This is version 48.
It is not the current version, and thus it cannot be edited.
[Back to current version]
[Restore this version]
(The following notes taken by S. Bowers)
- Feta Architecture
- Ontologist (Chris Wroe) -> Ontology Editor -> DL Reasoner -> Classification (in RDF(S)) -> obtain classification -> Feta, PeDRo
- Store WSDL Descriptions (in special XML schema), then annotate, and give to Feta
- The ontology, classified, and the annotated wsdl are merged into a single graph
- Taverna Workflow Workbench issues "semantic discovery via conceptual descriptions" against feta ... a set of canned queries
- Feta Engine
- Feta Loader uses myGrid service onto and domain onto
- use Jena, e.g., to do RDQL queries, etc.
- Feta Data Model
- Operation (name, description, task -- from a bio service ontology, method -- particular type of algo/codes also from onto but not used much, resource, application, hasInput : Parameter, hasOutput : Parameter)
- Parameter (name, desc, semantic type, format, transport type, collection type, collection format)
- Service (name, description, author, organizations)
- WSDL based operation is a subclass of Operation
- WSDL based Web Service is a subclassof Service (hasOperation : WSDL based operation)
- workflow, bioMoby service, soaplab service, local java code subclasses of Service and Operation
- seqHound service is an operation
- Each parameter can have a semantic type, stating that the parameter is an instance of a class, and the operation can have a "task" which is also a "semantic type" and "method"
- SHIM (need acronym)
- semantically compatible, syntactically incompatible services
- uniprot database (uniprot_record) -> parser and filter shim -> blastp analysis (protein_sequence)
- working definition: a software component who's main purpose is to syntactically match otherwise incompatible resources. it takes some input, performs some task and produces an output. depending on usage, a shim can be semantically neutral ...
- in myGrid, basically doing type manipulations (map between abstract types to concrete types), e.g., embl, genbank, fasta concrete types, dna_sequence is an abstract type
- examples:
- parser / filter
- de-referencer
- syntax translator
- mapper
- iterator
- dereferencer
- service a (genbank id) -> dereferencer -> service b (genbank record)
- retreives information from a URL
- syntax translator
- service a (dna seq; bsml) -> syntax translator -> service b (dna seq; agave)
- mapper
- service a (genbank id) -> mapper -> service b (embl id)
- iterator
- service a (collection of x) -> iterator -> service b (a single x)
- seven steps to shim "nirvana"
- recognize 2 services are not compatible (syntactically, possibly semantically)
- recognize the degree of mismatch
- everything connected to everything
- identify what type of shiim(s) is/are needed
- find or manufacture the shim
- advise user on "semantic safety" of the shim
- not clear what this means ...
- invoke the shim
- record provenance
- my (Shawn's) proposal: a shim is an actor/service whose input semantic type is the same or more general than the output semantic type
- Motivation
- workflows in grid-using communities
- challenges in supporting workflow management
- research on workflow planning at usc/isi
- using ai techniques in Pegasus to generate executable grid workflows
- using metadata descriptions as first step, to get away from the file encodings of VDL and Pegasus
- an operator is specified generally as an (if preconditions then add <stuff>) form, in Lisp/Scheme syntax
- example: user can say: I want the results of a pulsar search at this time and location
- the generation of the operation defs are done by hand ... began looking at how to construct them automatically
- The information model
- Organization of people, projects, experiments, and so on
- Operations, ... (Pinar)
- every data item can be annotated with various type information ... some slides
- mime types
- primary objective is to model escience processes, not the domain -- capturing the process provides added value: facilitates contextualization, data-model contracts between components, visualize integrated result object (as a result of a workflow), ...
- data fusion/integration not guided by this model
- The aim
- providing more direct support for the implementation of e-Science processes by:
- increasing the synergy between components
- facilitating data-model contracts between myGrid components
- defining a coherent myGrid architecture
- Some benefits:
- automatically capturing provenance and context information that is relevant to the interpretation and sharing of the results of the e-science experiments
- facilitating personalization and collaboration
- Implementation
- a database with a web service interface ... as canned queries
- generic interface, i.e., sql query
- performance penality -- overhead, access calls, etc.
- Questions
- Does the model support "synthetic" versus "raw/natural" data?
- What about the set-up and callibration of tools
- Also, predicted data versus experimentally observed
- The model is based on CCRC model
- There are also a lot of standards that should be incorporated, so need some kind of extensibility
- There needs to be place-holders for these within the information model
- Related issue is where the results should be stored
- three stores: one is the third-party databases (e.g., arrayexpress gene expression database ...) and link back
- this is encompassed by the MIR -- myGrid Info. Repository; like a notebook
- First thing done with information model
- Workbench: MIR browser, metadata browser, WF model editor/explorer, feta search gui
- Taverna execution environment: freefluo, and various plug-ins for MIR, Metadata Storage, and Feta
- MIR extenral
- Interestingly, the information model is "viewed" through a tree browser
- The Mediator
- Application oriented
- directly supports the e-Scientist by:
- providing pre-configured e-Science processes templates (i.e., system level worlkflows)
- helping capturing and maintaining context information that is relevant to the interpretation and sharing of the results of the e-science experiments
- facilitating personalization and collaboration
- middleware-oriented
- contributes to the synergy between mygrid services by
- acting as a sink for e-Science events initiated by myGrid components
- interpreting the intercepted events and triggering interactions w/ other related components entailed by the semantics of those events
- compensating for possible impedence mismatches with other services both in terms of data types and interaction protocols
- not really an issue -- won't do much here -- but might be some other components that want to participate, and would need to have this service
- inspired, etc., by WSMF, WSMO, WSMX, WSML, ..., Deri web-services -- Deter Fensel, et al.
- Supporting the e-Scientist
- recurring use-cases can be captured
- find workflows use-case
- etc.
- mediating between services
- fully service based approach
- the whole myGrid as a service
- all communication done through web services (the mediator acts as the front door / gateway)
- the name mediator taken from Gang of Four pattern with the same name
- internals
- mediation layer: action decision logic, event handlers, etc.
- interface aggregation layer: request router
- component access layer: mir proxy, enactor proxy, registry proxy, mds store proxy, dqpproxy, etc.
- all of these doc's are under the MIR portion of the Wiki
- Peter Li: Large data set transfer use case from Graves' disease scenario
- Graves' disease: autoimmune thyroid disease; lymphocytes attack thyroid gland cells causing hyperthyroidism; symptoms: increates pulse rate, sweating, heat interolerance, goitre, exophthalmose; inherited
- In silico experiments: microarray data analysis, gene annotation pipeline, design of genotype assays for SNP variations
- large data set transfer problem: ~9 data sets x 60 mb of GD array data; affyR service integrates data sets, ...
- demo
- Tom Oinn
- service a passes data to service b
- service b may start before service a finished execution
- need a comprehensive solution
- lsid's won't work
- to get the data out of it, you have to use soap calls, and you get all the data at once, or none
- the only way is if the lsid points to a stream -- otherwise lsid arch. won't support it
- Inferno ... Redding e-Science center (?) in the UK ... Inferno e-service
- take any command line tool, wrap it up in this mechanism, deal with the reference passing, automatically
- inputs are urls, protocol called styx
- basically, a naming convention that lets you denote streams
- http://www.vitanuova.com/solutions/grid/grid.html
- Chris Wroe
- use case from integrative biology
- oxford and new zealand
- from dna to whole organism modeling
- cardiac vulnerability to acute ischemia: step 1; import mechanical model from Aukland Data
- get mechanical model of heart
- take slice, place in perfusion bath, top and botttom surfaces isolated, site pacing ...
- finite elelent approach
- properties of fusion bath
- protocol for what they do in the experiment: pace at 250ms, apply shock, repeat with diff. interfals, etc.
- each simulation takes a week
- perturb initial conditions; stage 1 hypoxia (lack of oxygen), stage 2 hypoxia
- data analysis: construct activation map, measure activation potential duration, threshold for fibrillation, file produced every 1ms, big
- perl/shell scripts for all of this
- want to e-iffy this.
- simulation step
- long running, no other examples of this in myGrid
- finite element bidomain solver: mechanical model, electrophysio model, simulation protocol, initial conditions, parameters -> result file produced for every 1ms 7.3 mb
- monitor, stop, checkpoint, discard, restart with different parameters
- a mesh problem ... so more computation and you still run it for a week
- http://www.geodise.org Simon Cox
- Jeffrey Grethe
- BIRN workflow requirements (Biomedical informatics research network)
- enable new understanding of neurological disease by integrating data across multiple scales from macroscopic brain function etc.
- telescience portal enabled tomography workflow
- composed of the sequence of steps required to acquire, process, visualize, and extract useful information from a 3D volume
- morphometry workflow
- structural analysis of data
- large amounts of pre-processing
- normalization, calibration, etc., to get data in a form to be analyzed
- most methods in the pre-process stream can lead to errors
- requires manual editing, etc., and have a set of checkpoints, where a user interacts
- moving towards high-performance computing resources
- parameter sweeps
- taking birn-miriad numbers and comparing to what scientist has done ...
- researcher traced out diff area of the brain, need to compare fully automated approach
- looking for correct parameters to use for the imaging
- get as close as you can to the actual, to the trained researcher, can do: correlate minute changes in actual brain structure to saying to some patient we should put you on some drug regime because you have alzheimers -- to some preventive course of action
- has picture/slide of the workflow
- baseline preprocessing can take upwards of a day
- Karan Vahi
- Abstract Workflow (DAX): expressed in terms of logical entities; specifies all logical fils required to gen. the desired data prod. from scratch; dependencies between the jobs; analogous to build style dag
- format for specifying the abstract workflow, id's the recipe for creation
- xml syntax / format
- Concrete workflow ...
- alternate replica mechanisms
- how to manage replicas of the same service?
- haven't been looking at that, because of the mandate of the Pegasus ...
- all jobs run independently, wrapped around java executables, shell scripts, etc.
- leveraging condor, and condor-g, which don't go further with web-services, etc.
- Adam Birnbaum
- Resurgence project
- encyclopedia of life (eol) : automated annotations for all of the know protein sequences; slurp 1.5 million things out of a db, and push through seven to ten programs
- both want to have some kind of simple visual prog. screen, see nothing but icons relevant to their field, setup the workflow, say go, and do it 1.5 mill times / domain specific tools/icons, and say go repeatedly
- need constraints among icons: outputs and inputs
- template workflows, default settings, etc.
- check validity of resulting configurations / workflows
- what is meant here by high throughput, thousands of tasks per month (not flops), 1.5 mill jobs over 6 month period, e.g.
- scientist wants to run many times, varying the inputs
- apst and nimrod (tested with these)
- pegasus in same category
- Brainstorm
- data-intensive
- 3rd party transfer
- handling handles
- streaming
- SRB
- where does data intensive transport fit?
- separation of concerns ... who does what?
- is there a one-size-fits-all framework?
- wf-life cycle
- construction / design
- instantiation / parameter / data binding
- execution ~ streaming (provenance)
- conpute-intensive
- streaming
- wf exception handling
- job scheduling: where does it fit? (to hide or not to hide)
- Non-Breakout Breakout on Registry Services, etc.
- mygrid and biomoby "data models" are similar enough to plug together
- different ontologies: service, bioiformatics, molecular biology
- data model for services, etc.
- lots of discusion ...
- Verification of experiment data; recipes for experiment designs; explanation for the impact of changes; ownership; performance; data quality
- The "Provenance Pyramid" -- Knowledge level; Organisation Level; Data Level; Process Level
- Organisation Level at the bottom left of the pyramid, the same size as the right size, which contains the Data Level on top of the Process Level
- myGrid approach
- LSIDs: to identify objects
- myGrid information model and mIR: to store lower levels of the pyramid
- sem web technologies (RDF, Ontologies): to store knowledge provenance
- Taverna workflow workbench and plugins: ensure automated recording
- LSIDs
- each bioinf database on the web has:
- diff. policies for assigning and maintaining identifiers, dealing with versioning, etc.
- diff. mechanisms ...
- OMG standard
- urn:lsid:AuthorityID:NamespaceID:ObjectID:RevisionID
- urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2
- lsid designator -- the item being id'd is a lifes science-specific resource
- authority identifier -- internet domain owned by org that assigns an LSID to a resource
- namespace id -- name of the resource
- etc.
- how id data retrived with LSIDs?
- application -> 1. get me infor for id --> LSID client
- 2. where can i get data and metadata for ID
- returns wsdl doc giving information on where to get the data
- Authority commitments
- data returned for a given lsid must always be the same
- must always maintain an authority at e.g. pdb.org that can point to data and metadata resolvers
- lsid components
- IBM build client and server implementations in Perl, Java, C++ ...
- fairly straightforward to wrap an existing db as a source of data or metadata
- client also straightforward
- LSID launchpad ... within internet explorer (type in your lsid, returns metadata, etc)
- Use of LSIDs within myGrid
- needed an id for things such as workflows, experiments, new data results, etc.
- everything id'd with LSIDs
- build and deployed: LSID assigning server; lsid authority (http://www.mygrid.org.uk); metadata resolver; data resolver; (all based on IBM's open source implementation)
- experiences
- advantages: urn makes it easy to integrate with semantic web tools; more explicit than a url: there is an explicit protocol for separating metadata from data
- disadvantages: have to decide what is data and metadata because they have different commitments (versioning); up to Jul 04, implementations chasing revisions in the standard maturing ... ow seems stable as standardisation more complete; to be successful across the community, it will require widespread adoption by providers such as Genbank, UniProt, etc.
- Provenance storage
- architecture
- 1. data setn/received from services; 2. new lsids assigned to data; 3. data / metadata stroed; ...
- metdata store: Jena RDF store; pushes RDF to LSID metadata resolver
- mIR is an object relational database pushes XStream-RDF to LSID metadata resolver, and objects to LSID data resolver
- use jena to store the rdf data
- lsid resolver outputs xml and text-plain
- scientific annotation
- the goal of this experiment was ...
- the results prove the hypothesis that...
- need a schema for these annotations
- tools to add the annotations
- Tracy Cradddic
- Williams workflow B ...
- large amounts of data (or datatypes)
- data implicitly linked within itself
- data is implicitly linked outside of itself
- genomic sequence is central co-ordinating point, but there are anumber of different co-ordinate systesms
- some "biological", some artifacts of the workflow
- what's the problem
- we don't ahve a domain model
- we need a model for visualization
- but, domain models are hard
- it's not clear that the domain model should be in the middle ware
- what have we done!?
- bioinformatics pm (pre myGrid)
- one big distributed data heterogeneity and integration problem
- still a big distributed data heterogeneity and integration problem
- how do we solve the problem
- take the data, use something (perl or an MSc student) to map the data into a (partial) data model
- visualize this ...
- but what if the workflow changes?
- second solution
- large quantities of data are already available with rich mark up in a visualizable form
- this is unparsable, so also get the flat file rep
- start to build visualization information into the workflow using beanshell
- linked data from output -- domain model = scripts that hack these things together
- summary
- domain models are hard
- workflows can obfuscate the model
- visualization requires one
- we can build some knowledge of a domain model into the workflow and steal the rest.
- is there a better way?
|