| 
      
         
      
      
         
            This is version 47.  
            It is not the current version, and thus it cannot be edited.[Back to current version]  
            [Restore this version]
 
 
(The following notes taken by S. Bowers)
 
 
 
 Feta Architecture
 Ontologist (Chris Wroe) -> Ontology Editor -> DL Reasoner -> Classification (in RDF(S)) -> obtain classification -> Feta, PeDRo
 Store WSDL Descriptions (in special XML schema), then annotate, and give to Feta
 The ontology, classified, and the annotated wsdl are merged into a single graph
 Taverna Workflow Workbench issues "semantic discovery via conceptual descriptions" against feta ... a set of canned queries
  Feta Engine
 Feta Loader uses myGrid service onto and domain onto
 use Jena, e.g., to do RDQL queries, etc.
  Feta Data Model
 Operation (name, description, task -- from a bio service ontology, method -- particular type of algo/codes also from onto but not used much, resource, application, hasInput : Parameter, hasOutput : Parameter)
 Parameter (name, desc, semantic type, format, transport type, collection type, collection format)
 Service (name, description, author, organizations)
 WSDL based operation is a subclass of Operation
 WSDL based Web Service is a subclassof Service (hasOperation : WSDL based operation)
 workflow, bioMoby service, soaplab service, local java code subclasses of Service and Operation
 seqHound service is an operation
 Each parameter can have a semantic type, stating that the parameter is an instance of a class, and the operation can have a "task" which is also a "semantic type" and "method"
 
 
 
 SHIM (need acronym)
 semantically compatible, syntactically incompatible services
 uniprot database (uniprot_record) -> parser and filter shim -> blastp analysis (protein_sequence)
 working definition: a software component who's main purpose is to syntactically match otherwise incompatible resources. it takes some input, performs some task and produces an output. depending on usage, a shim can be semantically neutral ...
 in myGrid, basically doing type manipulations (map between abstract types to concrete types), e.g., embl, genbank, fasta concrete types, dna_sequence is an abstract type
 examples:
 parser / filter
 de-referencer
 syntax translator
 mapper
 iterator
  dereferencer
 service a (genbank id) -> dereferencer -> service b (genbank record)
 retreives information from a URL
  syntax translator
 service a (dna seq; bsml) -> syntax translator -> service b (dna seq; agave)
  mapper
 service a (genbank id) -> mapper -> service b (embl id)
  iterator
 service a (collection of x) -> iterator -> service b (a single x)
  seven steps to shim "nirvana"
 recognize 2 services are not compatible (syntactically, possibly semantically)
 recognize the degree of mismatch
 everything connected to everything
  identify what type of shiim(s) is/are needed
 find or manufacture the shim
 advise user on "semantic safety" of the shim
 not clear what this means ... 
  invoke the shim
 record provenance
 my (Shawn's) proposal: a shim is an actor/service whose input semantic type is the same or more general than the output semantic type
 
 
 
 Motivation
 workflows in grid-using communities
 challenges in supporting workflow management
  research on workflow planning at usc/isi
 using ai techniques in Pegasus to generate executable grid workflows
  using metadata descriptions as first step, to get away from the file encodings of VDL and Pegasus
 an operator is specified generally as an (if preconditions then add <stuff>) form, in Lisp/Scheme syntax
 example: user can say: I want the results of a pulsar search at this time and location
  the generation of the operation defs are done by hand ... began looking at how to construct them automatically
 
 
 
 The information model
 Organization of people, projects, experiments, and so on
 Operations, ... (Pinar)
 every data item can be annotated with various type information ... some slides 
 mime types
 primary objective is to model escience processes, not the domain -- capturing the process provides added value: facilitates contextualization, data-model contracts between components, visualize integrated result object (as a result of a workflow), ...
 data fusion/integration not guided by this model
  The aim
 providing more direct support for the implementation of e-Science processes by:
 increasing the synergy between components
 facilitating data-model contracts between myGrid components
 defining a coherent myGrid architecture
  Some benefits:
 automatically capturing provenance and context information that is relevant to the interpretation and sharing of the results of the e-science experiments
 facilitating personalization and collaboration
  Implementation
 a database with a web service interface ... as canned queries
 generic interface, i.e., sql query
 performance penality -- overhead, access calls, etc.
  Questions
 Does the model support "synthetic" versus "raw/natural" data? 
 What about the set-up and callibration of tools
 Also, predicted data versus experimentally observed
 The model is based on CCRC model
 There are also a lot of standards that should be incorporated, so need some kind of extensibility
 There needs to be place-holders for these within the information model
 Related issue is where the results should be stored
 three stores: one is the third-party databases (e.g., arrayexpress gene expression database ...) and link back
 this is encompassed by the MIR -- myGrid Info. Repository; like a notebook
  First thing done with information model
 Workbench: MIR browser, metadata browser, WF model editor/explorer, feta search gui
 Taverna execution environment: freefluo, and various plug-ins for MIR, Metadata Storage, and Feta
 MIR extenral
 Interestingly, the information model is "viewed" through a tree browser 
  The Mediator
 Application oriented
 directly supports the e-Scientist by:
 providing pre-configured e-Science processes templates (i.e., system level worlkflows)
 helping capturing and maintaining context information that is relevant to the interpretation and sharing of the results of the e-science experiments
 facilitating personalization and collaboration
  middleware-oriented
 contributes to the synergy between mygrid services by
 acting as a sink for e-Science events initiated by myGrid components
 interpreting the intercepted events and triggering interactions w/ other related components entailed by the semantics of those events
 compensating for possible impedence mismatches with other services both in terms of data types and interaction protocols
 not really an issue -- won't do much here -- but might be some other components that want to participate, and would need to have this service
  inspired, etc., by WSMF, WSMO, WSMX, WSML, ..., Deri web-services -- Deter Fensel, et al.
  Supporting the e-Scientist
 recurring use-cases can be captured
 find workflows use-case
 etc.
  mediating between services
 fully service based approach
 the whole myGrid as a service
 all communication done through web services (the mediator acts as the front door / gateway)
  the name mediator taken from Gang of Four pattern with the same name
 internals
 mediation layer: action decision logic, event handlers, etc.
 interface aggregation layer: request router
 component access layer: mir proxy, enactor proxy, registry proxy, mds store proxy, dqpproxy, etc.
  all of these doc's are under the MIR portion of the Wiki
 
 
 
 Peter Li: Large data set transfer use case from Graves' disease scenario
  Graves' disease: autoimmune thyroid disease; lymphocytes attack thyroid gland cells causing hyperthyroidism; symptoms: increates pulse rate, sweating, heat interolerance, goitre, exophthalmose; inherited
 In silico experiments: microarray data analysis, gene annotation pipeline, design of genotype assays for SNP variations
 large data set transfer problem: ~9 data sets x 60 mb of GD array data; affyR service integrates data sets, ...
 demo
 
 
 Tom Oinn
 service a passes data to service b
 service b may start before service a finished execution
 need a comprehensive solution
 lsid's won't work
 to get the data out of it, you have to use soap calls, and you get all the data at once, or none
 the only way is if the lsid points to a stream -- otherwise lsid arch. won't support it
 Inferno ... Redding e-Science center (?) in the UK ... Inferno e-service
 take any command line tool, wrap it up in this mechanism, deal with the reference passing, automatically
 inputs are urls, protocol called styx
 basically, a naming convention that lets you denote streams
 http://www.vitanuova.com/solutions/grid/grid.html  
 
 Chris Wroe
 use case from integrative biology
 oxford and new zealand
 from dna to whole organism modeling
 cardiac vulnerability to acute ischemia: step 1; import mechanical model from Aukland Data
 get mechanical model of heart
 take slice, place in perfusion bath, top and botttom surfaces isolated, site pacing ... 
 finite elelent approach
  properties of fusion bath
 protocol for what they do in the experiment: pace at 250ms, apply shock, repeat with diff. interfals, etc.
 each simulation takes a week
 perturb initial conditions; stage 1 hypoxia (lack of oxygen), stage 2 hypoxia
 data analysis: construct activation map, measure activation potential duration, threshold for fibrillation, file produced every 1ms, big
 perl/shell scripts for all of this
  want to e-iffy this. 
 simulation step
 long running, no other examples of this in myGrid
 finite element bidomain solver: mechanical model, electrophysio model, simulation protocol, initial conditions, parameters -> result file produced for every 1ms 7.3 mb
 monitor, stop, checkpoint, discard, restart with different parameters
 a mesh problem ... so more computation and you still run it for a week
  http://www.geodise.org Simon Cox 
 
 
 Jeffrey Grethe 
 BIRN workflow requirements (Biomedical informatics research network)
 enable new understanding of neurological disease by integrating data across multiple scales from macroscopic brain function etc.
 telescience portal enabled tomography workflow
 composed of the sequence of steps required to acquire, process, visualize, and extract useful information from a 3D volume
  morphometry workflow
 structural analysis of data
 large amounts of pre-processing
 normalization, calibration, etc., to get data in a form to be analyzed
 most methods in the pre-process stream can lead to errors
 requires manual editing, etc., and have a set of checkpoints, where a user interacts
  moving towards high-performance computing resources
  parameter sweeps
 taking birn-miriad numbers and comparing to what scientist has done ...
 researcher traced out diff area of the brain, need to compare fully automated approach 
 looking for correct parameters to use for the imaging 
 get as close as you can to the actual, to the trained researcher, can do: correlate minute changes in actual brain structure to saying to some patient we should put you on some drug regime because you have alzheimers -- to some preventive course of action
 has picture/slide of the workflow 
 baseline preprocessing can take upwards of a day
 
 
 Karan Vahi 
 Abstract Workflow (DAX): expressed in terms of logical entities; specifies all logical fils required to gen. the desired data prod. from scratch; dependencies between the jobs; analogous to build style dag
 format for specifying the abstract workflow, id's the recipe for creation
 xml syntax / format
  Concrete workflow ... 
 alternate replica mechanisms 
 how to manage replicas of the same service?
 haven't been looking at that, because of the mandate of the Pegasus ... 
 all jobs run independently, wrapped around java executables, shell scripts, etc.
 leveraging condor, and condor-g, which don't go further with web-services, etc.
 
 
 Adam Birnbaum
 Resurgence project
 encyclopedia of life (eol) : automated annotations for all of the know protein sequences; slurp 1.5 million things out of a db, and push through seven to ten programs
 both want to have some kind of simple visual prog. screen, see nothing but icons relevant to their field, setup the workflow, say go, and do it 1.5 mill times / domain specific tools/icons, and say go repeatedly
 need constraints among icons: outputs and inputs
 template workflows, default settings, etc.
 check validity of resulting configurations / workflows
 what is meant here by high throughput, thousands of tasks per month (not flops), 1.5 mill jobs over 6 month period, e.g.
 scientist wants to run many times, varying the inputs
 apst and nimrod (tested with these)
 pegasus in same category 
 
 
 Brainstorm
 data-intensive
 3rd party transfer
 handling handles
 streaming
 SRB
 where does data intensive transport fit?
 separation of concerns ... who does what? 
 is there a one-size-fits-all framework? 
 wf-life cycle
 construction / design
 instantiation / parameter / data binding
 execution ~ streaming (provenance)
  conpute-intensive
 streaming
 wf exception handling
 job scheduling: where does it fit? (to hide or not to hide)
 
 
 Non-Breakout Breakout on Registry Services, etc.
 mygrid and biomoby "data models" are similar enough to plug together
 different ontologies: service, bioiformatics, molecular biology
 data model for services, etc. 
 lots of discusion ...
 
 
 
 Verification of experiment data; recipes for experiment designs; explanation for the impact of changes; ownership; performance; data quality
 The "Provenance Pyramid" -- Knowledge level; Organisation Level; Data Level; Process Level
 Organisation Level at the bottom left of the pyramid, the same size as the right size, which contains the Data Level on top of the Process Level
  myGrid approach
 LSIDs: to identify objects
 myGrid information model and mIR: to store lower levels of the pyramid
 sem web technologies (RDF, Ontologies): to store knowledge provenance
 Taverna workflow workbench and plugins: ensure automated recording
  LSIDs
 each bioinf database on the web has: 
 diff. policies for assigning and maintaining identifiers, dealing with versioning, etc.
 diff. mechanisms ...
  OMG standard
 urn:lsid:AuthorityID:NamespaceID:ObjectID:RevisionID
 urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2
 lsid designator -- the item being id'd is a lifes science-specific resource
 authority identifier -- internet domain owned by org that assigns an LSID to a resource
 namespace id -- name of the resource 
 etc.
  how id data retrived with LSIDs?
 application -> 1. get me infor for id --> LSID client
 2. where can i get data and metadata for ID
 returns wsdl doc giving information on where to get the data
  Authority commitments
 data returned for a given lsid must always be the same
 must always maintain an authority at e.g. pdb.org that can point to data and metadata resolvers
  lsid components
 IBM build client and server implementations in Perl, Java, C++ ...
 fairly straightforward to wrap an existing db as a source of data or metadata
 client also straightforward
 LSID launchpad ... within internet explorer (type in your lsid, returns metadata, etc)
  Use of LSIDs within myGrid
 needed an id for things such as workflows, experiments, new data results, etc.
 everything id'd with LSIDs
 build and deployed: LSID assigning server; lsid authority (http://www.mygrid.org.uk ); metadata resolver; data resolver; (all based on IBM's open source implementation)  experiences
 advantages: urn makes it easy to integrate with semantic web tools; more explicit than a url: there is an explicit protocol for separating metadata from data
 disadvantages: have to decide what is data and metadata because they have different commitments (versioning); up to Jul 04, implementations chasing revisions in the standard maturing ... ow seems stable as standardisation more complete; to be successful across the community, it will require widespread adoption by providers such as Genbank, UniProt, etc.
  Provenance storage
 architecture
 1. data setn/received from services; 2. new lsids assigned to data; 3. data / metadata stroed; ... 
  metdata store: Jena RDF store; pushes RDF to LSID metadata resolver
 mIR is an object relational database pushes XStream-RDF to LSID metadata resolver, and objects to LSID data resolver
 use jena to store the rdf data
 lsid resolver outputs xml and text-plain
  scientific annotation
 the goal of this experiment was ... 
 the results prove the hypothesis that...
 need a schema for these annotations
 tools to add the annotations
  Tracy Cradddic 
 
 
 
 
 |