Science Environment for Ecological Knowledge
Ecoinformatics site parent site of Partnership for Biodiversity Informatics site parent site of SEEK - Home
Science Environment for Ecological Knowledge









 

 

 



E Science Link Up Oct 04

This is version 46. It is not the current version, and thus it cannot be edited.
[Back to current version]   [Restore this version]


Meeting notes and updates on the e-Science Link-Up Meeting

(The following notes taken by S. Bowers)

Semantic Registration in Taverna (Pinar Alper)

    • Feta Architecture
      • Ontologist (Chris Wroe) -> Ontology Editor -> DL Reasoner -> Classification (in RDF(S)) -> obtain classification -> Feta, PeDRo
      • Store WSDL Descriptions (in special XML schema), then annotate, and give to Feta
      • The ontology, classified, and the annotated wsdl are merged into a single graph
      • Taverna Workflow Workbench issues "semantic discovery via conceptual descriptions" against feta ... a set of canned queries
    • Feta Engine
      • Feta Loader uses myGrid service onto and domain onto
      • use Jena, e.g., to do RDQL queries, etc.
    • Feta Data Model
      • Operation (name, description, task -- from a bio service ontology, method -- particular type of algo/codes also from onto but not used much, resource, application, hasInput : Parameter, hasOutput : Parameter)
      • Parameter (name, desc, semantic type, format, transport type, collection type, collection format)
      • Service (name, description, author, organizations)
      • WSDL based operation is a subclass of Operation
      • WSDL based Web Service is a subclassof Service (hasOperation : WSDL based operation)
      • workflow, bioMoby service, soaplab service, local java code subclasses of Service and Operation
      • seqHound service is an operation
        • Each parameter can have a semantic type, stating that the parameter is an instance of a class, and the operation can have a "task" which is also a "semantic type" and "method"

SHIM breakout (Jim leads discussion)

    • SHIM (need acronym)
      • semantically compatible, syntactically incompatible services
      • uniprot database (uniprot_record) -> parser and filter shim -> blastp analysis (protein_sequence)
      • working definition: a software component who's main purpose is to syntactically match otherwise incompatible resources. it takes some input, performs some task and produces an output. depending on usage, a shim can be semantically neutral ...
      • in myGrid, basically doing type manipulations (map between abstract types to concrete types), e.g., embl, genbank, fasta concrete types, dna_sequence is an abstract type
      • examples:
        • parser / filter
        • de-referencer
        • syntax translator
        • mapper
        • iterator
      • dereferencer
        • service a (genbank id) -> dereferencer -> service b (genbank record)
        • retreives information from a URL
      • syntax translator
        • service a (dna seq; bsml) -> syntax translator -> service b (dna seq; agave)
      • mapper
        • service a (genbank id) -> mapper -> service b (embl id)
      • iterator
        • service a (collection of x) -> iterator -> service b (a single x)
      • seven steps to shim "nirvana"
      • recognize 2 services are not compatible (syntactically, possibly semantically)
      • recognize the degree of mismatch
        • everything connected to everything
      • identify what type of shiim(s) is/are needed
      • find or manufacture the shim
      • advise user on "semantic safety" of the shim
        • not clear what this means ...
      • invoke the shim
      • record provenance
      • my (Shawn's) proposal: a shim is an actor/service whose input semantic type is the same or more general than the output semantic type

Workflow management and AI Planning (Jim Blythe)

  • Motivation
    • workflows in grid-using communities
    • challenges in supporting workflow management
  • research on workflow planning at usc/isi
    • using ai techniques in Pegasus to generate executable grid workflows
  • using metadata descriptions as first step, to get away from the file encodings of VDL and Pegasus
  • an operator is specified generally as an (if preconditions then add <stuff>) form, in Lisp/Scheme syntax
    • example: user can say: I want the results of a pulsar search at this time and location
  • the generation of the operation defs are done by hand ... began looking at how to construct them automatically

Access Grid Meeting

  • The information model
    • Organization of people, projects, experiments, and so on
    • Operations, ... (Pinar)
    • every data item can be annotated with various type information ... some slides
    • mime types
    • primary objective is to model escience processes, not the domain -- capturing the process provides added value: facilitates contextualization, data-model contracts between components, visualize integrated result object (as a result of a workflow), ...
    • data fusion/integration not guided by this model
  • The aim
    • providing more direct support for the implementation of e-Science processes by:
      • increasing the synergy between components
      • facilitating data-model contracts between myGrid components
      • defining a coherent myGrid architecture
  • Some benefits:
    • automatically capturing provenance and context information that is relevant to the interpretation and sharing of the results of the e-science experiments
    • facilitating personalization and collaboration
  • Implementation
    • a database with a web service interface ... as canned queries
    • generic interface, i.e., sql query
    • performance penality -- overhead, access calls, etc.
  • Questions
    • Does the model support "synthetic" versus "raw/natural" data?
    • What about the set-up and callibration of tools
    • Also, predicted data versus experimentally observed
    • The model is based on CCRC model
    • There are also a lot of standards that should be incorporated, so need some kind of extensibility
    • There needs to be place-holders for these within the information model
    • Related issue is where the results should be stored
    • three stores: one is the third-party databases (e.g., arrayexpress gene expression database ...) and link back
    • this is encompassed by the MIR -- myGrid Info. Repository; like a notebook
  • First thing done with information model
    • Workbench: MIR browser, metadata browser, WF model editor/explorer, feta search gui
    • Taverna execution environment: freefluo, and various plug-ins for MIR, Metadata Storage, and Feta
    • MIR extenral
    • Interestingly, the information model is "viewed" through a tree browser
  • The Mediator
    • Application oriented
      • directly supports the e-Scientist by:
        • providing pre-configured e-Science processes templates (i.e., system level worlkflows)
        • helping capturing and maintaining context information that is relevant to the interpretation and sharing of the results of the e-science experiments
        • facilitating personalization and collaboration
    • middleware-oriented
      • contributes to the synergy between mygrid services by
        • acting as a sink for e-Science events initiated by myGrid components
        • interpreting the intercepted events and triggering interactions w/ other related components entailed by the semantics of those events
        • compensating for possible impedence mismatches with other services both in terms of data types and interaction protocols
          • not really an issue -- won't do much here -- but might be some other components that want to participate, and would need to have this service
        • inspired, etc., by WSMF, WSMO, WSMX, WSML, ..., Deri web-services -- Deter Fensel, et al.
  • Supporting the e-Scientist
    • recurring use-cases can be captured
    • find workflows use-case
    • etc.
  • mediating between services
    • fully service based approach
      • the whole myGrid as a service
      • all communication done through web services (the mediator acts as the front door / gateway)
    • the name mediator taken from Gang of Four pattern with the same name
    • internals
      • mediation layer: action decision logic, event handlers, etc.
      • interface aggregation layer: request router
      • component access layer: mir proxy, enactor proxy, registry proxy, mds store proxy, dqpproxy, etc.
  • all of these doc's are under the MIR portion of the Wiki

Grid Workflow Case Studies / Use Cases

  • Peter Li: Large data set transfer use case from Graves' disease scenario
    • Graves' disease: autoimmune thyroid disease; lymphocytes attack thyroid gland cells causing hyperthyroidism; symptoms: increates pulse rate, sweating, heat interolerance, goitre, exophthalmose; inherited
    • In silico experiments: microarray data analysis, gene annotation pipeline, design of genotype assays for SNP variations
    • large data set transfer problem: ~9 data sets x 60 mb of GD array data; affyR service integrates data sets, ...
    • demo

  • Tom Oinn
    • service a passes data to service b
    • service b may start before service a finished execution
    • need a comprehensive solution
    • lsid's won't work
    • to get the data out of it, you have to use soap calls, and you get all the data at once, or none
    • the only way is if the lsid points to a stream -- otherwise lsid arch. won't support it
    • Inferno ... Redding e-Science center (?) in the UK ... Inferno e-service
    • take any command line tool, wrap it up in this mechanism, deal with the reference passing, automatically
    • inputs are urls, protocol called styx
    • basically, a naming convention that lets you denote streams
    • http://www.vitanuova.com/solutions/grid/grid.html

  • Chris Wroe
    • use case from integrative biology
    • oxford and new zealand
    • from dna to whole organism modeling
      • cardiac vulnerability to acute ischemia: step 1; import mechanical model from Aukland Data
      • get mechanical model of heart
        • take slice, place in perfusion bath, top and botttom surfaces isolated, site pacing ...
        • finite elelent approach
      • properties of fusion bath
      • protocol for what they do in the experiment: pace at 250ms, apply shock, repeat with diff. interfals, etc.
      • each simulation takes a week
      • perturb initial conditions; stage 1 hypoxia (lack of oxygen), stage 2 hypoxia
      • data analysis: construct activation map, measure activation potential duration, threshold for fibrillation, file produced every 1ms, big
      • perl/shell scripts for all of this
    • want to e-iffy this.
      • simulation step
      • long running, no other examples of this in myGrid
      • finite element bidomain solver: mechanical model, electrophysio model, simulation protocol, initial conditions, parameters -> result file produced for every 1ms 7.3 mb
      • monitor, stop, checkpoint, discard, restart with different parameters
      • a mesh problem ... so more computation and you still run it for a week
    • http://www.geodise.org Simon Cox

  • Jeffrey Grethe
    • BIRN workflow requirements (Biomedical informatics research network)
    • enable new understanding of neurological disease by integrating data across multiple scales from macroscopic brain function etc.
    • telescience portal enabled tomography workflow
      • composed of the sequence of steps required to acquire, process, visualize, and extract useful information from a 3D volume
    • morphometry workflow
      • structural analysis of data
      • large amounts of pre-processing
        • normalization, calibration, etc., to get data in a form to be analyzed
        • most methods in the pre-process stream can lead to errors
        • requires manual editing, etc., and have a set of checkpoints, where a user interacts
      • moving towards high-performance computing resources
    • parameter sweeps
      • taking birn-miriad numbers and comparing to what scientist has done ...
      • researcher traced out diff area of the brain, need to compare fully automated approach
      • looking for correct parameters to use for the imaging
      • get as close as you can to the actual, to the trained researcher, can do: correlate minute changes in actual brain structure to saying to some patient we should put you on some drug regime because you have alzheimers -- to some preventive course of action
      • has picture/slide of the workflow
      • baseline preprocessing can take upwards of a day

  • Karan Vahi
    • Abstract Workflow (DAX): expressed in terms of logical entities; specifies all logical fils required to gen. the desired data prod. from scratch; dependencies between the jobs; analogous to build style dag
      • format for specifying the abstract workflow, id's the recipe for creation
      • xml syntax / format
    • Concrete workflow ...
    • alternate replica mechanisms
      • how to manage replicas of the same service?
        • haven't been looking at that, because of the mandate of the Pegasus ...
        • all jobs run independently, wrapped around java executables, shell scripts, etc.
        • leveraging condor, and condor-g, which don't go further with web-services, etc.

  • Adam Birnbaum
    • Resurgence project
    • encyclopedia of life (eol) : automated annotations for all of the know protein sequences; slurp 1.5 million things out of a db, and push through seven to ten programs
    • both want to have some kind of simple visual prog. screen, see nothing but icons relevant to their field, setup the workflow, say go, and do it 1.5 mill times / domain specific tools/icons, and say go repeatedly
    • need constraints among icons: outputs and inputs
    • template workflows, default settings, etc.
    • check validity of resulting configurations / workflows
    • what is meant here by high throughput, thousands of tasks per month (not flops), 1.5 mill jobs over 6 month period, e.g.
    • scientist wants to run many times, varying the inputs
    • apst and nimrod (tested with these)
    • pegasus in same category

  • Brainstorm
    • data-intensive
      • 3rd party transfer
      • handling handles
      • streaming
      • SRB
      • where does data intensive transport fit?
      • separation of concerns ... who does what?
      • is there a one-size-fits-all framework?
      • wf-life cycle
        • construction / design
        • instantiation / parameter / data binding
        • execution ~ streaming (provenance)
    • conpute-intensive
      • streaming
      • wf exception handling
      • job scheduling: where does it fit? (to hide or not to hide)

  • Non-Breakout Breakout on Registry Services, etc.
    • mygrid and biomoby "data models" are similar enough to plug together
    • different ontologies: service, bioiformatics, molecular biology
    • data model for services, etc.
    • lots of discusion ...

Provenance

  • Verification of experiment data; recipes for experiment designs; explanation for the impact of changes; ownership; performance; data quality
  • The "Provenance Pyramid" -- Knowledge level; Organisation Level; Data Level; Process Level
    • Organisation Level at the bottom left of the pyramid, the same size as the right size, which contains the Data Level on top of the Process Level
  • myGrid approach
    • LSIDs: to identify objects
    • myGrid information model and mIR: to store lower levels of the pyramid
    • sem web technologies (RDF, Ontologies): to store knowledge provenance
    • Taverna workflow workbench and plugins: ensure automated recording
  • LSIDs
    • each bioinf database on the web has:
      • diff. policies for assigning and maintaining identifiers, dealing with versioning, etc.
      • diff. mechanisms ...
    • OMG standard
      • urn:lsid:AuthorityID:NamespaceID:ObjectID:RevisionID
      • urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2
      • lsid designator -- the item being id'd is a lifes science-specific resource
      • authority identifier -- internet domain owned by org that assigns an LSID to a resource
      • namespace id -- name of the resource
      • etc.
    • how id data retrived with LSIDs?
      • application -> 1. get me infor for id --> LSID client
      • 2. where can i get data and metadata for ID
        • returns wsdl doc giving information on where to get the data
    • Authority commitments
      • data returned for a given lsid must always be the same
      • must always maintain an authority at e.g. pdb.org that can point to data and metadata resolvers
    • lsid components
      • IBM build client and server implementations in Perl, Java, C++ ...
      • fairly straightforward to wrap an existing db as a source of data or metadata
      • client also straightforward
      • LSID launchpad ... within internet explorer (type in your lsid, returns metadata, etc)
    • Use of LSIDs within myGrid
      • needed an id for things such as workflows, experiments, new data results, etc.
      • everything id'd with LSIDs
      • build and deployed: LSID assigning server; lsid authority (http://www.mygrid.org.uk); metadata resolver; data resolver; (all based on IBM's open source implementation)
    • experiences
      • advantages: urn makes it easy to integrate with semantic web tools; more explicit than a url: there is an explicit protocol for separating metadata from data
      • disadvantages: have to decide what is data and metadata because they have different commitments (versioning); up to Jul 04, implementations chasing revisions in the standard maturing ... ow seems stable as standardisation more complete; to be successful across the community, it will require widespread adoption by providers such as Genbank, UniProt, etc.



Go to top   More info...   Attach file...
This particular version was published on 21-Oct-2004 09:24:22 PDT by SDSC.bowers.