E Science Link Up Oct 04

Meeting notes and updates on the e-Science Link-Up Meeting

(The following notes taken by S. Bowers)

Semantic Registration in Taverna (Pinar Alper)

Feta Architecture

Ontologist (Chris Wroe) -> Ontology Editor -> DL Reasoner -> Classification (in RDF(S)) -> obtain classification -> Feta, PeDRo
Store WSDL Descriptions (in special XML schema), then annotate, and give to Feta
The ontology, classified, and the annotated wsdl are merged into a single graph
Taverna Workflow Workbench issues "semantic discovery via conceptual descriptions" against feta ... a set of canned queries

Feta Engine

Feta Loader uses myGrid service onto and domain onto
use Jena, e.g., to do RDQL queries, etc.

Feta Data Model

Operation (name, description, task -- from a bio service ontology, method -- particular type of algo/codes also from onto but not used much, resource, application, hasInput : Parameter, hasOutput : Parameter)
Parameter (name, desc, semantic type, format, transport type, collection type, collection format)
Service (name, description, author, organizations)
WSDL based operation is a subclass of Operation
WSDL based Web Service is a subclassof Service (hasOperation : WSDL based operation)
workflow, bioMoby service, soaplab service, local java code subclasses of Service and Operation
seqHound service is an operation

Each parameter can have a semantic type, stating that the parameter is an instance of a class, and the operation can have a "task" which is also a "semantic type" and "method"

SHIM breakout (Jim leads discussion)

SHIM (need acronym)

semantically compatible, syntactically incompatible services
uniprot database (uniprot_record) -> parser and filter shim -> blastp analysis (protein_sequence)
working definition: a software component who's main purpose is to syntactically match otherwise incompatible resources. it takes some input, performs some task and produces an output. depending on usage, a shim can be semantically neutral ...
in myGrid, basically doing type manipulations (map between abstract types to concrete types), e.g., embl, genbank, fasta concrete types, dna_sequence is an abstract type
examples:

parser / filter
de-referencer
syntax translator
mapper
iterator

dereferencer

service a (genbank id) -> dereferencer -> service b (genbank record)
retreives information from a URL

syntax translator

service a (dna seq; bsml) -> syntax translator -> service b (dna seq; agave)

mapper

service a (genbank id) -> mapper -> service b (embl id)

iterator

service a (collection of x) -> iterator -> service b (a single x)

seven steps to shim "nirvana"
recognize 2 services are not compatible (syntactically, possibly semantically)
recognize the degree of mismatch

everything connected to everything

identify what type of shiim(s) is/are needed
find or manufacture the shim
advise user on "semantic safety" of the shim

not clear what this means ...

invoke the shim
record provenance
my (Shawn's) proposal: a shim is an actor/service whose input semantic type is the same or more general than the output semantic type

Workflow management and AI Planning (Jim Blythe)

Motivation

workflows in grid-using communities
challenges in supporting workflow management

research on workflow planning at usc/isi

using ai techniques in Pegasus to generate executable grid workflows

using metadata descriptions as first step, to get away from the file encodings of VDL and Pegasus
an operator is specified generally as an (if preconditions then add <stuff>) form, in Lisp/Scheme syntax

example: user can say: I want the results of a pulsar search at this time and location

the generation of the operation defs are done by hand ... began looking at how to construct them automatically

Access Grid Meeting

The information model

Organization of people, projects, experiments, and so on
Operations, ... (Pinar)
every data item can be annotated with various type information ... some slides
mime types
primary objective is to model escience processes, not the domain -- capturing the process provides added value: facilitates contextualization, data-model contracts between components, visualize integrated result object (as a result of a workflow), ...
data fusion/integration not guided by this model

The aim

providing more direct support for the implementation of e-Science processes by:

increasing the synergy between components
facilitating data-model contracts between myGrid components
defining a coherent myGrid architecture

Some benefits:

automatically capturing provenance and context information that is relevant to the interpretation and sharing of the results of the e-science experiments
facilitating personalization and collaboration

Implementation

a database with a web service interface ... as canned queries
generic interface, i.e., sql query
performance penality -- overhead, access calls, etc.

Questions

Does the model support "synthetic" versus "raw/natural" data?
What about the set-up and callibration of tools
Also, predicted data versus experimentally observed
The model is based on CCRC model
There are also a lot of standards that should be incorporated, so need some kind of extensibility
There needs to be place-holders for these within the information model
Related issue is where the results should be stored
three stores: one is the third-party databases (e.g., arrayexpress gene expression database ...) and link back
this is encompassed by the MIR -- myGrid Info. Repository; like a notebook

First thing done with information model

Workbench: MIR browser, metadata browser, WF model editor/explorer, feta search gui
Taverna execution environment: freefluo, and various plug-ins for MIR, Metadata Storage, and Feta
MIR extenral
Interestingly, the information model is "viewed" through a tree browser

The Mediator

Application oriented

directly supports the e-Scientist by:

providing pre-configured e-Science processes templates (i.e., system level worlkflows)
helping capturing and maintaining context information that is relevant to the interpretation and sharing of the results of the e-science experiments
facilitating personalization and collaboration

middleware-oriented

contributes to the synergy between mygrid services by

acting as a sink for e-Science events initiated by myGrid components
interpreting the intercepted events and triggering interactions w/ other related components entailed by the semantics of those events
compensating for possible impedence mismatches with other services both in terms of data types and interaction protocols

not really an issue -- won't do much here -- but might be some other components that want to participate, and would need to have this service

inspired, etc., by WSMF, WSMO, WSMX, WSML, ..., Deri web-services -- Deter Fensel, et al.

Supporting the e-Scientist

recurring use-cases can be captured
find workflows use-case
etc.

mediating between services

fully service based approach

the whole myGrid as a service
all communication done through web services (the mediator acts as the front door / gateway)

the name mediator taken from Gang of Four pattern with the same name
internals

mediation layer: action decision logic, event handlers, etc.
interface aggregation layer: request router
component access layer: mir proxy, enactor proxy, registry proxy, mds store proxy, dqpproxy, etc.

all of these doc's are under the MIR portion of the Wiki

Grid Workflow Case Studies / Use Cases

Peter Li: Large data set transfer use case from Graves' disease scenario

Graves' disease: autoimmune thyroid disease; lymphocytes attack thyroid gland cells causing hyperthyroidism; symptoms: increates pulse rate, sweating, heat interolerance, goitre, exophthalmose; inherited
In silico experiments: microarray data analysis, gene annotation pipeline, design of genotype assays for SNP variations
large data set transfer problem: ~9 data sets x 60 mb of GD array data; affyR service integrates data sets, ...
demo

Tom Oinn

service a passes data to service b
service b may start before service a finished execution
need a comprehensive solution
lsid's won't work
to get the data out of it, you have to use soap calls, and you get all the data at once, or none
the only way is if the lsid points to a stream -- otherwise lsid arch. won't support it
Inferno ... Redding e-Science center (?) in the UK ... Inferno e-service
take any command line tool, wrap it up in this mechanism, deal with the reference passing, automatically
inputs are urls, protocol called styx
basically, a naming convention that lets you denote streams
http://www.vitanuova.com/solutions/grid/grid.html

Chris Wroe

use case from integrative biology
oxford and new zealand
from dna to whole organism modeling

cardiac vulnerability to acute ischemia: step 1; import mechanical model from Aukland Data
get mechanical model of heart

take slice, place in perfusion bath, top and botttom surfaces isolated, site pacing ...
finite elelent approach

properties of fusion bath
protocol for what they do in the experiment: pace at 250ms, apply shock, repeat with diff. interfals, etc.
each simulation takes a week
perturb initial conditions; stage 1 hypoxia (lack of oxygen), stage 2 hypoxia
data analysis: construct activation map, measure activation potential duration, threshold for fibrillation, file produced every 1ms, big
perl/shell scripts for all of this

want to e-iffy this.

simulation step
long running, no other examples of this in myGrid
finite element bidomain solver: mechanical model, electrophysio model, simulation protocol, initial conditions, parameters -> result file produced for every 1ms 7.3 mb
monitor, stop, checkpoint, discard, restart with different parameters
a mesh problem ... so more computation and you still run it for a week

http://www.geodise.org Simon Cox

Jeffrey Grethe

BIRN workflow requirements (Biomedical informatics research network)
enable new understanding of neurological disease by integrating data across multiple scales from macroscopic brain function etc.
telescience portal enabled tomography workflow

composed of the sequence of steps required to acquire, process, visualize, and extract useful information from a 3D volume

morphometry workflow

structural analysis of data
large amounts of pre-processing

normalization, calibration, etc., to get data in a form to be analyzed
most methods in the pre-process stream can lead to errors
requires manual editing, etc., and have a set of checkpoints, where a user interacts

moving towards high-performance computing resources

parameter sweeps

taking birn-miriad numbers and comparing to what scientist has done ...
researcher traced out diff area of the brain, need to compare fully automated approach
looking for correct parameters to use for the imaging
get as close as you can to the actual, to the trained researcher, can do: correlate minute changes in actual brain structure to saying to some patient we should put you on some drug regime because you have alzheimers -- to some preventive course of action
has picture/slide of the workflow
baseline preprocessing can take upwards of a day

Karan Vahi

Abstract Workflow (DAX): expressed in terms of logical entities; specifies all logical fils required to gen. the desired data prod. from scratch; dependencies between the jobs; analogous to build style dag

format for specifying the abstract workflow, id's the recipe for creation
xml syntax / format

Concrete workflow ...
alternate replica mechanisms

how to manage replicas of the same service?

haven't been looking at that, because of the mandate of the Pegasus ...
all jobs run independently, wrapped around java executables, shell scripts, etc.
leveraging condor, and condor-g, which don't go further with web-services, etc.

Adam Birnbaum

Resurgence project
encyclopedia of life (eol) : automated annotations for all of the know protein sequences; slurp 1.5 million things out of a db, and push through seven to ten programs
both want to have some kind of simple visual prog. screen, see nothing but icons relevant to their field, setup the workflow, say go, and do it 1.5 mill times / domain specific tools/icons, and say go repeatedly
need constraints among icons: outputs and inputs
template workflows, default settings, etc.
check validity of resulting configurations / workflows
what is meant here by high throughput, thousands of tasks per month (not flops), 1.5 mill jobs over 6 month period, e.g.
scientist wants to run many times, varying the inputs
apst and nimrod (tested with these)
pegasus in same category

Brainstorm

data-intensive

3rd party transfer
handling handles
streaming
SRB
where does data intensive transport fit?
separation of concerns ... who does what?
is there a one-size-fits-all framework?
wf-life cycle

construction / design
instantiation / parameter / data binding
execution ~ streaming (provenance)

conpute-intensive

streaming
wf exception handling
job scheduling: where does it fit? (to hide or not to hide)

Non-Breakout Breakout on Registry Services, etc.

mygrid and biomoby "data models" are similar enough to plug together
different ontologies: service, bioiformatics, molecular biology
data model for services, etc.
lots of discusion ...

Provenance

Verification of experiment data; recipes for experiment designs; explanation for the impact of changes; ownership; performance; data quality
The "Provenance Pyramid" -- Knowledge level; Organisation Level; Data Level; Process Level

Organisation Level at the bottom left of the pyramid, the same size as the right size, which contains the Data Level on top of the Process Level

myGrid approach

LSIDs: to identify objects
myGrid information model and mIR: to store lower levels of the pyramid
sem web technologies (RDF, Ontologies): to store knowledge provenance
Taverna workflow workbench and plugins: ensure automated recording

LSIDs

each bioinf database on the web has:

diff. policies for assigning and maintaining identifiers, dealing with versioning, etc.
diff. mechanisms ...

OMG standard

urn:lsid:AuthorityID:NamespaceID:ObjectID:RevisionID
urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2
lsid designator -- the item being id'd is a lifes science-specific resource
authority identifier -- internet domain owned by org that assigns an LSID to a resource
namespace id -- name of the resource
etc.

how id data retrived with LSIDs?

application -> 1. get me infor for id --> LSID client
2. where can i get data and metadata for ID

returns wsdl doc giving information on where to get the data

Authority commitments

data returned for a given lsid must always be the same
must always maintain an authority at e.g. pdb.org that can point to data and metadata resolvers

lsid components

IBM build client and server implementations in Perl, Java, C++ ...
fairly straightforward to wrap an existing db as a source of data or metadata
client also straightforward
LSID launchpad ... within internet explorer (type in your lsid, returns metadata, etc)

Use of LSIDs within myGrid

needed an id for things such as workflows, experiments, new data results, etc.
everything id'd with LSIDs
build and deployed: LSID assigning server; lsid authority (http://www.mygrid.org.uk); metadata resolver; data resolver; (all based on IBM's open source implementation)

experiences

advantages: urn makes it easy to integrate with semantic web tools; more explicit than a url: there is an explicit protocol for separating metadata from data
disadvantages: have to decide what is data and metadata because they have different commitments (versioning); up to Jul 04, implementations chasing revisions in the standard maturing ... ow seems stable as standardisation more complete; to be successful across the community, it will require widespread adoption by providers such as Genbank, UniProt, etc.

Provenance storage

architecture

1. data setn/received from services; 2. new lsids assigned to data; 3. data / metadata stroed; ...

metdata store: Jena RDF store; pushes RDF to LSID metadata resolver
mIR is an object relational database pushes XStream-RDF to LSID metadata resolver, and objects to LSID data resolver
use jena to store the rdf data
lsid resolver outputs xml and text-plain

scientific annotation

the goal of this experiment was ...
the results prove the hypothesis that...

need a schema for these annotations
tools to add the annotations

Tracy Cradddic

Visualization in myGrid

Williams workflow B ...

large amounts of data (or datatypes)
data implicitly linked within itself
data is implicitly linked outside of itself
genomic sequence is central co-ordinating point, but there are anumber of different co-ordinate systesms
some "biological", some artifacts of the workflow

what's the problem

we don't ahve a domain model
we need a model for visualization
but, domain models are hard
it's not clear that the domain model should be in the middle ware

what have we done!?

bioinformatics pm (pre myGrid)
one big distributed data heterogeneity and integration problem
still a big distributed data heterogeneity and integration problem

how do we solve the problem

take the data, use something (perl or an MSc student) to map the data into a (partial) data model
visualize this ...
but what if the workflow changes?

second solution

large quantities of data are already available with rich mark up in a visualizable form
this is unparsable, so also get the flat file rep
start to build visualization information into the workflow using beanshell
linked data from output -- domain model = scripts that hack these things together

summary

domain models are hard
workflows can obfuscate the model
visualization requires one
we can build some knowledge of a domain model into the workflow and steal the rest.
is there a better way?

Breakout: myGrid "Data Model" (schema) for capturing Metadata and Semantics

common.xsd

service description

serviceName
organisation

UDDI fields, e.g., organization name, etc.

author
locationURL
interfaceWSDL
serviceDescriptionText
operations (units of funtionality)

service operation

operation name
portName
operationDescriptionText
operationInputs

parameter

parameterName
messageName
parameterDescription
defaultValue
semanticType
XMLSchemaURI
isConfigurationParameter

operationOutputs
operationTask (the "what", i.e., what the operation does -- the verb or action -- e.g., "aligning, ncbi_blast_local_aligning, etc.")
operationResource (underlying resources that the operation may use, like a database, coming from an ontology...)
operationMethod
operationApplication (software application)

serviceType

either: "Soaplab service, WSDL service, Workflow service"

pedro

uses this schema to drive the user interface for annotation
also uses an external xml file to state that certain xml schema elements are to be filled in by semantic types, and where to look in the ontologies to fill those concepts
http://www.cs.man.ac.uk/~penpecip/feta/misc for files ...