SEEK-Wiki: E Science Link Up Oct 04

-SEEK-Home
About SEEK
Tools
Education
Publications
Opportunities
Community
About This Site
Calendar
+            E Science Link Up Oct 04
         
          
            Your trail: EcoGridConferenceCallNotes7Sep2004 | EcoGridConferenceCallNotes17Sep2004 | KRSMSMeasurement | SparrowLanguage | SemanticsInKepler | KRSMSOntoCreationGuide | SMSServiceInterfaces | KRSMSSemanticAnnotationLanguage | SMSHotTopics | AllHandsMeetingSMSNotesNov04
         
      

      

      

      
         



      
          Difference between 
          version 53 
          and 
          version 36:
          

          
Removed lines 1-2
- ----
-
Line 178 was replaced by line 176
- ** [www.GeoDise.org] Simon Cox
+ ** [http://www.geodise.org] Simon Cox
At line 183 added 193 lines.
+ ** telescience portal enabled tomography workflow
+ *** composed of the sequence of steps required to acquire, process, visualize, and extract useful information from a 3D volume
+ ** morphometry workflow
+ *** structural analysis of data
+ *** large amounts of pre-processing
+ **** normalization, calibration, etc., to get data in a form to be analyzed
+ **** most methods in the pre-process stream can lead to errors
+ **** requires manual editing, etc., and have a set of checkpoints, where a user interacts
+ *** moving towards high-performance computing resources
+ ** parameter sweeps
+ *** taking birn-miriad numbers and comparing to what scientist has done ...
+ *** researcher traced out diff area of the brain, need to compare fully automated approach
+ *** looking for correct parameters to use for the imaging
+ *** get as close as you can to the actual, to the trained researcher, can do: correlate minute changes in actual brain structure to saying to some patient we should put you on some drug regime because you have alzheimers -- to some preventive course of action
+ *** has picture/slide of the workflow
+ *** baseline preprocessing can take upwards of a day
+
+ * Karan Vahi
+ ** Abstract Workflow (DAX): expressed in terms of logical entities; specifies all logical fils required to gen. the desired data prod. from scratch; dependencies between the jobs; analogous to build style dag
+ *** format for specifying the abstract workflow, id's the recipe for creation
+ *** xml syntax / format
+ ** Concrete workflow ...
+ ** alternate replica mechanisms
+ *** how to manage replicas of the same service?
+ **** haven't been looking at that, because of the mandate of the Pegasus ...
+ **** all jobs run independently, wrapped around java executables, shell scripts, etc.
+ **** leveraging condor, and condor-g, which don't go further with web-services, etc.
+
+ * Adam Birnbaum
+ ** Resurgence project
+ ** encyclopedia of life (eol) : automated annotations for all of the know protein sequences; slurp 1.5 million things out of a db, and push through seven to ten programs
+ ** both want to have some kind of simple visual prog. screen, see nothing but icons relevant to their field, setup the workflow, say go, and do it 1.5 mill times / domain specific tools/icons, and say go repeatedly
+ ** need constraints among icons: outputs and inputs
+ ** template workflows, default settings, etc.
+ ** check validity of resulting configurations / workflows
+ ** what is meant here by high throughput, thousands of tasks per month (not flops), 1.5 mill jobs over 6 month period, e.g.
+ ** scientist wants to run many times, varying the inputs
+ ** apst and nimrod (tested with these)
+ ** pegasus in same category
+
+ * Brainstorm
+ ** data-intensive
+ *** 3rd party transfer
+ *** handling handles
+ *** streaming
+ *** SRB
+ *** where does data intensive transport fit?
+ *** separation of concerns ... who does what?
+ *** is there a one-size-fits-all framework?
+ *** wf-life cycle
+ **** construction / design
+ **** instantiation / parameter / data binding
+ **** execution ~ streaming (provenance)
+ ** conpute-intensive
+ *** streaming
+ *** wf exception handling
+ *** job scheduling: where does it fit? (to hide or not to hide)
+
+ * Non-Breakout Breakout on Registry Services, etc.
+ ** mygrid and biomoby "data models" are similar enough to plug together
+ ** different ontologies: service, bioiformatics, molecular biology
+ ** data model for services, etc.
+ ** lots of discusion ...
+
+ !Provenance
+
+ * Verification of experiment data; recipes for experiment designs; explanation for the impact of changes; ownership; performance; data quality
+ * The "Provenance Pyramid" -- Knowledge level; Organisation Level; Data Level; Process Level
+ ** Organisation Level at the bottom left of the pyramid, the same size as the right size, which contains the Data Level on top of the Process Level
+ * myGrid approach
+ ** LSIDs: to identify objects
+ ** myGrid information model and mIR: to store lower levels of the pyramid
+ ** sem web technologies (RDF, Ontologies): to store knowledge provenance
+ ** Taverna workflow workbench and plugins: ensure automated recording
+ * LSIDs
+ ** each bioinf database on the web has:
+ *** diff. policies for assigning and maintaining identifiers, dealing with versioning, etc.
+ *** diff. mechanisms ...
+ ** OMG standard
+ *** urn:lsid:AuthorityID:NamespaceID:ObjectID:RevisionID
+ *** urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2
+ *** lsid designator -- the item being id'd is a lifes science-specific resource
+ *** authority identifier -- internet domain owned by org that assigns an LSID to a resource
+ *** namespace id -- name of the resource
+ *** etc.
+ ** how id data retrived with LSIDs?
+ *** application -> 1. get me infor for id --> LSID client
+ *** 2. where can i get data and metadata for ID
+ **** returns wsdl doc giving information on where to get the data
+ ** Authority commitments
+ *** data returned for a given lsid must always be the same
+ *** must always maintain an authority at e.g. pdb.org that can point to data and metadata resolvers
+ ** lsid components
+ *** IBM build client and server implementations in Perl, Java, C++ ...
+ *** fairly straightforward to wrap an existing db as a source of data or metadata
+ *** client also straightforward
+ *** LSID launchpad ... within internet explorer (type in your lsid, returns metadata, etc)
+ ** Use of LSIDs within myGrid
+ *** needed an id for things such as workflows, experiments, new data results, etc.
+ *** everything id'd with LSIDs
+ *** build and deployed: LSID assigning server; lsid authority ([http://www.mygrid.org.uk]); metadata resolver; data resolver; (all based on IBM's open source implementation)
+ ** experiences
+ *** advantages: urn makes it easy to integrate with semantic web tools; more explicit than a url: there is an explicit protocol for separating metadata from data
+ *** disadvantages: have to decide what is data and metadata because they have different commitments (versioning); up to Jul 04, implementations chasing revisions in the standard maturing ... ow seems stable as standardisation more complete; to be successful across the community, it will require widespread adoption by providers such as Genbank, UniProt, etc.
+ ** Provenance storage
+ *** architecture
+ **** 1. data setn/received from services; 2. new lsids assigned to data; 3. data / metadata stroed; ...
+ *** metdata store: Jena RDF store; pushes RDF to LSID metadata resolver
+ *** mIR is an object relational database pushes XStream-RDF to LSID metadata resolver, and objects to LSID data resolver
+ *** use jena to store the rdf data
+ *** lsid resolver outputs xml and text-plain
+ ** scientific annotation
+ *** the goal of this experiment was ...
+ *** the results prove the hypothesis that...
+ **** need a schema for these annotations
+ **** tools to add the annotations
+ *** Tracy Cradddic
+
+ !Visualization in myGrid
+
+ * Williams workflow B ...
+ ** large amounts of data (or datatypes)
+ ** data implicitly linked within itself
+ ** data is implicitly linked outside of itself
+ ** genomic sequence is central co-ordinating point, but there are anumber of different co-ordinate systesms
+ ** some "biological", some artifacts of the workflow
+ * what's the problem
+ ** we don't ahve a domain model
+ ** we need a model for visualization
+ ** but, domain models are hard
+ ** it's not clear that the domain model should be in the middle ware
+ * what have we done!?
+ ** bioinformatics pm (pre myGrid)
+ ** one big distributed data heterogeneity and integration problem
+ ** still a big distributed data heterogeneity and integration problem
+ * how do we solve the problem
+ ** take the data, use something (perl or an MSc student) to map the data into a (partial) data model
+ ** visualize this ...
+ ** but what if the workflow changes?
+ * second solution
+ ** large quantities of data are already available with rich mark up in a visualizable form
+ ** this is unparsable, so also get the flat file rep
+ ** start to build visualization information into the workflow using beanshell
+ ** linked data from output -- domain model = scripts that hack these things together
+ * summary
+ ** domain models are hard
+ ** workflows can obfuscate the model
+ ** visualization requires one
+ ** we can build some knowledge of a domain model into the workflow and steal the rest.
+ ** is there a better way?
+
+ !Breakout: myGrid "Data Model" (schema) for capturing Metadata and Semantics
+
+ * common.xsd
+ ** service description
+ *** serviceName
+ *** organisation
+ **** UDDI fields, e.g., organization name, etc.
+ *** author
+ *** locationURL
+ *** interfaceWSDL
+ *** serviceDescriptionText
+ *** operations (units of funtionality)
+ **** service operation
+ ***** operation name
+ ***** portName
+ ***** operationDescriptionText
+ ***** operationInputs
+ ****** parameter
+ ******* parameterName
+ ******* messageName
+ ******* parameterDescription
+ ******* defaultValue
+ ******* semanticType
+ ******* XMLSchemaURI
+ ******* isConfigurationParameter
+ ***** operationOutputs
+ ***** operationTask (the "what", i.e., what the operation does -- the verb or action -- e.g., "aligning, ncbi_blast_local_aligning, etc.")
+ ***** operationResource (underlying resources that the operation may use, like a database, coming from an ontology...)
+ ***** operationMethod
+ ***** operationApplication (software application)
+ *** serviceType
+ **** either: "Soaplab service, WSDL service, Workflow service"
+ ** pedro
+ *** uses this schema to drive the user interface for annotation
+ *** also uses an external xml file to state that certain xml schema elements are to be filled in by semantic types, and where to look in the ontologies to fill those concepts
+ *** [http://www.cs.man.ac.uk/~penpecip/feta/misc] for files ...
+
+ !More on SHIMs and Planning
+
+ * Shims in detail: UniProt database to BLASTp analysis
+ ** UniProt produces concrete type: UniProt_record
+ ** contains protein_sequence
Removed lines 185-189
-
-
-
-
- *


          

      

      

      
      Back to E Science Link Up Oct 04,
       or to the Page History.
 E Science Link Up Oct 04
 Your trail: EcoGridConferenceCallNotes7Sep2004 | EcoGridConferenceCallNotes17Sep2004 | KRSMSMeasurement | SparrowLanguage | SemanticsInKepler | KRSMSOntoCreationGuide | SMSServiceInterfaces | KRSMSSemanticAnnotationLanguage | SMSHotTopics | AllHandsMeetingSMSNotesNov04
 Removed lines 1-2
-- ----
--
 Line 178 was replaced by line 176
-- ** [www.GeoDise.org] Simon Cox
++ ** [http://www.geodise.org] Simon Cox
 At line 183 added 193 lines.
++ ** telescience portal enabled tomography workflow
++ *** composed of the sequence of steps required to acquire, process, visualize, and extract useful information from a 3D volume
++ ** morphometry workflow
++ *** structural analysis of data
++ *** large amounts of pre-processing
++ **** normalization, calibration, etc., to get data in a form to be analyzed
++ **** most methods in the pre-process stream can lead to errors
++ **** requires manual editing, etc., and have a set of checkpoints, where a user interacts
++ *** moving towards high-performance computing resources
++ ** parameter sweeps
++ *** taking birn-miriad numbers and comparing to what scientist has done ...
++ *** researcher traced out diff area of the brain, need to compare fully automated approach
++ *** looking for correct parameters to use for the imaging
++ *** get as close as you can to the actual, to the trained researcher, can do: correlate minute changes in actual brain structure to saying to some patient we should put you on some drug regime because you have alzheimers -- to some preventive course of action
++ *** has picture/slide of the workflow
++ *** baseline preprocessing can take upwards of a day
++
++ * Karan Vahi
++ ** Abstract Workflow (DAX): expressed in terms of logical entities; specifies all logical fils required to gen. the desired data prod. from scratch; dependencies between the jobs; analogous to build style dag
++ *** format for specifying the abstract workflow, id's the recipe for creation
++ *** xml syntax / format
++ ** Concrete workflow ...
++ ** alternate replica mechanisms
++ *** how to manage replicas of the same service?
++ **** haven't been looking at that, because of the mandate of the Pegasus ...
++ **** all jobs run independently, wrapped around java executables, shell scripts, etc.
++ **** leveraging condor, and condor-g, which don't go further with web-services, etc.
++
++ * Adam Birnbaum
++ ** Resurgence project
++ ** encyclopedia of life (eol) : automated annotations for all of the know protein sequences; slurp 1.5 million things out of a db, and push through seven to ten programs
++ ** both want to have some kind of simple visual prog. screen, see nothing but icons relevant to their field, setup the workflow, say go, and do it 1.5 mill times / domain specific tools/icons, and say go repeatedly
++ ** need constraints among icons: outputs and inputs
++ ** template workflows, default settings, etc.
++ ** check validity of resulting configurations / workflows
++ ** what is meant here by high throughput, thousands of tasks per month (not flops), 1.5 mill jobs over 6 month period, e.g.
++ ** scientist wants to run many times, varying the inputs
++ ** apst and nimrod (tested with these)
++ ** pegasus in same category
++
++ * Brainstorm
++ ** data-intensive
++ *** 3rd party transfer
++ *** handling handles
++ *** streaming
++ *** SRB
++ *** where does data intensive transport fit?
++ *** separation of concerns ... who does what?
++ *** is there a one-size-fits-all framework?
++ *** wf-life cycle
++ **** construction / design
++ **** instantiation / parameter / data binding
++ **** execution ~ streaming (provenance)
++ ** conpute-intensive
++ *** streaming
++ *** wf exception handling
++ *** job scheduling: where does it fit? (to hide or not to hide)
++
++ * Non-Breakout Breakout on Registry Services, etc.
++ ** mygrid and biomoby "data models" are similar enough to plug together
++ ** different ontologies: service, bioiformatics, molecular biology
++ ** data model for services, etc.
++ ** lots of discusion ...
++
++ !Provenance
++
++ * Verification of experiment data; recipes for experiment designs; explanation for the impact of changes; ownership; performance; data quality
++ * The "Provenance Pyramid" -- Knowledge level; Organisation Level; Data Level; Process Level
++ ** Organisation Level at the bottom left of the pyramid, the same size as the right size, which contains the Data Level on top of the Process Level
++ * myGrid approach
++ ** LSIDs: to identify objects
++ ** myGrid information model and mIR: to store lower levels of the pyramid
++ ** sem web technologies (RDF, Ontologies): to store knowledge provenance
++ ** Taverna workflow workbench and plugins: ensure automated recording
++ * LSIDs
++ ** each bioinf database on the web has:
++ *** diff. policies for assigning and maintaining identifiers, dealing with versioning, etc.
++ *** diff. mechanisms ...
++ ** OMG standard
++ *** urn:lsid:AuthorityID:NamespaceID:ObjectID:RevisionID
++ *** urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2
++ *** lsid designator -- the item being id'd is a lifes science-specific resource
++ *** authority identifier -- internet domain owned by org that assigns an LSID to a resource
++ *** namespace id -- name of the resource
++ *** etc.
++ ** how id data retrived with LSIDs?
++ *** application -> 1. get me infor for id --> LSID client
++ *** 2. where can i get data and metadata for ID
++ **** returns wsdl doc giving information on where to get the data
++ ** Authority commitments
++ *** data returned for a given lsid must always be the same
++ *** must always maintain an authority at e.g. pdb.org that can point to data and metadata resolvers
++ ** lsid components
++ *** IBM build client and server implementations in Perl, Java, C++ ...
++ *** fairly straightforward to wrap an existing db as a source of data or metadata
++ *** client also straightforward
++ *** LSID launchpad ... within internet explorer (type in your lsid, returns metadata, etc)
++ ** Use of LSIDs within myGrid
++ *** needed an id for things such as workflows, experiments, new data results, etc.
++ *** everything id'd with LSIDs
++ *** build and deployed: LSID assigning server; lsid authority ([http://www.mygrid.org.uk]); metadata resolver; data resolver; (all based on IBM's open source implementation)
++ ** experiences
++ *** advantages: urn makes it easy to integrate with semantic web tools; more explicit than a url: there is an explicit protocol for separating metadata from data
++ *** disadvantages: have to decide what is data and metadata because they have different commitments (versioning); up to Jul 04, implementations chasing revisions in the standard maturing ... ow seems stable as standardisation more complete; to be successful across the community, it will require widespread adoption by providers such as Genbank, UniProt, etc.
++ ** Provenance storage
++ *** architecture
++ **** 1. data setn/received from services; 2. new lsids assigned to data; 3. data / metadata stroed; ...
++ *** metdata store: Jena RDF store; pushes RDF to LSID metadata resolver
++ *** mIR is an object relational database pushes XStream-RDF to LSID metadata resolver, and objects to LSID data resolver
++ *** use jena to store the rdf data
++ *** lsid resolver outputs xml and text-plain
++ ** scientific annotation
++ *** the goal of this experiment was ...
++ *** the results prove the hypothesis that...
++ **** need a schema for these annotations
++ **** tools to add the annotations
++ *** Tracy Cradddic
++
++ !Visualization in myGrid
++
++ * Williams workflow B ...
++ ** large amounts of data (or datatypes)
++ ** data implicitly linked within itself
++ ** data is implicitly linked outside of itself
++ ** genomic sequence is central co-ordinating point, but there are anumber of different co-ordinate systesms
++ ** some "biological", some artifacts of the workflow
++ * what's the problem
++ ** we don't ahve a domain model
++ ** we need a model for visualization
++ ** but, domain models are hard
++ ** it's not clear that the domain model should be in the middle ware
++ * what have we done!?
++ ** bioinformatics pm (pre myGrid)
++ ** one big distributed data heterogeneity and integration problem
++ ** still a big distributed data heterogeneity and integration problem
++ * how do we solve the problem
++ ** take the data, use something (perl or an MSc student) to map the data into a (partial) data model
++ ** visualize this ...
++ ** but what if the workflow changes?
++ * second solution
++ ** large quantities of data are already available with rich mark up in a visualizable form
++ ** this is unparsable, so also get the flat file rep
++ ** start to build visualization information into the workflow using beanshell
++ ** linked data from output -- domain model = scripts that hack these things together
++ * summary
++ ** domain models are hard
++ ** workflows can obfuscate the model
++ ** visualization requires one
++ ** we can build some knowledge of a domain model into the workflow and steal the rest.
++ ** is there a better way?
++
++ !Breakout: myGrid "Data Model" (schema) for capturing Metadata and Semantics
++
++ * common.xsd
++ ** service description
++ *** serviceName
++ *** organisation
++ **** UDDI fields, e.g., organization name, etc.
++ *** author
++ *** locationURL
++ *** interfaceWSDL
++ *** serviceDescriptionText
++ *** operations (units of funtionality)
++ **** service operation
++ ***** operation name
++ ***** portName
++ ***** operationDescriptionText
++ ***** operationInputs
++ ****** parameter
++ ******* parameterName
++ ******* messageName
++ ******* parameterDescription
++ ******* defaultValue
++ ******* semanticType
++ ******* XMLSchemaURI
++ ******* isConfigurationParameter
++ ***** operationOutputs
++ ***** operationTask (the "what", i.e., what the operation does -- the verb or action -- e.g., "aligning, ncbi_blast_local_aligning, etc.")
++ ***** operationResource (underlying resources that the operation may use, like a database, coming from an ontology...)
++ ***** operationMethod
++ ***** operationApplication (software application)
++ *** serviceType
++ **** either: "Soaplab service, WSDL service, Workflow service"
++ ** pedro
++ *** uses this schema to drive the user interface for annotation
++ *** also uses an external xml file to state that certain xml schema elements are to be filled in by semantic types, and where to look in the ontologies to fill those concepts
++ *** [http://www.cs.man.ac.uk/~penpecip/feta/misc] for files ...
++
++ !More on SHIMs and Planning
++
++ * Shims in detail: UniProt database to BLASTp analysis
++ ** UniProt produces concrete type: UniProt_record
++ ** contains protein_sequence
 Removed lines 185-189
--
--
--
--
-- *

This material is based upon work supported by the National Science Foundation under award 0225676. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).


Long Term Ecological Research Network, UNM	National Center for Ecological Analysis and Synthesis, UCSB	Biodiversity Research Center, KU	San Diego Supercomputer Center, UCSD


Arizona State University	Napier University	University of North Carolina	University of Vermont


UC Davis Genome Center