Science Environment for Ecological Knowledge
Ecoinformatics site parent site of Partnership for Biodiversity Informatics site parent site of SEEK - Home
Science Environment for Ecological Knowledge









 

 

 



Kepler Meeting SMS Notes

This is version 26. It is not the current version, and thus it cannot be edited.
[Back to current version]   [Restore this version]


The Semantic Mediation System and KEPLER

Back to Kepler Meeting Agenda


Exploiting Ontologies

  • In SEEK we want to exploit "eco" ontologies to do "smart discovery and integration"
  • The goal is to "tag" (annotate) data and workflows (and their components) using ontology terms
  • Our solutions are meant to be generic, applicable for KEPLER

Ontology Languages

  • An ontology is:
    1. a set of concept (class) names,
    2. subconcept (subclass) links,
    3. named (directed, binary) relationships between concepts,
    4. and constraints (cardinality, equivalence, conjunction, disjunction, etc.)
  • In SEEK, we've adopted the Web Ontology Language (OWL)

Semantic Annotations

  • A semantic annotation assigns an "item" to an ontology "expression".

    • Items
      • Datasets: An entire dataset or some portion (a single table, one or more attributes, one or more data values, etc.)
      • Workflows and components: A workflow, a workflow component, or some portion (parameters, ports, substructures of a port type, etc.).

    • Selecting Items
      • Can be as simple as an LSID, e.g., that identifies an entire component or dataset
      • Simple query expressions can also be used, e.g., like XPath/XPointer addressing, using EML attribute identifiers, etc.
      • More generally, expressed as a query.

    • Ontology Expressions
      • Defines the semantic "context" of the item selected
      • Can be as simple as a single concept id (like "Measurement")
      • Simple expressions can also be used, e.g., as paths in an ontology
        • Example: Measurement.spatialContext.loc.latDeg specifies the location of a Measurement's spatialContext as a latitude in degrees
      • More generally, update queries, e.g., SQL-style update queries

Architecture Issues

  • SMS-Based Applications
    1. Browsing/Keyword Search
      • Categorize workflows, components, datasets according to their position in the ontology concept hierarchy.
      • Search based on individual concepts (as a keyword), providing "term expansion" capabilities
    2. Find "compatible" workflow components
      • Given a workflow component (an actor), find components that can be connected to it (either as input or output) based on semantic annotations. If the annotations are "compatible" according to the ontology(ies), the component is returned.
      • Could result in "data binding" -- a dataset may be a "compatible" input.
      • Note that semantic compatibility does not imply structural compatibility (the i/o types may not match; see below)
      • Requires port inputs/outputs to be semantically annotated
    3. Workflow "analysis"
      • Given a workflow of connected components, check that each connection (input/output) is semantically compatible.
      • Analysis may take advantage of annotation propagation (this is still research)
    4. Workflow-component structural integration
      • Given two components that are semantically compatible, determine one or more transformations (either by inserting new components or deriving transformation "code") to make them structurally compatible.
        • In general, component integration is a planning-style search problem (and still research)
        • May be a place where SCIA can contribute, to derive the structural transformation code and help users refine mappings
    5. Dataset merging and integration
      • Search for "similar" datasets based on semantic annotations of current dataset
      • Given two datasets, merge them (data fusion) into a single dataset based on their semantic annotations + metadata
      • Define a dataset of interest (as a query---the classic approach---or as a target, annotated schema), then find/integrate datasets to populate result (classic data integration).
        • Perhaps places for SCIA to contribute?
        • In general, still research
        • Integration depends on the granularity/quality of the annotations, ontologies, etc.

  • Repositories
    1. Ontology(ies)
    2. Datasets (or metadata stating how to obtain the datasets)
    3. Workflows and Workflow Components (or metadata, etc.)
    4. Semantic Annotations

    • "Smart discovery and integration" needs access to these components:
      • To search for a workflow component, we would search through semantic annotations. When an annotation matches, obtain the corresponding component.
      • To organize (for browsing) all actors according to their annotations. Might iterate over actors, or similarly, for datasets.

  • Required Tools
    1. Ontology Editors/Browsers
      • The KR group in SEEK
    2. Semantic Annotation Editors/Browsers
      • For creating, editing, registering annotations
      • KR and SMS group in SEEK
    3. Ontology-based query rewriting/answering
      • Classification based on ontology (Jena, Racer, etc.)
      • Efficiently using rewriting to find components
      • Testing semantic compatibility
      • Annotation propogation (reasearch)
    4. Component integration reasoning
      • Structural transformation algorithms (SCIA? CLIO? Schema Mapping?)
      • Search a la planners
    5. Data merging and integration reasoning
      • Algorithms and rules for fusing together data
      • Structural transformation algorithms (see above))
      • Basic conversions like count/area = density
    6. Explanation viewers/systems
      • To explain why an answer was obtained
      • Closely tied to ontology editors/browsers

  • "Smart" Actor Search in Kepler

      • A very simple keyword-based search we (Chad and I) implemented within Kepler.
        • Integrated with the component 'quick search' frame
        • Allows dynamic actor classification (for browsing)
        • Allows runtime annotation and re-classification of actors
        • Term expansion for individual concept queries
      • Required a number of new features in Kepler:
        1. ID mechanism for actors
        2. Repositories
          • Fakes out: component repository (as a ptolemy xml config file), annotation repository (xml file), ontology (simple is-a hierarchy, no rels, etc.)
        3. Provides a very naive local ID service (like for LSIDs). Hand-coded.

What's needed for KEPLER

  • Ontologies and Ontology Tools
    • There basically aren't any.
    • There also aren't any tools in Kepler for creating, browsing, or editing ontologies.

  • Annotations
    • Need to extend the annotation "language"
    • Desperately need an annotation editor/browser
      • Need a reasonable/practical GUI design
      • Need a good way to access/browse a component/dataset and its attributes, such as is ports and their input/output types.

  • Basic Kepler GUI Hooks
    • Like for toolbar, menus, etc.
    • Checking semantic compatibility (can steal unit resolver?).
    • Explanation of results (like for searching, etc.)

  • Algorithms
    • Need to understand the integration/merging algorithms
    • Could today write the other types of search algorithms
  • Repositories
    • Basically none of the repositories exist (except perhaps for Data, not sure)
    • I think the Kepler Obj. Manager can help with this, what we need from it is:
      • Ability to register components, data sets, ontologies, and annotations with the obj. manager
      • Ability to access all LSIDs of a certain type, e.g., components, data sets, ontologies, annotations
      • Ability to retrieve the object for an LSID
      • Some form of annotation indexing (this is similar to metadata indexing perhaps)
        • A search can be executed directly against an in-memory annotation file (e.g., obtained dynamically from all registered objects)
        • In contrast to asking the obj mngr for all lsids that are annotations, and for each retrieving the annotation file, etc.
      • For efficiency, probably want multiple access paths via lsids, e.g., get all the workflow components and for each, retrieve it's annotation (if there are a lot more annotations than just for components); or build an annotation index based on these lsids, etc.
        • What types of indexing exactly needing should be driven by development/testing, but we may consider an obj. mngr. architecture that can easily support "extensible" indexing strategies (e.g., through listeners, etc.)



Go to top   More info...   Attach file...
This particular version was published on 20-Jan-2005 12:26:48 PST by SDSC.bowers.