Taxon Meeting Jan 30_2003

Polycom address: 205.253.57.82

Attendees

Jim Beach
Crispin Wilson
Bob Peet
Jesse Kennedy
Dave Vieglais
Bill Michener
Matthew Jones
Ricardo Pereira
Scott Downie
Greg Vorontsov
Aimee Stewart
Meg Kumin

Crispin Wilson

http://65.205.36.26/taxon/index.asp (login: bcis, 12@bcis)

Multiple taxonomy resolution with annotation. Primarily for browsing - should not be too hard to add a programmatic interface. Data stored within a single database- centralized repository.

Use Cases

There is now a TaxonUseCases page, which contains the current list of use cases for the SEEK taxon group.

Questions

What is a Taxonomic Concept?
What sort of mapping methods? Hard coded by specialist? Automated?

Revised Scope

Services:

An Internet taxonomic concept (assertion) resolution service employing a semantic mediation engine would exploit the SEEK architecture to enable precise species concept based data discovery and integration.
Service to provide a measure of relative equivalence of two (or more) taxonomic names.
e.g. Algae taxonomy is very dynamic- need mechanism to determin if the data in two experiments such as density measurements can be compared- was the experimenter working with the same organism?

Mechanism for integrating existing and ongoing efforts rather than creating another stand-alone system

Develop standard interchange model
Standard API for accessing system(s) regardless of back-end architecture and implementation

Architecture:

Central facility for the SEEK environment
Shall support distributed classification and concept maps with standard schemas for storage and programatic interfaces
Must maintain autonomy for data providers. Architecture provides value to authors for contributions.
Interface specification not implementation. Providers must be able to support the interface defined by this project.
Support multiple classifications with arbitrary number of levels

Why Distributed?

Contributors retain control
Scalability
Distributed in the sense of contribution and editing, not necessarily for database infrastructure.

Why Not?

Need to provide a fast, reliable service. Distributed model can make this quite complicated. This is primarily a problem for the database, not the activity.

Populating the Resource:

Populate initially with weak concept lists (such as ITIS)
Populate in more detail through several tiers of concepts and relatedness for some portions to demonstrate capability and functionality
Prioritize population process by the needs of demonstration experiments for the project.
What is the minimum content for a concept entry?

concept = Full name + (author + publication, date)  +  usage reference [to taxonomic work]

'Full name': can be a "proper" scientific name or any other string that provides a handle that was used in a publication (e.g. an experiment label such as "spp#1")

'Minimal e.g.': ITIS as a dynamic system does not provide a set of concepts, but a published version of ITIS would qualify as a set of concepts.

Names by themselves are a useful entity

Names are not substitutes for concepts

An Internet taxonomic concept (assertion) resolution service employing a semantic mediation engine would exploit the SEEK architecture to enable precise species concept based data discovery and integration. This specific concept identity resolution problem is representative of a large class of problems (e.g. with classifications for biotic communities, soils, rocks, places) where there exist many-to-many relationships between concepts and names. The solutions we develop should have fundamental utility far beyond biological nomenclature and biodiversity.

IT Research Challenges

Development of a comprehensive conceptual model that can represent all relevant aspects of biological classification and nomenclature semantics, specifically models of multiple interpretations depending on explicit representations of context information, e.g., temporal, hierarchical and circumscription dimensions.
Development of logic representations that allow reasoning about the consistency and consequences of multiple, possibly competing interpretations. For example, using formalizations in modal and many-valued logic 84, 85?, an automated deduction system may be devised that allows one to systematically compute all consequences of different taxonomic interpretations and feed those into the semantic mediation system, which in turn would show the different data and analysis views arising from the different nomenclature interpretations.
Deducing concepts rather than species name strings from distributed taxonomic data sources

Deliverables

Conceptual schema and data model for concept-based nomenclature data leveraging previous research by collaborators and colleagues.
Data entry software that allows scientists to add new or published assertions as needed and to map institutional and personal perspectives on the relationships among assertions.
Desktop visualization tools for data discovery and management of multiple classifications will be based on previous work by the Napier University Prometheus Project and others.
Database implementation for an operational, Web-accessible prototype database with representative data from several different taxonomic groups (e.g. higher plants, fishes) aimed ultimately at a global, distributed and federated system of taxonomic concept servers.
Internet service for automated name/concept resolution, accessible via EcoGrid, for several groups using information from synonymies currently available in public databases.
Usability analysis of the functional requirements by working group members would evaluate all applications and tools developed for nomenclature resolution.

+++ Added during workshop +++

Milestones

Year 1

Communications and outreach activities required to avoid duplication of effort (e.g. TDWG effort)

All classification database mafia
attend meetings / workshops etc to familiarize ourselves with other projects
populate working group activity with reps. from other groups

Schedule working group meeting soon
Draft schema for taxonomic concept object - as annotated XML-Schema document
Reports by Bob and Jesse on their systems
Jesse, Bob and Jim will take lead in analyzing all the other models and developing a plan for communicating with the rest of the group
By February meeting

working group will summarize research on data models
This document including use cases will be completed

Identify and hire human resources (early)

Year 2

+++

From CVS Document

Scope

Includes any type of analysis or model in ecology and biodiversity science.

Goal is to massively streamline the analysis and modeling process, and provide for archiving analyses and their outputs. Includes support for analyses in SAS, Matlab, R, SysStat and custom models written in various languages (e.g., C). The system should allow the addition of various back-end anaylytical engines as they become available or as new versions are released.

The system as a whole should not be tied to any one metadata standard, back-end system or operating system/platform. Flexibility should be a major concern in the design process due to the heterogeneous makeup of the ecological scientific community.

The system should include features that assist users in determining the appropriateness of combining various analytical steps and data sources based on semantic mediation. Semantic mediation should occur in three areas. First, to determine whether it is appropriate to link together particular analytic steps. Second, to mediate between multiple data sets to determine in what ways they can be combined. Third, to determine whether the selected data sources are appropriate inputs for the selected analysis.

Functional requirements

  FR1: Analyses and models documented in declarative language (e.g., XML)
  FR2: Must support 'pipelining' of models in a graph
  FR3: Ability to archive analyses and their outputs
  FR4: Ability to version analyses and their outputs
  FR5: Must have an easy-to-use front end GUI to assist scientists with
       building and executing pipelines
  FR6: Allows the sharing of analytical processes amongst scientists
  FR7: Flexibility in input, processing and output.  e.g. not binding the
       system to one metadata standard, back-end system or platform

Use cases

  UC1: Scientist can create new analytic steps
  UC2: Scientist can use a graphical interface to arrange analytical steps
       into a pipeline, save it, bind data to the inputs, and execute it
  UC3: Scientist can execute an analysis or model described in a declarative
       language
  UC4: Scientist can archive various intermediate and endpoint results of an
       analytical process
  UC5: Scientist can create new versions of analytical steps, and can return
       to old versions
  UC6: Scientist can share coded pipelines or sub-pipeline steps and results
       of pipeline analyses with other scientists
  UC7: Administrators can add support for additional metadata processors
       and back-end systems when needed
  UC8: Scientist can work backwards through a pipeline of interest and so by 
       starting with knowledge of the semantics of the result of interest 
       is able to determine the type of data needed as inputs to the pipeline
  UC9: Given a particular data set and set of pipelines, the scientist can 
       use the semantic mediation system to determine the types of analyses
       that are possible to carry out on the data set.

Software components

  SW1: Metadata language for formal description of analyses
  SW2: Metadata language for the formal description of data and model semantics
  SW3: Server-side system for execution of analyses and models
  SW4: Server-side system for processing semantic metadata
  SW5: Client interface for creating and executing analyses and models

Old version of this page

Go to top Edit this page More info... Attach file...

This page last changed on 28-Jun-2004 08:01:41 PDT by LTER.stekell.