Taxon Meeting Notes_13_May_2003

Taxon WG Meeting, San Diego Supercomputer Center

Opening Session Tuesday May 13, 2003, 1:30 PM

Present: Trevor Paterson, Aimee Stewart, Paula Huddleston, Bob Peet, Jessie Kennedy, Dave Vieglais, Hannu Saarenma, Jim Beach

Paula Huddleston briefly reviewed status of ITIS project. Hannu Saarenmaa briefly reviewed GBIF plans for serving biological names, electronic catalog of life consortium will likely provide names.

Bob: three basic functions: retrieval and mapping of tagged and not tagged names for data set queries

discovery/retrieval/merger of named taxa without concept mappping (GUIDs)
discovery/retrieval/merger of named taxa with concept mappings (using GUIDs)
APIs, standards, rules, work protocols and registries for identification of a unique concept with a GUID

GUIDs for taxonomic concepts
GUIDs for taxonomic names
GUIDs for taxonomic references (publications that use names)

JK:

EML - how can it be extended to deal effectively with taxonomic concepts.
SMS group is expecting the Taxon group to provide structure for taxonomic semantic mediation
How does this fit into the Ecogrid? We will likely need to create a taxonomic name service for testing and also (DV) some standard APIs for taxonomic name services, so that other people can be concept and name data providers.

DV Need to consider what kinds of queries will be asked within the SEEK architecture of a taxon data provider for SEEK analysis needs, for example for ecological niche modeling. Might need to start with a common name, e.g. The service needs to provide the mappings between names and concepts from the start.

There are many implentation issues with returning this kind of data, for example DoS problems, "Have any bananas?"

Concept resolution needs to be done both completely automatically for naive users and with intermediate results for expert browsing to see which names in which data sets come back are what the user meant.

There is also the need for taxon comparison support functions for expert users, who need to map between the name in their query and the results

Clear that we need a federated schema for concepts, with client and server APIs, to enable multiple different systems.

JK reviewed her summary of concept classifications, diagram. How taxonomic concepts are defined: Rank, Label, Publication, Definition, Author.

Quality score based on what kind of information was provided with the concept. 0.0.0.0 if for example, no name, no publication, no author, no definition.

4:15 Break

Use Cases

Ecogrid providers and consumers

Use Cases for the architecture and not for any particular database Do retrieval first

5:15 Adjourn

Wednesday Morning April 14, 8:30 AM, SDSC

Present: David Stockwell, Trevor Paterson, Aimee Stewart, Susan Gauch, Paula Huddleston, Bob Peet, Jessie Kennedy, Dave Vieglais, Hannu Saarenmaa, Jim Beach

Discussion of daily agenda priorities

Bob Peet Presentation -- his perspective of Taxon WG vision, objectives, deliverables.

Ecology will be dramatically different in 10 years with easy acces to wide and deep data sets CEED Project, SDSC, no one wanted to use it. ESA Digital Archives, respository to park archives and supplemental data sets from ESA journals
Ecologists are not entering metadata for their data sets, despite software and opportunities
NSF seems like the only way to motivate the research community, SEEK can provide the infrastructure
Motivational issues are critical, without large scale buy-in, any architecture will not be used, SEEK needs to deal with that
For SEEK Taxon, we are providing the mediation tools to deal with M:M cardinality between concepts and names
Need to allow, support, facilitate party (authority) perspectives on classifications, other types of data also have the same problme with M:M names and concepts
We can build an architecture, but unless we get data into it the project will fail.
We need to accomodate new marked up concepts, but also legacy data sets that only have names (weak concepts concept from the Jan meeting.)

We need to parse out what the responsibilites are for development in the next two days.

Funded Parties of the WG

KU
UNC
Napier

Components

model for concept data elements and for the architecture, leading to schemas, APIs, standards
web-based prototype, that demonstrates the mediation of concepts, for ecologists. Bob has built a system that does that. Seems like a logical piece to build upon for initial SEEK delieverables. The demo database then needs to be populated, likley from ITIS.
Need software tools for taxonomists to author concept records, Bob has a prototype of this also, from his Natureserve demonstration. UNC Postdocs might work on this
Web interface for searching and retrieving data
Need "visualization tools", similar to JK's
Need database tools for legacy data

JK: the most fundamental decision is what our design requirements are based on the expections of the other WGs, what the taxon name resolution service is depends a lot on how the Ecogrid group expects it to operate. Integration with the overall project architecture is pretty important.

DV: BEAM WG has identified a typical use case for how SEEK would work, Powerpoint slide of searching for concepts related to the Fringed Filefish, then a query on specimen databases, pipeline into GARP, identification for environmental data sets, run GARP model.

Hannu: an architectural document would be important for sharing with the community, GBIF has one a copy was circulated at the meeting, on the gbif web site, http://www.gbif.org

GBIF has about 8 use cases that would be valuable for SEEK to look at GBIF architecture, DiGIR architecture should be up an running next year, GBIF will support that GBIF does not have a project in place to deal with a taxon concept architecture

Susan: would like to see 1 use case, would like to build one function, and then at the next meeting talk about what we would like to add to that. Would like to pursue the spiral model of development.

Bob would like use case 1 to be what is needed to store a concept level record in an archive SG: Then my interests would be to see what would be needed to bring in legacy data into that archive and see how they could be mapped to full-qualified taxon records, and how they could be retrieved with them in an overall architecture

There is a comfort level with approaching WG modeling and development activites one use case at a time, with a user function like archiving being case.

DV: Presented overall SEEK Use Case, powerpoint slides.
The only requirement of the use case for the Taxon WG, is that a common name be resolved to a scientfic name, in order to do a specimen query.

HS: defining the interfaces on how you access multiple name providers to do this kind of thing, would be very useful for GBIF. The APIs and architectural requirements are key.

DV: We should just start with this, lets not try to solve all of the taxonomic problems and issues with concepts. Let's just use ITIS for name lookups and call it our first demonstration.

JK: Don't see a requirement here for the SEEK concept architecture. DV: yes, but it is a starting point and a point of departure for concepts.

SG: Use case 2, should be that this common name goes to two or more scientfic names, how do these map? How could user deal with multiple concepts to then due a query.

JK: (Need to get Jessie's model comparison slides on here.)

PH: SEEK will need a respository of its own to store data, because other people may not have a database.

JK: I am worried that working on a single provider, ITIS for the first use case, is too limited, we will do a lot of things that will not be of generality, we will not be thinking about multiple taxonomies

JK: Maybe the internal one is more important to do first.

DV: First thing we do is define the data structure and the API, and then look or existing repositories that could support our requirements. Still the first step is to build the API and the data structures.

JK: we cannot design for a single application use case, we need to also dig in into the ways to represent concepts and classifications in a database. If we dont look at the broader requirements of the architecture, then we are going to waste our time on one-off demonstrations.

SG: DV: Don't disagree.

Coffee Break 10:15-10:45 AM

Hannu Do it the way the Digir group did it:

coming up with a federated schema
have a draft protocol about how data in those records could be transported
have a draft interface defintition
find an existing database provider taht would be willing to accomodate concept data types
need a server software to map frm the database to the federated schema
need an internal database that would cache these data types
build applications around it.

Stockwell: would be most useful if this could be generalized to other systems, e.g. vegetation systems, gene name data,

DS: if Jesse's tools could be generalized to vegetation types, that would create a great buzz. JK: But it is not easy. Classifications are not trees, there is a ton of semantics, we are not just doing name comparisons, all of my systems are built to deal with taxon concepts,

Talk about generalizing the software to be able to generalize to other types of classifications, without actually getting into the semantics of the data.

One could simply create a system that would allow people to map between concepts manually, would not need to include any machine reasoning to compare things like vegetation types.

Bob -- let's take the diagrams that we have and come up with a draft schema for concepts, including the metadata.

Links would have to be modeled seperately, with a distinct schema.

Lunch 12:00-1:30

Agreed to:

multiple classification architecture
implement (model, specify, design) functions to satisfy SEEK primary use case

implement a service to take a name and return a taxonomic concept object
implementation will be in the context of multiple data providers with different classification models and different classification data

Define a data exchange federation schema which may be congruent with the SEEK taxonomic data repository structure.
Identify/adopt a query/response protocol. (Operations: search, retrieve, scan)

Implementation options JK: use prometheus as a database to serve concepts in the SEEK Schema Use Prometheus viz tools to read concepts out of 1 or more concept data servers.

Discussion notes below from both Wednesday and Thursday, juxtapositioned where appropriate

Wed: Discussion of IDs for concepts and concept instances JK: we should not use usage as the basis for a unique concept ID, if there is no additional information then the identity is unknowable,we cant use that data.

We have to decide what level of concept we want to capture, do institutions have concepts, do individuals have concepts? Or do we want to say we dont know what it is, so all we can say is that it falls into the unknown concept bucket. So, collection catalogs have classifications associated with them, but we assume that the concepts are not new, they are simply uses of existing concepts from unknown sources.

Thurs: the distinction of what is acceptable/workable for SEEK is whether the classification and concepts are explicitly defined or implicit. Usage of concepts, in an implicit classification, does not "create" new concepts. Alternative classifications, non-scientific classifications, special purpose classifications, amateur birder classificationsm, if explicitly created and defined are OK.

MS: what about the formal rules of classifications? Do they constrain what is acceptable for the architecture? BP: no, we are not going to constrain concept creation and traffic to a restricted subset of authors. But the sources must be explicit taxonomies.

Wed: Where do we draw the line on what we accept as a new concept? Weak concepts might be species field names, Melastome #1, Melastome #2, moth "A", "big brown moth B" etc. But what about a box of insects with no name, or a specimen with just a catalog number. (Hannu argues that we need to accomodate these, but are they implicit or explicit?)

JK: if concepts are well described, regardless of the source, then they should be considered concepts.
BP: informal names from field studies are not concepts unless they are explicitly authored by that author or someone else, "big brown moth" does not ascend to concept status merely by its use in a field data set, JK agrees.

BP: Revised: anytime anyone uses a name on a permanent record, e.g. a specimen, then it is a concept and we should be able to handle it. The concept has to be explicitly authored to be recognized. Someone has to put those into SEEK concept schema, e.g. Gentry's concept of "green plant" according to Bob Peet. JK: this is a classification of Peet's -- fine.

MJ: one of the best times to do concept mappings is when researchers are trying to do data integration with two or more datasets. Someone has an interest in doing that mapping for a particular analysis. Mappings of data set field names, should not be automatically done.

BP, SG: all names should be tagged with an explicit concept by authors of incoming data sets. If the name does not already exist in the federation, there should be no automatic concept generation, the policy should be that the author of the data set must register the name by linking it to a new or existing concept.

MJ: I see lots of field names "A" "B", we have to allow those data sets to exist in the SEEk architecture without forcing the author to map the field names to formal names. (BP: they must make the mappings to be stored in SEEK otherwise they are useless and not wanted.) MJ: A data set that is not completely mapped to formal classifications is NOT useless.

Summary Thurs: We are not going to create concepts in the SEEK federation implicitly,we are not going to scan data sets and create concepts or mappings from field data.

Wed: JK: it is a concept but of no use, an informationless label is meaningless and not useful, this is not accepting a new concept, this is an application of a name to some unknowable object.

DS: this is a policy decision, the data model should accomodate all potential uses of names. E.G. the data model should be able to handle a concept which is just a name in use, applied in some way, by somebody.

SG: we should accomodate all kinds of concepts including "small black moth" and

Susan: What about specimen or observation data in ecological data sets with no names? There will be these pathological issues with some data, we should work with the data that follows the rules, and not worry about things without names.

Trevor: but without names there is very limited value in having the data.

Thursday, May 15, 2003 AM 8:30, SDSC

Present: Kennedy, Huddleston, Saaranemaa, Schildhauer, Jones, Stewart, vieglais, Pereira, Trevor, Gauch, Stockwell, Ludaescher

Agenda

Brief review Aimee Stewart Review of remaining agenda items

Discussion of the Metamodel for concept classifications, simply shows how similiar the different classes of models are, does not suggest how they might be reconciled, but how the data can be accomodated.

Notes are intercalated above from the Thursday morning discussion.

Break 10:30-10:50

List of priorities
Complete the specification
Identification and schedule of tasks

multiple classification architecture

implement (model, specify, design) functions to satisfy SEEK primary use case

implement a service to take a name and return a taxonomic concept object
implementation will be in the context of multiple data providers with different classification models and different classification data

Define a data exchange federation schema which may be congruent with the SEEK taxonomic data repository structure.
Identify/adopt a query/response protocol. (Operations: search, retrieve, scan)
Extensions to EML (EML record could be a description of the collection at a high level.) How is the taxonomic data going to be marked up when the collection data is put in?
API for returning probability values for similarity between the query statement and the EML metadata value for describing the taxonomic holdings of the data set. Search on "roses" in a database labeled "plants" would return a lower similarity value than, a query "Rosa" on a database tagged with "Rosa" Option: take a DIGIR result set and make it into an EML document, say from a DIGIR scan query.

Discussion of EML and DiGIR and museum databases.
How do we use EML for museum database queries.

DV: To particiapte in Ecogrid will insist on having Ecogrid interfaces for data providers.

The discovery of data sets will require query to an EML interface.
DIGIR will have an EML interface.
Query language for Ecogrid, allows a query on "all organisms with x property", this will return different kinds of data objects, museum data sets, ecological data sets, etc. Then you need to further process the result set to identify the subset you want. The user query might be complex where you have multiple underlying schema underneath the Ecogrid interfaces. If you have 100 different museum database schema, then mapping the ecogrid query to 100 schema is tough, that is why we will use DiGIR!
A subset of Darwin Core elements should be transposable to EML and vice-versa.
Data sets will need to be taxonomically described in some way, either statically or dynamically in EML or in Darwin Core schema.

Prior to lunch Susan and Jessie were asked to work on diagrams of the architecture as they saw it.

Lunch 12:00-1:30 PM

Post Lunch
Susan Gauch architecture diagram
Jessie Kennedy Architecture diagram
discussion of options for expansion, relevancy assessment
MJ: Collaboration diagrams are useful from this point, to trace single paths through the system

TWG Development Tasks to enable the Fringed Filefish Concept Retrieval Demonstration with the SEEK architecture.

Add Concepts to the Concept Database

Define Concept Schema (PH: has ITIS US Schema for this in CVS, BP e-mailed draft schema 5/15/03, TDWG-Berendsohn-ABCD schema, vegbank model.) DRAFT XML schema to represent a taxonomic concept (DV) (JK will complete her survey to complete the metamodel)
Create Tools to populate (grab data from ITIS, AS/KU)
Policies to control population (postponed to later)
Match User Query Taxonomic Criteria to Concepts in Database
Define User Query Schema (may be extended beyond ecogrid generic schema, Susan based on concept schema)
Create/Test Initial Matching Algorithm (AS consulting with SG, JK))
Define API (DV)(Dave Vieglais has notes for services in a CVS document taxonservice.txt: http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/~checkout~/seek/projects/taxon/docs/taxonservice.html )
Define Site Taxonomic Schema in EML and we will hack the mapping to the internal SEEK schema as a first cut, then we decide whether to modify EML or make a SEEKEML in the future to fix the mapping mismatches. What are the first resources to markup, (a) someone should contact BEAM about which data sets we should use for this emonstration, Bill M queried the modelers for this information (JB will contact Michener)
Create Tools to Populate Site MetaData Cache AS
Create/Test Initial Matching Algorithm AS with SG, JK)
Define API (DV)
Determine timing and take action on othe 50% KU TWG Programmer

Task Schedule
Date - Task

May 26  - 1a strawman 
June 2  - 2 draft, 6
June 9  - 10
June 16 - 1b, draft
June 18 Wednesday - conference call, 10AM US CDT
June 23 - 3+6; 5+9 draft
June 30 - NSF reporting deadline for previous year 
July 7  - 2 final
July 14 - 7 final
July 16 - Wednesday - conference call 10AM US CDT
July 21 - 
July 28 - 4+8 initial

Taxon WG Human Resources

1 developer KU

Split 50% time Aimee Stewart
50% another person

2 full-time postdocs UNC (now hiring)
1 full-time informatics postdoc/programmer Taxon WG KU (new staff position to be advertised, 4yr)
1 full-time undergraduate CS student with Susan for the summer (Robert Gales) part time in the Fall
1 half-time research assistant Napier (existing RAs)
years 2-3-4 100% time graduate student (research orientation) possibly Trevor
1/6 of Dave V
1/6 Susan
1/4 Jim
1/12 Jessie

Original Agenda

AGENDA Tuesday May 13 Afternoon:

Review Draft Use Cases from Jan 30 notes, WITH ADDS from WG members before the meeting. Quickly review current approaches and literature on classification concept mapping and retreival

Beach: Beach, Pramanik and Beaman model for classification concepts (Taxon), James Ytow's paper
Kennedy: Prometheus, Perobase, Spice model, contribute to Behrendsohn's and Ytow's models. who wants to do Other GBIF name approaches?
Peet: VegBank? model and relatioinship to NatureServe?'s Biotics model and Berendsohn's model Overview of functional requirements of SEEK taxon architecture, five year vision, what SEEK needs, what everyone else needs, how much overlap?

Wednesday May 14

Morning:

Go through literature, mini-reviews (continued, if needed) From UPDATED Use Case list, identify FR and primary deliverables (PD) for Year 1 objectives Identify FR and PD for Year 3 objectives Identify FR and PD for Year 5 objectives

Afternoon:

Collaboration relationships with other projects: Our own software overlap, e.g. Prometheus, ITIS, Vegbank, BIOTICS requirements overlap for services, data, objectives, etc. data overlap e.g. ITIS, Specify Project will be a DIGIR source of multiple classifications of collection catalogs Other commuity projects, e.g. GBIF Octopus

Thursday May 15

Morning:

Briefly review any unresolved issues with Use Cases, Functional Requirements, Primary Delieverables, Staging, Staffing Begin development of project deliverables, software and publications

Afternoon:

Development activities continued future planning, meetings (Seek, side and outreach), new hires, next steps

Attachments:

Taxonomic_Concept_Models.doc		51712 bytes
taxonomic_concept.ppt		78336 bytes

Go to top Edit this page More info... Attach file...

This page last changed on 07-Jul-2004 10:06:11 PDT by LTER.stekell.