|
|||
|
Jessie et al., This page contains questions and comments from Napier (and a few from KU) and KU responses, preceded by names. AtomicNames ;). KU: I think one of the basic misunderstandings is that the TES was originally to be used for providers to transfer data into and out of the SEEK Taxon Cache. The TES should allow data providers to submit data to the SEEK Taxon Cache. It should also allow the transmission of data using the GUIDs that we are proposing for the community. This is one scenario we are trying to provide a better option for. The existing Napier schema does not allow the transfer of existing elements or submission of new elements using existing elements when entering data into SEEK. Napier: The function of ExistingSomeElement is not clear. From the telco it appears it is an empty tag containing a GUID. There are no references to any of the items from the local file. This means to me the maximum you get there is a list of GUIDs – possibly as the result of the query. We can’t think of a scenario where this would be useful in the context of a transfer schema. What are we overlooking? KU: Are we all agreed that GUIDs would be beneficial not just within SEEK but within the Taxonomy community as a whole? If so there should be some way of transmitting just a GUID through the TES. The Napier schema does not support this currently, and requires the full definition of everything to transmit data from one service to another, so if GUIDs were being used with the Napier schema, each service would then have to resolve the definitions to a GUID, instead of the vice versa. The problem here is that it is highly unlikely that a set of data will resolve to a particular GUID, but rather a set of GUIDs with different weightings. What is the point of the GUID if it cannot be used to uniquely identify a particular concept with no other information needed. If publications had DOIs for example, which is more beneficial for the publishing community...to get a DOI and resolve it to Charles Darwin – The Origin of Species, or to get “Charles Darwin -The Origin of Species” and resolve that to a DOI. Getting a list of GUIDs is the minimum that one would receive from the system, not the maximum. It would depend on what the situation is. If there is no need to fully define a concept (i.e. it only would have an ExistingSomeElement tag if there was already a GUID, but if it did not, the full definition would be required). The reason for distinguishing between existing and fully defined concepts is that we wanted to make it impossible to use a GUID with invalid information. ExistingAtomicTaxon can be referenced in any place that AtomicTaxon can be referenced. This is useful when creating TaxonConcepts using AtomicTaxa already present in the Taxon Cache. A place this would be useful in the schema we propose and possibly the newer versions of the Napier schema would be to say “Concept 53 is congruent to Concept 54” if those concepts already had GUIDs. This is not possible from what I can see of the Napier schema. You would have to define concept local53 and local54 in the document, i.e. fill in its name, according to, circumscriptions, etc. After that you would have to create the relationship between the local IDs of those concepts. On the other end, they would get that information, be required to resolve the data from concept local53 to GUID 53 and concept local54 to GUID 54 and then create the relationship. This is given that GUIDs are adopted. Napier: A design principle of the TDWG TaxonConcept transfer schema is broken by having the top-level containers now optional (and the contained elements are now compulsory). This is the opposite of the TDWG situation, is there a reason? Is it important? KU: What are the design principles you are referencing? Some optional top-level elements are those that can be referenced elsewhere in the document: Repositories, Vouchers, Publications, AccordingTos. Each item that can be referenced is encountered before the reference to it to simplify parsing. Repositories can be referenced in Vouchers, Publications can be referenced in AccordingTos, AccordingTos can be referenced in AtomicTaxa, Relationships, and TaxonConcepts, and Vouchers can be referenced in AtomicTaxa. AtomicTaxa (required) can be referenced in optional TaxonConcepts and Relationships. The draft schema is just that a draft, there will certainly be some things that will need to be changed. We are now thinking that the required fields should be AtomicTaxa or TaxonConcepts or both. Napier: Renamed TaxonConcepts to Concepts, why? to separate Name 'concepts' (AtomicConcepts) from TaxonConcepts? KU: You appear to be looking at the first version of our TES_draft, the second has the concept section with an element name of TaxonConcepts. Napier: With the removal of TaxonCircumscription and ConceptCircumsccription how do you cater for concepts that include these in the description? We perceive omitting information but attaching an AccordingTo as incorrect representation. While the info would be in the database it wouldn’t get a (concept) GUID, right? KU: TaxonCircumscription is still present in that concepts are defined by an group of atomic taxa an the relationships they have with one another. If by ConceptCircumscription you are referring to concept synonymies then we have a different view point on whether or not they are a defining characteristic of a concept or a relationship between self contained entities called concepts. I don't see how it is an incorrect representation. Say I publish a paper that describes Pan but adds a species to it. In that paper I state that the concept I'm defining as Pan overlaps with ITIS' Pan. By that statement in the paper I am not saying that the concept of Pan I'm describing is partly defined by the fact it overlaps with ITIS' Pan. What I feel a treatment of that nature is saying is Concept X sec Gales > Concept Y sec ITIS where the relationship is also sec Gales, so there are three unique entities here, 1) Concept X, 2) the relationship, and 3) Concept Y all three of which would get a GUID. A TaxonConcept would be described in the TaxonConcepts section. Yes, it would get a TaxonConcept GUID. The TaxonConcept includes AtomicTaxon and AtomicTaxonRelationships. Hierarchy (TaxonCircumscription) is one of the enumerated types available for AtomicTaxonRelationships. The AtomicTaxon would be described in the AtomicTaxa section (and get a GUID if it was new), then referenced in the TaxonConcept. The AtomicTaxonRelationships would be described in the Relationships section. Napier: The structure of the Relationship container is confusing, Nomenclatural relationships are required and Conceptuals one optional, so that in order to express a concept relationship it would appear it is necessary to express a (possibly redundant) name relationship. KU: Again, you must have been looking at the first CVS checkin and not the second, Nomenclatural relationships were replaced with AtomicTaxon relationships. Again, there are certainly some things that we need to think about, one of them we realize we need to reconsider is what elements are required. We've been discussing making either a single atomic taxon relationship or a single conceptual relationship required given that a relationship section is present. Napier: Based on our (possible incorrect) interpretation of the schema here are some more principal questions: Role of the TES We understand that TES is for internal SEEK use, it is a conceptual schema – not an implementation schema, representing the structure of the SEEK Taxon database, and/or the transfer document to exchange data. Is this correct? KU: We created the schema to open a dialog for discussing changes that as the only group to attempt implementation of the Napier schema we see would be beneficial to the taxonomic community as a whole. It is definitely not the schema for our database implementation, however, they are intimately tied together. Napier: Our TDWG TCS is a global (inclusive) schema for data exchange between any data provider, who will map their data to this schema, and provide data output in XML format matching this XSD. KU: We see the same goal with the TES schema we are proposing. Napier: TES should therefore be able to map to the TDWG TCS (and vice versa). If this is possible – there is no problem – this is the cental issue. One obvious difficulty would be the normalisation of data represented in TES. It may not be possible to reliably/perfectly normalize Data from TDWG TCS providers and data could be lost or corrupted in each mapping.. KU: It would be nice for us to get datasets for conceptual models you think we would have problems mapping data into, because we feel that we could represent any of the models with the schema we are proposing as well. There are some differences certainly. For example the Napier schema has synonymy as part of the definition of a concept, while we have it as a relationship between concepts. So there are some semantic differences. If a synonymy is part of the definition and we have concept A -> concept B -> concept C, with the Napier model, concept A would include concept B by virtue of the synonymy relationship. Likewise B would include concept C, so get concept A would return all three. If synonymy is a relationship between concepts, the get concept A returns only concept A and get concept A's synonymies would return only B. Perhaps we need clarification here, perhaps the Napier schema has an additional assumption that the surrounding concept is really just one end point, in which case you are saying the same thing as we are, but representing it differently. With the newer schemas on the Wiki, there seems to be the support for the addition of a relationship without making a claim about either of the concepts at the endpoints. If this is the case, that it is possible to add a relationship without saying anything about the concepts, then a relationship cannot be part of the definition of a concept, but is something independent between self contained entities. If the TES we are proposing is viewed as a conceptualization itself, and it seems that not only us but others seem to agree that we need to be able to pass taxonomic data around that is not at the conceptual level, then the Napier schema does not support this without creating artificial concepts for what we call the AtomicTaxon. Napier: Are we agreed that TES is SEEK internal and TDWG TCS global? KU: No we are not in agreement, we see the TES as global as well. Napier: Who externally would see or use the TES – is it necessary to expose it externally? What is the TES actually going to be used for i.e. is data ever going to be provided in an XML document defined by the TES xsd, or is it just a representation of the data in SEEK Taxon database systems? KU: These questions are relevant only if the TES is internal only. GUIDS Napier: As the result of the normalisation there are more elements that can potentially have GUIDs. In fact you seem to suggest all elements with a ExistingSomeElement will have GUIDs. Is this really useful? Especially AccordingTos seem to be not sufficiently different to publications to warrant that. KU: There would definitely be more elements that have GUIDs. In fact, that is part of the whole reason we did this. We feel there is merit to having some of these smaller elements stable and identfiable uniquely with a GUID. According to is an element that can be referenced by an internal id, not necessarily a GUID. Publication may have a GUID, but one that is created by another system, such as the library DOI system. Only elements starting with Existing are elements that we are currently proposing to have GUIDs. The one that we do feel most strongly about is the AtomicTaxon as we feel it is the component that could be reused the most by many concepts. We also think that it is a useful element to be able to return from a query with a Globally Unique ID for reference elsewhere. Napier: There are as yet no 'central' GUID authorities for names, people, institutions, references etc – so SEEK would have to provide all of these – would these be externalised and available for outside users/providers? KU: Of course, someone has to take the lead right? This is something we have discussed in our SEEK Taxon meetings repeatedly. While we may initiate the use of GUIDs, I don't think anyone is suggesting that creation, distribution, and maintenance of GUIDs is a long term responsibility of SEEK. Some neutral third party organization would be a better fit for that role. But, until someone starts using these elements with GUIDs no one will ever see the benefit of using them. Napier: Who creates, provides, maintains GUIDs? KU: I'm not sure why this question is relevant to the schema. We have discussed the use of GUIDs many times in SEEK Taxon meetings and I thought we were in agreement that this is the course to take. The question of “who” is less important to us as implementors. If GUIDs will be used, as we all agree they should, they should be able to be used when transferring data from provider to provider. If there is no mechanism for that in the transfer schema, what is the point at all of using GUIDs? Napier: Do they represent an obtainable resource? KU: Isn't that the definition of a GUID? A unique identifier that is a handle to a piece of data? Napier: Are SEEK GUIDS exposed to and usable by the external world? KU: See above AtomicTaxa Napier: As far as we understand they are introduced to allow separate representation of names. (Napier thinks that names can be sufficiently represented within concepts.) Names can have an AccordingTo, not clear how this would work – would it be the original use of a name or a specific use? (Napier would treat each of these as concepts). Furthermore the AtomicTaxon is given a status field - would this change over time, who decides it (an original name may be valid at the time created and then declared non-valid later). KU: There is a misunderstanding here. We created the AtomicTaxon so that there could be different representations of TaxonConcepts that use the same AtomicTaxon. For example ITIS has Pan troglodytes defined by Blumenbach in 1775. So we would have an AtomicTaxon with name = Pan troglodytes, reference Blumenbach (1775), lets say that has a GUID of 17. Now the concept (i.e. position in ITIS' tree with all its relationships) would say be concept 20 sec. ITIS with AtomicTaxon 17. Now if Species2000 has Pan troglodytes (Blumenbach, 1775) as well, its concept containing all the atomic taxon relationships as defined by Species2000 would say be concept 21 sec. Species2000 with AtomicTaxon 17. This is where we see the biggest benefit in the reuse. Additionally, if someone queries the system for Pan troglodytes, AtomicTaxon 17 may be all they really want. If they are a taxonomist, and want to see all the relationships defined by different people using Pan troglodytes, they can also see that. But Pan troglodytes (AtomicTaxon 17) can be referenced uniquely, without the necessity of discerning the (possibly obscure) differences in relationships defined by different data providers. The according to on the AtomicTaxon would be a a specific use of that name (i.e. if there was a new publication that added information to a Pan troglodytes (Blumenbach, 1775) by Gales in 2004, there would be a new AtomicTaxon with author Blumenbach 1775 sec Gales 2004 with the additional information. The status field is something that we feel definitely needs to be present, though how it is represented needs to be ironed out once we discuss with some taxonomists how status is used, assigned, etc. Napier: Apart from reusability of Names, what explicit information cannot be captured with Concepts (containing Names, that requires introduction of AtomicTaxa? KU: I personally feel that going this route has the disadvantage of bloated numbers of concepts with little to no information just for the storage of names. So it seems that a possibility with this, as Nico once described, is that we are creating more concepts than there are ideas. I also think that there is a need to be able to pass around non-conceptual taxonomic information, i.e. AtomicTaxon and relationships between those AtomicTaxa without necessarily being concerned with concepts. I also see a need for a taxonomist to use AtomicTaxon to build upon without unnecessarily creating new concepts just to create the relationships they need to represent their world view. Relationships Napier: TES has removed relationships from the definition of concepts (which is provided in TDWG TCS to allow an immutable record of what the original concept expressed). TES allows relationships to be recorded between two concepts AccordingTo some author (in response to discussions at the Edinburgh meeting TDWG TCS now has added the ability to express 'third party' relationships ). The original relationships recorded as part of a concept would be discoverable, but are not directly represented in the model of a concept. Relationships are also allowed between AtomicTaxa (i.e. names) in the TES, what would these be, and could they not be represented by relationships between TaxonConcepts. KU: I think we definitely view concepts differently in this aspect. We do not see the relationships (at least synonymy) as a defining characteristic of a concept, see above for the reasons. The relationships we see between AtomicTaxon are Homotypic synonymy, heterotypic synonymy, and hierarchical relationships currently. We feel a concept is a particular person's point of view of a group of AtomicTaxa and the relationships between them at any given point in history creating a single entity that can be identified. We view synonymy relationships (i.e. using Nico's language) as relationships between this entities, while the Napier schema views synonymy relationships as part of the defining characteristics of a concept, creating a recursive structure that without arbitrary stopping rules would force a concept to include all its synonyms, all its synonyms synonyms, etc etc. Which is the problem we have always had with representing relationships as part of the definition of the concept itself. Napier: Are relationships between AtomicTaxa (Names) necessary (and what are they)? KU: We do feel that there are some relationships that are better suited between atomic taxon than between concepts. For us currently, homotypic synonymy, heterotypic synonymy and hierarchical relationships, though we do not feel strongly that these are the only relevant ones. Napier: Is it necessary/useful to separate orginal relationships out of Concepts (and is anything lost in this process)? KU: We don't feel anything is lost. A relationship is a link between two well defined entities neither of which are defined by that link. Example, the data within a node in a binary tree can be moved because the relationships are not part of the definition of that data. The Napier schema seems to be taking a different approach: the link itself is part of what defines the data within a node. Though from telecon's it seems this is what you actually believe, that these relationships are static and once defined cannot change or be repositioned without defining a new concept. Napier: From reading Nico's commentary it is not clear that his interpretation of the TES structure is compatible/identical with that of the Kansas team. KU: Any perceived differences are probably due to the loaded words we are all using. Napier: Our understanding of where Aimee/RobertG were coming from was a desire to achieve re-use of data (1) via the introduction of GUIDs (or at least local implementations of GUIDs until true GUIDs were available) and (2) atomizing a core reusable 'thing', which would be based on a name and optional AccordingTo citation and which now has optional specimen and character circumscriptions. This AtomicTaxon does not hold relationships to other AtomicTaxa, but rather a TaxonConcept holds a reference to an AtomicTaxon, and may hold relationships between Atomic Taxa (we assume one of the referenced Atomic Taxa must be the same as the Taxon concept’s Atomic taxon – is this correct?). KU: There is a “root” AtomicTaxon which is the required element in a TaxonConcept. It is the “from” part of all the relationships in that TaxonConcept. Napier: In addition further relations can be expressed between Taxon Concepts but not as part of the definition of a taxon concept. Reading Nico’s notes he describes an AtomicTaxon as a name sec someone i.e. a concept. This is the same as what we see taxon concepts to be. A name according to someone i.e. as described in that publication. When we asked which according to went with the Atomic taxon – no one knew – which is clearly an issue but from Nico’s notes it sounds like the According To is the publication by someone in which the name is described. KU: Yes, an AtomicTaxon is a Name, AccordingTo, and optional other elements. Relationships are not a part of the AtomicTaxon – therefore, not the same as your “concept”. This AtomicTaxon gets a GUID. Yes, the AccordingTo is from the publication. If the publication describes any relationships, they are entered separately, with the same AccordingTo. In this case, the TaxonConcept (which also gets a GUID) is a combination of the AtomicTaxon and its relationships, all with the same AccordingTo. Carving up the TaxonConcept into its smaller elements allows our system to show similarities as well as differences. Napier: Let’s digress a moment to look at GUIDs and keys. One of the main reasons we believe in using GUIDs is to eliminate the uncertainty of which concept anyone is talking about – so when you pass a GUID for a concept you can resolve that GUID somehow and you will know which concept it is. In this sense a GUID is like a surrogate key or object identifier which is system controlled – not user-defined and is replacing a user defined primary key. In this case the user-defined key would be the name and according to (for atomic taxa as per Nico’s description of Atomic Taxa and for the TDWG Taxon Concept). The problem being that name and according to are often represented differently when referring to the same thing, Therefore they don’t work well as primary keys because they are user defined and not consistent causing problems when matching. So if name and according to act as the primary key, then each “unique” combination of name and according to will be a different concept (or atomic taxon) – the only issue being when is the name+according to combination unique (this is a problem fro GUID allocation I mentioned on the tel. Call). KU: The issue about the Name/AccordingTo being an imperfect primary key is a problem no matter how we represent the data. We must accept that we will have messy data, and try to resolve similarities and differences programmatically. Napier: So given that the primary key of an entity determines it’s attributes it doesn’t matter whether or not you include relationships, or specimens or character circumscriptions, we’re always talking about the same entity and the attributes should all be determined by the primary key and hence GUID. Therefore the stability/re-use that Kansas wants is not provided by removing relationships from the TDWG view of a Taxon Concept – you will still have the same number of atomic taxa as concepts published and therefore need the same number of GUIDs – one for each name+according to combination. This assumes Nico’s interpretation of Atomic taxa as being name sec. Someone – which we think might not be the same as Kansas’ but is the same as the TDWG Taxon Concept. KU: Here is a very basic misunderstanding. Since we do not include the relationships in the AtomicTaxon, the entity we are refering to when referencing the GUID of an AtomicTaxon is also not refering to the relationships. We believe that other taxonomists have and will use the initial AtomicTaxon as a jumping-off point, from which they will define full concepts with Relationships different from the original author. The stability comes in recognizing previous work and building on it. Napier: So, we are not convinced that this AtomicTaxon is necessarily any more 'stable' than the TDWG style Taxon Concept. As AtomicTaxon has an optional AccordingTo (i.e. implicitly records an opinion sec. an author, whether or not this is explicitly held in the database as circumscription information) AtomicTaxa are to all intents and purposes what we would see as an immutable taxon concept – any new use or revised definition of such a concept would be according to a new author. KU: An AtomicTaxon is only an immutable AtomicTaxon. It is not an immutable TaxonConcept. A new TaxonConcept (with a new author) can be defined using the existing AtomicTaxon (and its author) and a different TaxonCircumscription (i.e. Hierachical relationships with the new author). Napier: One AT could be shared/reused amongst different TCs, allowing different relationships to be expressed, but this could also be done by having a TDWG Taxon Concept refer to other TDWG Taxon Concepts (do you see a difference?). However, any change in the other constituents of AT (ie circumscriptions) would create a new AT. If there is a reason we are not quite getting then is it logical to only hive off relationships if stability is the rational, why not hive off specimens and characters? – except then Atomic Taxa wouldn’t be concepts as per Nico’s understanding. KU: A new AtomicTaxon is created when the elements of it, AccordingTo, CharacterCircumscription (not hierarchical relationships, i.e. TaxonCircumscription, or synonymies), change. Any new relationships defined using it creates a new TaxonConcept with a new AcccordingTo (probably required). Should we hive off specimens and characters? Specimens seem like a logical choice – treat them as a special case of TaxonCircumscription. Napier: Nico's interpretation of the AtomicTaxaRelationships that can be held in a Taxon Concept would seem to be that they are restricted to vertical (parent/child) relationships within one classification. However, the enumerated types of this relationship include the interclassification/hierarchy relationships homotypic and heterotypic synonymy – so again there is a misunderstanding somewhere. KU: Yes, the AtomicTaxaRelationships are mostly vertical. We would like to revisit homotypic and heterotypic synonymy. Napier: Nico gives a benefit of separating out AtomicTaxa from relationships in that they then contain no hierarchical information – but what sort of taxonomic concept really has no hierarchical information? Do people not think of taxon concepts as being composed of other taxon concepts? KU: The TaxonConcept contains hierachical information while the AtomicTaxa does not. We believe the AtomicTaxon is a useful element to be able to reference unambiguously in and of itself. Napier: How would the multiple AccordingTo assertions for Atomic Taxa, Taxon Concepts, ATRelationships and ConceptRelations. Relate – would they? would they be duplicated? could a TaxonConcept have different AcccordingTo information than it’s Atomic Taxon? what would it mean if the AccordingTo was the same for these? how would original relationships asserted as part of the definition of an original Concept be linked? KU: The AcccordingTos could be duplicated. An AtomicTaxon could have a different AcccordingTo than some of the TaxonConcepts it belongs to. Original definitions which contained both the AtomicTaxon definition, and the relationships present in the TaxonConcept would all have the same AcccordingTo. Napier: Let’s digress to looking at reuse for a moment. From a database perspective, normalisation of data is good to remove redundancy and to thereby give reuse when a primary/foreign key is introduced rather than duplicating the data. This removes problems of propagating updates and inconsistencies in the data but does make it less efficient for querying in many cases. Full normalisation also assumes that there will be lots of updates to the data, however if we are recording concepts as defined in publications then these concepts are fixed as how they were defined in the publication and are therefore not subject to updates as such (once they have been created and checked) or not to updates which would warrant changes to the concept or not so that changes could not be propagated in batch mode if necessary. KU: We are recording TaxonConcepts as defined in publications (yes they are fixed), but the AtomicTaxa are also defined in the publication and we are extracting that as a separate entity with a separate GUID that may be referenced elsewhere. Another TaxonConcept may use that AtomicTaxon definition in its TaxonConcept definition. Napier: So normalisation of the data may not be efficient for retrieving full record information (which is why big data warehouses use materialized views). For example, a query to SEEK may only return lists of GUIDs that would then have to be resolved by SEEK or an external service, furthermore resolving which relationships are held about an original concept, according to the original author, might require complex searching and dereferencing. This would not seem to be an ideal structure for a transfer schema returning data to users. KU: That certainly seems like an implementation issue for SEEK – it's easy enough to either return data using GUIDs or fully expanded – or both. Napier: The goal of the TDWG schema was to explicitly record/represent all the original information recorded about a concept and allow this to be moved in/out/between users and providers. KU: The goals of the TDWG schema have not been made clear, but the goals of the SEEK schema have always been to transfer data to and from the SEEK Taxon Cache, which should facilitate understanding about taxonomy. Napier: One view of what is achieved in the TES is a separation of Taxonomic Concepts into two types – one, the Atomic type cannot express relationships; whilst the other, TaxonConcept, can – although we cannot define a concept in which the definition includes “conceptual relationships”. We investigated the feasibility of representing different types of concepts in the TDWG model, for example having Name, Original, Reference, Revision, Vernacular Concept Types (see version 0.58 on TDWG http://www.soc.napier.ac.uk/tdwg/index.php?pagename=TheSchema). Rules might specify which 'fields/attributes' were required for each type of concept and which kinds of relationships that they could participate in. However, it was felt that this would be overly complex, how would a user/provider decide which type that they were providing etc. – and consequently this is more likely to be a division that some users would like to do according to their own rules and requirements, rather than something that is explicitly part of the transfer schema. KU: We agree, that different types of Concepts with different required elements seems overly complex for a transfer schema. Allowing TaxonConcepts to be defined by their component parts does not seem overly complex. You have enumerated (verbally) many assumptions about the behavior of the data model when discussing your schema. Is there some documentation of those assumptions?
|
This material is based upon work supported by the National Science Foundation under award 0225676. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). Copyright 2004 Partnership for Biodiversity Informatics, University of New Mexico, The Regents of the University of California, and University of Kansas |