Knowledge Representation

Introduction

The range of different data structures for ecological information makes data sets difficult to align and merge for synthetic research. The Ecological Metadata Language (EML) has accomplished much in terms of making ecological data discoverable and accessible. But once data are accessed, time must be spent determining if the various data sources are both semantically compatible and structurally convertible so that they can be normalized before merging. The aim of the Knowledge Representation group was to develop a knowledge model for addressing the necessary semantic considerations for aligning and merging disparate ecological data. In pursuing this practical aim, other powerful semantic capabilities have been realized, such as semantically enhanced data discovery methods that improve upon current text-based and keyword methods. These improvements to data discovery include the ability to explore data--which we consider part of the discovery process--using summarization techniques that are enabled by the knowledge model.

Most observational data sets are a series of attributes (e.g., data columns in a table), in which instances among the attributes are related in space, time, or part. A related group of data instances is generically called a tuple, but can be thought of as a row in a typical data table. The objective for data set integration is to ascertain if two or more attributes are semantically compatible, and if any structural conversion or scaling must be undertaken before merging them together. At the scale of the attribute, integration may appear trivial. However, the context and scale that data were collected must also be compatible, which requires cross attribute knowledge. For example, an attribute "weight 1" might be compatible with a second attribute "weight 2" in that they both are continuous quantities with easily convertible measurement units. However, the first weight might pertain to all the grass biomass in a 1m2 plot, and the second in a 2m2 plot; or maybe one pertains to trees and the other fish; or one was collected in the Alaska, and the second in Indonesia. To automate the alignment and integration of ecological data sets, the knowledge model must contain the necessary machinery to reason not only between attributes in different data sets, but also among the attributes within the same data set.

The objective of this technical note is to present a knowledge model, specifically design for ecological data integration. Called the observation ontology, the model breaks down scientific observation and measurement into all the components required to understand if data are semantically compatible and structurally convertible for merging.

The attribute entity

Do data attributes refer to the same entity or thing? For example, it would not be sensible to merge an attribute for a spatial area with one for a temporal duration.

The word "observation" is an overloaded term, and what constitutes a scientific observation is debatable. For example, an observation may be thought of as the tuple, which captures an associated group of attributes along some common thread in space and time. Conversely, an observation can be thought of as each individual measurement or cell, such as date, time, place, species, and height. A whole data set can even be thought of as an observation of some broader scientific concept, such as "productivity" or "ecosystem functioning." In the observation ontology, we define observation as an entity that is distinguishable from the other entities in a data set; for example, a location, time or organism. More than one characteristic (or property) may be recorded for a given entity, translating to more than one attribute in a data set. Our goal is to be able to distinguish the different entities in a data set so that we can describe how they are contextually related to each other. For example, a study location may provide context for a focal organism. However, we may record several characteristics of the focal organism, such as its taxonomic identity and weight. Because these characteristics both belong to the organism,

In our knowledge model, we define observation as an entity that is distinguishable from the other entities in a data set. More than one characteristic may be recorded for a given entity, translating to more than one attribute (or column) in a data set. Our goal is to be able to distinguish the different entities in a data set so that we can describe how they are contextually related to each other. For example, a study location may provide context for a focal organism. However, we may record several characteristics of the focal organism, such as its taxonomic identity and weight. Because these characteristics are

The attribute characteristic

Are the data attributes capturing the same characteristic of the entity being recorded? For example, two attributes might both pertain to an organism, but one the organism height, and the other weight. Attributes must refer to dimensionally or semantically compatible characteristics (or properties) of the entity.

The attribute (measurement) standard

Were the data attributes recorded using the same standard? Characteristics of entities can be recorded as data in many ways, including as physical quantities, names, or dates. For example, height of an organism might be measurement in meters for one data attribute, feet in a second attribute, and nominally as "tall" or "short" in another. Not only should there be the ability to convert among measurement standards, but also the ability to map qualitative standards to qualitative standards if the necessary information exists as metadata (e.g., "tall" = 10-20 meters).

Attribute precision

If attributes are quantitative, with what precisions are they recorded? For example, if two attributes were measured with different precision, then precision must be reduced to the lowest precision before merging. Precision is dependent on units, and should be normalized following unit conversion.

The attribute context

Possibly the most important and non-trivial aspect of attribute merging is correct alignment of contextual (mereological) dependencies. When aligning multiple attributes, it is necessary that the spatial, temporal and material containment hierarchies align or are, at least, made explicit. For example, a nesting sampling design "location <- biomass", is not directly compatible with a second design "location <- plot <- biomass". Merging the data sets and ignoring "plot" will deflate biomass estimates in the second data set. Knowledge of the nesting structure of the data sets indicates that biomass must be scaled by "location" area of the first data set before merging.

The attribute (spatial or temporal) scale

Were the data collected at the same spatial or temporal scale? Biomass of plants collected in a 1 square meter plot cannot be merged directly with biomass collected in a 2 square meter plot without normalizing the spatial scales. Sometimes such normalization can be handle by simple scaling (i.e., multiplication), but other time may require more complex curve fitting or rarefaction techniques. Although the ability to fully automate scaling may not be plausible, it is required that the potential need for scaling can be detected.