Beam Knowledge Rep Sept 04

Beam Knowledge Representation Meeting, Sept. 21-23, 2004

Participants

Bob Waide (rwaide@lternet.edu)
Steve Cox (Stephen.Cox@tiehh.ttu.edu)
David Chalcraft (CHALCRAFTD@MAIL.ECU.EDU)
Shawn Bowers (bowers@sdsc.edu)
Bertram Ludaescher (ludaesch@sdsc.edu)
Deana Pennington (dpennington@lternet.edu)
Mark Schildhauer (schild@nceas.ucsb.edu)
Katy Suding (ksuding@uci.edu)
Kristin Vanderbilt (vanderbi@sevilleta.unm.edu)
Evan Weiher (weiher@uwec.edu)
Rich Williams (rwilliams@nceas.ucsb.edu)
Chad Berkley (berkley@nceas.ucsb.edu)
Dan Higgins (higgins@nceas.ucsb.edu)
Jianting Zhang (jzhang@lternet.edu)

September 21st

Introductions and Presentations

Mark presentation on SEEK background
Deana presentation on Biodiversity, etc.
Discussion

Species traits
Data sets, availability
Integration

Chad / Dan Presented on Kepler

Much interface discussion
Showed Pred/Prey and Bio index models

Rich Williams on ontologies

Restricted to certain axis; spatial patterns (naturally occuring gradients); abundance; temporal
In a particular control plot, how are things changing, and are those changes in the same trajectory (in the same gradient change)?
Ton of data in range management
Q: Are most of these datasets freely available on the web; are people sharing them? A: Few available on web ... Q: But is it possible to get them? A: It is a very divergent / diverse community ...

Shawn Bowers on semantic integration

September 22nd

Agenda Setting

Some overlapping stuff on every biodiversity analysis
Methods / Design focus
Analysis focus
Mark: Goal is to compartmentalize so as to provides general utility for next project, for some other analysis; capture standard functions that we want to capture and describe to use for others
Bob: At one point there was a database with various scripts that provided the full spectrum of data integration

They aren't there anymore (they were on KNB bio)
Everthing that was done using scripts, no manual work
From 16 or so grasslands
Raw data that read in the scripts
Project technically still ongoing
Aug 2002 first working group
Then six months of work after that
It would be a good test case / use case to look into
Steve: Jornado would be the test case

Data Integration and Analysis

Katy (note, the following mixes discussion with Katy's presentation)

Predicting species response to increased rousource availability (history, questions, dataset)
what happens when you increase productivity (experimentally), and then look at what happens to diversity
N (nitrogen) Fertilization experiments (KBS oldfields, ARC heath)

if you add nitrogen, it generally increases productivity (Gough et al. 2000)
Every experiment found that as you increase productivy it decreases diversity, which doesn't follow the natural productivity/species-richness curve (this is gaining interest in the community: we are increasing productivity of systems in general with environmental change going on, e.g., urbanization increases nitrogen/fertizilation, and a desire to know the result on diversity)
What is Primary Productivity "can of worms" discussion

Want to aggregate data at many different scales/communities to get many types of graphs; you want to ultimately go from smallest possible scale to largest scale (infer curve of graph)
What decision making occurs before any analysis happens? This goes into the data discovery/integration. Deana: can we build a repository of methodologies?
You do a broad category of the dataset, but not the details (a little bit of woody stuff, and so on...)
Bob: need a step where someone can look at the methodology ...
Rich: Need to capture what it is you are measuring; it isn't as much a methodology issue
Bob: Clark and Clark paper covers some of this (Bob said he'd dig up the ref)

Most of the time, productivity scales well, i.e., it is a linear scaling, e.g., anpp vs. area is a linear relationship. So for example, even though they are smaller plots, you can get g/m^2 measures. Basically, productivity doubles as area doubles...
N addition decreases species diversity: plot at lter sites of anpp (above-ground primary productivity) versus relative species density; species density is the number of species observed in a given area
Two studies, measures at different scales: either you simulate or measure a species area curve, and extrapolate to different areas.
N fertilization positivevly effects productivity; negatively effects diversity

N fert. interacts with environment; species sorting (dominance composition) negatively effects diversity

Species response to fertilization:

In the case of increased fertility, can we predict what species we will lose? What species will become dominant?
Are these responses contingent on system characteristics?

Dataset: 8 lter sites, 28 community types (mainly vegetation classifications, based on the experiments/manipulations done at the sites); 831 species
Dataset characteristics

N added (g/m^2/yr), 10 (ARC), 9.5 (CDR), 60/wk (GCE), etc.
Form of N: NH4-NO3 pellets (ARC), Liguid Urea-N, NH4-NO3 pellets, etc.
Treatment plot size (m^2): differs from 900 to .25
Sample plot size (m^2): .25 to 10 (these don't differ that much really: .32, .30, .25, 1) ... it isn't, however, appropriate to compare directly, without the species area curves, the .3's with the 1's
Replication: 2 to 10
Duration (yrs): 2 to 13

Often, a matrix like this is constructed at the beginning of doing a "synthesis", and the most important points to track for each site are: treatment, sample size, duration, and replication.

Data "Request" (this is basicaly a data procurement request/query)

Contacted LTER and asked for the data in a particular format (basically the matrices)
At each site, asked for N-fertilization experiments: abundance, species, measures of productivity, treatment plots, un-treated plots, herbaceous systems, and to give latest sample time
List of vegetational forms; growth forms (secondary growth); herbaceous is a property of plants
Herbaceous term applied to a nonwoody stem/plant with minimal secondary growth
Example dataset:

atts: site, comm, species name, RA_Control, V_Control, RankC, PRankC, n, V_Naddition, RankN, PRankN, Cot, Dur, LF, DLF, HT, CLN, Origin, Family, Response, ImmExt, InRR, Change,
comm is a subset of n-fertilization of sites (e.g., tiled and untilled in KBS), thus <Site,Comm> denotes the actual place
each row consistutes an observation within an experiment
RA_control is the mean, V_control is the variance, the rank is derived from control, ...

Lots of discussion about separation of syntax and semantics issues; and of excel details
"Generic" stuff

Species/attribute matrix: to compute trait responses, functional attributes
Measure traits in as many species, then throw away points

Both projects pretty much took from the same original, "raw" data sets

34 from Katy's project
13 sites from NCEAS project

Brainstorming:

Six-month view: we know diversity made up of species, we have all that data, but don't use it to its full potential, productivity/diversity data needs to be integrated with community structure, and integrating across a lot of sites.
General tasks: Identifying data that is relevant (talking with people), permission to obtain the data, understanding the structure and content of the data (sampling design, how or what attributes mean), and then determining which can be appropriately integrated to do an analysis. From Jornada: http://jornada-www.nmsu.edu/ (go to "Research Data" > "LTER Data" > "Plant")
anpp versus r and a bunch of data points. how can the data points be linked back to a table of other features ... of the species that are involved in the point. The point represents the set of species in an area (the species "richness") ... the auxiliary table is characteristics of the species
As another example, compare what happens to the species between the points (as area increases); and more importantly the functional traits of the change

Comment: These are just visualization things

Null model, and the null expectation
Abundance vs loss prob for n-fix and not n-fix
Standard toolset
Measuring functional diversity is a big problem -- a computationally complex problem
For loss-probability and N-fixers you need to do data integration: how many sites are needed, and so on ...

Biodiversity Change Analysis

How to understand change in biodiversity: what are the factors causing the change in biodiversity

Given an attribute matrix, which are changing the most or least at particular segments in the species-area curve
Focus on Community structure
Niche breadth
Predict why diversity would declince / change

Deana's Example

Query

Some data by keyword: "biodiversity", "species counts", "abundance", species names, functional trait
Location, e.g., place name, coordinates, bounding box
Sample method
Analysis type, e.g., counts, abundance

Construct logically and semantically equivalent views
Group on sample methods (transect vs plot)
Data

data1: transection 20m -> rarefaction (species-area curve) -> scale to 1 m interpolation
data3: transection 20m -> rarefaction (species-area curve) -> scale to 1 m interpolation
data2: plot 5m^2 -> species-area curve -> scale to 1 m interpolation
data4: plot 1m^2

Integrate transect data
Integrate plot data
Construct graphs

September 23rd

Agenda Setting

Next meeeting
Convert Deana's starting workflow to a specific analysis ...

Next Meeting

February or March time frame
What we might speak about / do for next meeting

Get data, scripts, etc.
Get the biodiv workflow into Kepler
Maybe get familiar with kepler / ecological tutorial for feedback
Data integration examples
Given the technology -- what do you want to do?
IRC channel for BEAM? Weekly, bi-weekly scheduled meeting?

Group Breakouts

Biodiversity ontology breakout

Participants: Deana Pennington, Kristin Vanderbilt, Evan Weiher and Rich Williams.

Biodiversity workflow breakout

Participants: Steve Cox, David Chalcraft, Shawn Bowers, Bertram Ludaescher, Mark Schildhauer, Chad Berkley, Dan Higgins, Jianting Zhang

Notes from ontology breakout

The discussion broadly covered traits of plant communities important in biodiversity and productivity experiments and experimental methodologies. The following notes are raw and will require considerable work to formalize. As such, the categorizations suggested by the indented formatting should be regarded as preliminary.

Traits of a Population (aggregated group of individuals) of Plant Species:

Abundance

Count
Cover
Biomass

Size

Height

Mean, variance of “average height of the highest photosynthetic organ of a well-grown individual”

Biomass

Avg above ground biomass of an individual

Avg Canopy size (area)

Path of Resource Uptake

Photosynthesis (C3, C4, CAM)
Nitrogen fixing
Microrhyzal fungi associations (Yes/no, Endo/ecto)

Modes of Reproduction

Clonality (None/clumping/branching).

What is the definition used to separate clumping and branching?

Resprouting ability

Life Form/Habit

Grass
Forb
Subshrub
Shrub
Tree
Vine

All these may be definable using other traits (height, woodiness, leaf shape, self-supportingness, perhaps others.

Life Span

Annual, biennial, perennial

Phenological Traits

Seasonality
Sprouting cue

Native/naturalized/non-native

Traits of Parts of Plants: (note the important plant parts given by the trait categories)

Leaf Traits (Photosynthetic organ traits?)

Evergreen, deciduous (defined based on leaf longevity?)
Specific leaf area (Area/mass) or mass per unit area.
Water content or % dry mass
Many others ...

Root Traits
Stem Traits

Stem density

Mass per unit volume
Woody/nonwoody

Branching pattern

Seed Traits

Size

Mass

Shape
Appendages/Fruits -- closely related to dispersal categories, often highly correlated with seed size.

Fly through the air
Stick to an animal
Eaten and excreted
Cached for later consumption
Ballistic dispersal
Floating

Traits of Interactions

Competitive ability

Measure how an individual suppresses the growth of a neighbor

Interaction strength
Effect on environment (ability to reduce resources) (Tilman)

Experimental Methods

Experiment

Field Experiment

Observational/Empirical Experiment
Manipulation

All field experiments have

Where (site, plots etc)
When (sampling regime)
What (properties of organism/population/community/system)

(An empirical experiment is a field experiment with no manipulation)
Manipulations have one or more Treatments
Treatment has

What was treated?
Strength (amount), can be positive (addition) or negative (exclusion)
Temporal extent

When defining a treatment, a scientist might describe a substance (nutrient, presence of an organism) as being manipulated, or describe the manipulation of a process.
Sampling Regime

Random
Stratified
Stratified random
Nested
Regular (uniform)
Haphazard
Random haphazard

Note that the choice of a sampling regime (and of plot layout?) constrains the possible statistical analysis techniques that can be applied.
Traits of Experiments

Balanced or unbalanced sampling
Replication

Traits of Treatment Regime

Factorial (all possible combinations of treatments) or not
Random factors (treatments along a natural gradient)

Notes from workflow breakout

Workflow based on part of Steve and David's Jornada analysis
http://jornada-www.nmsu.edu/studies/lter/datasets/plants/nppqdbio/data/nppqdbio.htm
General steps outlined:

Data Request
Quality Control and Assurance (if from different sites)
Data Integration
Quality Control and Assurance (of the integration)
Analysis
Capture result of analysis …

Workflow we examined:

http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/~checkout~/seek/projects/kr-sms/docs/beam_kr_sms_meeting_sept_04_workflow.png

Useful Actors

List Summarizer

A set of values in a data column

List Comparator

Given two sets (lists), do they match?
Which ones in the first list aren’t in the second
Assign first list values to new values

Nested Transpose

(site, taxon, count)
{(A, x, 3), (A, y, 1), (B, y, 4), (C, z, 2)}
Transpose to:

(site, x, y, z)
{(A, 3, 1, 0), (B, 0, 4, 0), (C, 0, 0, 2)}

Notes about this from Bertram and Shawn after meeting:

Given an annotated schema S, denoted S*. And a white-box actor q s.t. q(S*) -> S’. We want to “push through” the annotations to obtain S’*.
The “nested” transpose is basically a combination of various lower-level algebraic operators, such as (theoretical) group-by, matrix transpose, projection, etc. So, given q as such a plan of operators, can we reason over the plan (white box-actor) q to obtain S*’? Using symbolic manipulation? Using the chase, e.g., for similar problems in integrity constraints?

Often-found pattern of computation

Can Kepler/Ptolemy efficiently and conveniently support the following pattern?
Given a data set, construct a scatter plot for pairs of variables, allow user to select a subset of the plots -or- pairs of variables of interest, return data subsets based on chosen pairs (with no extraneous variables)
Similarly, given data sets, an actor computes a set of regressions, the user is shown the results, the user selects the regressions of interest, and the workflow then proceeds using only those selected regressions
These "patterns" can be supported now (with lots of plumbing) using the browser actor. Can we also add functionality to better support/model these patterns?

Go to top Edit this page More info... Attach file...

This page last changed on 25-Jan-2005 15:39:17 PST by SDSC.bowers.