An Overview of Biodiversity Informatics

Background for the first meeting of the All Species Project

By Stanley D. Blum
    California Academy of Sciences

First distributed  :  September 18, 2000
This version  :  October 7, 2000

Preface

This overview of biodiversity informatics is strongly focussed on biological systematics and the work conducted in systematics collections (i.e., natural history museums). It does not address observational (monitoring) data, rare and endangered species, ecosystem characterization, threat assessment, and a host of other biodiversity issues.

Introduction

Regardless of what the All Species project sets as its ultimate goals, it is certain that the work will create large quantities of new information. Most people will actually experience the project's results not by coming into contact with newly discovered and described organisms, but by experiencing the new information generated about the organisms -- everything from text, to pictures, diagrams, maps, sounds and video.

Ensuring that information flows efficiently, from creation, through analysis, into appropriate outputs, is the essence of biodiversity informatics -- the application of information technology to the domain of biodiversity.

This overview of biodiversity informatics describes the main subject areas in biological systematics, their interrelationships, and the most important informatics projects in a given area. The subject areas are:

In each of these areas, I will discuss

Taxonomic Names and Classification

Nature and Uses of Names and Classification

Biological taxonomy -- the scientific names of organisms -- provides a global (at least internationally recognized) system of designating natural groups of organisms; i.e., species and higher taxa. Classifications assemble smaller groups into larger groups, and provide a way of making statements about or retrieving information about many species at a time. One of the first steps in communicating the discovery of a new kind of organism is to give it a name, and to infer what it is by classifying it -- i.e., saying that it is a particular kind of something more general and perhaps already familiar.

Examples of taxonomic names
Gorilla gorilla the name of a species  
Scombridae the name of a family of fishes (tunas, mackerels, etc.); the family contains 15 genera and 49 valid/accepted species; 212 species names are "available" for these 49 species so (212 - 49 =) 163 of those names are synonyms.  
 
-- (From the ITIS database)
 
 
An example of a species in a classification
 
Kingdom Animalia animals
  Phylum Chordata chordates
    Subphylum Vertebrata vertebrates
      Class Mammalia mammals
        Order Primates primates
          Family Hominidae man-like primates
            Genus Gorilla
              Species Gorilla gorilla
 
  -- (From the ITIS database)  

Defining a group of organisms (taxon) and naming it are conceptually distinct operations. Determining that a taxon is biologically meaningful or real is a question of scientific judgement and is subject to refutation. Determining what name should be applied to it is a matter of following the rules set forth in the relevant international code of nomenclature. (There are three; a code for microbiology, another for botany, and a third for zoology.)

Taxonomic names are created and put into use via publication. There are no requirements of scientific veracity, reasonableness, or qualifications of the author for a name to be effectively published and admitted into the universe of discourse. Once a name is published it has to be dealt with in subsequent works.

Scope of nomenclatural and classification data

There are an estimated 1.5 - 2 million known species. There are somewhere between one and two synonyms for every valid/accepted species (in addition the valid name). Compiling a list of scientific names for a major group takes years of effort. For the resulting list to represent real progress -- something that doesn't have to be done again -- each original publication must be reviewed (at least briefly) by a taxonomist and the decisions documented with supplemental information. Data gathered along with the name typically include the bibliographic reference, author(s), and date of publication. Additional information may include references to type specimens (institution and catalog number), type locality, and references to subsequent taxonomically significant publications.

Significant projects

Projects that aim to compile taxonomic databases fall into two major categories: nomeclators and checklists. A nomenclator is a compilation of all relevant names, but does not present opinions about which taxa are accepted or valid. A checklist represents determinations about which taxa are accepted or valid, and for each taxon gives its correct name a list of synonyms, and places all taxa in a single coherent classification.

There are four significant projects that aim to compile and/or provide information across major taxonomic on a global scale:

Both of the checklist projects (ITIS and Species 2000) are attempting to engage members of the systematics community to act as compilers and "curators" of taxonomic information on an on-going basis. ITIS is building a centralized database, whereas Species 2000 is creating a federation among distributed and independently managed databases. To date ITIS has made more progress, probably because it has had more funding and stronger management -- all the key players are US federal employees. In both projects, however, the bulk of the information is coming from databases created independently by practicing taxonomists, and in the case of ITIS layered on top of legacy databases managed by federal agencies. Governments have not yet funded large-scale data gathering for either of these projects. (Governments have probably funded the compilation of somewhere between 70-90% the data, but only through their traditional channels for supporting basic research.)

The two nomenclator projects (IPNI and ION) are derived from "traditional" literature indexing operations. (An indexer reads publications and records the taxonomic names and subjects addressed in each publication; the resulting indexes are published annually.) The two largest contributors to IPNI, Kew Gardens the Harvard Herbaria, have both been indexing botanical publications since the late 1880s. The largest contributor to ION, the Zoological Record, has been indexing the Zoological literature since 1864.

The strangest thing about all four projects is that while there are large overlaps among them, they are collaborators primarily in name and not really in substance. The Web site for each project mentions at least one of the others as a "partner" and the management team for each project has at least one liaison from another project. Yet despite this openness the projects do not share actual data, nor does there appear to be any intention to rationalize the efforts and create data flows between projects. Each has a sufficiently different strategic plan, technology base or funding model that actual collaboratoration (data sharing) appears politically infeasible, at least for the time being.

Also of interest is the NCBI ("GenBank") taxonomy project [ http://www.ncbi.nlm.nih.gov/Taxonomy ]. GenBank is the scientific community's repository of miscellaneous DNA/RNA sequences. GenBank accepts sequence data about all genes and all organisms (in contrast to the comprehensive genome databases for specific organisms, such as human, mouse, fruitfly, etc). Individual researchers and their labs "publish" sequence data by submitting it to GenBank The taxonomic information associated with submissions is typically of mixed quality and makes the database difficult to use. To solve the problem NCBI hired several curators who worked hard to establish, and now maintain, a coherent system of nomenclature. The GenBank taxonomy database was thinly populated at first, but is now approaching 80,000 taxa and nearly 137,000 names (including higher taxa, synonyms and common names). The GenBank taxonomy project is not intended to be a comprehensive treatment of the nomenclature for any group of organisms, but it serves as an example of how important consistent nomenclature is to information management in all of biology.

Other resources describing taxonomic name projects

Articles addressing the question, "How many species are there?"

Software Packages for Compiling Taxonomic Names

Neither of these packages has attained widespread acceptance relative to the potential market. Every practising taxonomist could use one of these applications, but most use either a word processor or their own specially built database (such as dBase, Access, FileMakerPro, etc.) to manage the taxonomic and bibliographic information in their research.

Taxonomic Character Data

After the tasks of collecting new specimens and assembling specimens from previous expeditions, the real work of taxonomy concerns studying, analyzing, and describing the attributes of organisms, or more importantly how those features vary among individuals, populations, species and higher taxa. Natural variation typically has a hierarchical structure: some characteristics are variable within species, others are good for telling species apart, and still others reveal the evolutionary history and relationships among species. The taxonomically meaningful patterns in variation are elucidated by examining and recording many observations on thousands of specimens. It's a lot of work.

The purposes of all these comparisons are: to define or circumscribe taxa, to enable other people to identify unknown specimens efficiently, and to determine the relationships among taxa. The methods for studying relationships are quite separate from the methods for developing taxonomic descriptions and identification keys. These two areas will be treated separately, even though the data are very similar.

Taxonomic Descriptions and Keys

Nature and Uses

Discovering a new species means finding one or more specimens that have a combination of characteristics not seen by previous workers. In essence, the new specimens fall outside the boundaries that have been laid out in "taxonomic descriptions," a standard part of taxonomic publications. A taxonomic description enumerates the characteristics of a group of organisms. Good descriptions are difficult and tedious to write. Each statement in a description is typically expressed simply and positively; e.g., individuals of this taxon have this feature or that range of variation. Descriptions are expressed this way, rather than comparatively, because the author cannot anticipate the comparisons a reader might want to make. The author tries to provide a description that is detailed enough to oppose all reasonable comparisons. Ideally the author includes characteristics that: 1) are shared by individuals of that taxon; 2) distinguish the taxon from similar taxa; 3) are variable within the taxon, but are judged to have no taxonomic significance; and 4) infer the placement of this taxon within the classification.

One of the most common reasons for consulting taxonomic descriptions is to identify a specimen, but going through descriptions, one after another, is not the most efficient way to make an identification. A taxonomic key, if one is available, typically makes the task much easier. A taxonomic key is a decision tree; each fork in the tree represents a choice among two or more mutually exclusive character states. Each choice narrows the field of possibilities until a single answer remains. Good keys are also difficult to write because experts often have difficulty anticipating how a naïve user might interpret completely unfamiliar morphology. Even a good key can be difficult to use if the specimen in question is missing critical structures because of its sex, life stage, damage, or because the important structures are simply hard to examine or easily misinterpreted. Perhaps the essential problem with written (static) keys is that they force the user to make decisions in a fixed order and an early mistake can mean hours of frustration. Interactive or "multi-entry" keys enable a user to make decisions in any sequence rather than in a specified sequence. Several software packages have been developed to build interactive keys.

Data Capture and Management

Descriptions and keys are perhaps the most important products of taxonomy. From an informatics perspective, there are several things that are important to note:

Significant Software Projects

DELTA/IntKey and LucID are the two "market leading" software packages in this area, and both are developed by research labs of CSIRO Australia (an agency of the Australian federal government). Both the DELTA and LucID websites provide good explanations of interactive keys. DELTA differs from LucID (and is unique) because it was designed to produce taxonomic descriptions, not just keys.

DELTA / INTKEY - http://www.biodiversity.uno.edu/delta/

DELTA was developed in the 1980s, and was first made widely available as a DOS program. It was not easy to use and a surprising number of friendlier front-ends were developed by third parties (see Digital Taxonomy site below.) IntKey was developed later as a tool to support interactive identification, using DELTA datasets supplemented with digital images. The DELTA web site lists 27 other sites where taxonomists have published one or more DELTA-based datasets or keys.

LucID - http://www.publish.csiro.au/lucid/index.htm

LucID has been in use since about 1994. LucID was initiated as yet another attempt to provide a better interface to DELTA-like capabilities, but then evolved into a completely separate product with a slightly different underlying data model; data can be moved from DELTA into LucID, but not necessarily the other way (depending on the features used). There are about 21 LucID keys that are done or nearly done, with another 25 in development.

ETI - http://www.eti.uva.nl/Default.html

Another significant organization in the area of interactive keys is the "Expert Center for Taxonomic Identification" (ETI) in Amsterdam. ETI produces CD-ROMs about particular taxonomic groups, sometimes with a geographic focus (e.g., "Annonaceae - Neotropical Genera and Species", "Birds of Europe", and "Bats of the Indian Subcontinent"), and most contain interactive keys. ETI has developed its own software, Linneaus II, for authoring interactive keys. Until recently, ETI had used CDs exclusively as their publishing medium. Part of the reason for this is probably the economic model supporting the organization and the need to create a revenue stream from a sell-able product. ETI has produced more than 50 titles.

Web-based Keys

Finally, some taxonomists have used the interactive capabilities the Web an developed Web sites (HTML and in some cases java) to publish interactive keys.

Web examples:

Status

Most taxonomic character information exists only as unstructured text on paper, not in structured digital form that can be manipulated by software applications or re-used in the next revision. Although a significant number of taxonomists recognize the advantages of interactive keys, only a small minority of systematists, almost certainly less than 5%, are creating or intend to create, interactive keys in current projects. Most systematists regard traditional paper publication to be the most important, if not only appropriate, medium for disseminating taxonomic knowledge; they believe Web pages and CDs are ephemeral. Perhaps even more important is the fact that most systematists remain unconvinced that they could produce traditional paper publications more easily by incorporating DELTA (or DELTA-compatible software) into their work processes.

As we discover more and more species, and generate more and more taxonomic character data, the more inappropriate paper becomes a medium for disseminating and managing this information. The informatics challenges surrounding taxonomic character data are:

Phylogenetic Data

Another class of taxon-by-character data matrices are the data sets used for phylogenetic analysis. In the 1970s and early '80s systematics was revitalized through the emergence of a subdiscipline called cladistics or phylogenetics. The debate over theory and theory was methods vigorous and re-established systematics as an intellectually challenging subdiscipline of biology. In the late '80s, two software tools in particular, PAUP and McClade, brought phylogenetics to the masses. These tools enabled researchers to analyze their observations and interpretations according to the emerging methodologies, and thereby to develop scientific hypotheses about the evolutionary histories of their study organisms. By the early '90s a very significant proportion of revisionary treatments (taxonomic studies containing new classifications, not just new species) were based on phylogenetic methods.

PAUP (Phylogentic Analysis Using Parsimony [ http://www.lms.si.edu/PAUP ]) is a program used to find the best "tree" (hypothesis of phylogenetic relationships) for a given taxon-by-character data matrix.

McClade [ http://www.sinauer.com/Titles/frmaddison.htm ] on the other hand can use the same data sets (the Nexus format), but allows users to draw and rearrange their own trees. This enables them to see what a different tree topology implies about character evolution and overall tree length (the sum of all character state changes on the tree).

PAUP, McClade, and other phylogenetics programs became very widely used, but the data matrices people used as input to these programs were not widely shared, except as tables in published papers. Treebase [ http://www.herbaria.harvard.edu/treebase ] was established in 1996 to facilitate data sharing and re-use. It now holds data from more than 450 different studies, covering almost 14,000 taxa.

Specimen Data and Species Distributions

Nature and Uses of Specimen Data

The most important physical resources to the pursuit of systematics are the specimen collections housed natural history museums. Collections serve as holding facilities for large quantities of unstudied material as well as permanent repositories for specimens already studied.

The archival role of collections is critical because our knowledge of taxonomy is constantly shifting; new species are being discovered and existing species are being merged or split along new lines. When classifications are revised, specimens need to be re-identified. In some situations, identifications can be updated automatically, e.g., when species are merged, but in other situations every specimen needs to be re-examined. The good thing about museum specimens is that they can be re-examined and re-identified. A disembodied species name on a piece of paper (i.e., an observation without a "voucher" specimen) cannot. A museum specimen therefore represents very long-lived and update-able piece of evidence that an organism occurred at a particular place in time. Museum specimens and vouchered observations provide the best long-term evidence for the distributions of species.

Another important point that derives from the update-ability of specimen data is that the management of the specimens and the management of the data about the specimens should never be separated. A data set downloaded from a collection database becomes progressively out of data as time goes on; specimen identifications are updated, new specimens are collected and cataloged, more specimens are prepared and studied. In addition, collection managers are constantly improving data management practices and are working hard to improve the consistency and completeness of existing specimen records. The Museum of Vertebrate Zoology (UC Berkeley), for example, recently estimated that nearly one third of their 625,000 specimen records are updated every year.

The basic information usually associated with a specimen can be divided into three broad categories:

  1. information intrinsic to the specimen itself -- taxonomic identification, sex, life-stage, etc.;
  2. information that describes circumstances of its collection and its context in nature -- date/time of collection, names of collector(s), method of collection, habitat description, geographic location, etc. (Note, specimens without the basic original locality information are commonly discarded);
  3. information about the specimen's special relevance to science, if any -- e.g., that the specimen has been photographed, measured, is a nomenclatural type, etc.

Taxonomists first want a simple list of appropriate material held by an institution -- i.e., "What do you have that I need to study?" (Finding the relevant material is an initial step in every original taxonomic study.) Beyond that, a taxonomist wants to see everything known about that material -- a complete dump, in readable form. Many taxonomists then want to get specimen data in structured form so they can use their computers to sort, count, and plot distributions. There are a couple of important points to make about the uses of data in structured form. First, these uses treat a museum catalog as a data set, not as a text document or an index for finding specimens. Second, a user with these purposes in mind is not really concerned about data from a particular museum -- he or she wants all the relevant data, from all the different museums. The fact that retrieving all the relevant data will require 10-50 different museums to be queried is not just a nuisance, but a major impediment.

Some of the biggest challenges facing the systematics community concern collection data. One is getting all collections computerized so that the data are simply accessible in electronic form. Another is creating a networked information retrieval system that will make all collections accessible from a single interface and will return structured data (not just free text) as an integrated data set (not as n data sets in n different structures). As a separate part of the digitizing effort, each specimen needs to be associated with a geo-referenced locality -- i.e., a collection locality expressed as a latitude-longitude. (Most specimen localities were originally recorded is a textual description.) Progress is being made on all of these fronts (more about the networked information retrieval system, below), but much work remains to be done.

History and Status

Natural history collections are organized and managed according to disciplines based on higher taxonomic groups, such as botany, entomology, and ichthyology. The conservation requirements for specimens (as in museum object conservation, not habitat conservation) and other practical concerns, such as the ratio of staff to specimens, means that virtually every discipline has developed its own suite of collection management practices. Collection operations are much more similar within a discipline and across museums, than across the disciplines represented in a given museum. There are typically between 5 and 10 independent collection management units in any large museum.

The Directory of Research Systematics Collections (DRSC; see below) contains survey results submitted from 145 institutions - predominantly US - that hold a total of 525 primary collections.

The number of specimens in a collection varies widely, from tens of thousands to tens of millions. Globally, natural history museums are estimated to hold about two and half billion specimens (Duckworth et al. 1993). The taxonomic diversity contained in a collection also varies widely. Large entomology collections have examples of more than 100,000 species, while the largest mammal collection contains representatives of fewer than 6,000 species. The geographic covereage of collections also varies; some have a regional focus, while others are best described as global, though even a global collection always has strengths in particular regions.

By the early part of this century many but not all disciplines made specimen cataloging a standard part of collection management practice; items were cataloged either in ledgers or on cards. Even before 1900, leading naturalists were aware that what distinguished research collections from curiosity cabinets was the information recorded along with specimens when they were collected. Catalog numbers affixed to specimens establish a permanent correspondence between the specimen and information in a catalog or notebook. In some collections, taxonomic and geographic cross-indexes were set up, typically on index cards, to make certain kinds of information retrieval more efficient or even possible.

Computer based cataloging began in the 1960s and by the middle to late 1980s had become widespread in disciplines where item-evel cataloging was considered standard practice. There are, however, several disciplines where item-level cataloging is not standard practice. These are the disciplines such as entomology and micro-paleontology, where the number of specimens in a single collection is commonly more than a million, or even tens of millions.

Botany is unique in that the most complete information and the specimens themselves are almost always physically co-located on numbered herbarium sheets. The herbarium itself represents a large catalog. Many botany collections have begun computerizing, but few of the internationally important collections are more than 25% computerized.

Retrospective data capture is the process of digitizing item-level information from specimen labels, index cards, or ledgers. Pre-existing paper-based catalogs makes retrospective data capture much more efficient, but a subsequent reconciliation or inventory is then required that to ensure that the database accurately reflects what is in the collection. Again, botanical collections are different because data are transcribed directly from the herbarium sheets themselves. Each specimen has to be handled, but no reconciliation phase is required.

The typical percentage of each collection computerized also varies by discipline. Vertebrate collections lead the way with many important collections 100% computerized.

Over the last 15 to 20 years, a number of collection management applications (perhaps 10) have been developed explicitly as products. (The term "product" here means developed as a software package, intended for use at other organizations.) Another 10 or 20 applications have been made available passively -- offered without charge and without support. Most of these applications were developed with a bias towards a particular discipline. A system developed for plants, might not work well for fishes or vice versa. While virtually all these product-ized applications have been used successfully at other collections, no application has achieved more than perhaps a 15% market share (across all collections with active computerization projects).

Example Software Projects

MUSE -- http://usobi.org/specify/musesup.html

MUSE was developed in the late eighties (as a Novell/B-trieve server with a C++/DOS client) for fish collections, and was deployed widely in the United States and Latin America. The MUSE project relocated a few times and stumbled when the transition to Windows was undertaken. Eventually MUSE transformed into Specify, but under different direction. MUSE installations are now considered legacy applications.

BIOTICA -- http://www.conabio.gob.mx/biotica_ingles/acerca_biotica.html

Biotica was developed by CONABIO in 1995 and is now in its third major release. Biotica manages information about nomenclature (both scientific and vernacular), geography, specimens (and observations), people and institutions, and literature. It is a Windows application developed in Visual Basic, stores data in 33 tables, supports customized reports, and is designed to import and export data easily for use by other programs, such as GIS. Biotica has both Spanish and English versions.

BIOTA -- http://viceroy.eeb.uconn.edu/biota

Biota was developed for Macintosh as a "Fourth Dimension" (4-D) application, but along with 4-D has become capable of running on Windows. Biota was first released in 1997. Biota provides a an excellent balance between capability and simplicity. Six incremental releases have been issued since 1997, approximately 4 to 6 months apart. Biota seems to target individual researcher- or project-level collections, up to medium-sized institutional collections. (This is based on features offered, not on scalability testing or analysis of the install-base.)

SPECIFY -- http://usobi.org/specify/

Specify was implmented as a Delphi application using MS Jet-Engine as the database. It was designed support virtually any type of systematic collection and the underlying data model is highly normalized. For these reasons it is also relatively complex, which makes data migration (importing data from other applications) somewhat difficult. The Specify project has strong NSF support, but the level of funding could still be considered low relative to the task of supporting an install-base as large as it could be. A new verison using MS SQL-Server as the back end is expected in late 2000.

BIOLINK -- http://www.ento.csiro.au/biolink/software.html

BioLink is one of the newest entries into the collection software arena. It is a 3-tiered system with a Visual Basic presentation layer, Visual C++ in the middle tier, and MS SQL-Server on the backend. The underlying model may be a bit more complex than Biota's, but not as complicated as those in Biotica or Specfiy. Some of the model reflects it's origins in Entomology.

Also see the list of "Software for Biological Collection Management" compiled by Walter Behrendson, Chair, TDWG Subgroup on Accession Data; http://www.bgbm.fu-berlin.de/TDWG/acc/Software.htm.

There is a pervasive tendency for each collection to develop its own collection cataloging application. This trend may be getting stronger as generic data management tools become more accessible and deliver more capability to people with little or no programming background. The issue driving most individuals in their choice of software appears to be control and the assurance of long-term support. Their concerns are commonly expressed in the questions: 1) "Can I make the system do what I want it to?", and 2) "Is the software going to be maintained and updated, or will I be left with an obsolete system?"

The "roll your own" trend has good and bad aspects. On the positive side, collection managers and curators who do it themselves become better educated about the complexity and difficulties of data management. It gives them a better appreciation of data capture protocols at the time of collection and ultimately makes them more technically capable. The down side is that most biologists have very little appreciation of how complicated data management can be, and absolutely no clue how complicated software development can be. Their education comes at the cost of mistakes. In addition a lot of effort is being wasted on independently developing relatively unsophisticated systems.

The Species Analyst (TSA)

The unmanaged, organic spread of information technology through natural history collections has also resulted in a large degree of heterogeneity among database systems; nearly every collection database has a different underlying structure even though each collection keeps roughly the same core information. The heterogeneity makes information integration difficult but not impossible, at least on a modest scale. The Species Analyst is distributed information retrieval system that can query multiple collection databases at the same time and return data in a simple, tabular format. It has been in prototype for about a year, and FishNet, the first significant deployment project, has recently been funded to bring about 20 collections on-line. Another project, the Mammal Network Information System (MANIS), is in preparation and if funded will bring another 18 collections on-line.

Species Analyst - http://habanero.nhm.ukans.edu

Fishnet - http://habanero.nhm.ukans.edu/Fishnet

Other Resources

(A List of Lists)

Other important projects and web sites: