[ TAF Home ] Proceedings of the Taxonomic Authority Files Workshop, Washington, DC, June 22-23, 1998

Introduction


Stanley D. Blum

California Academy of Sciences


The premise for this workshop is actually very simple. The information environment in the biological sciences is very partitioned. Data are created and managed in a highly distributed fashion, and in that kind of environment, consistency across all shared concepts is critical to the retrieval and integration of data. The library community has years of experience in promoting consistency within and among their catalogs through the use of authority files. A number of us in the natural history community think we can learn a lot from their example.

To explain this a bit further, I would like to walk you through what I might call a simplified flow of basic taxonomic data, from its sources through some of its uses (Fig. 1). (By "basic taxonomic data" I mean taxonomic names and classifications.) It begins with taxonomists and specimens. Taxonomists make observations on specimens, reconcile these against existing knowledge (i.e., the literature), and create new taxonomies, which they typcially publish as papers and monographs. To become globally available for use, names and classifications all go into the literature. (Taxonomists also apply taxonomic names to specimens and observations, but taxonomists are not the only ones who do this.) From the primary literature, compilers, indexers, and editors gather that information into taxonomic authority files, which can be very different in scope—virtually every compilation effort has an explicit taxonomic and geographic focus. These authority files can be assembled into progressively larger files. The logical end point for this type of assembly would be a global taxonomic authority file. The assembly-decomposition arrows go both ways because a particular user-group may find it useful to extract a subset of taxa for a particular purpose.

Flow of Basic Taxonomic Data
Figure 1. Flow of Basic Taxonomic Data.

From a data management perspective, some of the most important users of authority files are the catalogers; people who attach taxonomic names to specimens and observations, and enter these as electronic records into biological databases. I've shown this flow as a dashed red arrow because I don't think we do this well. We don't yet have an effective means to disseminate authoritative representations of taxonomic names to the catalogers, at least not in a coordinated or integrated fashion.

The last steps involve: the retrieval of biological information, from multiple, independently-managed sources; its integration into data sets; and the derivation of biological knowledge by analysts. The jobs of the data integrators and analysts are much more difficult because the taxonomies contained in each of the different data sources are inconsistent, and not necessarily because of substantive differences in taxonomic opinion, but simply because it is hard to keep taxonomic identifications current when classifications change. So the message I want to convey in this diagram is that our inability to get authoritative taxonomic information into the hands of the catalogers makes it more difficult to derive biological knowledge from the ever-growing repositories of biological data.

As you hear about the taxonomic authority file projects this morning and the authority control processes in the library community later this afternoon and tomorrow, I would just ask that you listen with an open mind, and that you think critically about the components that are missing in the natural history community, the future directions we need to take, and the ways we can leverage our collective resources. This workshop will be punctuated with coffee breaks, meals, and discussion sessions. I would also ask that you to take advantage of these opportunities to interact with each other—and particularly with individuals outside your own community—and let's see if we can come up with some solid recommendations about what needs to be developed in the systematics community to make more effective use of taxonomic information, particularly its use in biological data management. So those are my objectives for this workshop.

In the long-term, I think the principles and processes that are used in library cataloging can be exploited by the natural history community to accelerate the rate of data capture and to improve the quality of data. I can explain this by showing you the basic concepts encompassed by natural history collections data. This diagram (Fig. 2) is a somewhat simplified version of the ASC Information Model for Biological Collections.

ASC Model (ER Diagram)
Figure 2. The Association of Systematic Collections, Information Model for Biological Collections. [After ASC, 1993; simplified.]

This model was developed in 1992, in another NSF-sponsored workshop, and represents a reconciliation of the high-level concepts found in the collection databases, across the various systematic disciplines, such as Botany, Entomology, Ichthyology, etc. It shows that there are only about six major concepts or subject areas in the model. If we think about who has control over or responsibility for these data, we see that only the Collection-Object is the purview of any local institution. The rest of the concepts are global that are shared across the entire community and used to describe Collection-Objects.

The role of these other concepts as descriptors might be more clear if we translate this schema into an object-oriented framework (Fig. 3), where objects are in pink and the attributes of objects are in blue. This diagram shows the structure of a hypothetical Collection-Object, and we see that some of its attributes are actually pointers to other objects. Another way of saying this is that a Collection-Object record is made up of component objects. These other objects represent our common language—they are the terms, or the meanings of the terms, we apply to Collection-Objects as descriptors.

Collection Object Data Structure
Figure 3. An object-oriented representation of the data structure for a Collection-Object.

If we were to take the processes of compilation and dissemination that we are talking about applying to taxonomic information, and apply them to the rest of these objects—particularly people, bibliographic references, and localities—we could leverage our resources and in essence create our own system for cooperative cataloging. Sharing these component objects would accelerate the rate at which we capture natural history collections data and would improve the quality of our data. So my longer-range objective is to get you thinking about creating our own cooperative cataloging program, much like the one in the library community, and deploying it within the natural history community.

The program for the workshop is as follows. This morning and in the first part of the afternoon we're going to hear about some example taxonomic authority file compilation projects. In the later afternoon we will begin to hear about authority control the in the library community. The general talks will be this afternoon and the more specific talks, about the structures of authorities, gazetteers, and thesauri, will come tomorrow morning. These will be followed by two talks on data models and data structures for taxonomic information, and then finally, two talks on mechanisms for accessing and replicating authority information.

It's a very full schedule, so let's get right into the program.