A Software Tool for Retrospectively Georeferencing Specimen Localities using ArcView®

Elizabeth J. Proctor, Stanley D. Blum and George Chaplin
California Academy of Sciences

Version Date-Time: August 4, 2004 17:08

Abstract

Natural history collections represent a large but underutilized source of information about the distribution of organisms in space and time. In most collections, however, only the most recently collected specimens (typically 1-2%) have locality data that include a latitude/longitude. Most locality records contain only a textual description of place, which makes these data effectively unusable in spatial analyses. Retrospective georeferencing is the assignment of latitude and longitude coordinates to a textual locality description. In order to facilitate retrospective georeferencing, the California Academy of Sciences is developing a georeferencing tool as an extension to ESRI's ArcView software. It works by allowing the user to select a locality record, quickly locate the referenced place on a base map, and to draw a shape that represents the user's interpretation of the locality. This represents the user's interpretation of locality without loss of information. The drawn locality is stored as a "shape" in a spatial database, an ArcView shapefile, which can hold thousands of shapes. To make the georeferenced locality more available to mapping and spatial analysis software, the tool also derives the centroid and precision of the locality, which are stored as three simple numbers: the latitude, longitude (in decimal degrees) and the maximum span of the locality (in meters). The span of the locality represents its precision and can be employed by end-users to exclude vaguely represented localities from spatial analysis. The tool improves on previous methods by: a) integrating place-name look-ups with a variety of spatial measuring tools to assist the placement of localities; b) enabling a user to represent localities as two dimensional shapes [e.g., a circles, buffered lines, or polygons], c) quantifying the precision of localities; and d) facilitating the capture of metadata to document the georeferencing process.

Background

Natural history museum collections (preserved and cataloged plant and animal specimens) provide some of the most detailed information about the distribution of organisms in space and time. For hundreds of years, the data associated with museum specimens have consistently included geographic location, but well over 95% of this locality information exists in only textual form, such as "South bank of Purisima Creek", or "1 mi W of El Camino Real", rather than coordinates such as latitude/longitude or UTM. This representation makes them effectively unavailable to simple mapping tools or more sophisticated spatial analysis software.

The style, precision, and consistency of locality description varies widely among records and makes a completely automated approach to geo-referencing relatively ineffective.

Due to the growing use of GIS applications, there is an effort underway to make these collection data available to spatial analyses. Increased societal and scientific interest in issues like global change and biodiversity are causing ecologists to question how species patterns occur and vary in time and space (Levin 1992, as cited in Michener et al. 1997). As an example, mapping historical distributions of species is an important part of documenting the decline of extant species (Shaffer et al. 1998). The first step – and the first hurdle for many natural history museums – is simply computerizing their collection catalogs (Bojoroquez-Tapia et al. 1994), which might exist in handwritten ledgers or index cards. The second step is converting the textual locality descriptions into spatial data that can be mapped.

This process, which can be called retrospective georeferencing, is relatively new and to date there is no standardized way of doing it. It is a process that is quite laborious, requiring the interpretation of antiquated and often imprecise textual locality descriptions into Cartesian coordinates such as latitude and longitude. This conversion of a vague locality description into a single subjective point creates the problem of false precision, often to a scientifically unacceptable degree. On the positive side, it is a process that only needs to be done once.

To date, different methods have been used to georeference historical locality descriptions, including gazetteer lookups and software packages that identify latitude and longitude on basemaps (such as MapTech’s Terrain Navigator). Since these methods involve the subjective placement of a single (x,y) location point by the user, they entail differing degrees of overgeneralization, false precision, and time spent. The California Academy of Sciences (CAS) has developed a georeferencing tool that facilitates the same process, using desktop ArcView GIS software.

Retrospective georeferencing tool

This georeferencing tool was written by George Chaplin of CAS as an extension to ESRI's ArcView software. It is designed to facilitate the retrospective georeferencing of biological specimen localities. In brief, it works by allowing the user to search and sort a database of localities, quickly locate places on a base map, and draw shapes to represent locality descriptions. These shapes, stored in ArcView’s shapefile format, are used to derive two values: the centroid of the shape expressed as latitude and longitude in decimal degrees, and the shape’s span. The span provides a quantitative expression of the textual locality’s vagueness; imprecise localities will tend to be drawn with large shapes having large spans, and precise localities will tend to be drawn with smaller shapes having smaller spans. This relative measure improves on other methods of expressing coordinates’ precision levels as subjective categories. Overall, this tool improves on previous retrospective georeferencing methods by a) increasing speed, b) maximizing consistency between users, c) allowing incorporation of interpretation standards established by collection managers, and d) quantifying textual localities’ vagueness.

Retrospective georeferencing process

The process of retrospective georeferencing involves three steps: 1) preparing the specimen database, 2) georeferencing, and 3) integrating the newly created data.

1) Preparing the specimen database

The tool is designed to work with any collection database format. As such, it works with collections in which individual specimens, identified by unique code numbers, are the finest "grain" in the database. For such databases, the user would derive a latitude and longitude value for each specimen. However, upon examination of locality descriptions, redundancy is almost always found. Redundancy can exist for valid reasons: very commonly, multiple specimens have been collected at one location, thus the same place-name is associated with each specimen. But redundancy can also be meaningless, as in the case of different wordings used by different collectors, inconsistencies in abbreviation, antiquated place names, or simple typographic errors. This kind of redundancy can be called "false duplication," since different locality descriptions are used to signify what is intended to be the same place. Within the context of geo-referencing, false duplication means that the same locality has to be geo-referenced two or more times. Therefore, a normalization step is suggested prior to georeferencing. Creating a separate "edited locality" field, enables locality data to be converted into a stardard form while preserving the original language of the collector.

The amount of the false duplication may be surprising. In the case of the Herpetology Department at CAS, the database contained more than 47,000 unique localities in California (before editing). By standardizing locality descriptions and removing false duplicates the number of unique localities was reduced to 10,107. The general categories of the false duplicates corrected in the California herpetology database are summarized in the following table:

Table 1. Types of inconsistency that contribute to false duplication of locality records.

Source of inconsistency

Examples

Inconsistent interpretation of original handwritten locality

"Llagas Creek at Pajaro River"
"Llagas Creek at the Pajaro River"

Handwritten as:
"Llagas Cr. at the Pajaro R., Santa Clara Co., Ca."

Different wordings of the same place

"Santa Cruz, University of California"
"University of California Santa Cruz"

Different orderings of the same locality

"Sequoia National Park, Colony Mill"
"Colony Mill, Sequoia National Park"

Inconsistencies of form, such as capitalization, abbreviation, use of prepositions, and punctuation

junction/jct/Jct/jct./jctn … etc.
highway/hwy/Hwy
"2 mi N of Barstow"
"2 mi N Barstow"
"near 3-Rocks"
"near Three Rocks"

Typographical errors and misspellings

"39 33 9.3 N, 120 38 27.24 W"
"39 39 9.3 N, 120 38 27.24 W"
"Thornsberry Rd at Lovall Valey Rd"
"Thornsberry Rd at Lovall Valley Rd"
"Lemoor Naval Air Station"
"Lemoore Naval Air Station"

Inconsistent representation of precision

3 vs. 3.0

Inclusion of micro-habitat as location data

"5 mi NW of Los Trancos Woods"
"5 mi NW of Los Trancos Woods, under log"

Other typographical anomalies

"Alder Cr Camp (on Hwy 50)"
"Alder Cr Camp [on Hwy 50]"

First, we visually inspected the database and created a list of the existing locality variations for the collection manager to review and make decisions on desired standardizations. Next, we used a combination of software and tools to implement the collection manager’s approved changes, including Textpad (version 4.2.2.), Microsoft Access 2000, the USGS Geographic Names Information System (GNIS) and Perl scripts.

We used Textpad’s "search and replace" function for global edits such as deletion of extra spaces and periods at the end of lines, and changing of square brackets to parentheses. We used Textpad’s "spellcheck" function to find common typographical errors. Next, we wrote a script in the Perl text editing language that used regular expressions to find and replace common patterns globally. These patterns included the inconsistencies in abbreviation (junction/jct/Jct/jct./jctn, all changed to jct) as well as inconsistencies in form such as the missing "of" after a cardinal direction ( "2 mi N Barstow"   was changed to   2 mi N of Barstow  ). We used Textpad to make a second pass through the database and evaluate further edits individually. These include edits that would have been problematic if made globally, such as "mountain," which should be spelled out if part of a place name (Mountain View), but which should be abbreviated one of three different ways when part of a feature name (Mt St. Helena, San Bruno Mtn, Sierra Nevada Mtns).

Next, we examined the database for inconsistencies in place name spellings. Using the Geographic Names Information System, we looked up place names and corrected spelling inconsistencies (Devils’s Slide changed to Devils Slide; Idylwild changed to Idyllwild). Finally, we imported the database into Access and performed a query such that it displayed only "unique" localities. We visually inspected these localities individually, and corrected any remaining problems missed by the previous steps using Textpad.

The entire editing process took approximately 75 hours. As suggested above, this normalization step achieves two improvements: 1) georeferencing is faster because false duplicates are removed, and 2) searches are more productive because standardization improves the form of the database.

2) Georeferencing

Figure 1. Screen-shot: the SDO table and Interactively Geocode dialog box, with locality "Camptonville" selected.

In this context, retrospective georeferencing means the assignment of latitude and longitude coordinates to a textual locality description. Following is a specific description of how CAS’ georeferencing tool can be used to georeference a dataset of localities. This description uses terms and concepts with the assumption that the reader is familiar with ArcView GIS software (version 3.2.) It describes the process of georeferencing following the collection and preparation of base layers; specific suggestions about how to obtain and incorporate base layer data for a desired geographic region are available from the authors upon request.

First, open the ArcView project and type in your user name at the logon prompt. The table showing your normalized database of localities, called the "sdo" (spatial data object) table, will open. This table should contain a field with unique id numbers identifying each locality, called the Placeindex field. A dialog box called "Interactively Geocode" will also open. Select a record from the sdo.dbf table, for example, "Camptonville." It will appear in the Interactively Geocode window (see Figure 1).

Figure 2. Screen-shot: base map zoomed to locality "Camptonville."

The Interactively Geocode dialog box provides some options for navigating the geocoding process. Clicking the "Go To Map" button will simply bring up the base maps associated with the project. Clicking the "Parse SDO" button instead will begin a series of steps that will "zoom" you to the proper location on the base maps. Clicking "Parse SDO" parses out the locality into its component words and allows you to identify which of the words is the place name. Click on "Camptonville" select it as the place name to zoom to, and select the appropriate named-place database to search (GNIS, GNIS-historical, Rivers, or Roads). A list of "Candidates for geocoding" appears; click on the best-matching place name and you will zoom to that spot on the base map (see Figure 2).

The spot you picked is designated by a graphic (a black dot). If there was an offset in the locality (e.g., "2 mi from...") a graphic circle will be drawn with the appropriate radius. If there was a direction in the locality (i.e., 2 mi W of...") a second dot will be drawn indicating that direction.

The "Draw a shape for this SDO" window provides some tools and options for drawing a shape to represent that locality. The locality itself is displayed for reference in the window, above buttons representing point, line, circle, and polygon shape drawing tools. Click on the desired shape tool and, using the base map for reference, draw the desired shape. For example, "Camptonville" could be represented by a circle of any size, an irregular polygon, or a buffered line (see Figure 3).

a b
c d
Figure 3. Possible representations of the locality "Camptonville": a) as a 100 m radius circle; b) as a 500 m radius circle; c) as an irregular polygon; d) as a line with a 100m buffer.

Before clicking the Finish button, you can move the shape, edit its vertices, or you can delete it and redraw it. When you are satisfied with the shape, click Finish. The shape will be saved in an ArcView shapefile, called "sdoout.shp." Examples of a completed set of 53 localities in Yuba County, California, including shapes drawn to represent localities, and associated centroids, are shown in Figure 4.

a b
Figure 4. Finished set of 53 localities in Yuba County; a) with centroids (latitude-longitude) only displayed; b) with associated shapes and centroids.

To end a session, close all open windows, click the Save button, and close the project.

3) Integrating the new data

The georeferencing tool creates several tables which store information about the localities and about the georeferencing session. These tables have different uses, but in general they can be exported, edited, joined, and used to retain metadata, using the Placeindex field as join field. Each of the tables is discussed below.

Sdoout.shp

The locality shapes drawn by the user are saved in an ArcView shapefile, called "sdoout.shp." The sdoout.shp attribute table’s fields include Placeindex, the unique index number identifying the normalized locality; Logname, the name of the user; X_coord and Y_coord, the latitude and longitude of the centroid of the shape in decimal degrees; and Span, the length of the longest distance across the shape in meters. While in this shapefile format, the latitude and longitude coordinates are projected, in the projection of the ArcView project’s View. (For other uses, the shapefile can be unprojected using ArcView or ArcInfo software). The Span field is expressed in map units, which are meters in this example (see Figure 5.)

Figure 5. Attributes of sdoout shapefile after one shape has been drawn for locality "Camptonville."

This attribute table can be exported into a .dbf file (or a tab-delimited text file) using ArcView’s "Export" function. This .dbf can be then be edited and joined back to the original collection database using Placeindex as the join field. In this way, latitude, longitude and span values become associated with the original textual locality. Since the span measure is used as a relative quantitative expression of the "vagueness" of a locality, the most vague localities ("near Marysville") will tend to have larger spans, and the most precise localities ("4.55 mi NW of Arboga") will tend have the smallest spans. Depending on the subsequent use of these data, the collection manager can sort the localities by span, and use only those localities with the appropriate precision. For example, the resulting "near Marysville" coordinates might not be appropriate for a regional map, but would still be meaningful at the continental scale.

SDOUpdate.txt

The tool creates an "update" table, in tab-delimited text format, with Placeindex as the join field. This table stores any edits which are suggested by the user. For example, after examining the map, the user might find that a locality is not in the stated county, but is located in an adjacent county. Clicking the "Update Entry" button on the Interactively Geocode dialog brings up a form with fields that can be filled in with proposed updated information. This information can then be reviewed by a supervisor and upon approval used to batch update the original collection database.

SDOMemo.txt

The tool creates a "memo" table in tab-delimited text format, with Placeindex as the join field. This table stores two kinds of comments, temporary and permanent. For example, a locality might not be found anywhere on the particular base maps being used, thus the user would skip it and make a temporary comment that alternative sources need to be consulted. Alternatively, if a locality was too imprecise to be usable at all ("E Buttes") a permanent comment could be saved that made a note to that effect. Again, this information is saved in a separate table so the original database is preserved and the collection manager can review the comments and make changes at his or her discretion.

Session-level Metadata

Session-level metadata, i.e., information about the georeferencing session, including the date and time, the user's logon name, and details about the ArcView project file (*.apr), including filename, projection, datum, and reference themes, will be identified by a session number and stored as an XML document in a separate table.

Figure 6. Example contents of a session-level metadata record.

Conclusion and next steps

This tool was designed to facilitate the georeferencing of biological specimen localities. It can be used by natural history museum collection managers and staff to derive latitude and longitude coordinates for localities that are represented by textual descriptions of varying precision. Overall, the tool improves on previous retrospective georeferencing methods by a) increasing speed, b) enabling a user to represent a locality as a two dimensional shape [e.g., a circle, buffered line, or irregular polygon], c) integrating place-name look-ups and a variety of spatial measuring tools to assist drawing a locality; d) quantifying the precision of a locality; and e) facilitating the capture of metadata to document the geo-referencing process. Currently the California Academy of Sciences is testing this tool to assess speed and consistency among users.

Our long-term goals for natural history collection data should include maximizing the quantity and quality of data that are available for spatial analysis. Maximizing precision will enable much greater power in answering research questions, such as determining critical habitat requirements for a species. To deliver on the full promise of its collections data, the collections community will have to dedicate significant effort to retrospective georeferencing.

The capabilities to retrieve, map, and analyze collection data are about to coalesce and create a large demand for georeferenced collection data. Relatively inexpensive and powerful mapping software has been available for several years. Other analytical tools, such as the Biodiversity Species Workshop (Stockwell), have also come on-line recently and represent another, more sophisticated, class of uses for georeferenced collection data. Distributed query software for collection databases (ZBIG/Vieglais) is being developed and is about to become (we predict) very widely used. The natural history collections community is about to be confronted with the fact that collection databases we've been building for the last 20 years can't deliver on their potential because the data have not been translated into a machine-usable form. This georeferencing tool will provide institutions with the technical ability to translate their textual locality data into geospatial data accurately and efficiently.

Contact

For more information about the software and CAS’ retrospective geo-referencing project, contact:

Stan Blum,
Research Information Manager
California Academy of Sciences
Golden Gate Park
San Francisco, CA 94118
sblum@calacademy.org
 

References

Bojoroquez-Tapia, L., P. Balvanera, A. Cuaron, 1994.
"Biological inventories and computer data bases: their role in environmental assessments." Environmental Management. Vol. 18, No. 5. p.775-785.
 
Michener, W., J. Brunt, J. Helly, T. Kirchner, S. Stafford, 1997.
"Nongeospatial metadata for the ecological sciences." Ecological Applications. Vol. 7, No. 1. p.330-342.
 
Shaffer, H., R. Fisher, C. Davidson, 1998.
"The role of natural history collections in documenting species declines." TREE. Vol. 13, No. 1. p.27-30.

 

 

©2001 California Academy of Sciences