Discussion at the Taxonomic Authority Files Workshop, Washington, DC, June 22-23, 1998
[ TAF Home ] [ TAF Workshop Proceedings ] [ Session ]

General discussion following Session II

A Framework For Authority Control: From Shared Vocabularies To The Cooperative Cataloging Of Common Data Objects


Gregory New:
I just want the taxonomists here to realize that when Joan is talking about trusting the work of your colleagues she's talking about the "Ivy League" of the cataloging community. Not just everybody is admitted to the cooperative cataloging program. There are the prestige names of the profession. So what she's saying is, I think, in developing a taxonomic authority file you'll basically be doing the same thing; selecting the prestige names and institutions and integrating their work. You're going to have to learn to trust each other. If you can't trust the "Ivy League" schools, you're whole system is in bad shape.
 
Roy McDiarmid:
I don't want to get into that comparison of where you came from or where you are today, but I'd like to raise an issue, and it sort grows out of the notion or this idea of perfection. We frequently run into problems in tracking authorial names for scientific names—that is authors —and the conflicts that you find, both in the literature in terms of how a name is represented and the accuracy of representing that name, and when you start looking within groups there are conventions that people tend to follow for how do you represent an author's name. Unfortunately authors change their names, editorial influence dictates different representations of authors' names on publications, and when you start looking across groups, you immediately recognize that there are all sorts of problems that deal with accuracy versus consistency. I know that the botanical community has come up with an agreed upon representation of authorial names for botany. We're coming to grips with that in the ITIS program, and I'd like to get some guidance from those of you in the library community, because what we're tending toward is consistency—because of the retrieval importance and internal authority files within systems, rather than accuracy of representation of authorial names associated with scientific names and bibliographies. And I'd like you to give us some guidance about that—whoever wants to take that on.
 
Karen Calhoun:
As you saw in the examples I provided—and you'll be seeing more—we use a system of cross-references to help control that problem. So that in the catalog if you would enter a name that's not represented in the catalog, you would be taken directly to the group of records under that heading in its authorized form, or you would see a message that says for this author see this form of the name. That's the traditional way that it's been done. In an Internet or Web environment, you might be able to do something somewhat more sophisticated.
 
Roy McDiarmid:
That assumes that an authority file already exists?
 
Karen Calhoun:
Yes, and as a matter of fact you might be able to take advantage of the Name Authority File. It would certainly not cover all of your authors, and those who publish only in journals would be less well covered than those who have published monographs. The same is true of geographic names. One of the speakers this morning spoke of the difficulty of geographic names. The name authority file covers the names of countries and places, and the subject authority file covers things like names of rivers, in authoritative form, with cross references, and including superseded place names. So you might be able to take advantage of some of the work that's already been done there and is available freely to you over the Internet.
 
Gary Rosenberg:
A couple of thoughts on the issue of perfectionism: one is that I find that publishing in electronic database form makes it much less important to get everything right to have a completed project. That is, my database for molluscs is on the Net, it grows, I make corrections, but I won't publish a hardcopy form, like Bill [Eschmeyer] did with his fish, until I feel I've gotten it to a certain state of perfection. The electronic process makes in much easier to have works in progress. The other side of this is that if you have cooperative effort... the last five percent are some of the hardest to get—If we approached some of these projects in a different way, we could have someone that goes through and finds all taxa in particular journal, rather than all the taxa within a particular family. If we could distribute our efforts in a different way we'd pick up all these hard to find things just because we looked in lots of places, rather than having 50 people looking in all the same places for different sets of things.
 
Scott Miller:
Just a follow up comment to that: When we were compiling various authority files for organisms found in Hawaii as part of the Hawaii Biological Survey, in some cases we tried extensively to get specialists around the world to review those lists for us before we so-called "published" them, and we had very poor success at that. But we found, on the other hand, once we put the thing on the world wide web, a lot of those same people who wouldn't respond to us before sent us these nasty letters of: "You missed this, this, this, and this..." We'd responded with a letter saying: "We'd be happy to put those in the database if you'd provide us with these fields of documentation: ..." Then they'd go back and madly write out all this stuff and send it back to us. We found that getting things to certain level of completion and putting them out on the Internet as interim products was a very effective review process.
 
Karen Calhoun:
Scott, I'd like to comment on that. That was kind of what I was getting at when I was saying "pick a place to start and start doing it." Another thing that has struck me today is that you have many models that are potentially viable for beginning a cooperative effort. At the same time, you have immense historical data to record in machine-readable form—in other words you're looking at a mountain, just to get started. One possible way to approach it is to begin a point that is easy, and go forward from there. Attack your large mountains as you go, rather than try to make everything perfect on day one.
 
Larry Speers:
You bring out many points about provisional data sets and interim publishing, and it brings up the whole issue of time-stamps, archiving, and various related issues. It's an area we haven't really started to discuss, but if we're going down that road it needs to be dealt with.
 
Laurel Jizba:
We are also converting part of our card file. We finished "A" through "Q", and we've got the rest of the alphabet to do. As you might know, in the Library of Congress classification, Q is in the sciences, and in order to find funding to do it—because most of the libraries around the country who've had to do retrospective conversion have had to go to outside funding agencies—we took advantage of the recent national... I can't remember the name but there's a federal agency that is funding joint projects between libraries and museums... and we've applied with a natural history museum in the area and thrown in our retrospective conversion of 3x5 cards on that project. It took us a while to work out all the details, but thinking of going into a cooperative arrangement to get the funding from federal agencies is not a bad idea.
 
Linda Hill:
I guess I'm from the library community, but I'm kind of a maverick as far as the library community goes. Sitting here and listening to a description of the authority kinds of processes that the library community has brings up the concerns that I've always had about the fact that it's not scalable. The formal libraries pay attention to a very limited amount of the information that is published, that is by and large, monographic publications. If you look at the abstracting and indexing services which have a much larger volume of publication, you find that they don't go in for—I think this is true across the board—name authority files, for example, but go for a standardized way of representing the name—that is last name and first initials only, or just as the author has put it into the piece. And somehow this works. I would submit that the emphasis is on making this work at the information retrieval stage rather than putting the effort into the authority files up front, for that particular set—personal names. Also, I'd like to say that abstracting and indexing services pay attention to the affiliation of the author and you won't find that information in the library catalogs. That this is often one of the chief pieces of evidence that you want to know—where this person was working when they published the item. One other point is that as we get into electronic publication, where the information is perhaps coded in SGML, the document will come into the libraries or abstracting and indexing services already coded. You can just read the electronic text and it will say the this is the author. I think its going to be increasingly difficult for the libraries to maintain the authority files.
 
Paul Morris:
Speaking from the within the museum collections community, we have really two competing goals with respect to information. One is we need to archive information about unique objects and to make sure that we're doing that as accurately as possible, so that a hundred years from now someone who pulls up the information associated with the specimen has real information and not information we've interpreted or mistyped along the way. The other goal is getting that information out and making it accessible; making it easily and effectively searchable. These are two very different competing goals we have to worry about.
 
Karen Calhoun:
I'd like to speak to Linda's remarks, whether authority control for personal names can be sustained. One of the things that is always a tension in covering information resources is whether the information resource the user group that is associated with that set of resources has a need for current information, the latest thing on the at topic, or whether its a group of humanists, or artists, or musicians. I would maintain that, for humanities, art, and museums, it's very important to maintain the kinds of consistency that we have provided in the past for personal names because those names are actually names and subjects. For many scientists and engineers, current information is what is desired and the name access point is not necessarily the most important part of the citation. This may be different for the taxonomic community, I'm not sure. But distinctions need to be made and in some cases it is valuable to continue this level of personal name authority control.
 
Stan Blum:
We had a little bit of a discussion earlier about "what is an authority file?" and I was wondering whether anyone would care to offer a definition?
 
Joan Swanekamp:
What I provided was a little bit of background, because where the library community started was not necessarily with the creation of authority files, but with the authoritative forms of headings. And it was groupings of files of those headings that became our authority files. For a very, very long time we've been interested in standardizing the forms of names we've used for personal names, for corporate bodies, for subjects, and in doing so, we had rules for that, and that is how we created our authority files. Our rules have changed over time, sometimes dramatically, but by having these files, we've been able to keep some sense of order with our referencing structures or whatever. Where this really all began was with authoritative forms of headings.
 
Stan Blum:
It seems that it's not just form now because it's very clear that there's a concept behind this certified heading for this person. If you get an alternative form it goes under a different heading, or a different record, so now we've got a clear concept of a person, whose name can change or evolve, or who can have many names or forms of names.
 
Karen Calhoun:
I think an analogy is, if you think about the building where the fire department is, it isn't the building that's important—it's really not the file that's important. What's important is what it does, and it does several things. It provides an authorized form for retrieving information from the database, for sorting it, and for displaying it. Or it might involve links to other forms of related forms of headings that works the same way—so it has cross reference structure, it has the authorization, it has information in it that helps people to establish names or subjects in the proper way. It also has that automated function. Joan was talking about their retrospective conversion and about the millions of records that they need to simply key into a system and then they're going to send it to an authority control vendor to clean up the headings. And what they're going to use is the authority file. So think about what the fire department does and not where it lives.
 
John Riemer:
I have a definition to offer for "Authority Control" which is: The process of selecting, establishing, and maintaining unique and standardized headings to be used as access points in the catalog, as well as relating variations and other headings to them. So then I could back up from that and say that an authority file is any record of the decisions you've made toward this process.
 
Bill Eschmeyer:
On the Catalog of Fishes there are some very practical matters on a day-to-day basis. For example T. N. Gill might be listed in my database with 300 papers. He published as "T. Gill", "T. N. Gill," and "Theodore Gill". I had to pick one to make them come out alphabetically, to pick up all of his records chronologically—and then with his co-authors—just to make my literature cited. But I would really like to see an authority file that would give all the combinations that he's published under and also a short biography. Now we have several Gills publishing. We have two "C. L. Hubbs" that have published extensively. Sometimes I have published remarks, "also published as ____ and ____". And then Asian names are particularly difficult. Journals handle them differently. So you almost need an annotated file, because if you just work retroactively from your cards, you're going to get "T. Gill" and some "Theodore Gill"s. I had to make choices on a lot of authors and first I had to make sure that I was dealing with one person; sometimes I wasn't. That's why going from the electronic version to the printed version involved a lot of editing.

On perfection:  the books are not perfect, they're better than most anything you can find out there, but there's still a lot more to do. I experienced a little of what Scott mentioned— "Hey, you missed my paper," or "you missed this species." Some of that goes into it when you put it up as an electronic database. I would rather see biographies for current authors. We do a little newsletter where everyone sends in a little account of what they're working on, and the most heavily used part of that is e-mail addresses. A lot of journals are now giving e-mail addresses. So that should go in the author file, too, if they're current workers.
 
Eimear Nic Lughadha:
What Dr. Eschmeyer has described is exactly what we've been doing for plant-name authors, first in book form and now on-line. The trick is to identify which people you're talking about, and then which names they've published under, adopting standard forms, and making sure those standard forms aren't duplicated by people who are likely to be using the same information set. By doing this we've reduced the number of apparent authors dramatically, by a factor of two or three.
 
John Attig:
I think you're still going to have to make the decisions. That's the job that has to be done. An authority file is a way of reducing the number of times you have to make the same decision, by recording all of those decisions in a way that everybody can consult them.
 
Laurel Jizba:
One of the things that's really fun about adding to and contributing to authority records is that they have an open-ended length; they can go on forever. So we're not talking about a fixed form. And some of them do, when they get into the variants in Chinese and Russian names, and lengthy notes about when the heading first appeared a certain way, and the date of that publication. And then we have notes about things like conflicts. Two people publishing in the same or similar fields, and what the title of their actual job was at the time. I don't know if this has been said in this context yet, but we have no limit except that, as a value, we know we're trying to get it out faster. Any one individual may not spend a long time on that record, but after many months and years the build up of information in that record can represent a substantial amount of decision making.
 
Scott Miller:
A somewhat heretical question about subject headings in libraries: When we only had card files they were very important, but now that library databases can search freely, with boolean searches, etc., is control on subject headings actually cost effective or useful? I know I almost never use them.
 
John Attig:
Is your literature all in English?
 
Karen Calhoun:
That's right. "Is your literature all in English" is a very good question. The Council on Library Resources that showed that subject searching in library catalogs or a subject approach to a catalog was still extremely important. Over the years, the Library of Congress subject heading system has become less essential because of the rise of keyword search and all of the different ways that you can retrieve data. Now some of the libraries are experimenting with relevance searching, such as is done on the web. So the answer is a difficult one. The purpose that the Library of Congress subject heading system has at this time is for catalogs like Cornell's, which are very, very rich in other languages, other language materials, other culture's materials. It provides an English speaking person very good keyword access to those materials.
 
Joan Swanekamp:
This is probably a failing of our automated catalogs more than anything else, but keyword searches provide almost endless results with almost no order at all, with no sort of hierarchy. That really is one of the things that an LC subject heading or other the thesauri can do is bring together in a logical fashion a large group of related materials. When you get a result of a keyword search in most of our systems, you often don't have a clue as to why you got it.
 
Jessica Milstead:
It's important to keep in mind too that even if you're not searching on a subject heading, if there is a subject heading in the record and you search on one of the words in it, you may be searching on the subject heading without even knowing it.
 
Beacher Wiggins:
Another thing we should keep in mind though is that the structure that we are creating can be used and applied as the need arises. The library community has indeed been working with a subject heading system, and I'd be the first to say that the Library of Congress subject heading apparatus is in need of some work—and we've begun to do that work—but I think the value here is that a community, or an institution, or a group can take advantage of pieces of the work and build on it as you need it. When you were talking about "What is an authority file?"—the file can be whatever you want it to be, and it can control whatever universe you want to work with. I think keeping that in mind will also help in terms of deciding how to approach the problem or being able to manage a portion of the problem. Don't think that it's all or nothing. Talking to any three or four of the librarians that are here, you're likely to get a different view in terms of how best to approach a particular problem. I think part of that, and the dialog we're having, is understanding that it can be applied in portions and not the entire universe. I think you have to keep in mind too the history that we are establishing here and the fact that these are dynamic records that we are building and by virtue of that the files we create are also dynamic and can be bitten off as the need arises.
 
Linda Hill:
Information retrieval research (in reply to the question about whether controlled vocabularies are valuable) has shown that when they tested free text—that is searching every word in an item—against searching by controlled vocabularies (and subject headings are just one form of controlled vocabularies), what they've found is that the best system is being able to use both. You can't do as well with either as you can do with both.