A Framework For Authority Control: From Shared Vocabularies To The Cooperative
Cataloging Of Common Data Objects
Gregory New:
I just want the taxonomists here to realize that when Joan is talking about trusting the
work of your colleagues she's talking about the "Ivy League" of the cataloging community.
Not just everybody is admitted to the cooperative cataloging program. There are the
prestige names of the profession. So what she's saying is, I think, in developing a
taxonomic authority file you'll basically be doing the same thing; selecting the prestige
names and institutions and integrating their work. You're going to have to learn to trust
each other. If you can't trust the "Ivy League" schools, you're whole system is in bad
shape.
Roy McDiarmid:
I don't want to get into that comparison of where you came from or where you are today,
but I'd like to raise an issue, and it sort grows out of the notion or this idea of
perfection. We frequently run into problems in tracking
authorial names for scientific namesthat is authors and the conflicts that you
find, both in the literature in terms of how a name is represented and the accuracy of
representing that name, and when you start looking within groups there are conventions
that people tend to follow for how do you represent an author's name.
Unfortunately authors change their names, editorial influence dictates different
representations of authors' names on publications, and when you start looking across
groups, you immediately recognize that there are all sorts of problems that deal with
accuracy versus consistency. I know that the botanical community has come up with an
agreed upon representation of authorial names for botany. We're coming to grips with that
in the ITIS program, and I'd like to get some guidance from those of you in the library
community, because what we're tending toward is consistencybecause of the retrieval
importance and internal authority files within systems, rather than accuracy of
representation of authorial names associated with scientific names and bibliographies. And
I'd like you to give us some guidance about thatwhoever wants to take that on.
Karen Calhoun:
As you saw in the examples I providedand you'll be seeing morewe use a
system of cross-references to help control that problem. So that in the catalog if you
would enter a name that's not represented in the catalog, you would be taken directly to
the group of records under that heading in its authorized form, or you would see a message
that says for this author see this form of the name. That's the traditional way that it's
been done. In an Internet or Web environment, you might be able to do something somewhat
more sophisticated.
Roy McDiarmid:
That assumes that an authority file already exists?
Karen Calhoun:
Yes, and as a matter of fact you might be able to take advantage of the Name Authority
File. It would certainly not cover all of your authors, and those who publish only in
journals would be less well covered than those who have published monographs. The same is
true of geographic names. One of the speakers this morning spoke of the difficulty of
geographic names. The name authority file covers the names of countries and places, and
the subject authority file covers things like names of rivers, in authoritative form, with
cross references, and including superseded place names. So you might be able to take
advantage of some of the work that's already been done there and is available freely to
you over the Internet.
Gary Rosenberg:
A couple of thoughts on the issue of perfectionism: one is that I find that publishing
in electronic database form makes it much less important to get everything right to have a
completed project. That is, my database for molluscs is on the Net, it grows, I make
corrections, but I won't publish a hardcopy form, like Bill [Eschmeyer] did with his fish,
until I feel I've gotten it to a certain state of perfection. The electronic process makes
in much easier to have works in progress. The other side of this is that if you have
cooperative effort... the last five percent are some of the hardest to getIf we
approached some of these projects in a different way, we could have someone that goes
through and finds all taxa in particular journal, rather than all the taxa within a
particular family. If we could distribute our efforts in a different way we'd pick up all
these hard to find things just because we looked in lots of places, rather than having 50
people looking in all the same places for different sets of things.
Scott Miller:
Just a follow up comment to that: When we were compiling various authority files for
organisms found in Hawaii as part of the Hawaii Biological Survey, in some cases we tried
extensively to get specialists around the world to review those lists for us before we
so-called "published" them, and we had very poor success at that. But we found,
on the other hand, once we put the thing on the world wide web, a lot of those same people
who wouldn't respond to us before sent us these nasty letters of: "You missed this,
this, this, and this..." We'd responded with a letter saying: "We'd be happy to
put those in the database if you'd provide us with these fields of documentation:
..." Then they'd go back and madly write out all this stuff and send it back to us.
We found that getting things to certain level of completion and putting them out on the
Internet as interim products was a very effective review process.
Karen Calhoun:
Scott, I'd like to comment on that. That was kind of what I was getting at when I was
saying "pick a place to start and start doing it." Another thing that has struck
me today is that you have many models that are potentially viable for beginning a
cooperative effort. At the same time, you have immense historical data to record in
machine-readable formin other words you're looking at a mountain, just to get
started. One possible way to approach it is to begin a point that is easy, and go forward
from there. Attack your large mountains as you go, rather than try to make everything
perfect on day one.
Larry Speers:
You bring out many points about provisional data sets and interim publishing, and it
brings up the whole issue of time-stamps, archiving, and various related issues. It's an area we
haven't really started to discuss, but if we're going down that road it needs to be dealt
with.
Laurel Jizba:
We are also converting part of our card file. We finished "A" through
"Q", and we've got the rest of the alphabet to do. As you might know, in the
Library of Congress classification, Q is in the sciences, and in order to find funding to
do itbecause most of the libraries around the country who've had to do retrospective
conversion have had to go to outside funding agencieswe took advantage of the recent
national... I can't remember the name but there's a federal agency that is funding joint
projects between libraries and museums... and we've applied with a natural history museum
in the area and thrown in our retrospective conversion of 3x5 cards on that project. It
took us a while to work out all the details, but thinking of going into a cooperative
arrangement to get the funding from federal agencies is not a bad idea.
Linda Hill:
I guess I'm from the library community, but I'm kind of a maverick as far as the library
community goes. Sitting here and listening to a description of the authority kinds of
processes that the library community has brings up the concerns that I've always had about
the fact that it's not scalable. The formal libraries pay attention to a very limited
amount of the information that is published, that is by and large, monographic
publications. If you look at the abstracting and indexing services which have a much
larger volume of publication, you find that they don't go in forI think this is true
across the boardname authority files, for example, but go for a standardized way of
representing the namethat is last name and first initials only, or just as the
author has put it into the piece. And somehow this works. I would submit that the emphasis
is on making this work at the information retrieval stage rather than putting the effort
into the authority files up front, for that particular setpersonal names. Also, I'd
like to say that abstracting and indexing services pay attention to the affiliation of the
author and you won't find that information in the library catalogs. That this is often one
of the chief pieces of evidence that you want to knowwhere this person was working
when they published the item. One other point is that as we get into electronic
publication, where the information is perhaps coded in SGML, the document will come into
the libraries or abstracting and indexing services already coded. You can just read the
electronic text and it will say the this is the author. I think its going to be
increasingly difficult for the libraries to maintain the authority files.
Paul Morris:
Speaking from the within the museum collections community, we have really two competing
goals with respect to information. One is we need to archive information about unique
objects and to make sure that we're doing that as accurately as possible, so
that a hundred years from now someone who pulls up the information associated with the
specimen has real information and not information we've interpreted or mistyped along the
way. The other goal is getting
that information out and making it accessible; making it easily and effectively searchable. These
are two very different competing goals we have to worry about.
Karen Calhoun:
I'd like to speak to Linda's remarks, whether authority control for personal names can
be sustained. One of the things that is always a tension in covering information resources
is whether the information resource the user group that is associated with that set of
resources has a need for current information, the latest thing on the at topic, or whether
its a group of humanists, or artists, or musicians. I would maintain that, for humanities,
art, and museums, it's very important to maintain the kinds of consistency that we have
provided in the past for personal names because those names are actually names and
subjects. For many scientists and engineers, current information is what is desired and
the name access point is not necessarily the most important part of the citation. This may
be different for the taxonomic community, I'm not sure. But distinctions need to be made
and in some cases it is valuable to continue this level of personal name authority
control.
Stan Blum:
We had a little bit of a discussion earlier about "what is an authority file?"
and I was wondering whether anyone would care to offer a definition?
Joan Swanekamp:
What I provided was a little bit of background, because where the library community
started was not necessarily with the creation of authority files, but with the
authoritative forms of headings. And it was groupings of files of those headings that
became our authority files. For a very, very long time we've been interested in
standardizing the forms of names we've used for personal names, for corporate bodies, for
subjects, and in doing so, we had rules for that, and that is how we created our authority
files. Our rules have changed over time, sometimes dramatically, but by having these
files, we've been able to keep some sense of order with our referencing structures or
whatever. Where this really all began was with authoritative forms of headings.
Stan Blum:
It seems that it's not just form now because it's very clear that there's a concept
behind this certified heading for this person. If you get an alternative form it goes
under a different heading, or a different record, so now we've got a clear concept of a
person, whose name can change or evolve, or who can have many names or forms of names.
Karen Calhoun:
I think an analogy is, if you think about the building where the fire department is, it
isn't the building that's importantit's really not the file that's important.
What's important is what it does, and it does several things. It provides an authorized
form for retrieving information from the database, for sorting it, and for displaying it.
Or it might involve links to other forms of related forms of headings that works the same
wayso it has cross reference structure, it has the authorization, it has information
in it that helps people to establish names or subjects in the proper way. It also has that
automated function. Joan was talking about their retrospective conversion and about the
millions of records that they need to simply key into a system and then they're going to
send it to an authority control vendor to clean up the headings. And what they're going to
use is the authority file. So think about what the fire department does and not where it
lives.
John Riemer:
I have a definition to offer for "Authority Control" which is: The process of
selecting, establishing, and maintaining unique and standardized headings to be used as
access points in the catalog, as well as relating variations and other headings to them.
So then I could back up from that and say that an authority file is any record of the
decisions you've made toward this process.
Bill Eschmeyer:
On the Catalog of Fishes there are some very practical matters on a day-to-day basis.
For example T. N. Gill might be listed in my database with 300 papers. He published as
"T. Gill", "T. N. Gill," and "Theodore Gill". I had to pick
one to make them come out alphabetically, to pick up all of his records
chronologicallyand then with his co-authorsjust to make my literature cited.
But I would really like to see an authority file that would give all the combinations that
he's published under and also a short biography. Now we have several Gills publishing. We
have two "C. L. Hubbs" that have published extensively. Sometimes I have
published remarks, "also published as ____ and ____". And then Asian names are
particularly difficult. Journals handle them differently. So you almost need an annotated
file, because if you just work retroactively from your cards, you're going to get "T.
Gill" and some "Theodore Gill"s. I had to make choices on a lot of authors
and first I had to make sure that I was dealing with one person; sometimes I wasn't.
That's why going from the electronic version to the printed version involved a lot of
editing.
On perfection: the books are not perfect, they're better than most anything you can
find out there, but there's still a lot more to do. I experienced a little of what Scott
mentioned "Hey, you missed my paper," or "you missed this species."
Some of that goes into it when you put it up as an electronic database. I would rather see
biographies for current authors. We do a little newsletter where everyone sends in a
little account of what they're working on, and the most heavily used part of that is
e-mail addresses. A lot of journals are now giving e-mail addresses. So that should go in
the author file, too, if they're current workers.
Eimear Nic Lughadha:
What Dr. Eschmeyer has described is exactly what we've been doing for plant-name
authors, first in book form and now on-line. The trick is to identify which people you're
talking about, and then which names they've published under, adopting standard forms, and
making sure those standard forms aren't duplicated by people who are likely to be using
the same information set. By doing this we've reduced the number of apparent authors
dramatically, by a factor of two or three.
John Attig:
I think you're still going to have to make the decisions. That's the job that has to be
done. An authority file is a way of reducing the number of times you have to make the same
decision, by recording all of those decisions in a way that everybody can consult them.
Laurel Jizba:
One of the things that's really fun about adding to and contributing to authority
records is that they have an open-ended length; they can go on forever. So we're not
talking about a fixed form. And some of them do, when they get into the variants in Chinese
and Russian names, and lengthy notes about when the heading first appeared a certain way,
and the date of that publication. And then we have notes about things like conflicts. Two
people publishing in the same or similar fields, and what the title of their actual job
was at the time. I don't know if this has been said in this context yet, but we have no
limit except that, as a value, we know we're trying to get it out faster. Any one
individual may not spend a long time on that record, but after many months and years the
build up of information in that record can represent a substantial amount of decision
making.
Scott Miller:
A somewhat heretical question about subject headings in libraries: When we only had card
files they were very important, but now that library databases can search freely, with
boolean searches, etc., is control on subject headings actually cost effective or useful?
I know I almost never use them.
John Attig:
Is your literature all in English?
Karen Calhoun:
That's right. "Is your literature all in English" is a very good question. The
Council on Library Resources that showed that subject searching in library catalogs or a
subject approach to a catalog was still extremely important. Over the years, the Library
of Congress subject heading system has become less essential because of the rise of
keyword search and all of the different ways that you can retrieve data. Now some of the
libraries are experimenting with relevance searching, such as is done on the web. So the
answer is a difficult one. The purpose that the Library of Congress subject heading system
has at this time is for catalogs like Cornell's, which are very, very rich in other
languages, other language materials, other culture's materials. It provides an English
speaking person very good keyword access to those materials.
Joan Swanekamp:
This is probably a failing of our automated catalogs more than anything else, but
keyword searches provide almost endless results with almost no order at all, with no sort
of hierarchy. That really is one of the things that an LC subject heading or other the
thesauri can do is bring together in a logical fashion a large group of related materials.
When you get a result of a keyword search in most of our systems, you often don't have a
clue as to why you got it.
Jessica Milstead:
It's important to keep in mind too that even if you're not searching on a subject
heading, if there is a subject heading in the record and you search on one of the words in
it, you may be searching on the subject heading without even knowing it.
Beacher Wiggins:
Another thing we should keep in mind though is that the structure that we are creating
can be used and applied as the need arises. The library community has indeed been working
with a subject heading system, and I'd be the first to say that the Library of Congress
subject heading apparatus is in need of some workand we've begun to do that
workbut I think the value here is that a community, or an institution, or a group
can take advantage of pieces of the work and build on it as you need it. When you were
talking about "What is an authority file?"the file can be whatever you
want it to be, and it can control whatever universe you want to work with. I think keeping
that in mind will also help in terms of deciding how to approach the problem or being able
to manage a portion of the problem. Don't think that it's all or nothing. Talking to any
three or four of the librarians that are here, you're likely to get a different view in
terms of how best to approach a particular problem. I think part of that, and the dialog
we're having, is understanding that it can be applied in portions and not the entire
universe. I think you have to keep in mind too the history that we are establishing here
and the fact that these are dynamic records that we are building and by virtue of that the
files we create are also dynamic and can be bitten off as the need arises.
Linda Hill:
Information retrieval research (in reply to the question about whether controlled
vocabularies are valuable) has shown that when they tested free textthat is
searching every word in an itemagainst searching by controlled vocabularies (and subject
headings are just one form of controlled vocabularies), what they've found is that the
best system is being able to use both. You can't do as well with either as you can do with
both.