Besser-FirstMonday 02110/3/18

The Next Stage:

Moving from Isolated Digital Collections to

Interoperable Digital Libraries

Howard Besser

Abstract:

Online collections do not yet function like conventional libraries. Many digital collections are experimental and lack service components, and few have preservation components. The function of searching across collections is a dream frequently discussed but seldom realized at a robust level. This paper places a conceptual framework upon digital library development, and discusses how we might move from isolated digital collections to interoperable digital libraries. It first examines how early efforts to construct digital collections were conceived as experiments rather than operational libraries. It then discusses various conventional library components that are necessary to deployment of operational digital libraries. Finally, the author points to functions (such as infrastructure, robust metadata, and preservation components) that canbe deployed to move us from isolated digital collections to interoperable digital libraries.

Contents

What is a Library?

Brief Digital Library History

Moving to a more user-centered architecture

General processes and stages of technological development

The Importance of Standards

The Various Metadata Types

Metadata Philosophies and Harvesting: Warwick vs. MARC

Best Practices

Other standards issues

The Next Step: Moving from Isolated Digital Collections to Interoperable Digital Libraries

What is a Library?

Digital library development has thusfar proceeded in a piecemeal fashion, incrementally attempting to mimic specific functions of conventional libraries. But now that our digital libraries are sufficiently advanced and functional, it seems appropriate to examine the roles and functions of traditional libraries, and to see what additional developments should be incorporated into the digital libraries we are planning and developing.

Writing about public libraries, McClure has outlined a critical set of roles they fulfill (McClure 1987) including: community activities center, community information center, formal education support center, independent learning center, popular materials library, preschoolers’ door to learning, reference library, and research center. This set of roles needs to be rethought in an age when physical location and service can be separated from one another; some of these roles are more tied to the library’s physical presence in the community, while others may function very well if delivered from remote sites. Besser has tried to update McClure's roles for the digital age(Besser 1998), claiming that the four core missions of a public library are: that it is a physical place, that is a focus spot for continuous educational development, that it has a mission to serve the underserved, and that it is a guarantor of public access to information. But we need to go beyond public libraries to make some generalizations that can apply to most types of libraries.

Traditionally, libraries have been more than just collections. They have components (including service to a clientele, stewardship over a collection, sustainability, and the ability to find material that exists outside that collection)[1] and they uphold ethical traditions (including free speech, privacy, and equal access).

Almost all conventional libraries (be they Special, Academic, Public, or School Libraries) have a strong service component. All but the smallest libraries tend to have a substantial "public service" unit. Library schools teach about service (from "public service" courses to "reference interviews"). And the public in general regards librarians as helpful people who they turn to to meet their information needs.

Many libraries deliver information to multiple clienteles. They are very good at using the same collection to serve many different groups of users, each group incorporating different modalities of learning and interacting, different levels of knowledge of a certain subject, etc. Public libraries serve people of all ages and professions, from those barely able to read, to high schoolers, to college students, to professors, to blue collar workers. Academic libraries serve undergraduates who may know very little in a particular field, faculty who may be specialists in that field, and non-native English speakers who may understand detailed concepts in a particular domain, but have difficulty grasping the language.

Most libraries also incorporate the component of stewardship over a collection. For some libraries, this is primarily a matter of reshelving and circulation control. But for most libraries, this includes a serious preservation function over at least a portion of their collection. For research libraries and special collections, preservation is a significant portion of their core responsibilities, but even school, public, and special libraries are usually responsible for maintaining a core collection of local records and works over long periods of time.

Libraries are organizations that last over long periods of time. Though occasionally a library does "go out of business", in general, libraries are social entities that have a great deal of stability. Though services may occasionally change in slight ways, people rely on their libraries to provide a sustainable set of services. And when services do change, there is usually a lengthy period where input is solicited from those who might be affected by those changes.

Another key component of libraries is that each library offers the service of providing information that is not physically housed within that library. Libraries see themselves as part of a networked world of libraries that work together to deliver information to an individual (who may deal directly only with his or her own library). Tools such as union catalogs and services such as inter-library loan have produced a sort of interoperable library network that was able to search for and deliver material from afar long before the advent of the WorldWide Web.

Libraries also have strong ethical traditions. These include fervent protection of readers' privacy, equal access to information, diversity of information, serving the underserved, etc. (ALA 1995). Librarians also serve as public guardians over information, advocating for these ethical values.

The library tradition of privacy protection is very strong. Librarians have risked serving jail time rather than turn over whole sets of patron borrowing records. Libraries in the US have even designed their circulation systems to only save aggregate borrowing statistics; they do not save individual statistics that could subsequently be data-mined to determine what an individual had borrowed.

Librarians believe strongly in equal access to information. Librarians traditionally see themselves as providing information to those who cannot afford to pay for that information on the open market. And the American Library Association even mounted a court challenge to the Communications Decency Act because it prevented library users from accessing information that they could access from venues outside the library. Librarians have been in the forefront of the struggle against the privatizing of US government information on the grounds that those steps would limit the access of people who could not afford to pay for it.

Librarians also have a strong ethical tradition of assuring diversity of information. Libraries purposely collect material from a wide variety of perspectives. Collection development policies often stress collection diversity. And librarians pride themselves on being able to offer patrons a rich and diverse set of information.

Librarians are key public advocates for these ethical values. As guardians of information, they try to make sure that the richness, context, and value of information does not get lost.

As we move towards constructing digital libraries, we need to remember that libraries are not merely collections of materials. They have both services and ethical traditions that are a critical part of the functions they serve. The digital collections we build will not truly be digital libraries until they incorporate a significant number of these services and ethical traditions.

Brief Digital Library History

The first major acknowledgement of the importance of Digital Libraries came in a 1994 announcement that $24.4 million of US federal funds would be dispersed among 6 universities for "digital library" research(NSF 1994). This funding came through a joint initiative of the National Science Foundation (NSF), the Department of Defense Advanced Research Projects Agency (ARPA), and the National Aeronautics and Space Administration (NASA). The projects were at Carnegie Mellon University, the University of California-Berkeley, the University of Michigan, the University of Illinois, the University of California-Santa Barbara, and Stanford University.

These six well-funded projects helped set in motion the popular definition of a "digital library". These projects were computer science experiments, primarily in the areas of architecture and information retrieval. According to an editorial in D-Lib Magazine, "Rightly or wrongly, the DLI-1 grants were frequently criticized as exercises in pure research, with few practical applications"(Hirtle 1999).

Though these projects were exciting attempts to experiment with digital collections, in no sense of the word did they resemble libraries. They had little or no service components, no custodialship over collections, no sustainability, no base of users, no ethical traditions. We will call this the "Experimental" stage of digital library development (see Figure #1). Because efforts during this Experimental stage were the first to receive such widespread acknowledgement under the term "digital library", they set a popular impression for that term that persisted for many years.

By 1996, social scientists who had previously worked with conventional libraries began trying to broaden the term "digital libraries"(Bishop & Star 1996; Borgman et. al. 1996). But the real breakthrough came in late 1998 when the US federal government issued their highly funded DL-2 awards(Griffin 1999) to projects that contained some elements of traditional library service, such as custodialship, sustainability, and relationships to a community of users. Around that time, administrators of conventional libraries began building serious digital components.

Stage / Date / Sponsor / What
I
Experimental / 1994 / NSF/ARPA/NASA / Experiments on collections of digital materials
II
Developing / 1998/99 / NSF/ARPA/NASA, DLF/CLIR / Begin to consider custodialship, sustainability, user communities
III
Mature / ? / Funded through normal channels? / Real sustainable interoperable digital libraries

Figure #1--Stages of Digital Library Development

As librarians and social scientists became more involved in these digital projects, we moved away from computer science experiments into projects that were more operational. We shall call this the "Developing" stage of digital libraries. By the late 1990s, particularly under the influence of the US Digital Library Federation, projects began to address traditional library components such as stewardship over a collection and interoperability between collections. But even though these issues are finally being addressed, they are far from being solved. Though we have made great progress on issues such as real interoperability and digital preservation, these are far from being solved in a robust operational environment. In order to enter the "Mature" stage where we can really call these new entities "digital libraries", we will need to make much more progress in moving conventional library components such as sustainability and interoperability into the digital realm. And we need to begin to seriously address how we can move our library ethical traditions (such as free speech, privacy, and equal access) into the digital realm as well. The remainder of this paper examines important efforts to move us in those directions.

Moving to a more user-centered architecture

Both the early computer science experiments in digital libraries and the earlier initial efforts to build online public access catalogs (OPACs) followed a model similar to that in figure #2. Under this model, a user needed to interact with each digital repository independently, to learn the syntax supported by each digital repository, and to have installed on their own computer the applications software needed to view the types of digital objects supported by each digital repository.

So, in order for a user to search Repository A, s/he would need to first adjust to Repository A's specialized user interface, then learn the search syntax supported by this repository. (For example, NOTIS-based OPACs required search syntax like A=Besser, Howard, while Inovative-based OPACs required search syntax like FIND PN Besser, Howard.) Once the search was completed, s/he could retrieve the appropriate digital objects, but would not necessarily be able to view them. Each repository would only support a limited number of encoding formats, and would require that the user have specific software installed on their personal computer (such as viewers for Microsoft Word 98, SGML, Adobe Acrobat, TIFF, PNG, JPEG, or specialized software distributed by that repository) in order to view the digital object. Thus users might search and find relevant works, but not be able to view them.

The user would then have to repeat this process with Repository B, C, D, etc., and each of these repositories may have required a different syntax and different set of viewers. Once the user searched several different repositories, they still could not examine all their retrieved objects together. There was no way of merging sets. And because different repositories supported different viewing software, any attempt to examine objects from several repositories would likely require going back and forth between several different applications software used for display.

Obviously the model in Figure #2 was not very user-friendly. Users don't want to learn several search syntaxes, they don't want to install a variety of viewing applications on their desk, and they want to make a single query that accesses a variety of different repositories. Users want to access an interoperable information world, where a set of separate repositories looks to them like a single information portal. A more user-friendly model is outlined in Figure #3. Under this model a user makes a single query that propagates across multiple repositories. The user must only learn a single search syntax.

The user doesn't need to have a large number of software applications installed for viewing. And retrieved sets of digital objects may be looked at together on the user's workstation. The model in Figure #3 envisions a world of interoperable digital repositories, and is a model we need to strive for.

Over the years we have made some significant progress towards the Figure #3 model, particularly in the area of OPACs. Web browsers have given us a common "look-and-feel" between different repository user interfaces. The Z39.50 protocols have allowed users to employ a single familiar search syntax, even when the repository's native search syntax appears foreign. Z39.50 has also promised to let user queries propagate to different repositories. But when one leaves the world of OPACs and enters the world of digital repositories, much work still needs to be done to achieve real interoperability. Most of this work involves creation and adoption of a wide variety of standards: from standards for the various types of metadata (administrative, structural, identification, longevity), to ways of making that metadata visible to external systems (harvesting), to common architectures that will support interoperability (open archives).

General processes and stages of technological development

The automation of any type of conventional process often follows a series of pragmatic steps as well as a series of conceptual stages.

Pragmatic implementation steps usually begin by using technology to experiment with new methods of performing some function, followed by building operational systems, followed by building interoperable operational systems. And at the later stages of this, developers begin trying to make these systems useful for users. We have seen this pattern (experimental systems to operational systems to interoperable systems to useful systems) repeat in the development of OPACs, Indexing and Abstracting services, and image retrieval. The automation of each of these has begun with experiments, followed by implementations that envisioned closed operational systems (with known bodies of users who needed to learn particular user interfaces and syntaxes to interact with the system), followed by implementations that allowed the user to more easily interact with multiple systems (and sometimes to even search across various systems). Today's "digital libraries" are not much beyond the early experimental stage, and need much more work to make them truly interoperable and user-centered.

The conceptual steps typically include first trying to replicate core activities that functioned in the analog environment, then attempting to replicate some (but not all) of the non-core analog functions, then (after being in use for some time) discovering and implementing new functions that did not exist within the previous analog environment. This final step is a major shift in terms of creating something different that makes good use of the new functional environment enabled by the new technology. So, for example, word processors were initially built as typewriters with storage mechanisms, but over time grew to incorporate functions such as spell-checking and revision-tracking, and eventually enabled very different functions (such as desktop publishing). Our early efforts at creating MARC records began as ways to automate the production of catalog cards, then moved to the creation of bibliographic utilities and their union catalogs, then to OPACs. Functionally, our OPACs began as mere replicas of card catalogs, then added boolean searching, then title-word searching capabilities, and now are poised to allow users to propagate distributed searches across a series of OPACs. Today's digital collections are not much past the initial stage where we are replicating the collections of content and cataloging that existed in analog form, and just beginning to add minor functions. In the future we can expect our digital libraries to incorporate a variety of functions that employ the new technological environments in ways we can hardly imagine today.