Preservation of Digital Objects

Chapter for Volume 38, Annual Review of Information Science and Technology

Patricia Galloway

School of Information

University of Texas-Austin

Introduction

The preservation of digital objects (defined here as objects in digital form that require a computer to support their existence and display) is obviously an important practical issue for the information professions, and its importance is growing daily as more and more information objects are produced in or converted to digital form. Yakel’s (2001) review of the field provided a much-needed introduction. At the same time the complexity of new digital objects continues to increase, challenging existing efforts (Lee et al. 2002). The field of information science itself is beginning to pay some reflexive attention to its own participation on the production side of fragile and unpreservable digital objects. But these concerns focus often on the practical problems of short-term repurposing of digital objects rather than actual preservation, which I take to mean an activity of carrying digital objects from one software generation to another, undertaken for purposes beyond the original creation of the objects. For preservation in this sense to be possible, information science as a discipline needs to be active in the formulation of and advocacy for national information policies worldwide that challenge the effects of the planned-obsolescence model on what is coming to constitute the entire infrastructure of modern culture.

When it comes to preserving digital objects for the future, even into an indefinite future, there is little doubt that when thought of in terms of the actual computational tasks to be accomplished, “we have the technology.” Current content management/knowledge management systems are able to cope with the complexities of networked environments, while migration techniques have been in use on an enterprise scale for decades, the latest manifestation being seen in the Y2K process. The problem is not one of being unable to proceed for lack of adequate equipment or algorithms. The real questions are why to preserve, what to preserve, and who must do it. Digital preservation has been going on quite happily for many decades where well-defined answers to these questions set explicit temporal and content limits to the preservation task, as the following examples demonstrate.

Practical examples of ongoing digital preservation

Scientific datasets: Large datasets gathered by physical and social scientists have been being preserved for at least forty years, to such effect that there are now repositories full of valuable scientific data. Examples include ICPSR and other university-based repositories, as well as (since 1970) the National Archives. This has been possible for these kinds of data for very good reasons. The data were formulated as databases, which have been easy to maintain historically because they contain information encoded in standardized and documented ways. The designated user communities for the data generally possess or have access to the computational expertise to effect the preservation, and the archives provide detailed instructions for data preparation (ICPSR 2002; CFR 2002). Finally, interested parties are motivated to preserve the data (and are often obliged to do so through legal or contractual obligation) because it represents considerable investment in unrepeatable experiments and a significant cornerstone of individual bodies of work. Here, however, there is not a need for the data to be preserved in their precise original form as long as significant content is preserved, and the repository controls the format.

Data warehouses: In the 1980s businesses became aware that they were in possession of an “information asset” in their various databases, and that they would be well advised to reformulate their systems to preserve historical data instead of overwriting it. This apprehension was answered by Inmon’s (1992) notion of “data warehouses.” The idea led to today’s retention by businesses of large historical databases containing everything from customer and supplier transaction histories to warehouse inventory histories to sales contacts and more. The existence of such information has proved its value and spawned whole areas of specialization like knowledge management (KM) and customer relationship management (CRM). As the enterprise database infrastructure changes over time, effective data warehouses have been migrated according to well-understood IS best practices, eliciting investment to realize perceived advantage. But the data in data warehouses is normalized to standard formats and is subject to purging if ineffective or if judged by legal counsel to be a source of dangerous liability.

Author/publisher text files: A much more recent and informal example can be taken from the field of the arts. For nearly thirty years in the field of creative writing computers have been an indispensable support, as writers became early adopters of personal computer technology. Computer files have thus underlain the production of a significant body of cultural objects, even though the objects themselves have often been translated into material form for communication to audiences (e.g., books). These files and their preservation, sometimes over the years it takes for a complex work to be completed, have been an important consideration for the writers involved. Such files have indeed been preserved systematically (and one might say artisanally) across version changes in software and even in operating systems on desktop and print-production systems. Usually, however, the expense of individualized conversions can be justified because of the value of the objects, the fact that the “user community” for electronic book drafts is extremely restricted (authors, publishers, printers), and the fact that such files have not been systematically preserved for any longer than was required for the final product to be produced.

Government records: Since 2000 the Public Record Office of Victoria, Australia has been archiving digital text files under its Victorian Electronic Records Strategy (VERS), initiated in response to Australia’s policy of making states accountable for the preservation of their own digital records. The Victorian strategy focused upon a relatively low-tech solution that they considered acceptable to records creators. Because document files are required by law to be printable, document files are to be converted to Adobe’s PDF 1.3 format, after which the PDF files are converted to Base-64 binary files that are associated with an XML “wrapper” document containing relevant metadata (Public Record Office Victoria 2000). This approach is expected to permit the preservation of the converted records through future systematic conversions, but though still “digital,” the original document is no longer “computable,” since it has been converted to an image.

The primary problems in the preservation of digital objects, therefore, are not technological ones: Moore’s Law and other such phenomena seem destined to guarantee that no matter how large the data universe, there will be adequate media to support it, system designs to assure its integrity, and ever-increasing processing speeds and ingenious algorithms to enable searching it. The major problems, instead, are societal ones, and they have to do with whether and how any given community desires to preserve the record of its existence over the long term—or desires it enough. Intellectual and social capital are both at issue here: the former primarily for business, the arts, and researchers in academia and the latter primarily for government and academia as custodians of political and artistic culture. Because of these different concerns—and because different communities fetishize different attributes of the digital object—the problem of digital preservation is caught up in a confusion of discourses that often obscures common features addressed across the domain.

Preservation of digital objects requires action and expenditure. Preservation costs are exacerbated by a profusion of proprietary formats. To limit those costs, the format problem must be solved in some way and hard selection choices must be made and justified. But this is just the juncture where the problem fails to gain broad attention because so many creators of digital objects have historically had little interest in their long-term preservation, so the demand for a commercial solution has been small and little action has been taken. In order to be economical and workable, preservation must be provided for before the digital object is created, and there must be an institutional commitment to provide adequate support for it. As suggested above, it is most likely that such commitment will be concentrated in government, academia, and the digital content industry. Without an alteration in funding model, however, only preservation in the last of these venues will be self-funding.

This pattern may be new to the world of digital objects, but it is a familiar one to the world of cultural heritage. The human sense of situation with respect to history is notoriously presentist; in oral cultures anything past five or six generations fades into legendary “ancient times,” and it cannot really be said that contemporary literate cultures do much better in terms of personal experience and knowledge (Vansina 1965; Lowenthal 1985). It is therefore difficult to interest the general public in a past that has little personal relation to themselves. There is then no reason to be surprised when governments and cultural organizations enjoined to “run like a business” apply a business yardstick to the preservation of digital objects.

Why “preservation of digital objects” and not “digital preservation”?

By now a wide range of activities come under the informal heading of “digital preservation,” divided by Yakel (2001) into two groups according to the original format of the object to be preserved:

digitization of non-digital original object for preservation and access (public and private owners)

where original is preserved

publicly owned

privately owned

where original is not preserved

publicly owned

privately owned

preservation of born-digital objects

publicly owned

privately/corporately owned

Digitization, or digital reformatting or preservation reformatting, has often served as a metonym for the whole preservation activity and indeed has often been seen as the whole activity: “digital preservation” thus becomes “preservation of analog objects by digital means.” This is thanks both to businesses touting digital means for the preservation of analog cultural objects and to highly visible and effective efforts by libraries and archives themselves to supplement or replace imperilled original analog objects by use copies, formerly on microfilm and other non-digital media and now increasingly in digital form. The promise of access has attracted large amounts of public and private funding for the digitization of cultural materials. The practice has been going on long enough that the problem of preserving the digital surrogates themselves has now arisen, and the investment is large enough that it has become the poster child for the digital preservation problem. It must be recognized, however, that where for whatever reason the digitized surrogate is considered worth long-term preservation, the surrogate acquires, for all practical purposes including preservation, the same status as if it were a born-digital object. And it will require the same range of actions to achieve long-term preservation. Hence the problem addressed here is the more general one of the preservation of digital objects, whatever their origin (cf. Hedstrom & Montgomery 1998).

It seems to make sense, therefore, to draw a different dividing line: one between publicly-owned or publicly-accessible digital objects and privately-owned digital objects to which access is or may be restricted. This is especially important because of the repercussions this distinction may have on the kind of national or global preservation infrastructure that is devised and the degree of public involvement that it may have or require. By far the greatest bulk of digital objects presently being intentionally preserved past their original intended use are the scientific datasets and business data warehouses mentioned above and such public records as have been captured in digital form, most of the latter being databases held by government archives. They are records of historical process, generally preserved in some sense for public or private good. The interest the user community has in them is instrumental: using them to learn, to make new understandings, and to facilitate activities of business or daily life. They are not aimed primarily at being sold directly.

On the direct for-profit side, on the other hand, many entertainment genres and most scholarly communication still do their work in analog form, but scholarly journals, music, and film are increasingly (and in some cases for the first time) being distributed to the public in digital form. As that happens their preservation also becomes a cause for concern for public institutions that are concerned with cultural memory. If an uninterrupted flow of payment for use can be secured, the publishing and media industries seem to be willing to hand over the preservation task to established institutions of cultural memory like libraries, archives, and museums (as they formerly did by default through the right of first sale that allowed the lending and fair use of a purchased object), especially since the passage of new laws like the Digital Millennium Copyright Act and the Sonny Bono Copyright Term Extension Act to guarantee their ownership rights for the foreseeable future (Lynch 2001). But such arrangements are crucially dependent upon legal regimes that include some kind of deposit library as custodian, which has given rise in the past few years to efforts to establish such a legal requirement where it does not exist.

The major focus so far on research to support the preservation of digital objects has come from government or government-funded academic projects, mostly undertaken in Australia, Canada, the US, and Europe. One relatively recent change is the growing interest and participation of computer science practitioners attracted to the scale and interest of the problem: the newly-formed IEEE Technical CommitteeonDigital Libraries makes reference to the “Collective Memory Community" (IEEE 2002), and it is worth looking at who these stakeholders are.

The Collective Memory Community: Stakeholders for the Preservation of Digital Objects

It is becoming clear that the central position here is occupied by the digital library. The digital library movement has received the greatest amount of attention and funding, especially from the NSF and the Mellon Foundation, as significant shifts in research practice and communication models have required new kinds of support. The future importance to all libraries of (especially noncommercial) digital collections requires a recognition that at least some libraries in turn have an important role to play in the preservation of digital collections. As a result, under the leadership of the Digital Library Federation in the United States, leading academic libraries like UC Berkeley, Cornell, Harvard, and MIT and library service entities like OCLC, RLG, CLIR, and CNI have responded to the availability of grant funding to undertake some of the most important research projects regarding digital preservation. A similar structure and interest has emerged in Britain, under the leadership of JISC and CURL and drawing in academic libraries like Oxford, Cambridge, and Leeds. Government libraries have also been prominent in these developing concerns, both as centers for research (British Library, National Library of Australia, National Library of the Netherlands) and as grant providers for academic and other research (Library of Congress, British Library). Libraries and library researchers have brought to the table particularly significant expertise in the areas of descriptive metadata and digital reformatting, a long-established acquaintance with the issues surrounding preservation through multiplication of copies, and have carried out extensive research especially into the long-term preservation of digital scholarly journals (Waters 2002). The digital library’s central concern is preservation for access.

A second significant stakeholder with additional interests is the archival community, whose major focus has always been preservation for cultural support and the guarantee of genuineness of unique archival holdings over time. Government archives for the western democracies, charged primarily with the retention and preservation of government records in the public interest as evidence of rights and freedoms, have struggled with the volume of modern paper records since the nineteenth century. With the rapid emergence of electronic recordkeeping they have moved (sometimes tardily and led by national archives like the NARA, PRO, NLA, AN) to cope with the concerns of carrying out the same mission for digital government records. As a result government archives have focused almost exclusively on born-digital records. Collecting archives, on the other hand, which function like museums of documents and may or may not be public institutions, have focused instead on digitization projects for access, since so far they have managed to avoid collecting many materials in digital formats. Archival practice in both cases still dictates a collection-level descriptive focus, a presupposition of uniqueness, and a concern to serve accustomed researcher communities, all of which practices and assumptions are being changed by digital records preservation research. But under the leadership of leading national archives, the archival community has done much work on the preservation of born-digital objects (summary in Thibodeau 2002). Archives, however, are still wrestling with their sense that archival objects must be unique rather than precisely replicable.