The MathArc System: its characteristics and status
A report to the EMANI meeting at Grenoble in October 2006.
William R. Kehoe, CornellUniversity Library
For the past three years, the MathArc project has created a protocol and software that enable multiple institutions to share and store digital objects in each other's OAIS repositories, regardless of the nature of each system's underlying repository. In the pilot version, the Göttingen State and University Library (SUB) and the Cornell University Library (CUL) are sharing, storing, and managing collections preserved in Göttingen KOPAL system (based on DIAS) and Cornell's CUL-OAIS (based on aDORe). The digital objects include component TIFF, PDF, Postscript, XML, and LaTex files.
The MathArc system isn't another institutional repository or standalone preservation archive. The characteristic that distinguishes it from other current approaches is that it is designed to share complex digital objects among dissimilar OAIS archives. So the PDF, Postscript, XML, and LaTeX files that make up the journal issues published in Cornell's Project Euclid, and which are stored in Cornell's digital preservation archive can be automatically ingested into Göttingen digital archive, even though the archives are quite different.
It was decided at the beginning of the project that no attempt would be made to preserve the access systems that are currently being used to disseminate and display the journals. Changes in technology may make the display methods obsolete in the future. Thus this is not a system of mirrors. This design avoids the problem of trying to move executable systems into the future on changing platforms. The focus of the MathArc system is instead to preserve complex digital objects separate from the current access mechanisms, with the intention that they can be delivered by future systems.
To make future file migration possible, while preserving the original content, the MathArc system supports versioning. It has been designed to link newer versions of a component file to older versions and to preserve the version tree if changing technologies make it necessary to transform files in one format to another. Preservation metadata describing any changes accompany any new object versions.
The MathArc system uses open-source software throughout its design. The mechanisms of the system are thus open to external assessment and to future modification. They are well documented and are thus easy to maintain.
From the beginning, the system was designed to admit multiple partners. For example, it would be possible for a third partner to join and share some collections with Cornell, some of the same or other collections with Göttingen, or to become a sharing partner with only some of the other partners, but not all. The partners are the only users of the system. The intent is that partners are custodians for each others' collections, not distributors of the objects to a reading public. Access and automatic collection rights are controlled by the primary owner or custodian, so only those partners who have signed custodial agreements with the primary custodian can store objects.
As ongoing research and system-building continues around the world, the digital preservation environment is starting to be populated with special purpose systems. Some are designed to be central repositories for publishers, such as the Portico system in the United States. Other models distribute the repositories among partners, but suggest that all the repositories be of the same architecture. Still others focus on one type of content. The LOCKSS system, for example, stores only files meant for display, but not the underlying components from which the objects are constructed. The niche the MathArc system inhabits is that of a system that permits dissimilar archives to share custodianship of objects its partner institutions have created or have published.
The project is coming to an end. Cornell's funding ends in February 2007, Göttingen’s, six months later. If further funding is found, more partners will be added, the system will be enhanced to allow remote statistical sampling of stored files, the reporting system will be refined.
1