NDIIPP PartnershipTechnical Architecture Survey Collation – 1/2006
The following information has been extracted from the survey information provided at the July, 2005 Partner Meeting. It is recognized that some information is incomplete or requires updating.
Functions, Processes and Procedures
Ingest/Acquisition of Content/Content File Validation and Technical Analysis
[CDL] JHOVE (
[EMORY] Consists of the LOCKSS OAI-PMH-driven content ingestion and replication software component ( as well as a hardware component that makes use of linux systems administration tools for allocating disc space among institutional nodes.
[ICPSR] Virtual Data Center (VDC) (
[UCSB] Working to develop a registry of geospatial formats based on the ADL Object Format Thesaurus ( See also
Archival Storage/Institutional Repository
[CDL] Storage Resource Broker /CDL Repository
[EMORY] Archival Storage is the OAIS layer that will be investigated least by the current MetaArchive Partnership. The MetaArchive Network would like to add a bridge between the content ingestion and replication system with a modular component that will enforce archival storage (format migration or emulation and another layer of data integrity checks that is informed by or communicates with the integrity checks contained within the LOCKSS software).
[ICPSR] VDC, through the repository service component. The Repository stores and manages digital objects and the administrative metadata (such as the object’s owner, or last time of access) associated with them. A repository access protocol allows for maintenance, hiding the details of their storage (currently a SQL database) from the rest of the system. The repository itself treats every object as a MIME-typed BLOB. All knowledge about complex objects (objects that cannot be rendered by a browser without pre-processing) is encapsulated inside the User Interface Service (UIS) (see Access Service above).
[NCSU] Dspace
[PDPT] Dspace
[UCSB] ADL
[UIUC] Evaluating are DSpace, Fedora, Eprints, Greenstone, and the OCLC Digital Archive, with DSpace used as the initial, primary repository. They will also be evaluating the OCLC Web Archivist’s Workbench (WAW) which is being developed as another part of the UIUC project.
Access Management
[ICPSR] A central catalog will be provided at the Harvard-MIT Data Center (HMDC) with a registry of persistent identifiers, and a union-catalog created through harvesting. The union catalog will enable searching, will itself be searchable through Z39.50 and harvestable through OAI-PMH, and will link to the local holdings.
VDC, through the User Interface Service (UIS), the gateway to the other service components (repository, name resolution and indexing service components). The UIS is implemented as a set of Java servlets, each of which encapsulates access to a particular services and objects. Each object or service is itself described in XML, and XSL is used to render the object.
[NCSU] Any content exposed through the NC OneMap access delivery framework will be represented via geospatial web services using Open Geospatial Consortium (OGC) specifications. The current NC OneMap viewer implementation makes both image and vector data available in a format-independent manner via the OGC Web Map Server (WMS) specification. The NC OneMap application acts as a cascading map server, making WMS requests to distributed state, local, and federal data servers and combining the resulting information into a single map image sent to the user. The WMS specification is limited in that it does not deliver the underlying data, just a map image that is the result of the request. The Web Feature Server (WFS) specification will be explored as a means to stream the actual data content to the end user. NC OneMap’s web services also stream content to the National Map.
[UCSB] The Alexandria Digital Library (ADL) middleware server ( is written in Java and Python and can be run as a web application inside a servlet container, as an RMI server, or both. Distributed with the server is the "Bucket99 driver," a configurable component that allows relational databases to be viewed as collections.
Web Crawling/File Indexing
[CDL] Under development: A Web Crawl service for initiating and monitoring web crawls and for processing web crawl results. This service will interact with a new CDL resource manager (Web Crawl Manager), which in turn will interact
with Internet Archive’s Heritrix ( web crawler, utilizing Heritrix’s JMX interface.
eXtensible Text Framework (XTF): a flexible indexing and query tool written by CDL that supports searching across collections of heterogeneous data and present results in a highly configurable manner. Details at
Under development: A curation service for defining web crawl collections, scheduling crawls, packaging web crawl results with archival metadata, submitting crawl results to the preservation repository, etc.
[EMORY] LOCKSS
[ICPSR] VDC, through the Index Server (IS). The IS manages indexing and searching (queries) of the descriptive metadata associated with each object. Index servers act with a large amount of independence – they are assigned sets of identifiers that they are responsible for indexing. In addition, the index servers asynchronously resolve the identifiers to a repository, retrieve the metadata component of these object, and build indices based on this metadata.
Rights Management
[CDL] Under development: A rights management service for identifying and recording rights metadata for crawled content.
[EMORY] LOCKSS
[ICPSR] VDC
Persistent Identifier Manager
[CDL] NOID (Nice Opaque Identifier) Minting and Binding Tool. The NOID utility written by CDL creates minters (identifier generators) and accepts commands that operate them. A NOID minter is a lightweight database designed for efficiently generating, tracking, and binding unique identifiers, which are produced without replacement in random or sequential order, and with or without a check character that can be used for detecting transcription errors. CDL utilizes the Archival Resource Key (ARK) identifier, a naming scheme for persistent access to digital objects (including images, texts, data sets, and finding aids), currently being tested and implemented by the California Digital Library (CDL) for collections that it manages. Details on ARK at and NOID at
[ICPSR] VDC, through the name resolution system (NRS) manages identifiers for each digital object. Each distinct intellectual work stored in the system will be supplied with a persistent identifier using the CNRI handle system ( (and additional identifiers later identified by the partnership) and a format-independent cryptographic hash or digital signature.
[NCSU] Unique, non-semantic, auto-generated identifiers will be used to track items in the workflow management database and will be included with item metadata. NOTE: There is no universal unique identifier scheme for data resources in the geospatial industry space. Upon ingest, DSpace-provided Handles will also be stored in the workflow management database. The DSpace Handles are initially seen as redundant. Separate collection identifiers will connect individual data items with broader collections.
Standards
Metadata
[CDL] Under development: Develop a standard encoding format for representing crawled content and associated metadata, and utilize this format for moving content between CDL-CF services, and between CDL’s repositories and partners’ repositories. This format will likely combine METS metadata encoding and the next generation of Internet Archive’s web archive file format (ARC). An ARC file is the concatenation of many datastreams, whereby a separation between datastreams is created by a section that provides – mainly crawling-related –metadata in a text-only format (see
[EMORY] The MetaArchive Metadata Schema (MACDMS) is derived primarily from the UKOLN RSLP Collection Description Profile ( but has been augmented with a local at-risk ranking and with LOCKSS specific metadata tags.
[ICPSR] The basic object managed in the VDC system is the study. Each study comprises a metadata object and a set of associated data objects. The metadata object follows the Data Documentation Initiative (DDI) standard ( an XML-based standard for social science metadata, and contains all of the structural metadata for that study, and the descriptive metadata for the corresponding (abstract) intellectual work. The associated data objects consist of text files (usually for supplementary documentation), MIME-typed BLOBs (Binary Large Objects), and/or structured quantitative databases. The metadata object acts to document the study and to tie the associated data objects together.
[NCSU] Metadata will be stored in the following forms: 1) as FGDC records or ESRI Profile FGDC records, 2) as Qualified Dublin Core, Library Application Profile (roughly) within the DSpace database (Oracle), and 3) as METS records (stored with the content as DSpace bitstreams) containing the FGDC records and additional elements.
The Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata will apply to most acquired content. Harmonization of the FGDC standard with the ISO Draft Technical Specification ISO 19139 (Geographic information – Metadata) is ongoing.
Additional metadata elements which have either been extracted from the FGDC records, disambiguated from FGDC elements, or have been created in addition to FGDC elements will be stored in a separate MySQL database along with administrative and repository ingest workflow elements related to the item. A combination of ESRI FDGC elements and additional elements from the MySQL database will be mapped to: 1) the DSpace Simple Archive Format, which uses the Library Application Profile (roughly) of Qualified Dublin Core, and 2) METS records for submission as DSpace bitstreams along with the data items. While current plans involve the use of METS, developments related to GeoDRM and the prospective use of MPEG 21 as a content packaging framework will be watched closely. Longer term, it is expected that the project would adopt whatever content packaging scheme that becomes adopted by the geospatial industry.
[PDPT] PBCore ( and METS (
[UCSB] ADL metadata (
File Format Standards
Not requested in the survey
Infrastructure
Network
[CDL] CDL Common Framework (CDL-CF) is an open, services-oriented architecture written in Java. Consistent exposure of services through SOAP and Java Client API. All CDL-CF services are exposed as web services, using SOAP with attachments via HTTP.
[Emory] LOCKSS software installation is completely closed except to the nodes housed at the member institutions. The trusted relationship among servers has had the side effect of our institutions looking to this system as a test of the Shibboleth open-source platform for inter-institutional identity and trust relationships (
[NCSU] The project will explore a future component focusing on the issue of integrating preserved content into existing geospatial data discovery and access systems, notably the NC OneMap framework at the state level ( and Geospatial One-Stop ( and the National Map at the national level (
Hardware
[EMORY] The hardware is constructed from off-the-shelf components from EMC and Dell.
This is a standard storage area network utilizing Intel based hardware and fibrechannel
SATA disk storage.
1 Dell Poweredge 1850 Server (Storage) (Processor: 1 Intel Xeon 800 Mhz front
side bus – Memory: 1 GB DDR 2 – Storage: 2 73 GB 15K SCSI drive (mirrored))
1 Dell/EMC AX100 Storage System (3 TB SATA Storage – 2 active processors)
1 Dell Poweredge 1850 Server (Firewall) (Processor: 1 Intel Xeon 800 Mhz front
side bus – Memory: 2 GB DDR 2 – Storage: 2 73 GB 10K SCSI drive (mirrored))
1 Dell PowerConnect 2616 16 Port GigE Unmanaged Switch
1 Dell Poweredge 4210 Rack Mount System
[ICPSR] The VDC node at HMDC is currently hosted on redundant Redhat Linux Opteron based servers, using XSAN ( technology for redundant storage and DLT tape jukeboxes for backup. Each partner will use additional hardware and software for ingest, local storage, and dissemination, tailored to the partners needs and workflow. This currently includes: Solaris and other Unix servers, SAS and SPSS statistical software for data ingest and manipulation, DLT and DVD archival backup.
[NCSU] 2 Nexsan ATABeast systems, each with forty-two 400GB ATA drives, 7200 rpm, Fibre to ATA connectivity, 512 MB cache. Capacity will be 16.8 TB per system, with a total of 12.4 TB of redundant, usable space. One system will be deployed offsite and will replicate content on the other system. Additional drive upgrades in an existing ATABeast system provide additional auxiliary space. The disk space will be managed in 1.5-2 TB partitions, with tape backups. The storage environment will be managed by a Sun Fire V440 server cluster (four 1.593GHz UltraSPARC IIIi units).
[PDPT} WGBH DAM (Digital Asset Management) implementation ( Sun Fire servers and StorEdge server to run Artesia for DAM and Oracle database server. The software requirement for OAIS conformance is using Java for framework implementation and PostgreSQL server to host DSpace repository backend database engine.
[UCSB] UCSB and Stanford are using a variety of systems, both disk and tape: Isilon, Centerra, etc. Archivas is a common denominator.
Operating System
[EMORY] Operating system on the Storage Area Network: RedHat Enterprise Linux AS v. 3
[ICPSR] Redhat Linux at the VDC
[NCSU] Solaris 9
Preservation Planning and Strategies
[CDL] Normalization of Data on Ingest: Dates may be normalized either on ingest or in subsequent update.
[Emory] Redundant Data Storage: Redundancy is spread out over six different institutions utilizing the backbone of the Internet2 Abilene network and the local connections of the Southern Crossroads (SoX) network consortium and the Mid-Atlantic Crossroads (MAX) network consortium (See Abilene Map
Bit Preservation: MetaArchive Network is only working on bit preservation. This is
accomplished using MD5 checksums along with the LOCKSS polling algorithm that
slowly checks each host node’s content against each other for faults, additions or
subtractions. This decentralized model does not necessarily preserve bits within the
framework of an individual institution’s access but provides a cost sharing model for bit
preservation and access within the MetaArchive Network.
[ICPSR] Migration Strategies: The VDC system provides built-in support for format migration.
Normalization of Data on Ingest: Data will be normalized on ingest, which will involve the conversion of data in proprietary statistical formats to human-readable text+XML formats, and will also involve the conversion (or creation of preservation copies) of documentation in proprietary formats to preservation formats, such as XML, plain text, and PDF/A. Descriptive statistics are calculated to ensure that the data has been successfully normalized. When a data object in a proprietary format such as SAS or SPSS is ingested into the system, it is converted to a set of plain-text files and XML files that completely capture all of the data and metadata embedded in the original file.
Redundant Data Storage: As much of the content as possible will be located on a VDC system at Harvard that will mirror data at each individual partner’s sites.
[NCSU] Migration Strategies: The emergence of a widely accepted and understood Geography Markup Language (GML) application schema ( might trigger the migration of additional vector content at a future point. There is some interest in developing mechanisms, based on data inventories, to track format market strength in order to assess decline in support of particular formats, informing migration triggers.
Normalization of Data on Ingest: Existing FGDC metadata will need to be normalized. Malformed content will need to be restructured using openly available FGDC tools know as ‘cns’ (Chew and Spit) and ‘mp’ (Metadata Parser). Metadata will be imported using the National Park Service Metadata Toolkit on top of ESRI’s ArcCatalog for synchronization of elements and subsequent export as ESRI Profile FGDC XML records.
Managing Time-Versioned Content: In one possible approach, serial item representations of data items would be assigned Handles which serve as serial item identifiers. A separate database table would be used to manage serial representations, including storage of volatile layer-related information which would be inappropriate for inclusion in metadata stored with repository objects. The serial representation would be actively managed over time, and would provide the basis for Handle-based call-backs from the time-versioned items--after they have separated from the archive--for “get current metadata,” “get current object,” and “get current DRM” requests. The role of the Handle would be to redirect to the current location of the serial object manager, which would resolve to current context of the individual serial object. This approach is still speculative.
[UCSB] Migration Strategies: Along with each archival data object we intend to archive sufficient format, semantic, and contextual information that enables access mechanisms for the object to be re-created at any point in the future, and that allows the object to be used for the scientific purposes for which it was intended when originally created. This approach is effectively a foundation for other preservation approaches (periodic migration, etc.).