Collection Repository: Personal to Organizational

Introduction to the CERN Document Server

at the San DiegoSupercomputerCenter (CDS @ SDSC)

Karen S. Baker, Anna Gold, Frank Sudholt

Abstract

An individual, a project, and an organization, whether an institution, a network or a discipline, have collection management needs in common. In an academic arena, a collection is often of bibliographic citations but may focus also on the documents themselves as well as on photographs, videos, course materials and/or artifacts. The multiple levels from personal to organizational (project, department, campus or network of campuses) represent nested tiers across which information can flow when technology and metadata standards are partnered to provide an accessible, interoperable digital framework for istributed collection management.

We seek a generalized tool that can be easily adapted to the multi-level needs for a full range of repository activities: gathering, sharing, and discovering materials. The gathering together of objects into different collections may involve unique or re-used objects. An object is repurposed through use in multiple collections where a collection may be viewed as a type of intellectual capital (Greenstein, 200x). A collection conveys information by the selections and omissions; it shares a view or piece of knowledge about a subject or an organization. This work is driven by the recognition of the power of a collection to present information about individual entries and to convey insights resulting from an assembled whole.

Introduction

This report provides an overview in theory and in practice of a project seeking to define a document repository useful at multiple levels. The initial focus is on bibliographic materials and system requirements to support coordinated repository efforts. Beyond theory, a field report and local observations are presented in addition to a consideration of next steps. Progress in creating a prototype repository at the San DiegoSupercomputerCenter using CERN CDSware is detailed including the reconciliation of divergent practices and motivations of target repository participants.

We build from the assumptions that information flow is enhanced through use of the web as a common interface, cooperation is facilitated through a central repository, and the user is to be respected through coordination with attention to local practice. The project goals that focus project activity include: 1) to create a repository supporting two-way information flow (ingest and export) involving exchange between an individual’s collection system and a centralized collection system through partnering of existing technology and metadata; 2) to develop a specific working environment while staying informed about compatibility with alternative developments in the metadata and archives arenas; 3) to consider information about collections in a broader context with reference to collectors and their organizational ties; 4) to employ an iterative prototyping process as a development environment in order to optimize product usefulness given existing practice.

With so many metadata and digital library efforts, one is prompted to ask “why another repository?”. Response arises from recognition of the value in considering a diversity of approaches given the complexity involved in first establishing and then maintaining a repository. A range of important differences occur when considering repository design issues such as

  • How things get in
  • How things get out
  • Who can put things in (and take out)
  • What things can be put in
  • What linkages they have to other systems
  • What protocols/standards they follow

There are multiple, differing roles and motivations for participation by individuals, groups, networks, institutions, disciplines (Table 1, Appendix). The process observes protocols that enable federation and interconnectivity. The process also allows personal or institutional differentiation or “streams” (both into and out of the federated processes).

Repository success depends on a good match between technical and social design. Both technical and social issues are complex. Good social design remains an unsolved research challenge as the cultural and management aspects of repository building are emerging as areas of investigation (ARL, 2002). In addition to the multiple levels of organization, it is important to take into consideration the divergence ingrained in more-or-less well-functioning workflows and practice. We recognize and are committed to research and discovery as a process, not a product (Floyd, 1987) and to the participation of the multiple voices from scientist to information manager to system designer.

Background

Discussions on digital repositories are ongoing (Peters, 2002). Visions often focus on the particular, i.e. individual, project, organizational, national or international levels. Repository models are often driven by issues of publishing and/or identity. Our own work is stimulated by the notion of a single collection as a representation of an ongoing instance of learning and of a diversity of collections as a presentation of multiple understandings, all important to preserve and to connect without limitations. Further, we focus on the flow between these levels recognizing the importance of identity at all levels along with the critical need to engage the individual participant in the process of information gathering.

The growing work in cross-domain research prompts a refocus to the empirical understanding of the interdisciplinary research environment itself (Spanner, 2001) where the communications and infrastructures differ even between established and fringe interdisciplinary studies. In both, however, high priority is given to informal communications. The personal repository may be unique in providing a mechanism to extend informal communications if collection-making tools are available. Collections may provide a diversity of individual views into a discipline that when shared serve as informal guides and communications for associates. They also serve as an important outlet similar to that provided by contribution to volunteer open-source efforts where an individual is motivated to contribute by the ability to create according to a personal vision of a shared product.

Knowledge management and information flow become critical issues the moment a collection of objects is assembled. An individual insight is instantiated through selection and classification. Tools that facilitate aggregation create a knowledge heterogeneity that informs at multiple levels and from multiple views. Individual collections are a first step in a learning process (research cycle) involving document diversity, information flow, and knowledge heterogeneity. Our view emphasizes the 3R’s (research, relationships and reflexivity) with a focus on both infrastructure (documents, content, work practices) and cyber infrastructure (tools, methods, best practices). As we situate our position within the world of repositories, archives and libraries, we start with a document oriented view and extend it through integration of related information such as administrative tables, alternative media such as photographs, and associated services such as organizational metrics and report displays. We seek to create a process of learning and informing for participants by providing mechanisms that enable an individual’s work. The project name FLOW is a purposeful metaphor calling to mind the uncounted rivulets shaped by the local landscape that join to a river of information contributing to heterogeneous pools of knowledge.

Linking individuals and organizations to documents and collections contributes to identity but technical, conceptual and social problems arise. Technical hurdles include resource support, open design, and implementation strategies; social barriers include the need to have a critical mass of participation when acceptance for the system is dependant on activation energy since people must be motivated to take time to make time; conceptual difficulties involve articulating associations between individuals, materials and organizations in order to capture the relationships. When considering the federation of collections, complexity is introduced also by the multiple levels of relationships and by reclassification requirement changes as learning happens.

Benefits of a multiple level, multi-participant approach include

new tools to facilitate information gathering and reporting at multiple levels of organization

knowledge generation enabling participation from multiple individuals and groups

data/information reuse providing accessibility for multiple use

project definition through identity enhancement and multiple reflections

The project demonstrates that short-term/local approaches are not only compatible with long-term federation strategies but also critical to initiating information flow, contributing to knowledge diversity, and ensuring a continuing feed-back process.

Our design approach is to start small, design grounded in the local particular with an eye to federation, and implement at the organizational level. Critical system design elements include

centralization by organization of individual, project, institution or partnership;

federation through a common theme and across multiple locations;

openness with interoperability through protocols such as OAI.

The approach is a reflexive process where we act locally, think globally, and then reflect and react. Reacting reinitiates the process of customized local actions, of experiments and experiences, thereby enabling learning and change.

In Partnership

This project, supporting partnering of the Long-Term Ecological Research (LTER) Palmer site information manager with participants from the San Diego Supercomputer Center (SDSC) and the University of California San Diego (UCSD) Libraries, creates a larger view of organizational infrastructure. The partnership role is to

  • Empower partners with broader vision
  • Bridge individual, organizational, & national needs
  • Define and meet current needs
  • Provide arena for cross-domain work
  • Bridge from present to future needs
  • Anticipate change, optimizing for sustainability, and
  • Work from bottom up toward repository system.

This team identified the European Organization of Nuclear Research (CERN) document system (CDS) as a powerful prototyping software package and installed the software locally at SDSC. Communications with the CERN software developers were initiated and a technology transfer agreement established between SDSC and CERN. This collaboration (Figure 1) brings the active involvement of the CDS development team with the UCSD CDS @ SDSC prototype team.

In order to present a larger context, figure 1 presents four tiers highlighting: the SDSC computational environment, the Semantics Group project, the broader community of digital repositories and services which may be divided into two categories, those compliant with the Open Archives Initiative (OAI) and those not compliant with OAI. Embedded within the SDSC computational environment is the CDSware software where potential storage may be interfaced with the facilities Storage Resource Broker which does not address ingestion or work flow but is a logical name space (rather than a database) for managing the storage of the data rather than the data itself.

During these times of rapid technological change and social transition, there is good purpose to multiple approaches both as complementary, synergistic investigations and as training grounds for all participants. With the ability to preserve more information in the form of collections, will arise new needs requiring new more new services and tools than any one group can provide.

Our semantics group partnership goals may be summarized as follows:

  • Create personal citation libraries
  • Create input & retrieval tools for users and administrators
  • Assure discovery and output capabilities
  • Design for interoperability through identification of national categories and use of national standards
  • Design for local needs (define local categories)
  • Term lists (controlled vocabulary)
  • Keyword fields, ie by discipline, theme, grant support, bibliography
  • Design for scalability through mapping (cross-walking)
  • Consider sustainability of process

Existing participant approaches to bibliographic materials include:

LTER: Using EndNote PC platform software to gather structured bibliographic data from disparate sites. Using proprietary software as means to share data for discovery and networking.

SDSC: Compiling bibliographic records and citation counts, in order to testify to research impact of funding for projects such as the National Partnership for Advanced Computational Infrastructure (NPACI). Gathering is manual, with no means to share or enable discovery.

UCSD Departments: Maintaining large EndNote files of references on research on a dedicated workstation. Lacks capability to share, interactively submit, or discover by partners overseas.

.

Figure 1. CDS @ SDSC Computational and Partnership Environments

Federated Library on the Web (FLOW)

A stream metaphor (Figure 2a) is invoked in the conceptual schema in contrast to the original notion for bibliographic citation entries (Figure 2b). An ease of input and output is critical to both models. The concept of a Federated Library on the Web, FLOW, develops from understanding the distinctive document management tools and practices used within each layer (individuals, group, center, network, discipline) and that these layers represent boundaries across which information could flow openly if technology and metadata could provide an enabling digital framework (“metadata grid”).

Figure 2a: FLOW: Federating Libraries on the Web.

The immediate needs for such a model were to gather data from research users using web forms for input of information about publications, research grants, individuals and organizational context. A dual-mode function is desired for gathering entries in batches (EndNote, Web of Science searches) and one-by-one (individual submissions). This data could then be shared as collections of information and could then be discovered and retrieved in multiple ways from a relational data base. This approach with a central repository creates public exposure for data to enhance its impact.

Figure 2b. Creating an organizational bibliography: Structured, Parsable, Loosely Federated, MultiLevel Approach

Functional criteria call for a system supporting the ability to:

  • Modify and update submissions
  • Provide full text via local or remote files
  • Search by fields or full text
  • Support citation counting
  • Handle various media and data types
  • Provide for a review process, and
  • Offer customization (personalization) and alert services.

and implementation criteria for the system include:

  • Standards-based
  • Open source
  • Flexible
  • Fast
  • Support search within or across collections.

Technical Issues

Design Decisions

The following are protocols / standards were considered:

  • OAI-Protocol for Metadata Harvesting
  • MARC 21
  • Z39.80 (article databases, bibliographic software)
  • Dublin Core

Additional design considerations included:

  • Representing both people and digital objects in the system:
  • “creators” are considered both authors and people
  • integration with the personnel database was needed in order to enable organization views such as “all the people associated with XYZ research group”
  • Incorporating records for non-document objects (events as well as groups, people, grants)
  • Allowing a hybrid system of metadata with or without associated digital objects
  • Planning for end-user upload from EndNote or similar commercial citation management software, and
  • Creating genre-based views for public, and organization views for internal institutional purposes.

CERN Document System Background

Some existing software options existed, such as OpenEprints software and the CERN Document System (CDS, later CDSware). A comparison table of active potential repository systems is given in Table 2 (see appendix). The California Digital Library’s bepress software is still under development. The CERN Document System was found compatible with open eprints initiatives in research communities (OAI and OAI-PMH) but independent of OpenEprints / OAI priorities.

Running at CERN, the CERN Document System (CDS: revised and released in July 2002 as CDSware, is a program that allows the user to:

  • Search a scientific publication database
  • Submit objects into the database (metadata and document files)
  • Personalize the user account (predefined searches, publication baskets etc.)

The public interface is the World Wide Web. The current CERN implementation of CDSware ( manages over 350 collections of data, consisting of over 550,000 bibliographic records, including 220,000 full-text documents: preprints, articles, books, journals, photographs. The capabilities of the CDSware system include

  • Batch uploading of bibliographic citations
  • Submitting and modifying individual submissions
  • Support for differentiated collections or “catalogues” that can be searched separately or together, and
  • Support for implementing other modules, including: personalization, alert services, output and file format options.

The CERN Document System (CDS) was identified as available for rapid deployment and interactive feedback under the project name CDS @ SDSC. CDSware strengths were an active, ongoing development staff; compliance and evolution with existing international standards; and responsiveness to a user community during the development stage. Support has been available from the CERN technical staff for installation, configuration and modification for the CDS system with a CERN developed front-end and an off-the-shelf open source back-end software. System experts helped to configure CDSware, working with our local development team leader who set up the server, updated supporting software (WML, C-compiler, make, Perl and zlib), and ran the basic installations (MySQL, Php, Apache and Python). The enhancement of import filters, development of export filters, and population of the CDS @ SDSC system will continue through the second year of this grant.

The CDSware software has the advantages of

  • Proven institutional implementation at CERN
  • Full implementation of extended features (personalization, review)
  • OAI compliance
  • Support for hybrid repository / bibliography
  • Technical support and active development, and
  • Open source distribution under GNU license.

CDSware presents a configurable portal-like interface for hosting various kinds of collections, and features:

  • A powerful search engine with Google-like syntax;
  • User personalization, including document baskets and email notification alerts,
  • Electronic submission and upload of various types of documents,
  • Compliance with OAI data and service provider protocols, enabling the metadata exchange between heterogeneous repositories, and
  • Automated citation recognition and linking

There are two basic input/output forms: batch and individual; and separately, there are configurable modes for submission, either direct (no curation or intervention), reviewed (submission goes to staging area for approval before posting); and peer reviewed (more complicated routing and approval).