Developing Digital Libraries: Technologies and Challenges
Dr. Mohan Raj Pradhan
Dept. of Lib. & Inf. Sc.
Tribhuvan University
Kathmandu, Nepal
Email:
Abstract
It discusses the basics of developing a digital library and explains the new concepts underlying the digital library development procedures regarding technologies and managerial skills. Measures are needed to overcome then problems of computer viruses and also unauthorised use. Initial investment in digital libraries is high, as is maintenance, it is therefore essential to explore the new sources of fund.
Keywords
Libraries, Information technology, Internet, Developing countries, Electronic data interchange, Funding
1 Introduction
Digital library (DL) is a new concept. The concept has brought a phenomenal change in the information collection, preservation and dissemination scene of the world. For the profession of librarianship, this turn of the events is a blessing in disguise. The concept has entered the scene at a time when the traditional library is facing a fadeout and fast losing its value against the powerful growth of Internet and virtual library. But Internet is just a tool facilitating the access to available information. It has missed a very significant societal role in the preservation and diffusion of human knowledge through ages and generations of human society. Also Internet just provides everything that is published, regardless of the quality, authenticity or reliability.
At this point, the time tested, good old profession of librarianship has entered into the technology with its centuries-old techniques of selecting and acquisition of quality objects of human knowledge (in this case digital documents), performing subject analysis and cataloguing activities (in this case metadata definition) and organizing them into searchable collections accessible via web (called digital libraries) and preserving them for future use.
2 Development of the DL concept
Digital libraries began to appear in the early 1990s as research and development projects, centered within computer science departments of universities sometimes funded by government grants. As these projects matured than Information Technology (IT) groups began to partner with the library to develop campus-wide standards for the operation of digital libraries as a part of the education enterprise.
With introduction of Internet and Intranet, it was no longer considered practical to move to a practical to access information sources during a particular time of a day. It is due to time value of information, users expected instant access to information, from any location, at any time. Digital Library in its current shape is an attempt to fulfill this objective.
3 Defining the Digital Library
In the literature, there are many definition of the term of digital library. In this article, choice is made to use a most logical definition from the viewpoint of librarians, which was proposed by the American Digital Library Federation, 1998:
“Digital libraries are organizations that provide the resources, including the specialised staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and enusure the persistence over time of collections of digital works, so that they are readily and economically available for use by a defined community or set of computers.”
Based on the above definition, Cleveland (1998) gave some of its characteristics. One of these characteristics is:
“ Digital libraries are the digital face of traditional libraries and include both electronic (digital) as well as print and other (e.g. film, sound) materials.
Cleveland (1998) also thought that: “In reality, digital libraries will not be a single, complete digital system that allows users to promptly access all information, for all disciplines, from anywhere around the world. Instead, they will most likely to be a collection of disparate resources and disparate systems, catering to specific communities and user groups, created for specific purposes. They will also include perhaps indefinitely, paper-based collections”
Sharma and Vishwanathan (2001) said that “Growth of digital libraries involves digitisation of existing library materials; connectivity to the users in the world online and offline; integration with networking; and availability on the World Wide Web”
4 Key components
A fully developed digital library environment involves the following elements:
a. Digital documents (Both digital or by conversion of content to digital form).
b. The extraction or creation of metadata or indexing information describing the content to facilitate searching as well as administrative and structural metadata to assist in object viewing, management and preservation,
c. Storage of digital content and metadata in an appropriate multimedia repository. This will meet the requirement of intellectual property rights.
d. Client services for the browser, including repository querying and workflow:
e. Content delivery via file transfer or streaming media;
f. User access through a browser or dedicated client and
g. A private or public computer network.
5 Developing Digital Libraries—Team Approach
Very few people will have all the skills required to construct a digital library. Most of the skills are too specialized for the librarians or any other layman to acquire. Therefore, digital library development projects are very much a team effort. The skill set of a typical Digital Library team may be as follows:
q Technical skill (knowledge of IT hardware/software);
q Project management;
q Database development;
q Cataloging (Meta-data);
q Computer programming;
q Web designing subject specialists;
q Preservation (document formats and long-term storage media);
q Photography;
q Graphic design/digitization skills; and
q Volunteer/student help
6 Digital Archiving in the Framework of Information Life Cycle Management
The framework of information life cycle consists of: creation, acquisition, cataloguing/ identification, preservation and access. A brief description of each one is given as follows:
6.1 DLs- Creation
Building digital libraries begins with creating digital content and collections. Creation is the act of producing the information product. The creator may be human author or originator, or a piece of equipment such as a sensing device, satellite or laboratory instrument.
Several key practices are being involved in the archiving projects. First, the creator may be involved in assessing the long-term value of the information.
Secondly, the preservation and archiving process is made more efficient when attention is paid to issues of consistency, format standardization and metadata description in the very
beginning of the information life cycle. Limits are placed on both the software that can be used and on the format and layout of the documents in order to make short and long-term information management easier.
6.2.1 The digital content may be
a. Born Digital. These materials are from the beginning in electronic format as as an originating source.
b. Digitized. It means materials are converted to a digital format from an initial analog form.
6.2.2 Digitization
Digitization is the process of creating digital files by scanning or converting analogue materials.
The technology used for digitization of analog objects is called scanning and the equipment is called scanner. Scanners are imaging technology or the OCR technology for digitizing.
Imaging process creates photo image of the paper document and later on may be converted to PDF format or can be saved as Jpeg or bit map image as per policy of standard adopted for digitization. The image can be read only and cannot be edited.
The OCR technology allows converting the scanned image to the electronic format able to format either in the form of a plain text or word processor document format, ready for editing.
6.2 DLs- Content Selection and Acquisition
Content selection and acquisition is the stage in which the created object is “incorporated” physically or virtually in the archive. The object must be known to the archive administration. There are two main aspects to the selection and acquisition of digital objects – content selection policies and acquisition procedures.
6.2.1 DLs – Content Selection policies
The type, size and format of the digital content selected for a DL is the main factor which dictates the need for the technological requirements, the hardware/software and IT capabilities for the future.
One should be very clear about what content will constitute the DL? Would the DL’s content constitute only the internal document of an organization or external documents will also be included. If external documents were included would there be free or acquired from commercial vendors. If purchased from commercial vendors, would there be on-going expense or one time expense. For example, subscribing to e-journal involves on-going expense.
If electronic resources were purchased from commercial sources, would it be own by the DL after purchase or just an access to the electronic resources through user name and password or through IP address. If a document is purchased in electronic format, would the print format will be continued or not? In both the cases, utility for present time and future should be measured.
Regarding the access to electronic resources through license considerations has to be made on two aspects: number of users and Internet speed. License cost may differ with the number of users. Accessing the electronic resources through license need Internet access. The speed of Internet access is very expensive in developing countries. One should calculate the on-going cost of Internet speed. Here, if the electronic resources are purchased in CD-format, the costly Internet accesses need not to be paid. One should calculate the pros and cons of accessing electronic resources through Internet access or via through CD-server or local computer.
Another consideration, which has to be made, is how the users will use it? Would the access be free of cost or paid? What technology would be required for accessing DL? Are the infrastructure and manpower available for accessing DL? Do the users know using it or training is required for accessing electronic resources.
6.3 DLs- Content Acquisition
Just like the print document, there will be an ever increase in the volume of e-documents. The acquisition and ongoing loading of e-documents would be a regular routine. Consideration should be made whether the loading will be centralised or distributed. If the loading is distributed than consideration should be made regarding loading capability, firewall for security and manpower available.
There should be compatible software for accessing electronic resources. If conversion is needed for access, (e.g. from text to HTML or PDF), it should be made sure that the hardware and software for conversion is available.
Consideration should also be made, if the acquisition is made through archived links. Once acquisition is made through archived links, there should be a policy regarding refreshing the archived link, gathering approaches and determining the extent.
6.4 DLs- Identification and Cataloguing
Once the document has acquired the digital object, it is necessary to identify and catalog it. Both identification and cataloguing allow the archiving organization to manage the digital objects over time. Identification provides a unique key for finding the object and linking that object to other related objects. Cataloguing in the form of metadata supports organization and access.
6.4.1 Metadata
One of the most challenging aspects of the digital environment is the identification of resources available on the web as well as in the digital repositories. The existence of searchable descriptive data increases the chance of accessing the archived digital object for use.
Metadata is defined as “data about data” or “information about information”. It is the information, which describes significant aspects of a digital resource. Most discussion to date has tended to emphasize metadata for the purposes of resource discovery. Examples of metadata systems include library catalogues, archival finding aids, and museums inventory control or register systems. Over the years, metadata formats have been developed for a wide range of digital objects. Within this range of formats, there is a degree of consistency across all metadata schemes that supports interoperability. For example, most schemes provide for a title field, date field, and identifier field.
There is usually direct relationship between the cost of metadata creation and the benefit to the user. Applying standard subject vocabularies and classification schemes is more expensive than assigning a few keywords, and so on.
6.5 Preservation
Preservation is the aspect of archival management that preserves the content as well as the look and feel of the digital object. There is no common agreement regarding the preservation of digital objects in terms of time frame. However, it is estimated that the cycle for hardware/software migration is at 2 to 10 years.
6.5.1 Hardware and Software Migration
New releases of databases, spreadsheets, and word processors can be expected at least every 6 months to three years, with patches, and minor updates release more often. While software vendors provide backward compatibility for some versions but this will not be applicable after changes in 2 to 3 versions. This problem is serious if there is closures, sell outs or mergers of many firms dealing in the computer hardware, software and peripherals.
The best practice for the foreseeable future will be migration to new hardware and software platforms, emulation will begin to be used if and when the hardware and software industries begin to endorse it.
6.5.2 Preservation of the Look and Feel
Several approaches are being used to the “look and feel” of material. For journal articles, the majority of the projects reviewed use image files (TIFF), PDF, or HTML. TIFF is the most prevalent for those organizations that are involved in any way with the conversion of paper backfiles. The OCR technology is only 95% accurate, is used only for searching, the TIFF image is the actual delivery format that the user sees. However, this does not allow the embedded references to be active hyperlinks. HTML is the another popular format used for archiving documents.
For purely electronic documents, PDF is the most prevalent format. This provides the replica of the Postscript format of the document, but relies upon proprietary encoding technologies. In PDF format, if the document is put in Internet, it will consume more bandwidth as compared to HTML format, however in HTML format the tables and pictures conversion can not be done as replica of the original document.
6.6 Access
The e-documents contained in a digital library may be accessed through the search and retrieval software. Searches may be of multiple types:
1. Structured or metadata driven, in which case the software runs through the metadata elements and retrieves the documents based on the analysis of contents, done by the developer of the metadata record. This can be compared with library catalog and the efficiency of retrieval activity depends on cataloger’s work.
2. Object searches are based on text-tagging and indexing work done either manually or by any indexing software. This capability allows full-text, multimedia and object searches.
3. Global search and resource type search e.g. e-journals and reports.