The Digitized AARC
A Development Proposal
Rex E. Bradford
January 14, 1999
Summary
This document outlines a plan to [add: organize and] digitize the JFK assassination-related holdings of the [add: Assassination Archives and Research Center] (AARC), and [add: thereby] make [del: thus] them accessible to the research community and the public [add: in fulfillment of the goal of the JFK Records Act]. The proposal contained herein is necessarily short and high-level. It presents a staged plan, with the important goal of having useful "intermediate" results generated throughout the project.
Introduction
The AARC is in possession of [add:all] assassination records released by the [add: AssassinationRecords Review Board] (ARRB) during its tenure [move: , roughly 4.5 million pages]. In addition, it contains other relevant holdings obtained prior to the ARRB's work. The research community, much less the public, is still largely unfamiliar with the contents of these files. The purpose of this project is to rectify that situation.
Besides the sheer volume of the records, several factors impede their use by the research community and the public:
- Physical access - Traveling to Washington to view documents is an obvious problem. NARA makes copies available via mail, but their search system is so limited that it is hard for researches to find out which documents they wish to receive copies of without traveling to Washington to view them firsthand.
- Organization - RIF #'s are useful as the most basic categorization scheme, but there need to be other methods for locating documents. Again, NARA supplies a simple keyword search scheme, but one which leaves much to be desired.
- Analysis and Presentation – [del: For all but the most hard-core researchers, some amount of analysis, even if limited to] Summaries of documents and document sets, would be very helpful [add: to all users and essential for most. The network of project associates includes topical experts who can assist in interpretation with document annotations and the compilation of keys to acronyms and abbreviations.]
The digitization proposal described in this document addresses all of the preceding concerns. Furthermore, it proceeds in stages, in order to [add: provide clear opportunities for functional review and the incorporation of lessons learned and improved techniques and to ] avoid a lengthy project which only bears fruit when it is completed. What follows are proposed phases for such a project.
Phase 1: Cataloging and Organizing the Documents
Before any digitization project can begin in earnest, the set of documents must be organized in some fashion. [del: This proposal makes the assumption that][add: Most, if not all, of the] documents are already tagged with NARA'S RIF identification codes. This allows documents to be [add: automatically] fed into scanners [del: with these id's as their identifying mark, with the larger organization tasks deferred.][add: and tracked by a unique identifier.]
[del: Still, some form of] [add: Additional] categorization is necessary to:
- Verify that the documents are physically appropriate for [add: automated] scanning equipment, that they have RIF numbers assigned, etc.
- Organize the [del: boxes into rough categories at a level slightly more detailed than just CIA, FBI, etc.][add: documents chronologically in high-level subject categories (such as “Dallas” and “autopsy”) to elaborate their present simple segregation by issuing agency (CIA, FBI, etc.).]
- Prepare a [del: master][add: prioritized] list of document sets to be digitized. [del: and importantly an ordering to the digitization process.] Those document sets which can be identified as most important to the research community [del: should][add: will] be schedule[d] for digitization [del: early][add: first].
This phase of the project necessarily entails a small team of people, whose members are familiar with the [del: case][add: subject matter] and can be brought up to speed on [add: our project’s and] the agencies' classification schemes. These people would have to spend some number of days or weeks at the AARC [del: to perform such this basic cataloging job].
This phase requires no special equipment other than a computer for entering the master document set list into.
[note: from here on my comments will concern only concepts and content (not style or wording)]
Phase 2: Digitization and Basic Access
Digitization
Simply digitizing such a massive collection of documents is itself a technical challenge. Picture a single scanning station with an automatic feeder, capable of processing 10 pages per minute. Run 8 hours per day, it would take about 1000 days of operation to scan all of the ARRB releases. So scale is an issue here. The general options here are: 1) a sophisticated and expensive scanning machine, run perhaps in shifts, 2) 2 or more PC-based scanning stations operated in parallel, and 3) a single PC-based scanning station, schedule to operate over the period of a few years.
The last option, entailing a multi-year basic scanning project, is not necessarily a bad one. The phases of this project are designed such that useful work can be done on documents as they are fed into the "system." Subsequent phases of this project can overlap with Phase 2; in fact, it is expected that what is labeled as Phase 3 would operate simultaneously with Phase 2.
The idea of Phase 2 is simply to "get the documents scanned in" as quickly as possible, using RIF numbers and no other identification as the basis for identification with the digital document database. The scanning process can then be run by people who are not skilled researchers. Documents fed into the system should be stored in a high-quality graphic format, for instance JPEG (with quality settings turned up) or perhaps some other format which offers good compression along with high quality.
[note: after initial processing, the text will be OCR’ed and only select zones preserved as graphics]
The computer system on which these documents are stored can be PC-based but needs massive storage capability. If for instance, each document page required 100K bytes to store in compressed but high-quality format, then 5 million pages would require 500 gigabytes of disk space. This is somewhat above the capabilities of workstation-class server PC's. For instance, the current crop of such machines sold by Compaq, Dell, and others have a maximum storage capacity of 30-50 gigabytes. Two options are available: 1) pay more for a larger-scale computer system, or 2) buy the highest-end PC-based server and supplement it with a CD-ROM "jukebox" to supplement the storage capabilities. The good news here is that hard disk capacities are rising rapidly and costs plummeting rapidly. Buying a workstation with 50 gigabyte capacity and planning on expanding or replacing that in a year or two is not an unreasonable option. In any case, the selection of the basic storage computer system or network requires study and is beyond the scope of this proposal.
[note: Data storage is a basic feature of the project and should be firmed up as much as possible in this proposal. I think the raw JPEGs will have only very specialized, perhaps temporary, use and so can be on a set of CDs. Mostly everyone will be dealing with the “text & graphic insert” files which will be much smaller and should not pose any storage problem (for that matter the planned Stage 4 DVDs will be 17 GB each, I believe).]
To reiterate, the digitization aspect of Phase 2 is meant to be the fastest method of getting raw unprocessed documents into a computer system, and designate their processing (optical character recognition, indexing, categorization, analysis, etc.) as a distinct phase which could be carried out by different personnel.
Basic Access
Even raw digital documents accessed by RIF number is useful to some researchers, if made available somehow. Here is a proposal for a very simple system that could be made available:
- Researchers are required to use the NARA search engine or other means to collect a list of documents they would like, using RIF numbers.
- Researchers submit such a list to the AARC using either:
- a letter listing desired documents by RIF number
- an email with the same contents as above
- via a simple order entry form available on an AARC web site. This site would not allow searching, simple a means of entering a list of RIF numbers to be ordered
Such orders would be processed by someone with direct access to the computer storage system at the AARC. An automated process would allow them to submit the RIF number list to a computer program, which would then collect those documents and burn them onto a custom CD-ROM.
It would also be possible for knowledgeable people to identify a set of documents as an interesting collection, and make these available for ordering over the AARC web site. CD-ROM copies of popular data sets could be pre-burned instead of handled as orders come in.
Presuming that payments to the AARC for this service should at least cover costs, but not necessarily try to pay back the development of the system, then such nominal charges might be: $5-$10 for cost of media and processing, and then 1 or 2 cents per page of document. Note that a single CD-ROM could hold up to 5000 pages, assuming the storage estimates given previously.
The personnel functions which would be needed, on site, during this phase include:
- Management of computer system, backups, etc.
- Maintenance of web site
- Overseeing the scanning of documents
- Running the scanner
- Processing of requests for CD-ROMs collections
It's possible that one person aided by interns or volunteers to run the scanner(s) could do the job, perhaps aided by a part-time technical consultant.
Equipment needs have been noted above, and require further study. It is anticipated that at least $20,000, and quite possibly more, would be needed for the hardware to run this phase. There would also be ongoing service charges to run a web site.
Note that the CD-ROM publication services in this phase could begin within a few months of the initiation of scanning, particularly if careful attention is paid to choosing which document sets should be scanned first.
Phase 3: Organization, Indexing, and Structured Online Access
While the Phase 2 document access services are useful to some researches, the volume of documents cries out for more organization and access methodologies. This could occur while Phase 2 is ongoing, and consists of the following:
- Creating a master set of keywords by which each document or document set can be indexed, as well as other identifying information such as originating agency, date of creation, date of first release, date of final release, etc. A short one- or two-sentence summary of each document or document set might be very useful as well.
- Tagging documents and document sets with the relevant keywords and other identifiers. This process may be automated (see discussion below).
- Applying optical [add: and intelligent]character recognition to (some) documents.
- Creating an internet-based document retrieval mechanism.
Organization and Indexing
The biggest decision here is whether to attempt full optical character recognition to documents and make them available in text-only or hybrid (i.e, Adobe Acrobat) format. Rendering a document as text allows that text to be automatically searched, a great aid for researchers. However, optical character recognition (OCR) is an imperfect art, and requires proofreading and hand-correction in the best of circumstances. The poor quality of many documents makes the job even tougher in this case.
Furthermore, many documents have signatures, marginalia, stamped markings, and other graphics which need to be preserved. The Adobe Portable Document Format is a distinct possibility for these, at it converts text in the pages into searchable text, while leaving graphic portions intact.
One advantage of OCR is that it allows the tagging of each document (with keywords from the master index, dates, etc.) to be at least partially if not mostly automated. A custom program with some built-in knowledge of the layout of memos and other formal documents could easily identify generators and recipients of documents, dates, etc. It could then present this information in a dialog box for confirmation and editing by the person operating the program. This could reduce the time to tag documents dramatically. On the other hand, the OCR process itself is time-consuming, and its benefits may be minor that the effort is not justified. In that case, the keyword tagging would be a manual process.
It should be noted that the processing of applying OCR and/or tagging documents does not need to be done on-site at the AARC. Indeed, if a set of dedicated researchers across the country could be persuaded to sign up for the task, the following process could be initiated:
- Document sets are assigned to off-site researchers, and each is sent a set of CD-ROMs containing that person's assigned digital documents. Their package would also contain a computer program into which they would enter the tags for each document or document set. If OCR is part of the job, the program to be used is also supplied.
- Researchers run these programs to create the keywords and other information for each document. The results are returned to the AARC via e-mail, floppy disk, ZipDisk, or other medium.
- The returned results are [add: reviewed and] fed into the growing database of supplementary document information.
Structured Online Access
The second aspect of this phase is to make its work product available. This means creating a web-based document retrieval system. This would be a system similar to the NAIL system set up at NARA, wherein users may type in search parameters and receive the results. This system is more useful than that at NARA, however, because:
- The document keyword and summary information is more useful. The "hits" of a given search can be identified more readily as being relevant.
- Low-quality (but legible) document pages may be viewed directly over the internet, in order to verify that the documents are of interest. Such low-quality versions of each document image can be trivially (and automatically) generated during Phase 2 or subsequently. [note: really? For me, this idea seems to bring up some potential problems.]
- Documents identified as desired through this system may be "check-marked" and put into a document list. This list can then be submitted to the same process already developed in Phase 2, which is used to ship high-quality document page images on CD-ROM to the recipient.
High-quality documents could also be delivered directly across the internet, limited only by the bandwidth capacity of the web server and the individual user's modem.
Besides searching for documents via keywords, agency names, dates, and other such information, it would be useful to "organize" documents into some classification scheme. Actually, since the documents are inherently identified by the system via RIF number, there is no reason why multiple such classification schemes couldn't co-exist. One might use originating agency as the highest-level breakdown, whereas another might use the investigative phase (Warren Commission, Garrison investigation, Church Committee, HSCA, etc.) as the primary organizing mechanism. One or more such "table of contents" presented as an expandable-collapsible tree (similar to the Windows Explorer interface) could be presented directly on the AARC web site, with the low-quality document images [note: or rather, the document set summaries] as the "leaves" on such a tree.
Some equipment is needed for this phase in addition to that needed for Phase 2, even assuming that the work is done by off-site researchers who already own a PC. The web site would need to be enhanced to support access to low-quality documents, which would still take a large amount of storage for the full set (perhaps as much as 100 gigabytes when completed). [note: indeed, there’s a rub; I’d scrap the low-quality idea] Thus AARC would either need to host the web site directly (and probably purchase a separate computer and set of hard disks for it), or pay additional monthly fees for a large-storage web site.
Phase 4: Analysis and Presentation
Beyond the simple though large-scale digitization and basic classification of documents, it would be wonderful if the AARC could also serve as a repository for analysis of document sites or cross-document ideas. In paper format, journal articles or books are limited to citing testimony, exhibits, or documents in footnotes. On-line at the AARC, though, each such reference could be a hyperlink to the documents being discussed. Such "articles" could range from focused discussions of a small set of documents to length material such as a high-level overview of the 15 volumes of Warren Commission testimony, complete with hyperlinks to any particular appearance under discussion.
Here are some examples of the forms of analyses which could be stored on-line:
