From Humphrey Datalib

Computing Strategies for Beginning a Data Service[1]

I have occasionally been asked about the type of computing equipment needed to begin a data service within the library. While the question is usually asked with an expectation of being told to buy X or Y brand of computer, I feel that the best advice begins by focusing on the level of data service envisioned within the library. Once the service model is clearly understood, a computing strategy can be devised to complement this plan.

For example, if the data service plan consists of only ordering and passing data along to clients, the computing requirements will be less than if the plan further included a data extraction or subsetting service. If archiving research data for the campus is part of the service plan, the computing requirements will be different again. In the end, your computing strategy should identify -- within the context of your university's computing environment -- the hardware, software, and network connectivity needed to support a specific level of data service.

In all likelihood, your computing strategy will need to be built upon the computing resources available through your library and the central computing services offered by your university. Each of us must live with the computing environment with which we have been dealt. Very few of us have enough influence to steer the direction of central computing at our institution, although we may be a bit more successful in our immediate workplace. Thus, we all try to match our computing needs with the resources provided locally.

The pieces of a typical computing strategy for data services include processing power, storage space, statistical software, utility software, and network connectivity. Components of such a strategy are described below.

 desktop computing power to support a variety of applications. The type of software useful in providing data services includes a word processor for data documentation, desktop versions of statistical software (SPSS and/or SAS), network tools (telnet, ftp, web browsers, etc.), and general file utilities. By today's computing standards, this desktop machine should be at least a Pentium 120 or a Power Macintosh 8500, although some of us have operated data services with lesser machines.

This workstation may not be the machine where all of the large scale processing is done but rather may be a service point from which data are downloaded from a larger processor, especially if a central Unix system is used to create extract files for clients.

 large, readily available quantities of disk space on a fast system for packaging files (such as using PKZip to store or restore multiple files from a single file) and for compressing and uncompressing files. The most efficient method of moving files on the Internet is in a compressed and packaged format, which reduces the size and number of objects flowing over the network. Thus, ample disk space to manipulate these network bundles is required.

For example, we have this type of environment set up on three systems: Unix, a Pentium and a PowerMac since we end up receiving or distributing files in each of these environments.

 access to at least one fast processing machine with statistical software (SPSS or SAS, preferably.) This machine may be used to subset data from large files, such as the Individual Public Use Microdata File from the 1991 Census of Canada or the CRSP Daily stock exchange file. This machine may also be used to perform file verification tasks, such as ensuring that the proper number of records exist within a file and that the record length matches documentation, etc. For example, on Unix I use a series of utilities -- head, tail, maxline, cio, view, dd, and od -- to investigate the properties of files.

 mass storage that supports multiple-user access to files and preferably multiple-system access. In libraries, this mode of storage and access initially consisted of CD-ROM towers on a local area network. More recently, some libraries have turned to Unix systems with large disk farms to deliver database services, such as those available through ERL or OVID, on the campus network.

A distributed file system on a campus network currently offers the best model for multiple-user, multiple-system mass storage because these systems provide high-speed access (unlike most Unix tape devices and PC CD-ROMs, which are very slow by comparison) and because file management on these systems, such as naming, deleting, linking, and cataloguing files, is straightforward.

 other software that may or may not exist on your desktop but that will assist in providing a data service. Included in this category is a catalogue system for your data file collection, which may by the library's OPAC or a stand-alone system; communication tools such as email and a Web service; and presentation software to support client training projects.

 network connectivity that permits high-speed transfer of files across the campus network as well as the wider Internet. One wants to steer clear of a connection to the network where bottlenecks will likely occur, such as large local area networks sharing a common router. Another situation to avoid is sharing a connection with a large cluster of OPAC stations.

Securing a stable network connection can be a challenge and might require using services at another location of the campus network with better connectivity. For example, the central Unix service may be better situated on the network. You may then decide to use your local workstation as a terminal connected to the Unix service for conducting large file transfers. An X terminal is a reasonable investment if access is primarily through a Unix system supported elsewhere on the campus network.

One of the challenges of implementing a computing strategy arises from the variety of products and services available in each of the areas discussed above. Here are some practical guidelines to follow when acquiring new equipment or software for a data service.

Always investigate the computing support at your institution. Take advantage of the site licenses or educational discounts your campus has for software and hardware. There may be even further advantages to site license software than just saving some money. Central support is often provided with this software, including help with installing and debugging programs. There still may be instances, however, where you discover that the options provided by your campus' site licenses are too narrow and that you need to purchase other software or hardware. For example, your institution may only have a site license for Minitab when you feel you need either SPSS or SAS.

When buying hardware consider purchasing equipment that is compatible with your university's computing environment. This will often allow you to share peripheral devices, such as printers, without a great deal of hassle. Furthermore, you are more likely to find people with experience who can help with installing systems and diagnosing hardware problems.

A corollary to this guideline is to acquire equipment that is compatible with the majority of your clients. If the largest percentage of your clients use PC's, purchasing a PC would be wiser than a Macintosh even though you may prefer the Macintosh operating system. By working with equipment similar to your clients, you will be more able to address their inquires about using data on their personal workstations.

Never underestimate the amount of time and skill it takes to administer a computing system for data services. If you do not have the staff nor skills to provide a great deal of system administration, you will want to keep the computing environment in your data service as simple as possible. Currently, maintaining a PC and Macintosh system is easier than supporting a Unix machine, which requires a great deal more systems administration. Unless you are willing to invest in the skills to support a Unix system, you are better off with a PC or a Macintosh.

Another guideline to follow when purchasing hardware is to buy as much memory and disk storage as you can afford. Both of these are a good investment. A Pentium with 32 megabytes of RAM and a 2 gigabyte hard drive currently is a reasonable configuration to consider for a data service.

One last bit of advice. Ask your data service colleagues at other DLI institutions about their experiences. While the variety of solutions may seem great, many of the experiences are common. Just knowing how others have confronted their computing problems can often help.

Originally published in DLI Update, Vol 1(2), M