Accessing Scientific and Communication Information from Researchers:

Methodological Underpinnings[1]

R. Sooryamoorthy

Associate Professor

Dept. of Sociology, University of KwaZulu-Natal,

HowardCollege Campus, Durban, South Africa

Email:

Wesley Shrum

Professor, Dept. of Sociology

Louisiana State University, Louisiana, USA

Email:

Abstract

Researchers and scientists produce knowledge and information in their own specific professional fields and disciplines. Creation of any new knowledge and information relylargely on the existing body of knowledge and how these are accessed from various sources. Internet and communication technologies have made accession of information easier and fasterin theongoing process of production of knowledge. The way researchers produce knowledge is as important as its own production.How do we access such scientific and communication information from researchers and scientists who are engaged in the production of knowledge? This paper presents the methodological details of a study we have been doing since 1994 among researchers/academics/scientists located in research institutes and universities in five developing countries namely, Kerala (India), Ghana, Kenya, South Africa, the Philippines and Chile. The study looks at the impact of the Internet on research communication within the developing world and with the international scientific community through an analysis of the conditions associated with interpersonal networking and information search behaviour.We gather both quantitative and qualitative data and we have devised and used methods to gather information pertaining to the use of email and Web. The respondents are approached for data on their research activities, collaboration, professional networks and contacts,productivity, attitudes on the research system, computer use, email use and Web use.The paper highlights the procedures, limitations and difficulties in the collection of scientific and communication data from researchers.

Introduction

Internet is a unique technology in many respects. Unlike technologies such as the telephone and television, the costs and capabilities associated with the Internet have the potential not only to transform the scientific and educational communities in developing countries by creating linkages with partners in the developed world and also in allowing access to their informational resources. This paper is about an ongoing study on scientific communication which is underway in five developing countries.

The study wasoriginally started in 1994 in three countries namely, Ghana, Kenya and Kerala (India) and since 2003 it has been extended to include countries like South Africa, the Philippines and Chile. Basically the study examines the patterns of research communication of researchers/academics/scientistsand the effects of new information technology on science in developing areas. Briefly, the attempt is to studythe impact of the Internet on research communication within the developing world and with the international scientific community through an analysis of the conditions associated with interpersonal networking and information search behaviour. The overall objective of this research is to determine the impact of the Internet on scientific and technological practice in the chosen developing areas by following the behaviour of scientists in institutions in the years following its introduction.

This study is aimed at: (1) enhancing basic connectivity by providing local area networks to research organizations with point connections, (2) assessing the professional relationships, information search behaviour, and changes in research practice over a five year period, (3) training faculty and graduate students in the social analysis of science and technology.

Background

In 1994, when the study was beganface-to-face interviews with 294 scientists in approximately 100 organizations were conducted (Shrum, 1996). This 1994 study was a systematic sample of researchers in universities and research institutes that provided: (1) information on the characteristics of individual scientific careers and research networks; (2) data on the extent and frequency of professional ties and, (3) indicators of communication patterns for the early 1990s, prior to the period in which the World Wide Web developed as a significant global phenomenon.In order to give a glimpse of the findings, a summary of the findings of this 1994 study, which paved the way for the present,is as follows (Shrum, 1996):

  • Scientists in less developed areas are not, as some have claimed, "isolated." Most professional contacts of researchers in developing areas are simply local. That is, ties tend to be internal to the national or regional research systems.
  • Africans report larger professional networks than Indian researchers.
  • Academics have smaller professional networks than researchers in government institutes, particularly in terms of local ties. However, more professional contacts are reported to universities, both internal and external, than any other sector.
  • Higher education is associated with increased ties to developed countries. But education abroad is associated with stronger links to developed countries only for state research institutes.
  • Academics that have been educated in developed countries do not have more ties abroad than those who are not.
  • For researchers in national institutes, immediate access to a personal computer is associated with larger developed country networks, but international visibility is associated with smaller local networks.
  • International publication and education abroad are related to international linkages but the payoff is smaller for Ghanaians than Indians.
  • Local networks are larger for older scientists.
  • There is an inverse relationship between local ties and ties to developed countries.

The broad hypothesis addressed in this project is that the dissemination of the Internet, by changing the constraints on research communication, will promote changes in the patterns identified above. First, those who have access to personal computers (and hence, larger international networks to begin with) are more likely to make rapid use of the Internet. Second, if the inverse relationship between local and developed country ties is due to resource constraints of time and technology, then international linkages may increase without a reduction of local ties. In general, increased integration into global scientific networks may be expected. It is worth noting that this is not the only possible outcome. If the inverse relationship between domestic and international linkages is not due to resource constraints, but rather the increased opportunities represented by ties with international science that draws researchers away from local issues, then Internet technology may exacerbate the problem.

One of the most surprising features of the 1994 studywas the extremely low priority given by respondents in all three areas to the development of electronic communication networks as a means of improving the research system (on both rating and ranking items). In 1994 email was the only recognized use by these scientists. The World Wide Web was just beginning a phase of spectacular diffusion. Three years after the original study, attitudes and awareness had begun to shift.

Methodology

The strategy is to follow 300 scientists in each location over a five year period. During the study period of five years the research team in their respective locations will collect three types of data in specific periods of the study. They are:

  1. Gathersurvey data (years 1,3,5) and conduct annual qualitative interviews (years 2,4) on scientists' use of information technology. The 1994 communication network items will be administered three times during the course of the study in order to determine shifts in the frequency, means, duration, and location (local or international) of professional ties. These surveys in years one, three, and five will be relatively comprehensive with respect to the active scientists at the research sites.
  2. Internet connection time, incoming/outgoing email addresses, and Web site visits will be provided by several of the participating organizations from server logs.
  3. Email correspondence (incoming and outgoing) will be stored and forwarded by a subset of participating scientists who will be compensated for their cooperation.

The 1994 questionnaire was originally designed to measure the scientific networks associated with a comprehensive sample of research organizations in each country, as well as assess researchers' priorities for the development of the research system. This instrument contains a variety of items tapping communication patterns. The most comprehensive set of items is an eight page list of organizations that includes international agencies, foundations, private firms, parastatals, universities, state research institutes, and NGOs.

The basic interviewing technique involves first asking the respondent which organizations on the list s/he has had any contact with during a given period of time. Following that, they provide more detail on the content of the ties—

whether they involve friendship, collaboration, information exchange, and so forth. A second set of items asks respondents to list their primary professional contacts (individual rather than organizational level), including their organizational locations and the primary means of communication. The network portions of the survey will remain the same in the present study (apart from changes to reflect the composition of the organizational population) and an improvement in the format of the professional contacts section. This constancy is necessary in order to utilize the 1994 study as a baseline.Apart from the network sections, the questionnaire has been redesigned to emphasize access and use of information technology (specifically computer, email, and Web).

The survey instrument included both structured and unstructured sections on the major dimensions of professional research activities, international and national organizational contacts, frequency of discussions with various groups, supervisory roles and local contacts, professional memberships and activities, self-reported productivity, attitudes on agricultural and environmental issues, and the needs of the research system.

The survey contained a series of ten "written output" questions asking respondents to report their own productivity since 1990, but the most salient items for what follows pertain to articles in foreign journals and articles in national journals. Respondents rarely needed to consult documents prior to responding, but answered straightaway, or after a little thought. Publications are no less relevant to researchers in developing countries and they have no apparent difficulty in remembering them—perhaps even more with publications appearing in international journals.

As part of the project six organizational sites have received a local area network by the end of the first year or the beginning of the second year of the project. This is desktop connections to the LAN but not the basic Internet connection for the institution for a subset of the participating organizations. Although this is a kind of intervention indeed, Internet usage patterns and the longitudinal network data can be analyzed using this as an independent variable.

We have thought of several options to gather Web and email data. The first option involved the establishment of a proxy server and collector at the U.S. National Research Council, such that when participants open their browser or send email, all traffic to and from their machine is routed through the NRC server, allowing collection and storage of data. This option was simulated in Accra (Ghana) by setting the browsers on several machines to a proxy at the NRC, with no detectable effect on Web use. However, this option (software and storage at the NRC) is extremely expensive. More important, although computer professionals in the project sites were willing to participate in an assessment of email and Web use, there were objections to rerouting traffic in this manner.

The second option involved the automated collection of all Web and email traffic from servers at the project sites themselves. This concept is implemented through a conventional proxy system. It is both feasible and popular because it involves the participation of local IT professionals who find it an interesting technical activity that may help them to understand connectivity and communication issues. However, there are two major problems with this strategy as a way of studying Internet use by scientists. First, many scientists use the Internet from home or public sites. Second, Web-based email systems (such as Hotmail and Yahoo) are extremely popular in both Africa and India. Encryption makes these transmissions very difficult to capture, even with the permission of the respondents.

The preferred solution therefore, involves cooperation by two groups, IT professionals and the participating scientists themselves. Computer services will provide three types of data for consenting scientists: Internet connection time (frequency and duration of logins), incoming/outgoing email addresses and subject lines (but without content), and Web site visits. Basic proxy server software associates all traffic with an IP address, which is sufficient for users with exclusive use of a desktop computer. For participants who share a computer, there is a directory feature that enables tracking of user activity by ID/password. Unfortunately, this data cannot be collected in every location. Discussions with heads of department at JKUAT and Katumani (Kenya), Vellayani, CTCRI, and CESS (Kerala), and CSIR and Cape Coast (Ghana) indicate that it can be done in these locations owing to the presence of able technical staff and available software to track usage on a central server (e.g.Webtrends). The importance of this data is that it allows an independent and more reliable estimate of Internet use than self-reports. The list of Web sites (ranked by frequency of visits) and email addresses for correspondents provides a starting point for the semi-structured interviews to be conducted in the second and fourth years of the project.

In the alternating years focused interviews were conducted with participating scientists, who were asked to provide us with a sample of their email correspondence from home and work. These interviews are critical for providing detailed information on Internet use and experiences. Even more detailed server data than simple online time and Web visits will not capture home usage, nor will it provide direct information on the location of professional ties.Learning whether the Internet facilitates science across international borders without decreasing local ties is one of the key objectives of the study and requires information about the location of email correspondents. The solution to this is to remunerate participants with a small sum for saving incoming/outgoing email messages over a specified time period. This time period has not been determined for all sites. We are not interested in the detailed content of these messages (and will not keep copies after coding). These two basic forms of data (Web site visits from the server and email dumps from the scientists) will be the focus for the qualitative interviews regarding the social and professional uses of the medium for scientists in developing areas. Of course, such interviews can also be conducted without such information, based only on self-reports, but they will be more valuable and yield more reliable information on Internet use when it is available in advance.

Accessing and Analysing Email and Web Data

The raw data for this analysis is the access log generated by the proxy server (Squid) running Linux (Debian GNU/Linux). Each line of the access log represents a single request by a single user for information from the Internet (specifically, a user on a particular host computer within the local area network).

The information will have:

  1. A number indicating a time code: This can be duration in seconds from a starting point (some decades ago) that will be converted to real time if this information is to be directly interpreted as a current time point.
  2. IP address of the computer making the request for information.
  3. The status of the request (GET/DENY indicates whether the user is allowed to make this request (properly identified to the system; not a forbidden site; etc.); HIT/MISS indicates whether the information is already in the server cache (for immediate retrieval) or is not in the cache and must be retrieved from another server on the Internet.
  4. Volume of data transferred (in kilobytes).
  5. Address (in the form of the URL, or address) of the server from which the data is requested.
  6. User Name (when each user is identified with a password).
  7. IP address of the remote server that provided the data. (Same location as address but this is in numerical format.) The same server can be associated with many different addresses, because many Web pages are on the same server.

Only the User/Host IP (6 and 2), the volume of information (4), and the url (5) are relevant for our study. The rest of the line may be removed (status of the request; IP address of the remote server; any Java script requests indicating code on the same Web page, like a pop up window).

The processed log file gives:

  • the duration of the report period for the log file, and
  • the number of unique host/user combinations (these are the client machines for the LAN server).

If a user (known by their password) uses a second computer, this would be viewed as a separate entity. But for most purposes, one person uses one machine. The user ID is the best indicator for our purposes and could be summarized across all machines used by the user in the time period.