Sabina Leonelli, February 2008. Draft accepted for publication in Philosophy of Science (2009). To be revised, please do not distribute. A longer version is available at

On the Locality of Data and Claims about Phenomena

Abstract

Bogen and Woodward (1988) characterise data as embedded in the context in which they are produced (‘local’) and claims about phenomena as retaining their significance beyond that context (‘non-local’). This view does not fit sciences such as biology, which successfully disseminate data via packaging processes that include appropriate labels, vehicles and human interventions. These processes enhance the evidential scope of data and ensure that claims about phenomena are understood in the same way across research communities. I conclude that the degree of locality characterising data and claims about phenomena varies depending on the packaging used to make them travel.

Word count: 5,008

Introduction

Can data[1] be circulated independently of the claims for which they are taken as evidence, so as to be used in research contexts other than the one in which they have been produced? And in which ways and with which consequences does this happen, if at all? This paper tackles these questions through a study of how biological data travel across research contexts. I argue that data need to be appropriately packaged to be circulated and used as evidence for new claims; and that studying the process of packaging helps to understand the relation between data, claims about phenomena and the local contexts in which they are produced. My analysis leads me to challenge some of the conclusions drawn by Bogen and Woodward (B&W) on the evidential role of data and claims about phenomena. B&W characterise data as unavoidably embedded in one experimental context, a condition which they contrast with the mobility enjoyed by claims about phenomena, whose features and interpretation are alleged not to depend on the setting in which they are formulated. This view does not account for cases where data travel beyond their original experimental context and are adopted as evidence for new claims, nor for the extent to which the travelling of claims about phenomena depends on shared understanding across epistemic cultures.

  1. Making Facts About Organisms Travel

Biology yielded immense amounts of data in the last three decades. This is especially due to genome sequencing projects, which generated billions of datasets about the DNA sequence of various organisms. Researchers in all areas of biology are busy exploring the functional significance of those structural data. This leads to the accumulation of even more data of different types, including data about gene expression, morphological effects correlated to ‘knocking out’ specific genes, and so forth (Kroch and Callebaut 2007). These results are obtained through experimentation on a few species whose features are particularly tractable through the available laboratory techniques. These are called ‘model organisms’, because it is assumed that results obtained on them will be applicable to other species with broadly similar features (Ankeny 2007).

Researchers are aware that assuming model organisms to be representative for other species is problematic, as researchers cannot know the extent to which species differ from each other unless they perform accurate comparative studies. Indeed, reliance on cross-species inference is a pragmatic choice. Focusing research efforts on few species enables researchers to integrate data about different aspects of their biology, thus obtaining a better understanding of organisms as complex wholes. Despite the dubious representational value of model organisms, the majority of biologists agree that cooperation towards the study of several aspects of one organism is a good strategy to advance knowledge. The circulation of data across research contexts is therefore considered a priority in model organism research: the aimed-for cooperation can only spring through an efficient sharing of results.

The quest for efficient means to share data has recently become a lively research area in its own right, usually referred to as bioinformatics. One of its main objectives is to exploit new technologies to construct digital databases that are freely and easily available for consultation (Rhee et al 2006). Aside from the technical problems of building such resources, bioinformaticians have to confront two main issues: (1) the fragmentation of model organism biology into epistemic communities with their own expertise, traditions, favourite methods, instruments and research goals (Knorr Cetina 1999), which implies finding a vocabulary and a format that make data retrievable by anyone according to their own interests and background; and (2) the characteristics of data coming from disparate sources and produced for a variety of purposes, which make it difficult to assess the evidential scope of the data that become available (‘what are they evidence for?’) as well as their reliability (‘were they produced by competent scientists through adequate experimental means?’). I examine bioinformaticians’ responses to these demands by looking at three phases of data travel: (i) the disclosure of data by researchers who produce them; (ii) the circulation of data through databases, as arranged by database curators; and (iii) the retrieval of data by researchers wishing to use them for their own purposes.

  1. Disclosure

The majority of experimenters disclose their results through publication in a refereed journal. Publications include a description of the methods and instruments used, a sample of the data thus obtained and an argument for how those data support the claim that researchers are trying to make. Data disclosed in this way are selected on the basis of their value as evidence for the researchers’ claim; their ‘aesthetic’ qualities (e.g. clarity, legibility, exactness); and their adherence to the standards set in the relevant field. Because of these strict selection criteria, a large amount of data produced through experiment is discarded without being circulated to the wider community. Also, published data are only available to researchers who are interested in the claim made in the paper. There is little chance that researchers working in another area will see those data and thus be in a position to evaluate their relevance to their own projects. Hence, those data will never be employed as evidence for claims other than the one that they were produced to substantiate.

If disclosure is left solely to publications, data will travel as the empirical basis of specific claims about phenomena and will rarely be examined independently of those claims. Biologists and their sponsors are unhappy with this situation, since maximising the use made of each dataset means maximising the investments made on experimental research and avoiding the danger that scientists waste time and resources in repeating each other’s experiments simply because they do not know that the sought-for datasets already exist.

  1. Circulation

Community-based databases, devoted to collecting data about model organisms, constitute an alternative to the inefficient disclosure provided by publications. The curators responsible for the construction and maintenance of databases ground their work on one crucial insight. This is that biologists consulting a database wish to see the actual ‘marks’, to put it with Hacking, obtained through measurements and observations of organisms: the sequence of amino acids, the dots of micro array experiments, the photograph of an hybridised embryo. These marks constitute unique documents about a specific set of phenomena. Their production is constrained by the experimental setting and the nature of the entities under scrutiny; still, researchers with differing interests and expertises might find different interpretations of the same data and use them as evidence for a variety of claims about phenomena.

Curators have realised that researchers are unlikely to see data in the light of their own research interests, unless the data are presented independently of the claims that they were produced to validate. The idea that data can be separated from information about their provenance might seem straightforward. Most facts travelling from one context to another do not bring with them all the details about how they were originally fabricated. The source becomes important, however, when adding credibility to the claim: to evaluate the quality of a claim we need to know how the claim originated. Knowing whether to trust a rumour depends on whether we know where the rumour comes from and why it was spread; a politician wishing to assess a claim about climate change needs to reconstruct the reasoning and methods used by scientists to validate it. Thus, on the one hand, facts travel well when stripped of everything but their content and means of expression; on the other hand, the reliability of facts can only be assessed by reference to how they are produced. Curators are aware of this seeming paradox and strive to find ways to make data travel, without however depriving researchers who ‘receive’ those data from the opportunity to assess their reliability according to their own criteria.

To confront the challenge, data are ‘mined’ from sources such as publications; labelled with a machine-readable ‘unique identifier’ so as to be computed and analysed; and classified via association with keywords signalling the phenomena for which each dataset might be used as evidence. Data are thus taken to speak for themselves as ‘marks’ that are potentially applicable, in ways to be specified, to the range of phenomena indicated by keywords. At the same time, the structure of databases enables curators to store as much information as possible about how data have been produced. This is done through ‘evidence codes’ classifying each dataset according to the method and protocol with which it was obtained; the model organism and instruments used in the experiment; the publications or repository in which it first appeared; and the contact details of the researchers responsible, who can therefore be contacted directly for any question not directly answered in the database.

  1. Retrieval

Retrieval happens through so-called ‘search tools’, each of which accepts queries about one type of items: e.g. genes, markers, ecotypes, polymorphisms. Search results are displayed through digital models that visualise data according to the parameters requested by users and thus to their own expertise and demands. A query phrased through appropriate keywords typically yields links to datasets associated with the keywords. By clicking on these links, users can quickly generate information concerning the quantity and types of data classified as relevant to a given phenomenon; alternative ways of ordering the same dataset, for instance in relation to two different but biologically related phenomena; and even new hypotheses about how one dataset might correlate with another, which researchers might then test experimentally. Researchers would not be able to gather this type of information unless data were presented in formats that do not take into account the differences in the ways in which they were originated. Without de-contextualisation, it would be impossible to consult and compare such a high number of different items, not to speak of distributing such information across research communities.

Once users have found data of particular interest to them, they can narrow their analysis down to that dataset and assess its quality and reliability. It is at this stage that they will need to know how the data have been produced, by whom and for which purpose. The first step in that direction is the consultation of evidence codes, which provide access to the information needed to decide whether and how to pursue the investigation further.

  1. Packaging for Travel

Due to the variability in the types of data available, and in the purposes for which they are used, sciences such as experimental biology have specific ways to deal with data and their connection to claims about phenomena. Data are not produced to validate or refute a given claim about phenomena, but rather because biologists possess new technologies to extract data from organisms and it is hoped that the potential biological significance of those data will emerge through comparisons among different datasets. This type of research is data-driven as much as it is hypothesis-driven: the characteristics of available data shape the questions asked about their biological relevance, while open questions about biological entities and processes shape the ways in which data are produced and interpreted.

This inter-dependence between data types and claims about phenomena is made possible by packaging processes such as the ones used by database curators, which connect data to claims about phenomena in ways that differ from the one-way evidential relation depicted by B&W. Curators aim to allow users to recognise the potential relevance of data for as many claims as possible. I now focus on three elements of packaging and on the epistemic consequences of adopting these measures to make data travel.

Labels

The allocation of labels involves the selection of terms apt to classify the facts being packaged, thus making it possible to organise and retrieve them. In the case of datasets in databases, the labels are the keywords indicating the range of phenomena for which data might prove relevant. These labels are used to classify data as well as to formulate claims about phenomena for which data provide evidential support. Indeed, the labelling system devised by curators has the crucial epistemic role of streamlining the terms used to indicate which phenomena data can be taken to document with the terms used to formulate claims about those phenomena.

In research circumstances such as the ones examined by B&W, these two labelling processes satisfy different demands arising from different circumstances and scientists thus tend to keep them separate from each other. Labelling data for prospective retrieval and reuse means choosing terms referring to phenomena that are easily observed in the lab, either because they are very recognisable parts of an organism or a cell (‘meristem’, ‘ribosome’), or because they are processes whose characteristics and mechanisms are widely established across research contexts (‘mitosis’). By contrast, labelling phenomena for the purposes of formulating claims about them has the primary aim of ensuring that the resulting claims are compatible with the background knowledge and interests of the scientists adopting the labels, and will therefore fit existing theories and explanations: any label seen as somehow compatible with observations will be accepted, as long as it fits the research context in question. This means that there will be as much variation among labels adopted to formulate claims about phenomena as there is variation across research cultures and beliefs. For instance, both immunologists and ecologists use the term ‘pathogen’ as label in formulating their claims; however, they use it to refer to two different phenomena (immunologists consider pathogens as a specific type of parasites, while ecologists tend to view them as potential symbionts). Equally common are cases where the same phenomenon is described through different terms at different locations.

The multiplicity of definitions assigned to the same terms (and of terms assigned to the same definition) within experimental biology limits the power of these terms to carry information across contexts. Indeed, the use of ‘local’ labels is one of the reasons why journal publication is an inefficient means to disseminate data. Databases have ways to confront this issue. First, curators assign a strict definition to each term chosen as a label. These definitions are as close as possible to the ones used by scientists working on the bench; yet, they are meant to be understood in as many research contexts as possible. To this aim, once the labels are chosen and defined, curators examine the cases where a different label is given the same definition or several labels are proposed as fitting one definition. To accommodate the former option, curators create a system of synonyms associated with each label. For instance, the term ‘virion’ is defined as ‘the complete fully infectious extracellular virus particle’. Given that some biologists use the term ‘complete virus particle’ to fit this same definition, this second term is listed in the database as a synonym of ‘virion’. Users looking for ‘complete virus particle’ are thus able to retrieve data relevant the phenomenon of interest, even if it is officially labelled ‘virion’. Curators use another strategy for cases of substantial disagreement on how a specific term should be defined. They use the qualifier ‘sensu’ to generate sub-terms to match the different definitions assigned to the same term within different communities. This is especially efficient when dealing with species-specific definitions: the term ‘cell wall’ is labelled ‘cell wall (sensu Bacteria)’ (peptidoglycan-based) and ‘cell wall (sensu Fungi)’ (containing chitin and beta-glucan). As long as curators are aware of differences in the use of terms across communities, that difference is registered and assimilated so that users from all communities are able to query for data.