Importance of Data Sharing

Abstract

Data sharing practices within the Earth Sciences vary between disciplines. Each discipline, indeed each institution, has its own policies and practices to facilitate data sharing. Two major impediments to data sharing that every discipline must contend with are ensuring data quality and sharing behaviors among researchers and institutions. This paper will examine data sharing practices among the Earth Sciences by exploring the data life cycle, in terms of ensuring data quality, as well as examine motivations and incentives behind data sharing behaviors.

Introduction

Data sharing among the Earth Sciences is invaluable. Sharing research effectively with other scientists, as well as with the laity, could be argued to be as important as the research itself. If no one has access to the research or can properly interpret the meaning of findings, then the data essentially has no meaning. Therefore, data sharing policies and practices are becoming more common among the scientific community in order to encourage and facilitate the process of data sharing. This paper will explore the impediments and behaviors of data sharing practices in the Earth Sciences discipline, and will discuss more specifically programs and institutions designed to facilitate data sharing in the atmospheric sciences, oceanographic sciences, and astronomy.

Importance of Data Sharing

Before exploring barriers and behaviors of data sharing, it is important to briefly emphasize the importance of sharing data. Providing accessible, high-quality data encourages “open scientific inquiry” allowing research to be “validated or refuted” by the scientific community [1]. Data sharing encourages debate among scientists and prompts further scrutiny of research conclusions [2]. This further scrutiny then fosters a new level of integrity of the data. Sharing data also allows other researchers to draw new conclusions from the work and can provide users with a basis “for new research and new methods of data analysis” [1]. Forming new collaborations between researchers is often a result of drawing new conclusions between another’s data and your own [2]. Maintaining large repositories of data offers researchers access to a wide wealth of information larger than could be generated by an individual or even one institution [1]. Sharing information also prevents the “unnecessary duplication of effort” and thus promotes greater and faster strides in scientific discovery [1]. This wasted effort is measured not only in cost, but also in wasted time. Lastly, providing open access to data encourages learning and discovery among the public and involves the public in the scientific process. By sharing data with the public, the scientific community has encouraged a movement to citizen science which is quickly becoming a popular method of sharing data between scientific disciplines. Encouraging the public to become involved in the data sharing process will become increasingly important as scientists and the public alike attempt to wade through the age of the data deluge.

Impediments to Data Sharing

There are many impediments to effective data sharing. One of the most basic issues is data quality. There is no way to discuss data sharing without discussing data quality first. If a researcher cannot trust the data they are seeking, then sharing the data is useless. The level of data quality can be measured in many different ways. Following the life cycle of data from production to management to use/re-use is one way to maintain data quality [3].

In the production phase, two key components are the calibration of the instruments used to collect the data and the methodology chosen to collect the information [3]. Instruments must be continually checked and recalibrated in order to maintain a high level of data quality. Researchers must also choose a most appropriate method of collecting the data, from gathering general human observations to using a wireless sensing system. The Center for Embedded Network Sensing realizes the value of collecting trustworthy data and facilitating the reuse of that data, especially if that data is impossible to reproduce [4]. For example, CENS uses dynamic sensors which can adjust monitoring conditions in real time [4]. CENS also deploys scientists into the field with the sensors allowing the scientists to fine tune the instruments; however, this also brings up other issues with data integrity such as differences in the setup of the equipment between different teams of scientists. In the past, CENS relied heavily on oral exchange in terms of equipment usage, calibration, and methodology, but more recently has discovered the need for consistent documentation to ensure data quality [4], which is the last component of the production phase. According to CENS scientists, one of their most important needs is confidence in their measurements. This confidence rests on equipment selection, equipment calibration, and human reliability [4]. The proper documentation of these components, whether it be in paper or digital form, is essential to enhancing trust [4]. CENS applications in the Earth Sciences include their seismic research area. This area implements network technology to monitor aftershock and volcanic zones [5]. They use a wireless network with a signal to noise ratio in order to accurately monitor seismic events [5]. However, though the concept of their Wireless Linked Seismic Network has worked well, there are still problems. The concept has worked well because the network is actually in Mexico but has been managed almost completely from the United States. But many problems including hardware failures, software bugs, weather related failures, and poor design have led to “significant loss of data” [5]. Though most problems were recognized by logging into the network and “probing the sites,” a field engineer had to be deployed to Mexico to oversee the problems. This illustrates the need for constant monitoring of instruments and networks in order to safeguard data quality.

The second phase of the data life cycle is data management. This refers to long-term accessibility of the data [3]. Data is often stored in data archives or repositories specific to particular disciplines. There are many different archives specific to the area of Earth Sciences including the National Oceanic and Atmospheric Administration, the British Oceanographic Data Centre, the National Aeronautics and Space Administration, and the Australian Antarctic Data Centre. NOAA’s National Climactic Data Center archives its data in the Hierarchical Data Storage System (HDSS). This system is “the robotic tape assembly used to store large datasets at NCDC” [6]. The data is then transferred from the tapes onto the public ftp site [6]. The British Oceanographic Data Centre uses the relational model of database design in order to store information [7]. This allows tables to have “relationships or links” to other tables [7]. They also use the National Oceanographic Database to store metadata associated with the datasets. NASA uses the Distributed Active Archive Centers or DAAC. These centers each serve a particular discipline in the Earth Sciences to “process, archive, document, and distribute data” from different satellites and programs [8]. The Australian Antarctic Data Centre (AADC) uses several databases and a SCAR Feature Catalogue for spatial data [9]. The AADC also has a system specifically for cataloguing metadata associated with a dataset called the Catalogue of Australian Antarctic and Sub-Arctic Metadata [9]. Simply looking at these few institutions shows that there is no one perfect way of storing and sharing data. The system of storing information in a database for easy retrieval is obviously a commonality, but the specific implementation of practices and uses varies across institutions.

Another component to data management is “retrievability” [3]. Retrievability refers to the metadata that accompanies the data as well as the data format. Research data is available in many different formats including XML, spreadsheet files, database schemas, HTML, Word documents, PDF format, and many more [10]. The Australian Antarctic Data Centre offers researchers the option of collecting data in many different formats including TXT, HTML, XML, MS Excel, CSV, MS Access, JPEG, MPEG, and MP3 just to name a few [9]. The British Oceanographic Data Centre requires the use of standard formats such as the BODC request (ASCII) format, Ocean Data View format, a netCDF format, and an AXF format [7]. These standard formats are described in explicit detail on their website and their researchers required to format their data by these specific standards. The Global Observing Systems Information Center (GOSIC) portal, discussed later in greater detail, facilitates data sharing because it returns data regardless of the format [11]. Again, there is no one specific format for easy data sharing and each institution maintains their data in a way that works best for them and their researchers. The UK Data Archive does make recommendations of data formats to use for long-term preservation of research data [2]. This archive makes recommendations for quantitative tabular data with extensive metadata, quantitative tabular data with minimal metadata, geospatial data, qualitative data, digital image data, digital audio data, digital video data, and documentation and scripts. It is important to understand that the main goal of collecting data in specific formats is to ensure long-term preservation, accessibility, and usability.

The second issue surrounding retrievability is providing sufficient metadata to support the understanding and management of data. The three different types of metadata include descriptive metadata, administrative metadata, and structural metadata [12]. Descriptive metadata is information regarding the content of the dataset [12]. This type of metadata helps users to properly interpret the datasets and extrapolate from the datasets or data collections. Administrative metadata is information which is needed to allow proper management of the data [12]. This type of metadata is used by those responsible for maintaining the datasets. Structural metadata describes how different components of associated datasets relate to each other [12]. All these types of metadata are crucial to maintain management of datasets and thus ensure quality control. GOSIC’s three main systems, the Global Climate Observing System, the Global Ocean Observing System, and the Global Terrestrial Observing System are required to have “directory level” and “archive level” metadata associated with their datasets [13]. Directory level metadata refers to “general descriptive information” needed by a user to identify the dataset [13]. This includes information about the location of the dataset and contact information. Archive level metadata refers to the information needed to understand the dataset [13]. At the Australian Antarctic Data Centre a data record is not complete until all associated metadata has been submitted [9]. Metadata is essential in enabling data sharing and allowing users to effectively use data.

Another area to consider when discussing accessibility of research data is data policy issues. According to a questionnaire posed to a sample of Dutch professors and senior lecturers, open access to datasets [perhaps after an embargo period] in the field of physical sciences, which also encompasses Earth Science, is popular [3]. Nonetheless, popularity of open access remains varied across specific Earth Sciences disciplines. In the field of atmospheric sciences, the NOAA/National Climactic Data Center Open Access to Physical Climate Data Policy is essentially full and open data access [14]. According to the policy, all raw data collected from their many climate observing systems and output from their climate models are all “openly available in as timely a manner as possible” [14]. Additionally, NOAA makes its derived datasets available to the public as well as access to climate-related model simulations [14]. NOAA’s National Climate Data Center also operates the Global Observing Systems Information Center (GOSIC). GOSIC allows people to access international climate related datasets from the Global Climate Observing System, the Global Ocean Observing System, and the Global Terrestrial Observing System [11]. The goal of GOSIC is to provide full and open exchange of data, data products, and metadata for all of these systems at the lowest cost to the user. The easiest way for users to access information is through the GOSIC portal. This portal does not contain the datasets, but rather serves as a “single entry point for users” [11]. The portal maintains information about the datasets and provides users easy access to the data without the user having to navigate through hundreds of confusing websites trying to find the information they seek [11].

At NASA, their data sharing policy promotes the “full and open sharing of all data” with those in academia, the private industry, and the public community [15]. One goal of their policy is to create a National Information Infrastructure to foster an Environmental Information Economy to promote a “routine exchange of environmental data” [15]. However, NASA does have the right to protect data first produced by NASA or by Recipient that contains trade secrets, commercial or financial information, and other confidential information for a period of two years [15

BODC also promotes the use of their data for the advancement of industry, education, science, and public knowledge [7]. BODC follows the National Environment Research Council (NERC) Data Policy in which environmental data will be made available to any person or organization who wants the data [16]. There are a few restrictions on open access, but those are specifically explained by the Environmental Information Regulations [16]. Also, in order to protect ongoing research projects, NERC allows researchers exclusive rights to data they have collected for a maximum of two years from the end of the data collection period [16]. NERC also requires the development of a formal data management plan, much like the requirements of the National Science Foundation [16].

AADC releases submitted data to the public after embargo periods specific to the kind of data being submitted [9]. For example, ship-sourced observations and measurements are released by a project’s end data while data on threatened species has an unlimited embargo period [9]. All of these different institutions illustrate the differences in data sharing policies among the Earth Sciences disciplines. Even though these institutions all fall under Earth Sciences, they each maintain specific data policy practices. This is why it is impossible to discuss overall data sharing practices in the Earth Sciences without discussing different disciplines.

A final thought to accessibility of data, under the management phase of the data life cycle, is copyright considerations. One proposed solution to copyright issues is the Creative Commons licenses [3]. Creative Commons promotes “universal access” to research in order to achieve “an Internet full of open content” [17]. Their mission is to give individuals, institutions, and companies the ability to keep their copyright but also allow others “certain uses” to their work [17]. Basically this allows for a “some rights reserved” approach to information sharing rather than an “all rights reserved” mentality. AADC is one Earth Science institution which uses the Creative Commons license. Under the Creative Commons Attribution 3.0 License, users are able to share or to remix the work. The only condition to using the data is that users must attribute content to the AADC, or, more specifically, to the original creator [17].