Data Grids: a New Computational Infrastructure

Paul Avery
Nov. 10, 2001
Version 5

TGier 2 Centers and Grid Computing for CMS and U.S. -CMSData Grids: A New Computational Infrastructure
for Data Intensive Science

Abstract

Twenty-first century scientific and engineering enterprises are increasingly characterized by their geographic dispersion and their reliance on large data archives. These characteristics bring with them unique challenges. First, the increasing size and complexity of modern data collections require significant investments in information technologies to store, retrieve and analyze them. Second, the increased distribution of people and resources in these projects has made resource sharing and collaboration across significant geographic and organizational boundaries critical to their success.

In this paper I explore how computing infrastructures based on Data Grids offer data intensive enterprises a comprehensive, scalable framework for collaboration and resource sharing. A detailed example of a Data Grid framework is presented for a Large Hadron Collider experiment, where a hierarchical set of laboratory and university resources comprising petaflops of processing power and a multi- petabyte data archive must be efficiently utilized by a global collaboration. The experience gained with these new information systems, providing transparent managed access to massive distributed data collections, will be applicable to large-scale data-intensive problems in a wide spectrum of scientific and engineering disciplines, and eventually in industry and commerce. Such systems will be needed in the coming decades as a central element of our information-based society.

Keywords: Grids, data, virtual data, data intensive, petabyte, petascale, petaflop, virtual organization, griphyn, ppdg, cms, lhc, cern.

Data Grids: A New Computational Infrastructure for Data Intensive Science

1Introduction

2Data intensive activities

3Data Grids and Data Intensive Sciences

4Data Grid Development for the Large Hadron Collider

4.1The CMS Data Grid

4.2The Tier Concept

4.3Elaboration of the CMS Data Grid Model

4.4Advantages of the CMS Data Grid Model

5Data Grid Architecture

5.1Globus-Based Infrastructure

5.2Virtual Data

5.3Development of Data Grid Architecture

6Examples of Major Data Grid Efforts Today

6.1The TeraGrid

6.2Particle Physics Data Grid

6.3GriPhyN Project

6.4European Data Grid

6.5CrossGrid

6.6International Virtual Data Grid Laboratory and DataTAG

7Common Infrastructure

8Summary

References

1Introduction

Infrastructures known as “Grids”[[1],[2]] are being developed to address the problem of resource sharing. An excellent introduction to Grids can be found in the article [[3]], “The Anatomy of the Grid”, which provides the following interesting description:

“The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource-brokering strategies emerging in industry, science, and engineering. This sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing rules form what we call a virtual organization (VO).”

The existence of very large distributed data collections adds a significant new dimension to enterprise-wide resource sharing, and has led to substantial research and development effort on “Data Grid” infrastructures, capable of supporting this more complex collaborative environment. This work has taken on more urgency for new scientific collaborations, which in some cases will reach global proportions and share data archives with sizes measured in dozens or even hundreds of petabytes within a decade. These collaborations have recognized the strategic importance of Data Grids for realizing the scientific potential of their experiments, and have begun working with computer scientists, members of other scientific and engineering fields and industry to research and develop this new technology and create production-scale computational environments. Figure 1 shows a U.S. based Data Grid consisting of a number of heterogeneous resources.

My aim in this paper is to review Data Grid technologies and how they can benefit data intensive sciences. Developments in industry are not included here, but since most Data Grid work is presently carried out to address the urgent data needs of advanced scientific experiments, the omission is not a serious one. (The problems solved while dealing with these experiments will in any case be of enormous benefit to industry in a short time.) Furthermore, I will concentrate on those projects which are developing Data Grid infrastructures for a variety of disciplines, rather than “vertically integrated” projects that benefit a single experiment or discipline, and explain the specific challenges faced by those disciplines.

2Data intensive activities

The number and diversity of data intensive projects is expanding rapidly. The following recounting of projects is presented as a survey that, while incomplete, shows the scope of and immense interest in data intensive methods in solving scientific problems.

Physics and space sciences: High energy and nuclear physics experiments at accelerator laboratories at Fermilab, Brookhaven and SLAC already generate dozens to hundreds of terabytes of colliding beam data per year that is distributed to and analyzed by hundreds of physicists around the world to search for subtle new interactions. Upgrades to these experiments and new experiments planned for the Large Hadron Collider at CERN will increase data rates to petabytes per year. Gravitational wave searches at LIGO, VIRGO and GEO will accumulate yearly samples of approximately 100 terabytes of mostly environmental and calibration data that must be correlated and filtered to search for rare gravitational events. New multi-wavelength all-sky surveys utilizing telescopes instrumented with gigapixel CCD arrays will soon drive yearly data collection rates from terabytes to petabytes. Similarly, remote-sensing satellites operating at multiple wavelengths will generate several petabytes of spatial-temporal data that can be studied by researchers to accurately measure changes in our planet’s support systems.

Biology and medicine: Biology and medicine are rapidly increasing their dependence on data intensive methods. New generation X-ray sources coupled with improved data collection methods are expected to generate copious data samples from biological specimens with ultra-short time resolutions. Many organism genomes are been sequenced by extremely fast “shotgun” methods requiring enormous computational power. The resulting flow of genome data is rising exponentially, with standard databases doubling in size every few months. Sophisticated searches of these databases will require much larger computational resources as well as new statistical methods and metadata to keep up with the flow of data. Proteomics, the study of protein structure and function, is expected to generate enormous amounts of data, easily dwarfing the data samples obtained from genome studies. When applied to protein-protein interactions  of extreme importance to drug designers  these studies will require additional orders of magnitude increases in computational capacity (to hundreds of petaflops) and storage sizes (to thousands of petabytes). In medicine, a single three dimensional brain scan can generate a significant fraction of a terabyte of data, while systematic adoption of digitized radiology scans will produce dozens of petabytes of data that can be quickly accessed and searched for breast cancer and other diseases. Exploratory studies have shown the value of converting patient records to electronic form and attaching digital CAT scans, X-Ray charts and other instrument data, but systematic use of such methods would generate databases many petabytes in size. Medical data pose additional ethical and technical challenges stemming from exacting security restrictions on access to this data and patient identification.

Computer simulations: Advances in information technology in recent years have given scientists and engineers the ability to develop sophisticated simulation and modeling techniques for improved understanding of the behavior of complex systems. When coupled to the huge processing power and storage resources available in supercomputers or large computer clusters, these advanced simulation and modeling methods become tools of rare power, permitting detailed and rapid studies of physical processes while sharply reducing the need to conduct lengthy and costly experiments or to build expensive prototypes. The following examples provide a hint of the potential of modern simulation methods. High energy and nuclear physics experiments routinely generate simulated datasets whose size (in the multi-terabyte range) is comparable to and sometimes exceeds the raw data collected by the same experiment. Supercomputers generate enormous databases from long-term simulations of climate systems with different parameters that can be compared with one another and with remote satellite sensing data. Environmental modeling of bays and estuaries using fine-scale fluid dynamics calculations generates massive datasets that permit the calculation of pollutant dispersal scenarios under different assumptions that can be compared with measurements. These projects also have geographically distributed user communities who must access and manipulate these databases.

3Data Grids and Data Intensive Sciences

To develop the argument that Data Grids offer a comprehensive solution to data intensive activities, I first summarize some general features of Grid technologies. These technologies comprise a mixture of protocols, services, and tools that are collectively called “middleware”, reflecting the fact that they are accessed by “higher level” applications or application tools while they in turn invoke processing, storage, network and other services at “lower” software and hardware levels. Grid middleware includes security and policy mechanisms that work across multiple institutions; resource management tools that support access to remote information resources and simultaneous allocation (“co-allocation”) of multiple resources; general information protocols and services that provide important status information about hardware and software resources, site configurations, and services; and data management tools that locate and transport datasets between storage systems and applications.

The diagram in Figure 2 outlines in a simple way the roles played by these various Grid technologies. The lowest level Fabric contains shared resources such as computer clusters, data storage systems, catalogs, networks, etc. that Grid tools must access and manipulate. The Resource and Connectivity layers provide, respectively, access to individual resources and the communication and authentication tools needed to communicate with them. Coordinated use of multiple resources – possibly at different sites – is handled by Collective protocols, APIs, and services. Applications and application toolkits utilize these Grid services in myriad ways to provide “Grid-aware” services for members of a particular virtual organization. A much more detailed explication of Grid architecture can be found in reference [3].

While standard Grid infrastructures provide distributed scientific communities the ability collaborate and share resources, additional capabilities are needed to cope with the specific challenges associated with scientists accessing and manipulating very large distributed data collections. These collections, ranging in size from terabytes (TB) to petabytes (PB), comprise raw (measured) and many levels of processed or refined data as well as comprehensive metadata describing, for example, under what conditions the data was generated or collected, how large it is, etc. New protocols and services must facilitate access to significant tertiary (e.g., tape) and secondary (disk) storage repositories to allow efficient and rapid access to primary data stores, while taking advantage of disk caches that buffer very large data flows between sites. They also must make efficient use of high performance networks that are critically important for the timely completion of these transfers. Thus to transport 10 TB of data to a computational resource in a single day requires a 1 Gigabit per second network operated at 100% utilization. Efficient use of these extremely high network bandwidths also requires special software interfaces and programs that in most cases have yet to be developed.

The computational and data management problems encountered in data intensive research include the following challenging aspects:

Computation-intensive as well as data-intensive: Analysis tasks are compute-intensive and data-intensive and can involve hundreds or even thousands of computer, data handling, and network resources. The central problem is coordinated management of computation and data, not just data curation and movement.
Need for large-scale coordination without centralized control: Rigorous performance goals require coordinated management of numerous resources, yet these resources are, for both technical and strategic reasons, highly distributed and not always amenable to tight centralized control.
Large dynamic range in user demands and resource capabilities: It must be possible to support and arbitrate among a complex task mix of experiment-wide, group-oriented, and (perhaps thousands of) individual activities—using I/O channels, local area networks, and wide area networks that span several distance scales.
Data and resource sharing: Large dynamic communities would like to benefit from the advantages of intra and inter community sharing of data products and the resources needed to produce and store them.

The “Data Grid” has been introduced as a unifying concept to describe the new technologies required to support such next-generation data-intensive applications—technologies that will be critical to future data-intensive computing in the many areas of science and commerce in which sophisticated software must harness large amounts of computing, communication and storage resources to extract information from data. Data Grids are typically characterized by the following elements: (1) they have large extent (national and even global) and scale (many resources and users); (2) they layer sophisticated new services on top of existing local mechanisms and interfaces, facilitating coordinated sharing of remote resources; and (3) they provide a new dimension of transparency in how computational and data processing are integrated to provide data products to user applications. This transparency is vitally important for sharing heterogeneous distributed resources in a manageable way, a point to which I return later.

4Data Grid Development for the Large Hadron Collider

I now turn to a particular experimental activity, high energy physics at the Large Hadron Collider (LHC) at CERN, to provide a concrete example of how a Data Grid computing framework might be implemented to meet the demanding computational and collaborative needs of a data intensive experiment. The extreme requirements of experiments at the LHC, due to commence operations in 2006, have been known to physicists for several years. Distributed architectures based on Data Grid technologies have been proposed as solutions, and a number of initiatives are now underway to develop implementations at increasing levels of scale and complexity. The particular architectural solution shown here is not necessarily the most effective or efficient one for other disciplines, but it does offer some valuable insights about the technical and even political merits of Data Grids when applied to large-scale problems.

For definitiveness, and without sacrificing much generality, I focus on the CMS [30] experiment at the LHC. CMS faces computing challenges of unprecedented scale in terms of data volume, processing requirements, and the complexity and distributed nature of the analysis and simulation tasks among thousands of physicists worldwide. The detector will record events[*] at a rate of approximately 100 Hz, accumulating 100 MB/sec of raw data or several petabytes per year of raw and processed data in the early years of operation. The data storage rate is expected to grow in response to the pressures of higher luminosity, new physics triggers and better storage capabilities, leading to data collections of approximately 20-30 PB by 2010 rising to several hundred petabytes over the following decade.

The computational resources required to reconstruct and simulate this data are similarly vast. Each raw event will be approximately 1 MB in size and require roughly 3000 SI95-sec[†] to reconstruct and 5000 SI95-sec to fully simulate. Estimates based on the initial 100 Hz data rate have yielded a total computing capability of approximately 1800K SI95 in the first year of operation, rising rapidly to keep up with the total accumulated data. Approximately one third of this capability will reside at CERN and the remaining two thirds will be provided by the member nations [[4]]. To put the numbers in context, 1800K SI95s is roughly equivalent to 60,000 of today’s high-end PCs.

4.1The CMS Data Grid

The challenge facing CMS is how to build an infrastructure that will provide these computational and storage resources and permit their effective and efficient use by a scientific community of several thousand physicists spread across the globe. Simple scaling arguments as well as more sophisticated studies using the MONARC [[5]] simulation package have shown that a distributed infrastructure based on regional centers provides the most effective technical solution, since large geographical clusters of users are close to the datasets and resources that they employ. A distributed configuration is also preferred from a political perspective since it allows local control of resources and some degree of autonomy in pursuing physics objectives.

As discussed in the previous section, the distributed computing model can be made even more productive by arranging CMS resources as a Grid, specifically a Data Grid in which large computational resources and massive data collections (including cached copies and data catalogs) linked by very high-speed networks and sophisticated middleware software form a single computational resource accessible from any location. The Data Grid framework is expected to play a key role in realizing CMS’ scientific potential by transparently mobilizing large-scale computing resources for large-scale computing tasks and by providing a collaboration-wide computing fabric that permits full participation in the CMS research program by physicists at their home institutes. This latter point is particularly relevant for participants in remote or distant regions. As a result, a highly distributed, hierarchical computing infrastructure exploiting Grid technologies is a central element of the CMS worldwide computing model.