U.S. CMS Software and Computing Project

Version 1.0, May 22, 2001

Paul Avery

May 22, 2001

TGier 2 Centers and Grid Computing for CMS and U.S. -CMSData Grids: A New Computational Infrastructure
for Data Intensive Science

1Introduction

Twenty-first century scientific and engineering enterprises are increasingly characterized by their geographic dispersion and their reliance on large data archives. These characteristics bring with them unique challenges. First, the increasing size and complexity of modern data collections require significant investments in information technologies to store, retrieve and analyze them. Second, the increased distribution of people and resources in these projects has made resource sharing and collaboration across significant geographic and organizational boundaries critical to their success.

Infrastructures known as “Grids”[1] are being developed to address the problem of resource sharing. An excellent introduction to Grids can be found in the article[2], “The Anatomy of the Grid”, which provides the following interesting description:

“The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource-brokering strategies emerging in industry, science, and engineering. This sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing rules form what we call a virtual organization (VO).”

The existence of very large distributed data collections adds a significant new dimension to enterprise-wide resource sharing, and has led to substantial research and development effort on “Data Grid” infrastructures, capable of supporting this more complex collaborative environment. This work has taken on more urgency for new scientific collaborations, which in some cases will reach global proportions and share data archives with sizes measured in dozens or even hundreds of Petabytes within a decade. These collaborations have recognized the strategic importance of Data Grids for realizing the scientific potential of their experiments, and have begun working with computer scientists, members of other scientific and engineering fields and industry to research and develop this new technology and create production-scale computational environments. Figure 1 below shows a U.S. based Data Grid consisting of a number of heterogeneous resources.

My aim in this paper is to review how Data Grids can benefit data intensive sciences. Developments in industry are not included here, but since most Data Grid work is presently carried out to address the urgent data needs of advanced scientific experiments, the omission is not a serious one. (The problems solved while dealing with these experiments will in any case be of enormous benefit to industry in a short time.) Furthermore, I will concentrate on those projects which are developing Data Grid infrastructures for a variety of disciplines, rather than “vertically integrated” projects that benefit a single experiment or discipline, and explain the specific challenges faced by those disciplines.

2Data intensive activities

The number and diversity of data intensive projects is expanding rapidly. The following recounting of projects is presented as an survey that, while incomplete, shows the scope of and immense interest in data intensive methods in solving scientific problems.

Physics and space sciences: High energy and nuclear physics experiments at accelerator laboratories at Fermilab, Brookhaven and SLAC already generate dozens to hundreds of Terabytes of colliding beam data per year that is distributed to and analyzed by hundreds of physicists around the world to search for subtle new interactions. These rates will increase to Petabytes per year in experiments planned for the Large Hadron Collider at CERN. Gravitational wave searches at LIGO, VIRGO and GEO will accumulate yearly samples of approximately 100 Terabytes of mostly environmental and calibration data that must be correlated and filtered to search for rare gravitational events. New multi-wavelength all-sky surveys utilizing telescopes instrumented with gigapixel CCD arrays will soon drive yearly data collection rates from Terabytes to Petabytes. Similarly, remote-sensing satellites operating at multiple wavelengths will generate several Petabytes of spatial-temporal data that can be studied by researchers to accurately measure changes in our planet’s support systems.

Biology and medicine: Biology and medicine are rapidly increasing their dependence on data intensive methods. Experiments at new generation light sources have the potential to generate massive amounts of data while recording the changes in shape of individual protein molecules. Organism genomes are being sequenced and stored by new generations of sequencing engines in databases and their properties are compared using new statistical methods requiring massive computational power. Proteomics, the study of protein structure and function, is expected to generate enormous amounts of data, easily dwarfing the data samples obtained from genome studies. In medicine, a single three dimensional brain scan can generate a significant fraction of a Terabyte of data, while systematic adoption of digitized radiology scans will produce dozens of Petabytes of data that can be quickly accessed and searched for breast cancer and other diseases. Exploratory studies have shown the value of converting patient records to electronic form and attaching digital CAT scans, X-Ray charts and other instrument data, but systematic use of such methods would generate databases many Petabytes in size. Medical data pose additional ethical and technical challenges stemming from exacting security restrictions on access to this data and patient identification.

Computer simulations: Advances in information technology in recent years have given scientists and engineers the ability to use sophisticated simulation and modeling techniques for improved understanding of the behavior of complex systems. When coupled to the huge processing power and storage resources available in supercomputers or large computer clusters, these advanced simulation and modeling methods become tools of rare power, permitting detailed and rapid studies of physical processes while sharply reducing the need to conduct lengthy and costly experiments or to build expensive prototypes. The following examples provide a hint of the potential of modern simulation methods High energy and nuclear physics experiments routinely generate simulated datasets whose size (in the multi-Terabyte range) is comparable to and sometimes exceeds the raw data collected by the same experiment. Supercomputers generate enormous databases from long-term simulations of climate systems with different parameters that can be compared with one another and with remote satellite sensing data. Environmental modeling of bays and estuaries using fine-scale fluid dynamics calculations generates massive datasets that permit the calculation of pollutant dispersal scenarios under different assumptions that can be compared with measurements. These projects also have geographically distributed user communities who must access and manipulate these databases.

Physics at the Large Hadron Collider: Although I briefly discussed high energy physics earlier, the requirements for experiments at CERN’s Large Hadron Collider (LHC), due to start operations in 2006, are so extreme that they merit a separate treatment. LHC experiments face computing challenges of unprecedented scale in terms of data volume and complexity, processing requirements and the complexity and distributed nature of the analysis and simulation tasks among thousands of scientists worldwide. Every second, each of the two general purpose detectors will filter one billion collisions and record one hundred of them to mass storage, generating data rates of 100 Mbytes per second and several Petabytes per year of raw, processed and simulated data in the early years of operation. The data storage rate is expected to grow in response to the pressures of increased beam intensity, additional physics processes that must be recorded and better storage capabilities, leading to LHC data collections totaling approximately 100 Petabytes by the end of this decade, and up to an Exabyte (1000 Petabytes) by the middle of the following decade. The challenge facing the LHC laboratory and scientific community is how to build an information technology infrastructure that will provide these computational and storage resources while enabling their effective and efficient use by a scientific community of thousands of physicists spread across the globe.

3Data Grids and Data Intensive Sciences

To develop the argument that Data Grids offer a comprehensive solution to data intensive activities, I first summarize some general features of Grid architectures. Grid technologies comprise a mixture of protocols, services, and tools that are integrated into what is now called “middleware”, reflecting the fact that these new technologies are accessed by “higher level” applications or application tools while they themselves invoke processing, storage, network and other services at “lower” software and hardware levels. Grid middleware includes security and policy mechanisms that work across multiple institutions; resource management tools that support access to remote information resources and simultaneous allocation (“co-allocation”) of multiple resources; general information protocols and services that provide important status information about hardware and software resources, site configurations, and services; and data management tools that locate and transport datasets between storage systems and applications.[3]

Figure 2 illustrates a categorization that we have found useful when explaining the roles played by various Grid technologies. In the Fabric, we have the resources that we wish to share: computers, storage systems, data, catalogs, etc. The Connectivity layer provides communication and authentication services needed to communicate with these resources. Resource protocols (and, as in each layer, associated APIs) negotiate access to individual resources. Collective protocols, APIs, and services are concerned with coordinating the use of multiple resources, and finally application toolkits and applications themselves are defined in terms of services of these various kinds. Other papers present views on necessary components of a Grid architecture [8] and the additional services required within a Data Grid architecture [5].

While standard Grid infrastructures provide distributed scientific communities the ability collaborate and share resources, Data Grids with additional capabilities must be developed to cope with the specific challenges associated with scientists accessing and manipulating very large distributed data collections. These capabilities arise from the need to accommodate various data resources. For example, Data Grids must facilitate access to significant tertiary (e.g., tape) and secondary (disk) storage repositories to allow efficient and rapid access to primary data stores, while taking advantage of disk caches that buffer the very large inter-site data flows that will take place. They also must make efficient use of high performance networks that are critically important for the timely completion of these transfers. Thus to transport 10 Terabytes of data to a computational resource in a single day requires a 1 Gigabit per second network operated at 100% utilization. Efficient use of these extremely high network bandwidths also requires special software interfaces and programs that in most cases have yet to be developed.

4Data Grid Projects

High energy physics historically has been one of the most data intensive disciplines, principally because of the demands of accelerator-based experiments where rapid increases in data volume (Gigabytes to Terabytes) and data complexity (thousands of channels to millions of channels) confronted physicists early with the need to filter, collect, store and analyze this data. As a result, experiments and laboratories devoted large resources to data acquisition, processing, storage and networking infrastructures, as well as to human capital in the form of software and computing teams. The rising costs as a fraction of total expenditures of these and other computing technologies, as well as the creation of large laboratory-based computing infrastructures and expertise, reflect their critical importance in making new scientific discoveries in this field.

4.1Particle Physics Data Grid

The Particle Physics Data Grid collaboration is developing, evaluating and delivering vitally needed Grid-enabled tools for data-intensive collaboration in particle and nuclear physics. Novel mechanisms and policies will be vertically integrated with Grid middleware and experiment specific applications and computing resources to form effective end-to-end capabilities. PPDG is a collaboration of computer scientists with a strong record in distributed computing and Grid technology, and physicists with leading roles in the software and network infrastructures for major high-energy and nuclear experiments. Together they cover the wide spectrum of scientific disciplines and technologies required to bring Grid-enabled data manipulation and analysis capabilities to the desk of every physicist. A three-year program is now underway, taking advantage of the strong driving force provided by now-running physics experiments together with recent advances in Grid middleware. Our goals and plans are guided by the immediate, medium-term and longer-term needs and perspectives of the LHC experiments ATLAS[4] and CMS[5] that will run for at least a decade from late 2005 and by the research and development agenda of other Grid-oriented efforts. We exploit the immediate needs of running experiments – BaBar[6], D0[7], STAR[8] and Jlab[9] experiments – to stress-test both concepts and software in return for significant medium-term benefits. PPDG is actively involved in establishing the necessary coordination between potentially complementary data-grid initiatives in the US, Europe and beyond.

The BaBar experiment faces the challenge of data volumes and analysis needs planned to grow by more than a factor 20 by 2005. During 2001, the CNRS-funded computer center at CCIN2P3 Lyon, France will join SLAC in contributing data analysis facilities to the fabric of the collaboration. The STAR experiment at RHIC has already acquired its first data and has identified Grid services as the most effective way to couple the facilities at Brookhaven with its second major center for data analysis at LBNL. An important component of the D0 fabric is the SAM[10] distributed data management system at Fermilab, to be linked to applications at major US and international sites. The LHC collaborations have identified data-intensive collaboratories as a vital component of their plan to analyze tens of petabytes of data in the second half of this decade. US CMS is developing a prototype worldwide distributed data production system for detector and physics studies.

4.2GriPhyN

GriPhyN is a large collaboration of information technology (IT) researchers and experimental physicists who aim to provide the IT advances required to enable Petabyte-scale data intensive science in the 21st century. Driving the project are unprecedented requirements for geographically dispersed extraction of complex scientific information from large collections of measured data. To meet these requirements, which arise initially from four partner physics experiments (ATLAS, CMS, LIGO and SDSS) but will also be fundamental to science and commerce in the 21st century, the GriPhyN team is pursuing IT advances centered on the creation of Petascale Virtual Data Grids (PVDG) that meet the data-intensive computational needs of a diverse community of thousands of scientists spread across the globe.

GriPhyN has adopted the concept of virtual data as a unifying theme for its investigations of Data Grid concepts and technologies. This term is used to refer to two related concepts: transparency with respect to location as a means of improving access performance, with respect to speed and/or reliability, and transparency with respect to materialization, as a means of facilitating the definition, sharing, and use of data derivation mechanisms. In order to realize these concepts, GriPhyN conducts research into virtual data cataloging, execution planning, execution management, and performance analysis issues (see the figure). The results of this research, and other relevant technologies, are developed and integrated to form a Virtual Data Toolkit (VDT). Successive VDT releases are applied and evaluated in the context of the four partner experiments.


4.3European Data Grid

5Common Infrastructure

References

1

[1]Formerly known by the name “Computational Grid”, the term “Grid” reflects the fact that the resources to be shared may be quite heterogeneous and have little to do with computing.

[2]I. Foster, C. Kesselman, S. Tuecke, “The Anatomy of the Grid”,

[3]This description of general Grid technologies is paraphrased from Ian Foster

[4]The ATLAS Experiment, A Toroidal LHC ApparatuS,

[5]The CMS Experiment, A Compact Muon Solenoid,

[6]BaBar,

[7]D0,

[8]STAR,

[9]Jlab experiments,

[10]SAM,