DISUN: Creating Cyberinfrastructure for the CMS Experiment and Other Science Communities

Mar. 79, 2005

V134V136

DISUN: Creating Cyberinfrastructure for the CMS Experiment and Other Science Communities

Table of Contents

B.Project Summary......

D.Project Description......

D.1.Introduction to DISUN......

D.1.1. DISUN Vision......

D.1.2. Proposed activities......

D.2.The CMS Research Program and its Computing Challenges......

D.2.1. CMS Computing Challenges......

D.2.2. Tiered Model of Regional Computing and Analysis Centers......

D.2.3. The CMS Tier-2 Centers and DISUN......

D.2.4. The CMS Tier-2 Centers and Campus Grids......

D.2.5. The CMS Tier-2 Centers and Open Science Grid......

D.3.Building the DISUN Distributed Computing Facility......

D.3.1. Existing and Planned Computing and Storage Resources......

D.3.2. Networking Connections and Capabilities......

D.4.Creating the Science Data Analysis Framework (SDAF)......

D.5.DISUN Work Plan for SDAF and its Extension to Other Communities......

D.5.1. Phase 1 (Two Years): Developing and Deploying the Basic Infrastructure......

D.5.2. Phase 2 (Eighteen Months): Enhancing Networking and Resource Utilitzation......

D.5.3. Phase 3 (One year): Highly Functional and Scalable SDAF (medium LHC Luminosity)......

D.5.4. Phase 4 (One year): Fully Functional Global Analysis (full LHC Luminosity)......

D.6.Leveraging Research Activities and Partnerships......

D.6.1. Grid3 and the Open Science Grid......

D.6.2. Cooperatively Owned Campus Grids......

D.6.3. Collaborative Projects with ATLAS......

D.7.Supporting, Training and Expanding the Community......

D.7.1. Grid Outreach Activities......

D.7.2. Summer Grid Tutorials......

D.7.3. CHEPREO Program......

D.7.4. International Digital Divide Activities

D.7.5. Other International Education Outreach Activities......

D.8.DISUN Management Plan and Deployment of Funds......

E.References......

B. Project Summary

We propose to collaborate with computing grid researchers computer scientists and middleware developers to build and operate a Data Intensive Science University Network (DISUN), which will enable over 200[ML1] physicists distributed across the US to analyze data from the CMS detector at the Large Hadron Collider (LHC) project at CERN. The CMS physicSuch an analysis ists requires unprecedentunprecedented levels of data intensive computing to search for rare signatures of new physics in the petabyte per year datasetdata collections that they will be recording beginning in 2007. We propose to provideoperate CPU powercompute, storage, and network facilities, and most importantly, the tools to make these accessible to individual scientists at their local sites. While the basic reconstruction of the events recorded by the detector are processed at CERN (Tier-0) and the host laboratory in the US (FNAL Tier-1), the bulk of the data intensive scientific analysis computing, directly used by the physicists, will be located in the University based Tier-2 computing facilities. The Tier-2 facilities are required to provide at least 5 million SI2000 CPU power and over a petabyte of storage, all connected with about 10 GB/s network bandwidth. More importantly, these resources need to function as a single integrated facility where those resources can be apportionedlocatable to individual physicists according to the experimental prioritiespolicies defined by the experiment, while providing opportunities for innovative and exploratory analyses. The CMS physicists and the computer scientists at the four universities, Caltech, the University of California at San Diego, the University of Florida and the University of Wisconsin, Madison, have joined together in this proposal to build and operate a shared cyberinfrastructure facility that is linked by ultra-high speed optical networks and managed inas a well-integrated grid softwaredistributed environment. In this proposal we describe how we propose to collaborate with computing grid researchers and middleware developers to build and operateWe expect that DISUN, which will serve the US CMS community by providingprovide ~ 70% of the total US CMS Tier-2 capacity. DISUN will constitute a sufficiently large, complex and realistic operating environment that it will serve as a scalable model for shared cyberinfrastructure that extends to other data intensive science disciplines and provides a test bed for development ofnovel data-intensive distributed computing technology. We therefore propose to integrate the DISUN and otherthe CMS Tier-2 computing sites in the nation-wide Open Science Grid and bring this shared cyberinfrastructure to the larger data intensive science community,

We propose to take the next step in cyberinfrastructure integration by creating a multi-disciplinary distributed facility for US physicists participating in the CMS experiment at CERN’s Large Hadron Collider (LHC). Known as DISUN (Data Intensive Science University Network), the Grid-based facility will comprise computing, Grid and personnel resources from four universities, Caltech, the University of California at San Diego, the University of Florida and the University of Wisconsin, Madison, linked by ultra-high speed optical networks. DISUN will constitute the most powerful component of a tier (“Tier-2”) of university resources, serving US-CMS physicists as well as other science and engineering communities with massive computational, storage and data movement requirements.

DISUN will also provide a place where advanced cyberinfrastructure can be developed, tested, evaluated and integrated in Grid-based applications capable of harnessing unprecedented levels of computational, storage, and data transfer capacity. Such applications will be necessary to achieve the full scientific potential of the LHC, a multi-billion dollar facility currently under construction at the CERN laboratory in Europe (with over $0.5B in US investment) and expected to operate for decades. US physicists, who work remotely from the LHC laboratory, particularly need advanced cyberinfrastructure to seamlessly integrate their university-based computational and storage resources with those of national laboratories so that they can participate effectively in the LHC research program.

The equipment acquired as part of this proposal will be installed and operated at each of the four DISUN sites totaling about 1000 dual CPU nodes and about 100 storage servers. We also included funding for dedicated personnel to design, install and operate these facilities, and provide support services for scientists. While providing this critical cyberinfrastructure for the CMS experiment, DISUN will also focus on developing general techniques and tools to integrate university-based scientists and facilities with international computing and data storage facilities. A major deliverable will be provision of a “Science Data Analysis Framework” (SDAF) that will simplify how scientists integrate their applications and tools into a shared distributed computing environment that dramatically increases their powerthat facilitates collaboration. This work will be coordinated closely with physicists from ATLAS, the other major LHC experiment, who are submitting a similar proposal to develop cyberinfrastructure for their research program. This proposal also includes the software engineers who will collaboratework closely with the computer science community to ensure that the solutions that areDISUN developed areis applicable to data intensive science community at large. An important facet of our developmenteffort will also be to integrate our CMS specific cyberinfrastructure with the resources and scientific communities of extensive campus grids resources (e.g., GLOW). with this shared cyberinfrastructure to bring in the wider participation of the data intensive science community.

The intellectual merit of our proposal lies in the distributed computing and software cyberinfrastructure that we propose to develop, deploy and supportoperate. Integrating advanced middleware and distributed compute, storage and networking capabilities into a dependable cyberinfrastructure will require the development of novel service oriented frameworks, software tools and operational procedures. By exposing the proposed infrastructure to a large, diverse and demanding community of users, we will expose new challenges in the areas of resource and software management that we will need to be addressed and solved. Evolving a distributed and heterogeneous production infrastructure will require the development of new middleware distribution tools and procedures. Scientists who will be exposed to the capabilities of the proposed infrastructure will require help in transforming their applications and computing methodology to benefit from the services offered by the proposed infrastructure. By interfacing with other campus based cyber communities, DISUM will advance the technologies and methodologies needed to integrate campuses across the US into a national cyberinfrastructure.For university-based scientists, this cyberinfrastructure will dramatically enhance their ability to mobilize distributed computing resources to attack new research problems, strengthening their participation in national and international research projects. Specifically, DISUN will improve and support the physics research capability of the CMS experiment by building a distributed scientific facility from state-of-the art middleware and the computational, network and storage resources at the participating institutes. Generally, this distributed scientific facility will serve as a powerful and reliable resource for scientists from other disciplines, who will be encouraged and helped to make use of it. DISUN will especially benefit scientific researchers working with multi-terabyte to petabyte distributed datasets or requiring significant aggregate CPU power.[ML2]

DISUN’s broad impact will be a measurable improvement in the scientific computing capabilities of university- based researchers. For such scientists, the services provided by this cyberinfrastructure will dramatically enhance their ability to mobilize distributed computing resources to attack new research problems, strengthening their participation in national and international research projects. While US-CMS physicists will initially evaluate and improve the DISUN cyberinfrastructure, it will be fully integrated with the Open Science Grid to benefit a wide community of scientists and engineers, who we will assist with applications integration. The software tools DISUN integrates into the SDAF will be distributed via the Virtual Data Toolkit to facilitate even wider adoption. DISUN also will work with several partnering outreach projects to train scientists and students in using advanced information technologies.

D. Project Description

D.1. Introduction to DISUN

D.1.1. DISUN Vision

(Copy from Project Summary)

D.1.2. Connection with ATLAS Cyberinfrastructure

The scope and activities in this proposal have been developed in coordination with members of the other major LHC experiment, ATLAS. Although our two collaborations are scientific rivals and must build distinct computing infrastructures to satisfy separate constituencies, we clearly understand that we share the same daunting computing challenges and that both collaborations can benefit by pooling resources, sharing Grid middleware and agreeing on Grid software standards. This understanding has led to a close working relationship since 1999. For example, we have jointly commissioned grid middleware from Condor and Globus and participate together in a number of major Grid initiatives, including PPDG (1999-2006), GriPhyN (2000-2005), iVDGL (2001-2007) and, most recently, UltraLight (2005-2009). We have demonstrated extensive shared use of facilities as part of the Grid3 national infrastructure, where resource sharing between the experiments dramatically enhanced recent data challenge exercises. We are currently working together (along with computer scientists and scientists from other disciplines) in the creation of Open Science Grid, due to come online in early 2005. Similarly, both US collaborations are collaborating in ensuring compatibility with the LHC Computing Grid (LCG), primarily through our joint activities in Open Science Grid. These and numerous smaller examples demonstrate our commitment to work with our ATLAS colleagues in making Tier-2 computing and the national cyberinfrastructure program a success.

D.1.3. D.1.2. Proposed activities

Our proposed activities are organized into three main thrusts, described in more detail in the following sections.

Create Build and operate production-quality computing infrastructure supporting leading CMS applicationsanalyses: The linked resources of the four DISUN universities will form a computing persistent production infrastructure (5 million SI2000 CPU and 1 petabyte storage) for CMS as well as OSG, especially for applications that create, analyze, store, and share large volumes of data.
Develop, integrate and drive shared cyberinfrastructure usage: Roughly 1015% of the DISUN hardware will be dedicated to integration testbeds. Our testing and integration efforts will help drive the development and deployment of a new generation of cyberinfrastructure, particularly For scientists to easiliyeasily and effectively use the shared cyberinfrastructure for their analysis we envision development of the Science Data Analysis Framework (SDAF). This development will be accelerated by the advanced capabilities of our computing infrastructure (particularly our unique advanced network capabilities) and the resources in our existing NSF and DOE-funded projects.
Support, train and Eexpand the community: Our primary dissemination mechanismcollaboration with broader data intensive science communities will be through our campus cyber communities and the Open Science Grid, where DISUN will actively foster exchanges and interactions among CMS physicists and other partnering scientists, engineers and computer scientists to develop OSG as a nation-wide shared cyberinfrastructure facility. DISUN members will also provide (in collaboration with partner projects) training and education to researchers and students in specific targeted user communities at our respective campuses, providing them with educational materials, documentation and extensive test facilities. We will decrease the entry costs for new partners by working closely with cluster management software providers, e.g. the UCSD Rocks [37] team, on incorporating core OSG service implementations (e.g. Condor and SRM).

D.1.4. Outline of proposal

The rest of the proposal proceeds as follows. Section 2 first outlines the scale and scientific goals of the LHC facility and the CMS experimental collaboration, followed by a discussion of CMS’ tremendous computing challenges and the hierarchical global Data Grid cyberinfrastructure that has been proposed to address them. Section 3 describes a broad array of projects and activities that DISUN will leverage in creating an advanced cyberinfrastructure for CMS. This cyberinfrastructure, which includes a Science Data Analysis Framework (SDAF), is the main DISUN deliverable and is described in detail in Section 4 while related training, education and outreach activities are covered in Section 5. Section 6 outlines a five-year work plan for implementing this CMS cyberinfrastructure and extending it to other science communities. Section 7 describes existing and proposed computational, storage and networking infrastructure at the participating DISUN universities. Finally, Section 8 summarizes our management plan, including how the NSF funds will be disbursed.

D.2. The CMS Research Program and its Computing Challenges

The colliding proton beams produced by the LHC will open a new regime in particle physics up to and beyond the TeV energy scale, made accessible for the first time by a combination of a proton-proton collision energy of 14TeV and a “luminosity” corresponding to 1 billion head-on collisions per second. The unprecedented energy range and luminosity of this new particle accelerator, combined with the special capabilities of the CMS (Compact Muon Solenoid) experiment for particle detection and measurement, are expected to lead to discoveries of new elementary particles and novel behaviors of fundamental forces, as well as new precision measurements of particle decay processes. These achievements could have revolutionary effects on our understanding of the unification of forces, the origin and stability of matter, the ultimate underpinnings of the observable universe, and the nature of space-time itself. The LHC, which is now being constructed by a global collaboration of thousands of physicists and engineers, will commence operations in 2007 with a scientific program that will continue for decades.

D.2.1. CMS Computing Challenges

Experiments at the LHC face unprecedented computing challenges, arising primarily from the rapidly increasing size and complexity of datasets that will be collected and. In order to process, distributed, and analyze these datasets, the enormous computational, storage and networking resources that must be made available to their global collaborations in order to process, distribute and analyze them. For example, CMS, one of the two very large experiments at the LHC (the other is ATLAS), will record in the early years of operation record events[1] (1.5 MB/event) at a rate of approximately 150 Hz, accumulating over 200 MB/sec of raw data and tens of petabytes (PB) per year of raw and processed data that must be analyzed by 2,500 physicists from 160 institutes in 40 countries. The CMS data storage rate is expected to grow in response to the pressures of higher luminosity, searches for additional physics processes and better storage capabilities, leading to data collections of more than a hundred PB within a few years and an exabyte (1,000 PB) by the end of the decade. The computational resources required to process this data are similarly vast. The data collections, augmented by 1–2 billion events collected per year, will require initially 40M SpecInt2000 of CPU power for reconstruction, simulation and analysis, but will rapidly increase. Comparable challenges confront the ATLAS experiment.

The total scale of CMS computing in the U.S., which includes approximately 25% of the collaboration, will is expected to require by 2007 multi-Ppetabyte disk caches, roughly 10M SPECInt2000 units of CPU power, and 10 Gb/sec optical networks between sites by 2007 in order to fully exploit fully the physics potential of the experiment. This large-scale computing infrastructure, part of which we aim to develop and deploy thorough this proposal, will enable our CMS physicists to perform the team-based data analyses required to achieve scientific results, and thus fulfill the promise of the overall CMS scientific program.

It is worth noting that while the challenges confronting LHC collaborations are especially daunting, an increasing number of enterprises and projects are encountering similar barriers to their efficient operation as they grow in size and rely more heavily on massive shared collection of measured and derived data and computing resources. These enterprises include scientific collaborations (digital astronomy, gravitational wave searches, genomics and proteomics research), engineering projects (space exploration, earthquake simulations), medical teams (distributed medical imaging), governmental organizations (state extension programs, federal agencies) and corporations (particularly multinational companies). Thus computing infrastructures that provide effective resource sharing and collaboration across significant geographic and organizational boundaries will have wide applicability.