Llnl-Prop-652542-Draft

LLNL-PROP-652542-DRAFT

FastForward 2 R&D

Draft Statement of Work

April 3, 2014

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.

Contents

1 INTRODUCTION 6

2 ORGANIZATIONAL OVERVIEW 7

2.1 The Department of Energy Office of Science 7

2.1.1 Advanced Scientific Computing Research Program 7

2.2 National Nuclear Security Administration 8

2.2.1 Advanced Simulation and Computing Program 8

3 MISSION DRIVERS 8

3.1 Office of Science Drivers 8

3.2 National Nuclear Security Administration Drivers 9

4 EXTREME-SCALE TECHNOLOGY CHALLENGES 9

4.1 Power Consumption and Energy Efficiency 9

4.2 Concurrency 10

4.3 Fault Tolerance and Resiliency 11

4.4 Memory Technology 11

4.5 Programmability 12

5 APPLICATION CHARACTERISTICS 12

6 ROLE OF CO-DESIGN 14

6.1 Overview 14

6.2 ASCR Co-Design Centers 14

6.3 ASC Co-Design Project 15

6.4 Proxy Apps 15

7 REQUIREMENTS 16

7.1 Description of Requirement Categories 16

7.2 Requirements for Research and Development Investment Areas 16

7.3 Common Mandatory Requirements 16

7.3.1 Solution Description (MR) 17

7.3.2 Research and Development Plan (MR) 17

7.3.3 Technology Demonstration (MR) 17

7.3.4 Productization Strategy (MR) 17

7.3.5 Staffing/Partnering Plan (MR) 18

7.3.6 Project Management Methodology (MR) 18

7.3.7 Intellectual Property Plan (MR) 18

7.3.8 Coordination with Current Research (MR) 18

8 EVALUATION CRITERIA 18

8.1 Evaluation Team 18

8.2 Evaluation Factors and Basis for Selection 18

8.3 Performance Features 19

8.4 Feasibility of Successful Performance 19

8.5 Supplier Attributes 20

8.5.1 Capability 20

8.6 Price of Proposed Research and Development 20

ATTACHMENT 1: NODE ARCHITECTURE RESEARCH AND DEVELOPMENT REQUIREMENTS 21

A1-1 Key Challenges for Node Architecture Technologies 21

A1-1.1 Component Integration 21

A1-1.2 Energy Utilization 21

A1-1.3 Resilience and Reliability 21

A1-1.4 On-Chip and Off-Chip Data Movement 21

A1-1.5 Concurrency 22

A1-1.6 Programmability and Usability 22

A1-2 Areas of Interest 22

A1-2.1 Component Integration 22

A1-2.2 Energy Utilization 22

A1-2.3 Resilience and Reliability 23

A1-2.4 On-Chip and Off-Chip Data Movement 23

A1-2.5 Concurrency 23

A1.2.6 Programmability and Usability 24

A1-3 Performance Metrics (MR) 24

A1-4 Mandatory Requirements 25

A1-4.1 Overall Node Design (MR) 25

A1-5.1 Component Integration (TR-1) 26

A1-5.2 NIC integration (TR-2) 26

A1-5.3 Energy Utilization (TR-1) 26

A1-5.4 Resilience and Reliability (TR-1) 26

A1-5.4 On-Chip Data Movement (TR-2) 26

A1-5.5 Processing Near Memory (TR-2) 26

A1-5.6 Programmability and Usability: Hardware (TR-1) 27

A1-5.7 Programmability and Usability: Software (TR-1) 27

A1-5.8 System Integration Strategy (TR-1) 27

ATTACHMENT 2: MEMORY TECHNOLOGY RESEARCH AND DEVELOPMENT REQUIREMENTS 28

A2-1 Key Challenges for Memory Technology 28

A2-1.1 Energy Consumption 28

A2-1.2 Memory Bandwidth and Latency 28

A2-1.3 Memory Capacity 28

A2-1.4 Reliability 29

A2-1.5 Error Detection, Correction, and Reporting 29

A2-1.6 Processing in Memory 29

A2-1.7 Integration of NVRAM Technology 30

A2-1.8 Ease of Programmability 30

A2-1.9 New Abstractions for Memory Interfaces and Operations 30

A2-1.10 Integration of Optical and Electrical Technologies 31

A2-2 Areas of Interest 31

A2-3 Performance Metrics (MR) 31

A2-3.1 DRAM Performance Metrics 32

A2-4 Multivendor Integration Strategy (MR) 34

A2-5 Target Requirements 34

A2-5.1 Energy per Bit 34

A2-5.2 Aggregate Delivered DRAM Bandwidth 34

A2-5.3 Memory Capacity per Socket 35

A2-5.4 FIT Rate per Node 35

A2-5.5 Error Detection Coverage and Reporting 35

A2-5.6 Advanced Processing in Memory Capabilities 35

A2-5.7 NVRAM Performance Metrics 35

A2-5.8 Multivendor Integration Strategy 35

1 INTRODUCTION

The Department of Energy (DOE) has a long history of deploying leading-edge computing capability for science and national security. Going forward, DOE’s compelling science, energy assurance, and national security needs will require a thousand-fold increase in usable computing power, delivered as quickly and energy-efficiently as possible. Those needs, and the ability of high performance computing (HPC) to address other critical problems of national interest, are described in reports from the ten DOE Scientific Grand Challenges Workshops[1] that were convened in 2008–2010. A common finding across these efforts is that scientific simulation and data analysis requirements are exceeding petascale capabilities and rapidly approaching the need for exascale computing. However, workshop participants also found that due to projected technology constraints, current approaches to HPC software and hardware design will not be sufficient to produce the required exascale capabilities.

In April 2011 a Memorandum of Understanding was signed between the DOE Office of Science (SC) and the DOE National Nuclear Security Administration (NNSA), Office of Defense Programs, regarding the coordination of exascale computing activities across the two organizations. This led to the formation of a consortium that includes representation from seven DOE laboratories: Argonne National Laboratory, Lawrence Berkeley National Laboratory, Lawrence Livermore National Laboratory, Los Alamos National Laboratory, Oak Ridge National Laboratory, Pacific Northwest National Laboratory, and Sandia National Laboratories.

Funding for the DOE Exascale Computing Initiative has not yet been secured, but DOE has compelling real-world challenges that will not be met by existing vendor roadmaps. In response to these challenges, DOE SC and NNSA initiated an R&D program called FastForward that established partnerships with multiple companies to accelerate the R&D of critical technologies needed for extreme-scale computing. FastForward funded five companies (two of which have merged into one) starting in July 2012. With the initial two-year FastForward program coming to an end, DOE SC and NNSA are planning a follow-up program called FastForward 2. This new program will focus on two areas: Node Architecture and Memory Technology. The timeframe for the productization of the resulting Node Architecture and Memory Technology projects in 2020-2023. Node Architecture proposals for near-term product development that does not meet exascale needs are not in scope.

The Node Architecture focus area broadens the previous FastForward focus on Processors to include the entire architecture of a compute node. Both the node hardware and any necessary enabling software are in scope. A Node Architecture research proposal can also include several focus areas. For example, if novel runtime techniques or programming models are needed to make a new node architecture usable, research into these technologies could be included in a proposal. (However a software-only proposal would not be in scope.)

The Memory Technology focus area includes technologies that could be used in multiple vendors’ systems. Memory technologies that are an integral part of a proprietary node design should be proposed in the Node Architecture focus area. Processor-in-memory (PIM) research may be proposed in the Memory Technologies focus area if the resulting technologies could be used in multiple vendors’ node designs.

Vendors currently funded under FastForward may propose follow-on research under FastForward 2, and DOE also welcomes new research areas and new vendors for this program.

FastForward 2 seeks to fund innovative new or accelerated R&D of technologies targeted for productization in 5–8 years. The period of performance for any subcontract resulting from this request for proposal (RFP) will be approximately 27 months and end on November 1, 2016.

The consortium is soliciting innovative R&D proposals in Node Architecture and advanced Memory Technology that will maximize energy and computational efficiency while increasing the performance, productivity, and reliability of key DOE extreme-scale applications. The proposed technology roadmaps could have disruptive and costly impacts on the development of DOE applications and the productivity of DOE scientists. Therefore, proposals submitted in response to this solicitation should address the impact of the proposed R&D on both DOE extreme-scale mission applications as well as the broader HPC community. Offerors are expected to leverage the DOE SC and NNSA Co-Design Centers to ensure solutions are aligned with DOE needs. While DOE’s extreme-scale computer requirements are a driving factor, these projects should also exhibit the potential for technology adoption by broader segments of the market outside of DOE supercomputer installations. This public-private partnership between industry and the DOE will aid the development of technology that reduces economic and manufacturing barriers to building systems that deliver exascale performance, and the partnership will also further DOE’s goal that the selected technologies should have the potential to impact low-power embedded, cloud/datacenter and midrange HPC applications. This ensures that DOE’s investment furthers a sustainable software/hardware ecosystem supported by applications across not only HPC but also the broader IT industry. This breadth will result in an increase in the consortium’s ability to leverage commercial developments. The consortium does not intend to fund the engineering of near-term capabilities that are already on existing product roadmaps.

2 ORGANIZATIONAL OVERVIEW

2.1 The Department of Energy Office of Science

The SC is the lead Federal agency supporting fundamental scientific research for energy and the Nation’s largest supporter of basic research in the physical sciences. The SC portfolio has two principal thrusts: direct support of scientific research and direct support of the development, construction, and operation of unique, open-access scientific user facilities. These activities have wide-reaching impact. SC supports research in all 50 States and the District of Columbia, at DOE laboratories, and at more than 300 universities and institutions of higher learning nationwide. The SC user facilities provide the Nation’s researchers with state-of-the-art capabilities that are unmatched anywhere in the world.

2.1.1 Advanced Scientific Computing Research Program

Within SC, the mission of the Advanced Scientific Computing Research (ASCR) program is to discover, develop, and deploy computational and networking capabilities to analyze, model, simulate, and predict complex phenomena important to the DOE. A particular challenge of this program is fulfilling the science potential of emerging computing systems and other novel computing architectures, which will require numerous significant modifications to today's tools and techniques to deliver on the promise of exascale science.

2.2 National Nuclear Security Administration

The NNSA is responsible for the management and security of the nation’s nuclear weapons, nuclear non-proliferation, and naval reactor programs. It also responds to nuclear and radiological emergencies in the United States and abroad.

2.2.1 Advanced Simulation and Computing Program

Established in 1995, the Advanced Simulation and Computing (ASC) Program supports NNSA Stockpile Stewardship Programs’ shift in emphasis from test-based confidence to simulation-based confidence. Under ASC, simulation and computing capabilities are developed to analyze and predict the performance, safety, and reliability of nuclear weapons and to certify their functionality. Modern simulations on powerful computing systems are key to supporting the U.S. national security mission. As the nuclear stockpile moves further from the nuclear test base through either the natural aging of today’s stockpile or introduction of component modifications, the realism and accuracy of ASC simulations must further increase through development of improved physics models and methods requiring ever greater computational resources.

3 MISSION DRIVERS

3.1 Office of Science Drivers

DOE’s strategic plan calls for promoting America’s energy security through reliable, clean, and affordable energy, ensuring America’s nuclear security, strengthening U.S. scientific discovery, economic competitiveness, and improving quality of life through innovations in science and technology. In support of these themes is DOE’s goal to advance simulation-based scientific discovery significantly. This goal includes the objective to “provide computing resources at the petascale and beyond, network infrastructure, and tools to enable computational science and scientific collaboration.” All other research programs within the SC depend on the ASCR to provide the advanced facilities needed as the tools for computational scientists to conduct their studies.

Between 2008 and 2010, program offices within the DOE held a series of ten workshops to identify critical scientific and national security grand challenges and to explore the impact exascale modeling and simulation computing will have on these challenges. The extreme scale workshops documented the need for integrated mission and science applications, systems software and tools, and computing platforms that can solve billions, if not trillions, of equations simultaneously. The platforms and applications must access and process huge amounts of data efficiently and run ensembles of simulations to help assess uncertainties in the results. New simulations capabilities, such as cloud-resolving earth system models and multi-scale materials models, can be effectively developed for and deployed on exascale systems. The petascale machines of today can perform some of these tasks in isolation or in scaled-down combinations (for example, ensembles of smaller simulations). However, the computing goals of many scientific and engineering domains of national importance cannot be achieved without exascale (or greater) computing capability.

3.2 National Nuclear Security Administration Drivers

Maintaining the reliability, safety, and security of the nation’s nuclear deterrent without nuclear testing relies upon the use of complex computational simulations to assess the stockpile, to investigate basic weapons physics questions that cannot be investigated experimentally, and to provide the kind of information that was once gained from underground experiments. As weapon systems age and are refurbished, the state of systems in the enduring stockpile drifts from the state of weapons that were historically tested. In short, simulation is now used in lieu of testing as the integrating element. The historical reliance upon simulations of specific weapons systems tuned by calibration to historical tests will not be adequate to support the range of options and challenges anticipated by the mid-2020s, by which time the stewardship of the stockpile will need to rely on a science-based predictive capability.

To maintain the deterrent, the U.S. Nuclear Posture Review (NPR) insists that “the full range of Life Extension Program (LEP) approaches will be considered: refurbishment of existing warheads, reuse of nuclear components from different warheads, and replacement of nuclear components.” In addition, as the number of weapons in the stockpile is reduced, the reliability of the remaining weapons becomes more important. By the mid-2020s, the stewardship of the stockpile will need to rely on a science-based predictive capability to support the range of options with sufficient certainty as called for in the NPR. In particular, existing computational facilities and applications will be inadequate to meet the demands for the required technology maturation for weapons surety and life extension by the middle of the next decade. Evaluation of anticipated surety options is raising questions for which there are shortcomings in our existing scientific basis. Correcting those shortcomings will require simulation of more detailed physics to model material behavior at a more atomistic scale and to represent the state of the system. This requirement pushes the need for computational capability into the exascale level.