UCAR RFP R17-27363 Attachment 1, NWSC-2a Technical Specifications (v1)

UCAR RFP R17-27363 NWSC-2a Technical Specifications

Contents

1Introduction

2NWSC-2a System: Data Analysis, Visualization and Machine/Deep Learning

3NWSC-2a System Description

3.1NWSC-2a System Hardware

3.1.1Nodes

3.1.2Graphics Adapters, GPGPUs and NVMe Drives

3.1.3System Interconnectivity

3.1.4Administrative Infrastructure

3.1.5System Racks

3.2System Environmental Attributes

3.3NWSC Facility Information

4NWSC-2a System Maintenance and Support Services

4.1Project and Risk Management

4.2Facility and Installation Planning

4.3System Assembly and Factory Test

4.4System Delivery and Installation

4.5System Acceptance Testing

4.6Hardware Warranty/Maintenance

4.7Problem Reporting, Isolation, Repair and Resolution Services

5Options

6Reference Information

1Introduction

The technical attributes of the NWSC-2a resource are provided in this Attachment 1 to UCAR RFP R17-27363. The University Corporation for Atmospheric Research (UCAR) is issuing this RFP on behalf of the Computational Information Systems Laboratory (CISL) of the National Center for Atmospheric Research (NCAR), for the acquisition of a next-generation Data Analysis and Visualization system (DAV), herein referred to as NWSC-2a. The anticipated acquisition, delivery and acceptance schedule is provided in the RFP document, Section 2.22.

This RFP is for the acquisition of the NWSC-2a system’s hardware and maintenance services. CISL will install and maintain the NWSC-2a system’s software stack.

Prospective Offerors are invited to respond to these technical specifications per the instructions in the RFP document, Sections 2.7 and 2.7.2. Technical information regarding the Offeror’s proposed solution shall be provided in the Technical Volume of the Offeror’s Proposal; the section and item numbering of this Attachment should be retained in the Offeror’s response to facilitate UCAR’s evaluation. Pricing information shall be provided in the Business/Price Volume of the Offeror’s Proposal.

2NWSC-2a System: Data Analysis, Visualization and Machine/Deep Learning

Data analysis, visualization, post-processing and Machine Learning/Deep Learning (ML/DL) workloads require substantial computational and memory resources, as well as excellent bandwidth to data storage resources.

The following subsections describe the general attributes CISL anticipates for the NWSC-2a system and its requisite connectivity with the NCAR Wyoming Supercomputing Center’s (NWSC)[1] computational and data storage environment[3].

The NWSC-2a system is intended to be a production data analysis, visualization and ML/DL resource for the NCAR community. This system will replace NCAR’s current DAV systems, Geyser and Caldera[2]; be used primarily to process and visualize data generated by the Cheyenne supercomputer system[4][5] and for the evaluation and development of Machine Learning/Deep Learning (ML/DL) analysis applications in the geosciences; and be connected to and interoperate with NCAR’s GLADE file systems[6].

The NWSC-2a system is envisioned to be a cluster of stateful nodes which may, on occasion, run multi-node MPI-based applications and be integrated with the Cheyenne[4][5] system’s InfiniBand fabric. The system shall not have data storage, except for the specified node-local SSD storage and node-local disk needed for system administration, management, maintenance and operation.

3NWSC-2a System Description

The NWSC-2a system will be installed in the NCAR-Wyoming Supercomputing Center (NWSC)[1] and will replace NCAR’s existing Data Analysis and Visualization (DAV) systems[2]. Additional information regarding the NWSC computational environment can be found in the references provided in Section 6.

3.1NWSC-2a System Hardware

The following subsections describe the hardware attributes of the NWSC-2a system.

3.1.1Nodes

The NWSC-2a system shall be comprised of twenty-four nodes, each containing:

  1. Two Intel Xeon E5 v4 (or later) processors, each with at least 18 cores
  2. A dual port 10 Gb Ethernet adapter with SFP+ ports (e.g. Mellanox MCX4121A-XCAT ConnectX-4, or similar)
  3. A dual port Mellanox EDR InfiniBand adapter accommodating standard QSFP28 connectors on a PCIe Gen 3.0 x16 (or faster) bus with support for VPI (e.g. Mellanox MCX456A-ECAT ConnectX VPI InfiniBand Host Bus Adapter, or similar)
  4. IPMI compatible Out-of-Band Management or LOM (BMC)
  5. The ability to accommodate the graphics and GPGPU cards specified in Section 3.1.2 with dedicated PCIe lanes (x16 or faster)
  6. The ability to accommodate the NVMe SSD drives specified in Section 3.1.2

Of the twenty-four nodes,

  1. twenty-two nodes shall be “small-memory nodes” configured with at least 512GBytes of DDR4-2400 (or faster) memory (i.e. sized and performance-optimized for the number of memory channels supported by the supplied processor), and
  2. two nodes shall be “large-memory nodes” configured with at least 1TB of DDR4-2400 (or faster) memory (i.e. sized and performance-optimized for the number of memory channels supported by the supplied processor).

The InfiniBand and Ethernet adapters should use standard PCIe edge connectors and be supplied with vendor-supplied firmware that is both upgradable by CISL and does not require custom, or integrator-provided, firmware or PSIDs.

3.1.2Graphics Adapters, GPGPUs and NVMe Drives

The NWSC-2a system shall include:

  1. Sixteen NVIDIA GEForce Titan X, or comparable, graphics cards (twoin each in of any eight small-memory nodes)
  2. Sixteen NVIDIA Tesla V100 GPUs and support for optimal and scalableGPUDirect communication between GPUs, NVLink is preferable if available (four in each of the two large-memory nodes and four in each of any two small-memory nodes)
  3. Twenty-four NVMe SSDs, each with a capacity of at least 2 TB (one in each node)

3.1.3System Interconnectivity

The NWSC-2a system shall include:

  1. Twenty-five 30-meter QSFP28 EDR optical cables

Each node of the NWSC-2a system will be connected into Cheyenne’s Enhanced Hypercube EDR InfiniBand fabric via a direct connection to the HPE/SGI EDR Premium switches. Details of Cheyenne’s InfiniBand fabric are provided in Attachment 1A of this RFP. If the Offeror envisions any issue with this plan, it should be addressed in the Technical Volume of the proposal.

CISL will provide Ethernet connectivity and infrastructure for the NWSC-2a system, independent of this RFP.

3.1.4Administrative Infrastructure

The NWSC-2a system shall include an administrative infrastructure containing:

  1. One node, identical to those specified in Section 3.1.1, but with at least 256 GBytes of DDR4-2400 (or faster) memory, for use as a system administration and management node.
  2. At least 2 TB of additional usable disk storage for administrative file system(s) with sufficient resilience and/or redundancy to survive at least one drive failure without downtime.
  3. A dedicatedadministrative network for cluster management, stateless booting and IPMI communications.

3.1.5System Racks

The NWSC-2a system shall include:

  1. Oneor more racks (i.e. a sufficient number to accommodate all NWSC-2a system hardware and provide at least 10RU of unused expansion space). CISL prefers standard 19” air-cooled racks, but can accommodate other rack footprints and liquid cooling for systems with high power-density.
  2. Each rack shall be provided with integrated Intelligent Power Distribution Units (IPDUs) which allow power to be remotely turned on/off for each node and/or switch comprising the system, and per socket monitoring of volts, amps, and watts via remote Ethernet.
  3. The IPDUs shall accept 480/277v and 208/120v 3-phase 60Hz input power.
  4. The system shall be provided with 6-meter power cables with approved Hubbell plugs that are the appropriate configuration for the rated voltage/amperage.

The Offeror shall consult with and obtain the approval of CISL on the power cables and plugs, any custom cooling infrastructure, any component labeling schema, and the preferred and optimal layout of equipment in the rack(s) prior to assembly.

3.2System Environmental Attributes

The Offeror shall supply, as part of the response to these specifications,

  1. an assembly diagram indicating the planned layout of system components in the rack(s), including the minimum required unused expansion space specified in Section 3.1.5, and
  2. a Machine Unit Specification (MUS) chart for the proposed system. This MUS shall include system physical attributes, power and cooling requirements, and other information necessary for NWSC facility preparation and integration of the NWSC-2a system with the NWSC facility infrastructure.

3.3NWSC Facility Information

The Offeror shall review the following information regarding the NWSC facility and its infrastructure and confirm that the Offeror’s proposed equipment and services are compatible with the NWSC facility.

Table 1.NWSC Facility Specifications

NCAR-Wyoming Computing Center Information
Location / NCAR-Wyoming Supercomputer Center
8120 Veta Drive, Cheyenne, Wyoming
Altitude / 6,260 feet
Seismic / N/A
Air Cooling / The system must operate within the temperature ranges specified in the ‘ASHRAE 2015 Thermal Guidelines for Data Processing Environments, 4th Edition’.
Water Cooling / The system must operate within the temperature ranges specified in the ‘ASHRAE 2015 Thermal Guidelines for Data Processing Environments, 4th Edition’. The NWSC provides 65F chilled water, currently has 75F return water, and can accommodate higher return temperatures up to 80F without coordination with or modifications of the NWSC facility infrastructure. The NWSC chilled water system can accommodate a large range of flow rates. De-ionized water is available.
Floor / 10’ raised floor
Ceiling / 12’ ceiling; 9’ 9” maximum cabinet height
Floor Loading / The floor loading shall not exceed a uniform live load of 250 pounds per square foot with a deflection of not more than 0.04 inch, and a concentrated load of 2,500 pounds per square inch.
Shipment Dimensions and Weight / For delivery, shipping units and/or containers shall weigh less than 7,000 pounds and be able to be moved from the NWSC facility’s loading dock through its doors and hallways, which are 6’ 0” in width and 9’ 9” in height, or larger.
Cabling / All power cabling and water connections are below the access floor. All other cabling (e.g. system interconnect, administrative networking) are above floor.
External network interfaces supported by the NWSC / 1GbE, 10GbE, 40GbE, 100GbE

4NWSC-2a System Maintenance and Support Services

The NWSC-2a system shall be provided with the following maintenance and support services. The Offeror’s Technical Volume shall describe the services proposed in response to each of the following subsections.

4.1Project and Risk Management

The Technical Volume of the Offeror’s proposal shall provide a system assembly, test, delivery and installation plan and timeline, identify any risks associated with the plan, and identify the roles and responsibilities of the individuals involved in executing the plan.

4.2Facility and Installation Planning

The Offeror shall provide facility and installation planning services for the NWSC-2a system. The Offeror’s Technical Volume shall include a description of the facility and installation planning services to be provided.

4.3System Assembly and Factory Test

The Offeror shall assemble and test the NWSC-2a system prior to shipping it to the NWSC in accordance with Attachment 3F of this RFP. CISL requests that the Offeror allow CISL personnel to travel to the Offeror’s system assembly site to observe, and potentially participate with, the factory test. The Offeror’s system integrity tests shall be made available to CISL for post-installation testing at the NWSC.

4.4System Delivery and Installation

The Offeror shall include any transportation, delivery, uncrating, packing material removal, and installation services and costs for the NWSC-2a system in the Offeror’s proposal. Delivery shall be to the NCAR-Wyoming Supercomputing Center, 8120 Veta Drive, Cheyenne, WY.

Offeror personnel should be on-site during the system’s installation within the NWSC, and shall provide assistance to CISL if any on-site assembly or cabling of the system is required.

4.5System Acceptance Testing

The NWSC-2a system shall undergo Acceptance testing in accordance with Attachment 3F of this RFP. The Offeror shall review Attachment 3F and provide any exceptions, along with a rationale for each change, via change-tracked modifications to that document as part of the Technical Volume of the Offeror’s proposal. A final acceptance test plan will be developed during subcontract negotiations.

4.6Hardware Warranty/Maintenance

The Offeror shall provide, as a minimum, a standard warranty, or combination of warranty and maintenance services, for CISL self-maintenance of the system, including replacement of failed system hardware components and firmware/microcode updates for supplied hardware, covering a period of four calendar years subsequent to the successful completion of system installation, testing and acceptance of the NWSC-2a system.

Services shall include replacement parts, firmware/microcode updates, shipment of replacement parts and delivery to the NWSC within three business days, and the RMA of failed components. Proposed warranty and maintenance services shall be described in the Offeror’s Technical Volume; annualized costs shall be itemized in the Offeror’s Business/Price Volume.

4.7Problem Reporting, Isolation, Repair and Resolution Services

As part of the NWSC-2a system installation, the Offeror shall provide written procedures and web-based and/or onsite training (commensurate with the warranty/maintenance proposed in response to Section 4.6) to CISL for system hardware failure reporting and failed component diagnosis, isolation and repair, including the ordering of replacement components and return, if necessary, of failed components to the Offeror.

The Offeror shall provide 24x7 telephone and web-based problem reporting, ticketing, query and resolution services during the warranty and maintenance period. The Offeror’s problem reporting and ticketing system shall be full-functioned and provide participant roles, keyword searches, a unified view of hardware problem reports/resolutions, and a customer-searchable knowledge database which allows a customer to determine if other customers have experienced a similar problem.

5Options

The Offeror shall respond to the following option requests, providing a technical description in the Technical Volume, and pricing information in the Business/Price Volume, of the Offeror’s response.

  1. 5th year hardware maintenance as proposed in response to Section 4.6.
  2. 6th year hardware maintenance as proposed in response to Section 4.6.
  3. Maintenance uplift option providing for 9x5, next business day, on-site hardware maintenance by the Offeror’s field service personnel, exceeding that proposed in response to Section 4.6. A maintenance uplift description shall be provided in the Technical Volume; and prices shall be provided in the Business/Price Volume for each of the initial 4 years, and the 5th and 6th option years.
  4. Additional node, identical to the small-memory nodes configured in response to Section 3.1.1, and all mounting hardware and any communications and power cabling necessary for integrating the node into the system. The additional node price and associated additional annual maintenance price shall be included in the Business/Price Volume.
  5. On-site spare parts cache, comprised of at least one (two for larger quantity or lower MTBF items, such as DIMMs or disk drives) of each Customer Replaceable Unit within the NWSC-2a system, to allow CISL to effect repairs of failed components and return nodes to production service while awaiting replacement parts.
  6. A 36-port EDR InfiniBand switch (e.g. Mellanox MSB7700-ES2F, or similar) and twenty-five QSFP28 DAC EDR InfiniBand cables of sufficient length for connection of the system’s nodes to the InfiniBand switch and their associated additional annual maintenance: technical description shall be included in the Technical Volume; prices shall be included in the Business/Price Volume.
  7. Additional NVIDIA GEForce Titan X graphics card, or the proposed comparable card, and its associated additional annual maintenance: prices shall be included in the Business/Price Volume.
  8. Additional NVIDIA Tesla P100 GPGPU card with GPUDirect RDMA support and its associated additional annual maintenance: prices shall be included in the Business/Price Volume.
  9. A 32-port 100 GbE Ethernet switch (e.g. Mellanox Spectrum SN2700 32-Port 100GbE Open Ethernet Switch with MLNX-OS, 32 QSFP28 ports, 2 Power Supplies (AC), x86 CPU, standard depth, P2C airflow, Rail Kit, RoHS6)
  10. The differential cost of a large-memory node over that of a small-memory node shall be included in the Business/Price Volume. If there is an associated additional annual maintenance cost, that shall be included in the Business/Price Volume.
  11. The cost of upgrading an already-installed small-memory node to a large-memory node shall be included in the Business/Price Volume. If there is an associated additional annual maintenance cost, that shall be included in the Business/Price Volume.

6Reference Information

[1] The NCAR-Wyoming Supercomputing Center (NWSC) facility:

[2] The current NCAR Data Analysis and Visualization environment:

[3] The NWSC computational and data storage/archival environment:

[4] Cheyenne: the NWSC-2 high-performance computing system:

[5] An introductory presentation on the Cheyenne system and NWSC user environment:

[6] General user information regarding the NCAR Globally Accessible Data Environment (GLADE):

Page 1