CERN Technology Tracking for LHC

PASTA Run II – WG (d) – Report V0.&

CERN Technology Tracking for LHC

PASTA 2002 Edition

Working Group G

Computer Architectures

System Interconnects

Scalable Clusters

Arie Van Praag, CERN IT/ADC

30th September 2002

PASTA 2002working group g 1/22/2019

Introduction

With the rise of clock frequencies from the first Pentium processors of 60 MHz up to 3 GHz in the Pentium 4, PC’s have an excess of power for most daily tasks. For normal office work and home applications, a 300 MHz Pentium was adequate. This brings PC development to a crossroad that changes the orientation of mass production. A split will occur between the standard desktop, high-end PCs and Intel oriented Workstations. Also, an increasing part of the market will be portables and soon the so-called “tablets”. Besides the Intel Pentium, other manufacturers produce equivalent processor chips such as the AMD Athlon and the low power Crusoe. They all have different architectures that may have a different efficiency when running LINUX.

Networking and Interconnects move also up for an order of magnitude into the bandwidth rang from 10 Gbit/s to 40 Gbit/s. The I/O and storage connections will profit from new standards. Network storage has become more reliable and allows bandwidth that can not be reached with classical file servers.

System Interconnects and I/O facilities

2.1 Busses

The bus that delivers the highest performance is still the SGI XIO bus. Sun’s UPA and the build-in crossbar switches seems to have disappeared, but a large part of the design objectives have been realized with what is now Infiniband. Firewire is unchanged and not very widespread and USB is now in its second version increasing the bandwidth to 60 MByte/s. The PCIbus standard is extending rapidly with PCI-X, PCI-X2 and the serial PCI-Express.

2.1.1IDE developments

Changes in the sector of disk connections are mainly in the physical connection. The parallel connection ATA flat-cable will be replaced by a new standard called “Serial-ATA”. The physical connection is in the form of a coaxial cable pair completed with power wires, and multiple discs are connected in a star pattern.

The standard knows three classes:1500 = 150 MByte/s

3000 = 300 MByte/s

6000 = 600 MByte/s

As the difference is purely in the hardware, no new drivers are needed.

Serial ATA disks and interfaces are demonstrated and a large activity is going on at the Disk manufacturers and in the Silicon industry.

PCI interfaces for Serial-ATA with or without RAID possibilities are announced with the first products to be delivered in 2 Q 2002.

Most probably, the fastest classes will first be introduced in large server machines. The lower two classes may appear with 1500 level systems soon as the thinner cables take serious cooling problems away. The serial cables is meant to be more robust than the parallel flat cables now in use.

2.1.2 SCSI developments 2002

SCSI is currently at 160 MByte/s and 320 MByte/s still announced. A standard for Serial-SCSI is in discussion that adapts the physical layer of Serial-ATA class 3 with a bandwidth of 600 MByte/s for SCSI connections. There are no products announced by now. However the same prognosis may be valid as for Serial-ATA.

2.1.3 iSCSI

During the October 2000 workshop at CERN Julian Satran from IBM Haifa presented his project for networked storage called iSCSI. The international handling of this TCP/IP extension of SCSI is handled by ANSI T10 and is now an accepted standard.

iSCSI is very well received and a number of products, including iSCSI to Fiber Channel converters for connections to NAS and SAN storage and PCI interfaces are available. The last one using PCI 64/66 and a TCP/IP offload engine in the interface.

This kind of interfaces brings network traffic preparation back to a DMA transfer and as such, offloads the processor and memory. Using this kind of interface for TCP/IP or iSCSI interface in disk servers can substantially increase I/O bandwidth. However network latency will be at best slightly shorter. iSCSI has a bright future in large low-end clusters and in database applications.

2.1.4USB

USB-1 is now a commodity serial I/O bus with a long list of commercial products available

New is USB-2, announced with some specification corrections and with a bandwidth increase up to 60 MByte/s while still being compatible with USB-1. Intel and a number of other chipmakers have adopted USB-2 and incorporated it on many motherboards. Also peripherals start to become commercially available.

2.1.5 IEEE 1394b

Introduced by Apple in 1993 as Firewire, the IEEE1394b standard for high speed serial I/O is not very wide spread. It has been adapted by Apple, and can be found on a number of digital video cameras.

2.1.6 PCI developments 2002

PCI 66/64 can be found mostly on server systems and on a limited number of small systems. Implementation of the PCI-X standard from 1999 can be found on some newer server systems but introduction is slow.

The PCI-X 2 standard is adapted with a maximum clock of:

266 MHz with a theoretical bandwidth of2.13 GByte/s

533 MHz with a theoretical bandwidth of 4.26 GByte/s

The standard supports DDR and QDR technology.

As with all PCI figures, they include + 25 % bus overhead.

PCI Express is new and is a new name for 3GIO or project Arapahoe (see NGIO).

Again the first applications will be in high-end servers moving down to smaller systems. It is questionable if the highest data rates of PCI-X 2 will ever be found in low-level workstations as the resulting bandwidth depends also on processor and memory. PCI Express is an internal serial bus and not an interconnect; it has to move the market away from the massively installed and popular PCI standard to be successful.

2.2 New Interconnects

2.2.1 NGIO and Future I/O

NGIO and Future I/O did not make it to a standard but merged into the INFINIBAND (see 2.3 ).

Intel continued the work on NGIO under the project name Arapahoe, resulting in an industrial proposal 3GIO. A symmetrical bi-directional bus with 2.5 Gbit/s bandwidth and 32 and 64 bit address space. Its implementation is on the chipset level and is as such an internal I/O bus. Intel proposes 3GIO to be PCI Express.

This may be a political move by Intel to impose 3GIO against the superior HyperTransport from AMD ( see also PCI Developments 2.1.3 ). ( PCISIG the PCI governing organization was founded by Intel and is still largely depended on their financial backup.

2.2.2 HyperTransport

HyperTransport is proposed by AMD as the future internal bus for its Intel equivalent processors. It is a 12.8 GByte/s asymmetrical bi-directional bus with 64-bit address space. The physical connections, called “tunnels” are 8 or 16 bit wide. The chipset includes a small routing crossbar switch.

The specification foresees a connection to the outside that can be used: either to connect peripherals or to connect up to 4 or 8 systems to a small cluster with cross-linked memory access. Several independent chipset makers announced that they support HyperTransport (VIA P.R. 1-7-2002)

HyperTransport is theoretically superior to 3GIO from Intel but market acceptance will decide its future.

2.2.3 I2O

This serial bus developed in several Industrial busses with applications in control and robotic technology (Fieldbus, Canbus, Profibus, etc ). It is a relatively slow interconnect that may find its place for slow control in experiments.

However USB 1 and 2 and IEEE1394b are both children of the Philips I2O bus.

2.3 LANs as System interconnects

The example of GSN to demonstrate a reliable network in the 10 Gbit/s range generated several similar interconnects. INFINIBAND copied many details from the GSN standard and has 3 bandwidth options, 2.5 Gbit/s, 10 Gbit/s and 30 Gbit/s.

The Ethernet community introduced a 10 Gbit/s version as 10GE, and works already on a future 40 Gbit/s.

In the Ethernet type of connection a second reversal took place in that the relatively slow processors could not keep up with always-faster networks. With multi-GHz processors this is not valid anymore. A new reversal will come with 10 Gbit Ethernet. The tendency to bring network handling to the interface will only move the latency problem to the network connection, but not take it away. However processor and memory efficiency will increase.

Tremendous developments have been seen in wireless networking with at least three different standards commercially available.

2.3.1GSN

GSN has developed to a serious network standard and is used in a number of computer centers in the USA and in Europe. It is the only “secure network” (to be understood as “granted Data Integrity” ) in this bandwidth range and is the network with the lowest latency. Physical connections are available as copper cable and as parallel fiber cable.

Bridges with conversion to Gigabit Ethernet, Fiber-channel, HIPPI and SONET OC48/SDH16 exist.

GSN has its own low latency protocol Scheduled Transfer (ST) based on network DMA, but is also fully TCP/IP capable. A SCSI extension over ST (SST ) makes storage access over the network possible ( SAN, NAS etc. ).

GSN has been demonstrated on systems with high bandwidth memory and high bandwidth I/O to reach almost wire speeds ( 780 MByte/s ) with a processor use under 5 %. GSN is applied at + 40 places mainly in the USA and some in Europe.

GSN is successfully applied in high-end systems but low-end servers and workstations for the moment miss the necessary I/O bandwidth to use the full bandwidth. PCI-64/66 interfaces have shown that 400 MByte/s is obtainable.

2.3.2 FC, HIPPI, Myrinet

Fiber Channel (FC) crystallized out to be dominant in one field: fast hard disks for SAN technology. To-days standard is 1 Gbit/s and 2Gbit/s is also operational. . 10 Gbit/s is announced but given the difficulty to reliably implement the different protocols at this speed, it will not be around soon.

Even if HIPPI had not seen new technological developments after Serial-HIPPI, it is still a strong contender with a constant commercial activity. Myrinet is still a relatively cheap interconnect that has proven reliable up to 1Gbit/s. Versions of 2 Gbit/s and 10 Gbit/s are announced. For cluster computing Myrinet seems, according to several reports, to have proven communication and latency advantages in combination with MPI.

2.3.3 INFINIBAND

Infiniband is a combination of Fiber-channel and PCI at one end together with GSN, RIO and some IBM technology. The definition is “a network capable interconnect”. It has coaxial copper connects for short distances and parallel fiber connections for the longer distances. The basic speed is 2.5 Gbit/s that can be multiplied by 4 and by 12, Striping is sequential over the parallel connections. Networking is done with crossbar switches.

Infiniband has many proprietary protocols but supports also TCP/IP. One of these protocols is foreseen for system management.

RDMA is a low latency protocol much the same as ST and also based on network DMA.

The fact that every INFINIBAND node can be its own network controller makes maintenance difficult.

Intel stopped chip design for INFINIBAND, and in the same move forwarded its own PCI Express to be a system interconnect ( see also NGIO ).

INFINIBAND will survive Intel’s move away from producing specialized silicon as a very large number of firms adhere to the Interest group. But INFINIBAND is a complex system with many protocols, which will mean it will be successful in one or more key applications such as Blade Servers (as Fiber-channel in storage) and will lose against more accepted networking standards.

2.3.4 Advances in Ethernet

AS PC clock frequencies have gone up to over 2GHz the bottleneck of packet building and IP stack control for Gigabit Ethernet have almost gone.

Several companies announce network cards with IP stacks as part of the interface. However if this offloads the processor it will shrink the transfer latency only marginally.

The 10 Gigabit Ethernet ( 10GE ) standard is accepted on 17 June 2002. The physical connection has many different options and is divided in two classes.

The first uses 4 streams at 2.5 Gbit/s with Coarse Wavelength Division Multiplexing (CWDM). A technology that is not applicable for longer distances, and is very difficult to make stable.

The second is a set of single fiber single stream connections with different wavelengths using multimode fiber or single mode fiber. Distances of up to 80 Km are announced.

Part of the standard is the modulation of 1500 Byte Ethernet frames in 64KByte SONET/SDH packets in POS mode making long distance transport of Ethernet frames over standard communication channels possible. To avoid conflicts by usingthis feature it is impossible to allow Jumbo frames. After adapting communication hardware Ethernet frames will move directly over the wan, with a tenfold reduction of (service) cost.

The 10 Gigabit Ethernet standard is made in a way that it can easily be adapted to higher speeds. Work on 40 Gigabit Ethernet has started. The introduction of 40GE will mainly be decided on the price performance relationship of the necessary optical components.

10 GE will become successful in the coming years, as backbone interconnect and for high-end system interconnect. The compatibility with SONET/SDH will make it the ideal WAN interconnect. However the small Ethernet frames will be a constant problem introducing latency, especially as Jumbo-frames are refused as an official standard. Again Network Offload Engines may bring some help for processor efficiency, but displace the network latency to the Interface.

2.3.5 Wireless LAN’s

The last three years have seen the introduction and commercialization of 803.11b wireless Ethernet with a theoretical bandwidth of 10 Mbit/s. It uses the free 2.4 GHz band and interference with other users of this frequency is sometimes a problem. Early security problems get corrected by time, as encoding algorithms grow better.

More recently the wireless Ethernet standard was followed by 803.11a with a bandwidth of up to 50 Mbit/s, using a frequency band at 5 GHz and the latest 128 bit security algorithms were implemented. Both 803.11 variants should be compatible except for speed.

The Bluetooth standard is extending rapidly. It uses the 2.4 GHz band and copies many of the characteristics of 803.11b, but is also able to use other protocols for printer connections etc. Its range is limited to about 5m to 10 m due to the low transmission power.

All three of these systems may be very useful in office applications and may avoid pulling new network cables. However in places as a computer center, or HEP experiments, the high level of electromagnetic noise may influence reliable working.

2.4 Middleware

2.4.1VIA

Via has not succeeded to impose itself. However Microsoft is still working on a version that merge the VIA ideas and ST technology. (Microsoft hired Jim Pinckerton from SGI)

2.4.2PVM

2.4.3MOSIX

MOSIX is software that can transform a cluster of PC's (workstations and servers) to run almost like an SMP. MOSIX can support configurations with large numbers of computers, with minimal scaling overheads to impair the performance. A low-end configuration may include several PCs that are connected by Ethernet, while a larger configuration may include workstations and servers that are connected by higher speed LAN, e.g., Fast Ethernet.

Even if the Hebrew University of Jerusalem puts a real effort into this OS extension, the network interconnects will always be a disadvantage against real SMP systems.

2.4.4MPI

MPI is successfully applied in clusters that need inter-processor message passing at NASA Ames, Mountain View and at Sandia Lab, Albuquerque. Activities are going on to make the I/O part of MPI more performant.

2.4.5Threading

Methods have been developed by Intel and AMD to increase processor efficiency by allowing multiple threads pipelined or simultaneously in the processor. Intel will introduce this technology in the XEON version to be introduced for 3 Q 2002 and AMD announced this feature for the HAMMER series.

Scalable Clusters

Tendency is to see three different technologies continuing in scalable clusters, with a multitude of smaller partners around one main player. Such the MIPS group with SGI, The SPARC group with SUN and the PowerPC group with IBM. In addition, all three propose Intel based clusters with proprietary or shareware operating systems (LINUX).

3.1.High Level Clusters

Under pressure of the ASCI project in the USA and the Global Simulation project in Japan, the capacity of high level clusters move toward or beyond 10 Teraflop. They are the domain of view manufacturers such as Compaq with Alpha processors, IBM with PowerPC processors and SGI with MIPS and/or Itanium processors in the USA, and NEC with its own vector oriented microprocessor in Japan. Some of these systems can be delivered with the LINUX OS.

High-end clusters will continue to exist. They will be used mainly in large simulation programs, weather forecast and biological computing where multiple processors need access to a single large data frame.

3.2.Medium Level Clusters

A class of systems built especially to be combined into clusters can be seen as medium level cluster machines. They are formed for a large part by the so-called 1” high “pizza box” systems that exist mainly as one and two processor configurations with some four processor exceptions.

Very high-density clusters can be built with these systems, but power dissipation in a small volume is a problem. This adds to an installation price that is already higher than double processor workstations. Pizza box systems are available from all larger manufacturers and some smaller start-up companies.

Newcomers in this field are the so-called “Blade Processors”. They can be seen as naked motherboards that plug into a back-panel with power and networking interconnects. To avoid power problems the blades are normally equipped low power processors. Production methods can be compared to the miniaturization of portables together with the use of high reliability components, which keeps prices high.

Blade systems are available or announced from Compaq-HP, Dell and Sun. It may be of interest to make a comparison of price performance relation if these systems become more common and the costs drop near workstation prices.

Medium level clusters is what Service Providers need, such there is a huge market out there. But they will keep high prices, ones due to the high demand and secondly because of the high reliability components included. Sophisticated cooling needs increment the cost of ownership.