Large-Scale Cluster Computing Workshop, FNAL

Day 1: 5/22/2001

General information (Dane)

Mail presentations to

Clusters builders handbook based on notes from workshop http://conferences.fnal.gov/lccws/draft/

Goals (Alan)

HEPiX formed a Large Clusters SIG last year

SIG should run “special” meetings and this (LCCWS) is one of those.

Primary goal: gathering practical experience and build the ‘cluster builders guide’

Welcome (Matthias)

D0 data handling: SAM – GRID enabled. Remote clients: Lyon, NIKHEF, Lancaster, Prague, Michigan, Arlington

LHC computing needs (Wolfgang)

The IEEE CS Task Force on Cluster Computing (CSTF) (Gropp)

()

Setup standards. Be involved with issues related to the design, analysis and development of cluster systems as well as the applications that use them.

· Task force with short-term lifetime (2-3 years)

· Cluster computing is NOT just parallel, distributed, OSs or the Internet. It is a mix of them all.

· http://www.ieee.org

· http://www.clustercomp.org

· http://www.tu-chemintz.de/cluster2000

· http://andy.usc.edu/cluster2001

· http://www.ieeetfcc.org/ClusterArchive.html

· http://www.TopClusters.org

· TFCC whiltepaper: http://www.dcs.port.ac.uk/~mab/tfcc/WhitePaper/

· TFCC newsletter: http://www.eg.bucknell.edu/

· TopClusters project. Numeric, I/O, web, Database and application level benchmarking of clusters

· TFCC: over 300 registered members. Over 450 on the TFCC mailing list

· TFCC future plans: cease as task force and attain full Technical Commettee status.

Scalable clusters

· TopClusters.org list: 26 Clusters with 128+ nodes (8 with 500+ nodes). Does not include MPP-like systems (IBM SP, SGI Origin, Compaq, Intel TFLOPs, etc.)

· Scalability practical definition: Operations comlete “fast enough” -> 0.5 to 3 seconds for “interactive”.

· Operations are reliable: approach to scalability must not be fragile.

· Developing and testing cluster management tools: requires convenient access to large-scale system. Can it co-exist with production computing?

· Too many different tools

o Why not adopt UNIX philosophy? Example solution: “Scalable Unix tools”. Those are just parallel versions of common Unix commands like ps, ls, cp, …, with appropriate semantics. Designed for users but found useful by administrators.

o The basic Unix commands (ls, …) are quintessential tools

o Pt<unix-command> names

o Performance of ptcp: copy a single 10MB file on 241 systems toke 14 seconds (100BaseT)

o Based on a very simple daemon forking processes under user’s id. Authentication relies on trusted hosts.

o Implementation layered on MPI.

o GUI tool: ptdisp

o Open Source. Get from http://www.mcs.anl.gov/sut

o Needs MPI implementation with mpirun. Developed with Linux, MPICH, MPD, on Chiba City at Aragonne.

· http://www-unix.mcs.anl.giv/chiba/ resource available for research on scalability issues.

· Large programs:

o DOE Scientific Discovery through Advanced Computing (SciDAC)

o NSF Distributed Terascale Facility (DRF)

o OSCAR: goal is a “cluster in a box” CD. Target smaller clusters (sponsored by Intel)

o PVFS (Parallel Viertual File System)

o Commercial Efforts: Scyld, etc.

· Q: Model we have for a small Unix box does not necessary scale. Yes, it is maybe not an elegant model but a working model.

· Q (Tim): why not only ptexec? To ease the learning.

Linux farms at RCF

Linux farms provides majority of CPU power in RCF. Mass processing of raw data from RHIC experiments. Mostly rack-mounted due space restrictions.

· 338 intel-based nodes (dual and quads). Ranging from 200MHz to 800MHz. ~18,176 SpecInt95

· RH6.1, 2.2.18 kernel. Take advantage to pend disks together

· OS upgrades over the network or initiated via bootable floppy. Master Linux image kept in a separate machine.

· AFS (mostly OpenAFS) and NFS servers for access to user software and data.

· CRS: Central Reconstruction System

o Interface with HPSS for access to STK silos

o Jobs submitted o batch server nodes are relayed to batch master node, which interfaces with HPSS and the CRS nodes.

o Check that HPSS and NFS to minimize job losses

· CAS: Central Analysis

o LSF or interactively

o LSF queues separated by experiment and by priority

o LSF license cost is an issue

· Monitoring software:

o Web interfaces and automatic paging systems (also for users)

o System availability and critical services (NFS, NIS, SSH, LSF)

o Data stored on disk and backed-up

o Cluster management software collects vital hardware statistics (VALinux)

o VACM for remote system management

· Security:

o Transition to use open SSH. The few systems upgraded haven’t had any problems. Issues for farm deployment is Kerberos token passing so that the users don’t need to authenticate twice.

· CTS for trouble ticket system

BaBar Clusters (Charles Young, SLAC)

· Migrating to Linux (and Solaris)

· Reconstruction cluster is Solaris based.

o Up to 200 CPU farm

o Not using batch but a customized job control

o Scaling issues: tightly coupled systems, serialization or choke points, reliability concerns when scaling up.

· Pursuing Linux farm nodes:

o VA Linux 1220 – 1 RU, 2 CPU, 866MHz, 1GB

o Some online code is Solaris specific

· MonteCarlo

o 100 CPU farm nodes. Mix of Linux and Solaris

o LSF

· Analysis Cluster

o Data stored primarily in Objectivity format

o Disk cache with HPSS back end

o Levels: Tag, Micro, Mini, Reco, Raw, etc.

o Varied access parttern

§ High rate and many paralle jobs to Micro

§ Lower rate and fewer jobs to Raw.

· Analysis farm: ~200 CPU, Solaris. Gigabi Ethernet

· BaBar computing is divided in to Tier A, B and C. Three tier-A: SLAC, IN2P3, RAL

· Offline clusters more likely GRID adaptable

o Distributed resources

o Distributed management

o Distribute users

Fermilab offline computing Farms (Stephen Wolbers)

· Batch system is Fermilab-written FBSNG, a product which is the result of many years of evolution (ACP, cps, FBS, FBSNG) (Farms Batch System New Generation)

· No real analysis farms.

· The farms are designed for small number of large jobs and a small number of expert users

· 314 PCs + 124 more on order

· CDF: two I/O nodes (big SGI). Tape system directly attached to one of them

· Datavolume per experiment per year ~doubles every 2.4 years

· Future: disk farms maybe replace tapes. Analysis farms?

HPC in the Human Genome Project (James Cuff)

· The Sanger Center founded in 1993; >570 staff members now

· Data volume: 3000 MB database for Human Genome.

· IT farms

o >350 Compaq Alpha systems

o +440 nodes annotation farm

o -> 750 alphas

· Front-end compute servers to login to

· ATM backbone.

· LSF

· Computer systems architecture: Fiber channel/memory channel Tru64 (V5) clusters.

· Annotation farms: 8 racks with 40 x Tru64 v5.0. Total 320GB memory, spinning 19.2 TB internal storage.

· Two network subnets (multicast and backbone)

· Highly available NFS (Tru64 CAA)

· Fast I/O (ATM> switched full duplex Ethernet)

· Socket data transfer (rdist, ..)

· Modular supercomputing

· Immediate future: looking heavily into SAN. Wide Area Clusters. GRID technology.

· Global compute engines

Linux clusters of the H1 experiment at DESY (Ralf Gerhards)

· ~50 nodes

· SuSE linux v3

· Batch system: PBS (some problems to integrate it with AFS). No support contract with PBS people.

· Data access: AFS, RFIO, Disk Cache

· H1 framework (for event distribution) based on Corba

· Conclusions:

o H1 computing fits well with Linux based desktop environment

o Investigate new data storage models

§ Move processes to data

§ Use linux file servers

PC Clusters at KEK (A.Manabe)

· Belle PC cluster:

o >400 CPUs by 3 clusters

o Cooperation with SUN servers for I/O

o Number of users is small (<5)

· Belle PC cluster (1): 1999. Machines installed by physicist and racks are homemade. Cabling done by staff.

· Belle PC cluster (2): since 2000 winter. Compaq Proliant

· Belle PC cluster (3): since 2001 March. Compaq Proliant (4CPU) x 60.

· NFS for data access

· No LSF. Job submission is done manually with help of Perl scripts and a DB managing exp. Run information

· 14TB disk (10TB RAID, 4TB local)

· 100BaseT

· Dataservers: 40 SUN servers (each has a DTF2 tape drive) Tape library 500TB capacity

· Other activity: PC farm I/O R&D: HPSS (Linux HPSS clinet API driver by IBM). SAN (with Fujitsu). GRID activity for ATLAS in Japan.

Lattice QCD (James Simone, FNAL)

· Parallelism: domain decomposition. Parition spacetime lattice into smaller volumes and distribute them among parallel processes

· Each processor communicates only boundary sites with neighboring processors

· Traditionally performed on commercial or specially built supercomputers

· Old FNAL built system ACPMAPS

· Proposal for US facilities: three ~10Tflops/s facilities by FY2005

· Ramp up with 300 cluster nodes/year over three years for FNAL and JLAB

· 80-node PIII Cluster installed at FNAL September-October 2000.

o BIOS and EMP redirect to COM serial ports, monitor via Cyclades

o Remote boot via PXE capable etherniet BIOS

o Myrinet-2000 NICS and 80-port switch

o Linux 2.2; PBS (maui); MPI mpich/vmi (virtual driver layer provided by NCSA)

· Myrinet much superior for MPI than TCP of Ethernet in terms of performance per CPU (Mflops/sec/cpu vs. #cpus)

· FY2001 program: $0.6M to fund cluster investigations. Add 130 – 180 compute nodes. Expand high-perf. Network (Myrinet, SCI, GigE)

Panel

· I/O servers. Data access. Pre-emptive LSF jobs that fires off replicate data locally to all nodes (BioInfo, all jobs use same data)

· Any tests using desktop CPU for jobs? DESY has tested but not anymore (mainly a management issue: much cheaper to invest for a few more farm nodes rather than keeping a person for admin.). Timebase scheduling is not new? Does it scale with many users? Works well if desktop is uniform (US gov….?) Tim: HEP too chaotic with users “enhancing” the environment.

· How to upgrade hardware? What’s the strategy and how is old hardware phased out?

· Asset management? Huge issue at CERN. Vendor looking at broadcast MAC address. Some new systems can load asset information into the BIOS (DMI dumper).

· What local file systems are used?

· Why don’t you use cfengine?

· Major issues from panel:

o AFS support

o File system usage is not scaling

o Data access

Day 2: 23/5/2001

CERN Clusters (Tim)

· 37 clusters configurations, e.g. CMS (a year ago):

o Interaction: Solaris, Linux

o Batch: Solaris, HP-UX, Linux

· Interactive cluster: 50 bi-processor PCs; 512MB, 440 – 800MHz

o DNS load balancing: update DNS every 30 seconds determined by load etc. factors

· Batch Clusters

o Some nodes are dedicated for experiments. Allow to dedicate nodes when experiments decide to use their peak load and other jobs are suspended. Still a little bit mechanical. Gone away from physically separate clusters but there is still a logical separation.

· Batch clusters with scheduled access

· ASIS loads 3GB local to every machine. Delays the install process from being few minutes to several hours

· Console mgmt

o PCM (DEC PolyConsole Manager)

o Console concentrators

o Cross wiring serial ports: cost effective way of getting the ports but a nightmare to manage

o Etherlite, VACM

· No power mgmt

· Monitoring:

o SURE: alarm

o perfmon

o remperf

o … about 16 different monitoring projects

Q: power management. Btech(?) power switch with addressable power ports

VA Linux

· Problem with denser floorspace and multi-unit 1U boxes is cooling and power distribution

· Trying to provide an uniform management structure for each machine

o Power cycle

o Get to the BIOS remotely

o Sensor data

· VACM is a manager structure, which is portable to all VA boxes.

o 64 nodes + controller node allows VACM user interface to interact with the nodes

o Serial port console re-direction

o Parallel rsh access

o System status

o EMP functionality. Hardware sensors (fan speeds, temperature, ..)

o Next version is going to include alarms and snmp plug-ins (VACM 3.0)

· VACM is not specific to VA systems. Most of the stuff is not Intel specific, though probably quite Linux specific. Very modular with an open API.

· SystemImager allows a client system to be configured like you want. Geared towards homogenous clusters. Image server provides the golden image, which is pulled over by the client through different mechanisms. Allows for file-by-file pulls.

· Q (Tim): how do you verify against the golden image? Based on rsynch. But what about a client image that gets bad? Rsynch checks for the differences. Only differences are updated. If a client image contains a file not present in the golden image, the file is removed. What about rsynch scalability issues? With 100BaseT it is feasible to update 20 clients per server but for larger clusters a hierarchy could be built. Next release will contain a new utility: pushUpdate. With multicast performance will be significantly improved.

· Q (Tim): how does it handle errors? With 500 machines there will always be some down. If run non-interactively the proper error code from the client will be handled. With pushUpdate utility this will be logged. It goes ahead but it doesn’t retry. A system coming back from repair could be handled with a fresh install.

The SLAC cluster (Chuck Boeheim)

· Solaris Farm: 900 single CPU units

· Linux Farm: 256 CPU units installed and 256 on the truck

· 7 AFS (3TB), 21 NFS (16TB), 94 Objectivity (52TB)

· LSF

· HPSS with 10 tape movers (20 x 9840, 20 x 9940). Objectivity manages the disk and HPSS manages the tape.

· Interactive: 25 servers, + E10000

· Build farm: 12 servers, Solaris&Linux (for BaBar code builds, >7MSLOCs C++). LSF scheduled but a set of core-developers are allowed to get on to the servers. Full build ~24hours.

· Network: 9 Cisco 6509 switches. Farm nodes are connected over 100BaseT.

· Q (Tim): who mounts NFS? Mostly farm nodes. Using auto-mounter with soft mounts. Have seen mount-storms. Haven’t seen any stale. (Tim) we see stales with soft mounts.

· Staffing:

o 7 sysadmin

o 3 mass storage

o 3 applications

o 1 batch

o 4 operations

o 0 operators! Works well.

o Same staff supports most UNIX desktops

· Approaching ~100 system/staff. To cope with this they always aim to reduce complexity.

· Issues: limited floor space. Power + cooling requirements go up per rack.

o With no operators remote power and console management becomes import

o Console servers with up to 500 serial lines per server

o Installation. Haven’t been very successful with “burn-in”.

o Maintenance. Frequent problem with divergence from original models (e.g. physical memory)

o Need database and bar-code to keep track of machines

· System admin:

o Network install. Using KickStart and JumpStart. 256 machines in < 1hr

o Patches and configuration management done with a homemade system (~10yrs)

o Nightly maintenance

o System Ranger for monitoring (local tool)

o Report summarization: reporting becomes massive. Tool to condense reports to present common reports for groups of machines

· User Application issues:

o Workload scheduling starts to become complicated

o Startup effects. E.g. a user submits a job to 300 nodes and the loading of the executable kills the fileserver

o Mount-storms with AMD

· Q (Tim): does anybody do anything to dump Linux? No

Farm Cluster observations (BNL, Thomas A. Yanuklis)

· Machine life-cycle: at RFC machines have a life span of 3 years before obsolete

· Operational notes: no benchmark. Only minor difference between interactive and batch systems (homogenous clusters)

· Power and cooling are becoming a problem. Recently installed new (30 tons?) AC unit