HPCx Service Report
July 2004
1 Introduction
This report covers the period from 1 July 2004 at 0800 to 1 August 2004 at 0800. This gives a service month of 744 hours.
This month we delivered more than 3.3 million AUs to users which is another record. This represents 75% of the available time. The number of incidents was down from 31 to 21.
2 Usage
2.1 Availability
Incidents
During this month, there were 21 incidents, 2 of which were at SEV 1. The following table indicates the severity levels of the incidents, where SEV 1 is defined as a Failure (in contractual terms). The definitions used for severity levels can be found in Appendix A.
Severity / Number1 / 2
2 / 9
3 / 10
4 / 0
The attributions for the SEV 1 incident were as follows:
SEV1 / Incidents / MTBFIBM / 1.0 / 732
Site / 1.0 / 732
External / 0.0 / ∞
Overall / 2.0 / 366
The following table gives more details on the Severity 1 incidents:
Reason
04.151 / 100% / 0% / 0% / Emergency power down after fire condition04.169 / 0% / 100% / 0% / Maintenance session overrun
Serviceability
There was a total of 22.4 hours of scheduled downtime this month.
Attribution / UDT / ServiceabilityIBM / 1:30 / 99.8
Site / 8:43 / 98.8
External / 0:00 / 100.0
Overall / 10:13 / 98.6
2.2 CPU Usage by Consortium
The PIs and titles for the various consortia are listed in Appendix B.
Consortium / CPU Hours (Parallel) / CPU Hours (Other) / AUs charged / %agee01 / 159833 / 105 / 592083 / 17.8%
e02 / 717 / 150 / 3353 / 0.1%
e03 / 147763 / 0 / 571474 / 17.2%
e04 / 9275 / 39 / 36023 / 1.1%
e05 / 103237 / 134 / 399790 / 12.0%
e06 / 290058 / 561 / 1123969 / 33.8%
e07 / 5365 / 3 / 20758 / 0.6%
e08 / 4003 / 1 / 15485 / 0.5%
e10 / 2604 / 0 / 10071 / 0.3%
e11 / 956 / 109 / 4119 / 0.1%
e12 / 6869 / 1 / 26569 / 0.8%
e15 / 1487 / 0 / 5749 / 0.2%
e18 / 6680 / 14 / 25889 / 0.8%
e20 / 51258 / 4 / 198259 / 6.0%
EPSRC Total / 790105 / 1122 / 3033590 / 91.3%
n01 / 23396 / 22 / 90570 / 2.7%
n02 / 3666 / 2 / 14184 / 0.4%
n03 / 2074 / 0 / 8020 / 0.2%
n04 / 71 / 0 / 273 / 0.0%
NERC Total / 29206 / 24 / 113047 / 3.3%
p01 / 388 / 26 / 1601 / 0.0%
PPARC Total / 388 / 26 / 1601 / 0.0%
c01 / 21127 / 102 / 82101 / 2.5%
CCLRC Total / 21127 / 102 / 82101 / 2.5%
b02 / 28 / 0 / 107 / 0.0%
b03 / 2770 / 3 / 10725 / 0.3%
b06 / 8 / 0 / 31 / 0.0%
b07 / 4 / 0 / 15 / 0.0%
BBSRC Total / 2810 / 3 / 10879 / 0.3%
x01 / 1351 / 1 / 5230 / 0.2%
External Total / 1351 / 1 / 5230 / 0.2%
z001 / 16851.4 / 24.8 / 65268 / 0.02
z002 / 1217.5 / 0 / 4709 / 0.001
z004 / 1063 / 61.3 / 4348 / 0.001
HPCx Total / 20471.8 / 87.4 / 79512 / 0.024
2.3 CPU Usage by Job Type
The figures for Raw AUs given here show the number of AUs actually supplied by the system to users’ jobs. It uses the conversion rate for the AU which corresponds to the results of the Linpack benchmark running on the new platform; that is, 1 CPU hour = 3.8675 AUs.
Number ofProcessors / Raw AUs / %age / Number of
Jobs
≤32 / 276804 / 8.3% / 4493
33–64 / 388665 / 11.6% / 2259
65–128 / 962990 / 28.7% / 694
129–256 / 523172 / 15.6% / 207
257–512 / 596320 / 17.8% / 119
513–1024 / 596576 / 17.8% / 46
>1024 / 2630 / 0.1% / 2
The system is divided into three regions.
Development Region (9 frames, jobs using ≤64 CPUs): a total of 665469 raw AUs were used; that is 80.3% of the total available in this regionProduction Region (40 frames, jobs using >64 CPUs): a total of 2681690 raw AUs were used; that is 72.8% of the total available in this region
The remaining frame is reserved for interactive parallel jobs.
2.4 Slowdown and Job Wait Times
Slowdown
Slowdown is a widely used measure of the relative wait times of different classes of jobs. It is defined as:
Slowdown = (job run time + job wait time) / (job run time)
Slowdowns of less than around 10 are usually regarded as reasonable. The graph below plots slowdown against run-time (ignoring jobs of less than 5 minutes duration). In general, slowdown this month was satisfactory, reflecting the increased size of the system and the arrival of the summer holidays.
In the graph below, we plot the slowdown figures against the number of processors used and ignoring the development jobs of less than 1 hour.
Job wait times
The following table shows the average wait time (in hours) for each class of job. These are also satisfactory in general, although the wait times for certain capability-level classes will be kept under review.
Job Class / Category / MaximumNumber of CPUs / Maximum Job length / Average wait time / Number of Jobspar32_1 / parallel / 32 / 1 / 1.5 / 2347
par32_3 / parallel / 32 / 3 / 9.9 / 170
par32_6 / parallel / 32 / 6 / 3.6 / 1976
par64_1 / parallel / 64 / 1 / 1.5 / 553
par64_3 / parallel / 64 / 3 / 13.6 / 52
par64_6 / parallel / 64 / 6 / 1.5 / 1654
par128_1 / parallel / 128 / 1 / 1.2 / 290
par128_3 / parallel / 128 / 3 / 5.1 / 82
par128_6 / parallel / 128 / 6 / 3.7 / 90
par128_9 / parallel / 128 / 9 / 4.6 / 2
par128_12 / parallel / 128 / 12 / 17.3 / 230
par256_1 / parallel / 256 / 1 / 2.0 / 103
par256_3 / parallel / 256 / 3 / 8.1 / 15
par256_6 / parallel / 256 / 6 / 3.4 / 18
par256_9 / parallel / 256 / 9 / 0.0 / 0
par256_12 / parallel / 256 / 12 / 8.0 / 71
par512_1 / parallel / 512 / 1 / 6.8 / 45
par512_3 / parallel / 512 / 3 / 6.2 / 12
par512_6 / parallel / 512 / 6 / 5.0 / 21
par512_9 / parallel / 512 / 9 / 0.0 / 0
par512_12 / parallel / 512 / 12 / 13.3 / 41
par1024_1 / parallel / 1024 / 1 / 20.0 / 13
par1024_3 / parallel / 1024 / 3 / 0.1 / 11
par1024_6 / parallel / 1024 / 6 / 4.0 / 2
par1024_9 / parallel / 1024 / 9 / 0.0 / 0
par1024_12 / parallel / 1024 / 12 / 26.4 / 20
par1280_1 / parallel / 1280 / 1 / 0.0 / 0
par1280_3 / parallel / 1280 / 3 / 0.0 / 0
par1280_6 / parallel / 1280 / 6 / 0.0 / 0
par1280_9 / parallel / 1280 / 9 / 0.0 / 0
par1280_12 / parallel / 1280 / 12 / 0.0 / 0
serial_1 / serial / 1 / 1 / 15.0 / 111
serial_12 / serial / 1 / 3 / 2.9 / 86
serial_3 / serial / 1 / 6 / 0.7 / 28
serial_6 / serial / 1 / 9 / 0.0 / 2
serial_9 / serial / 1 / 12 / 0.0 / 0
inter32_1 / interactive / 32 / 1 / 0.0 / 2836
course32_1 / parallel / 32 / 1 / 0.0 / 55
2.5 Disk Occupancy
Home Space
Home space is the part of the disk space that is regularly backed up.
Consortium / Disc Occupancy(Kb) / Disc Quota
(Kb)
b01 / 64 / 1,024,000
b02 / 4,252,320 / 51,200,000
b03 / 4,096 / 51,200,000
b04 / 64 / 51,200,000
b05 / 193,792 / 51,200,000
b06 / 51,744 / 51,200,000
b07 / 40,832 / 20,480,000
c01 / 47,185,504 / 51,200,000
e01 / 49,252,800 / 50,006,016
e02 / 35,045,920 / 39,760,896
e03 / 93,839,200 / 102,412,288
e04 / 92,862,912 / 102,400,000
e05 / 98,777,920 / 225,280,000
e06 / 159,689,216 / 204,800,000
e07 / 5,606,720 / 20,480,000
e08 / 9,961,408 / 20,480,000
e09 / 64 / 51,200,000
e10 / 8,494,496 / 10,240,000
e11 / 80,838,592 / 102,400,000
e12 / 7,459,680 / 20,480,000
e13 / 621,632 / 102,400,000
e14 / 64 / 102,400,000
e15 / 1,114,720 / 51,200,000
e16 / 64 / 20,480,000
e17 / 192 / 51,200,000
e18 / 14,306,688 / 40,960,000
e19 / 64 / 40,960,000
e20 / 10,179,872 / 30,720,000
n01 / 34,120,800 / 51,200,000
n02 / 20,475,168 / 40,960,000
n03 / 41,773,376 / 51,202,048
n04 / 93,973,824 / 102,400,000
n05 / 2,080 / 10,240,000
p01 / 29,114,272 / 30,720,000
x01 / 6,968,704 / 51,200,000
z001 / 151,179,584 / 163,842,048
z002 / 31,813,696 / 49,153,024
z003 / 256 / 3,072
z004 / 49,080,640 / 51,200,000
z05 / 256 / 20,480,000
z06 / 39,820,032 / 51,200,000
z07 / 5,892,224 / 10,240,000
Workspace
Consortium / Disc Occupancy(Kb) / Disc Quota
(Kb)
b01 / 64 / 5,120,000
b02 / 15,008 / 1,049,600
b03 / 4,552,192 / 102,400,000
b04 / 64 / 102,400,000
b05 / 192 / 102,400,000
b06 / 224 / 102,400,000
b07 / 128 / 40,960,000
c01 / 25,428,288 / 30,720,000
e01 / 881,653,696 / 1,177,600,000
e02 / 3,115,200 / 10,240,000
e03 / 9,824 / 524,288
e04 / 733,994,560 / 1,126,400,000
e05 / 43,384,896 / 110,592,000
e06 / 152,672,480 / 204,800,000
e07 / 8,124,096 / 102,398,976
e08 / 128 / 1,024,000
e09 / 64 / 102,400,000
e10 / 17,814,272 / 51,200,000
e11 / 160 / 102,400,000
e12 / 743,584 / 102,400,000
e13 / 450,506,048 / 512,000,000
e14 / 64 / 102,400,000
e15 / 192 / 102,400,000
e16 / 64 / 61,440,000
e17 / 96 / 102,400,000
e18 / 160 / 81,920,000
e19 / 64 / 81,920,000
e20 / 139,429,280 / 1,024,000,000
n01 / 179,936,800 / 256,000,000
n02 / 212,375,488 / 409,601,024
n03 / 21,504 / 1,026,048
n04 / 68,076,960 / 204,798,976
n05 / 25,564,480 / 92,160,000
p01 / 192 / 1,024,000
x01 / 384 / 102,400,000
z001 / 73,761,216 / 204,798,976
z002 / 211,264 / 788,480
z003 / 192 / 3,072
z004 / 1,056 / 1,024,000
z05 / 96 / 1,024,000
z06 / 18,451,680 / 102,400,000
z07 / 1,056 / 1,024
2.6 Tape Archive
Consortium / Usage (Tapes) / Quota (Tapes) / Files / Data (Gb)c01 / 2 / 2 / 8 / 16
e01 / 15 / 15 / 20,161 / 1,377.2
e03 / 2 / 2 / 5,010 / 71.3
e04 / 2 / 2 / 1,356 / 170.3
n01 / 24 / 35 / 829 / 2,339
n02 / 7 / 10 / 5,270 / 73.3
z001 / 2 / 2 / 4,932 / 17.8
z002 / 4 / 4 / 368 / 11.3
z06 / 1 / 1 / 833 / 67.9
Note that a tape is counted in the Usage column even if it is only partly occupied.
3 Support
3.1 Helpdesk
Classifications
Category / Number / % of allAdministrative / 56 / 42.1
Technical / 73 / 54.9
In-depth / 4 / 3.0
PMR / 0 / 0.0
TOTAL / 133 / 100.0
The PMR category indicates in-depth queries that result in Problem Management Reports for IBM.
Service Area / Number / % of allPhase 2 platform / 110 / 82.7
Website / 6 / 4.5
Other/general / 17 / 12.8
TOTAL / 133 / 100.0
Performance
All non-indepth queries / Number / % / TargetFinished within 24 Hours / 114 / 88.4 / 75%
Finished within 72 Hours / 129 / 100.0 / 97%
Finished after 72 Hours / 0 / 0.0
Administrative queries / Number / % / Target
Finished within 48 Hours / 56 / 100.0 / 97%
Finished after 48 Hours / 0 / 0.0
Experts Handling Queries
Expert / Admin / Technical / In-Depth / PMRepcc.ed.ac.uk / 36 / 27 / 2 / 0
dl.ac.uk / 1 / 8 / 1 / 0
Sysadm / 19 / 38 / 1 / 0
Other people / 0 / 0 / 0 / 0
3.2 Training
Title of Course /Start
Date / Length (Days) / Placedays / HPCx Users / HPCx Staff
Improved Performance Scaling on HPCx (Edinburgh) / 8-Jul / 1 / 26 / 9 / 4
4 Staffing
4.1 Science Support Staffing
Daresbury Laboratory
Name / DaysAshworth / 12.0
Blake / 0.7
Bush / 21.0
Guest / 5.3
Jones / 1.6
Plummer / 15.0
Sherwood / 2.6
Sunderland / 21.0
Thomas / 9.5
Pickles / 3.4
Total (Days) / 92.0
FTEs / 5.2
EPCC
Name / DaysSimpson / 11.5
Booth / 14.5
Henty / 17.0
Smith / 18.5
Bull / 11.8
Fisher / 9.0
Hein / 24.8
Jackson, Adrian / 18.0
Pringle / 13.2
Reid / 21.5
Carter / 0.7
Breitmoser / 4.0
Stratford / 2.3
Nowell / 0.7
Helpdesk / 1.0
Total (Days) / 168.5
FTEs / 9.5
In the table above, the figures for Stephen Booth and Adrian Jackson were estimated on the basis of their work schedules, since they were away on leave. If necessary, they will be corrected in the quarterly report.
Overall Levels
FTEsDL / 5.2
EPCC / 9.5
Total / 14.7
4.2 Systems Staffing
Name / DaysAndrews / 12.0
Blake / 0.0
Brown / 21.0
Elwell / 15.8
Fisher / 8.0
Franks / 15.8
Jones / 1.6
Shore / 15.8
BITD / 21.0
Total (days) / 110.8
FTEs / 6.2
Note: BITD covers a range of bookings from a support department who provide approximately 1 FTE to support computer room operations, electrical and mechanical site services and networking and security. Roughly a dozen staff charge time to the project in amounts which vary from month to month. We believe that it adds no value to report these individual bookings although a full listing can be provided annually if required.
5 Summary of Performance Metrics
Metric /TSL
/ FSL / Monthly MeasurementTechnology serviceability / 80% / 99.2% / 99.8%
Technology MTBF (hours) / 200 / 300 / 732
Number of AV FTEs / 7.5 / 10 / 14.7
Number of training days per month / 30/12 / 40/12 / 23/7
Non in-depth queries resolved within 3 days / 85% / 97% / 100%
Number of A&M FTEs / 3.75 / 5.75 / 6.2
A&M serviceability / 80% / 100% / 98.8%
Appendix A: Incident Severity Levels
SEV 1 ― anything that comprises a FAILURE as defined in the contract with EPSRC.