Cluster Service Pack Upgrade

Power5 HPS Service Pack Upgrade

GA7 SP10 to GA7 SP12

Terry Dietter

Dave Trnka

IBM Poughkeepsie

April 26, 2007

THIS PAGE IS LEFT BLANK INTENTIONALLY

Table of Contents

Introduction

Stage 1 – HMC Update

Stage 2 – Cluster Management Server Update

Stage 3 – Power Subsystem Code and System Firmware Concurrent Update

Stage 4 – Managed System LPP Updates

Appendix A: AIX Health Check List

Appendix B: Cluster Ready Hardware Server Health Check List

Appendix C: Verifying Power Subsystem Code levels and Status

Power5 HPS Service Pack Upgrade

Introduction

The upgrade of a Power5 Federation cluster is a multi-step process. Every Power5 Service Pack that is released by IBM covers the three main Hardware code levels; GA5,GA6, and GA7. It also covers both the AIX52 and AIX53 releases. Each Service Pack is described in detail in a README provided by IBM on the following website:

The purpose of this document is to provide a detailed view of the upgrade steps that were taken during an upgrade in IBM Poughkeepsie. This particular upgrade was a GA7 Service Pack 10 to GA7 Service Pack 12 upgrade on a Single Frame 12 Node Power5 IH HPS Cluster. In staying within the confines of a GA7 upgrade, the majority of the upgrade was completed concurrently. The upgrade was completed in four stages, thus allowing the system to run workload while the upgrade was in progress, with the only real outage being the node reboot in the final stage. This provides the customer with greater flexibility during GA7 upgrades, allowing them to complete each Stage in multiple non-disruptive service windows.

This document is intended as a reference only. It depicts one scenario for an upgrade and does not cover all combinations of Power5 HPS upgrades.

The following table is an overview of the code levels that were touched upon in this upgrade.

Code / LPP / AIX5.3 Environment
HMC / V6R1.0+MH00839
HMC-HPSNM / MH00817(ServicePack7)
Power Subsystem / BP240_203
GFW (System) / SF240_284
AIX / 5300-05-03
VSD / 4.1.0
LAPI / 2.4.3
HPS/SNI / 1.2.0
PE / 4.2.2
LoadL / 3.3.2
GPFS / 3.1.0
CSM / 1.5.1
RSCT / 2.4.6
ESSL / 4.2.0
PESSL / 3.3.0

Stage 1 – HMC Update

The first step to any update, is to preserve your existing environment in case a recovery action is required. The HMC has tools that allow one to do this, the first one being: Backup Critical Console Data. This will take a snapshot of your HMC OS, and copy it to DVD or remote site. This operation can take up to an hour or more in some cases.

Use the provided eServer Backup DVD (PN:09P5407)

From LicensedInternal Code Maintenance pulldown

SelectHMC Code Update

Select Back up Critical Console Data

The second tool is Save Upgrade Data, which is a smaller snapshot that backs up mainly console configuration data, ie: network and user configurations to the HMC hard drive. This data can be restored in case the HMC needs to be reinstalled. The actual node lpar configuration is stored on the nodes’ service processor, and is not affected by what occurs on the HMC.

Save your network settings:

From the HMC Management pulldown

Select HMC Configuration

Select Customize Network Settings

Be sure to collect the information on each tab above. This information is handy to have should an unrecoverable condition occur on your HMC,

where the Save Upgrade Data cannot be accessed. This step is optional, but recommended.

Install Corrective Service: This step will update the HMC to the levels indicated below. This can be accomplished either through the ftp download method or local DVD. The FTP download is easiest. When mutilple updates are being applied (ie; HMC, HPSNM, Security), you can save the reboot for the final update, or reboot with each one, this would be the administrators preference. To shorten the HMC outage window, a single reboot was done here.

Tip: For patch file locations see “Installation Instructions” tab under the respective HMC version corrective service support link:

From Licensed Internal Code Maintenancepulldown

Select HMC Code Update

Select Install Corrective Service

Enter your FTP site or use IBM’s software site

The Corrective Service for this upgrade will result in the following changes:

From: Version=Version:6

Release: 1.0

HMC Build level 20060801.1

MH00781:Required fixes for V6R1.0 (08-03-2006)

To: Version=Version:6

Release: 1.1

HMC Build level 20061103.1

Install the HMC HPSNM code. Although Power5 Federation requires CRHS which runs on the CMS, it’s still recommended to perform this update.

Reboot the HMC.

When the HMC comes back up, verify the updated levels.

The above verification from the HMC GUI is just one way to confirm the update. Other HMC ‘hscroot’ commands can be executed via the Management Server ‘dsh’ command. See Appendix B for additional verification tools to verify your HMC has been successfully updated, and has rejoined the cluster. The most important aspect of the Power5 Federation cluster is the Cluster Ready Hardware Server (CRHS); the HMC plays a critical role. Proper CRHS operation is required before moving forward with the remaining upgrade steps.

Stage 2 – Cluster Management Server Update

Once the HMC update and verification stage is complete, the cluster can remain in this state until the next service window, or the cluster update can continue with the next stage.

A very useful tool for AIX upgrades is the ‘multibos’ utility. Multibos allows admins to copy the existing OS to a Standby BOS (within the same rootvg), while at the same time updating that BOS. Ensure there are sufficient FREE partitions on the hard drive to perform this task. Logical volumes hd5,hd4,hd2, hd9var and hd0opt will be copied. This upgrade will use multibos. However, due to an existing multibos PMR 10314,999,866, this update was performed in two steps. The idea is to build a Standby BOS with the AIX updates only, then boot off the Standby BOS, and update the HPC LPPs. Although this requires two reboots, it leaves the original AIX environment intact, in case you need to revert back. This update will bring the AIX level up to AIX5300-05-03, with the HPC LPPs being updated via “Order ALL Fixes” APAR IY92413.

Build a Standby BOS, updating the AIX images only

Remove the multibos output log on all the nodes.

"rm /etc/multibos/logs/op.alog"

Clean up the old Standby BOS on all nodes.

"multibos -tR"

Commit all filesets.

“installp –c all”

Build the Standby BOS using the AIX 5300-05-03 update images.

"multibos -Xsa -l /csminstall/aix530503”

Check the multibos log for errors in /etc/multibos/logs/op.alog

Verify the bootlist is the Standby BOS boot device.

"bootlist -m normal -o"

Launch the standby BOS command prompt.

“multibos –S”

Verify the Standby BOS.

See Appendix A: AIX Health Check Listfor suggested checks

Exit out of multibos via “exit”

Boot off the standby BOS

Verify the Standby BOS boot devices and correct bootlist is set:

“lsvg –l rootvg”

the original BOS LVs should be “open/syncd”,

while the Standby BOS (bos_*) LVs should be “closed/syncd”.

“bootlist –m normal –o”

The new bootlist generated by multibos should now show the Standby BOS boot device first in order,

ie: hdisk0 blv=bos_hd5.

“shutdown –Fr”

Verify the Cluster Management Server did boot off of the Standby BOS

“lsvg –l rootvg” - The opposite of the above should now be true.

Update the HPC LPP pack using update_all

Use “Order ALL Fixes” APAR IY92413 without “order the latest”.

Reboot the Cluster Management Server

Verification

See Appendix A: AIX Health Check List

AppendixB:Cluster ReadyHardwareServerHealth Check List.

Stage 3 – Power Subsystem Code and System Firmware Concurrent Update

There are several ways to update Power Subsystem Code and System Firmware. The approach in this document was to use available CSM commands. These commands are issued from the Management Server, and are a considerable time saver. Remember, this is a GA7 concurrent upgrade, the workload goes on.

Tip: CSM recommends using the HMC IP address for the ConsoleServerName, and HWControlPoint in the node definitions. Ensure node definitions are setup correctly using the ‘lsnode’ command.

“lsnode –a ConsoleServerName”

“lsnode –a HWControlPoint”

See chnode command to change the node attributes.

You must enable remote copy to the HMCs for rfw commands to work

“chhwdev -a RemoteCopyCmd=/usr/bin/scp RemoteShellUser=hscroot”

Obtain Power Subsystem and System firmware code.

02BP240_203_168.rpm – Power Subsystem code

01SF240_284_201.rpm – System firmware

Copy the code to the proper CSM location using mkflashfiles command.

“mkflashfiles -f /csminstall/sp12/02BP240_203_168.rpm”

“mkflashfiles -f /csminstall/sp12/01SF240_284_201.rpm”

Before updating it is a good idea to verify the existing Power Subsystem code levels and status using the HMC LIC GUI Frame information. This ensures that your existing Frame Power, and FRU Code levels are correct.

See Appendix C: Verifying Power Subsystem Code and Levels.

The code levels can also be checked via the csm rfwscan command.

[c595mgrs][/]> rfwscan -n c595sq01

Nodename = c595sq01.ppd.pok.ibm.com

Managed System Release Code Level = 01SF240

Active Service Code Level = 261

Installed Service Code Level = 261

Accepted Service Code Level = 261

Power Subsystem MTMS = 9458-100*99200PA

Power Subsystem Release Code Level = 02BP240

Active Service Code Level = 197

Installed Service Code Level = 197

Accepted Service Code Level = 197

From the above example, the Active, Installed, and Accepted levels are the same within Power Code and System Firmware.

Active is the code that is presently running. Installed is the code that was last installed (Temporary). Accepted is the code that is now the backup, which you can return to if you decide to remove the Installed level. You will notice throughout these upgrade steps that the levels will change.

Tip: Use rfwscan –xa to view levels in machine parsable format

Tip: If your system contains switch-only frames, the power update for these frames must be done via the HMC LIC GUI.

Update the Power Subsystem Code for all Frames.

For the purpose of this exercise, we had no switch-only frames so we were able to update the Power Subsystem and Switch Power using the CSM commands. Targeting a node in a frame, and specifying the type as ‘power’, will update the power code in that frame, and not the system firmware on the node.

[c595mgrs][/]> rfwflash -n c595sq01 -f -t power --activate concurrent

Flashing code level 02BP240_203_168.rpm.

Querying LIC levels on targets.

Finished querying LIC levels on targets.

Waiting for 10.1.0.1 to be available

Copying Code Update Packages to targets.

Running Code Update Packages on targets.

Note: This operation may take a long time to complete.

This took approximately one hour to complete.

Scan the results of the Power Subsystem Update

[c595mgrs][/]> rfwscan -n c595sq01

Nodename = c595sq01.ppd.pok.ibm.com

Managed System Release Code Level = 01SF240

Active Service Code Level = 261

Installed Service Code Level = 261

Accepted Service Code Level = 261

Power Subsystem MTMS = 9458-100*99200PA

Power Subsystem Release Code Level = 02BP240

Active Service Code Level = 203

Installed Service Code Level = 203

Accepted Service Code Level = 197

Note the different Power Subsystem Active and Installed levels.

See Appendix C to again verify the Frame Power code levels and status as done previous to the Power upgrade. Once comfortable the Power Code has been successful on all Frames, move on to the system firmware.

Update the Managed System Firmware Code for a single node first. Notice now the type (-t) is ‘system’, which will update system firmware.

[c595mgrs][/]> rfwflash -n c595sq01 -f -t system --activate concurrent

Flashing code level 01SF240_284_201.rpm.

Querying LIC levels on targets.

Finished querying LIC levels on targets.

Waiting for 10.1.0.1 to be available

Copying Code Update Packages to targets.

Running Code Update Packages on targets.

Note: This operation may take a long time to complete.

Scan the results of the Managed System Firmware Update.

[c595mgrs][/]> rfwscan -n c595sq01

Nodename = c595sq01.ppd.pok.ibm.com

Managed System Release Code Level = 01SF240

Active Service Code Level = 284

Installed Service Code Level = 284

Accepted Service Code Level = 261

Power Subsystem MTMS = 9458-100*99200PA

Power Subsystem Release Code Level = 02BP240

Active Service Code Level = 203

Installed Service Code Level = 203

Accepted Service Code Level = 197

After things check out on this managed system, update the remaining managed systems. You’ll notice a ‘disruptive’ message below for the system that has already been updated, this is ok, this system will be skipped in this step.

[c595mgrs][/]> rfwflash -a -f -t system --activate concurrent

Flashing code level 01SF240_284_201.rpm.

Querying LIC levels on targets.

Finished querying LIC levels on targets.

rfwflash: 2651-404 The requested operation is disruptive for the following nodes:

c595sq01.ppd.pok.ibm.com

Please run the command again with the "--activate disruptive" flag to update

these nodes.

Waiting for 10.1.0.1 to be available

Copying Code Update Packages to targets.

Running Code Update Packages on targets.

Note: This operation may take a long time to complete.

Once complete, scan all Managed Systems for proper code levels.

See Appendix C: Power Code and Firmware Health Check List

When confidence is high that all code updates can be made Permanent, use the following command to do so. This will move the Accepted levels up to the same levels as Active and Installed.

“rfwflash –a –f –commit”

Stage 4 – Managed System LPP Updates

This stage will be updating the AIX and HPC filesets on the nodes. This will require an outage, to reboot the nodes.

Update the AIX and HPC LPP filesets. This stage will follow the same ‘multibos’ steps as used on Management Server in stage 2.

Build the Standby BOS using the multibos commands

Export and mount the proper fileset repository on all the nodes.

dsh -av "mount c595mgrs:/csminstall /csminstall”

Remove the multibos output log on all the nodes.

dsh -av "rm /etc/multibos/logs/op.alog"

Clean up old Standby BOS on all the nodes.

dsh -av "multibos -tR"

Commit all filesets.

dsh –av “installp –c all”

Build the node Standby BOS using the AIX 5300-05-03 update images.

dsh -av "multibos -Xsa -l /csminstall/aix530503”

Check the multibos log for errors.

Verify the bootlist is the Standby BOS boot device.

dsh -av "bootlist -m normal -o"

At this time, shutdown the workload, LoadLeveler and GPFS using normal shutdown procedures.

Reboot the nodes.

When the nodes come back up, some simple verification checks before upgrading the HPC LPPs.

“lsvg –l rootvg”

verify the Standby BOS LVs are now “open/syncd”

“lppchk –v”

“hps_check.pl

verify all switch links are ‘Timed

Update the node HPC filesets at this time using the ‘update_all’ installp command, then reboot.

Resume workloads.

Appendix A: AIX Health Check List

AIX Checks / Description / Complete
instfix –cik 5300_AIX_ML |grep “:-:” / Detects any missing AIX ML filesets
instfix –vi |grep AIX_ML / Ensure all AIX_ML filesets are installed
lsvg –l rootvg |grep stale / Check for ‘stale’ partitions in rootvg
df /var / Check for full /var
df /tmp / Check for full /tmp
sysdumpdev -l / Check for proper system dump config
lsattr –El mem0 / Ensure proper memory configuration
lsattr –El sni0 / sni1 / Ensure proper HPS Adapter settings
Specifically rdma, poolsizes and number of windows
emgr –l / Check ‘efix’ inventory if any
bindprocessor -q / Check for offline processors
lsattr –El ent0 |grep media_speed / Check for proper ‘admin’ network speed
lppchk –v / Verify installed filesets
lppchk –c / Sum check installed filesets
lsps –a / Verify paging space
vmstat –l / Verify large page allocation
smtctl / Verify smt on/off all the nodes
lssrc –ls xntpd | grep “Reference Id” / Check ntp on the nodes
raso –L / Check LMT on the nodes
dsh –av date / Verify date
vmo -a / Check all vmo settings
no -a / Check all network options settings
vmo –a | grep lgpg / Check large page settings
no –a | grep arp / Check arp settings
netstat –in | grep sn | grep -v link / Check for sni ip addresses
netstat -in | grep ml | grep -v link / Check for ml0 ip addresses

Appendix B: Cluster Ready Hardware Server Health Check List

Cluster Ready Hardware Server / Power5 Federation Checks / Complete
hwsda –a / Collects all networked hardware
lsrhws –e, -f, -m / Lists the hardware CRHS is seeing
chswnm –q / Check for active FNM daemon
hps_check.pl / Check for TIMED/MPA=YES
lsswendpt / Check for endpoint Up:Operational
lssrc –a | grep rsct / Verify proper rsct subsystems
lssrc –ls dhcpsd or dadmin -s / Checks dhcp status
lspeer / Check hmc peer domain
lshwdev / Check hw devices
frame –l / Verify frames, nodes, ips
/usr/sbin/rsct/bin/rmcdomainstatus –s ctrmc / Verify rmc domain on the MS for hmcs and nodes
lsswtopol –n 1 (2) / Check switch topology for Svc reqd
csmstat / Check node status
ssh hscpe@hmc \
lssvcevents –t hardware \
--filter “status=open” –F \
problem_num refcode sys_name \
sys_mtms enclosure_mtms text \
failing_mtms analyzing_hmc / Check SFP on all the hmcs
ssh hscpe@hmc lssysconn –r all / Verify frame/FSP connections on all hmcs
ssh hscpe@hmc lssyscfg –r frame \
-Fname,state / Verify frames on all hmcs
ssh hscpe@hmc lssyscfg –r sys \
-Fname,state / Verify CEC connections on the all hmcs
lsrpdomain / Proper rsct domain online
lsrpnode / Nodes are online in domain
lssrc –ls cthags / Check for correct providers (node only)
lssrc –ls cthats |grep CG / Check interface Mbrs

Appendix C: Verifying Power Subsystem Code levels and Status

From Licensed Internal Code Maintenancepulldown