Tivoli Software / TPC 3.1.3 PM Spike Control User’s Guide

SWG - TIVOLI / TPC 3.1.3 Performance Manager
Performance Data Spike Control
Version 1.0
Issue Date: 11/14/2006
Revision Status: Draft
Stefan Jaquet
IBM Tivoli
San Jose

Copyright, IBM Corporation 2006 – All rights reserved

Table of Contents

1.Introduction

2.The Origin of Performance Data Spikes

3.Detection of Performance Data Spikes

3.1Limitations

4.Prevention of Performance Data Spikes

5.Conclusion

1.Introduction

The IBM TotalStorage Productivity Center (TPC) is a SAN management application suite, focusing on storage management within the SAN. TPC consists of 4 primary components, TPC for Data, which provides SAN management of host system (server) storage, including attached disks, databases, file systems, and files, TPC for Disk, which provides SAN management of storage subsystems, including volume allocation and assignment, asset and capacity tracking, and subsystem performance management, TPC for Fabric, which provides SAN management of fabrics, switches, and routers, including SAN topology, zoning, and performance management of switches, and finally TPC for Tape, which provides SAN management of tape subsystems, including media, media changers, and drives.

One of the main sub-components of TPC is the Performance Manager (PM), which spans the TPC for Disk and TPC for Fabric products. PM consists of functions to monitor the performance of selected storage subsystems and switches, to notify users when potential performance problems are encountered (via setting and checking performance thresholds), and to report on the historical performance behavior and threshold violations of the monitored subsystems and switches. PM also includes some additional advanced functions, which are not relevant for this document.

During the course of monitoring the performance of certain devices, TPC will in some cases record in its database some very large values for the performance metrics for those devices. These values, referred to as spikes in this document, represent impossibly high performance measurements for the device and its associated hardware components. The primary problem with having impossibly high values recorded in the database primarily surfaces during graphical reporting of the performance data. While spikes are immediately apparent and could be mentally discounted by any competent user, the historical performance graphs that are displayed in the TPC GUI will adjust the X and Y axes appropriately for the present data ranges. This of course means that whenever an invalid high spike exists, all valid performance values are pushed down so low on the Y axis that they appear virtually as zero values on the graph. In other words, the presence of the invalid spike data values in the database will make all other valid data values useless, when looking at the graphs.

Starting in version 3.1.3, the TPC Performance Manager software will attempt to automatically detect any such spikes while performance monitors are running against a device, and will prevent such invalid values from being saved in the TPC database. This document describes some of the background of performance spikes and explains some of the implementation details of this new feature in TPC 3.1.3.

It is important to note that not all performance spikes are necessarily invalid spikes. When this document refers to “spikes” in a generic manner, it is intended to refer only to those invalid spikes, i.e. those spikes in the performance data that represent impossibly high performance measurements for the associated devices.

2.The Origin of Performance Data Spikes

To understand how invalid spikes can occur in the performance data, it is necessary to understand the basic mechanism for measuring performance on devices in the SAN.

The TPC Performance Manager(PM) achieves performance monitoring of subsystems and switches by repeated polling of performance statistics counters that are maintained by these devices. These counters track various activities in the device, for example the number of I/Os issued to particular volumes or ports in the device, or the amount of time it takes the device to process read or write requests. For some devices, these counters are maintained natively by the firmware of the device, and in other cases these counters are maintained by the CIM Agent used to manage the device. In either case, the counters are always monotonically increasing, meaning that over time, the values of these counters will consistently keep growing larger and larger. Therefore, the counter values that can be retrieved via a single poll of the device are rather meaningless. It is necessary to perform 2 successive polls to retrieve the counters, where a certain amount of time, say n seconds, separate the polls. Then the actual activity that occurred on the device during those n seconds can be determined by computing the differences (deltas) between all the counter values that were retrieved by the two polls.

Counters are usually represented as unsigned 32-bit or 64-bit integers. But no matter how large, eventually an individual counter will wrap, meaning that a subsequent poll returns a smaller counter value than the prior poll, because the counter has been incremented up to its limit (32 or 64 bits) and then incremented once more to cause a wrap back to 0, and then incremented further from there. PM automatically adjusts for such wraps during its delta computations, to avoid any gaps in the collected performance data for the device.

This, in a nutshell, is the basic concept behind performance data collection in PM. Devices keep incrementing their performance counters, PM keeps polling for them, computing their deltas, and then saving the results in the database for later display in performance reports. However, this is where the problem comes in. Unfortunately a real live SAN environment doesn’t always live up to these ideals, and there are always periodic, planned or unplanned hardware or software outages. For those devices which maintain their performance counters in firmware, a shut-down of the device usually causes the counters to be reset to zeroes. The same occurs for those devices which maintain their counters in the CIM Agent; a shut-down of the CIM Agent software causes the counters to be reset. And there may be other valid cases that could cause counter-resets, for example fail-over or fail-back in ESS/DS subsystems.

On top of these valid cases, there are some invalid counter resets that could occur due to firmware or CIMOM bugs. This has especially been a problem with the earlier microcode releases of ESS and DS storage subsystems, where counters were invalidly reset as result of machine warm-starts, or even as result of hosts issuing LUN resets to their assigned volumes. This is also a problem for McData switches for example, where counters are reset due to new port-speed negotiations by the hardware whenever an attached host is rebooted. It was also a problem with earlier versions of the Engenio CIM provider, where the provider keeps track of the performance counters and would perform invalid arithmetic on the counters in some cases. And of course there may be other cases of firmware or CIMOM/provider bugs which have not yet been discovered.

In any of these scenarios, valid or invalid, the result is that when PM computes the deltas between the counters, it will usually arrive at very large values, because the assumptionis that if two consecutive polls return a smaller counter value on the second poll, then a counter has been incremented to its maximum value (32 or 64 bits), then wrapped around to zero,and then incremented further up to its present value. In the performance reports, such cases will display as metric values in the millions or more, where normal values might be in the hundreds, or thousands. These values therefore are the invalid spikes that corrupt the TPC performance graphs.

3.Detection of Performance Data Spikes

Starting in version 3.1.3, TPC PM will attempt to detect as best as possible any invalid performance data spikes. A new data-validation step has been added to the normal processing of the gathered performance data. This validation step will iterate through all the gathered performance statistics counters for all of the gathered device components and attempt to detect any invalidly high values. Detection occurs when any individual counter equals or exceeds a certain maximum limit value, which represents the maximum expected value for any performance measurement from the device. There are two such limit values implemented in PM, one for 32 bit counters and one for 64 bit counters, each with an appropriate default value. It is expected that normally these limit values should never be adjusted, but if absolutely necessary, they can be set using two TPC configuration parameters. If this should be necessary in any given environment,an IBM service representative should be contacted who can provide more specifics on how to set these parameters, and more importantly, to what values these parameters should be set to. Remember that setting the limit values too low could result in valid data to be misinterpreted as invalid spikes and could result in deletion of that valid performance data.

3.1Limitations

Performance data spike detection sounds like a simple concept, but by necessity it is an inexact science. If a counter has been validly incremented close to its maximum value (32 or 64 bits), then a counter reset to zero will be undetectable, regardless of whether that reset is a valid reset due to external factors, or an invalid reset because of some bugs in the firmware or the CIM provider.

Detection of spikes after they have been inserted into the database is even more error prone, due to aggregation of the performance counters across higher level components, anddue to summarization of the counters across multiple sampling intervals. This is because a single spike in an individual counter can be masked by all the other counters for other components and/or all the other counters for other sampling intervals, so that the sum of all these counters might appear to be a valid value. When setting the limit value for spike detection, if the limit is set too low then a valid aggregated/summarized value might be detected as a spike, but if set too high, then a true spike might be passed over as though it were a valid value.

As a result, a mechanism for spike detection can never be 100% accurate, and there will always be false negatives, where a spike occurred but none is detected, or false positives, where a spike is detected but none occurred.

However, the spike detection logic within the PM data-validation step has to deal only with the sample data received directly from the devices, and therefore the inherent inaccuracy of the spike detection is not too bad. The only danger is that we might not detect a counter reset because the corresponding counter had already been validly incremented close to its maximum value (32 or 64 bits) before the reset of the counter to zero. However, each data object consists of many performance counters, and a counter reset generally affects more than a single counter. Since a spike is detected when any of the counters of a data object exceeds the limit value, the probability of a counter reset not being detected is very small, because the probability of all associated counters already being close to their maximum is very small. However the probability is not 0, and therefore there may be some rare cases where a false negative occurs such that an invalid spike is not detected by TPC and is then propagated into the database.

4.Prevention of Performance Data Spikes

Once an invalid performance data spike has been detected, there are several different approaches that could be used for spike prevention. Spike prevention is a rather ambiguous term. Since any actual spikes in the performance counters are completely dependent on the counter values returned by the various devices, it is really impossible to prevent spikes in the counters themselves. This section describes ways that we can prevent such counter spikes from being propagated into the TPC database and into the performance reports, where they will adversely affect the readability of the reports.

One approach for spike prevention would be to discard each performance record that contains one or more counters with an invalid spike. This approach is acceptable, because if one of the counters has been reset by the device, it is highly likely that other counters have also been reset. The problem with this approach is that if too many such records are deleted, then the remaining performance data will be severely skewed. Some of the performance data reported on by TPC is aggregated from many lower lever components to a few higher level components. For example, if a subsystem contains 1000 volumes, the performance metrics of those 1000 volume records are summed up and reported at the level of the entire subsystem. If 800 of those 1000 volume records are discarded due to spike prevention, then the values that are reported at the level of the subsystem will not represent reality, because they would take into account only the remaining 200 records. This is not acceptable, and would arguably be worse than not doing any spike prevention at all, because at least with spikes in the data it is immediately obvious to the user that the information presented in the performance report is wrong, whereas with partial results it will not at all be obvious to the user that the presented information is wrong.

Another approach would be to discard the entire sample when one or more spikes have been detected. This at first may seem extreme, but is the best option regarding the drawbacks described for the other approach above. Note also that in normal cases, spikes will only be seen when the device has been rebooted or reset in some manner, which should happen extremely rarely. It is acceptable to lose an individual sample every once in a long while whenever a maintenance window for a device has been reached.

The latter approach has been implemented in PM for TPC 3.1.3 as the default mechanism of spike prevention. Whenever a spike has been detected as described previously, the current sample pass is terminated without saving any of the sample data in the database, and in addition the following message is written to the log:

HWNPM2124W Performance data continuity has been broken. The device may have been reset or rebooted. {0} invalid performance data records were discarded.

This message will be displayed in the collector (job) log for every sample pass where a spike has been detected. The {0} is a placeholder and is filled in with the actual number of performance data records that were discarded. Note that this new message parallels the existing message which is issued for every successful sample pass:

HWNPM2123I Performance data was collected and processed successfully. {0} performance data records were inserted into the database.

However, some customers with very active DS6000 and DS8000 environments may experience some kind of spike in almost every single sample interval. For these customers, the default approach would mean that all or a large percentage of the gathered performance data would be discarded. For these special circumstances, a new TPC configuration parameter is available to change the default behavior, as follows:

Attribute / Default Value / Context / Description
LimitCheckLenient / false / PerformanceManager / Control the spike prevention logic. If false, records for the entire sample interval are discarded if at least one spike is detected in any of the records. If true, only those records which have spikes are discarded and the rest are processed normally.

If theLimitCheckLenient parameter is set to false (the default),the entire samples are discarded when any spikes are detected. If set to true, only the individual data recordsthat contain spikes will be discarded, instead of all data records for the sample pass. However, remember that setting this parameter to true means that all the aggregated performance values for higher level components could potentially end up being significantly skewed, due to the individual data records that were discarded. As a result, an additional warning message is printed in this case in addition to HWNPM2124, as a reminder that all computed aggregate values may be incorrect:

HWNPM2125W Aggregated performance values have been computed from the remaining data records, but their accuracy cannot be guaranteed.

In addition, HWNPM2123I would also be issued in that case, since some records would end up being inserted into the TPC database.

To set the LimitCheckLenient parameter, it is necessary to insert or update a row in the T_RES_CONFIG_DATA table in the TPC database. The attribute and context columns should be specified as shown above, and the value column should contain the desired value (true or false). After the parameter has been inserted into the table once, it is then also possible to modify the value using the tpctool command line interface: