COMP 290 – Data Mining Final Project

Using sequence mining techniques for performance data

Todd Gamblin

Abstract

For this project, we attempted to evaluate the effectiveness of rudimentary sequence mining techniques for characterizing I/O trace data. We have taken trace information for a scientific application running on a cluster, and we have attempted to use K-Medoids based clustering algorithms to correlate particular trace sequences with phases of application execution. We also present a novel approach to sequence comparison, where we consider packets in sequences qualitatively based on high-level observations about the distribution of their durations. We show that our approach can reveal correlations between I/O sequences and phases of application execution, and that sequence mining techniques hold promise for adaptive performance monitoring.

1. Introduction and Motivation

With the advent of Grid Computing, an application designer may never know the exact performance characteristics of the environment in which his application runs. An application may run on a set of nodes that have completely different memory systems, disk drives, and associated resources, and it is thus the responsibility of the runtime system to discover these details and optimize the behavior of an application accordingly. To do this, it must be possible to detect, for a particular application, what phase of execution it is in at a particular time, based only on performance data collected from the running system. This information can then be used to guide the way the nodes the application is running on are monitored, and to dynamically assign resources to the application to improve its performance.

Large cluster systems today can have thousands of nodes, and monitoring all of them in real-time is beyond the capabilities of current interconnection networks. Statistical sampling can be used to significantly reduce the number of nodes we need to monitor and, accordingly, the volume of data that needs to be sent over the network. In particular, if we establish a minimum acceptable accuracy for our results, we can determine a minimum sample size necessary to achieve that accuracy. This minimum sample sizedepends on the variability of the data. For data that varies less, it is possible to sample fewer nodes while maintaining the same accuracy. If we modulate the sample size according to the variability of the data monitored, we can further reduce the load on the network.

Since the performance characteristics of an application can vary by execution phase, and since execution phase in a dynamic application can vary by machine, we would like to be able to divide our nodes into different groups and sample each group independently. This could further reduce monitoring overhead. First, however, we need to understand what the different execution phases of applications are, and we need to learn how to identify them based on observable information.

2. Prior Work

Hao Wang [9] has run tests on the same data we examine here, but his approach has did not consider the sequential nature of the data. Instead, he applied various data mining techniques to individual packets of data, without taking into account their order. Computer programs are inherently sequential, and their actions are difficult to characterize without examining their order. Streaming I/O references made by computer programs will also be strongly sequential, and mining for frequently occurring sequences of data will be more effective for characterization than analysis at the reference level.

Lu and Reed [4] have investigated a low-overhead alternative to event tracing called “application signatures”. They use curve fitting to characterize individual performance metrics, and use these to compare applications across platforms. This technique is comparable to our sequence-mining approach, but our research is focused more on leveraging powerful existing data-mining techniques on available data than on low overhead.

Much work has been done on sequential pattern matching (string matching) and on cluster mining in the past. Sequence mining and approximate string matching have had important applications to biological data, and we apply these techniques to performance data. In particular, we use the K-Medoids clustering algorithm [3], the CLARA [5] clustering algorithm, and the edit distance [7] measure of the similarity of sequences.

3. Data Characterization

For this project, we ran tests on approximately 850 MB of application trace data from traces run by the University of Illinois Pablo project [6]. The data is in Pablo’s Self-Defining Data Format (SDDF), and it contains records of each I/O event that occurred during a particular run of Dyna3D [1], a structural mechanics simulator. The simulation we examine ran on 32 nodes of the ASCI Blue system (blue.pacific.llnl.gov), and the dataset was aphysical model containing 1,046,797 nodal points. Dyna3D performance can be very data dependent, so it will be especially beneficial for us to identify different phases in its execution, as we may be able to perform load-balancing between nodes, or reallocate resources based on this kind of information.

The fundamental unit of data in an SDDF file is called a “packet”. Some of these represent I/O trace events, while others are actual I/O system calls made by the application. There are 9,594,415 packets total. For the purpose of this project, we have filtered out the trace events, since it is only the system calls that we would be monitoring in an actual runtime environment.

A packet relevant to us in the data might look like this:

"Write" {

[2] { -382203,1}, 214.738809725,

700017, 15, 0.0032184, 38, 13, 0, 0

};;

Each packet has a type name, specified on its first line. The six specific types of packets we are concerned with are Read, Write, Open, Close, Seek, and Flush. These are the names of well-known I/O operations, so we will not explain them in detail here. The second line of the packet is its timestamp in two forms: the first is an array of ints, and the second is the same value represented as a double. The third line contains the following fields:

  1. Unique id of the packet
  2. Id of the node on which the I/O operation occurred
  3. Duration of the operation in seconds
  4. Id of the file the operation refers to,
  5. Number of bytes read/written by the operation
  6. Number of variables passed to the I/O call
  7. Bookkeeping value used by the instrumentation code that recorded this data

The timestamp field is important for our analysis, because events with similar timestamps can represent the same phase of application execution. If we can show correlations between other attributes and time, then we may be able to use this kind of data to learn about running applications. Number of bytes, number of variables passed to the I/O call, and the duration of the I/O operation might also be of concern at runtime. Before we started writing our code, we looked at the distribution of these values across the data. Our results are presented below.

Figure 1: Histogram of Operation Types

The data is consists of 9,138,950 reads, 449,431 writes, 256 file opens, 227 closes, and 31 seeks. There are no flushes, and there are a total of 9,588,895 operations.

Figure 2: Histogam of values for bytes read/written in I/O Operations

There were very few values for the number of bytes read or the number of bytes written in I/O operations. Either 1 or 18 bytes were read at a time, and 2, 4, 13, or 15 were written at a time. The number of variables passed to read calls, along with the number of variables passed to write calls, was always zero in the data we used, so there was no information to be gained from this dimension of the data.

The duration of I/O operations was the most varied metric for the I/O references we examined. The distributions of values for Each type of I/O operation are shown below.

Figure 3: Durations of reads and writes.

Figure 4: Durations of Opens, Closes, and Seeks

A large number of read operations are tightly grouped around 10-6seconds, while smaller groups of reads tend to take around 10-4 seconds. There are almost no reads taking more than 10-2 seconds, save for a small group taking around 100 seconds to complete. Writes vary in a similar fashion to the reads. The largest portion of writes is around 10-6 seconds in duration, tapering off to close to zero writes above 10-2 seconds Just as with the reads, there is a small group of writes taking around 100 seconds to complete. The speed of reads and writes in the system can be related to the memory hierarchy of the system tested, or it could be related to the virtual memory system of the machine. Slower times could imply a cache miss or a page fault, or a higher load on the machine on which these timings were observed. They can tell us important information about the performance of the application at the point in time they were observed. The duration data thus seems to be the most relevantfor our mining algorithms.

Other operations are far less frequent than writes or reads are in the dataset. Closes seem to follow a similar distribution to the writes and the reads, while seeks are all in the 10-6 second range. The distribution of opens is similar to reads, writes, and closes, but without the large number of fast operations that we see with these other types.

4. Adapting algorithms for analysis of performance data

In this section, we describe how we have adapted existing algorithms for approximate string comparison and for data mining to work with our performance data. We present a novel high-level way of looking at sequences, where we categorize reads and writes into bins based on their distribution as described in Section 3, and we then describe how sequences of this sort can be used as input for clustering algorithms.

4.1. Mapping Q-Grams to I/O Traces

String matching is typically defined in terms of characters in sequence, or, in the case of genetics, in terms of bases in DNA molecules. We will be using packets as characters, and sequences, called q-grams [8], of them in place of strings.

Q-grams are discussed in [8,9], and the term generally refers to taking a stream of inputs and breaking it into subsequences of a fixed length, q. In our case, we do just that: we split the data stream into q-grams of packets. As q-grams are sequential, comparing them instead of individual packets should give us more of a sense for the types of tasks the application is performing, rather than for the overall mix of operations it is using.

To compare q-grams with each other, we need somedefinition of what makes two packets equal. We obviously cannot compare two packets for equality based on their timestamps, as this would result in none of them being equal. Instead, we define three different measures for packet equality based on the attributes available to us in the data, and we look for the q-grams that met these criteria.

  1. Operation-Type equality
    If two packets have the same operation type, they are considered equal.
  1. Exact duration equality
    If two packets are of the same type, and they have the exact same duration, they are considered to be equal.
  1. Binned duration equality
    Similar to operation type equality, but with slightly finer granularity. Since the duration values for read and write are so varied, we chose to group them in bins. Reads slower than 10-5 seconds are considered “short reads”, those longer than 1 second are considered “long reads”, and those in between are considered “middle reads”. Writes are handled similarly.

Two q-grams are considered equal if all of their component packets are equal. We use a trie data structure to search for the unique q-grams in our data, and we test this with each of the above definitions of equality.

4.2. Clustering Q-Grams

Clustering is a technique widely used in data mining to find groups of similar objects within a larger dataset. The frequent q-gram search we described above can find frequently occurring q-grams, but it will find only exact sequence matches. We wanted to find equivalence classes of similar q-grams in the data. Given these, we could characterize data at runtime based on its membership in these classes, and we could use this information to deduce how an application is running.

Clustering data requires some means for assessing how points are similar. In our case, our points are q-grams, and we can again look to string matching research. A common measure of the dissimilarity of sequential strings is the edit distance. Edit distance is a measure of how many changes (e.g. adding, deleting, or replacing characters) would need to be made to make one string into another. Given two q-grams, we can again treat packets as characters, and using the equality measures outlined above we can calculate the edit distance between two q-grams. Our clustering algorithms use this as their measure of q-gram similarity.

Traditional methods for clustering include K-Means [3], K-Medoids[3], and Density Clustering [2]. K-Means and K-Medoids are similar in that they both attempt to create a fixed number of clusters, k, by assigning clusters and minimizingthe total dissimilarity of points within them. K-Means tries to minimize the sum of square dissimilarities of points in a cluster from their mean value. K-Medoids uses medoids instead of means, where a medoid is restricted to being an actual point in a cluster rather than the average of the positions of all points. Density-based clustering algorithms take a different approach, and try to find groups of points that are “density connected,” without considering overall dissimilarity. This tends to be a more robust method than the K-cluster based approaches, as it can find clusters with arbitrary shape. Also, Density Clustering does not need to know the number of clusters to search for in advance.

Of these methods, the only one applicable to our sequence data is K-Medoids. Because the only two operations we have defined on our q-grams are equality and edit distance, we cannot use K-Means or Density Clustering. K-Means requires the computation of a mean from the data points, but all we have is a measure of dissimilarity. There is no way to compute a “mean” q-gram. Density clustering suffers from similar problems, and is only applicable to spatial data.

K-Medoids, however, can be applied, as it only requires a measure of “distance” between two points to be applicable. A proper distance measure must obey the triangle inequality, that is, for any three points x, y, and z, the distance d(x,y) d(x,z) + d(z,y). The edit distance does obey this property, and it is thus usable for K-Medoids clustering.

K-Medoids is the slowest of the above mentioned clustering algorithms, and it is not practical for large datasets. A single iteration of K-Medoids requires O(k(n-k)2) time. Running sufficient iterations of this algorithm for a large dataset to converge could take an unmanageable amount of time. Therefore, we chose to use CLARA [5], which offers significantly better performance. Instead of clustering an entire dataset, CLARA samples the dataset several times and runs K-Medoids on the sample. It then chooses from the sample runs the best set of medoids for the total dataset. CLARA has been shown to perform well on large datasets like ours.

It is worth noting that the sequence clustering problem has been looked at in considerable detail by Wang, et al. They presented CLUSEQ, an algorithm for finding clusters in sequential data without a preset number of clusters to find, and without a preset length of sequences to cluster on (as our q-gram analysis inherently requires). Also, CLUSEQ measures similarity by statistical properties of sequences, rather than on a single distance metric. We believe that this algorithm would be very effective in analyzing I/O trace data, but it was not available at the time we performed our experiments. The algorithm is very sophisticated, and there was not time available to rewrite and then debug it. We believe that testing CLUSEQ holds promise for testing performance data in the future.

4.3. Implementation

We have implemented all of the above algorithms in C++. Our implementation uses the Pablo project’s library for processing SDDF data as its backend. It first reads sequence data from SDDF files, splits it into q-grams, and stores unique q-grams in a trie. The trie is then queried for frequent q-grams. We then pass these q-grams on to the CLARA clustering algorithm. We then relate each of these unique q-grams to all instances of them in the main dataset to calculate statistics for the clusters.

5. Q-Gram Results

In this section, we describe the experiments we performed and the results we obtained from applying our q-gram and cluster algorithms to the Dyna3D dataset.

5.1 Unique Q-Grams

Our first experiment was to count the number of unique q-grams in the dataset. We show that the number of unique q-grams in an I/O trace can give us an idea of the variety and relative complexity of behavior in the system.

We first ran q-gram on our raw dataset, with I/O trace packets from the different nodes interleaved in the order they were traced. We then split the data into separate files, one per node, and ran the same experiment on the split data. We compared the results with the number of unique q-grams in the un-split data set. Finally, we looked at the distribution of all instances of these unique q-grams in time, to see if there was any temporal locality among like q-grams.

Binned-10 / Binned-100 / Exact-
10 / Exact-100 / OpType-10 / OpType-100
Main file / 23178 / 20467 / 928905 / 95698 / 1007 / 835
All Split (w/node0) / 21903 / 19863 / 831517 / 95381 / 669 / 342
All Split (w/o node0) / 17829 / 15497 / 610464 / 72508 / 246 / 247
Node 0 / 4202 / 4366 / 221405 / 22873 / 430 / 98
Avg. Nodes 1-31 / 634 / 500 / 20050 / 2339 / 12 / 10

Figure 5: Summary of unique Q-Grams

5.1.1 Main dataset

Unique q-grams in the raw data file are summarized in Figure 5. We ran our tests with q-grams of size 10 and 100 and for all definitions of q-gram equality.

Within the main data, the number of unique q-grams found through exact matching is the greatest at almost 1 million q-grams of length 10 and almost 100 thousand of length 100. These numbers are approximately one-tenth and one one-hundredth the number of packets in the dataset, respectively. This indicates that we have almost the maximum number of unique q-grams possible when we use exact matching.