1 an Open Framework for Scalable, Reconfigurable Performance Analysis

[1]An Open Framework for Scalable, Reconfigurable Performance Analysis

Supplementary Material for Poster Submission to SC 2007

Todd Gamblin‡, Prasun Ratn*†,

Bronis R. de Supinski*, Martin Schulz*, Frank Mueller†

Robert J. Fowler‡, and Daniel A. Reed‡

*Lawrence Livermore National Laboratory

†North Carolina State University

‡Renaissance Computing Institute, University of North Carolina at Chapel Hill

Petascale application developers will need monitoring and analysis tools capable of processing enormous data volumes. For 100,000 or more processors, an instrumented application can generate terabytes of event trace data during a single execution. At such scales, the transfer and storage of complete traces is infeasible, necessitating tools that reduce, process and analyze data on-line before human exploration and optimization.

We are developing a scalable, reconfigurable infrastructure for performance analysis on large-scale machines. One of its central capabilities is the collection of near-constant size communication traces. In this poster, we build on this capability with new mechanisms that annotate runtime traces with timing statistics and computational load measures. We describe our experiences using these techniques on current scientific applications. We also give preliminary results showing these techniques can be used to visualize time-evolution of load imbalance and to replay codes for accurate postmortem analysis.

Poster Structure

The poster will be divided into three sections. In Section 1, we will illustrate our architecture for scalable performance monitoring and describe its key components and data flow.

Sections 2 and 3 will provide two examples of how our framework enables new types of analysis on real-world scientific codes. We will illustrate how each of these examples can be constructed as a module within our framework, and we will show results obtained using these techniques with scientific codes.

Examples will include:

A technique to annotate scalable traces with statistical timing data for computation and communication; and
A model for recording load balance information in communication traces.

Section 1: Scalable Infrastructure

Traditional performance analysis tools are insufficient for monitoring petascale applications. At one end of the spectrum, tracing tools preserve complete timing data, but they generate prodigious outputs and excessively perturb the monitored application. Traces generated by these tools can be trimmed by manually reducing instrumentation, but this is at most a linear data reduction. Scalable traces must grow sub-linearly with the number of processes in the system.

Conversely, profiling tools (e.g., in HPCToolkit, TAU and Open|SpeedShop) provide a compact, whole-run summary of application performance, but aggregate data over time. This lack of timing-dependent data often precludes diagnosis of adaptive codes that exhibit evolutionary or cyclic behavior.

We have built tools to collect near-constant size communication traces using local and inter-node compression. We are currently developing a reconfigurable architecture to annotate these traces with additional performance data, creating a scalable hybrid framework that combines the expressiveness and detail of tracing with the compactness of profiling.

This section of the poster will introduce our framework and trace format. We will describe the dynamic configuration of our modular performance tools, enabling control over data volume and overhead. This material will provide the context for Sections 2 and 3 by illustrating the environment where the tools will be integrated.

Section 2: Histograms and Timing Delta Information

Our traces can be input to a replay engine that facilitates detailed analysis of communication patterns and task mappings. This engine needs statistical data on trace event durations to reproduce application behavior accurately, However, timing data does not compress as easily as the communication trace data, Already small (and mostly irrelevant) variations in computation and communication times can prevent very similar regions from being compressed and lead to unacceptably large traces.

We use statistical methods to preserve timing data without interfering with our compression algorithm. This poster will show the steps needed to generate adaptive histograms that approximate the timing data distributions. It will also show how this algorithm can be modularized and integrated with our framework to annotate events in communication traces with histograms.

Additionally, this poster will provide results showing the accuracy of our trace replay engine for several scientific codes. The figure at left shows preliminary results for one of them, the NAS BT benchmark

Section 3: Progress/Effort Model for Evolutionary Load-Balance Analysis

Modern solvers use methods such as Adaptive Mesh Refinement (AMR) and time-subcycling that may unevenly distribute work among processes. Managing such load imbalance at scale is difficult, as no extant tools can collect this data in a scalable manner. We have developed a model for annotating traces with loop information, which can inform engineers about the evolution of load imbalance during application execution.

This poster will sketch how to add such semantic annotations to traces. These annotations will show progress rates (i.e., those runtime events corresponding to progress toward some application goal, typically outer loops) and effort rates (i.e., data-dependent runtime events with variable iteration counts, typically inner loops). Using these annotations in conjunction with other metrics, we will show how one can easily diagnose and visualize load imbalance.

We will also provide several graphs of useful data gathered by our tool, showing the evolution of load imbalance on large-scale runs of UMT2K, a code known to exhibit load imbalances, and Raptor, a well-known AMR code.

[1] Part of this work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48 (UCRL-ABS-233187). This work was also supported in part by NSF grants CNS-0410203, CCF-0429653, CAREER CCR-0237570, SCI-0510267, and the SciDAC Performance Engineering Research Institute grant DE-FC02-06ER25764.