Load Balancing Irregular Hierarchical Parallelism:
A Case Study from Biomolecular Structure Computation
Electrical Engineering Dept. / Computer Science Dept. / Stanford Medical Informatics
Stanford University / Princeton University / Stanford University
/ /
Abstract
Many applications exhibit a tree-structured computational pattern, in which the nodes encapsulate significant parallelizable parallelizable computation and the topology expresses dependencies among the nodes. We evaluate two partitioning algorithms to exploit the resulting irregular hierarchical parallelism, applying them to our molecular structure determination program. Our first strategy, similar to several sparse Cholesky partitioning methods, statically allocates processors over the computational hierarchy according to estimated workloads in the tree nodes. While further load balancing by common task stealing techniques is possible in some applications, it is made infeasible here by computational dependencies within each node. Instead, we present a new, distributed dynamic partitioning approach which we call “dynamic regrouping”. Groups of processes start traversing the task graph according to aforementioned statically initialized schedules, but later dynamically regroup to achieve load balance, through a general algorithm implemented via orderly state transitions of the tree nodes. We compare the performance of the two load balancing algorithms on a 64-processor SGI Origin2000, a cache-coherent distributed shared memory multiprocessor. The dynamic regrouping approach yields increasingly better performance over the static scheme as the topology becomes more irregular and the workload more unpredictable, improving the execution time by an average of 20% and as high as 70% in typical, realistic cases. The dynamic algorithm is expected to perform even better for the new types of data being incorporated in more recent macromolecular modeling efforts. Finally, we briefly discuss the applicability of this dynamic approach to other emerging classes of parallel applications with similar computational patterns.
Keywords: load balancing, hierarchical, probabilistic least squares, shared memory multiprocessing, molecular structures.
1
1.Introduction
The computational structure of many applications can be described by a tree. There is significant parallelism within the computation of each node, as well as topological parallelism across nodes according to the dependency imposed by the underlying task graph. Examples include direct sparse Cholesky factorization via elimination trees [23, 42, 29], query evaluation in belief networks using join trees [36], and several hierarchical biomolecular structure determination algorithms [30, 4, 12].
When the computational content in individual nodes is very large and can be divided evenly among the processors, static work partition solely within nodes suffices. Even when topological parallelism is important to exploit, as long as the work distribution over the computational hierarchy is predictable and evenly divisible, static assignment based on analysis of the workload distribution can perform well. However, in many applications the distribution of work can be highly skewed and even non-deterministic. A static partitioning of computation among the processors based on analysis of the input data may suffer significant load imbalance. Alternatively, the workload may be statically assignable, but the execution environment may not be well suited to static schemes; e.g. a heterogeneous system, a multiprogrammed system, or a system in which communication imbalances and contention make an otherwise load balanced computation imbalanced. Task queues with dynamic task stealing are often used to recover load balance after a good static assignment in such cases. However, in a hitherto not well studied class of applications, the computation in each node may itself have multiple stages with dependencies and localized synchronization among them, making common dynamic assignment techniques such as task stealing difficult and/or ineffective. We report our experience with the parallelization of a probabilistic least squares algorithm adapted for molecular structure estimation, which has all these challenging properties. We describe a distributed dynamic partitioning algorithm that we have developed, called dynamic regrouping, and compare its performance on several test problems from our application domain with that of an earlier parallel implementation based on static analysis of the input data.
The static partitioning scheme exploits both available levels of parallelism by assigning a subset of the processors to each node or subtree according to an estimated workload distribution across nodes and subject to the topological ordering constraints. The strategy is similar to several partitioning algorithms for sparse matrix factorization. The dynamic regrouping approach builds on the initial static work partition to improve load balance while still preserving locality of reference as much as possible. Groups of processors start traversing the computational hierarchy according to statically initialized schedules. When a processor group is stalled at a parent node on its static schedule due to computational dependencies, it dynamically computes a “detour” down the subtree rooted at a sibling node, taking on additional, available nodes or joining another group to achieve load balance. Eventually the original idle processor group, now possibly part of a larger group, comes back to the parent node at which it would have stalled after having satisfied the computational dependencies. These regrouping events are “permanent”, so locality is preserved thereafter; they are regulated by orderly state transitions of the tree nodes which signify the completion of certain computational dependencies.
We demonstrate the improved performance of the dynamic scheme over the static work partition alone on a 64-processor SGI Origin2000, a cache-coherent distributed shared memory multiprocessor. The dynamic algorithm yields increasingly better performance as the computational hierarchy becomes more irregular and the workload more unpredictable, improving the execution time by an average of 20% and as high as 70% for typical, realistic input problems. The dynamic algorithm is expected to perform even better for the new types of structural data being incorporated in more recent macromolecular modeling efforts. Finally, this multi-level parallelism is not unique to our program. We point out several applications with similar computational patterns and touch on the applicability of the dynamic scheduling technique to them.
2.Background
Determining the three-dimensional structure of molecules such as proteins and nucleic acids is an important task of molecular biology, both for understanding their functions and to aid in the design of drugs. However, large-scale structure determination presents a difficult computational challenge because of the large number of variables in the objective function and the highly nonlinear relationships among the parameters. In addition, structural data come from many experimental and theoretical sources, where their quality and abundance can vary significantly. Thus, it is important to combine these sources of information effectively to produce not only a structure consistent with the data, but also a measure of uncertainty in the estimated structure.
A probabilistic least squares algorithm has been applied to estimate the three-dimensional structure of molecules from multiple sources of uncertain data in [3, 1]. Its parallel implementation was reported in [13], and an extension of the algorithm for more general probability distributions can be found in [2]. Our most recent study of this algorithm focused on the hierarchical decomposition of the algorithm to improve its performance tremendously on real problems and an initial, static parallelization of the resulting computation [11, 12].
In our hierarchical application, the root of the tree represents the entire molecule of interest. Within the root node, the state vector contains the atomic coordinates; and the covariance matrix contains estimates of the uncertainties in the coordinates, as well as linear correlations between them. The children of a tree node correspond to a partition of the molecular structure represented by that node. The state vector of the node is partitioned, and the corresponding diagonal blocks of the covariance matrix are distributed among the children. Attached to each node is a queue of constraints which relate various parts of the substructure of the node. The computation at each node applies the available structural constraints to the state variables of the node using the least squares algorithm (Figure 1) and propagates the final, updated structure to the parent of the node (Figure 2). By visiting the tree in a reverse topological order, we have shown that the hierarchical decomposition (Figure 3) yields the same result as the original “flat” computation with the same ordering of constraints, at greatly reduced computational complexity [12].
3.Exploiting Parallelism
The computation within a node in the hierarchy is an instance of the original, flat problem, which sequentially applies the constraints in an iterative algorithm to update the state vector and covariance matrix (Figure 1). The entailed matrix operations may be parallelized (in-node parallelism). In addition, nodes with no ancestor-descendant relationship encapsulate independent computations and may be carried out by different processors (topological parallelism). At the leaves of the tree, there are many independent nodes, each of which encompasses a relatively small amount of work. As the computation progresses toward the root, there are many fewer nodes remaining; however, the data space per node and its computational content grow significantly. Therefore both levels of parallelism, in-node and topological, are important for high performance.
Our previous parallel implementation [11] exploits both levels of parallelism by statically assigning a subset of the processors to each node or subtree according to estimated workload. This partitioning heuristic is similar to several used for sparse matrix factorization. It first estimates the work at each node using simple polynomial models of computational complexities for the update equations (Figure 1) and accumulates the numbers upward so that the estimated amount of work in any subtree is available. All the processors are initially assigned to work on the root. Then, recursively, given a node which has been assigned a group of processors, those processors are distributed across the child branches according to the ratio of work in the corresponding subtrees. An example work partition is shown in Figure 3. A processor assigned to a node also participates in the computation of its parent node. Therefore a group barrier among processors assigned to the children is used, at entry to the parent node computation, to ensure that memory has been updated.
A potential shortcoming with the static assignment strategy is that the available number of processors at a node is usually not divisible in the same ratio as the workload distribution in the children’s subtrees. Furthermore, the work at a node may not be easily predictable because of non-determinism. The sparse matrix computations in the tree nodes incur unpredictable interaction with the computer memory system hierarchy and deviate substantially from a simple estimation of execution time by counting flops. Also, constraint function evaluation may have a randomized algorithmic component with a large variance in its execution time. And finally, the linearization loop in the constraint application within a node (Figure 1) usually executes a variable and unpredictable number of iterations as well. The resulting mismatch in processor partitions leads to load imbalance and longer synchronization wait times at entry to the node computation. We need a dynamic load balancing mechanism.
Unfortunately, a tasking stealing approach cannot be effectively applied in this application due to the nature of its computational dependencies. First, defining an entire node as the unit of task stealing is too coarse a division, especially further up the tree. Second, down at the next level of granularity, the constraints pertaining to a node may not be defined as tasks and parceled out among processors. They need to be applied to the node sequentially (Figure 1). Finally, within a constraint application, there are multiple phases of computation interspersed with in-node synchronization operations. These dependencies detract greatly from the effectiveness of breaking up the matrix computations within a constraint application into tasks for stealing, because the results of the stolen tasks must quickly and repeatedly be merged with those of the others at these synchronization points.
Our dynamic load balancing scheme uses a technique we call dynamic regrouping. Since static assignment gives us more control over data locality, we begin with the aforementioned static work partition, which assigns individual processors or groups of processors to each node. We stipulate that a processor visit its assigned portion of the tree (see Figure 3) according to a reverse topological order in which left siblings take precedence over right siblings and their descendants. This assignment “spaces out” the processors over the tree as much as possible, so that different processor groups tend not to access the same data and hence communicate. Eventually, however, a processor group may have to wait for others at a group barrier (involving only those assigned to sibling nodes) before entering a parent node. At this point, the idle processor group instead takes a “detour” from its precomputed schedule to participate in additional work in other nodes. In order to maintain as much data locality defined by the static schedule as possible, we limit the idle processor group to systematic searches within the parent’s subtree. First, we allow the idle processor group to take over available nodes which are not on their statically assigned schedule. Second, an idle processor group may join a busy processor group, even while the latter is currently working on a node, but only at certain synchronization points between constraint applications to that node. Once the processors join a group, they remain part of that group in subsequent computations, retaining data locality. Eventually the original idle processor group, now possibly part of a larger group, will come back to the parent node at which it would have stalled after having satisfied the computational dependencies. These regrouping events load balance the progress of computation over the tree.
The dynamic algorithm is general, and can be used for applications with the two-tiered parallelism structure we have described. To explain how it is orchestrated, we first identify distinct phases a node may go through in the course of the hierarchical computation (Figure 4).
- Countdown
In this state, a node keeps track of the number of its children whose constraint updates have not completed. - Ready
The current node has no outstanding topological dependencies; however, no processor group is working on the node at the moment. - Occupied
A processor group is presently working on the node. - Final
This final state signifies that all constraints have been applied to the current node, and the updated substructure has been propagated to the parent.
Initially, the leaf nodes of the tree are in the ready state, while all internal nodes are in countdown. The first nodes visited by processors are advanced to the occupied state to begin the state transitions.
When a processor group has completed the update of a node and propagated the substructure upward, it advances the node’s state to final and decrements the countdown counter of the parent node. If the parent’s counter becomes zero, the processor group changes the parent’s state from countdown to occupied, thereby claiming it for computation. The operations on these two related nodes’ states must occur atomically; however, other state transitions in different nodes may proceed without interference. We therefore need a distributed mechanism to minimize bottlenecks in regulating node transitions while preserving the necessary synchronization and atomicity.
The necessary atomicity in updating certain related nodes' states is achieved by associating a lock with every node. When a processor group finishes updating a node, it initiates a critical section by obtaining the lock of the parent node. This critical section encompasses the aforementioned sequence of actions: advancing the node to the final state, and decrementing the parent’s counter. At this point, if the parent’s counter reaches zero, the parent is advanced to the occupied state and claimed by the processor group, and the lock is released to terminate the critical section.
Otherwise, the goal of the just-turned-idle processor group is to find additional work as soon as possible, by locating a ready descendant of the parent node (to take over) or an occupied descendant (to join). A positive countdown in the parent means that a sibling has not reached the final state. The processor group, while holding the lock on the parent, performs a cyclical scan of the siblings, starting to the immediate right of the current node, until a non-final node is found (guaranteed by possession of the parent’s lock). The processor group first reads a sibling’s state, and if that state is non-final, the processor group obtains that sibling’s lock and re-examines its state. If, on the second examination, the state becomes final, the processor group releases the sibling’s lock and moves on to the next node. This additional locking is required because further action depends on the exact state of the non-final sibling, which otherwise could change at any time from countdown or ready to occupied. Thus a processor group may hold up to two locks at a time.