Dynamic Load Balancing for Cluster Computing

Jaswinder Pal Singh,

CSE @ Technische Universität München.

e-mail:

Abstract:In parallel simulations, partitioning and load-balancing algorithms compute the distribution of application data and work to processors. The effectiveness of this distribution greatly influences the performance of a parallel simulation. Decompositions that balance processor loads while keeping the application's communication costs low are preferred. Although a wide variety of partitioning and load-balancing algorithms have been developed, but the load-balancing problem is not yet solvedcompletely as their effectiveness depends on the characteristics of the application usingthem. New applications and architectures require new partitioning features.Newmodels are needed for non-square, non-symmetric, and highly connected systemsarising from applications in biology, circuits, and materials simulations. Increaseduse of heterogeneous computing architectures requires partitioners that account fornon-uniform computing, network, and memory resources. This paper introduces the topic and proposes algorithm to paralellize adaptive grid generation on a clusterfollowed by a brief look at the future prospects of DLB.

Keywords: Distributed Systems, Dynamic Load Balancing, Adaptive Grids, Refinement Tree Bisection

1. Introduction

Modern age scientific computations are increasingly becoming large, irregular and computationally intensive to demand more computing power than a conventional sequential computer can provide. The upper bound for the computing power for a single processor is limited by the fastest processor available at any certain time but it can be dramatically increased by integrating a set of processors together. Later are known as Parallel Systems(Fig 1). An in-depth discussion of existing systems has been presented in [1],[2],[3]. Cluster, the one relevant to our discussion, is a collection of heterogeneous workstations with a dedicated high-performance network and can have a Single System Image (SSI) spanning its nodes.Message – Passing programming model in which each processor (or process) is assumed to have its own private data space, and data must be explicitly moved between spaces by sending messages, is inherent to Clusters. The principal challenge of parallel programming is to decompose the program into subcomponents that can be run in parallel and that can be achieved by exploring either functional or data parallelism.

In Functional Parallelism, the problem is decomposed into a large number of smaller tasks, which are assigned to the processors as they become available. Processors that finish quickly are simply assigned more work. It is implemented in a Client-Server paradigm. The tasks are allocated to a group of slave processes by a master process that may also perform some of the tasks.

Data parallelism also known as Domain Decomposition represents the most common strategy for scientific programs.The application is decomposed by subdividing the data space over which it operates and assigning different processors to the work associated with different data subspaces. This leads to data sharing at the boundaries, and the programmer is responsible for ensuring that this data are correctly synchronized.Domain decomposition methods are useful in two contexts. First, the division of problems into smaller problems through usually artificial subdivisions of the domain are a means for introducing parallelism into a problem. Second, many problems involve more than one mathematical model, each posed on a different domain, so that domain decomposition occurs naturally. Examples of the latter are fluid-structure interactions. Associated with MIMD architectures, data parallelism originated the single program, multiple data (SPMD)programming model [4]; the same program is executed on different processors, over distinct data sets.

Load Balancing: Motivation

Consider, for example, an application that afterdomain decomposition can be mapped onto the processors of a parallel architecture. In our case underlying hardware system is a cluster and we run into problem because the static resource problem is mapped to a system with dynamic resources, resulting in a potentially unbalanced execution. Things get even more complicated if we run an application with a dynamic run-time behavior on a cluster i.e. mapping of a dynamic resource problem onto a dynamic resource machine. The system changes such as variation in availability of individual processor power, variation in number of processors or dynamic changes in the run-time behavior of the application leads to load imbalance among processing elements. These factors contribute to inefficient use of resources and increase in total execution time, which is a major concern in parallel programming.

Load balancing/ sharing is a policy which takes the advantage of the communication facility between the nodes of a cluster, by exchanging of status information and jobs between any two nodes, in order to find the appropriate granularity of tasks and partitioning them so that each node is assigned load in proportion to its performance.It aims at improving the performance of the system and decrease the total execution time. Load balancing algorithm can be either static or dynamic. Static load balancing only uses information about the average system behavior at the initialization phase, i.e. done at compile time while the Dynamic load balancing uses runtime state information to manage task allocation. The paper is outlined as follows. The steps involved in a dynamic load balancing algorithm and its classification criteria are presented in section 2. Section 3 introduces the some popular domain decomposition techniques. Parallelization of Adaptive Grids generation on a Cluster and has been discussed in section 4 as an application. Section 5 mentions current challenges and future aspects in domain decomposition methods.

2. Dynamic Load Balancing

Dynamic load balancing is carried out through task migration – the transfer of tasks form overloaded nodes to underloaded nodes. To decide when and how to perform task migration, information about the current workload must be exchanged among nodes. The communication and the computation required to make the balancing decision consumes the processing power and may result in worse overall performance if the algorithm is not efficient enough. In the context of SPMD applications, a task migration between two nodes corresponds to transferring all the data associated with this task and necessary to its execution hence the need for an appropriate DLB algorithm.

In order to understand a DLB algorithm completely, the main four components (initiation, location, exchange, and load movement) have to be understood

Initiation: The initiation strategy specifies the mechanism, which invokes the load balancing activities. This may be a periodic or event-driven initiation. The later are load – dependent, based upon the monitoring of local load thus more responsive to load imbalances. They can be either sender- or receiver initiated. In sender initiated, congested servers attempt to transfer work to lightly loaded ones and the opposite takes place in receiver initiated policies.
Load-balancer location: Specifies the location at which the algorithm itself is executed. In Centralized algorithms only a single processor, computes the necessary reassignments and informs the involved processors. A distributed algorithm runs locally within each processor. Although the use of former may lead to a bottleneck, but later require load information to be propagated to all the processors, leading to higher communication costs.
Information Exchange: Specifies the information and load flow through the system based upon whether the information used by algorithm for decision-making is local or global and the communication policy. All processors take part in the global schemes, whereas in the local schemes, information on the processor or gathered from the surrounding neighborhood take part. Less communication costs are involved in the 1st case but global information exchange strategies tend to give more accurate decisions.

The communication policy determines the neighborhood of each processor. It specifies the connection topology, which doesn’t have to represent the actual physical topology. A uniform topology indicates a fixed set of neighbors to communicate with, while in a randomized topology the processor randomly chooses another processor to exchange information with. Also, the communication policy specifies the task/load exchange between different processors. In global strategies, task/load transfers may take place between any two processors, while local strategies define group of processors, and allow transfers to take place only between two processors within the same group.

Load movement: specifies the appropriate load items to be moved/exchanged as there is a trade/off between the benefits of moving work to balance load and the cost of data movement. Local averaging represents one of the common techniques. The overloaded processor sends load-packets to its neighbors until its own load drops to a specific threshold or the average load.

3. Domain Decomposition or Partitioning algorithm

3.1 The partitioning problem

At its simplest, a partitioning algorithm attempts to assign equal numbers of objects to partitions while minimizing communication costs between partitions. A partition's subdomain, then, consists of the data uniquely assigned to the partition; the union of subdomains is equal to the entire problem domain (Fig 2). Objects may have weights proportional to the computational costs of the objects. These nonuniform costs may result from, e.g., variances in computation time due to different physics being solved on different objects, more degrees of freedom per element in adaptive p-refinement [5], or more small time steps taken on smaller elements to enforce timestep constraints in local mesh-refinement methods [6]. Similarly, nonuniform communication costs may be modeled by assigning weights to connections between objects. Partitioning then has the goal of assigning equal total object weight to each subdomain while minimizing the weighted communication cost.

3.2 Dynamic Repartitioning and Load Balancing Problem

Workloads in dynamic computations evolve in time, for example, in finite element methods with adaptive mesh refinement, process workloads can vary dramatically as elements are added and/or removed from the mesh. Dynamic repartitioning of mesh data, often called dynamic load balancing, becomes necessary. It is also needed to maintain geometric locality in applications like crash simulations where high parallel efficiency is obtained when subdomains are constructed of geometrically close elements [7].

In our case Dynamic load balancing has the same goals as partitioning, but with the additional constraints that procedures (i) must operate in parallel on already distributed data, (ii) must execute quickly, as dynamic load balancing may be performed frequently, and (iii) should be incremental (i.e., small changes in workloads produce only small changes in the decomposition) as the cost of redistribution of mesh data is often the most significant part of a dynamic load-balancing step. While a more expensive procedure may produce a higher quality result, it is sometimes better to use a faster procedure to obtain lower-quality decomposition, if the workloads are likely to change again after a short time.

3.3 Partition Quality Assessment

The most obvious measure of partition quality is computational load balance but it alone does not ensure efficient parallel computation. Communication costs must also be minimized which corresponds to minimizing the number of objects on sharing data across subdomain boundaries. For mesh-based applications, this cost is often approximated by the number of element faces on boundaries between two or more subdomains. To estimate the cost of interprocess communication following metrics have proved to provide better results:

Subdomain's surface index is the percentage of all element faces within a subdomain that lie on the subdomain boundary. The maximum local surface index is the largest surface index over all subdomains and approximates the maximum communication needed by any one subdomain, while the global surface index measures the percentage of all element faces that are on subdomain boundaries [8] and approximates the total communication volume. Minimizing only the edge cut or global surface index statistics is not enough [8] for the following reasons:

First, the number of faces shared by subdomains is not necessarily equal to the communication volume between the subdomains [8]; an element could easily share two or more faces, but the element's data would be communicated only once to the neighbor..

Second, interprocess connectivity i.e. the number of processes with which each process must exchange information during the solution phase is a significant factor due to its dependence upon interconnection network latency [8].

Third, communication should be balanced, not necessarily minimized [9]. A balanced communication load often corresponds to a small maximum local surface index.

Internal connectivity of the subdomains has also proved to be a measure of partition quality. Having multiple disjoint connected components within a subdomain (also known as subdomain splitting ) can be undesirable as the solution of the linear systems will converge slowly for partitions with this property [11]. Additionally, if a relatively small disjoint part of one subdomain can be merged into a neighboring subdomain, the boundary size will decrease, thereby improving the surface indices.
Subdomain aspect ratiois the ratio of the square of the radius of smallest circle that contains the entire subdomain to the subdomain's area [Diekmann, et al. [10]]. It has also been reported as an important factor in partition quality [11], particularly when iterative methods such as Conjugate Gradient (CG) or Multigrid are used to solve the linear systems. They [11] show that the number of iterations needed for a preconditioned CG procedure grows with the subdomain aspect ratio. Furthermore, large aspect ratios are likely to lead to larger boundary sizes.
Geometric locality of elements is an important indicator of partition effectiveness for some applications. While mesh connectivity provides a reasonable approximation to geometric locality in some simulations, it does not represent geometric locality in all simulations. (In a simulation of an automobile crash, for example, the windshield and bumper are far apart in the mesh, but can be quite close together geometrically.). Quality metrics based on connectivity are not appropriate for these types of simulations

3.4 Partitioning and Dynamic Load Balancing Taxonomy

A variety of partitioning and dynamic load balancing procedures have been developed. Since no single procedure is ideal in all situations, many of these alternatives are commonly used. This section describes many of the approaches, grouping them into geometric methods, global graph-based methods, and local graph-based methods. Geometric methods examine only coordinates of the objects to be partitioned. Graph-based methods use the topological connections among the objects. Most geometric or graph-based methods operate as global partitioners or repartitioners. Local graph-based methods, however, operate among neighborhoods of processes in an existing decomposition to improve load balance. This section describes the methods; their relative merits are discussed in Section 3.

3.4.1 Geometric Methods

Geometric methods use only objects' spatial coordinates and objects' computational weights to compute decomposition in a way that balances the total weight of objects assigned to each partition. They are effective for applications in which objects interact only if they are geometrically close to each other. Examples of such methods are:

Recursive Bisection: Recursive bisection methods divide the simulation's objects into two equally weighted sets; the algorithm is then applied recursively to obtain desired number of partitions. InRecursive Coordinate Bisection (RCB) [12], two sets are computed by cutting the problem geometry with a plane orthogonal to a coordinate axis. The plane's direction is selected to be orthogonal to the longest direction of the geometry; its position is computed so that half of the object weight is on each side of the plane. RCB is incremental and suitable for DLB.Like RCB, Recursive Inertial Bisection (RIB) [13] uses cutting planes to bisect the geometry; however, the direction of the plane is computed to be orthogonal to the principle axis of inertia. It is not incremental and may be not suitable for dynamic load balancing
Space-Filling Curves: A space-filling curve (SFC) maps n-dimensional space to one dimension [11]. In SFC partitioning, an object's coordinates are converted to a SFC key representing the object's position along a SFC through the physical domain. Sorting the keys gives a linear ordering of the objects. This ordering is cut into appropriately weighted pieces that are assigned to processors.

3.4.2 Global Graph-Based Partitioning

A popular and powerful class of partitioning procedures make use of connectivity information rather than spatial coordinates. These methods use the fact that the partitioning problem in Section 3.1 can be viewed as the partitioning of an induced graph G = (V, E), where objects serve as the graph vertices (V) and connections between objects are the graph edges (E). For example, Figure 3 shows an induced graph for the mesh in Figure 2; here, elements are the objects to be partitioned and, thus, serve as vertices in the graph, while shared element faces define graph edges. A k-way partition of the graph G is obtained by dividing the vertices into subsets V1… Vk, where V = V1… Vk and Vi∩ Vj =  for i j. Figure 3 (right) shows one possible decomposition of the graph induced by the mesh. Vertices and edges may have weights associated with them representing computation and communication costs, respectively. The goal of graph partitioning, then, is to create subsets Vk with equal vertex weights while minimizing the weight of edge “cut” by subset boundaries. An edge eij between vertices vi and vj is cut when vi belongs to one subset and vj belongs to a different one. Algorithms to provide an optimal partitioning are NP-complete [14], so heuristic algorithms are generally used.

Greedy algorithm, Spectral partitioning and Multilevel partitioning are some of the static partitioners, intended for use as a preprocessing step rather than as a dynamic load balancing procedure. Some of the multilevel procedures do operate in parallel and can be used for dynamic load balancing.