Variation Aware Application Scheduling for Chip Multi-Processors

Lavanya Subramanian, Aman Kumar

Carnegie Mellon University

(lsubrama, amank}@andrew.cmu.edu

Abstract

Variations in chip multi processors are fast becoming a major concern, with nanometer scaling. The within die variation, particularly, is gaining significance in the sub 65 nanometer technologies. Techniques are being explored to make use of the variability information, to achieve better performance and energy efficiency. We propose a unified approach for application scheduling that attacks performance and energy efficiency simultaneously, using information on variability.

1. Introduction

Variations in chip multi processors are a major concern. There are two components to this, the die to die component and the within die component.

  • The die to die component of variation is being addressed by speed binning and there has been quite some work on these techniques and methodologies.
  • The within die component, however, has been gaining attention lately. At the transistor/device level, these are variations in Leff and Vth. These variations in Leff and Vth translate into frequency and leakage current variations at the micro-architecture level.

The perspective of a chip multiprocessor as consisting of several homogeneous cores is not valid anymore. A CMP has to be relooked at, as a collection of heterogeneous cores, with different frequencies and power profiles. These variations can be profiled or modelled in terms of per-core leakage and frequency parameters. This information coupled with the characteristics of the applications/workloads that run on the CMP, could be exploited to get better energy efficiency and performance out of it.

2. Related Work

There has been some work in this direction. [1] presents a set of algorithms, intended either towards power or performance.

  • The basic power efficiency inclined algorithm (VarP) tries to map applications onto the least leaky cores.
  • The enhanced version of this (VarP+AppP) tries to map the highest dynamic power consuming applications onto the least leaky cores.
  • Similarly, the performance centric algorithms map applications onto the fastest cores.

3. Motivation

[1] presents power and performance optimized algorithms. However, these are oriented solely towards power reduction or performance enhancement. We aim at looking at these in a unified fashion, motivated by the following observation: For cores that can operate at a specific maximum frequency, there is a wide variation in the leakage profiles. Similarly, for cores that have a certain leakage power, there is a wide spread in the maximum frequency characteristics [3]. It is on the basis of this observation that we propose to enhance the schemes presented by [1].

4. Proposed Scheme

As mentioned earlier, the previous work has focussed either on power or performance. One possible heuristic for the unified scheme is as follows:

  1. Rank the cores in the order of the maximum frequencies that they can run up to.
  2. Obtain the static leakage power number for each core (profiled statically at a nominal temperature)
  3. Rank the applications in the order of dynamic power (obtained by static profiling on a core)
  4. For each application, starting from the highest dynamic power one, map the application onto the core with the highest frequency, with the least leakage. This could be achieved by sorting the cores in frequency and leakage levels/bins

We plan to analyze the power/performance gains from using this heuristic and possibly, tweaking it based on the results we obtain.

The variability model in [2] will be used to model the frequency and leakage variability information.

5. Technical Description

The infrastructure needed to run and analyze our heuristic against other algorithms requires the following steps to build

5.1 Static Profiling

This is the first step in the power macro modelling in the BLESS (CMP) simulator. We use a single core simulator, Sim-GALS to obtain these. This simulator is intended for a locally synchronous and globally asynchronous system. We make all the local frequencies the same and the main purpose of using this simulator is the reasonably accurate leakage modelling present as part of this tool. The technology models we use are 45nm. We simulate SPEC 2000 benchmarks on this simulator and obtain per instruction dynamic power numbers for memory and non-memory instructions and per cycle leakage numbers. The rationale behind obtaining these numbers is that BLESS does not model the core in great detail. It just distinguishes between memory and non-memory instructions. The power numbers from the static profiling are presented in the Preliminary Results section. We use the average of these numbers in BLESS.

5.2 Variation map generation

The next step is to generate variation maps to characterize the variation of the leakage power and frequencies of the different cores. We obtain these maps at the per core granularity. We use the Varimap tool developed by Sebastian to generate variation maps for Leff (gate length). We model the leakage power’s variation with Leff as follows: We simulate an inverter in HSPICE, by varying the gate length and plot the variation of the inverter’s leakage power, with gate length. We fit this data using MATLAB and obtain the following relationship for leakage power.

LeakageVar = exp(0.051∆Leff 2 – 0.6 ∆Leff – 0.062) Leakage

Where ∆Leff is the gate length variation from the nominal

Leakage is the nominal leakage power

LeakageVaris the variation accounted for leakage power

The frequency variation is modelled as the delay variation being directly proportional to the gate length variation.

We use these models to come up with a 4 x 4 variation map. This states the leakage power and frequencies for each core in a 4 x 4 CMP.

5.3 Power/Variation Macro modelling in BLESS

The next step is to take in the variation accounted for power/frequency models/numbers into BLESS, the CMP simulator. We read in frequency and leakage maps generated by the Variation modelling. We use the per instruction dynamic power numbers for the memory and non-memory instructions, scaled by the frequency of operation of the corresponding core, for the dynamic power numbers. We use the per cycle leakage numbers for each core (from the variation map) for the leakage power computation. We put together all of these and finally report the power and performance (MIPS) for the different cores. We look at the variation of the power and performance across the different processors, to get a rough feel of the variation behaviour. We present this in the preliminary results section

We now have the basic infrastructure – a CMP simulator with power and variability models. The next step is to build a mock scheduler, to perform the application migration between the different cores, at scheduling intervals. Then, we’ll be all set to compare the different algorithms proposed in [1] and our heuristic.

6. Preliminary Results

6.1 Static Profiling results from Sim-GALS

These are our static profling results from Sim-GALS for Spec 2000 benchmarks. They list the memory and non-memory instruction dynamic powers and the core per cycle leakage powers. We obtain average numbers from all benchmarks to use in BLESS.

Application / NMIDP* (Watt) / MIDP* (Watt) / ACLP/cycle*(Watt)
ammp / 4.856 / 3.6018 / 0.1272
gzip / 2.514 / 1.3364 / 0.0897
vpr / 4.0125 / 2.9914 / 0.1569
mesa / 2.6177 / 1.5051 / 0.1261
art / 3.7089 / 2.8037 / 0.1719
mcf / 3.3925 / 2.5841 / 0.1716
parser / 2.6258 / 1.7255 / 0.1529
vortex / 3.8746 / 2.8734 / 0.1536
bzip2 / 2.4704 / 1.3382 / 0.0854
Average / 3.341377778 / 2.306622222 / 0.137255556

Table 1: Sim-GALS results (45nm)

*NMIDP – Non-memory instruction Dynamic Power

*MIDP – Memory Instruction Dynamic Power

*ACLP/cycle – Avg. Core Leakage Power per Cycle

6.2 Results after power/variation macro-modelling in BLESS

We picked two applications, perlbench, a compute intensive application and mcf, a memory intensive application. We mapped a copy of perlbench onto all cores and studied the MIPS and power with and without variation. We repeated the same thing for mcf. The results are interesting

Application / Variation / Mean / Standard Deviation
Perlbench / Without / 6273 / 20
With / 6077 / 133
Mcf / Without / 1970 / 99
With / 1909 / 103

Table 2: MIPS comparison

Application / Variation / Mean(Watt) / Standard Deviation
Perlbench / Without / 5.9 / 0.0179
With / 5.7 / 0.1720
Mcf / Without / 1.9462 / 0.0906
With / 1.8669 / 0.1258

Table 3: Avg Power per Cycle (Watt) comparison

The sigma of the MIPS for perlbench, the compute intensive application is much bigger when variation is accounted for. However, the sigma increase of the MIPS for mcf is small, as it is memory intensive and variations in processor frequency do not affect its performance as much. However, the sigma for average power per cycle for both perlbench and mcf is quite large (though perlbench’s sigma variation is larger than mcf’s), as compared to the no variation case. This can be explained by the fact that non-memory instructions also consume 2.3 Watt per cycle in the core and hence this component is affected by variation too.

7. Original Plan

  • Milestone 1:

Static profiling of applications to obtain dynamic powers, on Simple scalar with Wattch.

Build variability information into the BLESS CMP simulator

  • Milestone 2

Build a scheduler into or on top of the CMP simulator.

  • Milestone 3

Implement and analyze the proposed scheme against the baseline algorithms.

8. Progress

We have stuck to our original plan/schedule so far. We have achieved Milestone 1, which we promised to, during the proposal. Building a scheduler and analyzing the different algorithms is what is left to be done.

Lavanya worked on the static profiling and incorporation in BLESS part. Aman worked on the variation modelling/map generation aspect.

9. Conclusion

We observe that there is definitely a difference in the power/performance of different cores, even when they run the same applications. A large part of this difference is a result of process variability. Hence, we believe that our original plan of building scheduling algorithms that use this process variability is further bolstered by these observartions.

10. Project Website

11. References

[1]R. Teodorescu and J. Torrellas. Variation-aware application scheduling and power management for chip multiprocessors. In ISCA’08: Proceedings of the 35th annual InternationalSymposium on Computer Architecture, 2008.

[2]Y. Abulafia and A. Kornfeld. Estimation of FMAX and ISB in microprocessors. IEEE Transactions on VLSI Systems, 13(10), Oct 2006.

[3]Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Keshavarzi, A., and De, V. 2003. Parameter variations and impact on circuits and microarchitecture. InProceedings of the 40th Annual Design Automation Conference(Anaheim, CA, USA, June 02 - 06, 2003). DAC '03. ACM, New York, NY, 338-342.

Fig Variability per core modelled