Performance Gains on the New Multi-Core, Multi-Threaded Intel Core I7 Microprocessor

Turbo Boost and Overclocking© Intel Corp.

Architecture and Early Performance Results of Turbo Boost Technology on Intel® CoreTM i7Processor and Intel Xeon®Processor 5500 Series(2009)

Markus Mattwandel, Todd Baird, Jorge Garcia, Seongwoo Kim, Herbert Mayer*

AbstractWe surveythe Turbo Boost Technology on the new Intel®Core TM i7 multi-core, multi-threaded micro processor. Turbo Boost Technology dynamically increases the frequency of processor coresfor the benefit of higher performance, while operating under thermal design limitsand maintaining safe conditions on the physical chip. This paper outlines the degree, how much the core frequency can be raised as a function of the number of currently active cores and of other electrical and temperature parameters. We explain conditions, under which such boosts are possible, depending on instantaneously flowing current, onoverall power consumption with resulting heat generation, and on actual temperature of the core[s] being boosted. We contrast Turbo with Overclocking, another method of boosting frequency and improving performance, and discuss the pros and cons of Turbo versus thermal throttling. Since the Turbo Boost Technology has been implemented in silicon on the Core i7, on both single-socket desktop and dual-socket servers, we include actual performance data from average to ideal cases. Core i7 is implemented in 45 nm High-K Silicon, launched in late 2008 as a High End Desktop platform with 1, and in 2009 as a server with 2 processors, each having 4 cores and 2 hardware threads per core. We conclude with conjectures into the future and a list of references.

Keywords: Multi-Core; Turbo Mode; Overclocking; Simultaneous Multi-Threading (SMT); Parallel System; Logical Core; Green Computing

Turbo Boost and Overclocking© Intel Corp.

1. Introduction

Turbo Boost Technology (Turbo, for short) dynamically enables a temporary performance boost on the new Intel® CoreM i7 multi-core, multi-threaded micro processor, stylized in Figure 2.1. Turbo Boost Technology increases the core clock of a processor in defined, discreet frequency steps(AKA bins) for the benefit of higher performance, while conditions on the physical chip allow this without endangering the microprocessor. This survey outlines the degree, how much the core frequency can be raised as a function of the number of active cores and of other parameters.

Section 2 describes the design goals of Turbo Boost Technology on Core i7 and contrasts the new method with an older Turbo legacy method implemented on earlier Intel silicon. It discusses the pros and cons of Turbo vs. Overclocking, both of them being methods of boosting frequency to increaseperformance, yet with different goals and conditions. It also comparesTurbo with thermal throttling. Section 3 summarizes, how much Turbo boosting is theoretically possible, as set by predefined system parameters. In section 4 we list costs, shortcomings, and dangers of Turbo. Since the Turbo Boost Technology has been implemented in silicon on the Core i7, on both single-socket desktops and dual-socket servers, Section 5 includes detailed, actual performance data on client- and server platforms, from average to ideal cases. Section 6 contrasts Turbo with other performance boost ideas, while sections 7 and 8 conclude with a conjecture into the future and references.

The physical Core i7 microprocessor is realized by Intel in 45 nm High-K Silicon technology, launched in late 2008 as a High-End Desktop platform with a single socket, and in 2009 as a server with 2 sockets.

Corresponding author:
SPEC, SPECint and SPECfp are copyright of SPEC

2. Description of Turbo Boost Technology

Why Turbo Boost?Intel Turbo Boost Technology, introduced on Intel’s flagship Core i7 and Core i7 Extreme Edition processors in Q3’2008, allows processor cores to automatically run faster than their base operating frequency if cores are operating at the low end of a defined envelope of power, current, and temperature, the specification limits. The amount of additional frequency upside each core actually will achieve depends on the total number of active cores, executing processes (threads) that a workload has spawned, and on the thermal operating environment, which includes current(thermal design current, or TDC) and power consumption (thermal design power, or TDP), as well as temperature. Turbo Boost kicks in when theOS power scheme is set for performance and the processor package is operating below critical constraints. The core frequency is dynamically adjusted within the defined limits, as the operating conditions change.

Figure 2.1 High-Level Nehalem Architecture

2.1.Turbo Boost Technology vs. Enhanced Dynamic Acceleration Technology: Prior to the introduction of Turbo Boost in Core i7, Intel’s previous generation Core 2 Duo processors introduced the 1st generation of Turbo technologies known as Enhanced Dynamic Acceleration Technology (EDAT). This technology allows processor cores to automatically run faster than their base operating frequency, if one or more core(s) are idle. In that event, the operating frequency of the other cores is increased. Note that this increase is influenced by the number of active hardware threads and by various electrical and thermal parameters, before taking advantage of a clock boost within the product constraints. Turbo Boost and EDATalso happen to be “Green” technologies that provide performance on demand, while keeping power consumption at a minimum when the additional processor performance is not needed, as judged by the current load.

2.2.Turbo Boost vs.Overclocking

Turbo is quite distinct from overclocking. First of all, overclocking increases clock frequency by running outside the specification of the part, while Turbo operates completely within spec. Turbo does not change the reliability or durability of a part. Overclocking occurs when the clock rate of the processor is manually and statically increased. This results in running the processor out of its specified and thus safe limits. Conversely, Turbo technologies run the processor within specification, and aim to take advantage of optional thermal headroom available during under-utilized conditions. Overclocking is not a “Green” technology, since it forces increased processor power consumption continuously without regard to actual demand.

Turbo Boost / Overclocking
Starting Clock / Base op frequency / Base op frequency
heat protection / Yes / Yes
Protective action / Decrease clock / Thermal throttling set by user.
Application Mechanism / Automatic, based on sys.Conditions / Manual, user driven by brute force

2.3. Turbo Execution vs. Thermal Throttling

Turbo Boost Technology is a conservative performance enhancement method that increases the clock rate, after the microprocessor recognizes that an increase in clock speed is safe; it is understood that the processor was already operation in a safe way before boosting the clock speed. When the thermal parameters change, or when the number of active cores increases, then the prior clock increase is reversed, not only saving the chip from possible damage, but also saving power.

Turbo Boost / Thermal Throttle
Starting Clock / Low, to run safely / High, to run fast
heat protection / Yes / Yes
Protective action / Decrease clock / Decrease clock
Arch. driven / Yes / Yes

Thermal Throttling comes from the other end by taking a greedy approach of performance enhancement. Thermal throttling assumes that the microprocessor is generally running in some steady state of execution, but acknowledges that temporary hot spots are possible. This happens when the typical mix of IO-bound plus compute-bound execution is replaced by compute-bound only execution, resulting in more heat generation than is safe. Similar to the safety action taken in Turbo, the frequency is throttled in thermal throttling, resulting is less current and thus less heat being generated, and less performance being delivered.

A microprocessor architect must decide,which safe technology of performance boosting should be realized in Silicon, one, or the other, or both. On the Core i7 Intel decided to provide both methods.

3.Ideal Performance Speedup with Turbo

A number of dynamic parameters dictate the upper limit of Turbo Boost speedup limit. These include the current core’s temperature, the overall currentand momentary power, and the number of active cores. Each frequency step of turbo boots is 133.33 MHz. For each SKU, fuse values are set in a small internal table during chip manufacturing, to define an upper bound, how many of these frequency steps maximally a core can increase safely. The table parameters d-c-b-a mean: If 1 core is active, that core’s frequency may increase by a bins. Else if 2 cores are active, these cores can grow by b frequency steps, etc.

Applying the same encoding principle, but starting from he other end, the table entry 1-1-4-8 means that for 3 or 4 cores being active, the frequency may increase by just 1 frequency step. But if only 2 cores are busy, the speed may grow up to 4 steps, and if only a single core is active, the current one may grow by 8 frequency steps, amounting to 1.06 GHz incremental clock speed.

However, this boost may decrease, if for any reason a predefined envelope of maximally allowable current or temperature is exceeded. Decrease is designed to not only save the microprocessor from thermal stress, but to save power and run “more green”. Similarly, as the sample 1-1-4-8 bound shows, other cores may become active, forcing a current high boost rate to decrease, again to protect the processor and save power.

4.Technology Investment for Turbo Boost

Although the goal of improving performance with Intel Turbo Boost Technology is worth pursuing, the long-term investments and shorter-term costs must be weighed against gains on the performance side for the userand the business side for the manufacturer.

4.1. Engineering Investment

The up front engineering costs to design and implement the Turbo Boost Technology were noticeable but contained despite the existence of past technological history at Intel; e.g. the Enhanced Speed Step Technology. Design costs included a new mini-controller, called the Power Control Unit (PCU), and associated microcode. Also, the cost of validation was significant because new methods were developed to ensure that the feature was working properly without interfering with the operation of the feature. The manufacturing flow was also updated to support testing of the PCU, which added another minor development cost.

4.2.End User Costs

When Turbo Boost Technology promotes cores to a higher frequency, the processor will draw more current than it would while running at nominal frequency. The end user will incur an incremental cost for additional electrical power consumed in this mode, however this cost is very minor compared to the power used by the system as a whole. If necessary, users may choose to manually adjust the balance between performance and power consumption through the OS power policies.

A final theoretical cost to note is the introduction of a variable frequency processor into an environment that has largely been able to depend on a constant processor frequency. Some applications may attempt to synchronize events in time based on the assumption that frequency does not change over time, although none has yet been found by Intel. Computer users may also become alarmed when their frequency reporting tools begin to show dynamic frequency changes.

5. Actual Performance Data with Turbo Boost

We isolated workloads known to be CPU-centric, and concentrated further on single and multi-threaded workloads in our focus on turbo performance measurements. We proceeded by running three baseline frequencies without enabling turbo. The base frequencies were 2.66 GHz, 2.8GHz, and 2.93 GHz to simulate the lower and upper bounds of the workload.

Initial results showed mix results because the OS scheduler was allowing single-core workloads to run on multiple CPUs. By setting affinity manually, and forcing workloads to run on a single CPU we were able to obtain maximum benefit from Turbo. Affinity here means to associate any particular thread with a dedicated core or hyper-thread. The learning of setting affinity manually was then applied to all single-threaded workloads.

5.1. Turbo Speedup on UP Client

Setting processor affinity is the process by which an application manually tells the OS scheduler where to run, in other words, it restricts the available hardware threads where the workload may run. For instance, setting Affinity = p3, tells the OS scheduler to only run on Processor 3. Setting Affinity = P0, P2, P3, allows an application to run on hardware thread 0, 2, or 3.

Figure 5.1 Cinebench 10

Allowing single threaded workloads to run on any hardware thread incurs performance penalties because each time a thread moves around the OS needs to SAVE/RESTORE state to preserve determinism.

Figure 5.2 Cinebench 9.5

Figures 5.1 and 5.2 show Cinebench obtaining highest Turbo upside when affinity is set,as it can run on one single core for the whole test duration. Rendering software performance data show Turbo to have a positive result; rendering is conventionally calculated in time units, hence smaller is better.

Figure 5.3Rendering Workloads

Figure 5.3 shows3DStudioMax and MainConcept H.264 reaching nearly ideal Turbo speedup because these workloads are CPU centric, can run on specific cores, and incur no other overhead.

Figure 5.4Arithmetic and Multi-Media Workloads

Figure 5.4 exhibits Sandra measurements of Arithmetic and Multimediaapplication with multi-threaded workloads.

Figure 5.5Estimated Individual SPEC CPU2000 Score, 4-Users

Figure 5.6Estimated Individual SPEC CPU2000 Score, 4-Users

Figures 5.5 and 5.6 display various components of CPU2000visibly benefiting from Turbo. These workloads represent a gamut of diverse disciplines and do not all scale linearly with core frequency. Thus, some workloads do not reach full theoretical Turbo benefit.Estimated individual SPEC CPU 2000 scores are based on measurements on Intel internal development platforms and may differ from measurements on production platforms available later in 2009. For more information about the benchmarks see [4].

5.2 Turbo Speedup on DP Server

Table 5.1 summarizes our setup for DP Turbo experiments. We used an engineering validation board, called Green City with an open bench top configuration. This is certainly a different thermal system condition compared to a typical end-user environment in a standard chassis. However, we learned that thermal impact on Turbo performance is still second-order based on pre-Si study and other post-Si experiments conducted. As shown in Figure 5.4, each processor has an individual heatsink with active fans attached. In addition, four external fans are placed on the side to cool down the memories, voltage regulators, etc. All fans were running at a constant speed. If the workload does not hit Turbo constraints, the Core frequency can increase up to 3.33GHz dynamically depending on the number of active cores.

Table 5.1 Experimental Setup

Figure 5.7NHM-EP System with external fans

We first tested theSPEC CPU2000 benchmark compiled with Intel Compiler 11.0 for multiple cases of our interest. The baseline configuration was to turn off the Turbo mode and the benchmark scores were compared with the cases of Turbo mode. Since Turbo is designed to operate within predetermined TDC, TDP, and thermal constraints, it may not always run at maximum Turbo frequency. In order to assess the efficiency, we compared actual performance against maximum performance without the constraints. This unconstrained case would give the same performance as the case of overclocking the processors to the Turbo frequency in non-Turbo mode, e.g., 3.20GHz for multi-core active workload, unless there is some overhead caused by the Turbo. Note that we used maximum scores among several samples of each experiment in the comparison. The system could occasionally generate exceptionally low scores due to certain abnormal transient conditions at the beginning of tests. Based on our previous experience, we believe this water-marking approach in sampling is effective when dealing with a pre-production platform prior to fine tuning. Figure 5.5 presents the Turbo speedup for 16-user SPECintRate along with the level of run-to-run variation. The red line indicates the amount of frequency increase between P1 and P0, i.e., slightly more than 9%. On average, Turbo mode brings about 5.8% performance upside, compared to non-Turbo.

This extra boost is still within the thermal design envelope. For example, bzip2 and gcc reach the ideal Turbo performance target. On the other hand, multiple components are below the ideal level. Our analysis using workload profile from the power control unit showed that these workloads hit TDP limit. The variation is represented by standard deviation over the mean for 9 different trials of each benchmark component. The variability is mainly due to sub-optimal memory usage by OS under non-uniform memory configurations (NUMA)between the two separate processor (two separate sockets on a server platform) and it is the main explanation for the case where two ideal performance bars mismatch.

Figure 5.8 Turbo Speedup for SPEC CPU 2000 Integer Rate IC11.0 – 16-user

Figure 5.9 is the observed speedup for SPECfpRate. The average performance benefit by Turbo is 3.3% out of a 3.5% goal, which is less than observed in the integer suite. One of the reasons is that some floating-point components do not rely on activity that scales with frequency, e.g. DRAM accesses. Even though some components directly take advantage of faster core clock, e.g., sixtrack, they often hit TPD constraint throughout the execution. Some bars look erroneous in terms of basic relationship. However,run-to-run variation has to be factored in to explain. Although we present the variation only for Turbo case here, its level was not dramatically different in non-Turbo cases. There is no empirical evidence so far suggesting that Turbo introduces additional run-to-run variation on a given system.