Dynamic Power Management: A Quantitative Approach
by Johan De Gelas on January 18, 2010 2:00 AM EST- Posted in
- IT Computing
Analysis: What Happened?
The measurements on the previous page are fine but we also want to understand how well the hardware and operating system coped with the "low load" scenario. What did Windows 2008R2 do? We asked the Windows Driver Kit "Powertest" tool to tell us more. The first thing we want to know is the clock speed the CPU was ordered to run at in "Balanced" mode. The differences are very telling. First the Xeon's clock speed changes:
Xeon L3426 Core Speeds | ||||||||
Frequency | Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 | Core 6 | Core 7 |
10 times | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 |
20 times | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 |
1 time | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 |
10 times | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 |
1 time | 1729 | 1729 | 1729 | 1729 | 1729 | 1729 | 1729 | 1729 |
Many | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 |
The Xeon L3426 almost always ran at 1.86GHz. In a period of 30 seconds, we noticed only two P-state change requests: one speed bin lower (-133MHz) and 3 speed bins lower (-400MHz). All cores were always asked to run at the same clock speed.
Next those of the Opteron:
Opteron 2435 Core Speeds | ||||||
Frequency | Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 |
1 time | 800 | 1400 | 800 | 2600 | 800 | 800 |
1 time | 800 | 800 | 1400 | 1400 | 800 | 800 |
1 time | 800 | 800 | 800 | 800 | 800 | 800 |
1 time | 800 | 800 | 800 | 2600 | 800 | 800 |
1 time | 800 | 800 | 800 | 800 | 800 | 800 |
1 time | 800 | 800 | 800 | 800 | 2600 | 1400 |
1 time | 800 | 800 | 800 | 800 | 800 | 800 |
Where the Xeon hardly gets any P-state changes, the six-core Opteron 2435 frequently switches between 0.8GHz, 1.4GHz, and 2.6GHz. A lot of times one of the cores runs at 1400MHz, another one at 2600MHz, and the rest at 800MHz. Basically, the above table is repeated over and over again. This means that the frequency scaling is far from ideal: we should see two cores at 2.6GHz most of the time as the application spawns two threads that require 100% core power. This in turn explains the 15% performance hit between "Balanced" and "Performance". If the hardware and OS worked together better, the performance hit should not be more than a few percent. This makes us conclude that in this case, the 4W power savings are not worth the performance hit.
Sleeping
We have focused on the active cores so far, but the important power savings can also come from putting idle cores in sleep states. Did the CPU driver and OS scheduler work well together? Again, there are remarkable differences.
CPU Sleep State Comparison | ||||
% Idle | ACPI C1 | ACPI C2 | ACPI C3 | |
Opteron 2435 | 86 | 100 | 0 | 0 |
Xeon L3426 | 81 | 7 | 93 | 0 |
Opteron 2389 | 72.4 | 100 | 0 | 0 |
The six-core had more idle cores than the quad-core Opteron, and as a result it did experience more idle time. All idle time with the Opterons was spent in the C1/"Halt" status.
The Xeon was quite a bit more aggressive: 93% of the idle time was spent in the C2 state, but C2 at the operating system level does not mean the hardware actually runs in C2. In theory, the hardware is capable of putting the core into a "deeper" CC (Core Sleep) state. Intel promised that the idle Nehalem cores would be able to reach even the deepest C6 sleep while other cores were working. Did that actually happen?
Software tools read out the API of the OS and thus - as far as we know - always read out the ACPI states. We followed the guidelines in Intel's White Paper, "Intel Turbo Boost Technology in Intel Core Microarchitecture Based Processors", and did some programming (in assembly) to find the actual hardware C-states.
First we read out the Time Stamp Register
RDTSC
0x000086FCCA7EBD0E
Next we read out the right Machine Specific Register
RDMSR 0x3FDH
High 32bit(EDX) = 0x00007265, Low 32bit(EAX) = 0xF842A000
We wait for 1500ms and then repeat the previous procedure:
RDTSC
0x000086FD78268DC2
RDMSR 0x3FDH
High 32bit(EDX) = 0x00007265, Low 32bit(EAX) = 0xFA3F0000
In some cases, the MSR did not get one tick more, clearly indicating that the CPU had not entered C6 during the 1.5 second period. Both the "real" physical and logical core report the same TSC and MSR info, so it is quite easy to make a distinction between the real cores and the logical cores which are a result of SMT (Hyper-Threading).
With the "Performance" power plan we get:
"Performance" Power Profile C6 | |||
Clockticks | Ticks spent in C6 | Percentage C6 | |
Core 1 | 2913456308 | 33316864 | 1.14% |
Core 2 | 2933155470 | 0 | 0.00% |
Core 3 | 2950461391 | 2809569280 | 95.22% |
Core 4 | 2957802638 | 0 | 0.00% |
So on average the CPU is in C6 24% of the time, which is quite impressive. However, the way we measure this is not perfect: the measurement puts an extra load (slightly less than a chess thread) on the CPU. So the load on the CPU is not two but rather three threads. This means that the CPU probably spends even more time in C6 mode with two active threads.
Next the same measurement but with the "Balanced" power plan:
"Balanced" Power Profile C6 | |||
Clockticks | Ticks spent in C6 | Percentage C6 | |
Core 1 | 2961019252 | 0 | 0.00% |
Core 2 | 2991271044 | 2371919872 | 79.29% |
Core 3 | 3012220038 | 74088448 | 2.46% |
Core 4 | 3012878436 | 22192128 | 0.74% |
This time we spend a little bit less time in C6: about 21%. Setting the power plan to Performance allows the idle cores to go just a little bit more into deep sleep as the active cores are working harder. Of course total power does not decline as the higher power consumption of the Turbo Boosted cores is much more important than the small effect of some cores being in deep sleep an extra 10% of the time.
35 Comments
View All Comments
JohanAnandtech - Monday, January 18, 2010 - link
In which utility do you set/manage the frequency of a separate core?n0nsense - Monday, January 18, 2010 - link
Gnome panel applets. CPU frequency monitor I guess it uses cpufreq. Each instance monitors core. So i have 4 of them visible all the time. If you have enabled CPU Frequency scaling (kernel) than you can select the governor (performance, on demand, conservative etc) or a static frequency. I can do it for each core. And it displays what i have set.Of course processor should support frequency scaling.(power now and speed step).
Most mainstream distributions (Ubuntu, Sabayon, Fedora) will use onedemand governor by default when processor with frequency scaling available. No user intervention required.
jordanclock - Monday, January 18, 2010 - link
I really think you're mistaken. Core 2 CPUs don't have any mechanism to allow per-core frequencies. There is one FSB clock and one multiplier. There is no way to set CPU0 to a different frequency than CPU1 (or for quad core, CPU2 and CPU3) because the variables that control the clock speed are chip wide.VJ - Tuesday, January 19, 2010 - link
These people seem to be convinced of per-core Speedstep:https://bugs.launchpad.net/ubuntu/+source/linux-so...">https://bugs.launchpad.net/ubuntu/+source/linux-so...
Maybe someone can ask David Tomaschik for the Intel documentation he refers to?
n0nsense - Monday, January 18, 2010 - link
I heard it in past, but i still tend to believe my eyes :)while writing this reply, i saw any possible combination. My Q9300 has 2 states 2.0GHz and
2.5GHz. It's not a server CPU. Have no reason to mislead you
VJ - Tuesday, January 19, 2010 - link
If there's only two states, then it's possible that one core is in the C2 state while the other is in its C0 state.The core in state C2 may be shown to be operating at 2Ghz (its lowest frequency) while it's really off. The OS may simply be reporting the lowest possible frequency while the core is really not receiving a clock signal.
So in general, if one core is showing its lowest frequency it may be off which still allows the other core to operate (at a different frequency).
It would be very strange if both cores are operating greater than their lowest and less than their highest frequencies at different frequencies.
From a different angle: Has anybody ever seen /proc/cpuinfo report a frequency less than the CPU/Core's lowest active frequency or even zero? Probably not.
n0nsense - Tuesday, January 19, 2010 - link
Nice theory :)But in this case, I see that each core doing something. htop shows that each core somewhere in 15% usage. So the only options left, are
1. Each core frequency can be controlled independently on C2D and C2Q (May be i3 i5 i7 too)
2. The OS is completely unaware of whats going on :) (which is less possible)
mino - Thursday, January 21, 2010 - link
"The OS is completely unaware of whats going on" is the right answer.:)
BTW, only x86 CPU's able to change freq per core are >=K10 for AMD and >=Nehalem for Intel.
VJ - Tuesday, January 19, 2010 - link
Not to defeat your argument/observations, rather for completeness' sake:It's also possible that the differences are due to the reading of the attributes. If the attributes are read in succession, then it's possible that the differences are due to the time of reading the attributes, while at any given instant, notwithstanding the allowable subtle differences in frequency described in this article, all cores are operating at the same frequency.
There's a lot of time at the bottom.
JanR - Tuesday, January 19, 2010 - link
Hi,I completely agree to this:
"It's also possible that the differences are due to the reading of the attributes."
The point is that desktop usage together with ondemand governor leads to a lot of fast frequency changes. Therefore, this is not a good scenario to decide on "per core" vs "per CPU". We did a lot of testing the following way:
Put load on all cores using "taskset" (this avoids C-states). Switch to "userspace" governor and then set frequencies of individual cores differently. You have one control per core but the actual hardware decides what really happens - you can check this in /proc/cpuinfo or using a tool such as "mhz" from lmbench as load generator (this one calculates actual frequency based on CPI and time, it allows also measurement of turbo frequencies).
Trying around, the results are:
AMD K8: One clock domain, maximum of the requested frequencies is taken
Intel Core2 Duo: Same as K8
AMD K10: Individual clock domains, you can clock each core individually
Intel Core 2 Quad: TWO clock domains! These CPUs are two dual core dies glued together so each die has its one multiplicator. Therefore, the cores of each die get the maximum of the requested frequencies but you can clock the two dies independendly.
Intel Nehalem: One clock domain, maximum of requests of all cores that are not in C-state! If you set one core to, e.g., 2.66 GHz and all other to 1.6, all cores clock at 1.6 as long as the core set to 2.66 is not used, they all switch to 2.66 if you put load on that core.
So far to our findings. "cat /proc/cpuinfo" or some funny tools are useless if you do not control the environment (userspace, manual settings). If you then enable ondemand, the system switches fast between different states and looking at it is just a snapshot, maybe taken in the middle of a transition.
Greetings,
Jan