Intel Woodcrest, AMD's Opteron and Sun's UltraSparc T1: Server CPU Shoot-out
by Johan De Gelas on June 7, 2006 12:00 PM EST- Posted in
- IT Computing
Secure Socket Layers RSA Performance
Secure Web communication is possible through the utilization of the Secure Sockets Layer (SSL) protocol. Using the command
While
We included the HP DL585 to see whether 8 cores of complex general purpose CPUs (Opteron 880) can keep up with the 8 MAU of the Sun T1. If you want to compare Woodcrest and the Opteron, you should check the 2 and 4 concurrency numbers. You can find our 1024-bit numbers in the graph below. One thread per core is optimal, so we tested the DL585 with a maximum of 16 threads, to show you that the peak is attained at 8 threads. The Xeon Irwindale was tested with 8 threads to show you that 4 threads (4 logical cores) is optimal and so on.
Notice that the 8 MAUs of the Sun T1 can only get in full action if we fire off 32 "SSL RSA signing" threads. Once that happens, the little 1 GHz T1 is able to keep up with the massive 2.4 GHz 8 core DL585. Without MAU, the T1 is as fast as a 1.8 GHz Xeon Irwindale. It is thus very important to check that your favorite web server works with SCF if you want to run your secure web services on the Sun T2000.
It looks like we've discovered the first - but rather insignificant to most people - "weakness" of the new Core architecture: decryption and encryption. The Opteron at 2.4 GHz has no trouble keeping up with the 3 GHz Woodcrest. This might be a result of the fact that the Woodcrest can only perform one rotate per cycle, while the Opteron can do 3. Although the RSA algorithm doesn't really use rotations, the hash algorithms needed to sign or encrypt a key make use of rotations. However, the most important reason is probably that the Opteron can sustain 2 ADC (Add with Carry) instructions per clock cycle, while Woodcrest can only do one. As ADC is good for about 17% of the instruction mix of the RSA algorithm, this might be enough to negate the extra integer power (Memory disambiguation, 4 wide decode ...) that the Woodcrest has.
Also notice that the previous NetBurst architecture, represented by the Xeon Irwindale, does very badly. The reason is that the P4 doesn't have a barrel shifter, a circuit in the chip which can shift or rotate any number in one clock cycle. Without this shifter, rotates and shifts take much longer, resulting in high latency. Most x86 code couldn't care less, but most encrypting code makes heavy use of rotates or shifts or both. We also did a quick test with Hyper-Threading on and off. In this case Hyper-Threading sped up the encryption (signs/s) with 20 to 28%.
To end the RSA sign/s benchmark, we'll make a quick comparison between quad core AMD Opteron 2.4 GHz, quad-core Intel Xeon Woodcrest and Sun's T1 with MAU enabled across different RSA bit lengths.
Notice that the hardware acceleration of the T1 does not work beyond 2048-bit keys. Considering that most secure applications use 1024-bit and only a few "high security" ones use 2048-bit, this is not an issue.
In case of doing verifies as opposed to signs, the server has to authenticate the identity of the client. This is a lot less intensive, and we'll show you the verifies per second numbers at 2048-bits. At 1024-bits length, both the Woodcrest and Opteron were able to verify more than 50000 keys per core, and that is a hard limit of the OpenSSL benchmark.
Again, the Opteron takes the lead. The Sun T1 even with the 8 MAUs is half as slow as four Opterons or Woodcrests, but this is hardly an issue. Encrypting or signing will slow down a server much quicker than verifying keys.
Both verifies/s and signs/s benchmark are rather synthetic. It is much more realistic to test with a real web server running SSL, and that is what we are currently doing. We followed Sun's instructions to enable RSA hardware acceleration for Apache, but for some reason, the Apache web server is still not making use of the Solaris Cryptographic Framework. So our Web server SSL test is work in progress.
Secure Web communication is possible through the utilization of the Secure Sockets Layer (SSL) protocol. Using the command
"openssl speed rsa"
we can measure the number of RSA public key operations (signs) that a system can perform per second.While
"openssl speed rsa"
is sufficient to test the Xeons and Opterons, the Sun T1 can speed up the Rivest Shamir Adleman (RSA) and Digital Signal Algorithm (DSA) encryption and decryption operations needed for SSL processing, thanks to a modular arithmetic unit (MAU) that supports modular exponentiation and multiplication. Each T1 core has a MAU, thus one 8 core T1 has 8 MAUs. To make use of those 8 MAUs, you have run the SSL calculations through the Solaris Cryptographic Framework (SCF). To test the T1 with the MAU crunching at full speed we used the command: "openssl speed -engine pkcs11 rsa"
. The Solaris 10 OS also provides in-kernel SSL termination, offering greater security than SSL termination outside the kernel.We included the HP DL585 to see whether 8 cores of complex general purpose CPUs (Opteron 880) can keep up with the 8 MAU of the Sun T1. If you want to compare Woodcrest and the Opteron, you should check the 2 and 4 concurrency numbers. You can find our 1024-bit numbers in the graph below. One thread per core is optimal, so we tested the DL585 with a maximum of 16 threads, to show you that the peak is attained at 8 threads. The Xeon Irwindale was tested with 8 threads to show you that 4 threads (4 logical cores) is optimal and so on.
Notice that the 8 MAUs of the Sun T1 can only get in full action if we fire off 32 "SSL RSA signing" threads. Once that happens, the little 1 GHz T1 is able to keep up with the massive 2.4 GHz 8 core DL585. Without MAU, the T1 is as fast as a 1.8 GHz Xeon Irwindale. It is thus very important to check that your favorite web server works with SCF if you want to run your secure web services on the Sun T2000.
It looks like we've discovered the first - but rather insignificant to most people - "weakness" of the new Core architecture: decryption and encryption. The Opteron at 2.4 GHz has no trouble keeping up with the 3 GHz Woodcrest. This might be a result of the fact that the Woodcrest can only perform one rotate per cycle, while the Opteron can do 3. Although the RSA algorithm doesn't really use rotations, the hash algorithms needed to sign or encrypt a key make use of rotations. However, the most important reason is probably that the Opteron can sustain 2 ADC (Add with Carry) instructions per clock cycle, while Woodcrest can only do one. As ADC is good for about 17% of the instruction mix of the RSA algorithm, this might be enough to negate the extra integer power (Memory disambiguation, 4 wide decode ...) that the Woodcrest has.
Also notice that the previous NetBurst architecture, represented by the Xeon Irwindale, does very badly. The reason is that the P4 doesn't have a barrel shifter, a circuit in the chip which can shift or rotate any number in one clock cycle. Without this shifter, rotates and shifts take much longer, resulting in high latency. Most x86 code couldn't care less, but most encrypting code makes heavy use of rotates or shifts or both. We also did a quick test with Hyper-Threading on and off. In this case Hyper-Threading sped up the encryption (signs/s) with 20 to 28%.
To end the RSA sign/s benchmark, we'll make a quick comparison between quad core AMD Opteron 2.4 GHz, quad-core Intel Xeon Woodcrest and Sun's T1 with MAU enabled across different RSA bit lengths.
RSA Encryption (Signs/s) | |||
Opteron 2.4 GHz 4 threads |
Xeon 5160 3 GHz 4 threads |
SUN T1 with MAU 32 threads |
|
512 bit | 19003 | 21194 | 35613 |
1024 bit | 6098 | 6240 | 10722 |
2048 bit | 1145 | 1087 | 1918 |
4096 bit | 185 | 164 | 1 |
Notice that the hardware acceleration of the T1 does not work beyond 2048-bit keys. Considering that most secure applications use 1024-bit and only a few "high security" ones use 2048-bit, this is not an issue.
In case of doing verifies as opposed to signs, the server has to authenticate the identity of the client. This is a lot less intensive, and we'll show you the verifies per second numbers at 2048-bits. At 1024-bits length, both the Woodcrest and Opteron were able to verify more than 50000 keys per core, and that is a hard limit of the OpenSSL benchmark.
Again, the Opteron takes the lead. The Sun T1 even with the 8 MAUs is half as slow as four Opterons or Woodcrests, but this is hardly an issue. Encrypting or signing will slow down a server much quicker than verifying keys.
Both verifies/s and signs/s benchmark are rather synthetic. It is much more realistic to test with a real web server running SSL, and that is what we are currently doing. We followed Sun's instructions to enable RSA hardware acceleration for Apache, but for some reason, the Apache web server is still not making use of the Solaris Cryptographic Framework. So our Web server SSL test is work in progress.
91 Comments
View All Comments
JohanAnandtech - Thursday, June 8, 2006 - link
I should have mentioned this: most of the tests have also been done on SUSE linux SLES9. The reason why we use Gentoo is that we are able to use the latest kernel and to tune the kernel specifically for the AMD or Intel architecture.With SUSE Enterprise you need to wait for SUSE to use a new kernel. Your suggestion is noted, and from now on I will include the SUSE SLES numbers too.
But to call our numbers useless, well that is a heavy exageration. There was about 1-2% difference between running on Gentoo than on SUSE. It is only natural: they both use more or less the same kernel, only the tools are different.
ashyanbhog - Thursday, June 8, 2006 - link
"Two months of testing and tweaking"so thats the time you took to make sure you could say
"In one word: Woodcrest rocks!"
and suprisingly your emotions were quite tepid when AMD processors where showing similar performance advantages over Intel processors earlier!
http://www.anandtech.com/IT/showdoc.aspx?i=2447&am...">http://www.anandtech.com/IT/showdoc.aspx?i=2447&am...
ashyanbhog - Thursday, June 8, 2006 - link
What 1-2% difference b/w SLES and Gentoo are you talking about? Anand's own earlier benchmarks show SLES performance as 9-17% better than than Gentoo!http://www.anandtech.com/IT/showdoc.aspx?i=2447&am...">http://www.anandtech.com/IT/showdoc.aspx?i=2447&am...
If you have specifically used Gentoo for the optimization options that it provides, why didn't you list the specific compile time optimizations for Intel and AMD that were finally used to run the benchmarks? The purpose of a independent benchmark is to ensure a setup that is neutral and verifiable by any third party using similar hardware and software. Does your review report provide the info necessary for the same?
Your earlier benchmarks using Linux + DB2 show dual dual core opterons gaining 50% - 80% improvement over dual single core opteron when more than 5 threads come into picture, and a mere 1% to 2% gain in case of one or two threads. Okay, I know DB2 was not part of this benchmarks this around, but shoudn't these figures have setoff enough alarm bells to force inclusion of something other than MySQL for database benchmarks?
Even MySQL on gentoo shows a modest 10% to 17% gain with concurrency numbers from 5 and higher in Sinle Core + Dual CPU vs Dual Core + Dual CPU. Strange that Linux and MySQL misbehave on Opteron this time and show a 10% performance degradation! You deserve a award for this! How can somebody contradict their own earlier benchmarks?
http://www.anandtech.com/IT/showdoc.aspx?i=2447&am...">http://www.anandtech.com/IT/showdoc.aspx?i=2447&am...
The MSI motherboard you used for the benchmarks has only a single channel to the memory bank for both the processors, a comprise made to cut its price and compete in the lowest market segment for 2P Opteron boards. A major design feature of the Opteron is its ability to use seperate memory channels for each procesor giving it NUMA capabilities, and dedicated memory lanes also cut lantencies when accessing the memory. Did you specifically choose this motherboard to negate opterons advantage? The Intel board used for "Irwindale" retails for around $500, the price for one used for woodcrest is not known, the MSI board is available for $250, so even the price range is different! "relatively cheap workstation board" as you noted in your earlier benchmarks. Were Tyan K8WE, ASUS K8N-DL, Supermicro H8DCi or the Iwill DK8EW that are more popular, so hard to come by? Also you dont specify which of the three opteron systems was used for which benchmark, or was it a average of three. The extreme attention to details usually found at Anand is suprisingly lacking for this review
http://www.ocforums.com/showthread.php?t=459111">http://www.ocforums.com/showthread.php?t=459111
http://forums.amd.com/lofiversion/index.php/t56855...">http://forums.amd.com/lofiversion/index.php/t56855...
http://geek.pricegrabber.com/search_getprod.php/ma...">http://geek.pricegrabber.com/search_getprod.php/ma...
http://geek.pricegrabber.com/search_getprod.php/ma...">http://geek.pricegrabber.com/search_get...php/mast...
Arent temprature readings also important, specially for a new xeon chip, as earlier ones had forced admins to double their AC capacities and discard covers of rack cabinets for better cooling.
Its good to know Intel is back on track, but this review seems to have only one purpose - Show woodcrest in favorable light against opterons.
rayl - Wednesday, June 7, 2006 - link
It doesn't take a server to compute that "Woodcrest rocks" = 2 words. :pmerlinm - Wednesday, June 7, 2006 - link
where are the postgresql quad core benchmarks? My experience is that 2-4 cores on postgresql gives you 1.7x the power on non i/o constrained databases. This would have been a huge upset to have PostgreSQL blow out mysql in a quad core configuration.also a postgresql.conf containing non-default values would have been nice.
blackbrrd - Wednesday, June 7, 2006 - link
Where does it say they are using the default postgresql.conf? Actually I can't find any information on what kind of tweaking that has been done here at all?Is there any special reason for only running postgresql on a single cpu instead of a dual dual core setup like you did for the rest of the tests? There are no commends about it on http://www.anandtech.com/IT/showdoc.aspx?i=2772&am...">page 9 atleast...
merlinm - Wednesday, June 7, 2006 - link
right...what I meant was, could you please supply .conf entries which where edited and changed from the stock configuration. Actually, for this type of benchmark (90% read), there's not a whole lot to change in postgresql.conf...generally the more writes there are the more you have to tweak.the major tweak in postgresql is to use prepared statements over the parameterized interface...
merlinm - Wednesday, June 7, 2006 - link
oh, and postgresql 8.1 is about 20+% faster than 8.0 in most read operations involving very small (one statement) transactions.squash - Wednesday, June 7, 2006 - link
Hello,With the recent "official" support in Ubuntu for that Niagra server, would it be possible to also include performance numbers for that server running Linux?
I have seen other benchmarks showing Linux to have improved performance on the same hardware compared to Solaris. Filesystem performance is typically much higher in ext2 vs ufs+logging, and if you scan Sun's issue tracking database, there are many entries for libc and kernel operations which are much slower than Linux.
Maybe as a seperate article....
Squash
OddTSi - Wednesday, June 7, 2006 - link
The Verify/s graph (the last one on the page) doesn't have a line for the Dual Opteron yet the author still claims "Again, the Opteron takes the lead." Does the Dual Opteron take the lead and there was just an error in showing up on the graph?Also in the signs/s chart the Dual Woodcrest tops out at just over 6,000 and the Dual Opteron tops out at just over 5,000, which is a 20% lead, yet the author writes "The Opteron at 2.4 GHz has no trouble keeping up with the 3 GHz Woodcrest." I'm not trying to be a nitpicky fanboy here but being beaten by 20% isn't "keeping up," at least not in my defition of the expression.
Finally, I have a question. Why are there no Windows-based tests? I know that LAMP is very popular in the webserving part of the server world but in most other server/enterprise areas it's mostly Windows, SQL Server, Visual Studio, .NET, etc. I'd like to see some benchmarks that use software that those of us in the non-webserving community are most likely to use. I know there's no chance of running the UltraSPARC in a Windows configuration but quite frankly who cares. I would like to see which of the x86 offerings (which is FAR more likely to be used) is better.