Latest Technology: 09/07/09

Monday, September 7, 2009

Will Nehalem conquer the server world by storm?

A dramatic turn of events is the best way to describe what we'll witness in a few weeks. But let us first talk about the current situation. As we pointed out in our last server CPU comparison, AMD latest quadcore Opteron was a very positive surprise. Sure, you can show a few server benchmarks where the Intel CPU wins like Black Scholes or some exotic HPC benchmark but the server applications that really make the difference like webservers, database servers run faster on the latest AMD "Shanghai" CPU. All depends on what kind of application is important for you of course. But let us look at the complete picture: performing more than 30% faster in Virtualization benchmarks is the final proof that AMD's latest is overall the best server CPU at this point in time.

But a few weeks from now, that will all change. As always we can not disclose benchmark information before a certain date, but if you look around here at this site, you have been able to discern the omens. The K10 architecture of Shanghai is a well rounded architecture, but one that misses really crucial weapons to keep up with the Nehalem:

Simultaneous Hyperthreading offers performance boost that IPC Improvements are not capable of delivering (up to 45%!).
Memory latency. Nehalem's memory latency is up to 40% lower
Memory bandwidth: 3 channels is complete overkill for desktop apps, but it does wonders for many HPC and in a lesser degree server applications.
a really aggressive integer engine

Nehalem will use somewhat more expensive DDR-3 DIMMs, which hardly offer any real performance boosts (as compared to DDR-2). So moving to DDR-3 will not help AMD much.

Istanbul?

The details on the six-core Istanbul are still sketchy. But the dual socket Xeon "Westmere" will get six cores too and will appear in the same timeframe as AMD's hexacore. Only if AMD added SMT very secretly to Istanbul, they will be able to turn the tide. Considering that this would be a first for AMD, it is very unlikely SMT made it to Istanbul.

A dent in Nehalem's armour?

Does AMD have a chance in the server market in 2009 (and possibly 2010)? I must say it was not easy to find a weakness in Nehalem's architecture. The challenge made it very attractive to search anyway :-). So what follows is a big "IF- iF" story and you should take it with a big grain of salt ... as you should always do with forward looking articles.

There is one market where AMD has really been the leader and that is virtualization thanks to the IMC and the support for segments (four privilege levels) in the AMD64 Instruction Set Architecture. AMD's performance running VMware ESX in the "good old" ESX Binary translating mode (software virtualization) was better than running an Intel on the latest hardware virtualization hypervisor. VMware only uses hardware virtualization on an AMD server if NPT (or RVI or HAP) is present . In contrast, hardware virtualization slowed the Xeons of 2005 and 2006 a bit down but was absolutely necessary to run 64 bit guests on a hypervisor on top of a Xeon server.

Nehalem is catching up with EPT and VPID (see here), and while it was well implemented, one thing is lacking: the TLB is rather small. I have been pointing out this out about a year ago: while the TLB got AMD a lot of bad press, it will probably be the one thing that keeps AMD somewhat in Intel's slipstream. Let me make that more clear:

CPU	L1 TLB Data	L1 TLB Instr	L2 TLB
AMD Shanghai/ Opteron 238x or 838x	48 (4 KB) 48 (large)	48 (4 KB) 48 (large)	512 (4 KB) 128 (large)
Intel Penryn / Xeon 54xx	16 (4 KB) 16 (large)	128(4 KB) 8 (large)	256 (4 KB) 32 (large)
Intel Nehalem / Xeon 55xx	64 (4KB) 32 (large)	128 (4 KB) 14 (large)	512 (4 KB) 0 (large)

Notice that in case you use large pages, the Nehalem TLB has few entries. So, let us now do a thought experiment. Currently, most of the virtualization benchmarks like VMmark (VMware) and VConsolidate (Intel) use relatively small VMs. VMs are for example a small Apache webserver and Mysql server which get between 512 MB and 2 GB of RAM. As a result most of them run with large pages off (Page size = 4 KB). These benchmark are very similar to the daily practice of an enterprise which uses IT mostly for "infrastructure purposes" such as authentificating it's employees and giving them access to mail, ftp, fileserver, print serving and web browsing.

It becomes totally different when you are an IT firm that offers it's services to a relatively large amount of customers on the internet. You need a large database with many probably pretty heavy webportals which offer a good interactive experience.

So you are not going to consolidate something like 84 (14 tiles x 6 VMs) tiny VMs on one physical machine, but rather 5 to 10 "fat" VMs. With fat VMs I mean VMs that get 4 GB and more of RAM, 2 to 4 vCPUs, run a 64 bit guest OS and so on.

Those applications also open tons of connections, which they have to destroy and recreate after some time. In other words, lots of memory activity going on.

EPT and NPT can offer between 10 and 35% better performance when lots of memory management activity is going on. Compared to the shadow page table technique, each change in the page tables does not cause a trap and the associated overhead (which can be 1000s of cycles). So you could say that going to the TLB of your CPU is a lot smoother. But if the TLB fails to deliver, the hardware page walk is very costly.

In search of the real page table

A hardware page walk consists of searching in several tables which allow the CPU to find the real physical address as the running software always supplies a virtual address. With a normal OS, the OS has set the CR3 register to contain a physical address where the first table is located.The first table converts the first part of the virtual address into a physical one, a pointer towards the physical address where the next table is located. With large pages, it takes about 3 steps to translate the virtual address to the physical one.

With EPT/NPT, the Guest OS gives a (CR3) address which in fact virtual and which must be converted into a real physical address. All the Guest OS tables contain pointers to a virtual addresses. So each table gives you a virtual address towards the other table. But the next table is not located at this virtual address, so we need to go out and search for the real address. So instead of 3 accesses to the memory, we need 3x3 accesses. If this happens too many times, EPT will actually reduce performance instead of improving it!

It is a good practice to use large pages with large database. Now remember we are moving towards a datacenter where almost everything is virtualized, databases included. In that case, Nehalem's TLB can only make sure that about 32 x 2 MB or only 64 MB of data and 28 MB of code is covered by the TLB. As a result, lots of relatively heavy hardware page walks will happen. Luckily, Intel caches the real physical page tables in the L3-cache, so it should not be too painful.

The latest quadcore Opteron has a much more potent TLB. As instructions take a lot less space than data, it is safe to say that the data TLB can cover up 176 (48 + 128) times 2 MB or 352 MB of data. Considering that virtualized machines have easily between 32 and 128 GB and are much better utilized (60-80% CPU load), it is clear that the AMD chip has an advantage there. How much difference can this make? We have to measure it, but based on our profiling and early benchmarking we believe that "an overflowing TLB" can decrease virtualized performance by as much 15%. To be honest: it is to early to tell, but we are pretty sure it is not peanuts in some important applications.

So what are we saying? Well, it is possible that the Opteron might be able to do some "damage control" compared to Nehalem when we try out a benchmark with large and fat VMs (Like we have done here). But there are a lot of "IF"s. Firstly, AMD must also cache the page tables in the caches. If for some reason they keep the page tables out of the caches, the advantage will probably be partly negated. Secondly, if the applications running on the physical machine demand a lot of bandwidth, the fact that the Nehalem platform has up to 70% more bandwidth might spoil the advantage too.

The last AMD Stronghold?

So Should Intel worry about this? Most likely not. For simplicity sake, let us assume that both cores - Shanghai and Nehalem- offer equal crunching power. They more or less do when it comes to pure raw FP power, but SpecInt makes it clear that Nehalem is faster in integer loads.

But let us forget that, as most server applications are unable to use all that superscalar power anyway. The AMD chip is still disadvantaged by the fact that it does not have SMT. Considering that most server apps have ample threads and that virtualization makes it easier to load each logical CPU up to 80% that remains a hard to close gap. Secondly, many of these applications do not fit entirely in the cache, so the fact that AMD's memory latency is up to 40% higher is not helping either. Thirdly, all top Xeons (2.66 GHz and higher) are capable of adding 2 extra speedbins even if all 4 cores are busy (like it was the case in SAP). It will be interesting to see how much power this costs, and if Turbo mode is possible with a 80% loaded virtualized machine.

In a nutshell: expect Nehalem with it's ample bandwidth and EPT to do very well in VMmark. However, we think that AMD might stay in the slipstream of the Intel flagship in some virtualization setups. It is possible that AMD counters with an even better optimized memory controller in Istanbul, but it is going to be tough.

Return to Linpack

The benchmarks where AMD will be able to stay close should have no use for massive amounts of memory bandwidth, SMT or Turbo mode. Feel free to educate us, but so far we have only found one benchmark that answers this profile: Linpack. Linpack achieves the highest IPC rates of probably almost all softwares. That means the Nehalem Xeon will be consuming peak power, and will not be able to use Turbo mode. Linpack (with MKL or ACML) is also so carefully optimized that it runs almost completely in the caches, and SMT or hyperthreading is only disturbing the carefully placed code lines. Considering that a 2.7 GHz Shanghai CPU with registered RAM was only a tiny bit slower than a Nehalem CPU with non registered RAM, you may expect to see both CPUs very close in this benchmark.

Outlook to 2009

The AMD quadcore is now the server CPU to get, but it is not going to stay that way very long. Until AMD comes up with SMT or another form of multi-threading and a faster memory controller, Intel's newest platform and CPU will force AMD to make the quadcore opteron very cheap. We expect that the AMD quadcore will only be competitive in Linpack and some virtualization scenario's.

And unless Istanbul has a very nice surprise for us, it is not going to change soon. Agreed, to our loyal readers, this does not come as a surprise...

Far Cry 2 Dissected: Massive Amounts of Performance Data

Originally we had planned on doing a rather quick Far Cry 2 performance article, as the game has been anticipated for quite some time and we like to keep our benchmarks up to date with the latest and greatest titles. Unfortunately we hit some snags along the way. We've finally got all the data we can pull together ready to go, and there is quite a bit of it. Despite some issues that precluded us obtaining all the data we wanted, we do have an interesting picture of Far Cry 2 performance.

Because of the inclusion of a very robust and useful benchmarking tool, the process of collecting the data was greatly eased. Unfortunately, the benchmark tool was a bit unstable, which did mean lots of babysitting. But other than that, it was still a much nicer process to benchmark Far Cry 2 than most other games. The tool not only helps with running the benchmark, but it does a great job of collecting data. Lots of data. But we'll get to all that in a bit.

By now, many people know about the AMD driver issues that have plagued their Far Cry 2 performance and consistency. We were unable to test CrossFire because of driver issues. We didn't do a full SLI analysis because there isn't much to compare it against, but we did include two SLI configurations in order to help illustrate the potential scaling we could see from other SLI setups and to give us a target to hope CrossFire eventually hits (when it works). It is worth noting that this is the kind of issue that really damages AMD's credibility with respect to going single card CrossFire on the high end. We absolutely support their strategy, but they have simply got to execute. This type of a fumble is simply unacceptable.

Our line up tests will be an analysis of Far Cry 2 performance running with High, Very High and Ultra Quality with and without AA under DX9 and DX10. After we take a look at that we'll drill down into Ultra High quality DX10 performance and look at AMD and NVIDIA performance from top to bottom. We will touch on both built in and custom demo performance and 4xAA as well.

Video Card Buyer's Guide - Spring 2009

It's been since the holidays that we've done a GPU buyers guide. It never seems like the right time to do a new GPU buyers guide, as NVIDIA and AMD have been pushing aggressively back and forth for leadership in the market place. When new parts or tweaked cards haven't been coming out, prices have been adjusted quickly to maintain tight competition.

Now is no exception. There are a couple spots in our line up where we will have to make recommendations based on what we know about what's happening in the market place. In competitive reviews, we try very hard to look only at that exact time slice to make our recommendations. In our buyers guides we like to be a little more flexible and take a more retail and market place view rather than the heavily technology and performance based focus of our GPU reviews.

Starting out, we're looking at the roughly $75 market where we split our recommendation between the 4670 and the 9600 GT. Prices have compressed more over the past few months, and the 4670 comes in low enough to cover many needs at very little cost. You can always spend less on graphics and get less, but if you want more than 2D, the 4670 and 9600 GT are where you should start looking.

$75 Recommendation: ATI Radeon HD 4670

	ATI Radeon HD 4670
Apollo	$64.99
Gigabyte	$79.99
Sapphire	$69.99

And we've got the GeForce 9600 GT. Just a little more performance in some games, maybe a little less in others, with roughly the same cost. But if you want any more than that, you'll want to wait about a month.

$75 Recommendation: NVIDIA GeForce 9600 GT

	NVIDIA GeForce 9600 GT
Apollo	$74.99
Gigabyte	$67.99
Sparkle	$89.99
PNY	$97.99

For our ~$100 price point (plus or minus a bit) we are going to strongly recommend that people wait for about a month. This price point will be shaken up a bit in about that time and we really aren't comfortable recommending anyone purchase something in this market until sometime in early May. This may or may not further compress the sub $100 market, but there really isn't much more room down there, so we don't expect much change except at right around $100.

ATI Radeon Xpress 200: Performance, PCI Express & DX9 for Athlon 64

The Radeon Xpress Family

The RX480/RS480 is the first ATI chipset for AMD. It is also the first ATI chipset available as a discrete chipset. Previous ATI chipsets have concentrated on integrated graphics for the Intel platform.

While the previous ATI chipsets brought interesting integrated graphics to the Intel platform, the performance never really threatened Intel's domination of the Pentium 4 chipset market. Without truly competitive performance as a chipset, there was no real reason for a discrete ATI chipset solution for Intel, although each generation of the ATI chipset for Intel brought more competitive performance. ATI firmly believes that RX480 for Athlon 64 has broken through the performance barrier, bringing competitive or better performance to Athlon 64.

To understand the current ATI lineup better, you need to take a closer look at how they will be branded and sold.

Radeon Xpress is ATI's name for the new PCI Express chipsets. The 480 series is aimed at AMD Athlon 64 and the future 400 series will bring ATI PCI Express to Intel. The current introduction is the A64 chipsets, with the chipsets for Intel targeted for the beginning of 2005. ATI believes clearly that there is a better market for AMD A64 solutions right now, which is why they have concentrated on the chipsets for AMD.

RX480 and RS480 are identical except for integrated graphics. RX480, called Radeon Xpress 200P, is the discrete Athlon 64 solution for PCI Express graphics cards. RS480, marketed as Radeon Xpress 200G, adds integrated DX 9 graphics to the 480 core. Both RX480 and RS480 are currently combined with the SB400 Southbridge.

The RX480/RS480 Northbridge supports dual or single channel DDR memory, PCI Express x16 for graphics, and up to 4 PCIe x1 slots. It is interesting that communication between North and South bridges (RX480/SB400) is handled by 2 additional lanes of PCI Express. This bring the total PCI Express lanes to 22. Communication with the CPU is available over a 1GHz (1000MHz) Hyper Transport. RS480 adds integrated DX9 graphics with both VGA and DVI outputs.

The current SB400 south bridge supports 8 USB 2.0 ports, 4 SATA 150 drives, 4 ATA-133 drives, PCI 8-channel AC '97 audio, and up to 5 PCI slots. SATA drives can be combined in RAID 0 and 1 configurations, but RAID 0+1 is not currently supported. The 2-chip design allows ATI to upgrade features just by using a new Southbridge. For instance, an SB450 Southbridge supporting High Definition audio appears on the ATI roadmaps. The SB450 should be available in early 2005.

Optional Integrated Graphics - Radeon Express 200G adds ATI's first DirectX 9 integrated graphics, which can be supported by both DVI (digital) and VGA (analog) outputs. The graphics core is a modified version of the discrete Radeon X300 core with only two rendering pipelines instead of four. Since the logic core is identical on RX480 and RS480, PCIe x16 support is also available. Graphics outputs from both internal and external graphics can be combined in ATI SurroundView.

LAN - The Radeon Express 200 series does not provide integrated LAN in the chipset. ATI claims that integrated Gigabit LAN offers no performance or cost advantage compared to Gigabit LAN supported by the PCI Express bus. PCI Express Gigabit LAN can deliver bi-directional 500MB/s total bandwidth per device. Gigabit or 10/100 Ethernet can be supported by the Northbridge PCI Express bus or the Southbridge PCI bus. This should allow manufacturers the option to implement LAN for top performance or for lowest cost.

SLI - ATI claims that this PCI Express 20-lane design is capable of supporting dual PCIe x8 slots for combining 2 graphics cards in an SLI configuration. Plans are already in place at ATI for an SLI version of RX480 to be introduced in early 2005

Understanding the iPhone 3GS

Putting it in Perspective

Below is a table of the CPUs used in some of the top smartphones on the market, let’s put our newly refreshed knowledge to the test.

	CPU	Issue Width	Basic Pipeline	Clock Speed
Apple iPhone/iPhone 3G	Samsung ARM11	single	8-stage	412MHz
Apple iPhone 3GS	Samsung ARM Cortex A8	dual	13-stage	600MHz
HTC Hero	Qualcomm ARM11	single	8-stage	528MHz
Nokia N97	ARM11	single	8-stage	424MHz
Palm Pre	TI ARM Cortex A8	dual	13-stage	600MHz
RIM Blackberry Storm	Marvell ARM11	single	8-stage	624MHz
T-Mobile G1	ARM11	single	8-stage	528MHz

The first thing you’ll notice is that there are a number of manufacturers of the same CPUs. Unlike the desktop x86 CPU market, there are a multitude of players in the ARM space. In fact, ARM doesn’t manufacture any processors - it simply designs them. The designs are then licensed to companies like Marvell, Samsung, Texas Instruments and Qualcomm. Each individual company takes the ARM core they’ve licensed and surrounds it with other processors (e.g. graphics cores from PowerVR) and delivers the entire solution as a single chip called a System on a Chip (SoC). You get a CPU, GPU, cellular modem and even memory all on a single chip, all with minimal design effort.

A derivative of this is what you'll find in the iPhone 3GS

While it takes ARM a few years to completely architect a new design, their licensees can avoid the painful duty of designing a new chip and just license the core directly from ARM. ARM doesn’t have to worry about manufacturing and its licensees don’t have to focus on building world class microprocessor design teams. It’s a win-win situation for this business.

For the most part, ARM’s licensees don’t modify the design much at all. There are a few exceptions (e.g. Qualcomm’s Snapdragon Cortex A8), but usually the only things that will differ between chips are clock speeds and cache sizes.

The fundamentals of the architectures don’t vary from SoC to SoC, what does change are the clock speeds. Manufacturers with larger batteries and handsets can opt for higher clock speeds, while others will want to ship at lower frequencies. The ARM11 based products all fall within the 400 - 528MHz range. These are all single-issue chips with an 8-stage pipeline.

	iPhone 3G (ARM11)	iPhone 3GS (ARM Cortex A8)
Manufacturing Process	90nm	65nm
Architecture	In-Order	In-Order
Issue Width	1-issue	2-issue
Pipeline Depth	8-stage	13-stage
Clock Speed	412MHz	600MHz
L1 Cache Size	16KB I-Cache + 16KB D-Cache	32KB I-Cache + 32KB D-Cache
L2 Cache Size	N/A	256KB

The iPhone 3GS and the Palm Pre both ship with a Cortex A8. I’m actually guessing at the clock speeds here, there’s a chance that both of these devices run at closer to 500MHz but it’s tough to tell without querying the hardware at a lower level. The Cortex A8 gives us a deeper pipeline, and thus higher clock speeds, as well as a dual issue front end. The end result is significantly higher performance. Apple promised > 2x performance improvements from the iPhone 3GS over the iPhone 3G, such an increase was only possible with a brand new architecture.

I must stress this again: clock speed alone doesn’t determine the performance of a processor. Gizmodo’s recent N97 review complained about the speed of Nokia’s 424MHz processor (rightfully so). The review continued by saying that HTC uses 528MHz processors, implying that Nokia should do the same. The second part isn’t what Nokia should be doing on its $500+ smartphone, what is inexcusable is the fact that Nokia is not using ARM’s latest and greatest Cortex A8 on such an expensive phone. It’s the equivalent of Dell shipping a high end PC with a Core 2 Duo instead of a Core i7; after a certain price point, the i7 is just expected.

Click