A Sun blogs entry on Power Use

the Dell choice reflected long standing relationships,that the power issue had to have been understood in advance, or that there appears to besome serious posturing going on at the top

Sun's Marc Hamilton had a personal blog entry on August 29th that I found fascinating. It's about the problems the University of Buffalo has had getting enough power in place to run a new super computer grid made up of 834 dual processor Dell boxes.

It seems Buffalo has about 210KW available, but needs 350KW to run the thing. What Hamilton does in the blog, therefore, is show that they actually have enough power to run a comparable grid built using Sun's existing v20z dual core Opterons.

Here's how he introduces the key issue:

According to Buffalo's aptly named hotpages, their cluster has a total of 800 Dell SC1425 compute nodes each with two 3.2 GHz Xeon processors. For whatever reason, Intel makes it rather difficult to find information on CPU power usage on their web site, but Dell has this nice Power Calculator on their site which shows the SC1425 uses 437 Watts, which works out to about 350 KWatts for the whole lot of 800. That doesn't count the Myrinet, Fiber Channel, or Gigabit Ethernet switches used in the cluster, or the power needed to cool the system, but lets ignore that for the time being. If Buffalo only has enough power to run the cluster at 60% of capacity, lets assume they have 210 KWatts available for compute nodes.

Usually, when someone buys a 1600 CPU cluster, one of their goals is to get on the Top500 list. To qualify for the Top500 list, you need to run a benchmark called Linpack. There are two figures reported in the Top500 list. The first is a simple calculation called Rpeak which is the maximum theoretical number of floating point operations per second. For Dell's SC1425 server, the figure is calculated as 3.2 GHz * 2 CPUs/server * 2 floating point units/CPU = 12.8 GFlops. For 800 servers you get a Rpeak of 10.24 TFlops. Now lets look at Sun's V20z with dual core AMD Opteron CPUs. A single Sun Fire V20z has an Rpeak of 2.2 GHz * 2 CPUs * 2 cores/CPU * 2 floating point units/core = 17.6 GFlops. The same RPeak value as the 800 node Dell cluster could thus be obtained by 582 V20z servers.

He then points out that running 582 V20z servers would only take 190KW -below what's known to be available and then estimates the cost savings that come from not having to house, connect, and power the additional 218 servers to discover that Buffalo could have gotten a lot more bang for fewer bucks by buying from Sun.

On reading this almost two weeks ago now, I thought it a nice bit of analysis that staying just on the right side of not gloating in public. Yesterday, however, I was driving along peaceably nodding my head at whatever my wife was saying when something struck me about what he'd written. And, indeed, there's a key line:

218 extra Myrinet cards + switch ports @ average $1K = $218K

Myrinet makes some cool products, but their presence usually signals the absence of a Sun/Solaris perspective during the decision process.

Once I looked, it wasn't a surprise to find that the Dell choice reflected long standing relationships, that the power issue had to have been understood in advance, or that there appears to be some serious posturing going on at the top.

In other words a blog entry I initially understood as an interesting demonstration of the role of power measured in kilowatt hours is really far more about power measured in loyalties, control, and competence. As such it looks like a a call to reduce power consumption but is really more of a call to overturn power of another kind entirely.

Or, at least, that's how I read it now.