Fun with IBM's Z10 numbers

As far as I'm concerned, when an organization buys an IBM mainframe to run Linux somebody should be fired - but not the DP guy who bought it: his boss: the non technical guy who approved it without carefully reviewing the alternatives.

If you're into high performance computing, IBM has the machine for you: 64 customized Power processor cores on four 16-way/380GB SMP boards at 4.4Ghz, each with up to 48GB/Sec in bandwidth to the world outside.

It's a hot machine, but costs are high:

  z10 Costs
Capital cost $26 million
Maintenance cost $1,706,000/yr
SuSe Linux $768,000/yr

Here's some of what IBM said about the z10 in the original February 2008 press release

IBM's next-generation, 64-processor mainframe, which uses Quad-Core technology, is built from the start to be shared, offering greater performance over virtualized x86 servers to support hundreds to hundreds of millions of users.

The z10 also supports a broad range of workloads. In addition to Linux, XML, Java, WebSphere and increased workloads from Service Oriented Architecture implementations, IBM is working with Sun Microsystems and Sine Nomine Associates to pilot the Open Solaris operating system on System z, demonstrating the openness and flexibility of the mainframe.

From a performance standpoint, the new z10 is designed to be up to 50% faster and offers up to 100% performance improvement for CPU intensive jobs compared to its predecessor, the z9, with up to 70% more capacity. The z10 also is the equivalent of nearly 1,500 x86 servers, with up to an 85% smaller footprint, and up to 85% lower energy costs. The new z10 can consolidate x86 software licenses at up to a 30-to-1 ratio.

If you want a lot of detail, be patient: there's a 170 page redbook (sg247515) describing the thing due for December 8/08 release.

Unfortunately IBM doesn't say what those "hundreds of millions" of users will be doing, doesn't define those 1,500 replaceable x86 servers, and doesn't allow anyone to publish authoritative benchmark results for the machine - so we can guess it would be insanely great as a Dungeons and Dragons host, but we really don't know how it would compare to something like IBM's own p595 or Sun's M8000.

We do know how some of the numbers compare to alternatives. For $26 million, for example, you could buy 21,131 low end Sun x86 servers - each with 2GB of memory and a dual core opteron - or, if you prefer mid range x86, the same money would buy 4,985 Sun M2 X4200 servers, each with two dual core opterons at 2.8Ghz, 8GB of RAM, 292GB of Disk, and Solaris.

If, however, you just bought 1,500 of these; you'd get to keep $18 million or so in change - enough, at $85K per FTE all in, to pay 21 additional IT staff for ten years.

You'd also get rather more processing resources:

  IBM z10 1500 X4200s
CPU Cycles 282Ghz 16,800Ghz
Memory 1,520GB 12,000GB
Disk Storage None 438TB

In fact, if you accept that each PPC cycle typically achieves about twice as much real work as an x86 cycle, and then cheerfully assume that the mainframe has essentially zero overheads on data transfer and switching, then you could more than match all of its throughput resources with only about 80 of these little x86 machines.

So why would anyone buy a z10 to run Linux? IBM's answer is that the z10 can virtualize 1,500 undefined x86 boxes or serve hundreds of millions of users - and there are circumstances in which both statements are perfectly true. People parsing sales brochures or attending data processing conventions can, for example, easily imagine 1,500 essentially idle servers or two hundred million users represented as records in a batch job.

The real reason people buy into this is worldview -but that's not the answer you get from people who make this kind of decision. Scratch one of them deeply enough to get past the personal attack asking the question will generate, and you'll find rationalizations couched in terms of the space, power, and staffing savings they get from consolidation.

In reality this argument is absurd: if you bought twice as many four way x86 servers as you need to match the mainframe's throughput, and then hired 20 full time IT staff at an all in cost of $85K per FTE, you could keep the entire $26 million in z10 capital cost in your pockets while paying for your cluster (including staff) just from the monthly maintenance and Linux licensing you're not paying IBM and Novell.

And yet the true believers will not only buy into this - but loudly tell other people they're right to do so. Why?

The answer, I think, is that data processing people still focus on system utilization as the primary measure for their own effectiveness - and, on that basis, a $30 million dollar system that approachs 100% average utilization is infinitely preferable to a half million dollar system that does the same job at 20% utilization.

You may think, as I do, that this is absurd, but it's a matter of world view; because data processing originally reported to Finance, budgets are givens and the focus is inward: on managing a system in which users are treated as nuisances and system utilization is king.

To a Windows user, or a Unix manager, utilization rates are completely irrelevant: nobody cares if the machine is idle much of the time, we care that the resources be available when users need them.

Thus when I look at a large Sun Ray server installation running at an average 12% utilization during working hours, I see a success - a system in which users get the resources they need when they need them. The data processing guy looking at the same system, however, sees absolute proof of complete management incompetence: a machine that's 88% idle.

That's the bottom line difference between the data processing and Unix world views: they measure themselves against utilization because that made sense when their profession evolved, and we measure ourselves against user satisfaction, service quality, and response times because those are the measures our users care about.

All of which brings us to the real bottom line question: who's responsible when some organization chooses to spend insane amounts of money to limit user computing resources? My answer is that it's not the data processing guy - he's just doing what his predecessors did and what he's been trained to do: buying to maximize utilization.

So who is it? It's the guy in the excutive suite: the guy who's so clueless about computing, and so desperate not to have anything to do with it, that he doesn't see that equating 1,500 complete x86 servers to only 64 PPC cores requires a profound belief in magic,