SPEC launches standardized energy efficiency benchmark

SPEC launches standardized energy efficiency benchmark

Summary: SPEC (Standard Performance Evaluation Corporation) launched its first standardized energy efficiency benchmark SPECpower_ssj2008 this week which tackles something that the computer industry has struggled to define in recent years.  With datacenter energy costs spiraling out of control, server customers have struggled to sort out the conflicting messages from technology vendors about who is the energy efficiency leader.

SHARE:

SPEC (Standard Performance Evaluation Corporation) launched its first standardized energy efficiency benchmark SPECpower_ssj2008 this week which tackles something that the computer industry has struggled to define in recent years.  With datacenter energy costs spiraling out of control, server customers have struggled to sort out the conflicting messages from technology vendors about who is the energy efficiency leader.  Now the industry has a standardized way to measure the energy efficiency of computer servers.

Even though this first version of SPEC Power only addresses server side Java performance, it is one of the most comprehensive standards for energy efficiency to date giving it instant credibility.  Other energy efficiency metrics like the Green500 list simply takes the theoretical aggregate FLOPS (Floating Point Operations Per Second) of a cluster of computers and divides it by the measured peak power consumption or even peak rated power consumption if measurements aren't given.  Since FLOPS aren't really a good real-world measurement of performance to begin with and most people don't operate their servers at peak loads or run massive clusters, the Green500 list simply isn't that useful of a metric.

SPECpower_ssj2008 is basically a measure of ssj_ops/watt (server side Java operations per second per watt).  I would personally prefer to call it ssj_opj (server side Java operations per unit of energy in Joules) since "per second per watt" is by definition "per Joule".  SPECpower_ssj2008 factors in the fact that servers usually aren't operated at peak capacity and they're even idle at times.  To factor for idle and peak load power consumption, average power consumption at 0, 10, 20, 30, all the way through 100 percent load capacity are measured and disclosed.  Then server side Java operations per second are divided by the average power consumption in watts at every 10% increment and then all the scores are averaged again to produce the "overall" ssj_ops/watt metric.  The following graph is from the current SPECpower_ssj2008 performance leader as of DEC 12th 2008 and it illustrates how this benchmark works.

 

I spoke to the President of SPEC Walter Bays yesterday about this new power benchmark and my preference for using ssj_opj was one of the topics that came up.  I also asked Bays why there couldn't also be a SPECint_rate2006/watt or SPECfp_rate2006/watt measurement.  Although Bays couldn't comment specifically on availability or the existence of future benchmarks, he did explain that the SPEC CPU (SPECint and SPECfp) benchmarks are peak throughput only which would be fairly simple to measure and interesting.  The resulting metric would be a lot more valuable than the FLOPS/watt rating used in the Green500 list since SPEC CPU is much more comprehensive than a simple FLOP measurement.  Bays also explained that SPECweb2005 might be a good candidate but it was a more complex benchmark (due to the multiple systems involved) making too much to tackle for the initial version of SPECpower.

<Next page - First server comparisons for SPECpower_ssj2008>

First server comparisons for SPECpower_ssj2008

On opening day (12/12/2007) of the benchmark, there have been exactly a dozen published submissions for SPECpower_ssj2008.  Eleven of the systems are Intel based and one system is AMD based.  The fastest dual-processor quad-core Intel system is based on the recently launched 3.0 GHz XEON E5450 45nm processor is capable of 1144 server side Java operations per Joule at peak load and it has an overall score of 698.  The fastest single-processor quad-core Intel server is based on the aging year-old XEON X3220 2.4 GHz 65nm processor and it has an overall score of 667.  Even though its absolute performance is less than half of the fastest dual-processor system, it also uses less than half the power which makes the efficiency score comparable.

The only AMD based system submitted so far is based on an Opteron dual-processor dual-core 2216HE (High Efficiency model) chip running at 2.4 GHz.  That system has an overall score of 203 though readers should note that the new High Efficiency 1.9 GHz Barcelona quad-core Opterons will undoubtedly do better when they get past their Barcelona launch problems.  Also note that since Colfax International used 8 memory DIMMs instead of 4 like all the other vendors that submitted results, that probably added an unnecessary ~15 watts to the power consumption.  If we factor that in the extra power consumed by the 4 extra DIMMs, the efficiency score might have been closer to 212.  Since the performance jump of a quad-core 1.9 "Barcelona" over a dual-core 2.4 "K8" Opteron won't be enough to double and the power consumption won't change (same TDP for the chips and all other components remain the same), the new ssj_ops/watt score for AMD's 1.9 GHz Barcelona server will improve but it won't be enough to double.

Comparison of peak load efficiencies for the systems mentioned above

This first version of SPECpower is only representative of server side Java performance and energy efficiency and it doesn't necessarily translate to other server applications.  Server side Java has traditionally been advantageous to Intel even at comparable core count and clock speeds (based on SPECjbb2005 results) so it's understandable that SPECpower_ssj2008 doesn't show well for AMD.  AMD isn't at a clock-for-clock core-for-core disadvantage for SPECweb2005, so it might be reasonable to hypothesize that a hypothetical web server version of SPECpower would be more competitive for AMD if they can catch up on core count and clock speed.  A hypothetical SPECfp_rate2006/watt metric (of interest to HPC customers) which is weighted more heavily to memory bandwidth would also be more competitive for AMD because of AMD's faster memory subsystem, but only if AMD can deliver higher clocked quad-core chips.

In conclusion, the new SPECpower_ssj2008 benchmark will likely be welcomed with open arms by server side Java customers.  As a start, SPECpower_ssj2008 is an important milestone in the quest for a good energy metric.  The market may also demand additional versions of SPECpower that encompass web serving, 2D/3D graphics applications, high performance computing, and general purpose computing.  At least now we have a standardized framework to proceed.

Topics: Open Source, Hardware, Processors, Servers

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

8 comments
Log in or register to join the discussion
  • Great Article, George.

    Validates what has been known for some time now. No need for any more made up metrics like ACP or other nonsense.

    Why are there no AMD fans posting? Could it be because Intel owns the power/perf crown?

    Yeah, I thought so.
    thetruthhurts
    • This is the performance/watt crown for java performance

      This is the performance/watt crown for java performance and it's exaggerated due to the fact that Intel architecture is more suited to Java. If this were a SPECweb derivative for SPECpower, then you would at least see similar performance/watt at comparable core count and clock speed but AMD can't match the clock speeds.
      georgeou
  • Why did SPEC agree to this if it's just Java?

    shortsighted?
    thetruthhurts
    • As I explained in the blog, it's the easiest candidate

      As I explained in the blog, it's the easiest candidate if you want something that will give your operations per joule at varying loads. SPECweb might be another candidate and may eventually be one but it was too complex to tackle at first.
      georgeou
    • The real problem is the use of single numbers

      Real performance is not a single-dimensional quantity.

      To SPEC's credit, they require disclosure at a specific loads, but they undo that by yielding to the pressure from journalists, marketeers, and consumer (i.e. just about everybody) for a single, easy-to-quote-but-totally-misleading number.

      But the simplest power model that makes any sense at all is one where there's a base idle consumption plus a cost/operation value. Limiting ourselves to linear models, we can decompose the operations into different types of operations, at the instruction level, memory operations, network, disk, GPU - and break it down even further from there.

      It gets real complicated real fast, but it could be done, and we'd have a multi-dimensional model that would really predict real-world performance.

      The only problem is, it would be pointless, because nobody knows the real-world makeup of real-world applications! Even if you could measure it, your results today might not be what you experience tomorrow.

      So we're stuck with a plethora of benchmarks that collapse the performance space along the single axis of some arbitrary load. Whatever load they chose, it won't be YOUR load, so you -- and everybody else -- will end up arguing about the benchmark.

      So benchmarks give us just a little insight over not having them. A little more, if we carefully consider the characteristics of the particular benchmark.

      But no satisfaction that we've answered the question of which product is better!

      Because the only time that question gets answered is if you benchmark your own real-world load -- assuming you can control and measure it adequately to get a statistically significant result!

      But to return to the subject -- in this case in particular, I think it was a major mistake to include the single number result. I think they should have analyzed it down to a "N watts idle + N joules / operation" formula -- because that's something you can reason about.

      Just because a good benchmark is a hopeless task doesn't mean you shouldn't try!
      Bob.Kerns
      • You would be hard pressed to find a situation where one vendor beats anothe

        If your average transaction load is 70%, you look at the 70% score. If your average load is 20%, you look at the 20%. The overall score isn't perfect but it is a very FAIR score. Furthermore, you would be hard pressed to find a situation where one vendor beats another vendor at 20% load but lose at 80% load.
        georgeou
  • SPECPower vs. Green500

    Hello,

    I finally had a moment to read your article and I generally enjoyed it. I
    think you captured the importance of the SPEC benchmark and the motivation
    for using Java-based applications in the first pass. Note, my comments are
    my own and in this capacity I am not representing the SPECPower
    subcommittee or any other organization.

    I do have one nit about your article of course; few respond to such
    articles without some form of disagreement with the author. As the only
    person that has participated as a member of the SPECPower committee and a
    co-founder of the Green500 List, I disagree with your juxtaposition of the
    two benchmark methodologies.

    Prior to the release of both methodologies, there was a void in
    power-performance efficiency benchmarks. The history of performance-only
    benchmarks has shown a number of benchmarks that continue to coexist (SPEC
    CPU, TPC, LINPACK) and serve a variety of communities. Why would
    power-performance efficiency benchmarks be any different?

    Benchmarks serve distinct audiences. The commercial server community is
    driven by many that use applications similar to the java benchmarks used
    in SPECPower. The HPC community would find it laughable to use a java
    benchmark to gauge efficiency for a high-end system; they are much more
    comfortable interpreting single-metric benchmarks like LINPACK.

    Each benchmarking effort, as in SPECPower and Green500, is constantly
    making choices - like whether a single metric should be used - where the
    answer is often heavily dependent on the opinion of the participants. This
    implies the process itself will change and mature over time and that
    people will argue about the methodology and the usefulness of a particular
    benchmark indefinitely.

    Undoubtedly, both methodologies will evolve to serve their particular
    community. You pointed out some of the weaknesses in your article; for
    one, both methodologies should be broadened to capture additional load
    varieties.

    I find it most likely that multiple power-performance benchmarks and
    methodologies will emerge and take hold across several communities. In
    reference to your article, the comparison between SPECPower and Green500
    was somewhat ill-conceived. They serve different communities with
    different technical cultures and motivations. Comparing whether or not
    load levels were used across the methodologies fails to consider the
    context within which benchmark decisions were made. In essence, each
    benchmark methodology was designed to reflect the immediate needs of its
    community.

    Having participated in both efforts first-hand, I am quite pleased with
    the first-pass outcome of both methodologies. At the very least, the
    accolades and criticism of both efforts has ensured elevation of the
    dialogue which I for one have lobbying for over 5 years running.

    As mentioned, while I disagreed with portions of your article, I did find
    it well-written and thoroughly researched. And I look forward to reading
    your articles in the future on this topic.

    Kind Regards,

    Kirk W. Cameron
    Associate Professor
    Computer Science
    Virginia Tech
    co-founder Green500
    member SPECPower subcommittee
    kwcameron
    • Thanks for your response, but you misread me

      Kirk,

      Thank you for your thoughtful response, but I think you completely misread me.

      "The HPC community would find it laughable to use a java benchmark to gauge efficiency for a high-end system; they are much more comfortable interpreting single-metric benchmarks like LINPACK. ...SNIP... the comparison between SPECPower and Green500 was somewhat ill-conceived."

      I explicitly stated that SPECpower_ssj2008 was ONLY a good measure of Java performance and efficiency. I said that more was needed to measure other business applications AND HPC applications.

      The said that a SPECfp_rate2006/watt (milliwatt) score was far superior to a FLOP/watt metric; I never said that ssj_opj was better than a FLOP/watt metric. I stand by my position that LINPACK is not as comprehensive or as relevant to the HPC community as SPECfp_rate2006. That's not to say that LINPACK is worthless because it obviously reflects a subset of workloads relevant to the HPC community, but SPECfp_rate2006 reflects a bigger segment of HPC. That's not to say that SPECfp_rate2006 is perfect by any measure and HPC customers ultimately want to test their own applications, but I think that most people will agree that SPECfp is more comprehensive than LINPACK.
      georgeou