The obligatory 'Victoria Falls' post

The obligatory 'Victoria Falls' post

Summary: The reason Sun's new dual socket CMT/SMP machines don't double the throughput of their uniprocessor predecessors is that other components, particularly memory, cost too much to speed up - but the additional threads and processing power work wonders on things like response time for CPU constrained applications - like Lotus Domino.

SHARE:
TOPICS: Hardware, Oracle
19
For those who don't know, Sun's new T2+ machines extend the T2's CMT capabilities across multiple units to produce 16 and 32 core SMP machines capable of handling 128 and 256 concurrent threads respectively. Sun blogger Denis Sheahan provides a good overview of the current dual socket releases here By itself the T2 continues to set new performance records - Sun's bmseer usually has the latest; most recently a pair of new SPECint_rate2006 and SPECfp_rate2006 records. The new machines don't offer the kind of quantum leap the T2 did - obviously because the T2+ is a continuation within the UltraSPARC SMP/CMT line and less obviously because market pricing constraints limit the throughput possible in other parts of the system. The most illustrative benchmark result I've seen on this, also as reported by bmseer involves Lotus Domino. Here's part of that report:
Lotus Domino 7.0.1 NotesBench R6iNotes Performance Chart (in increasing $/User order) Users = number of users supported (bigger is better) NotesMark = the benchmark metric (bigger is better) $/User = cost per user (smaller is better)

System

Chip GHz

Cores/
Chip

OS

USERS

N-MARK

#Dom Part

AvRT

$/User

Sun T5240

2xUS T2 Plus 1.2

8

Sol10

65000

55101

6

224ms

$2.84

IBM-P5 560Q

2xPOWER5+ 1.8

4

AIXL

55000

46103

6

848ms

$4.89

Sun T5220

1x US T2 1.4

8

Sol10

43000

36240

6

584ms

$2.89

Complete benchmark results may be found at the Lotus NotesBench website http://www.notesbench.org.
Notice that doubling the CPU only produced about a fifty percent increase in throughput -an artifact of limitations elsewhere in the system. Users, however, don't care about throughput in applications like this: they care about response time - and that's where the T2+ really shines, reducing the average response time from 584ms to only 224ms - a 60% improvement. That's an artifact of the CMT architecture and a pointer, I think, to the markets that this thing will sell into in volume. On the other hand.. the way the processors are coupled - done by replacing the the T2's on board 10Gbyte facility - demonstrated that Sun can now produce highly customized versions of the core CPU set and suggests what I believe may be a unique performance opportunity for this product line. On the hardware customization side: suppose you consider a couple of million bucks no object for getting T2 machines that do FFT on short (16 way) vector processors - Sun has now shown it can do that with COTS parts that can be produced in volume. The performance opportunity is a bit esoteric ( :) ) but comes down to this: there are time critical applications in which the majority of the processing effort goes into moving data between process groups -and the Solaris/T2+ combination lets you move relatively lightweight processes instead of "heavyweight" data across, for the expected four-way machine, 256 threads and 64 direct PCI/E channels. This possibility isn't going to change how products like Apache or even compilers are built, but should make it possible to do some things no one could before. Imagine, for example, that your application will get about 8GB worth of image data every three seconds -potentially 24 x 7; primary per image base processing now takes about 4.3 seconds on one of Mercury Computing's dual cell blades; secondary processing now takes another 8 seconds on one of those blades; you want to keep a minute's worth of data for instant replay; you generally expect to throw away more than 99.99% of all incoming data; and, you want to move the entire system around on a truck. To do it now you'll need a large vehicle because you'll need to carry and power several rackmounts stuffed with cell blades - first because the things are incredibly fast at floating point, but terribly bad at throughput; and, equally importantly, because memory and bandwidth limitations combine with that playback requirement to force you to spend the majority of the effort you put into processing each arriving image just shuffling it around. Choose the T2+ instead and you'll get slower floating point but faster I/O and more storage flexibility - so, while the programming required might be a bit tricky (is there a Pulitzer for understatement?) I think success would give you something you could carry in a small launch or Hummer that would actually run faster and more reliably than anything else you could build.

Topics: Hardware, Oracle

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

19 comments
Log in or register to join the discussion
  • The 800 lb Gorilla in the room

    There is no doubt that Moore's Law is making some very amazing things happen in CPU land. The T2+ would be even better if Sun could make CPUs that can clock to 4 Ghz (or EVEN 2 Ghz), but for now the T2+ is surely amazing.

    Meanwhile back at the Bat Cave, the Riddler has given the caped crusader something to think about. As the computer gets faster, the storage devices . . . stay the same speed. So while Batman can process data about criminals ever faster, the computer is starving for data as the 15k FC drives are barely keeping up. So riddle me this - If your T2+ is a champion sprinter, then your storage is a man who's fat - so who's afraid of the big black bat?
    Roger Ramjet
    • Who's afraid? me..

      1 - Funny you should ask - This is close to Thursday's blog question.

      2 - It is possible to make and offer a cheap 3Ghz+ T2 CPU. However... normal RAM would be far too slow to work with it and 64Gb of on-board cache is currently both physically and financially improbable.
      murph_z
      • 3GHz+ cheaply?

        I find that hard to believe (mostly the cheap part).

        High-speed processors usually have really long pipelines. The T2 has a short pipeline. That way it doesn't require all the complex branch-prediction circuitry (not to mention pipeline circuitry) that Intel and Power processors have.

        This also makes them much less subject to stalls (e.g. waiting for memory) and flushes due to bad branch predictions. Stalls are further eliminated by making context switches really cheap (multiple sets of registers, TLBs, etc per core), so one can be done while waiting for memory.

        Anyway, in a nutshell the beauty of T2 is that it is SIMPLE. So I find it hard to believe that +3GHz could be had cheaply.
        Erik Engbrecht
        • T2+ = Titanic?

          The Itanium processor eschews pipeline prediction/branch prediction in order to make the hardware simpler. This of course means that you have shifted this work into the software (compiler). Of course the EPIC architecture also uses VLIW - which I assume T2+ does not. In any event, for as much criticism that Murph has for Itanium - his own favorite T2+ suffers from the same maladies. So which one is a stupider idea - the one that came first and fell on its face, or the one that had the advantage of going second (third?), and still managed to get nowhere (in terms of software/compiler/programming techniques)? Maybe we should create a new programming language for . . .

          *WHUMP*

          PAPPL has sacked the announcer and all responsible for the above content. PAPPL will continue to . . .

          *WHUMP*

          Watch out! There are Llamas . . . ;)
          Roger Ramjet
          • Not really the same

            IIRC, Itanic eschews complex branch predictions but keeps a complex pipeline, thereby putting the onus on software instead of on hardware. It's one of those things that sounds good in theory but didn't turn out so well...

            T2 has a short pipeline and can do cheap context switchs rather than stalling so it doesn't need complex branch prediction.
            Erik Engbrecht
          • Wump indeed!

            I have consistently said the itanium was a decent technical idea that MS couldn't make work (because not x86), Intel only wanted to build in order to get a patentable instruction set (to kill AMD), and HP fell for because it wanted to become the MS/intel hardware and services supplier.

            That said, it is not related to CMT - that's a wholly different set of ideas executed by people who not only could, but did.
            murph_z
          • Consecutive failures.

            I was going to title the post Whump-A$$, but I've had one post removed this week already.

            The best example of consecutive failure to learn from others' experience is from Mark Twain's "Fenimore Cooper's Literary Offenses".

            Twain is criticizing Cooper's writing in an episode in The Deerslayer. Read closely as we join the discussion in progress...



            Cooper made the exit of that stream fifty feet wide, in the first place, for no particular reason; in the second place, he narrowed it to less than twenty to accommodate some Indians.

            He bends a "sapling" to form an arch over this narrow passage, and conceals six Indians in its foliage. They are "laying" for a settler's scow or ark which is coming up the stream on its way to the lake; it is being hauled against the stiff current by rope whose stationary end is anchored in the lake; its rate of progress cannot be more than a mile an hour.

            Cooper describes the ark, but pretty obscurely. In the matter of dimensions "it was little more than a modern canal boat." Let us guess, then, that it was about one hundred and forty feet long. It was of "greater breadth than common." Let us guess then that it was about sixteen feet wide. This leviathon had been prowling down bends which were but a third as long as itself, and scraping between banks where it only had two feet of space to spare on each side. We cannot too much admire this miracle.

            A low- roofed dwelling occupies "two-thirds of the ark's length" -- a dwelling ninety feet long and sixteen feet wide, let us say -- a kind of vestibule train. The dwelling has two rooms -- each forty- five feet long and sixteen feet wide, let us guess. One of them is the bedroom of the Hutter girls, Judith and Hetty; the other is the parlor in the daytime, at night it is papa's bedchamber.

            The ark is arriving at the stream's exit now, whose width has been reduced to less than twenty feet to accommodate the Indians -- say to eighteen. There is a foot to spare on each side of the boat. Did the Indians notice that there was going to be a tight squeeze there? Did they notice that they could make money by climbing down out of that arched sapling and just stepping aboard when the ark scraped by?

            No, other Indians would have noticed these things, but Cooper's Indian's never notice anything. Cooper thinks they are marvelous creatures for noticing, but he was almost always in error about his Indians. There was seldom a sane one among them.

            The ark is one hundred and forty-feet long; the dwelling is ninety feet long. The idea of the Indians is to drop softly and secretly from the arched sapling to the dwelling as the ark creeps along under it at the rate of a mile an hour,and butcher the family. It will take the ark a minute and a half to pass under. It will take the ninety-foot dwelling a minute to pass under.

            Now, then, what did the six Indians do? It would take you thirty years to guess, and even then you would have to give it up, I believe.

            Therefore, I will tell you what the Indians did. Their chief, a person of quite extraordinary intellect for a Cooper Indian, warily watched the canal-boat as it squeezed along under him and when he had got his calculations fined down to exactly the right shade, as he judge, he let go and dropped. And missed the boat! That is actually what he did. He missed the house, and landed in the stern of the scow. It was not much of a fall, yet it knocked him silly. He lay there unconscious.

            If the house had been ninety-seven feet long he would have made the trip. The error lay in the construction of the house. Cooper was no architect.

            There still remained in the roost five Indians. The boat has passed under and is now out of their reach. Let me explain what the five did -- you would not be able to reason it out for yourself. No. 1 jumped for the boat, but fell in the water astern of it. Then No. 2 jumped for the boat, but fell in the water still further astern of it. Then No. 3 jumped for the boat, and fell a good way astern of it. Then No. 4 jumped for the boat, and fell in the water away astern. Then even No. 5 made a jump for the boat -- for he was Cooper Indian. In that matter of intellect, the difference between a Cooper Indian and the Indian that stands in front of the cigar-shop is not spacious.

            The scow episode is really a sublime burst of invention; but it does not thrill, because the inaccuracy of details throw a sort of air of fictitiousness and general improbability over it. This comes of Cooper's inadequacy as observer.
            Anton Philidor
        • A cheap chip != a cheap system

          notice I said they could make the chip cheaply, but customers couldn't afford the systems built around them. Why? because normal memory isn't remotely fast enough.

          Remember the complexities you complain of exist to reduce wait states on high gigahertz cpus accessing slow ram. If you don't have those work-arounds, you need very fast memory = very large piles of money => zero market share.
          murph_z
          • Still don't buy it

            You're right that memory is a big problem, but I don't think it is the only problem.

            Memory makes it so fast CPUs waste cycles, both by being in wait states and through bad branch predictions made while waiting for memory.

            But the long pipeline is needed to get the GHz economically. Just look at P4.

            So I don't think they could make the chip cheaply, especially not at their volume. Notice that only Intel's high-end Core chips are hitting this speed and they only did it recently.

            Now, I'm sure they could make it in the lab, and given some time make production quantities. Eventually they probably will. But I think there would be major investments to reach that point, and a period of low yields (and high prices) before the price could come down.
            Erik Engbrecht
    • Solved in a !!FLASH!!

      With more !!BANG!! for all the $ we?re throwing about can?t we get some of those drives which look like hard drives but used to have ram in them and now (I think) have flash in them?

      And put up with those 15K crawlers (nicely RAIDed) further out / for longer term storage?

      "That takes care of the Riddler Batman."

      "I?m not so sure, Robin. I think he was eavesdropping when the mad scientist said "the programming required might be a bit tricky". We must be ready for the Riddler to attack again at any moment."

      DON?T MISS NEXT WEEK?S COMPLICATED EPISODE!!!
      Ross44
  • RE: More comprehensive T2+ info collection

    Sun blogger Alan Parker has collected links to a wide range of T2+ related information - see:

    http://blogs.sun.com/allanp/entry/sun_s_cmt_goes_multi
    murph_z
  • Sun's fabs

    I partly agree, but I think Sun is still using 65nm fabs for the T2. If they were able to switch to a 45nm (or smaller feature size) process along with the usual manufacturing tricks/magic, that should allow them to bump up the clock without altering their design significantly.

    Also, a T2 has something like 500 million transistors, which isn't bad compared to a Core 2 Quad Penryn chip which has over 800 million. Also, the T2 has like a 342 square mm die size while a 45nm Core 2 Quad has a 210 square mm die size.

    So I don't think it's inconceivable that the lower transistor count coupled with a die shrink could lead to fairly high yield, smaller die, and therefore a fairly economical chip. Also, they could rely on water cooling or more exotic techniques if they wanted to make them fast on the cheap (relative to cost of die shrinks).
    t_mohajir
    • Supposed to be a reply to "Still don't buy it "

      Doh!
      t_mohajir
    • I don't think Sun has big fabs

      I think they outsource it. Sun is an engineering company at its core, and it shows in their financials.

      Intel is a manufacturer that happens to do engineering because it needs something to manufacture.

      I have no doubt that Intel at Intel scale could produce a really fast T2 at a low cost, but that's because it is their core competency.

      Recovering the non-recurring costs, including perfection of the fab process, would require more volume that Sun has.

      As a side note, stuff like water coolers would damage the marketing image of the T2. T2 is supposed to be power efficient.
      Erik Engbrecht
      • I think you're missing the point

        I said they could make the chip faster cheaply, but that doing so would raise the cost of the system beyond the marketable range. i.e. system costs are dictated by system components -and performnace is dictated by the slowest parts - meaning that putting a really fast chip into a slow system isn't smart.

        Ever wonder why Xeon wastes so many cycles - or, how can a T2 at 1.4Ghz blow away a 4 x 4 core Xeon on most benchmarks? Easy: the rest of the system forces the xeon to spend most of its time doing nops - while Sun's CMT technologies replaces those nops with runable threads to achieve higher throughput at lower power.
        murph_z
        • The only problem left...

          The only problem left is writing code that can decently use all those threads.
          TheTruthisOutThere@...
          • Agreed

            Although I notice that some bigger apps - like oracle's own financials/erp package - were designed to operate on larger servers and so fit perfectly to CMT but are often run by people so clueless they don't adjust for the number of CPUs and so produce some horribly bad results.
            murph_z
        • No, I get your point

          Even if Sun could make T2 crank out a lot more GHz and produce it cheaply, it wouldn't due much good because it would cost too much to make the rest of the system keep up.

          I agree with that. There are both business and technology issues at work, and I think Sun is doing the right thing and hitting a sweet spot in a wide array of areas.

          I also think if they tried to over double the clock rate of the chips, at least in a short timespan, they would veer way out of that sweet spot. The chips would cost too much, the systems to support them would cost too much, the systems would start being power hungry, and all this probably wouldn't yield that big of gains considering the investment and increased cost.

          You claimed Sun could cost effectively build a 3GHz T2 if they wanted, they just won't because it wouldn't make sense. I think they can't, for a variety of reasons that I've stated.

          If you have some evidence that they can, such as producing prototype units in reasonable quantity at that speed, or even credible claims from Sun people about it, I would be glad to stand corrected.

          If you want to discuss chip design issues, and why you think T2 is amenable to high GHz (I don't think it is), then we can do that.

          If you think Sun's supplier both have the manufacturing capability to produce a 3Ghz chip and would let Sun access it at a reasonable price without volume commitments beyond Sun's means, I would be happy to have a discussion around evidence.

          But right now, I think you made an empty claim and are simply trying to side-step the argument, unlike t_mohajir who addressed the technical issues.
          Erik Engbrecht
          • If they could they would have...

            I mean the current T2 (not the SMT enabled T2+) has a max
            power consumption of 123 Watt, and a typical operating power
            consumption of 95 watt
            Now more than doubling the frequency to 3GHz would make it
            hotter than hell.

            // jesper
            JesperFrimann