Coolthreads vs hyperthreading

Coolthreads vs hyperthreading

Summary: Laudon's description of the differences between Sun's SMP capable coolthreads approach to multi-threading and Intel's hyperthreading is as concise, lucid, and simple as anything I could find - so over to him:


James Laudon is one of the authors, along with Kunle Olukotun and Lance Hammond, of the Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency book from Morgan & Claypool Publishers.

He used to work at Sun - and may still: all I know is that his former email address there has disappeared into the land of the unknown recipient.

While there he wrote a blog entry, for December 06, 2005 offering the clearest, simplest, explanation for the major structural difference between Intel style "hyperthreading" and Sun's thread level parallelism I've been able to find.

Here's the whole thing:

Threading the UltraSPARC T1

Well, the launch of the UltraSPARC T1 caused me to finally brave the waters of blogging. I've been working on the UltraSPARC T1 for about the last four years, and on multithreading and multiprocessors for the past twenty, and it's very gratifying to see many of the best ideas for architecting chip multiprocessors come together in the UltraSPARC T1.

I thought I'd use my first-ever blog entry to discuss the vertical multithreading used in the UltraSPARC T1. There are three main ways to multithread a processor: coarse-grain, vertical, and simultaneous. With coarse-grain threading, a single-thread occupies the full resources of the processor until a long-latency event, such as a primary cache miss is encountered. At that point, the pipeline is flushed and another thread starts executing, using the full pipeline resources. When that new thread hits a long-latency event, it will yield the processor to either another thread (if more than two are implemented in hardware) or the first thread (assuming its long-latency event has been satisfied.) Coarse-grain threading has the advantage that it is less of an integral part of the processor pipeline than either vertical or simultaneous multithreading and can more easily be added to existing pipelines. However, coarse-grain threading has a big disadvantage: the large cost to switch between threads. As I described above, when a long-latency event like a cache miss is encountered, all the instructions in the pipeline behind the cache miss must be flushed from the pipeline and execution of the new thread starts filling the pipeline. Given the pipeline depth of modern processors, this means a thread switch cost in the tens of processor cycles. This large switch cost means that coarse-grain threading cannot be used to hide the effects of short pipeline stalls due to dependencies between instructions and even means that the thread switching latency will occupy much of the latency of a primary cache miss/secondary cache hit. As a result, coarse-grain multithreading has been primarily used when existing, single-threaded processor designs are extended to include multithreading.

The two remaining techniques for threading, vertical threading (VT) and simultaneous multithreading (SMT), switch threads on a much finer granularity (and not surprisingly are referred to as fine-grained multithreading). On a processor capable of multiple instruction issue, a SMT processor can issue instructions from multiple threads during the same cycle, while a VT processor limits itself to issuing instructions from only one thread each cycle. On a single-issue processor there is no difference between VT and SMT, as only one instruction can be issued per cycle, but since there is no issue of instructions from different threads in the same cycle, single-issue fine-grained multithreaded processors are labeled VT. Both SMT and VT solve the thread switch latency problem by making the thread switch decision part of the pipeline. The threading decision is folded in with the instruction issue logic. Since the issue logic is simply trying to fill the pipeline with instructions from all of the hardware threads, there is no penalty associated with "switching" between threads. However, there is a little extra complexity added to the issue logic as it now needs to pick instructions from multiple ready threads. This additional issue logic complexity is fairly small (certainly much smaller than all the other issue-related complexity that is present in a modern superscalar processor) and well worth it in terms of performance. The advantages of SMT and VT are that very short pipeline latencies (all the way down to a single cycle) can be tolerated by executing instructions from other threads between the instructions with the pipeline dependency. The ability to switch threads at no cost is the key to enabling the impressive performance of the UltraSPARC T1, as many commercial benchmarks have significant amounts of both memory and pipeline latency.

Most people are familiar with the hyperthreaded Intel processors, which employ SMT. They support two threads in hardware, and show modest gains on some parallel workloads. Given that SMT is the most aggressive of the three threading schemes, one would expect SMT to deliver the highest performance, but in general the performance gains seen from hyperthreading are small (and sometimes are actually performance losses). However, the gains seen from hyperthreading are not limited by the SMT but more by the memory system (a topic for a later post), and unfortunately the Intel hyperthreading implementation delivers a misleading message about the performance to be gained from fine-grained multithreading.

The UltraSPARC T1, on the other hand, was built from the ground up as a multithreaded chip multiprocessor, and each of the eight pipelines employs vertical threading of four hardware threads. The eight pipelines in the UltraSPARC T1 are short (6 stages), and one might be tempted to employ the slightly simpler coarse-grain threading. However, even on the UltraSPARC T1, the gains from vertical threading over coarse-grained multithreading ended up being substantial. In fact, the very earliest proposals at Afara Websystems for what became the UltraSPARC T1 employed coarse-grain threading. Rather quickly, the modest additional complexity of vertical threading was traded off against its performance gains and the switch to vertical threading was made. Now,roughly four years later, the performance resulting from that and many other architecture, implementation, and design decisions is being announced to the rest of the world. There's been a lot of hard work by a lot of people between then and now but as the performance and performance/Watt numbers from the UltraSPARC T1 show it's been worth it!

With its i7 technologies Intel has largely adopted AMD's internal communications architecture and because this removed some bottlenecks on cache sharing has also brought back its early SMT style hyperthreading -with the result that it can now claim to offer "an unprecedented 4-core, 8-thread design" that's remarkably similar to AMD's 2005 offerings - while Sun already offers 64 concurrent threads on each N2 processor and up to 256 fully SMP capable threads on its Victoria Falls multi-processors.

Topics: Processors, Hardware, Networking

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Thanks

    I appreciate the effort to give a great explanation
  • So I guy who admittedly works for Sun ...

    ... thinks their implimentation is better then Intel. Shocker! This from a guy that rails on any TCO study sponsored by Microsoft as not being objective. I wonder how the Intel guy would spin it.
    • There's no satisfying you

      [So I guy who admittedly works for Sun ...
      ... thinks]


      [their implimentation is better then Intel. Shocker!]

      How much does Murph have to do on this thread topic? First he is accused of lying, then of selective memory, then of not understanding. He is berated for not doing research and dismissed as a ABMer.

      I think this article sums up threads very nicely. Too bad you cannot believe ANYTHING that Murph (or anyone from Sun) has to say.
      Roger Ramjet
      • Spoken like the ....

        Murph's lap dog you are. He could try quoting from an impartial source rather then publishing the Sun marketing points. He attacks Microsoft for funding TCO studies but uses a biased report from Sun to prove his. That is called hypocrisy.

        How is your projection of the WiMax steamroller working out for you anyway?
        • Lap Dog huh?

          Murph and I share the same type of background - UNIX (administration). But we do disagree. He prefers centralized computing whereas I prefer distributed. He believes that the "goodness" of UNIX should be isolated to the server platform - where I believe that UNIX should be on all computers great and small. So in some ways I'm even more radical than Murph.

          Chip designers are rarely independent (Jay Miner!), so everything they write about their craft is from a certain corporate base. This specific article spelled out the different types of multithreading and how each company employs them. That part is factual. As for his opinion on what is BEST - well, you M$hills hate hearing "best" without M$ being the recipient, but you REALLY hate hearing the word "best" being applied to a M$ competitor.

          Murph has presented one side of a discussion with an expert witness. In a court of law, this is all that he is required to do . . .

          Ask Erik Engbrecht about WiMAX. He is a customer of theirs in Baltimore. See how he likes it.
          Roger Ramjet
          • Lap Dog Yes!

            Why else did you feel the need to rush to his aid so passionately? Was it because you fealt he couldn't defend himself or because you share his belief?

            While it is true chip designers have biases those biases are extremely slanted when they are talking about their own products. What if Murph found someone from his beloved PPC to comment on the subject?

            My favorite part is the M$. As if $un wasn't in it for the money. That is really original and mature.

            Finally the whole WiMax steamroller thing. One person in Baltimore would hardly classify as a steamroller. Nice dodge though.

            Most of you Unix guys are the ones that failed to make the switch when the PC steamroller came along. You failed to adapt. That is why you continue to be mired in the 70s. Wouldn't want get you out of your comfort zone!
          • mired in the 70s?

            but... all the paradigms that microsoft has been reintroducing in windows seven have been part of various *nix operating systems for years... if *nix guys are mired in the 70s, where in time is microsoft?

            fact of the matter is that PCs are designed for use by one market, and unix was designed for use in another. with the convergent markets we see today, there is a lot of room for overlap, and the solution you choose largely has to do with your specific goals.

            if you want to deal with fragile application maintenance, tissue paper security, non-standard protocols, designed obsolescence, weak architectural designs passed off as "investment protection", and the like... go ahead, invest in a windows infrastructure.

            if you want inherent security, stability, extensibility, worldwide standards, etc... pick a *nix.

            Intel and it's competitors, on the other hand, are always leap-frogging one another. However, Sun and AMD have the lead more often than Intel does. I realize it's unfair pitting just one company against two, but that's the price of market leadership. Intel gets by, and it's predatory and anti-competitive market practices serve it well. Intel are geniuses at marketing and lying to the consumer's face. Admittedly, that's capitalism.

            Fact is that Sun has Intel beat soundly on this one. And AMD has them trounced as well. The only reason Intel hasn't keeled over yet is because OEMs love the big rebates that they get for not selling Intel competitor's products. It's not that they suck at what they do, infact i'm on a core2 right now. It's just that they are in a paradigmatic time lag behind Sun and AMD and know how to hide it with fancy non-innovative solutions and exceptional marketing.
          • Another party heard from(nt)

            Same old clap trap Murph pushes. Just ain't buying it!
      • Thanks Roger!

        but don't bother - keep in mind that in the land of the blind the one eyed man is roundly hated.

        • Don't hate me(nt)

      • Too much apologizing Roger

        A four year old article by a a Sun employee (who can't be found), trying to make out Sun's approach is better. Hmmm, didn't work did it? But then Murph would probably agree that history is written by the victors and Sun is fast approaching sunset.

        Once again Murph cannot give any real world examples of his approach. All I ask is ONE example of Murph shepherding a company through a Windows to *nix change. What's so hard about this? If Murph has any connection to the real world he would have already done this and be presenting us with lots of data on an actual conversion. Instead we get castles in the air, ideology or some marginal example that Murph manages to troll from the Web.

        Murph, if this is so good, life changing and will bring about peace in our time, then how about actually DOING IT!

        In fact how about an example of any client who's still talking to you ;-)
    • Rebuttal from a major OEM employee.

      Full disclosure dude, works both ways. Now, where is the writeup wrong? You being extremely pro MS and very defensive regarding any disrupting technology (understandable) doesn't mean you have nothing useful to say, so why should it be true for those who don't have your preconditions?

  • Intel: Good and bad?

    So Murph,
    Intel is bad unless it runs OSX?
    Please elaborate on your praise of all thing Apple if Intel is so bad.

    Thanks and not holding my breath.

    • Dear Joe

      Intel is great - if your scope of comparison is limited to intel. Compare it to AMD and it's a follower, step outside x86 and it's not in the game.

      As for Apple - bet you can't find one positive thing I've said about Apple's x86 decision.
      • Fair enough

        ..As for Apple - bet you can't find one positive thing I've said about Apple's x86 decision....

        Fair enough.
        Same applies to Linux then.
        The issue is when you mention Intel it is usually WinTel and never include the other two.

        Thanks for the clarification.
        BTW, my take from your threading coverage is that it relates to servers, desktops are a different kettle of fish.
        There are more serious bottlenecks relating to desktop applications than the processor threading model.


        • Mostly yes

          1) Yes I mostly mention wintel in the threading context because windows threads co-evolve with Intel threads and are correspondingly different from unix threads as defined by the Unix leading edge - Solaris/SPARC.

          2) I also argue that Linux is largely captive to x86 because highly optimized for it. That affects its threading model too - leaving it mostly in the middle of the road: more like posix than like solaris or windows.

          3) The Mac's underlying BSD has a clean, early 90s, threads implementation - and is not optimal for x86. Darwin on an arm/ppc outperfoms darwin on atom for that reason.

          4) Yes, desktop apps are often different from server apps amd intel's ability to shunt some code into empty processor cycles has a higher probability of being useful on a desktop than on a server - while Sun's LWP approach doesn't differentiate between workloads very much but gets volume throughput at the cost of single process slowdowns and so seems to discriminate against single process desktop functions.

          However... how fast does a processor really have to be to run most desktop apps? Atom processors work - so single threads on 1400Mhz N2s should too (and do).
  • RE: Coolthreads vs hyperthreading

    Good article. Sun definitely wins on threading and I think on performance / watt. No arguing that. What Sun seems also ready to admit is that single thread performance isn't all that good in T1 and they are addressing that in Rock.

    The second concern is that all that threading may not be as useable in all server workloads.

    My third concern is that when quoting benchmarks and providing links, they all seem to come from SUN which is certainly fair but also it is likely that they pick the type work loads that show off their processor best. It will be interesting to see how the Nehalem based Xeon fairs in comparison (more parallelism on independent benchmarks.

    Fourth, given the performance crown for servers, people may still want to develop for x86, because its reach is broadening (mobile to server), and it is good enough.
    • Agreed

      I often quote the sun coolthreads benchmark collection because it's a handy collection. Most of the benchmarks reported there are certified by third parties - spec, tpc, etc.
  • Compatibility will always win out over better

    Which is why Intel has nothing to worry about.
    Michael Kelly
  • Another classic fairy tale from our beloved Murph.

    He's fast becoming the next Grimm who transports the reader to far off magical places and contraptions far beyond the realms of reality.