James Laudon is one of the authors, along with Kunle Olukotun and Lance Hammond, of the Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency book from Morgan & Claypool Publishers.
He used to work at Sun - and may still: all I know is that his former email address there has disappeared into the land of the unknown recipient.
While there he wrote a blog entry, for December 06, 2005 offering the clearest, simplest, explanation for the major structural difference between Intel style "hyperthreading" and Sun's thread level parallelism I've been able to find.
Here's the whole thing:
Threading the UltraSPARC T1
Well, the launch of the UltraSPARC T1 caused me to finally brave the waters of blogging. I've been working on the UltraSPARC T1 for about the last four years, and on multithreading and multiprocessors for the past twenty, and it's very gratifying to see many of the best ideas for architecting chip multiprocessors come together in the UltraSPARC T1.
I thought I'd use my first-ever blog entry to discuss the vertical multithreading used in the UltraSPARC T1. There are three main ways to multithread a processor: coarse-grain, vertical, and simultaneous. With coarse-grain threading, a single-thread occupies the full resources of the processor until a long-latency event, such as a primary cache miss is encountered. At that point, the pipeline is flushed and another thread starts executing, using the full pipeline resources. When that new thread hits a long-latency event, it will yield the processor to either another thread (if more than two are implemented in hardware) or the first thread (assuming its long-latency event has been satisfied.) Coarse-grain threading has the advantage that it is less of an integral part of the processor pipeline than either vertical or simultaneous multithreading and can more easily be added to existing pipelines. However, coarse-grain threading has a big disadvantage: the large cost to switch between threads. As I described above, when a long-latency event like a cache miss is encountered, all the instructions in the pipeline behind the cache miss must be flushed from the pipeline and execution of the new thread starts filling the pipeline. Given the pipeline depth of modern processors, this means a thread switch cost in the tens of processor cycles. This large switch cost means that coarse-grain threading cannot be used to hide the effects of short pipeline stalls due to dependencies between instructions and even means that the thread switching latency will occupy much of the latency of a primary cache miss/secondary cache hit. As a result, coarse-grain multithreading has been primarily used when existing, single-threaded processor designs are extended to include multithreading.
The two remaining techniques for threading, vertical threading (VT) and simultaneous multithreading (SMT), switch threads on a much finer granularity (and not surprisingly are referred to as fine-grained multithreading). On a processor capable of multiple instruction issue, a SMT processor can issue instructions from multiple threads during the same cycle, while a VT processor limits itself to issuing instructions from only one thread each cycle. On a single-issue processor there is no difference between VT and SMT, as only one instruction can be issued per cycle, but since there is no issue of instructions from different threads in the same cycle, single-issue fine-grained multithreaded processors are labeled VT. Both SMT and VT solve the thread switch latency problem by making the thread switch decision part of the pipeline. The threading decision is folded in with the instruction issue logic. Since the issue logic is simply trying to fill the pipeline with instructions from all of the hardware threads, there is no penalty associated with "switching" between threads. However, there is a little extra complexity added to the issue logic as it now needs to pick instructions from multiple ready threads. This additional issue logic complexity is fairly small (certainly much smaller than all the other issue-related complexity that is present in a modern superscalar processor) and well worth it in terms of performance. The advantages of SMT and VT are that very short pipeline latencies (all the way down to a single cycle) can be tolerated by executing instructions from other threads between the instructions with the pipeline dependency. The ability to switch threads at no cost is the key to enabling the impressive performance of the UltraSPARC T1, as many commercial benchmarks have significant amounts of both memory and pipeline latency.
Most people are familiar with the hyperthreaded Intel processors, which employ SMT. They support two threads in hardware, and show modest gains on some parallel workloads. Given that SMT is the most aggressive of the three threading schemes, one would expect SMT to deliver the highest performance, but in general the performance gains seen from hyperthreading are small (and sometimes are actually performance losses). However, the gains seen from hyperthreading are not limited by the SMT but more by the memory system (a topic for a later post), and unfortunately the Intel hyperthreading implementation delivers a misleading message about the performance to be gained from fine-grained multithreading.
The UltraSPARC T1, on the other hand, was built from the ground up as a multithreaded chip multiprocessor, and each of the eight pipelines employs vertical threading of four hardware threads. The eight pipelines in the UltraSPARC T1 are short (6 stages), and one might be tempted to employ the slightly simpler coarse-grain threading. However, even on the UltraSPARC T1, the gains from vertical threading over coarse-grained multithreading ended up being substantial. In fact, the very earliest proposals at Afara Websystems for what became the UltraSPARC T1 employed coarse-grain threading. Rather quickly, the modest additional complexity of vertical threading was traded off against its performance gains and the switch to vertical threading was made. Now,roughly four years later, the performance resulting from that and many other architecture, implementation, and design decisions is being announced to the rest of the world. There's been a lot of hard work by a lot of people between then and now but as the performance and performance/Watt numbers from the UltraSPARC T1 show it's been worth it!
With its i7 technologies Intel has largely adopted AMD's internal communications architecture and because this removed some bottlenecks on cache sharing has also brought back its early SMT style hyperthreading -with the result that it can now claim to offer "an unprecedented 4-core, 8-thread design" that's remarkably similar to AMD's 2005 offerings - while Sun already offers 64 concurrent threads on each N2 processor and up to 256 fully SMP capable threads on its Victoria Falls multi-processors.