Tech

The obligatory 'Victoria Falls' post

The reason Sun's new dual socket CMT/SMP machines don't double the throughput of their uniprocessor predecessors is that other components, particularly memory, cost too much to speed up - but the additional threads and processing power work wonders on things like response time for CPU constrained applications - like Lotus Domino.

Written by Paul Murphy, Contributor April 15, 2008 at 1:15 a.m. PT

For those who don't know, Sun's new T2+ machines extend the T2's CMT capabilities across multiple units to produce 16 and 32 core SMP machines capable of handling 128 and 256 concurrent threads respectively. Sun blogger Denis Sheahan provides a good overview of the current dual socket releases here By itself the T2 continues to set new performance records - Sun's bmseer usually has the latest; most recently a pair of new SPECint_rate2006 and SPECfp_rate2006 records. The new machines don't offer the kind of quantum leap the T2 did - obviously because the T2+ is a continuation within the UltraSPARC SMP/CMT line and less obviously because market pricing constraints limit the throughput possible in other parts of the system. The most illustrative benchmark result I've seen on this, also as reported by bmseer involves Lotus Domino. Here's part of that report:

Lotus Domino 7.0.1 NotesBench R6iNotes Performance Chart (in increasing $/User order) Users = number of users supported (bigger is better) NotesMark = the benchmark metric (bigger is better) $/User = cost per user (smaller is better)
System

Chip GHz

Cores/
Chip

OS

USERS

N-MARK

#Dom Part

AvRT

$/User

Sun T5240

2xUS T2 Plus 1.2

8

Sol10

65000

55101

6

224ms

$2.84

IBM-P5 560Q

2xPOWER5+ 1.8

4

AIXL

55000

46103

6

848ms

$4.89

Sun T5220

1x US T2 1.4

8

Sol10

43000

36240

6

584ms

$2.89
Complete benchmark results may be found at the Lotus NotesBench website http://www.notesbench.org.

Notice that doubling the CPU only produced about a fifty percent increase in throughput -an artifact of limitations elsewhere in the system. Users, however, don't care about throughput in applications like this: they care about response time - and that's where the T2+ really shines, reducing the average response time from 584ms to only 224ms - a 60% improvement. That's an artifact of the CMT architecture and a pointer, I think, to the markets that this thing will sell into in volume. On the other hand.. the way the processors are coupled - done by replacing the the T2's on board 10Gbyte facility - demonstrated that Sun can now produce highly customized versions of the core CPU set and suggests what I believe may be a unique performance opportunity for this product line. On the hardware customization side: suppose you consider a couple of million bucks no object for getting T2 machines that do FFT on short (16 way) vector processors - Sun has now shown it can do that with COTS parts that can be produced in volume. The performance opportunity is a bit esoteric ( :) ) but comes down to this: there are time critical applications in which the majority of the processing effort goes into moving data between process groups -and the Solaris/T2+ combination lets you move relatively lightweight processes instead of "heavyweight" data across, for the expected four-way machine, 256 threads and 64 direct PCI/E channels. This possibility isn't going to change how products like Apache or even compilers are built, but should make it possible to do some things no one could before. Imagine, for example, that your application will get about 8GB worth of image data every three seconds -potentially 24 x 7; primary per image base processing now takes about 4.3 seconds on one of Mercury Computing's dual cell blades; secondary processing now takes another 8 seconds on one of those blades; you want to keep a minute's worth of data for instant replay; you generally expect to throw away more than 99.99% of all incoming data; and, you want to move the entire system around on a truck. To do it now you'll need a large vehicle because you'll need to carry and power several rackmounts stuffed with cell blades - first because the things are incredibly fast at floating point, but terribly bad at throughput; and, equally importantly, because memory and bandwidth limitations combine with that playback requirement to force you to spend the majority of the effort you put into processing each arriving image just shuffling it around. Choose the T2+ instead and you'll get slower floating point but faster I/O and more storage flexibility - so, while the programming required might be a bit tricky (is there a Pulitzer for understatement?) I think success would give you something you could carry in a small launch or Hummer that would actually run faster and more reliably than anything else you could build.

Editorial standards

Show Comments

Linus Torvalds and Dirk Hohndel, Open Source Summit North America 2024

The obligatory 'Victoria Falls' post

Related

Linus Torvalds takes on evil developers, hardware errors and 'hilarious' AI hype

6 features I wish MacOS would copy from Linux

The best AI image generators to try right now

System	Chip GHz	Cores/ Chip	OS	USERS	N-MARK	#Dom Part	AvRT	$/User
Sun T5240	2xUS T2 Plus 1.2	8	Sol10	65000	55101	6	224ms	$2.84
IBM-P5 560Q	2xPOWER5+ 1.8	4	AIXL	55000	46103	6	848ms	$4.89
Sun T5220	1x US T2 1.4	8	Sol10	43000	36240	6	584ms	$2.89