The rise, fall, and rise of the supercomputer in the cloud era

Though the personal computer was born from garage projects, the supercomputer had been declining to the back of the garage. That's until a handful of trends conspired to poke the reset button for the industry. Now the race is back on.

A supercomputer is any mechanism whose performance capability, either by design or by default, enables it to compete -- effectively or otherwise -- in the market for functionality and information. There have been times throughout history where handfuls of spare processors, cobbled together with homemade substrates, produced supercomputers. And because they were super, they qualified.

Also: Supercomputers: All Linux, all the time

Today, supercomputing performance cannot be achieved accidentally. It must be designed willfully, intentionally, and with a modicum of tolerance for both corporate and international politics.

The state of the supercomputing market


(Image: Photo by Carlos Jones, ORNL, licensed under Creative Commons)

To be a supercomputer in today's market is to be something other than cobbled together from off-the-shelf parts. A modern system is intentionally architected for a single purpose: Parallel processing. This is not the same as multitasking, where a scheduling mechanism juggles multiple applications, adhering to some manner of concurrency.

Ordinary multicore processors, intended for use in personal computers and data center servers, may be incorporated in even the fastest supercomputers. Indeed, at the turn of the century, most machines that qualified at the time as supercomputers were composed of commercial, off-the-shelf (COTS) components. But a number of factors -- some related, some coincidental -- conflated in recent years to re-inspire purpose-built supercomputer architecture:

  • Moore's Law collided head-on with physics. The economic principle, first observed by Intel founder Gordon Moore, suggested that "cramming" transistors onto processors in smaller spaces would be a good way of meeting customer expectations for higher yields. There's circumstantial evidence that this method of evolving processors, at some point soon if not already, may no longer be physically possible. At any rate, Intel has in recent years shifted away from its classic "tick-tock" production cadence, toward a more nuanced product approach that relies upon multiple layers of premium quality. In so doing, Intel has drawn more attention to the high-performance side of its Xeon product line, making performance a "thing" again.

Also: Intel slows the rate of major chip upgrades as Moore's Law falters

  • The Internet removed the wrapper from software. In the PC era, both hardware and software was packaged, so it was harder for consumers to judge the result of a high-performance task in comparison against an application in a box with a features list, a cellophane wrapper, and a discount price. Today, most of the world's functionality (as well as its content) is delivered through the Web, so consumers have become more comfortable with a more figurative ideal of what software is.
  • The cloud brought market value back to stand-alone functionality. Once enterprises became accustomed to leasing virtual servers from service providers such as Amazon, Rackspace, and GoGrid, they perceived the functions of their business as something cultivated by their developers through the Web, rather than something installed by the IT department from a set of disks. This changed the nature of computing as a product, back to something much closer to what it started out to be.

Also: What a hybrid cloud is in the 'multi-cloud era,' and why you may already have one

  • Global warming became undeniable (except in executive tweets). Rapid changes in the Earth's living conditions brought forth renewed attention on the systems used to forecast those conditions, especially as the E.T.A. for dire consequences crossed the zone from our descendants' lifetimes into our own.

Many have argued over the years that Moore's Law has no bearing upon supercomputing performance, since it deals with the transistor count of processors and not the petaflops count of giant machines. As you'll see later, though, those in charge of funding the development of supercomputing through the years have made genuine efforts to correlate the famous formula (doubling transistor count on a commercial processor die every 18 months) with the very definition of supercomputers, specifically for purposes of export regulations.

What's more (pun possibly intended), scientists and engineers actively engaged in the architecture of the next generation of highest-performance machines, including so-called "neuromorphic" systems and quantum computers, have for the last three years declared themselves living in the "Post-Moore's Era of Supercomputing" (PMES), and produce an annual conference to help move it along.

Also: What a quantum computer is, and why it needs to be more

A thousand-fold: Exascale resets the goal posts

Supercomputing is the modern descendant of the world's first computing industry. Though supercomputers are typically the stars of the show, in any public discussion or in articles such as this, it's the work they do that forms the industry's axis. PCs have always been a business; supercomputing is a job.

"We have problems that we need to solve better than the ones we do today," said Bert Still, leader of application strategy and computational physicist with Lawrence Livermore National Labs, during a conference panel session a few years ago. "We know there are things that are missing. We know that things are changing, evolving over time, which requires us to have better, higher fidelity models than we have now, and that means that we need more computing to be able to compute those models."

What is the actual job a supercomputer performs?

The job at hand that Still identified is extrapolation -- the extraction of meaning from the data and from the model. A supercomputer job is a model of a situation, typically a real-world one, which ravenously feeds upon data. To be engaged in supercomputing is to believe in the power of the algorithm to distill valuable, meaningful information from the repeated implementation of procedural logic. At the foundation of supercomputing are two ideals: one that professes that today's machine will eventually reach a new and extraordinarily valuable solution, followed by a second and more subtle notion that today's machine is a prototype for tomorrow's.

Also: US once again boasts the world's fastest supercomputer

What matters in the end, Still pointed out, is what one tries to extrapolate. You can look at the weather today, he said, but you can't extrapolate from it the weather report for three months from now. But you can look at the Fukushima nuclear disaster, and you can make reasonable projections for when the cloud from that event would reach the U.S. west coast.

"There are some problems you can address with certainty," said Still, "and there are other ones that we really don't know how to address yet. So we have large problems that need to be solved and they require both models, as well as computing beyond today's reach. That really is what's driving our interest in exascale."

Must read

The real definition of "exascale"

Exascale is not a brand name, a supercomputer vendor, or a software initiative. Much more simply, it's a goal. And frankly, it's an arbitrary one. It's this: one quintillion operations per second. It's a kind of mathematical moonshot. In 2011, President Obama directed that $112 million be appropriated in the Fiscal Year 2012 Federal Budget for the Dept. of Energy, for a policy-supported project devoted to exascale computing. For FY 2018, some $232.7 million was appropriated.

Also: HPE announces world's largest ARM-based supercomputer

There may be no way to prove a direct correlation between taxpayer funds and supercomputer performance, even with a supercomputer. But performance has scaled up, and the U.S. reclaimed the lead in that race. In late 2011, the highest performing American supercomputer was #3 on the Top 500 list: a Cray XT5 operated by the DoE's Oak Ridge National Laboratory, posting about 1,759 billion floating point operations per second (1.76 petaflops). A Japanese machine built by Fujitsu was #1 with 10.51 petaflops; a Chinese machine #2 at 2.57.


The #1-ranked supercomputer in the world as of November 2018, Oak Ridge National Laboratories' "Summit" (Image: Creative Commons/Photo by Carlos Jones, ORNL)

In November 2018, an IBM-built machine built for Oak Ridge, called Summit [pictured above], posted a score of 143.5 petaflops, with another IBM model for Livermore Labs posting 94.6. A Chinese machine just barely held onto #3 with 93 petaflops.

But that's still somewhat short of the goal: one exaflop, or 1018 operations per second. That's what exascale actually means. And the year the DoE projected for achieving exascale, when its taxpayer-funded project began, was 2018.

"To achieve these increases in capability by 2018," stated a DoE publication in 2010 [PDF], "significant acceleration in both hardware and software development is required. This could be accomplished through an intensive co-design effort, where system architects, application software designers, applied mathematicians, and computer scientists work interactively to characterize and produce an environment for computational science discovery that fully leverages these significant advances in computational capability."

That co-design effort is indeed taking place. But in recent years, it's been commandeered by the Post-Moore's Era contingent. Together, they've devoted their efforts to the advancement of quantum computing -- the theoretical technology based on a dysfunctional behavior of physics that, if it comes about, would render today's supercomputers obsolete in a quintillionth of a second.

Also: US now claims world's top two fastest supercomputers

What differentiates an algorithm from an application?

Through the years, parallelism has been confused with multitasking. Your PC and your smartphone juggle the execution of multiple applications all the time. A supercomputer is designed to be dedicated to the task of rendering a result for the oldest (and, many believe, best) form of computer program there is: the algorithm. (The name is in honor of a 9th century Indian mathematician, Abu Ja'far Mohammed ibn Musa al-Khwarizmi, who introduced the use of Arabic numerals with a decimal point to denote fractional values.)

It is the capability to run an algorithm that differentiates a computer from a calculator, which resolves formulas and arithmetic problems. An algorithm is a logical procedure engineered to produce a solution for a situation, given any kind of rational input, through the same sequence of procedural steps. Put another way, do these things, in this order, and whatever input you give it, your output will be what you want. An algorithm has these characteristics:

  • A finite set of steps. It might seem obvious that the number of instructions in a program is never infinite, but groups of steps in an algorithm may be repeated, perhaps indefinitely. So there must be an exit clause -- a condition which, once satisfied, stops the repetition.
  • A clear point of termination. An algorithm knows when it's done, and when the solution is reached.
  • A minimal number of steps. The best algorithms through history are optimized and refined to require the fewest operations.

The first electronic computers were essentially gear-boxes custom-built for resolving algorithms. In the 1950s, it became obvious to their builders that they needed a separately running program -- an "operating system" or a "control program" -- to separate the jobs the users wanted done, from the fundamental functions of the machines. When asked what the difference was between an OS and an algorithm, engineers explained it this way: An algorithm is designed to stop. An operating system is designed never to stop.

Also: When supercomputing and AI meets the cloud

Because an algorithm is modular by design, it's well suited to parallelism. It's easy to determine which modules may be forked from the main thread, replicated (perhaps thousands of times), and executed in parallel, all without breaking the primary condition that the algorithm terminates upon reaching a solution. In fact, with supercomputing algorithms, the programmer may express these modular boundaries explicitly, easing the burden on the processors.

As a result, all computer programs, once compiled to object code to be executed by processors, utilize either of the following:

  • Explicit parallelism - The explicit declaration by the program of those modules and other program components that may be isolated from the rest of the program, replicated into as many copies as necessary, and executed at the same time; or,
  • Implicit parallelism - A processor capable of determining for itself where portions of the code can be segmented, replicated, and processed in parallel.

The state of the supercomputing race today

The Top 500 list, maintained independently by the University of Mannheim and published semi-annually, is actually a report on the geography of the high-performance market -- the space that supercomputers carve for themselves as their owners, and to an equal extent their components' manufacturers, stake their claims to functional superiority and, just as with any market, maximum competitive value.

The race that supercomputers run to earn a place on this list is based on a synthetic benchmark called Linpack. Folks have argued over the years that Linpack is not representative of how such machines run real-world tasks. Of course, if supercomputers were given actual, real-world tasks such as simulations or forecasting, and then judged against one another for performance, the argument would be that real-world tasks are so nuanced that it's impossible to render a truly objective conclusion.

Must read

What all the "FLOP/s" are about

Linpack renders its results in millions of floating-point instructions per second (megaflops, or MFLOP/s), where floating-point refers to a method for representing fractional values in memory similar to scientific notation in common arithmetic. In recent years, the Top 500 has multiplied this result by 1,000 so it fits on the grid more nicely. So now it renders values in teraflops (TFLOP/s), although the leaders on the list have now scored over 1,000 teraflops, entering the zone of petaflops (PFLOP/s) -- trillions of operations per second.

Also: Google's quantum computing breakthrough: Our new chip

As of November 2018, the fastest supercomputer whose test results were confirmed by Mannheim University is Oak Ridge's "Summit." With a combination of IBM POWER9 CPUs and Nvidia Volta GV100 GPUs, Summit posted a confirmed performance score of 143.5 PFLOP/s. To give you an idea of how fast supercomputers are accelerating in general, it was a mere decade ago that I reported the news of a DOE supercomputer breaking the 1 petaflop mark.

But before you get the idea that the supercomputer market is made up of 500 or so systems racing neck-and-neck to make the #1 slot, consider this: The best posted score for the 2016 champion, China's Sunway TaihuLight, was just over 93 PFLOP/s -- about 65 percent of the current leader's performance, though it still ranks #3 today. From the #20 position to the bottom of the list, every contender posted a score of below 10 petaflops. And from #431 down, the scores are below 1 petaflop.

The first supercomputer was a "Stretch"

In April 1955, IBM had lost a major bid to build a computer for the U.S. Atomic Energy Commission's Livermore Laboratory, to the UNIVAC division of Remington Rand. UNIVAC had promised up to five times the processing power as the Government's bid request, so IBM decided it should play that game too, next time it had an opportunity.

When Los Alamos Scientific Laboratory was next to publish a bid request, IBM promised a system it boasted would operate at 100 times present speeds, to be ready for delivery at the turn of the decade. Here is where the categorical split happened between "conventional computers" and supercomputers: IBM committed itself to producing a wholly new kind of computing mechanism, for the first time entirely transistorized. There had always been a race to build the fastest and most capable machine, but the market had not yet begun its path to maturity until that first cell split, when it was determined that atomic physics research represented a different profile of customer from business accounting, and needed a different class of machine.

How the supercomputer concept may have been invented

Stephen W. Dunwell was Stretch's lead engineer and project manager. In a 1989 oral history interview for the University of Minnesota's Charles Babbage Institute [PDF], he recalled the all-hands meeting he attended, along with legendary IBM engineer Gene Amdahl and several others. There, the engineers and their managers came to the collective realization that there needed to be a class of computer above and beyond the common computing machine, if IBM was to regain a competitive edge against competitors such as Sperry Rand.

Also: How Red Hat Linux is helping reclaim the fastest supercomputer title

"We got together and started out really from scratch," recalled Dunwell, "and said, 'What can be done in hardware, in systems design, and everything of that sort?' We came up with the conviction that, in fact, we could put together a machine which would serve both scientific and business purposes -- that we could meld these two, could bring the two together, and that we could also build a machine that was very much faster than any existing machine of any kind, and that this would be a very desirable thing to do."

C. Gordon Bell, the brilliant engineer who developed the VAX series for DEC, would later recall [PDF] that engineers of his ilk began using the term "supercomputer" to refer to machines in this upper class, as early as 1957 -- while the 7030 project was under way.

"Stretch" creates the supercomputing space


The IBM 7030 "Stretch," said by many to be the first supercomputer. (Image: Creative Commons)

The architectural gap between the previous IBM 701 design and that of the new IBM 7030 [pictured left] was so great that engineers dubbed the new system "Stretch." It introduced the notion of instruction "look-aheads" and index registers, both of which are principal components of modern x86 processor design. Though it utilized 64-bit "words" internally, Stretch utilized the first random-access memory mechanism from magnetic disk, breaking down those words into 8-bit alphanumeric segments that engineers dubbed "bytes."

Also: Supercomputers: All Linux, all the time

Though IBM successfully built and delivered eight 7030 models between 1961 and 1963, keeping a ninth for itself, Dunwell's superiors declared it a failure for only being 30 times faster than 1955 benchmarks instead of 100. Declaring something you built yourself a failure typically prompts others to agree with you, often for no other viable reason. When competitor Control Data set about to build a system a mere three times faster than the IBM 7030, and then in 1964 met that goal with the CDC 6600 -- principally designed by Seymour Cray -- the "supercomputer" moniker stuck to it like glue. (Even before Control Data ceased to exist, the term attached itself to Cray.) Indeed, the CDC 6600 [pictured below] introduced vector processing -- executing single instructions on multiple registers in sequence, which was the beginning of parallelism. But no computer today, not even your smartphone, is without parallel processing, nor is it without index registers, look-ahead instruction pre-fetching, or bytes.


The Control Data 6600, first to use the moniker "supercomputer." (Image: Creative Commons)

Who decides how supercomputer performance scales over time?

Folks who have argued that Moore's Law has nothing to do with supercomputers would be surprised to learn that there is actually a direct historical correlation. The U.S. Department of Commerce and its various agencies have often had difficulty defining supercomputers for purposes of such important things as issuing export restrictions. When its performance benchmarks are two years old or more, it's had trouble determining whether a restricted product from two years ago should still be restricted today, given that "high performance" is a moving target.

In 2000, the Government Accounting Office made a judgment that a supercomputer yields at least 85 billion theoretical operations per second (85,000 MTOP/s). At that time, this was not a measurement of observed performance, but actually a euphemism for cumulative, single-threaded clock speed (the megahertz or gigahertz of all the CPUs added together).

Also: Upgraded US supercomputers claim top two spots on Top500 list CNET

So when asked for an update by the DoC and others, the GAO had been known to punt, actually printing that supercomputing "is simply utilizing the fastest and most powerful computers available to solve complex computational problems."

Yet another misinterpretation of Moore's Law

Left with no other obvious alternative, the DoC has often looked to Moore's Law. Specifically, its researchers would estimate the total transistor count of a system whose cumulative processor cycles totaled 85 GHz, then multiply the rate of growth in that transistor count by 200% per 18 months (or sometimes 24), over the interval of time since 2000. The resulting clock speed would become the de facto supercomputing threshold.

At that time, a theoretical supercomputer meeting the DoC's minimum standards might have been composed of about 75 Intel Pentium III "Coppermine" processors clocked at 1.13 GHz each. With 29 million processors on-die, such a machine would have a total processor count of 2.175 billion, for a ratio of nearly 2 transistors for every 1 clock cycle.

If this sounds like a ludicrous formula to you, keep in mind it was brought to you by your U.S. Federal Government. Assuming Moore's Law had anything to do with clock cycles, the projected transistor count for a low-grade supercomputer circa 2018 should be about 118.8 billion.

A view of how performance scales, from the bottom of the heap


Inspur Yingxin SA5212M4 server unit. (Image: Inspur)

At the very bottom of the Top 500 list ("on the bubble," as they'd say in Indy 500 qualifying) is a machine built by China-based cloud service provider Inspur, for one of its Internet service provider customers. It's made up of several Yingxin model SA5212M4 servers, which as the photo at left shows, fit into ordinary 2U rack units. The entire ensemble is made up of 45,440 Intel Xeon E5-2682v4 server-class processors clocked at 4.8 GHz. Using the DoC's methodology, #500 should weigh in at 218,112,000 MTOP/s (218,112 GHz).

(As it turns out, #500 posted a real-world performance score of 874.8 TFLOP/s, which is 4 times the DoC's projected score, if you convert "theoretical" operations into real floating-point operations.)

Although Intel doesn't publish this specification officially, the E5-2682v4 is estimated to have had 7.2 billion transistors just on one die. So #500 brings in a processor count of 327.1 trillion. By that logic, supercomputer performance would have been declining precipitously.

Also: Photos: The world's 25 fastest supercomputers TechRepublic

So let's look at some real performance numbers, and apply some real logic. In November 2011, the #500 "on-the-bubble" machine, with 7,236 CPU cores (co-processors weren't a factor then), posted a score of 50.9 teraflops. For November 2018, the #500 machine with 45,440 CPU cores (no GPU) posted 874.8 teraflops. That's 17.2 times the performance, with 6.3 times the processor cores.

Granted, a 2018 core is much more evolved than a 2011 core. But this math exercise gives us a more realistic view of how much more: about 3 times more over a 7-year period. If you're a supercomputer operator, this means you can expect this year's model CPU to bring you 43% better performance over last year's. It also means your competitor replacing a 3-year-old system with a new one has an even chance to double its score, if it were to keep the same configuration. (Which it won't.)

Graphics processors supercharge supercomputing


Nvidia's most recent "Tensor Core" GPU with Volta microarchitecture. (Image: Nvidia)

A full 138 of all the supercomputers on the list (nearly 28 percent) feature some type of acceleration or co-processing, with 128 of those systems using Nvidia GPUs -- essentially graphics co-processors rearchitected and retooled as parallel processing engines. Some 122 of these systems make use of Nvidia's original 2006 Tesla architecture for general-purpose GPUs, while the #1 "Summit" system and the #2 "Sierra" system [pictured below] use Nvidia's Volta microarchitecture, geared more towards AI applications. Its first GPUs on this platform [pictured above] were introduced last May and launched in December.


The world's No. 2 performing supercomputer, Livermore Labs' Sierra. (Image: Creative Commons)

What had made GPUs effective for desktop graphics was the means with which they can replicate a single group of instructions, then feed them through a pipelined path that enables them to be executed all at once, in parallel. This originally made shading large segments of triangular areas in 3D scenes easier and much faster to shade and render.

But just after the turn of the century, academic researchers began experimenting with GPUs for different purposes. They leveraged GPUs' parallel processing capability not for rendering, but resolving complex algorithms. In 2006, Nvidia began publishing a software library for use with its GPUs, enabling programs managed by a CPU to delegate easily replicable algorithms to the GPU. Called Compute Unified Device Architecture (CUDA), its key feature is a runtime library (an accessory program that executes functions on behalf of a user application) that compiles routines to what Nvidia calls "kernels," which may then be disseminated by the GPU, distributed through its pipelines, and run in parallel. It leverages scientific programmers' preference for explicit parallelism to create a system for handing off parallel algorithms between CPU and GPU.

The following year, Nvidia realized CUDA had sparked an entirely new market. The company began building general-purpose co-processor engines, which for a time Nvidia called "desk-side supercomputers." In 2009, in an affirmation that the GPU market had expanded beyond graphics and was no longer a niche, Nvidia began actively funding organizations that contributed to the CUDA platform.

Also: University of Texas wins $60 million grant for supercomputer CNET

What GPUs eventually accomplished was nothing less than the total rebirth of supercomputing, both as an industry and a science. At a time when high-end machines were in danger of becoming composites of low-end processors, Nvidia refocused programmers' attentions on the components that made supercomputing into a viable industry in the first place: the algorithm. This, in turn, made computer vendors care once again about designing machines around the functions they performed rather than the resources they consumed.

The supercomputer industry today has been reinfused with at least some of the spirit that inspired folks like Gordon Bell, Stephen Dunwell, Gene Amdahl, and Seymour Cray. They're building machines with intent and, to borrow Dunwell's term, "conviction." So in a market where the personal computer no longer dictates its architectural goals, the supercomputer may have rediscovered its lost leadership role. If only for a quintillionth of a second, everything old is new again.

Learn more -- From the CBS Interactive Network