80 isn't nearly enough

80 isn't nearly enough

Summary: What an exciting week this has been. We unleashed the 'Era of Tera' by showcasing the world’s first programmable processor that can deliver Teraflops performance with remarkable energy efficiency.

SHARE:
TOPICS: Hardware
10

What an exciting week this has been. We unleashed the ‘Era of Tera’ by showcasing the world’s first programmable processor that can deliver Teraflops performance with remarkable energy efficiency.

It’s rather extraordinary that after decades of single core processors, the high volume processor industry has gone from single to dual to quad-core in just the last two years. Moore’s Law scaling should easily let us hit the 80-core mark in a mainstream processors within the next ten years and quite possibly even less. It is therefore reasonable to ask the question: what are we going to do with this sudden abundance of processors?

The answer is somewhat obvious on the server side of things. More cores and more threads means more transactions per unit time, assuming that all those cores are given the necessary appropriate memory and I/O bandwidth. Other computationally intensive applications in scientific and engineering computing are also likely beneficiaries. I’m talking about seismic analysis, crash simulation, molecular modeling, genetic research, and fluid dynamics.

On the client end of the wire, things aren’t as obvious or straightforward, but they are no less interesting. The abundance of cores is likely to lead to a very different approach to resource allocation. For decades operating systems have been optimized for managing the very scarce processor resources, by cleverly multiplexing many tasks or threads across one or now two or four cores. As quality of service has become more important to users, we’ve all come to realize the limitations of this approach as frames get dropped from video streams or productivity applications pause while the video goes full tilt. A different approach, and one that probably hasn’t received enough attention from the research community, is to dedicate cores to providing particular functions. The allocations become more static than what we see today, but they can certainly be changed over longer periods of time ranging for seconds to hours or even days.

As an example, we could conceive of a multi-function computing appliance that contains a processor with perhaps three dozen cores: we might allocate four of those cores to running the core productivity and collaboration applications. Another cluster of cores, on the order of a dozen, might provide very high quality graphics and visualization. Media processing, beyond encode/decode which would best be handled by dedicated hardware, would be the responsibility of yet another cluster of, say six cores. Still other clusters might be do real-time data mining on various streams of data flowing in from the Internet. Various bots operating within this cluster might be assembling news, shopping, or investing. The key idea here is to let the abundant hardware resources replace a lot of very complex OS code. It’s replaced by cluster or partition management code, which doles out the resources, but stays out of the way until there’s a major shift in the workload.

TJGeezer suggested using Tera-Scale capability along with huge amounts of NAND in an iPOD size container for AI applications. He may be right. One can easily imagine clusters of cores supporting an advanced human interface with real-time speech and vision or language translation. A lot of algorithmic development would have to take place to make this feasible, but there is no doubt in my mind that we’ll have the hardware resources needed to host them. The statistical algorithms that will form the heart of these future recognition systems are highly parallel and thus a great fit for a high core count architecture.

An abundance of cores also enables new ways to deal with challenges associated with system operation in the face of device failures and cosmic radiation. Think of the collection of cores as a redundant array of computing engines (RACE). Two or more cores could be used in tandem to detect and correct faults. If a core becomes unreliable, it can simply be removed from service without significantly affecting overall system performance

As we pack more and more computing resources into smaller areas, managing power and heat in a very fine grain manner will be critical. If we have more cores than are needed to execute the desired set of workloads, we can swap threads between cores whenever one becomes too hot. It’s like the hot potato game – move the potato fast enough and you never get burned. We’ll need the ability to adjust supply voltages, operating frequencies, and sleep states of individual cores in matters of microseconds.

While the challenges are somewhat mind-boggling on both the hardware and software sides to develop and fully utilize these future Tera-Scale platforms, the benefits and opportunities from putting these computing capabilities into the hands of all users are equally incredible.

So how many cores could you use, and what would you use them for? ArsTechnica user dg65536 said it best in his post – “Now that I think about it...80 isn't nearly enough.”

Topic: Hardware

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

10 comments
Log in or register to join the discussion
  • Data supply will be even more important

    From the article:
    [i]
    "...assuming that all those cores are given the necessary appropriate memory and I/O bandwidth."
    [/i]

    Don't brush this off with just a passing mention. Keeping data supplied to 80 cores is going to be more difficult than it will be deciding what to use all these processing units for or how to program them. Processors that don't have anything to work on will sit idle. As clock speeds have increased, memory and bus speeds have not kept up the same pace.

    Even if the speed of cores is kept constant, processors with 80 cores is a leap of more than 6 doublings. In terms of applying Moore's Law, this would be an advance of 9 to 12 years of processing power increases from single-core processors. If Intel can produce 80-core chips in, say, five years, will memory technology be able keep up, let alone practically double the speed curve that has been followed for the last decade or two? I'm skeptical.

    - silent E
    --
    Turning dams into dames since 1974.
    silent E
  • 80 is nonsense for the next decade

    It reminds me of those days where Mhz or Ghz is important until no one needs it. Same as # of cores. And of course, i am refering to everyday life, the normal population not the biz communuty.

    One post in Arcstechnica said 80 cores isn't enough and points of multiple ways of how it can be utilize. My question, what is the point of running that process faster?

    Example: instead of taking 10 seconds to process a job/task, the computer now takes 1 nanosecond.

    Impact 1: You get your answer "right-away".

    Impact 2: Your computer or server will be leave alone for (at least) the next 9 seconds also idle. As you cant process information faster than that as human interactions is part of the way to process/validate/execute decision/action.

    Next Example, I can use 80 cores for
    - Running OS faster from 15 seconds boot time to 5 seconds (cant be faster as the BIOS/HW/SW intialize take time)
    - Opening a web browser from 2 seconds to 1 nanoseconds (in user experience, it means right away)
    - Type in a web address or click a link - remain same - 2 seconds (we just cant click or type faster)
    - Waiting for web page to load - remain the same (it is the broadband speed matters not the core)
    - Browse the webpage/email - slighly faster due to imaging processing - but fast by a fraction of seconds (as download speed still key for those activities)
    - Typing a document - almost same (your typing skill more important than cores)
    - Load/Save/Delete a document from ~3 seconds to instant
    - Copy the document from your computer to network - from 5-10 seconds to 5 seconds (slightly faster but limit by network utilization, speed and bandwidth)
    .... the list goes on

    Other good impact: MP3 encoding/decoding, Photo/Image/video editing, computer games, CAD, Animation desing/development... all those improve tremdeously - as long as, the software capable to use the cores.

    How to use all 80 cores?
    - How the Enterprise custoemr fill up the IDLE time create by faster processing by taking the consideration that there is a limit of how fast a human can work?
    - How our everyday life - Instant Messaging, Blogging, Surfing Internet, Email... improve with 80 cores? (Fact is not matter that much)
    - What is the real need for everyone? There are infinite electric power available to your everyday life, but we are limit to the device we used and most of the time, we bought a device keep utilizing the electricity but we never really use it much. But it is acceptable as electricity is cheap for everyone.

    Until someone is able to fix/improve/innvoate the ecosystem and human interaction model (data and/or information usage) , getting a 80 cores system remains an inefficient way of utilizing resources. We should only buy what we need and what we are capable to use.
    stephen.oh33@...
  • simple multi core trend

    Currently one of Intel's differentiator is the complexity to design a pentium for example. It requires so much people to tune the processor to the technology and squeeze everything in performance. A few companies out there can afford such investment. On the other hand, this trend towards multicore chips, composed of replicated simple cores, therefore scalable, can potentially require less people to design it (I've read that Polaris was the effort of 30 engineers, HW and SW, in a bit more than 1 year). The difficult part in the future will be, as mentioned, to program those cores. I wonder if Intel is not shooting at its own feet in the long term, as more companies will be capable of providing the underlining hardware subtrate composed of a similar multicore chip. Is it crazy to think that?
    poeta nascitur
  • That's a HUGE Ass umption...

    QUOTE:
    More cores and more threads means more transactions per unit time, assuming that all those cores are given the necessary appropriate memory and I/O bandwidth.
    ENDQUOTE:

    Ummm. How can you just assume that 80+ cores will get enough memory and I/O bandwidth? It's already stretching things with 2/4 cores with each CPU with their own memory banks.

    No way will memory architectures suddenly be able to supply 20-80 times the current memory bandwidth without using something insane like 8000 pin CPU packages.

    That's almost as bad as the statement from someone a while back that once we find a really fast easy way to factor large primes, all encryption will be easily broken. Ummmmm, yeah.
    Sxooter_z
  • Terrific article.

    Clearly multi-core is the next stage of utilizing all those Moore's Law transistors. Very interesting perspective and now the 80 Core proof of concept research chip that Intel submitted to ISSCC makes sense. That is, research the IO and Memory bottlenecks area.

    I think the "AOL researchers" that commented here have a good point. Four cores is all the cores anyone would ever need. Yeah.
    Prognosticator
  • Oh Please

    Justin, Justin, Justin. Please spend your transister budget more wisely. This rope does not deserve pushing. Let it lie and maybe the feeling will pass.
    Inflection
  • On the personal scale

    I see computers turning more toward the efficiency scale to conserve energy, Run cooler and shrink of course. A unit the sized of a deck or 2 of playing cards will serve as the PC and have wireless interfaces for all other input/output devices except for maybe video. At least I see that as logical as you could simply stack upgrades like lego blocks. Ah I love to dream.
    Hrothgar - PCLinuxOS User
  • Right but for the wrong reasons

    80 cores is not enough because the existing prototype only achieves 16GFLOPS/W resulting in a whopping 62W power-dissipation @ 0.9V.

    While obviously better than the current crop of multicore processors the types of application envisaged mean that a much higher percentage of that 62W peak will be sustained compared to current multicore processors which are throttled by external memory bandwidth for HPC applications such as FEM and CFD (assuming polaris can be fed with data).

    In my experience designing for topline performance rather than for power efficiency results in a higher cost of ownership and operation than designing for the optimum power/performance for a given technology (in this case 65nm).

    In order to maximise power efficiency for cost of operation many more than the proposed 80 cores should be integrated into the same die size but operating at much lower power.

    Secondly in terms of keeping the cores fed some of them can be dedicated to compressing and decompressing data from the external I/O increasing the effective bandwidth "seen" by the other cores.
    moloned@...
  • Marketing Spin

    http://forums.techgage.com/attachment.php?attachmentid=155&d=1171236277

    The ISSCC paper on Polaris is posted above and my reading of it is that this is a purely an exercise in technology development Borkar's group had published a paper on ISSCC a few years ago on a fast FP MAC out of which Polaris has been developed.

    Despite Intel's claims Polaris in this form is not usable for HPC applications as it only supports single-precision arithmetic.

    The FP unit uses deferred normalisation so any loop of code which stores results back to memory will have to perform an additional normalisation step which will degrade the top-line performance.

    Furthermore the on-chip data-memory per node is only 2kB (512x 32-bit words) deep, and the 3kB instruction memory will only hold 256x 96-bit instructions.

    The small memory and reliance on the NoC to supply data will mean that performance and power will be very high when compared with the IBM Cell which has 256kB/node local data/program storage.

    Most tellingly of all the instruction-set only supports FP MACs, no divides, square-roots etc. so it is only really of use for a headline-grabbing marketing exercise.
    moloned@...
  • Software Architecture to Match Your Cores

    SISA: A Scalable Instruction Set Architecture
    by Brian Fidel Davila
    1. Introduction
    The never-ending pursuit of higher performance dictates the ongoing development
    of
    processors that perform computations on progressively bigger blocks of data at a
    time.
    Two methods of achieving this goal are increasing the size of the data word,
    demonstrated most recently by the migration from 32 to 64-bit computing, and
    having
    an instruction operate on more words at once, as seen in vector computers and the
    ?multimedia?, ?DSP? or ?SIMD? extensions now common in modern processors.
    While it is tempting to design a new architecture carte blanche every time
    technological
    advances make new operations possible, this is rarely feasible. Market realities
    dictate
    that pre-existing software continue to be supported unmodified. As a result,
    instructions
    are added to an existing architecture to provide this functionality.
    The need to modify instruction sets for new data formats is a fundamental
    limitation of
    standard architectures. Ideally, an architecture would not be tied to specific data
    sizes.
    New processors could then support different operations without requiring
    instruction set
    modifications.
    This paper presents a scalable instruction set architecture, SISA, which meets these
    requirements. While any processor has limits on the data formats it supports, SISA
    implementations are able to support operations on larger data words and arbitrarily
    long
    vectors using the same interface.
    Section two describes memory aliasing registers and their use in a simple
    implementation of SISA, the SISA-I. Next, section three defines vector operations
    and
    support for a larger data words and introduces a vector machine, SISA-II. Section
    four
    continues with the SISA-III, which supports superscalar execution. Section five lists
    additional work, section six related research and section seven concludes.
    2. Registers
    ?Any problem in computer science can be solved by adding another level of
    indirection?
    -Alan Kay
    To minimize demands on the instruction data path, SISA instructions are 16 bits
    long.
    This instruction length allows only eight architecture registers, r0 to r7. Each
    register is
    associated or aliased to a memory address with the rset instruction. These
    addresses
    may subsequently be accessed with the rget instruction. Addresses may be read or
    written from any number of contiguous registers with a single instruction, though
    register
    storage bandwidth or other limitations may prevent operations on larger subsets of
    the
    registers from executing in a single cycle. Memory addresses may also be modified
    with
    the radd and raddi instructions to add the value at a memory address or
    immediate,
    respectively.
    SISA registers also have a width field, which defaults to 32-bits. This may be
    changed
    with the rsetw instruction. Register widths may be retrieved with rgetw. Both
    instructions
    may modify multiple registers as with rset and rget. The maxwidth instruction
    returns the
    maximum width an implementation supports.
    In SISA-I, a register?s width may be set to one, two, four, eight, 16 or 32 bits. The
    architecture may support any width; SISA-I supports only these widths for
    simplicity and
    performance. The widths of each operand and the result are passed to the
    execution
    stage so that it may perform the correct operation. The radd and raddi instructions
    also
    use register widths when incrementing alias addresses.
    SISA-I uses an eight-entry register table for storage of addresses and widths.
    When an
    rset or rsetw instruction is called, the corresponding entries are updated in the
    register
    table. This table then provides the values for rget and rgetw.
    Data values are stored in four banks of 32-bit wide, 16-entry register files, each
    with two
    ports. This arrangement allows up to four address registers to be saved or restored
    in
    one cycle and all eight in two cycles. This storage and a 16-entry, 24-bit wide tag
    file
    function as a direct-mapped Level-0, or L0, cache. During instruction decode each
    aliased register specified is used to retrieve the currently aliased address, which is
    then
    used to access data in this cache.
    It is not feasible to solely use a direct-mapped cache. It would fail in the case
    where two
    operands are aliased to different addresses that would occupy the same cache line.
    While it would be possible to fetch one value directly from memory, this would
    result in
    substantial performance degradation.
    A victim cache is used to avoid this penalty in most cases. When a block is evicted
    from
    the direct-mapped cache, it is sent to this unit which can then supply it to the data
    path
    as required. The requirement for 28-bit fully-associative lookups suggests the
    victim
    cache be kept small, so SISA-I uses eight entries. It is 128 bits wide in order to
    hold an
    entry from each of the four L0 cache banks in one line.
    The victim cache also functions as a write buffer. Normally, dirty cache lines will be
    written only when the bus to the next level in the memory hierarchy, the L1 data
    cache,
    is not otherwise busy. The exception is when the victim cache is full, in which case
    a
    write must take priority over a read.
    While a set-associative cache could also be used, there is a performance issue. The
    associative lookup in the victim cache can be performed in parallel with the
    directmapped
    cache access. Thus, it will be known if the value resides in the victim cache at
    the start of the execution phase and it can be muxed into the data path with little
    delay.
    When the value is not in the victim cache, the tag check for the direct-mapped
    cache
    can be performed in parallel with the execution stage, which is not possible with a
    setassociative
    cache. This is not an issue in an implementation where performing a serial
    tag check in the execution stage is not in the critical path.
    A simple implementation of SISA could directly access memory, but this would only
    be
    viable in very low-cost solutions or where memory requirements are so small that
    it may
    be implemented with performance comparable to the speed of a cache.
    Operands pass from the decode stage to an ALU in the execution stage before
    results
    are written back to the L0 cache. Each instruction in these stages of the pipeline
    will
    keep a eight bit destination field that specifies where the result will be written.
    Each
    register cache line will have a two bit counter of the number of instructions in the
    pipeline that will write to it. The control logic insures cache blocks waiting for
    results are
    not evicted and that instructions writing to a cache block that has not loaded from
    memory are stalled.
    Advantages of SISA-I over a conventional pipelined processor include the speed of
    function calls and returns and context switches. The rset instruction may be used
    to
    quickly switch between many different sets of address registers as needed and
    control
    logic will move values in and out of register storage in parallel with other
    operations.
    Explicit loads and stores of data are removed from the instruction stream and are
    performed in parallel with ALU operations. Loading longer immediate values will
    take
    more cycles, but the 16-bit instruction format has been shown to suffer only a
    small
    performance penalty relative to a 32-bit format in a load-store architecture.1
    3. Vector operations
    Because operands are simply references to memory, it is easy to specify
    instructions
    that perform vector operations. While both scalar and vector instructions use a
    three
    operand format, scalar instructions have two source operands and a destination
    while
    the vector versions have one operand that is both source and destination, one
    source
    operand and an operand to specify vector length.
    While the 16-bit instruction length does not allow a reasonable number of
    instructions
    with two 3-bit operands and the 8-bit immediate used in scalar instructions, there
    are a
    number of possible solutions. Either the immediate could be reduced to five bits to
    allow
    for a vector length operand, the vector length operand could be defined to come
    from a
    hardwired register, perhaps r0, or they may simply not be supported. The last
    option is
    chosen for the current specification.
    An implementation may support any number of lanes2 for vector operations,
    including
    one, and the maxveclen instruction, which takes a width as an argument, is
    provided to
    return this value. SISA-II supports operations on 128 bits, which may be sub-
    divided by
    any power of two bits. This restriction is not a property of the architecture. This
    implementation supports only these formats for simplicity and performance.
    To minimize data path complexity, the restriction that only one of the operands
    may
    come from the victim cache is enforced. When an instruction specifying two
    operands
    that reside in the data cache is encountered, one of the operands is moved to the
    directmapped
    cache before the instruction proceeds.
    For the bit-wise logical instructions and the move instruction, there is no
    difference
    between operating on, for example, a vector of four 32-bit values and a vector of
    two
    64-bit values. For addition and subtraction, the difference is only which carry bits
    are
    passed. Larger operations may be performed over multiple cycles. SISA-II supports
    64-
    bit additions in one cycle and 128-bit additions in two.
    Multiplication presents a bigger challenge. Multiplication of operands with widths
    larger
    than those natively supported can be synthesized by combining smaller width
    multiplications and additions, essentially decomposing the operands into ?digits? of
    the
    smaller bit width. While providing a more consistent interface, these operations will
    introduce complexity and may be better handled in software. SISA-II supports
    multiplications on operands of up to 32 bits and on vectors returning a maximum
    of 128
    bits. Note that the width of the result register is passed to the execution stage and
    that
    this determines the limit rather than the sum of the widths of the operands. The
    maxmult
    instruction takes a result width and returns the maximum number of consecutive
    multiplications supported.
    4. Superscalar execution
    The multiple execution units of SISA-II are underutilized with shorter scalar
    operations.
    To take advantage of this idle hardware, SISA-III examines two instructions at a
    time.
    The short instruction word keeps the demands placed on the instruction cache low
    and
    the three-bit register operands simplify decode logic. Three bits are now used to
    record
    the number of in-flight instructions that will write to a cache line.
    Instead of adding ports to register storage, SISA-III simply does not allow the issue
    of
    two instructions that would require more than two operands be read from different
    lines
    in the same bank. Similarly, one pipeline is stalled when two instructions attempt
    to
    write to different lines in the same bank. While these decisions will decrease the CPI
    below what would otherwise be possible, they provide a significant increase in the
    CPI
    for a modest investment in hardware. Simultaneous multi-threading is also
    possible by
    duplicating the address, width and PC registers and inserting logic to guard against
    structural hazards.
    5. Additional Work
    An assembler and an implementation of this architecture for a Xilinx Spartan-3
    FPGA
    are currently undergoing testing. Results for assembly programs vs. equivalents
    targeted for a MIPS processor will be available shortly. A more thorough
    investigation
    requires targeting a compiler for the SISA architecture and perhaps developing a
    simulator as well.
    There is no reason this architecture must be limited to 16-bits. An instruction set
    with a
    different instruction width could easily be defined to support, for example,
    different
    immediate fields. An instruction length of 32 bits allows support for directly
    addressing
    data values in addition to the indirect addressing specified. To support a legacy
    instruction set alongside this new framework, architecture registers could be
    defined to
    reside at a certain memory address and instructions could be dynamically
    translated. In
    that data values in load-store architectures must be saved on context switches,
    they are
    already associated with a memory address. Static translation of legacy code for
    traditional architectures to SISA is also possible using this technique.
    The performance of a SISA implementation with an L0 cache is of course dependent
    on
    the performance of the cache. How the performance of various workloads varies
    with
    different L0 cache implementations warrants further investigation, but interesting
    results
    require a benchmark such as SPEC CINT2000 be ported to SISA.
    To improve the performance of the L0 cache, operand address prediction could be
    used
    to prefetch data. Such a mechanism could either rely on the branch prediction
    mechanism and record operations on the register addresses and widths in each
    path or
    a method of predicting operand addresses independent of the branch prediction
    unit
    could be developed.
    Different compilation techniques and optimizations will be necessary for SISA.
    Gcc3 4.0
    supports auto-vectorization from which SISA would benefit. Because the radd and
    raddi
    instructions may operate on more than one consecutive registers, one or more
    registers
    may be used to refer to the top values of a stack or stacks. Multiple registers may
    also
    refer to the same address but have different widths.
    6. Related Research
    Ditzel and McLellan proposed4 using a cache instead of registers for operands,
    citing
    disadvantages that have been addressed by advances in technology and
    architecture.
    The Virtual Context Architecture5 first studies a memory-memory architecture that
    uses
    memory addresses as operands and then modifies the Alpha architecture to
    support
    register windows that map logical registers to memory locations based on context.
    Banked register files have been studied by Cruz, et al.6
    7. Conclusion
    A Scalable Instruction Set Architecture has been defined that allows microprocessor
    implementations to support operations on arbitrarily long data elements, vector
    lengths
    and memory address sizes. It has the additional advantages of using only a 16-bit
    instruction and allowing for fast function calls and returns and context switches.
    The
    adoption of such an architecture would greatly reduce the costs of introducing new
    processors supporting these new operations.
    1 J. Bunda, D. Fussell, R. Jenevein, and W.C. Athas, ?16-Bit vs. 32-Bit Instructions
    for Pipelined Microprocessors?,
    Proceedings of the 20th Annual International Symposium of Computer Architecture,
    pp. 237-246, 1993.
    2 Hennessy, J.. and Patterson, D. Computer Architecture: A Quantitative Approach.
    Morgan- Kaufmann Publishers,
    San Mateo, CA, 2003.
    3 Free Software Foundation. GNU Compiler Collection. http://gcc.gnu.org.
    4 Ditzel, D.R.. and McLellan, H.R. Register Allocation for Free: The C Machine Stack
    Cache. In Symposium on
    Architectural Support for Programming Languages and Operating Systems. SIGPLAN
    Not. 17, 4 (Apr. 1982), 48-56.
    5 Oehmke, D.. N. Binkert, S. Reinhardt and T. Mudge. Design and Applications of a
    Virtual Context Architecture.
    https://www.eecs.umich.edu/techreports/cse/2004/CSE-TR-497-04.pdf
    6 J.-L. Cruz, A. Gonzalez, M. Valero, and N. E Topham. Multiple-Banked Register
    File Architectures. In
    Proceedings of the ISCA-27, pages 316-325, 2000.
    fdavila