Cracking the 1,000-core processor power challenge

Cracking the 1,000-core processor power challenge

Summary: With the number of cores in mainstream processors predicted to scale to hundreds in the near future, a team of UK researchers is looking at how to stop their power consumption from spiralling out of control.

TOPICS: Hardware

This year the first octa-core smartphone will hit the market — but as the number of cores inside mobiles and tablets grow, so does the toll on the battery.

The problem with cramming more cores onto processors is that energy efficiency doesn't scale with the number of cores stacked onto chips. As more cores are added, power consumption grows faster than performance.

If you were to put a 16-core processor into your average modern smartphone, the maximum battery life would fall to three hours, while if you used a 100-core  processor it would drop to just one hour, according to back-of-an-envelope calculations by a team of researchers from UK universities.

Tackling the rapacious appetite for power of many-core processors is relevant to more than just letting computers do more on the move. As an increasing number of cloud services such as Gmail, and Spotify are accessed over the internet the need to keep down the energy demands of datacentres full of densely packed server clusters is also becoming pressing.

If unaddressed, the rising power consumption of many-core processors may limit future increases in computing performance, with predictions that within three processor generations CPUs will need to be designed to use as little of 50 percent of their circuitry at one time, to limit energy draw and prevent waste heat from destroying the chip.

Intel's Xeon Phi Co-processor. Image: Intel

Worrying about how to limit the draw of processors with hundreds of cores might sound academic, something that won't be an issue for mainstream processors for more than a decade. But many-core processors don't seem quite so distant when you consider there are already octa-core processors in desktops and servers – as well as specialist many-core processors such as the Xeon Phi Co-processor — and Moore's Law predicts there will be processors with more than 16 times the transistors of today's chips by 2020.

The International Technology Roadmap for Semiconductors 2011, a roadmap forecasting semiconductor development drawn up by experts from semiconductor companies worldwide, forecasts that by 2015 there will be an electronics product with nearly 450 processing cores, rising to nearly 1,500 cores by 2020.

Chip designers are coming up with novel ways of pushing down power consumption in multi-core devices. For instance, Arm's big.LITTLE configuration, where a high-performance, high-power consumption processor is paired with an energy efficient, weedy processor.

However, pairing energy-sipping with energy-hungry chips can only reduce power draw so much, according to professor Bashir Al-Hashimi, of the Electronics and Computer Science department at the University of Southampton, who is the director of a new project looking at a longer term solution to the problem.

The University of Southampton is part of a group of universities and companies – including UK chip designer Arm and Microsoft — in the PRiME (Power-efficient, Reliable, Many-core Embedded systems) project. The project will examine how processors, operating systems and applications could be redesigned to allow CPUs to more precisely match their power consumption to the application they are running.

"In the long term, the focus should be not just the hardware, the system software needs to become much more intelligentand work co-operatively with the hardware," Al-Hashimi said.

He said that ensuring processors were not sucking up more power than they needed at any one time would require a lot more intelligence in how operating systems manage the power consumed by CPUs. Some current power reduction techniques, for example clock and power gating, are deployed when the chip is being designed, and invoked when the chip is in use.

"This happens at design time, so requires good understanding of or predictions about the type of application one would run in order to look for opportunities to reduce the energy cost of computation, or eliminate it where there's no useful work being done," he said.

PRiME will investigate a dynamic model of power management, where processors would work in conjunction with the operating system kernel to shut down parts of cores or adjust the CPU's clock speed and voltage based on the precise needs of the application running on the processor at that moment.

A new approach

This dynamic power management would require current computer hardware, operating systems and applications to be enhanced.

Additional circuitry, such as performance and energy counters, would need to be added to processors to capture more data on how much work a CPU was doing and how workloads were distributed across cores. This data could include the level of current being consumed and the operating frequency of each core.

This data would then be interrogated by the operating system to capture a snapshot of the load on the processor and how the CPU was handling it. Interrogating this data in detail would require changes to power management routines within the kernels of operating systems.

Lastly, each application would likely need to come with a profile that described its power and performance needs to the power management system in the OS kernel. Al-Hashimi said one option would be for this to be generated during the application's development, using software tools that would estimate the app's performance needs on a given processor architecture.

All these changes would allow an OS to constantly monitor the performance and power usage of a CPU, scaling its power usage to precisely match the needs of the app outlined in its profile. This adjustment would take place via methods such as reducing the clock speed and voltage flowing to the CPU and shutting down parts of cores. More sophisticated manipulation of processors — for example, switching between heterogeneous cores, a la Arm's big.LITTLE, and homogeneous cores — would require further modification of processor hardware.

Research will be carried out by different UK universities. Imperial College London will research the hardware enhancement and reconfiguration, the University of Southampton and Newcastle University will investigate optimisation of the software runtime management.

To help investigate power and reliability optimisations of many-core systems, the university will build a 1,024-core system with the help of the University of Manchester, which will contribute its knowledge of highly parallel systems, building off its work on the SpiNNaker architecture.

The SpiNNaker computing architecture. Manchester academics aim to use a million ARM processing cores to simulate the neuron network of the human brain. Photo: Manchester University

Researchers at the University of Southampton have also begun software work, modifying the Linux kernel power management system to try and capture data on what are the power and performance demands of an MPEG decoder.

"We're trying to get our intelligent power management to learn the task it's doing. We're trying to label tasks and how much they cost in terms of clock cycles. Based on that we will decide at what speed the processor needs to be operating at," Al-Hashimi said.

PRiME is a five-year project being undertaken by research groups from the Universities of Southampton, Imperial College, Manchester and Newcastle. The five-year is being funded by a £5.6m grant from the Engineering and Physical Sciences Research Council (EPSRC).

As well as investigating ways of dynamically adjusting power consumption, the project will also investigate ways of altering the how an application runs based on a profile describing how important it or the data it's handling is. For while a flipped bit in a processor register may need to be corrected when running flight software onboard a plane, it is less likely to need fixing in a tablet playing a video.

Topic: Hardware


Nick Heath is chief reporter for TechRepublic UK. He writes about the technology that IT-decision makers need to know about, and the latest happenings in the European tech scene.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • One problem

    "The problem with cramming more cores onto processors is that energy efficiency doesn't scale with the number of cores stacked onto chips. As more cores are added, power consumption grows faster than performance."

    I wouldn't go so far as to say that it's THE problem. Another problem with it is that it's more work to optimize programs for multiple cores. Either each individual program has to be written with multithreading in mind, or the operating system has to fill in that gap on its own, but many single-threaded programs aren't well-served by multiple cores.

    As it is now, it seems like multiple cores may be generally more useful for doing a bunch of smaller things (like server transactions) than big things.
    Third of Five
    • Other benefits of multicore seem to get overlooked.

      You don't have to rework your apps to be massively parallel in order to benefit from many cores. They also benefit the OS in allowing smoother running of multiple single thread applications at once. As somebody who has a dozen applications open most of the time, some of which are running CPU-intensive processes, having more cores is a huge benefit. It also obviously benefits applications which spawn multiple threads. Plus, it allows you to run multiple multi-threaded applications side-by-side. I'm a big fan of adding more cores on the desktop.

      On mobile, I don't see the reason for anything beyond 4-8 cores (for now.) You really can't take advantage of running a lot of applications because the screens are too tiny. Once your mobile device is projecting a desktop-sized holographic display in front of you, that will change. But for now, the number of cores is keeping pace with the functionality of the devices pretty easily.
      • Not True, When a "holographic display" becomes common place

        there will be a holographic co-procesor. Third of Five has a very valid point, today. What Intel is now doing is dynamically allocating power between CPU cores including the GPU.

        This is one of the outstanding features in the new microarchitecture of Silvermont. Intel is predicting a 5X decrease in power consumption with Silvermont. And 3X performance enhancement.

        What is interesting about Silvermont, Intel is rolling it out first in the Atom line along with 22nm lithography. Previously new begin with the i7 and it would take a year for new technology to trickle down to the Atom.

        Silvermont is an all new microarchitecture with features that address a broad range of product from mobile to servers.

        Programmers could take advantage of Intel's microarchitecture by following the techniques published in Intel's Architectures Optimization Reference Manual. It's more than just multiple cores. Spawning new threads may or may now be the correct approach.

        If two threads are spawned and they both use the same shared resources simultaneously such as the memory bus, the is a performance hit rather than gain. Thought must be given to how to group various functions to avoid resource conflicts.

        There are cache memory, bus utilization, branch prediction, and instruction execution core that should be considered.

        Based on the last April 2012 revision of this manual I found for myself these techniques to be helpful. Most of these increase the performance of non-threaded apps.

        Bus Utilization: Do not interleave variable reads and writes. Group them and separate them with some computational instructions.

        Cache: Define or align variables on 64 and 128 byte boundaries. Use local variables. When defining variables define them grouped as they are used in the code. Group variable into arrays. When using large variables define them in 4K byte pages.

        Branch Prediction: Separate Branch Instructions. All that is needed is 3 μop instructions between branches. When Branch Instructions are contiguous, branch prediction is defeated.

        Instruction Execution Core: The newer Intel execution engines are capable of executing six simultaneous instructions. By using bitwise operators the compiler will use less complex single μop instructions. The programmer can assist the compiler by using functions that give the compiler a clues as how to use the optimum instructions.

        x = x/2 the compiler, if multi-pass, may use a shift instruction
        x >> 2 leaves no doubt

        Loop Optimization: More than 64 loops. No more than 4 taken branches. It's not the number of branches, but the number of branches that actually occur.
  • A Linear Array Processor

    Back in 1985-6, I worked on a simulation of a 1024 core linear array processor based on the SIMD (Single Instruction, Multiple Data) concept whereby a single program runs on all processors simultaneously but each processor has different data. This was part of the Wafer Scale Integration project at Anamartic. A simple example is a wages program where each processor handles a single employee but all employees are handled by the same program. Weather forecasting and air traffic control are other examples which naturally map onto such a system. The programs are simple to write and simple to debug; the processors can be simple and very low power yet the throughput is very high.
    This is very different from the current approach where complexity seems to be the order of the day. What happens when a software bug turns all processors into high power mode and a hardware meltdown results? How do you debug a multithreaded program designed to run on 1000 (1024?) processors?
    It's a shame that Anamartic died, and, without placing blame, it was not due to the technical excellence of its products and ideas.
    Has anyone heard of KISS? (Keep It Simple, Stupid!)
    • huh

      I can understand why Anamartic's the same concept but less efficiently designed as multi-threaded processing.

      Not sure why people think multi-threaded design is put your processes and loops into their own threads quite easily in most any language. It's the same as what you described..put all the employees in different threads so they can be ran on other cores asynchronous to the base program.

      Most programs use multi-threading nowadays and those that don't are generally made by amateurs or are lower end and don't require threading. Multi-threading is compatible with both single and multiple cores and, if written properly, has no core limit.

      Not sure why debugging comes into it because if you're doing multi-threading properly, it is still the main program and debugs just the same as a single thread program.
      Tom Gray
  • Ask the GPU manufacturers.

    There's already a class of processing units that are scaling to hundreds/thousands of cores on a single die: GPUs. They're found in practically everything now, from smart phones to desktops. And at the high end, yes they are scaling to hundreds/thousands of cores. nVidia's Titan has 2688 CUDA cores.

    So - if the CPU manufacturers want some insight into power consumption as the number of cores scales up - they should probably ask the GPU manufacturers, who are already dealing with this problem.

    "But many-core processors don't seem quite so distant when you consider there are already octa-core processors in desktops and servers – as well as specialist many-core processors such as the Xeon Phi Co-processor — and Moore's Law predicts there will be processors with more than 16 times the transistors of today's chips by 2020."

    Moore's "law" has pretty much ended from a "we're gonna scale up GHz and/or # of cores" standpoint. Maybe in servers octa-core is common, but it's difficult to find a consumer with more than two cores. You still need to find an enthusiast or gamer to find quad or octa-core, and honestly I don't wonder if we shouldn't have 16 or 32 cores by now for Moore's law to hold true.

    And I *don't* consider Moore's "law" to be an actual law. The *real* laws of physics simply won't allow it to continue indefinitely - there are limits to how small we can make things - once down to the atomic level, how do we make things smaller?

    I think if we are going to keep adding more cores, there may actually be a push to make CPUs simpler in order to fit more of them on a die. GPUs have already demonstrated that a simple enough execution unit can be replicated hundreds, even thousands, of times on a single die. But I suspect CPUs never made the jump because they're more complex.

    We're also approaching the point where scaling up doesn't make much sense from a consumer standpoint - very, very little consumer software can take full advantage of multiple cores. Your average PC user probably has 2 or 4 cores. Cell phones are up to four cores as well, but whether it really helps a mobile device is questionable, and it does drain more battery.

    The question may be - what's the consumer app that makes people want to have more than 2 to 4 cores? Other than games, there's not much out there that consumers want that would really push multi-core to its limits.

    Now for the real twist in the story: APUs. Your average GPU has to deal with the PCIE bus, but the upcoming APUs have no such limitation, and I'm sure that before long we'll be seeing the GPU side of APUs have 1000+ cores.

    What happens when we have 1000+ GPU cores, without the limitations of the PCIE bus?

    What happens when you can fit the power of a small data center on your desktop, or even in your pocket? What can we do with that?
    • It isn't just threaded software which benefits.

      "The question may be - what's the consumer app that makes people want to have more than 2 to 4 cores?"

      If people only ever ran one application at a time, this would be a valid point. However, having many cores allows the OS to run a dozen applications which each have multiple threads at the same time. More cores benefits multiple tasking as much or more than multiple threading. Even my mother-in-law, in her 70's, has a a processor with 8 threads. Other than laptops, the only people I know who are still using a dual-core machine are businesses who haven't updated their PCs in 8+ years.

      The fact that GPUs have thousands of cores already, shows that there are already consumer uses for massive numbers of cores. Those uses will continue to expand as we move forward.
      • um, yeah, ok, threads vs cores . . .

        um, yeah, ok . . . Intel did bring back 2 threads/core (hyperthreading) in more recent desktop CPUs.
      • Thousands of graphics cores

        That is relatively easy to imagine. One could imagine one tiny core per pixel as some sort of optimum perhaps, although those cores would not be all that busy most of the time. Coordinating them would also be pretty complex, but you get the picture ....... (no pun intended).

        For OSs and apps, the situation is VERY different. That level of multithreading would be pretty much impossible to achieve and manage efficiently.

        Of course in servers, with literally millions of requests, the situation is perhaps more like what a GPU would face.
        • "relatively easy to imagine" NOT

          No I cannot imagine one core per pixel. As GPU's are frame oriented, a pixel core makes zero sense.

          OS App Multi-Treading cannot be compared to GPU.

          Servers are the least likely able to take advantage of a multi-core processor. Because each request to a server involves executing basically the same set of instructions to retrieve data from storage and stream it to the network, the shared resources to do so being shared, only one core can have access at any given point in time.

          Just as BillDem's thinking more cores is better it just does not pan out that way.

          To simply say 8 cores is better than 4 is no where close to reality.

          The only type of app that can benefit from multiple cores is one that has very few variables and instructions. Where the variables can be contained with in the execution core's registers and the instructions in the pipeline. If each core has it's own cache things begin to improve.

          Even if each app were running in a separate core the code would need to be vary carefully written to avoid resource conflicts.

          When an app is executing a small loop where instructions do not need to be fetched from RAM memory and data and code can be contained with the cores own cache is the only time multi-core offers a significant advantage.

          This generally only happens with hand coded assembly routines.

          I one app is primarily utilizing one resources such as the PCI bus and another memory then there may be significant performance advantage if the code remains in cache and the code and data are well partitioned.

          None of these scenarios is likely to happen in real life use.

          Number crunching apps like cracking an encryption algorithm could make use of many cores. The app would have to be very carefully written.

          Mainstream processors with hundreds of cores is preposterous. It is not the energy consumption as like Intel is doing in their new Silvermont microarchitecture they dynamically de-allocate power to unused cores.

          The reason it is impractical is the amount of silicon required for each core. Because shared resources limit the number of cores executing instructions vs. the number waiting for resource assess the advantages are limited.

          Profitability of a semiconductor house is the yield per silicon wafer. If you have a large number of cores the yield is fewer chips per wafer. Therefore each chip must be priced higher, with no benefit to the user.

          There are some apps such as CAE simulations that can utilize the expensive i7 processors with 8 cores. Conversely BillDem's mother cannot.

          NOTE: CAE=Computer Aided Engineering for chip design.

          A CAE software is very expensive. The reason it is expensive, the computationally intensive routines must be hand coded in Assembly language and optimized for a particular microarchitecture.

          In practical terms the i3 and i5 are the same as i7. In their programming documentation, Intel does not differentiate between i3, i5, and i7. They are all referred to as i7.

          Within i7 there are various microarchitectures. The difference between i3, i5, and i7 is the number of cores and the microarchitecture.

          There are i3 processors that out perform i5. This can be seen in the Passmark scores.

          When it gets down to the nitty gritty, an Ivy Bridge i3 processor may out preform an Nehalem i7.

          There are hundreds of x86 processors being sold today. There is a reason each one exists. It is not a simple task to pick the ideal process or for your own use.
          • err... there are many very common apps that are massively multi threaded

            server: iis, SQL server are very obvious ones
            desktop: Internet browsers and virtually every app as dev do not want their apps to freeze every time the user performs an action.
            on our servers we have in house services running in excess of 100 threads with CPU usage in excess of 90% for 6 hours a day on average.
            sure you have to be careful when coding such apps but this is not rocket science either!
          • Multi-Tasking is Something Else

            While similar in concept they are very different.

            No it's not rocket science. Rocket science is very simple in comparison.

            Intel's programmers optimization manual is 3044 pages. This link is one page with a block digram of

   (1 page 65KB)

            The System Bus is mostly for access of DRAM and PCI

            The Intel Core processor has two symmetric cores that share the second-level cache and a single bus interface. Two threads executing on two cores in an Intel Core Duo processor can take advantage of shared second-level cache, accessing a single-copy of cached data without generating system bus traffic.

            If you are running a routine that has less than 32K of instructions and 32K data in each core, you can run all cores at 100%.

            That is not the case in Browser, and Web Servers. These are streaming apps. The require moving data from PCI and DRAM to each core. Except there is only one bus for PCI and DRAM. Only one core at a time.

            If both cores are trying to access DRAM only one core can access DRAM at any given time.

            This is why increasing the number of cores has limited performance gains.

            Here is two small overview section out of the 3044 page optimization manual.

            8.3.1 Key Practices of Thread Synchronization
            Key practices for minimizing the cost of thread synchronization are summarized
            • Insert the PAUSE instruction in fast spin loops and keep the number of loop repetitions to a minimum to improve overall system performance.
            • Replace a spin-lock that may be acquired by multiple threads with pipelined locks such that no more than two threads have write accesses to one lock. If only one thread needs to write to a variable sharedby two threads, there is no need to
            acquire a lock.
            • Use a thread-blocking API in a long idle loop to free up the processor.
            • Prevent “false-sharing” of per-thread-data between two threads.
            • Place each synchronization variable alone, separated by 128 bytes or in a separate cache line.
            See Section 8.4, “Thread Synchronization,” for details.

            8.3.2 Key Practices of System Bus Optimization
            Managing bus traffic can significantly impact the overall performance of multi-threaded software and MP systems. Key practices of system bus optimization for achieving high data throughput and quick response are:
            • Improve data and code locality to conserve bus command bandwidth.
            • Avoid excessive use of software prefetch instructions and allow the automatic hardware prefetcher to work. Excessive use of software prefetches can significantly and unnecessarily increase bus utilization if used inappropriately.
            • Consider using overlapping multiple back-to-back memory reads to improve effective cache miss latencies.
            • Use full write transactions to achieve higher data throughput.
            See Section 8.5, “System Bus Optimization,” for details

    Currently a single CORE is about 3.6 GHz.
    What I do not see is 2x 3.8 GHz cores but a 3.8 divided in two, 1.8 + 1.8
    Where's the REAL gain in that? There JUST AIN'T.
    My laptop was made in 2002 with a dual 3.2 GHz Pent-4.
    It's only a SQUEAK faster than my 3.0 AMD desktop.
    Both units are MAXED OUT in upgrades.
    They both take nearly the same time (1 to 3 seconds different) doing identical programs.
    "WHAT I SEE" there's no advantage in multi-cores....well except spending more moolah on hype.

      What I do not see is 2x 3.6 GHz cores but a 3.6 divided in two, 1.8 + 1.8
    • Sigh . . .

      Sigh . . .

      No, dual cores are not dividing 3.6 GHz into two 1.8 GHz.

      Far more likely is that the application in question is using a single 3.8 GHz core, and not taking full advantage of dual cores.

      Also, you may have to consider what you're doing and how good *all* of the components are on the system. RAM could be full, hard drive could be slow. I don't know what your app is doing, but it might not be bottlenecked at the CPU.

      Computers are complex machines - sadly, a couple of number is rarely enough to tell their true performance for all things.
    • dumb

      Weedy Gonzales
  • AMD explanation

    BRIEF explanation from last nights email I received from AMD.

    1) "Single-core" processor running at 3.0GHz.
    2) "Dual-Core" processor(s) running at 1.5Ghz each totals 3.0GHz.

    AND yes, ALL the PC components are balanced and working at there peak efficiency.
    • No

      That's is not how it works.