Intel chases hyperscale datacentres with FPGA bolt-on

Intel chases hyperscale datacentres with FPGA bolt-on

Summary: Intel's latest move to make it easier for web and cloud giants to customise their infrastructure to their hyperscale workloads sees it introduce reconfigurable silicon.

Intel says that FPGAs can achieve 10 times the performance of CPUs. Image: Intel

As web giants such as Google and Facebook customise their huge IT estates to drive down running costs, Intel is trying to meet their needs by making its chips more flexible.

Its latest step is to marry its x86 Xeon processors with a field programmable gate array (FPGA), a chip whose core logic is reconfigurable using software.

Because the logic of an FPGA can be tailored to the demands of specific computing workloads — for instance, certain search query or video processing tasks — it can carry out these workloads more efficiently than a general purpose CPU. Intel cites benchmarks showing that FPGAs can achieve 10 times the performance of CPUs for specific tasks.

The FPGA will be combined with the Xeon in a single package that will fit into a standard E5 socket, and the processors will be linked via an Intel Quick Path Interconnect. Organisations can then customise the FPGA's logic to handle specific workloads, and if the demands of that workload change then they can reconfigure the logic of the FPGA accordingly.

FPGAs are already used within datacentres but more commonly as discrete devices connected via PCI Express, rather than packaged together with a CPU as Intel is proposing. The FPGA will have access to the CPU's cache hierarchy and according to Intel will be capable of two times the performance of discrete FPGAs.

Microsoft announced earlier this week that it had been experimenting with FPGAs in its datacentres, and found the configurable chips to be 40 times faster than a CPU at handling certain custom algorithms used by Bing search.

Where a company requires large numbers of chips with logic tailored to specific workloads they will often use application specific integrated circuits (ASICs), which can be cheaper and more efficient than FPGAs in volume.

There's no word from Intel on cost or specifications of the FPGAs, but Intel already produces devices that incorporate Altera's Stratix 10 FPGAs.

Intel already supplies chips customised to the computing needs of its largest customers, and the company delivered 15 custom products last year for customers including eBay and Facebook.

Diane Bryant, general manager of Intel's datacentre group, told the Gigaom Structure conference in San Francisco yesterday that this custom architecture would help satisfy "the move to scale-out, distributed applications".

The shift is being driven by cloud software providers and companies with large web presences which are building computing architectures suited to performing identical workloads on a huge scale.

Intel's announcement comes ahead of the release of 64-bit ARM-based chipsets aimed at providing a low power alternative to Intel's x86 CPUs in the datacentre, with AMD planning to make its ARM Cortex A57-based Opteron A1100 available this year.

Read more on Intel

Topics: Data Centers, Hardware


Nick Heath is chief reporter for TechRepublic UK. He writes about the technology that IT-decision makers need to know about, and the latest happenings in the European tech scene.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • To head off nay-sayers

    1. FPGA does not have to control the bus - not a security issue.
    2. Does not imply that the FPGA is frequently reprogrammed it can be a secure flash operation that is only done occasionally.
    3. The FPGA does not need to execute software but only act as an accelerator for certain algorithms. Things like encryption, hash algorithms, compression, codec, encode, decode and pattern search.
    4. Data transfer can be very fast to the FPGA without DMA
    5. Context switching is not necessary in this role as it is not acting as a processor but an accelerator. The FPGA signals when it is done.
    6. Keeping the logic very simple avoids wasted gate logic. A simple write-only memory mapped register set and some RW memory mapped memory buffers for data transfer by the CPU (very fast) can keep the FPGA busy.

    In these roles the FPGA can be a very nice add-on.

    It is true that MCU and CPU cores can be implemented on an FPGA but for this application it is not only not necessary but not advantageous, the XEON would be much faster. The FPGA is faster only if it is implementing an algorithm in hardware and in this role it can be much faster than the CPU.
    • Sorry, wrong.

      "1. FPGA does not have to control the bus - not a security issue."
      Depends on actual implementation... It will be a very slow device if it can't access data except that contained within the CPU registers...

      "5. Context switching is not necessary in this role as it is not acting as a processor but an accelerator. The FPGA signals when it is done."

      "2. Does not imply that the FPGA is frequently reprogrammed it can be a secure flash operation that is only done occasionally."

      Doesn't imply it isn't either.

      "3. The FPGA does not need to execute software but only act as an accelerator for certain algorithms. Things like encryption, hash algorithms, compression, codec, encode, decode and pattern search."

      If it can't be reprogrammed on the fly, it becomes just another dedicated instruction. And that means it can't be used in a generic manner as each program using the device can't have its own code loaded.

      "5. Context switching is not necessary in this role as it is not acting as a processor but an accelerator. The FPGA signals when it is done."

      So you don't have to save the data (or the coding) the FPGA does during a context switch to another process that replaces the data (or the coding)??? what magic prevents contamination? Use of only CPU registers??? If so, then it can't do encryption/decryption/compression/... as the blocks are larger than the CPU register file...

      "6. Keeping the logic very simple avoids wasted gate logic. A simple write-only memory mapped register set and some RW memory mapped memory buffers for data transfer by the CPU (very fast) can keep the FPGA busy."

      So now you have context that must be saved and restored....

      Evidently you don't understand how CPUs operate...
      • do wish for an edit...

        "4. Data transfer can be very fast to the FPGA without DMA"

        Again, limited to the CPU registers would make the FPGA useless... Even going through a cache is a DMA action. And using more than CPU registers would mean there is context to be saved.
        • The final limitation

          is that any CPU that gets programed (say once) makes that CPU unique, and any program that needs that one function can only run on that one specific CPU. It can't run on any other...

          And that makes for a very limited capability. Not much better than just loading microcode.

          This may actually be useful to replace errors in an instruction implementation. It would allow for faster execution (nearly as fast as an error free instruction) than what can be done in microcode.
          • 2 Uses

            I see three primary uses for the FPGA
            1. Special instruction. Sections of an FPGA can be reprogrammed on the fly and not the entire thing. This lets it adapt to changing needs.
            2. Building a parallel processor fabric. This lets it direct data from one processor to the next with out having to store it in memory first. It allows many different kinds of weaves for the fabric.
            3. Preprocessor. It could pull data from the IO. Do some limited conditioning before pushing it to the processor. In a parallel processor fabric it could also provide some preconditioning of the data.

            Sorry Jesse, while you make some good points, I think DevGuy_z is mostly correct. Many things depend on how the FPGA is used.

            Please also note that I am sure we will come up with uses for the FPGA that no one imagines now.
          • 3 Uses

            Also wish there was an editor
          • It doesn't work inter-core connections.

            The FPGA is not communicating with the other cores... or other processor sockets.

            The problem is that it makes the core unique... and that makes the software non-portable.

            If the PGA is reloaded for each process context switch then it is much more useful - and allows load balancing across cores - but that makes the context switch much slower...

            He may be right for use in an embedded unit though. In that situation, making the core unique isn't a problem.
          • I'm wrong here. There is an inter-FPGA routing...

            Unfortunately, this causes a lot of synchronization problems between the FPGAs when they are in the process of being reconfigured.

            This in turn causes a long delay while reloading. It isn't fast at startup.

            Now once the FPGAs are load, and synched they can process fairly fast... But appears to require dedicated use... which makes it nearly impossible to change processing structures on the fly.

            Which is the same problem any attached processor has.

            Its data path is through a PCI-E interface, but no IOMMU indicated for use. This makes each connection to the system a security weakness (same problem in older Cray XT systems - a bad program loaded into the Xilinx PGA would crash the node. Don't know if they fixed that yet).
          • wrong about being wrong.

            really really wish for an edit. the message was supposed to be about the MS catapult...
          • Agreed on edit

            it would be nice to edit. My grammar is terrible on the fly.
          • If you follow my architecture it is simply memory mapped

            And interfacing with the CPU requires no context switch, bus management.

            As I have said before I have this exact implementation working, but with ARM not x86 but it works great.

            1. Secure as there is no chance for the FPGA to control anything and the code isn't normally included for flashing the FPGA ROM

            2. No context switching necessary.

            3. Fast IO to and from FPGA. Using 50Mhz clock on a Spartan 6 and reading and writing 8K buffers with a 16 bit data path.

            4. Only uses 15 bits of address space out of 32 bits available. Data bus is 16 bit. the Cortex M has a harvard architecture which supports simultaneous address and data bus access.

            5. FPGA handles real-time analog data processing and hardware control logic. We use DSP slices for some math.

            We have very efficient use of the FPGA, we conserve both logic and obtain high speed processing. It is simple to understand, simple to document and vastly improves the performance of our system. The MCU (ARM) gives us flexibility to executing general software but the FPGA gives us the performance for off-loading certain algorithms that otherwise the CPU would have to deal with.
          • I only see 1, 3 and a possible 4. 2 I think is a stretch and redundant.

            I see the Preprocessing of IO, post processing of IO and algorithm evaluation as the three use-cases. I could see a single FPGA implementing multiple algorithms or having parallel algorithms (same logic but separately executed - like a thread).

            2 would be a highly expensive without much ROI. The FPGA in this role is not ideal, it would be slow in this role and these fabrics already exist.
          • Hard to implement well

            The CPU is very flexible as it generically executes software. You are correct that it is limited to a specific instruction set.

            Personally I don't see correcting instruction set as an efficient or typical use-case for an FPGA, to do so does require the FPGA to have control and tightly integrate with the processor, this means that a large portion if not the majority of its gates have to be allocated to support this kind of management and integration. As FPGAs are expensive, this doesn't make a lot of sense. It doesn't solve a very big problem.

            What you want to solve is the kind of problems that a typical CPU sees hundreds of times a second: Encoding, decoding, encryption, hash calculation, some search algorithm, and off-load. With an FPGA the algorithm can be updated as needed but not frequently. And since the FPGA is reliant on the CPU for data transfer and only signals the CPU, the gates can be devoted to actual calculation or algorithm execution and not to context switching, DMA, Bus mastering etc.
        • It's all in how you read it

          The FPGA could read/write to I/O ports, or memory. In either case, you don't HAVE to use DMA, data transfer CAN be very fast without it.

          An example [not server related]: raw data from an ADC can be piped through FPGA for Fourier Transform [FFT] and / or DSP filtering, before analysis / correlation / decoding in the CPU.

          Just the thing for beamforming radar, or even EEG analysis - if only the ARM FPGA's weren't already doing this.
          • "Fast is relative...

            A CPU is about 10x faster than the fastest I/O capable.

            so wasting a CPU just to do I/O is not "fast" or "efficient".

            That is why DMA is done - it allows an expensive resource (the CPU) to do something more demanding.
          • yes as long as the calculation is way more expensive for the CPU

            Using my approach does mean a CPU IO hit (which is not free), but that hit is far less expensive than what you gain. If the CPU takes 100x longer to calculate a hash than an FPGA then what is a few cycles to transfer some IO.

            Trust me, if you keep it simple you can get a very high return on CPU IO investment. 40x increase is doable.
          • You got it! This is what I am trying to say.

            You CAN use an FPGA do all kinds of things like Bus mastering or DMA etc but this requires a lot of gates and unless you really need it...
        • the other way around. CPU reads/writes FPGA memory and FPGA registers

          Many FPGAs reserve space for memory that is meant to be accessed like memory. Typically this isn't huge but usually sufficient for buffering etc. The memory can read, write or read/write. Often you use both read and write separately as it makes the logic less complicated. Registers (FPGA) are used for configuration and control. The speed is dependent on the bus clock and the databus width.
      • This isn't hard. I never said CPU registers. CPU accesses FPGA memory

        Most hardware does not support context switching on an x86 platform, you simply wait until if finishes and it signals you when done. This works quite well especially when the hardware is done very quickly. No different for an FPGA. So no, you DO NOT have to save data.

        I have seen a number of hardware implementations involving CPU and FPGA and none of them required context switching. While you could support it with an FPGA it would be a waste of logic.

        FPGA access nothing other than raise interrupt. CPU accesses FPGA memory. And this is very fast (normally not as fast as main memory but still very fast). FPGA memory can be pretty large for embedded purposes. 8K blocks are no problem and I am sure with some FPGAs you could probably much more. These are memory mapped. If you had a 32 bit databus and FPGA clocking at 100Mhz (some hit a gigahertz) and a separate address bus you could achieve 400MBytes/s data transfer. That's 20µS to fill an 8K buffer. For encryption assuming 10-100x calculation speed compared to CPU the overhead of the transfer is far more than made up with the calculation speed. Note the FPGA does not handle the data transfer, the CPU does. This is a CPU hit but it doesn't have to do the calculation either. A significant calculation could take a CPU a few ms to finish and would probably have a few context switches during that time (save load state). In my scenario it would hand it over the the FPGA which might solve the same problem in 10µS so you have 1 ms vs 50µs (I am being conservative because the read result would be smaller and not take 20µS.)

        Again if you look at modern HW peripherals none of them are designed to save their state. That is why they are faster, they don't have to have such overhead. The CPU saves its own state and might save and reconfigure the hardware state but that is very wasteful. Typically the CPU configures the hardware, signals the hardware to proceed and the waits to be interrupted with the result.

        Programming FPGAs on the fly is slow and wasteful and not nearly as good as infrequently implementing an algorithm and just using it over and over and over.

      Planet Audio PA100B Wireless Bluetooth Speaker