Microsoft to implement 'Catapult' programmable processors in its datacenters

Microsoft to implement 'Catapult' programmable processors in its datacenters

Summary: A Microsoft Research pilot focused on field-programmable gate arrays in datacenters, has passed muster and will be implemented by the Bing team in 2015.


Microsoft researchers have been experimenting with using field-programmable gate array (FPGA) processors in an attempt to make its datacenters more efficient.


Researchers collaborated with Microsoft's Bing team to test a pilot of "Catapult," which is a programmable hardware/software "fabric" on more than 1,600 Microsoft datacenter servers running Intel Xeon processors and Altera FPGA chips. The goal of the pilot was to see if FPGA-enhanced servers could provide faster, better quality search results at a lower cost. The answer, it turned out, was yes, and now Microsoft is planning to roll out FPGA-enhanced, Bing-powered servers to process customer searches starting in early 2015.

"The system takes search queries coming from Bing and offloads a lot of the work to the FPGAs, which are custom-programmed for the heavy computational work needed to figure out which webpages results should be displayed in which order," as Wired explains in its write-up about the new technology. According to MSR Director of Client & Cloud Apps, Doug Burger, who is heading up the pilot, the FPGAs are 40 times faster than a CPU at processing Bing's custom algorithms, and the overall system will be twice as fast as Bing's existing datacenter system. Microsoft will be able to chop the number of servers it is using to dish up these Bing queries by half, as a result.

The pilot is described in a new Microsoft Research white paper, published on June 16, entitled "A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services."

Microsoft Research has been doing work in the FPGA area for at least the past several years. Microsoft Technical Fellow Chuck Thacker has been working on a project to help build FPGAs -- which are semiconductors that can be custom-configured after they're manufactured -- along with industry and academic researchers on the Research Accelerator for Multiple Processors (RAMP) consortium, for example.

The researchers predicted in the white paper that programmability of FPGAs is going to be an issue in the long term. Currently, domain-specific languages like Scala and OpenCL, plus FPGA-targeted tools, can be used. But more integrated development tools are going to be needed within the next 10 to 15 years, "well past the end of Moore's Law," the paper's authors said.

"We conclude that distributed reconfigurable fabrics are a viable path forward as increases in server performance level off, and will be crucial at the end of Moore’s Law for continued cost and capability improvements. Reconfigurability is a critical means by which hardware acceleration can keep pace with the rapid rate of change in datacenter services," the authors concluded.

Topics: Cloud, Data Centers, Emerging Tech, Microsoft, Processors


Mary Jo has covered the tech industry for 30 years for a variety of publications and Web sites, and is a frequent guest on radio, TV and podcasts, speaking about all things Microsoft-related. She is the author of Microsoft 2.0: How Microsoft plans to stay relevant in the post-Gates era (John Wiley & Sons, 2008).

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • That is interesting...

    Lets hope they figured out how to make the FPGA secure.

    The usual problem is that the normal architecture of the bus (not CPU) is that attached processors have full run of the physical memory...

    Imposing an IOMMU makes it slower...

    And using an FPGA makes the hardware inherently insecure (no easy context save/restore - the FPGA has to be reprogrammed in its entirety, thus the context is everything in the registers + the FPGA configuration + any bits defined by the FPGA...)

    This was the reason FPGAs failed years ago, other than defining specific dedicated operations that didn't change after being loaded. But it made the AP inflexible.

    So, this is interesting, if they really solved the problems.
    • In Complete

      FPGAs can be any where from small chips basic chips to massive ones that have some interesting fixed cells/cores in them. With out giving any specifics on the FPGA, i.e. model number and a quick summary of its abilities the article really says nothing. Most articles on Microsoft's new venture topic are missing this info. Searching the web and looking at a half dozen articles I still not nothing more.
    • I disagree

      First, the FPGA doesn't need to see the entire address bus (assuming it is memory mapped, which it doesn't have to be). Depending on the complexity needed (number of gates and registers) it could use a much smaller addressable location (say 12 bits). The data bus could also be smaller.

      Secondly because the logic used in FPGA is often more primitive than say what you might see in software and can be simulated, it is easier to get correct

      Thirdly, as long as the FPGA doesn't master the bus i.e. control it directly but results are retrieved by CPU then I don't see a security issue at all. It sounds like they use the FPGA as a co-processor for certain algorithms. In such a case the CPU would transfer data to an FPGA buffer, initiate the calculation and then retrieve it from another buffer. This leaves security totally in control of software (CPU). The FPGA is passive. Having the FPGA control the bus is way more complicated as you have bus contention issues.

      Regarding context, this is the case of almost all hardware and I am not sure why you would want to save/restore context as long as power is applied it will be fine. A reset can automatically reload the FPGA.

      And FPGAs haven't failed at all. They are being used more and more. The only problem I see with FPGAs is cost, power and PCB layout (modern FPGAs are pretty dense). They are pricey and consume a lot of power relative to an ASIC.

      The benefit of FPGAs is they can act as dedicated high speed coprocessors for certain operations. That is why the used to be used for bitcoin mining.

      Conceptually from a security standpoint there isn't much difference between an FPGA and an ASIC. The ROM could be controlled by a dedicated interface (JTAG) to prevent reprogramming and usually you would need to emit a reset anyway.
      • I should probably add that FPGAs can have their own internal memory

        Which is accessible to the CPU (if memory mapped) and thus the FPGA doesn't need access to CPU memory at all.
        • Cores Also

          They can also have processing cores. Some have more than one fixed cores. Others are large enough you can download a core to them. The real question is what FPGA are they using.

          The June 16 white paper answers a lot of these questions. This is a very technical paper and will take a couple of hours to read.
          • It doesn't make sense to me to implement a CPU

            Intel already has a fast CPU. Implementing a CPU doesn't make sense to me. What makes se sense is to implement algorithms that the CPU would normally have to execute in software in hardware with a huge speed up. That is the way to use FPGAs as specialized co-processors.

            So yes it is possible to have an FPGA with various MCU cores, but why?
          • Parallel Processing

            There are many tasks that a full fledged Intel core is over kill. Data can be preprocessing in a smaller core with simple algorithms. The bigger core can then handle the complex stuff.

            Warning, your mind may explode when you start to think about true asynchronous multi tasking on many cores. Following the data threw several cores and realizing it is really a wave instead of a partical can warp your senses.
          • But intel already has that or could just use ARM, why FPGA

            Intel already has crippled core approach (I think they have a coprocessor card with 96 cores). Or you could just use a bunch of ARM cores. I see no point in doing this with an FPGA (way more expensive AND more power and slower).
          • FPGAs are expensive, use more power and are slower than ASIC

            So it doesn't make sense to implement MCU or CPU cores as your total processing scheme. The reason people create cores on FPGAs is so that they can combine both software and hardware logic.

            But for what they are doing I would just keep it hardware logic as this is very fast (orders of magnitude) where an MCU would execute slower than a dedicated core.
          • There are several reasons

            1. interface handling of the host (addressing, DMA,..).
            2. directive interpretation
            3. reconfiguration of the PGA
            4. speed - a hardwired processor (a core) will execute faster than one implemented via gate arrays. The data paths are smaller, register files closer to the processing units...

            Gate arrays are very good for custom applications. But they are not known for being fast.
          • Huh, A gate array can beat a CPU by orders of magnitude.

            That is why people use them for bitcoin mining. A CPU mostly executes serially while a gate array can "execute" 100s of logic decisions in parallel in a single clock. When an FPGA is programmed and used correctly it can beat not only an embedded core but an external core even though the external processor is clocked much faster. When people use FPGAs for bitcoin mining they don't implement a core on the FPGA that would be slower than using the main external processor. They implement many parallel hash algorithms with each being "executed" much much faster than the main processor could ever execute them. The processor has to decode instructions, manipulate stack, update registers, while the FPGA is just clocking gates at the propagation delay of the logic.

            The FPGA is optimal (vs CPU or embedded core) because it is dedicated. An ASIC though does it even faster with lower cost and lower power but can't be configured.
        • quite true.

          But that is part of the context that must be saved/restored when switching processes...
          • Not if you are not executing processes or software.

            I have no idea what Microsoft is up to but my idea of utilizing FPGAs would not be to implement a CPU or MCU core but to simply transfer algorithms that normally would be coded in software to the FPFA and execute it in hardware. Encryption, compression, decode, encode, codecs, search algorithms would all be excellent utilization of an FPGA and you could achieve orders of magnitude faster execution than if done at the software level. As soon as you turn an FPGA to executing software you eliminate the performance gains.
          • Lots of Guessing

            Before posting a lot more guesses you all really need to read the Microsoft white paper. Warning it is very technical and will take a couple of hours to read and understand. They are really going after a new concept in computer science.
          • Fabric

            Also, you are talking about cores. Your thoughts need to cores and think about fabrics. Similar to physics, sometimes you deal with light as a particle but sometimes your thoughts need to expand and you have to think about wave. The transition area from particles to waves will warp your mind.
          • Fabric (corrected)

            Also, you are talking about cores. Your thoughts need to abandon cores and think about fabrics. Similar to physics, sometimes you deal with light as a particle but sometimes your thoughts need to expand and you have to think about light as waves. The transition area from particles to waves will warp your mind.
          • If the device is a dedicated function...

            Then an ASIC would be faster.

            And those functions you list (encryption, compression decode, encode, codecs, search) all have context data (and programming) that must be saved/restored.

            FPGAs are slow due to the long data paths they have to use.
          • I still don't see why you need to save/restore context

            In any architecture I have ever seen it has never been a requirement that hardware context needs to be saved and restored. You simply maintain power. A CPU needs context because it does multiple things that it can't do simultaneously unless it has multiple execution paths. But even with multiple execution paths context switching is frequent. Not so with hardware (asic or FPGA) you never need to change context and restore it. For one, this is very wasteful. Once the hardware beens clocking the logic through it goes to completion and probably much faster than any context switch would require.
      • No. To interface with the rest of the system requires

        the bus to match. Otherwise you don't know where in memory the result will go.

        Even using a 12 bit address can destroy an OS...

        Saving context is required to maintain proper operation of the device. Otherwise any process would be able to alter another processes code (as loaded into the FPGA). Thus context is necessary - even if it is dedicated to only one process - that dedication alone is context. Plus, the device has to be limited to the addresses used by that process... it can't just write anywhere...

        The loading of an ASIC/FPGA via jtag is one approach - unfortunately, it is SLOW. Quite reasonable when it is done rarely, unreasonable when it has to be done frequently.

        I wasn't meang FPGAs failed as FPGAs. What they failed at was being reconfigurable as part of general processing.

        Even those GPUs (which are ASICs) can't be reconfigured - the operations they are coded for cannot be replaced.

        Those that have programs loaded into them have special purpose programs - and they are dedicated to a single process unless they have context capability, and IOMMU control (some don't... the Xilinx boards made some processing features very fast... but no security whatsoever - a bad program loaded into them could take over the host).

        And bitcoin mining is one place they work quite well - when the FPGA does nothing else.
        • Misunderstanding and wrong on the bus.

          I didn't say load the FPGA via JTAG. I said load the FPGA ROM from JTAG. A reset then causes the ROM to program the FPGA - very fast.

          You are clearly wrong. It is quite possible for the FPGA or any other hardware peripheral to sit on a portion of the bus. I have done it and such things have been done with Intel architectures. And it doesn't need to control the bus.

          I don't see the best use of an FPGA as a processor, it won't do as good a job as an x86 CPU. What it does extremely well is to implement logic constructs. The FPGA is best used as dumb hardware execution unit. No stack, no PC, no AR, just configuration registers, memory buffers (memory mapped) and hardwired logic.