The new ONNX optimizations come from work the Bing team has done around BERT (Bidirectional Encoder Representations from Transformers). BERT is unlike previous deep-neural-network architectures that process words individually. Instead BERT uses a model type called transformers.
Microsoft execs have said that deep learning is widely used across the Bing search stack to run a number of its "intelligent" features. Natural-language models are used to improve Bing's understanding of search intent and related web pages.
Microsoft Bing and ONNX Runtime team have been working together to build the fastest, cheapest, and easiest way to run large transformer networks in production. Microsoft officials say the resulting technology offers more than ten times improvements in latency and over 800 times improvements in throughput.
Microsoft is incorporating these updates in the ONNX Runtime. Officials say that developers will be able to use the updated ONNX Runtime on any cloud or on premises with a choice of CPU or GPU.
Microsoft increasingly is using the ONNX Runtime to run advanced AI models across the company's various products and services, including Bing, Office, Azure computer vision and more. Many of the deep learning models from Microsoft's Project Turing -- machine reading comprehension for the semantic search portion of Microsoft Search -- runs on the ONNX runtime, as well.