Windows on ARM isn't new: from Windows Phone to Windows RT to Windows IoT, Microsoft has had multiple systems that take Windows beyond the familiar Intel and AMD processors. Older versions of Windows ran on PowerPC, Alpha, Itanium, and MIPS, after all, and in 2009 an unofficial internal project had Windows 7 running on ARM. Development continued for ARMv7 32-bit processors with VFP floating point, NEON (ARM's version of Intel's SSE instructions for processing data in parallel) and the Thumb-2 instruction set.
But when that shipped as Windows RT, it only ran apps that had been specifically written and compiled for ARM using only the WinRT APIs. The idea was to turn Windows in a OS that was designed for mobile -- like iOS -- to get better security and battery life. But not running standard Windows programs -- either recompiled or in emulation -- was an artificial limitation (although designing Windows RT devices just for Store apps meant they used rather underpowered Tegra SOCs that couldn't have delivered good emulation performance anyway).
The new Windows on ARM devices that are about to go on sale aren't taking the same approach. They use 64-bit Snapdragon 835 SOCs (which Qualcomm calls a 'Mobile PC Platform') and they can run many more applications. Although the first systems to go on sale will all come with Windows 10 S, which only runs apps that come from the Store, those Store apps can include standard desktop x86 apps that have been packaged up for distribution through the Store. There's also a free upgrade to Windows 10 Pro -- not a special or limited version of Windows, but the full Windows 10 Pro that Microsoft has compiled for ARM.
Install Windows 10 Pro and you can install standard Windows applications, with only one real limitation: even though it's a 64-bit version of Windows and the Snapdragon 835 is a 64-bit Kryo CPU, only 32-bit x86 applications are supported -- not 64-bit x64 code. So how does that work, and how well will it work?
Windows on Windows on ARM
Windows itself -- both the Windows kernel and the features inside Windows like the shell and File Explorer -- runs as native ARM 64 code. So do the NTDLL system services that let apps talk to the kernel, and system DLLs for storage, graphics, networking, and other device drivers that talk to the kernel, which means they get native hardware speed. UWP applications from the Store have been compiled into native ARM code, but x86 code runs in emulation, on top of the WOW (Windows on Windows) abstraction layer.
If that sounds familiar, it's because WOW has been in Windows for a long time. The first version was a subsystem that translated 16-bit APIs to 32-bit equivalents (a process called 'thunking') for running 16-bit code on 32-bit Windows (where all the 16-bit applications ran in a single virtual machine). Windows 10 still uses WOW for running 32-bit applications on 64-bit versions of Windows -- not just redirecting DLL calls but also mapping or mirroring registry keys from their 64 to 32-bit equivalents, registering ODBC connections and providing a 32-bit CMD.EXE for command line calls, to create a full 32-bit environment for 32-bit applications to run in.
Those applications don't use virtualisation, like a virtual machine (which is about running code efficiently on a different operating system, not a different kind of hardware); they run on a CPU emulator.
On a x64 PC with an AMD or Intel CPU, the emulator runs on the processor itself, so performance is pretty much the same as it would be on an x86 CPU. On an ARM system, the emulator runs in software: Microsoft has implemented what it calls a Dynamic Binary Translator, which translates blocks of code to ARM 64 code as they run and caches them in memory or on disk, so they don't have to be translated again the next time you run the same application.
The translator has to cope with differences beyond the ARM instruction set, like memory ordering and exception handling, which are both different on RISC processors like ARM compared t0 Intel and AMD CPUs.
ARM's looser memory consistency means multiprocessor systems can use much cheaper caching hardware, but they also have to create 'barriers' to preserve the order of memory when it matters -- like making sure that all the threads in a multi-threaded program see the correct value in a variable when it gets updated, rather than some getting the old value and some the new. The translator has to manage those barriers, and it has to strike a balance between adding so many barriers that the emulation runs slowly, and adding so few that one thread gets the wrong value from the variable and the code crashes.
Dynamic recompilation works on blocks of code rather than the whole program, translating logical chunks of code, as the program calls them. So it might stop at a branch instruction in the code because that will determine what code is needed next. The translated code can start running immediately, rather than having to wait for all the code to be ready; dynamic translation also gives the translator more information about the runtime than static recompilation in advance.
Just In Time
This kind of transcoding emulation (sometimes called Just In Time or JIT) is much faster than interpretive emulation, which steps through code one instruction at a time, simulating each processor instruction in turn. Instruction emulation is hundreds or even thousands of times slower than native code. Just-in-time translation is still slower than running native code -- the first time you run it, the code might run perhaps fifty times slower, but once the translated code is cached, it can run at up to 99 percent of the speed of native code.
The actual performance will depend on how 'compute bound' an application is: does it spend most of its time using the CPU for computation, or is more of the time spent in system and kernel code or loading files, using the network or drawing graphics? The former is slower because of emulation, while all of the latter run at the native hardware speed. Obviously, any applications that themselves generate code and then run it will be rather slower, because both the application and the code it generates will both have to be translated.
"If the app is using the hard disk, graphics, or networking, all of this runs in the kernel and is running at native performance. If the application is CPU bound, it takes more time than native because it has to be translated. This will also vary by application. In our testing we have found that most of the apps running under emulation are consistent with user's expectation of responsiveness," Windows general manager Erin Chapple told ZDNet.
Even more dynamic libraries
One of the ways Windows on Windows improves performance for 32-bit applications on 64-bit Windows is, slightly counter-intuitively, by running copies of system DLLs that come with Windows as 32-bit code in emulation.
Microsoft has the source code and could easily recompile them to native 64-bit in advance, but the applications that talk to those DLLs use 32-bit data types and 32-bit calling conventions.
If the DLLs were 64-bit, WOW would have to 'thunk' every system call to them -- translating the data types from 32 to 64-bit and back -- and 'marshall' the calling convention into the right memory representation. On x64 systems that's actually more of a performance hit than running the DLLs in emulation, because while it's possible to automate the way the calls are marshalled from 32 to 64-bit, it's complicated and the data type thunking is still an issue.
On ARM though, running the DLLs in emulation has more of an overhead, so Windows on ARM does this translation slightly differently. The native system DLLs are a new type of file called CHPE (Compiled Hybrid Portable Executable) files. They're compiled to ARM 64 code using the same source code as the native 64-bit versions of the DLL, but while the code is ARM 64 (and the system automates the marshalling of the calling convention) the interfaces are 32-bit x86 interfaces so the DLLs work with 32-bit data types that x86 processes can load. This combination of ARM code and x86 entry points is much more efficient; "for the most part these run at native speed," according to Arun Kishan of the Windows kernel team.
We'll test this out when hardware is available, but Kishan believes "you can get near-native, or very close to native performance with this approach".
The 64-bit question
That's all for 32-bit x86 applications that you'd run on a Windows PC -- but what about the increasing numbers of 64-bit applications for Windows, like Photoshop? They're not supported and won't work on Windows on ARM. That's a trade-off between how much more work would be required to support 64-bit applications and how useful that support would be.
"To emulate x64 in addition to x86 doubles the engineering work," Erin Chapple told ZDNet. Unlike the 32-bit support, it would be new work as well. "In addition, Windows only supports the Windows on Windows (WOW) abstraction layer for 32-bit applications, not 64-bit applications. We would have to add support for a 64-bit Windows on Windows layer."
Plus, the 64-bit emulation would have to deal with the fact that x64 CPUs have 16 general-purpose registers (small amounts of fast storage on the processor used to hold the current instruction or the data it's working with) compared to just 8 for x86.
Although the Kryo 280 CPU in the Snapdragon 835 has 31 general-purpose registers that can be accessed in either 32-bit or 64-bit mode (64-bit being more complex to code against), a number of those are reserved for specific things; the emulator itself also needs to use a couple of registers. Making sure that the registers used by emulated code remain committed is key to getting good emulation performance. If the emulator has to use system memory to store information that it's waiting to load into hardware registers rather than having the register available to use straight away, performance is going to drop significantly: an instruction that should take one processor cycle to run could be ten, twenty or fifty times slower.
Keeping enough registers allocated to running emulated code is certainly easier for the 8 registers the x86 instruction set uses than the 16 that X64 needs. And if the emulator itself is doing that allocation in a high level language rather than the registers being protected by the system architecture itself, that adds more complexity to the emulation - especially when the 32-bit emulation layer is already there and a new 64-bit layer would have to be written from scratch.
The extra work and less predictable performance might not be useful to many people, Chapple suggested. "This is technically possible, [but] it is a resource trade-off of the work necessary versus the benefit to the user. When we looked at our telemetry for the most-used applications on Windows, we found that the majority of them have x86 versions. A lot of applications also have only x86 versions. Most of the 64-bit only applications are games which are outside of the target customer for this device. Lastly, those applications that are 64-bit only typically want to run natively for performance reasons. As a result, we decided to focus our engineering investments on the native ARM64 SDK to enable developers to natively write their application for the device."
If developers want their x64 code to run on Windows on ARM straight away, they need to compile it for x86 instead, and if they use the Desktop Bridge to put those applications in the Store they need to submit the x86 version to have it work on Windows on ARM. That even applies to installers, which could be an issue for some applications.
Microsoft has never stopped creating 32-bit versions of Windows: in 2015 then-head of the Windows Insider program Gabe Aul said there were "hundreds of millions of 32-bit PCs" that could upgrade to Windows 10.
Even so, some developers have moved exclusively to x64 for security, increased memory or performance. And even applications that are 32-bit sometimes have a 64-bit installer so you can install your choice of 32 or 64-bit code. The Creative Cloud installer no longer installs the 32-bit version of Photoshop, for instance; you have download it separately, and it's marked as the 2014 version.
Chapple confirmed that 64-bit installers won't run on Windows on ARM, although she noted that "this is not a scenario we've run into in our testing of the top applications".
One way to avoid emulation is creating UWP apps that run natively on Windows ARM PCs. Developers can turn existing desktop apps into UWP, if they only use features in the WinRT APIs and the Core version of .NET. If they use features like WinForms that need the full version of .NET though, they'll have to keep them as x86 code and have that run in emulation.
In the longer term, there will be a way to get 64-bit PC applications onto Windows on ARM. Developers who need 64-bit features like access to more memory and 64-bit virtual addresses will be able to use the Windows SDK for Windows 10 ARM 64 and compile their code directly to 64-bit ARM code and avoid even the slight performance hit of emulation. If you're writing C++ code, you can experiment with that now, although there aren't any project templates and it takes a certain amount of configuration. You can't yet submit those applications to the Store, so they would only work on Windows 10 Pro on ARM, not devices with Windows 10 S.
Will the SDK include support for features like WinForms that require the desktop version of .NET? "We are still working through our ARM64 SDK plans, including what versions of .NET will be supported," Chapple said.
Microsoft has also made some interesting choices for its own applications.
The Edge browser is a 32-bit ARM application on Windows on ARM and runs natively, although it's switching to ARM64 (which may mean that support for everything it needs is only now in the Windows ARM64 SDK). "If you join the Windows insider program with a Windows on ARM device, you will see that MS Edge has moved to be 64 bit," Chapple said.
Internet Explorer and Office, on the other hand, are x86 applications: plugins and add-ons are overwhelmingly 32-bit, so it makes sense to keep the applications x86 -- especially as most of what Office does isn't CPU intensive.
Desktop applications are clearly part of the Windows future, which should reassure those who think that Microsoft is trying to move everyone to UWP. But as long as they're only 32-bit, that gives Windows on ARM a very clear position -- and one that's rather different from the way Apple views the iPad Pro.
Despite the increasing performance and capability of ARM processors, Windows on ARM is designed to give you an affordable and extremely mobile device with the emphasis on battery life and integrated LTE for connectivity, rather than attempting to compete with Intel for 64-bit PC performance.
The Windows 10 laptops will use hardware typically found in mobile phones, the ARM-based Qualcomm Snapdragon 835 chipset, to bring the always-on connectivity and longer battery life of smartphones to laptops.