'

Linux Meltdown patch: 'Up to 800 percent CPU overhead', Netflix tests show

The performance impact of Meltdown patches makes it essential to move systems to Linux 4.14.

Video: Fake Meltdown-Spectre patch emails hiding Smoke Loader malware

The Linux mitigation for Meltdown known as kernel page table isolation (KPTI) can cause a massive drain on CPU performance, according to an analysis by Brendan Gregg, a senior performance architect at Netflix.

While Intel's Spectre mitigations have attracted the most attention for causing performance and stability problems, Gregg finds that KPTI causes the "largest kernel performance regressions I've ever seen".

KPTI prevents Meltdown leaks by using completely separate page tables for user-mode execution and kernel-mode execution.

To test the impact of KPTI, Gregg created a microbenchmark and found that Netflix, one the largest users of AWS, is likely to see a performance overhead of between 0.1 percent and six percent due to KPTI. However, others may see much larger overheads.

"The KPTI patches to mitigate Meltdown can incur massive overhead, anything from one percent to over 800 percent," he writes.

"Where you are on that spectrum depends on your syscall and page fault rates, due to the extra CPU cycle overheads, and your memory working set size, due to TLB flushing on syscalls and context switches."

Gregg's analysis looks at five key factors that influence overhead, including system call rates, context switches, page fault rate, the working set size, and cache pattern access. Depending on the measurements for each factor, the performance overhead can balloon from two percent to 17 percent.

The circumstance where the overhead can exceed 800 percent is when using a version of Linux that didn't support PCID or process-context ID.

The Linux kernel added support for PCID in version 4.14, improving its handling of the Meltdown-fixing separate tables so long as the CPU supports PCID too.

Exactly how much the system is impacted depends on the characteristics of the application. As he notes, applications with higher system call, or syscall, rates, such as proxies and databases that do lots of tiny I/O, will suffer the largest losses. The impact also rises with higher context switch and page fault rates.

Gregg offered the following summary:

• Syscall rate: There are overheads relative to the syscall rate, although high rates are needed for this to be noticeable. At 50,000 syscalls per second per CPU, the overhead may be two percent, and climbs as the syscall rate increases. At Netflix, high rates are unusual in the cloud, with some exceptions, such as databases.

• Context switches: These add overheads similar to the syscall rate, and the context switch rate can simply be added to the syscall rate for the following estimations.

• Page fault rate: Adds a little more overhead as well, for high rates.

• Working set size, hot data: More than 10MB will cost additional overhead due to TLB flushing. This can turn a one percent overhead (syscall cycles alone) into a seven percent overhead. This overhead can be reduced by a) PCID, available in Linux 4.14, and b) huge pages.

• Cache access pattern: The overheads are exacerbated by certain access patterns that switch from caching well to caching a little less well. Worst case, this can add an additional 10 percent overhead, taking, say, the seven percent overhead to 17 percent.

He expects Netflix will be able to reduce the performance overhead to less than two percent by using Linux 4.14 with PCID support, huge pages, syscall reductions and other methods to fine-tune performance.

However, Gregg notes that KPTI is only one source of performance overheads in the fixes for Meltdown and Spectre, which include cloud hypervisor changes, Intel's microcode, and compilation changes such as Google's Retpoline fix.

linusmeltdowngreggkpti.png

Brendan Gregg has set out the cost of extra CPU cycles in the syscall path.

Image: Brendan Gregg

Previous and related coverage

Spectre reboot problems: Now Intel replaces its buggy fix for Skylake PCs

And offers patching tips from US CERT, which it failed to brief on the bugs.

Meltdown-Spectre: Malware is already being tested by attackers

Malware makers are experimenting with malware that exploits the Spectre and Meltdown CPU bugs.

Windows emergency patch: Microsoft's new update kills off Intel's Spectre fix

The out-of-band update disabled Intel's mitigation for the Spectre Variant 2 attack, which Microsoft says can cause data loss on top of unexpected reboots.

Meltdown-Spectre: Why were flaws kept secret from industry, demand lawmakers

Great work on patching your own products, but why were smaller tech companies kept in the dark?

Spectre flaw: Dell and HP pull Intel's buggy patch, new BIOS updates coming

Dell and HP have pulled Intel's firmware patches for the Spectre attack.

Windows 10 Meltdown-Spectre patch: New updates bring fix for unbootable AMD PCs

AMD PCs can now install Microsoft's Windows update with fixes for Meltdown and Spectre and the bug that caused boot problems.

Meltdown-Spectre: Intel says newer chips also hit by unwanted reboots after patch

Intel's firmware fix for Spectre is also causing higher reboots on Kaby Lake and Skylake CPUs.

26% of organizations haven't yet received Windows Meltdown and Spectre patches Tech Republic

Roughly a week after the update was released, many machines still lack the fix for the critical CPU vulnerabilities.

Bad news: A Spectre-like flaw will probably happen again CNET

Our devices may never truly be secure, says the CEO of the company that designs the heart of most mobile chips.