Patch for Piledriver chips emitted this week to kill off potentially exploitable glitches
Analysis AMD will tomorrow release new processor microcode to crush an esoteric bug that can be potentially exploited by virtual machine guests to hijack host servers.
Machines using AMD Piledriver CPUs, such as the Opteron 6300 family of server chips, and specifically CPU microcode versions 0x6000832 and 0x6000836 – the latest available – are vulnerable to the flaw.
When triggered, the bug can glitch a processor core to execute data as software, which crashes the currently running process.
It is possible for a non-root user in a virtual machine to exploit this defect to upset the host system, or trick the host kernel into executing malicious code controlled by the user.
In other words, it is possible on some AMD-powered servers for a normal user in a guest virtual machine to escape to the underlying host and take over the whole shared server.

Although it is rather tricky to exploit – for one thing, it requires precise timing – AMD has a fix ready for operating system makers to distribute to affected users from this Monday.
“AMD is aware of the potential issue of unprivileged code running in virtual machine guests on systems that make use of AMD Opteron 6200/6300,” a spokesman told The Register.
“Following a thorough investigation we have determined that only AMD driver patch 6000832 and patch 6000836 is affected by this issue.

AMD has developed a patch to fully resolve the issue and will be made available to our partners on Monday, 7 March, 2016.”
The bug is related to the delivery of non-maskable interrupts (NMI), and is specific to the aforementioned microcode versions. On Linux, /proc/cpuinfo will list the ID number of the microcode running on your processor cores if you want to check if your machine is vulnerable. Microcode – basically, your processor’s firmware – can be installed by your motherboard’s BIOS or your kernel during boot-up: for example, Debian GNU/Linux distributes the latest patches in the amd64-microcode and intel-microcode packages.
For most affected people, a package update and reboot will ensure the fixed microcode is in place.

The new microcode is also expected to appear on the AMD operating system team’s website if you want to install it by hand.
The microcode flaw has so far reared its head on systems using QEMU-KVM for virtualization, but it may affect other hypervisors.
Bug hunt
Due to Intel’s dominance in the data center and virtualization world, this AMD-specific bug is not going to cause widespread chaos. However, it may give some people grief.

For one thing, the code gremlin managed to nip the OpenSUSE Linux project, which is cosponsored by AMD.
An OpenSUSE build server that sports an Opteron 6348 processor with microcode version 0x06000836 hit a Linux kernel “oops” while running post-compilation tests on a fresh copy of GDB.

The debugger’s bytes barely had time to settle on the hard drive before the tests were killed by the underlying kernel.
Jiri Slaby, a SUSE Linux programmer, reported the weird crash to the Linux kernel mailing list at the end of February, and uploaded a bunch of diagnostic information for fellow developers to pore over.
The crash was bizarre and, we’re told, couldn’t be repeated: while running tests on the newly built GDB debugger, the processor entered kernel mode and suddenly careered off course. Like a car hitting some black ice, it slid off the road and smashed into a tree.
It stopped executing the code it was supposed to be running, and instead slammed into a page of memory that had been wisely marked non-executable because it contained a critical kernel data structure rather than actual code.

That collision triggered a fault, which was flagged up as a potential kernel bug, and the running process was killed.
At the time of the crash, the kernel was leaving an internal function called ttwu_stat(), which updates some of the scheduler’s accounting statistics.
It is harmless.
Its instructions aren’t that complicated: just some compares, additions, and stack popping and pushing.
It’s called from the scheduler function try_to_wake_up().
Then a clue was spotted.

A scrap of torn red silk left at the GDB process’s murder scene.

Before ttwu_stat() is called, the kernel function try_to_wake_up() does a bunch of stuff that includes this instruction:

mov $0x16e80,%r15

What’s a stack?
Think of a stack as a pile of cafeteria trays: you push a tray, or a value, onto the stack, and you pop a tray, or value, off the stack.
If you push 1, then 4, then 5, and finally 2 onto the stack in that order, you’ll pop them off in the order of 2, 5, 4, and 1.
If you push the contents of R15 and then, say, R14 onto the stack, when you next pop a value off, you’ll get back R14’s.

This moves the hexadecimal value 0x16e80 into the CPU core’s R15 register.
Soon after, ttwu_stat() is called, which pushes R15 and other registers onto the stack.
At the end of ttwu_stat(), the registers, including R15, are pulled off the stack.

This means R15 should have the same value on leaving ttwu_stat() as it did entering the function – specifically, 0x16e80. Whatever the function did to R15, the register’s original value should be restored on leaving ttwu_stat().
Let’s look at the “oops” report generated by the kernel, which reveals the contents of all the registers at the time of the exception:

RIP: 0010:[<ffff88023fd40000>] [<ffff88023fd40000>] 0xffff88023fd40000
RSP: 0018:ffff8800bb2a7c50 EFLAGS: 00056686
RAX: 00000000bb37e180 RBX: 0000000000000001 RCX: 00000000ffffffff
RDX: 0000000000000000 RSI: ffff88023fdd6e80 RDI: ffff88023fdd6e80
RBP: ffffffff810a535a R08: 0000000000000000 R09: 0000000000000020
R10: 0000000001b52cb0 R11: 0000000000000293 R12: 0000000000000046
R13: ffff8800bb37e180 R14: 0000000000016e80 R15: ffff8800bb2a7c80

R15 should be 0x16e80 but it’s actually 0xffff8800bb2a7c80 – and R14 is 0x16e80.

That’s not right at all.
In ttwu_stat(), R15 is pushed onto the stack, then R14.

At the end of the function, R14 pulls its contents off the stack, and then R15 does the same.

But in this case, R14 has popped R15’s value instead of its own.
Something’s not right: the stack is an unexpected state.
ttwu_stat()’s final instructions are:

pop %r14
pop %r15
pop %rbp
retq

That’s supposed to restore the contents of the R14, R15, and RBP registers from the stack in that order, and then pull another value off the stack: the location in try_to_wake_up() that ttwu_stat() is supposed to return to.

The final req instruction pops this return address and jumps to it.
But, whoops, RBP contains 0xffffffff810a535a, which is the return address we want.

The req instruction was expecting that value, but instead it’ll get whatever’s next on the stack.
This confirms the stack is off by one 64-bit register, or eight bytes: the value for R15 was popped into R14, the real return address was popped into RBP, and a previously stacked value was popped by retq as a return address and jumped to.

That explains why the kernel took off in a seemingly random direction – it tried using a pointer to data from the stack as a legit address to execute code.
While ttwu_stat() was running, something else tampered with the stack pointer – the special register that keeps track of where in memory values are pushed to and popped from the stack.
Something invisible dropped the stack pointer an extra eight bytes.

A poltergeist spilling cafeteria trays of register values all over the floor in the middle of the night.
You get the idea.
Hate to interrupt you, what’s this got to do with QEMU-KVM?
The one thing you just should never do is blame your compiler or kernel or microprocessor when your code bombs. Your carefully crafted source, like a teenager’s first poem to their first crush, is an extension of your essence, your passion to do things right. When it goes wrong, though, 99.999 per cent of the time it’s because you suck, and any time spent blaming the toolchain or CPU is time not spent fixing your own work.
Well, here’s one of those rare moments where you can blame someone else.
In the background to all of this, Google security engineer Robert Święcki had privately disclosed to AMD engineers and the Linux kernel security team a strange kernel “oops” that, in the words of Linux kernel chief Linus Torvalds, “turned out to be a AMD microcode problem with NMI delivery.”
Święcki had reported a similar exception to Slaby’s GDB crash: the kernel had tried to execute code in memory that was off limits.

That sort of fault will make the hairs on the back of a security engineer’s neck stand up: if a hacker can control or even simply influence where the CPU ricochets off to in kernel mode, she can potentially hijack the whole computer.
It’s the sort of bug that you have to get to the bottom of.
“I’m actually starting to suspect that it’s an AMD microcode bug that we know very little about,” Torvalds said, referring to Slaby’s GDB prang. “There’s apparently register corruption (the guess being from NMI handling, but virtualization was also involved) under some circumstances.
“We do have a reported ‘oops’ on the security list that looks totally different in the big picture, but shares the exact same ‘corrupted stack pointer register state resulting in crazy instruction pointer, resulting in NX fault’ behavior in the end.”
In other words, Slaby had stumbled across an AMD microcode issue on production hardware, an issue that Święcki and the Linux kernel security team were already investigating.
NMIs are interrupts that absolutely must be handled by the kernel and cannot be ignored: you can’t tell the chipset to postpone them because they are typically generated by a hardware failure or a watchdog timer raising an alarm. Like almost all interrupts, they can potentially fire at any time. Perhaps an NMI delivery problem occurred during the doomed GDB test; a microcode bug meddling with the stack pointer in an innocuous kernel function during a process scheduling operation that spiraled into a serious exception in the host kernel.
The other ingredient in this saga is virtualization: the OpenSUSE build server was compiling GDB and testing it in a QEMU-KVM virtual machine.

That means an unprivileged user in a guest virtual machine merely building software was able to trigger an “oops” in the host server’s kernel.

That’s not good.
According to Święcki, the microcode glitch mostly interferes with the host kernel’s stack pointer RSP, but it can also corrupt the contents of other registers – all of which can cause crashes, unpredictable behavior, or potentially be exploited to gain control of the system.

The Googler said he can, in “rare” conditions, commandeer the host machine’s kernel from a virtual machine guest.
“The visible effects are, in about 80 per cent of cases, incorrect RSP [values] leading to bad returns into kernel data or [triggering] stack-protector faults,” Święcki told the Linux kernel mailing list.
“But there are also more elusive effects, like registers being cleared before use in indirect memory fetches.
“I can trigger it from within QEMU guests, as non-root, causing bad RIP [instruction pointer register values] in the host kernel. When testing, a couple of times out of maybe 30 ‘oopses’, I was able to set it to user-space addresses mapped in the guest.
It greatly depends on timing, but I think with some more effort and populating the kernel stack with guest addresses it’d be possible to create a more reliable QEMU guest to host ring-0 escape.”
He added:

My proof-of-concept code [to trigger the bug] works only under QEMU-KVM. Xen and KVMtools don’t appear to be affected by it because there’s some missing functionality in them that my PoC makes use of.

But another thread started on [the Linux kernel mailing list] made me think those hypervisors can also be affected, although that’s just speculation.

AMD told The Register the bad microcode – 0x6000832 and 0x6000836 – affects the Opteron 6200 and 6300 series, although Święcki believes the problem extends to newer AMD FX and Opteron 3300 and 4300 chips using the Piledriver architecture and the buggy microcode.
Specific details on how to trigger the bug have not been disclosed ahead of the updated microcode’s release. Not that you need to know exactly how to exploit the vulnerability: you could be unlucky like the SUSE team and encounter it randomly on a live system.
Finally, Święcki pointed to a similar bug VMware has worked around in its ESXi hypervisor software for AMD Opteron 6300 CPUs. “Under a highly specific and detailed set of internal timing conditions, the AMD Opteron Series 63xx processor may read an internal branch status register while the register is being updated, resulting in an incorrect RIP.

The incorrect RIP causes unpredictable program or system behavior, usually observed as a page fault,” reads the VMware note, issued last year.
It’s no secret that microprocessors – especially today’s complex CPUs with billions of transistors – have bugs.
Intel and AMD both publish hundreds of pages of notes warning of subtle flaws in their designs. Most of the cockups are harmless to normal users, some are not; operating systems can work around the engineering blunders, or not bother at all for bugs that are benign.
Sometimes, though, new microcode is needed.

AMD last issued new microcode for its x86 processors in December 2014. ®
PS: Here’s a video of David Kaplan, a hardware security architect at AMD explaining how you’d typically go about testing and debugging a modern x86 CPU.
Youtube video