QEMU and the Kernel Virtual Machine (KVM)

One of the main goals of this course is to learn to manage a Linux server. So, we need a server to begin with! It would be quite impractical (and expensive) if everyone were to use a “real” (physical) machine, so we’ll be using virtual machines (VMs) instead.

User space, kernel space and x86 protection rings

The job of an operating system is to allow multiple processes to share the underlying machine in a safe and secure manner. To make it possible, the operating system—or more precisely, the kernel of the operating system—needs to remain in control of the hardware, and the “regular” programs must instead access the hardware by requesting service from the kernel with a system call (syscall).

Linux 6.11 on x86-64 currently supports over 450 system calls¹ which provide various services: from writing files, through process management, to system configuration. This way, the kernel is able to separate programs from itself and one another.

This separation between the user programs and the kernel gives rise to the terms kernel space (where the kernel and device drivers run) and user space (where the “regular” programs run). The user space programs are mostly limited to general-purpose computation, and to interact with the machine or other programs they need to make appropriate syscalls. The kernel implements policy checking and synchronization to safely handle the syscall.

Kernel space and user space — On AMD64, to open a file, a user space program performs a so-called system call by executing the SYSCALL instruction. This switches the CPU from the most restrictive ring 3 where the user space program executes, to the least restrictive ring 0 where the operating system’s kernel executes. Because the syscall number was SYS_OPEN, the kernel knows to transfer execution to the do_sys_open handler. This code carries out the desired action (opening a file) and returns a file descriptor (or an error) to the calling program with a SYSRET. This switches the CPU back to ring 3 and the user space program resumes execution, with the return value of do_sys_open now in `fd`.

This approach to operating system design, with the operating system being the layer between user space programs and hardware, is called kernel interposition. A big advantage of this design is that it’s conceptually simple (the kernel is in charge of everything), but sometimes it can get in the way, as system calls have a noticeable overhead.

The isolation of the kernel from user space programs relies on features provided by the CPU. On x86, there are four so-called protection rings, or privilege levels. The kernel code typically runs in ring 0 and is largely unrestricted, while the user space programs are confined to ring 3, while rings 1 and 2 are typically unused. The higher the ring number, the lower the privilege. This and many other protection mechanisms ensure that the kernel stays in charge at all times: if a user space program tried to mess with hardware or other processes directly, the CPU prevents that and notifies the kernel to deal with the misbehaving process.

Sometimes, ring 0 is called kernel mode or supervisor mode, and ring 3 is called user mode.

Hardware virtualization

Note: Virtualization is a complex topic, and this reading is by no means a replacement for a full course such as NSWI150. Please note that we only describe a particular kind of virtualization called hardware-assisted virtualization, and only as it works on the x86 platform.

Hardware virtualization allows multiple operating systems to execute at the same time on a single physical machine. One of these operating systems acts as the hypervisor²: it monitors the execution of the other operating systems and steps in when necessary to make sure that they share the underlying hardware safely. The physical machine on which the hypervisor runs is called the host, while the operating systems managed by the hypervisor are called guests.

The relationship between the hypervisor and the guests is similar to the relationship between an operating system’s kernel and the user processes, as described in the previous section.

The guest OS and the hardware it runs on—some of which is virtualized with hardware support and some of which is emulated entirely in software—is then called a virtual machine (VM). In many important ways, a virtual machine mimics the behavior of a physical one very closely:

You can turn it on and off, but instead of pressing a physical button, you issue a command (or click a button in some management interface).
When a virtual machine is started, it boots up like a physical one.
You can ssh into a running virtual server to manage it, just like you would into a running physical server. You may not even know the machine is being virtualized, and the machine itself may be unaware of it³.
Configuring a server to securely provide services to the outside world is a difficult task regardless of whether the server is virtual or physical.

Advantage and disadvantages

Virtual machines come with many advantages:

Virtual machines are not tied to the hardware they run on. Should the host fail or be gracefully taken down for maintenance, the VM can be moved to another host with the same CPU architecture and sufficient resources. This process is called migration, and with the right setup, you can even move running operating systems between physical machines.
Like physical machines, VMs provide security isolation. When a machine is breached, the attacker gets in charge of a single machine. To spill over to other machines, the attacker needs to breach them, too.
Running many VMs on the same hardware usually allows you to increase the average utilization of the hardware, because most machines are idle most of the time. That allows you to run fewer physical machines in total, leading to lower electricity consumption and upkeep, lowering cost and environmental impact.

They also come with some disadvantages:

A virtualized guest usually runs slower than a native equivalent. In other words, running a VM incurs noticeable overhead, or decrease in efficiency.
The technology necessary to run them is complex and consequently can be difficult to understand.

VMs are a good fit for this course because:

We can pack many student VMs onto just a few physical machines. Since VMs are cheap, everyone can run multiple virtual machines.
Managing a VM is very similar to managing a physical machine, so those skills will be very handy even if you manage physical servers in the future.
The VMs provide isolation between students. If one student VM gets hacked, the fallout should be contained.
We gain experience with virtualization.

You’ll be in charge of managing your VMs end to end. That’s why it’s important that you understand, at least on a basic level, what they are.

Linux, KVM and QEMU

Note: This section is Linux-specific: it describes how Linux can be used as a high-performance hypervisor for the x86 platform.

The cornerstone of virtualization is the isolation of the host and the guests. The guests must not be able to interfere with the host or the other guests in any way: they must all act as if they were separate physical machines. Originally, x86 CPUs offered no hardware support for virtualization, and it was extremely difficult to implement an x86 hypervisor capable of safely running VMs at decent speed.

To address that issue, both Intel and AMD have gradually rolled out several extensions to the x86 instruction set, which make it somewhat easier to implement an efficient hypervisor:

Intel: Intel VT-x, Extended Page Tables (EPT), VT-d
AMD: AMD-V, Nested Page Tables (NPT), AMD-Vi

The Intel VT-x extension adds two new “modes” of CPU operation, the root mode and the non-root mode.⁴ The four protection rings of the CPU remain unchanged and are orthogonal to these new modes: the CPU can be in root mode protection level 3, or non-root mode protection level 0. As you would probably guess, hypervisor code is executing in root mode, and the guests are running in non-root mode. Sometimes, ring 0 in root mode is called ring -1 or hypervisor mode.

To start executing a VM, the hypervisor will switch the CPU from root mode to non-root mode. This is called VM Entry, and it should remind you of what the operating system does when it starts a user space process. The key feature of the non-root mode is that privileged instructions, which could potentially interfere with the hypervisor, switch the CPU from non-root mode back to the root mode. This is called VM Exit. The hypervisor is then provided with detailed information about the instruction which caused the exit, so that it can handle it in software. This is similar to what happens when a user space process attempts to perform a privileged operation, and an interrupt is generated to deal with the situation.

The other processor extensions listed above each add hardware virtualization support for other functions of the CPU (such as paging) so that expensive emulation in software can be avoided.

Kernel-based Virtual Machine (KVM) is a Linux kernel module⁵ enabling Linux to use the virtualization features built into modern Intel and AMD CPUs. The KVM module makes it possible to write a user space program which uses the virtualization extensions of the underlying hardware (VT-x, AMD-V and others) to run a virtual machine efficiently.

When the KVM module is loaded into the kernel, it exposes the character device /dev/kvm into the user space:

$ ls -l /dev/kvm
crw-rw---- 1 root kvm 10, 232 Aug 23 23:42 /dev/kvm

By opening this filename, you obtain a file descriptor representing the KVM subsystem of Linux. There are many ioctl calls you can make on the file descriptor, for example KVM_CREATE_VM to create a representation of a virtual machine, or KVM_RUN which performs VM Entry and executes code in the non-root mode. When the KVM_RUN ioctl returns, that means that VM Exit occurred and your intervention (you being the hypervisor) is required. See (*) KVM host in a few lines of code for a detailed C example.

QEMU is able to use the KVM subsystem to run virtual machines. Apart from that, it is also an excellent hardware emulator. KVM alone can only virtualize a CPU, but to have a useful VM, we need much more—at a minimum, we need a BIOS and a serial port, but ideally also a graphics card. Whenever the guest kernel tries to communicate with any virtual hardware, VM Exit will occur, QEMU will step in to emulate the device in software and then it will resume the virtualization.

Execution of a VM under KVM — To run a VM, QEMU (a user space program running in root mode) performs ioctl calls on /dev/kvm. These calls instruct the KVM subsystem in the Linux kernel (running in kernel space in root mode) to perform a VM entry. The guest kernel starts executing (in kernel space in non-root mode). Eventually, the first user space process is launched in the guest (in user space in non-root mode). Whenever the guest kernel attempts to execute a sensitive instruction, a VM exit occurs. QEMU resumes (in user space in root mode), emulates the instruction and the cycle repeats.

To summarize: It is difficult to achieve efficient x86 virtualization. That’s why modern CPUs provide extensions such as AMD-V or Intel VT-x which make the task easier. A Linux module called KVM exposes the functionality provided by these extensions to user-space programs such as QEMU. These programs can then run virtual machines efficiently, often at near-native speed. But they usually still have to emulate other hardware, such as storage, network or graphics.

Using QEMU and KVM to run VMs

As mentioned before, QEMU will be both our interface to the KVM subsystem and our device emulator. To run the VMs, we’ll use the qemu-system-x86_64 command (the man page is qemu(1)). As the name suggests, this creates a virtual x86-64 system: virtual CPU, memory reserved from the host system and basic emulated hardware.

The command has loads of options to control the virtualization, such as CPU topology and the number of CPU cores, amount of reserved memory, types of disk controllers, disk images to appear as CDs inserted into a virtual CD-ROM drive, and so on.

Don’t read the entire qemu(1) man page at once—with a complex command such as this, it’s much better to just gloss it over and have a rough idea of the options available. Follow up by reading QEMU entry on Arch Wiki.

Is this really secure?

“You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can’t write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes.” — Theo de Raadt

If you wanted to run two workloads and make sure they remain isolated from one another, from the least to the most secure, your options are running them:

as two processes on the same machine,
in two separate containers (to be explored later during the semester),
in two separate virtual machines,
on two separate physical machines,
on two separate air-gapped physical machines,
not running them.

The more isolation you have, the more inconvenient the setup is and more resources are wasted. VMs lie somewhere in the middle. This of course doesn’t tell you much, but it helps to put things into perspective.

The cloud

Even though our VMs will run on servers provided by the faculty and not “in the cloud”, cloud deserves a passing mention, because (hardware) virtualization is what makes it possible.

Cloud is what you get when you let someone else worry about the hardware and you rent VMs⁶ from the cloud operator instead. Currently the largest cloud computing players are AWS (Amazon), Azure (Microsoft), Alibaba Cloud and Google Cloud.

You can ask the cloud provider to provision a VM for you with an API call. Depending on current demand, that takes between a few seconds and a few minutes to complete, after which you’re provided with credentials to access the machine. When you no longer need the VM, you can deallocate it with another API call. You are only charged for the period for which you occupy the machine, the usual billing granularity being between a second and a minute.

This tremendous flexibility is what differentiates cloud from server hosting. When you run in the cloud, you only pay for the resources you utilize, and you can allocate/deallocate resources quickly at no additional cost. With server hosting, you rent physical machines for fixed periods of time, usually on the order of months, and usually there are commitments and/or setup fees. Both have their pros and cons.

Cloud computing makes perfect sense for certain workloads. Many customer-facing services see their computing requirements change drastically during the day. For example, a video streaming service typically sees the peak traffic during the evening hours when people come home and watch TV. The difference in compute power necessary to run, say, Netflix can easily be 10x, or even 100x the average during live sports streams and prime time of popular TV shows.

To host such a service, you can either own and pay for many more servers than you need most of the time, or you can automatically provision and decommission resources in the cloud depending on the current load of your platform. Not only is the latter option typically cost effective, it’s also a necessary prerequisite for scalability. Should there be a sudden increase in the popularity of your service, you’ll simply request additional resources from the provider. This process is usually automated, and then it’s called autoscaling.

In case you’re wondering, yes, the cloud provider can temporarily run out of capacity in a particular region. It doesn’t happen often, but it can and does happen. But from the cloud customer’s perspective, the cloud appears “infinite” most of the time.

Normally, a single physical server in the provider’s data center runs many VMs from many different tenants (cloud customers). A single machine is usually too beefy, and so it’s split into many smaller virtual machines, which are then rented out individually. This is why virtualization is so important for the cloud: it allows efficient packing of unrelated workloads from many different tenants onto a single physical host, with a good deal of isolation.

(*) The thorny road to efficient x86 virtualization

TL;DR: It’s a small miracle that it works at all.

Prehistory: CP/CMS (1967)

Hardware virtualization is not exactly a new thing, having been first implemented in the late 1960s by IBM when mainframes were the only “serious” computers you could get. The motivation was simple: sales, and a hefty dose of corporate politics.

Most of IBM’s then machines, including their then-new System/360 line of mainframe computers, were built for batch processing. That didn’t sit well with MIT and Bell labs, who thought time-sharing was a better fit for scientific computing. IBM started to lose market share to GE, spurring the development of System/360 model 67 and a new operating system, the Time-Sharing System (TSS). TSS was however running late, and so IBM developed CP/CMS: a Control Program (CP)—we would call it a hypervisor today—and the Cambridge Monitor System (CMS), a simple, interactive, single-user operating system.

Each user of the mainframe was allocated a virtual machine running CMS, and CP was in charge of all the VMs. Instead of running something very complex once, they decided to run something simple many times. From the user’s perspective, they had the entire mainframe machine at their disposal, when in fact, the many CMS instances were sharing the underlying hardware. Incredibly, the source code of CP was freely available in source form. Think about how revolutionary this was almost sixty years ago!

The computers we are using today are not descendants of the mighty mainframes. Most of today’s personal computers and servers alike are developments of the original IBM PC running Intel 8088 (a variant of the 8086). Not surprisingly, the original x86 instruction set from the 1970’s wasn’t designed with virtualization in mind - after all, it was designed as a microcontroller for a desktop calculator.

And because in the x86 world, backward compatibility is king, in 2024 our 64-bit CPUs worth thousands of 60’s mainframes still pay homage to the 8086 with every tick. The x86 is a big success turned big mess, and x86 virtualization is no different.

First attempt: trap and emulate

“Trap and emulate” is a common virtualization technique which is both simple and efficient. In this approach, the hypervisor is running in ring 0, and the guest operating system runs in ring 3. That is, the entire guest runs in ring 3, both the user space and the kernel. Because the kernel is moved from the higher-privileged ring 0 to lesser-privileged ring 3, this is called deprivileging of the kernel.

The key observation is that the vast majority of code executing on a CPU is “safe” in the sense that it does not attempt to affect the shared resources of the machine in any way, such as modifying the page tables, configuring interrupt vectors or switching the CPU power modes. All user space code is safe by definition, since it is meant to be run in ring 3, where the CPU protection mechanisms would prevent such actions anyway. Similarly, a lot of kernel code is general-purpose computation, but interspersed with privileged or otherwise sensitive instructions. By running the virtual machine in ring 3, all of userspace code and a lot of kernel code can run just fine at the native speed of the CPU—with no overhead!

The question is, what happens when the VM’s kernel, running in ring 3, executes a sensitive instruction? The CPU protection mechanisms will prevent the instruction from executing (it’s running in ring 3) and an exception⁷ occurs. An exception handler registered by the hypervisor is executed in ring 0. This way, the hypervisor may step in and emulate the instruction in software and then switch back to the VM in ring 3.

Similarly, when user space code running in the VM makes a syscall with the SYSCALL or SYSENTER instruction, the CPU is switched to ring 0 and hypervisor code starts executing. In this case however, the hypervisor will simply switch back to the virtual machine’s kernel code, since that is the intended recipient of the syscall.

Unfortunately, this approach doesn’t work for x86.

Dynamic Binary Translation, VMware Workstation (1999)

The x86 instruction set, mostly due to backward compatibility, contains several so-called “non-virtualizable instructions”, also called stealth instructions. The POPF instruction is the canonical example given in every textbook. In ring 0, it is a privileged instruction, whereas in ring 3, it performs some non-privileged subset of the work, but doesn’t trigger an exception. It is therefore impossible for the hypervisor to trap and emulate that instruction. There are 17 such instructions in total, and they are the reason why trap and emulate alone doesn’t work for x86.

Popek and Goldberg defined what it means for an instruction set to be virtualizable in 1974. Since their criteria hinted upon the trap and emulate technique, the technique became synonymous with virtualizability. And since trap and emulate didn’t work for x86, x86 was thought to be impossible to virtualize, at least in the traditional sense. VMware was secretly started in 1998 and in 1999, VMware Workstation, an x86 hypervisor, was shipped to great commercial success.

To get around the problems caused by stealth instructions, a technique called dynamic binary translation was used by VMware engineers. User space code, safe by definition, would still run in ring 3, enjoying native performance. Syscalls made by user space code in the VM would still trap into the hypervisor and be forwarded to the VM’s kernel running in ring 1. But the kernel code wouldn’t be executed directly. Instead, it is translated on the fly: the translator inspects the instruction stream and replaces the stealth instructions with safer equivalent instructions, then runs the code in ring 1. This makes x86 virtualizable, even if not in the traditional trap and emulate sense.

Emulation, Shadow Structures, Shadow Page Tables

Through a combination of trapping privileged instructions and binary translating the kernel code, we now have a way of preventing all privileged operations from being executed by the VM. But so far, we didn’t discuss how exactly they should be emulated by the hypervisor. That, of course, depends on the operation. Page table emulation is an important, if complex, example.

As you probably know, the page tables are a tree-like data structure stored in memory, which the CPU uses to translate virtual memory addresses to physical ones. In Linux, each process has its own set of page tables which define its virtual memory layout. It’s the operating system’s kernel job to maintain the virtual-to-physical memory mapping for each process. To be in control of paging, the kernel needs at least to:

Store values into the control register CR3 of the CPU. Part of this register contains the Page Directory Base Address (physical address of the root page table). When Linux switches processes, it also switches the page tables, so that each process sees its own virtual memory address space.
Modify the page tables themselves. Page tables are “off-CPU” state, since they are stored in memory. Yet, they are crucial to the operation of the CPU, and must be protected from user space code.

To keep things simple, let’s only discuss how the hypervisor emulates reads and writes of the CR3 register.

The x86 architecture treats all access to the control registers, including CR3, as ring 0-only operations. Any attempt at reading or writing the CR3 register therefore traps into the hypervisor. The guest kernel must be able to read and write CR3, without ever touching the physical register.

The hypervisor therefore maintains a structure in memory representing the state of the CPU as the guest expects it to be. When the guest writes CR3, the hypervisor will instead update the value of CR3 in this CPU model. When the guest later loads CR3, the value stored in the model is retrieved.

For example, for the x86 platform, KVM maintains⁸ a fairly massive struct kvm_vcpu_arch representing the state of a VM’s CPU (also called virtual CPU, or vCPU):

struct kvm_vcpu_arch {
        /*
         * rip and regs accesses must go through
         * kvm_{register,rip}_{read,write} functions.
         */
        unsigned long regs[NR_VCPU_REGS];
        u32 regs_avail;
        u32 regs_dirty;


        unsigned long cr0;
        unsigned long cr0_guest_owned_bits;
        unsigned long cr2;
        unsigned long cr3;
        unsigned long cr4;
        unsigned long cr4_guest_owned_bits;
        unsigned long cr4_guest_rsvd_bits;
        unsigned long cr8;
        u32 host_pkru;
        u32 pkru;


        /* lots and lots of fields elided for brevity... */
}

TODO: It would be interesting to also describe how the emulation of page tables themselves works.

Hardware-assisted Virtualization: First Generation (2005)

In 2005, Intel introduced VT-x (originally called just VT), an extension to the x86 instruction set whose objective was to make it possible to build an x86 hypervisor without binary translation. Binary translation isn’t easy⁹ to get right and efficient at the same time: caches are needed for the translated code; the caches must be properly invalidated even in the face of self-modifying code, etc. VT-x finally provided a way to trap the stealth instructions, making binary translation unnecessary.

VT-x introduces 13 new instructions and a key new data structure stored in memory, the Virtual Machine Control Structure (VMCS). Once VT-x is enabled (with the VMXON instruction), the CPU enters a new mode called “root mode”. There is a complementary mode called “non-root mode”. Hypervisors using VT-x run in the root mode, and their guests (the VMs) run in non-root mode.

The four CPU protection rings still exist in addition to the root and non-root modes. Before VT-x, ring 0 was reserved for the hypervisor, and the guest kernel needed to be moved to a higher ring (deprivileged) to keep the hypervisor in charge. With VT-x enabled, there are now two ring 0’s: the root mode ring 0, where the hypervisor runs, and the non-root mode ring 0, where the guest kernel runs. Guest’s userspace runs in non-root mode ring 3.

This solves many issues. First of all, the issue with stealth instructions was caused by deprivileging: executing the instruction in a higher ring changed its semantics slightly, but did not cause a trap. Since kernel code is running in ring 0, stealth instructions don’t cause trouble anymore.

Furthermore, system calls (SYSCALL/SYSENTER) always switch the CPU to ring 0. Without VT-x, every system call made by the guest switched to ring 0 where a callback registered by the hypervisor was invoked, which then forwarded the system call to ring 1. This adds unnecessary overhead to every system call. Again, with VT-x, this is not an issue, since the kernel is actually running in ring 0.

To use VT-x, the hypervisor basically does this: first, it enables VT-x (VMXON). Then it will allocate a chunk of memory for each VM it wants to run which will hold the VMCS. The VMCS is an opaque binary blob, and may only be read and written with VMREAD/VMWRITE instructions. The actual encoding of the VMCS and its contents are considered an implementation detail of the CPU which no code should rely upon. The hypervisor then selects one of the VMs to be run and executes VMPTRST (VM pointer store) to mark a VMCS active, and executes VMLAUNCH, starting the VM described by the active VMCS. Launching of the VM is called “VM entry”.

The VM is now free to run, subject to certain limitations captured in the VMCS: the hypervisor configures which conditions should trigger a so-called “VM exit,” or handing of control back to the hypervisor. As in trap-and-emulate, the hypervisor would service these conditions by emulating the instruction which caused the VM exit. Unlike trap-and-emulate however, the hypervisor is handed over (in the VMCS) detailed information about the offending instruction. Once instruction emulation is finished, VMRESUME resumes execution of the VM.

One of the problems of trap-and-emulate we glossed over was that the hypervisor needed to decode the offending instruction from its binary form in software. Basically, it would read the instruction stream from the PC (program counter) onward and decode the first instruction to the right. Not only is that slow, but x86 instruction decoding is a non-trivial task, and with VT-x, the hypervisor is free from that hassle.

In the original implementation of VT-x, VM entry and exit were very expensive, as the CPU needed to store the current state of the CPU, load the new state, etc. On the first generation of Intel CPUs, the round-trip time (the time that VM exit followed by VM entry takes) was 1.5 microseconds, or roughly eternity. This led to a rather paradoxical situation where the software-only approach mastered by VMware was actually much faster than the first generation of hardware-assisted virtualization.

In 2006, AMD followed suit and introduced AMD-V, an incompatible but similar technology.

Functions of a Hypervisor, Type-1 and Type-2 Hypervisors

We now have a good deal of understanding of the problem which VT-x/AMD-V try to solve, and we have a basic understanding of the instructions and CPU modes they provide. What remains now is to explore how these features fit into a real-world hypervisor.

VT-x and AMD-V are low-level building blocks which are incredibly useful, but by themselves they aren’t a hypervisor. Let us explore some of the features a hypervisor needs provide to be of any use:

Somehow, the hypervisor needs to get in control of the machine - after all, it needs to enable the hardware assisted virtualization features and switch the CPU to root mode. The host operating system needs to be aware of this happening.
The hypervisor needs to be able to utilize the underlying hardware-assisted virtualization features: detect which virtualization extensions are available, enable them, manage the VMCS for each VM and facilitate the transition between the root and non-root modes. It somehow needs to divide CPU time and memory among running VMs.
Some way to control what the hypervisor is doing is needed. This would typically be a low-level application programming interface (API) meant for programs rather than humans, to achieve interoperability.
A management interface for the administrator of the host is needed which translates the user’s commands into API calls. This allows the administrator to manage the life-cycle of the VMs: create (define) new VMs, start them, stop them, delete them, etc.
The guest needs additional hardware to run - virtualizing the CPU is not enough, we need to virtualize an entire platform (this is called platform virtualization). At a minimum, BIOS/UEFI firmware, persistent storage, networking and display capabilities are usually needed. These devices are often emulated in software¹⁰.

Emulated hardware is usually backed by actual hardware. For example, virtual disks often store their data in disk image files which are stored on actual storage devices, virtual network cards dispatch packets to physical network cards, etc. The hypervisor needs to control this hardware to be able to use it.

Going through the list, you’ll notice that many of the functions (such as hardware control and scheduling) are often provided by an operating system, and it’s natural to ask where the boundary lies between a hypervisor and an operating system. A coarse classification distinguishes between type-1 and type-2 hypervisors.

A type-1 hypervisor runs directly on the hardware of the host, or on the “bare metal”. They are often called bare-metal hypervisors for that reason, and they are small operating systems in their own right. They are what boots the machine up (1). Usually they contain the minimum amount of functionality required to run the VMs (2, 3, 5). However, to control the hardware, they require custom drivers (6). The prehistoric CP/CMS was of this kind. Xen is a modern Type-1 hypervisor.

A type-2 hypervisor runs as a user-space process on top of an existing operating system, such as Linux or Windows, and is launched as any other application (1). It leaves device control, memory management, scheduling, … to the underlying operating system (2, 6) and only provides device emulation and VM management (3, 4, 5). VirtualBox is a type-2 hypervisor.

This classification is somewhat useful, but far from perfect, as real-world hypervisors are often a compromise between these two poles, and it’s unclear how they should be classified. Strictly speaking, KVM is neither type-1 nor type-2. However, it’s useful to know these terms, since they remain in widespread use, and at least in their pure meaning described above are well defined.

Missing bits

Some bits are still missing and will be added in future revisions of this document. Let us know if you want to contribute any of these:

Mention microkernels, a different approach to operating system design
Mention Firecracker and compare it to QEMU
Describe how shadow page tables work
Describe VT-d and the AMD equivalent (basically IOMMU)
Measure VM-Entry and VM-Exit latency

Acknowledgements

Honza Dubský did a proofread of this text
Honza Černohorský helped with small improvements

Thanks!

Historic note: Unix version 7, often called “the last true Unix”, supported fewer than 50 syscalls: access, alarm, break, chdir, chmod, chown, chroot, close, create, dup, exec, exece, exit, fork, fstat, ftime, getgid, getpid, getuid, gtime, kill, link, mknod, mount, nice, open, pause, profil, read, seek, setgid, setuid, ssig, stat, stime, sync, times, umask, umount, unlink, utime, wait, write. See sys1.c, sys2.c, sys3.c, sys4.c. ↩
An operating system used to be called a “supervisor program,” and the supervisor of the supervisors is… a hypervisor, obviously! ↩
Normally when you connect to a virtual machine, it’s easy to tell it’s being virtualized, for example by looking at the hardware configuration (CPU, memories, network, etc.) reported by the guest OS. Even if the host wanted to fool the guest into thinking it is real, it would probably be nearly impossible, as the virtualization has a measurable impact on the timing of events in the guest. ↩
Just to be clear: this has nothing to do with the root user. ↩
A module is a piece of pluggable functionality: something that’s not compiled into the kernel directly, but can be loaded and unloaded at runtime to extend the set of available features. KVM is typically built as a module, but can also be compiled directly into the kernel. ↩
Virtual machines are the most basic cloud offering. There are other compute, network and storage products you can choose from. To get a sense of the complexity, take a look at the AWS calculator. ↩
In systems programming, an exception is a condition which is triggered by software or by hardware. When a handler is registered for the exception, it is invoked, temporarily preempting whatever was running on the CPU when the exception occurred. An interrupt is a type of exception and so are faults such as division by zero.

Just to be clear, this has nothing to do with exceptions in high-level programming, even though there are some conceptual similarities. ↩
KVM doesn’t use binary translation and relies on hardware-assisted virtualization features of the CPU instead. But it still maintains a model of the CPU, and it was instructive to show an example here. ↩
See US Patent 6,397,242: Virtualization system including a virtual machine monitor for a computer with a segmented architecture. ↩
Software emulation of networking, storage etc. is viable as long as performance isn’t critical. When performance matters, it is possible to allow the guest VM to control some hardware directly by means of PCI pass-through. ↩