Required reading: Disco
What is a virtual machine? IBM definition: a fully protected and isolated copy of the underlying machine's hardware.
Another view is that it provides another example of a kernel API. In contrast to other kernel APIs (unix, microkernel, and exokernel), the virtual machine operating system exports as the kernel API the processor API (e.g., the x86 interface). Thus, each program running in user space sees the services offered by a processor, and each program sees its own processor. Of course, we don't want to make a system call for each instruction, and in fact one of the main challenges in virtual machine operation systems is to design the system in such a way that the physical processor executes the virtual processor API directly, at processor speed.
Virtual machines can be useful for a number of reasons:
If your operating system isn't a virtual machine operating system, what are the alternatives? Processor simulation (e.g., bochs) or binary emulation (WINE). Simulation runs instructions purely in software and is slow (e.g., 100x slow down for bochs); virtualization gets out of the way whenever possible and can be efficient.
Simulation gives portability whereas virtualization focuses on performance. However, this means that you need to model your hardware very carefully in software. Binary emulation focuses on just getting system call for a particular operating system's interface. Binary emulation can be hard because it is targetted towards a particular operating system (and even that can change between revisions).
To provide each process with its own virtual processor that exports the same API as the physical processor, what features must the virtual machine operating system virtualize?
Virtual machine monitors (VMM) can be implemented in two ways:
read()
).The three primary functions of a virtual machine monitor are:
Understanding memory virtualization. Let's consider the MIPS example from the paper. Ideally, we'd be able to intercept and rewrite all memory address references. (e.g., by intercepting virtual memory calls). Why can't we do this on the MIPS? (There are addresses that don't go through address translation --- but we don't want the virtual machine to directly access memory!) What does Disco do to get around this problem? (Relink the kernel outside this address space.)
Having gotten around that problem, how do we handle things in general?
// Disco's tlb miss handler. // Called when a memory reference for virtual adddress // 'VA' is made, but there is not VA->MA (virtual -> machine) // mapping in the cpu's TLB. void tlb_miss_handler (VA) { // see if we have a mapping in our "shadow" tlb (which includes // "main" tlb) tlb_entry *t = tlb_lookup (thiscpu->l2tlb, va); if (t && defined (thiscpu->pmap[t->pa])) // is there a MA for this PA? tlbwrite (va, thiscpu->pmap[t->pa], t->otherdata); else if (t) // get a machine page, copy physical page into, and tlbwrite else // trap to the virtual CPU/OS's handler } // Disco's procedure which emulates the MIPS // instruction which writes to the tlb. // // VA -- virtual addresss // PA -- physical address (NOT MA machine address!) // otherdata -- perms and stuff void emulate_tlbwrite_instruction (VA, PA, otherdata) { tlb_insert (thiscpu->l2tlb, VA, PA, otherdata); // cache if (!defined (thiscpu->pmap[PA])) { // fill in pmap dynamically MA = allocate_machine_page (); thiscpu->pmap[PA] = MA; // See 4.2.2 thiscpu->pmapbackmap[MA] = PA; thiscpu->memmap[MA] = VA; // See 4.2.3 (for TLB shootdowns) } tlbwrite (va, thiscpu->pmap[PA], otherdata); } // Disco's procedure which emulates the MIPS // instruction which read the tlb. tlb_entry *emulate_tlbread_instruction (VA) { // Must return a TLB entry that has a "Physical" address; // This is recorded in our secondary TLB cache. // (We don't have to read from the hardware TLB since // all writes to the hardware TLB are mediated by Disco. // Thus we can always keep the l2tlb up to date.) return tlb_lookup (thiscpu->l2tlb, va); }
Requirements:
The MIPS didn't quite meet the second criteria, as discussed above. But, it does have a supervisor mode that is between user mode and kernel mode where any privileged instruction will trap.
What might a the VMM trap handler look like?
void privilege_trap_handler (addr) { instruction, args = decode_instruction (addr) switch (instruction) { case foo: emulate_foo (thiscpu, args, ...); break; case bar: emulate_bar (thiscpu, args, ...); break; case ...: ... } }
The emulator_foo
bits will have to evaluate the
state of the virtual CPU and compute the appropriate "fake" answer.
What sort of state is needed in order to appropriately emulate all of these things?
- all user registers - CPU specific regs (e.g. on x86, %crN, debugging, FP...) - page tables (or tlb) - interrupt tablesThis is needed for each virtual processor.
We intercept all communication to the I/O devices: read/writes to reserved memory addresses cause page faults into special handlers which will emulate or pass through I/O as appropriate.
In a system like Disco, the sequence would look something like:
Interrupts will require some additional work:
The above can be slow! So sometimes you want the guest operating system to be aware that it is a guest and allow it to avoid the slow path. Special device drivers or changing instructions that would cause traps into memory read/write instructions.
VMware, unlike Disco, runs as an application on a guest OS and cannot modify the guest OS. Furthermore, it must virtualize the x86 instead of MIPS processor. Both of these differences make good design challenges.
The first challenge is that the monitor runs in user space, yet it must dispatch traps and it must execute privilege instructions, which both require kernel privileges. To address this challenge, the monitor downloads a piece of code, a kernel module, into the guest OS. Most modern operating systems are constructed as a core kernel, extended with downloadable kernel modules. Privileged users can insert kernel modules at run-time.
The monitor downloads a kernel module that reads the IDT, copies it, and overwrites the hard-wired entries with addresses for stubs in the just downloaded kernel module. When a trap happens, the kernel module inspects the PC, and either forwards the trap to the monitor running in user space or to the guest OS. If the trap is caused because a guest OS execute a privileged instructions, the monitor can emulate that privilege instruction by asking the kernel module to perform that instructions (perhaps after modifying the arguments to the instruction).
The second challenge is virtualizing the x86 instructions. Unfortunately, x86 doesn't meet the 3 requirements for CPU virtualization. the first two requirements above. If you run the CPU in ring 3, most x86 instructions will be fine, because most privileged instructions will result in a trap, which can then be forwarded to vmware for emulation. For example, consider a guest OS loading the root of a page table in CR3. This results in trap (the guest OS runs in user space), which is forwarded to the monitor, which can emulate the load to CR3 as follows:
// addr is a physical address void emulate_lcr3 (thiscpu, addr) { thiscpu->cr3 = addr; Pte *fakepdir = lookup (addr, oldcr3cache); if (!fakepdir) { fakedir = ppage_alloc (); store (oldcr3cache, addr, fakedir); // May wish to scan through supplied page directory to see if // we have to fix up anything in particular. // Exact settings will depend on how we want to handle // problem cases below and our own MM. } asm ("movl fakepdir,%cr3"); // Must make sure our page fault handler is in sync with what we do here. }
To virtualize the x86, the monitor must intercept any modifications to the page table and substitute appropriate responses. And update things like the accessed/dirty bits. The monitor can arrange for this to happen by making all page table pages inaccessible so that it can emulate loads and stores to page table pages. This setup allow the monitor to virtualize the memory interface of the x86.
Unfortunately, not all instructions that must be virtualized result in traps:
pushf/popf
: FL_IF
is handled different,
for example. In user-mode setting FL_IF is just ignored.push
, pop
, mov
)
that reads or writes from %cs
, which contains the
privilege level.
How can we virtualize these instructions? An approach is to decode
the instruction stream that is provided by the user and look for bad
instructions. When we find them, replace them with an interrupt
(INT 3
) that will allow the VMM to handle it
correctly. This might look something like:
void initcode () { scan_for_nonvirtual (0x7c00); } void scan_for_nonvirtualizable (thiscpu, startaddr) { addr = startaddr; instr = disassemble (addr); while (instr is not branch or bad) { addr += len (instr); instr = disassemble (addr); } // remember that we wanted to execute this instruction. replace (addr, "int 3"); record (thiscpu->rewrites, addr, instr); } void breakpoint_handler (tf) { oldinstr = lookup (thiscpu->rewrites, tf->eip); if (oldinstr is branch) { newcs:neweip = evaluate branch scan_for_nonvirtualizable (thiscpu, newcs:neweip) return; } else { // something non virtualizable // dispatch to appropriate emulation } }
All pages must be scanned in this way. Fortunately, most pages probably are okay and don't really need any special handling so after scanning them once, we can just remember that the page is okay and let it run natively.
What if a guest OS generates instructions, writes them to memory, and then wants to execute them? We must detect self-modifying code (e.g. must simulate buffer overflow attacks correctly.) When a write to a physical page that happens to be in code segment happens, must trap the write and then rescan the affected portions of the page.
What about self-examining code? Need to protect it some how---possibly by playing tricks with instruction/data TLB caches, or introducing a private segment for code (%cs) that is different than the segment used for reads/writes (%ds).
Disco has some I/O specific optimizations.
Disco developers clearly had access to IRIX source code.
Performance?
Premise. Are virtual machine the preferred approach to extending operating systems? Have scalable multiprocessors materialized?
John Scott Robin, Cynthia E. Irvine. Analysis of the Intel Pentium's Ability to Support a Secure Virtual Machine Monitor.
Jeremy Sugerman, Ganesh Venkitachalam, Beng-Hong Lim. Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor. In Proceedings of the 2001 Usenix Technical Conference.
Kevin Lawton, Drew Northup. Plex86 Virtual Machine.
Xen and the Art of Virtualization, Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, Andrew Warfield, SOSP 2003
A comparison of software and hardware techniques for x86 virtualizatonKeith Adams and Ole Agesen, ASPLOS 2006