263 lines
12 KiB
HTML
263 lines
12 KiB
HTML
|
<title>Microkernel lecture</title>
|
||
|
<html>
|
||
|
<head>
|
||
|
</head>
|
||
|
<body>
|
||
|
|
||
|
<h1>Microkernels</h1>
|
||
|
|
||
|
<p>Required reading: Improving IPC by kernel design
|
||
|
|
||
|
<h2>Overview</h2>
|
||
|
|
||
|
<p>This lecture looks at the microkernel organization. In a
|
||
|
microkernel, services that a monolithic kernel implements in the
|
||
|
kernel are running as user-level programs. For example, the file
|
||
|
system, UNIX process management, pager, and network protocols each run
|
||
|
in a separate user-level address space. The microkernel itself
|
||
|
supports only the services that are necessary to allow system services
|
||
|
to run well in user space; a typical microkernel has at least support
|
||
|
for creating address spaces, threads, and inter process communication.
|
||
|
|
||
|
<p>The potential advantages of a microkernel are simplicity of the
|
||
|
kernel (small), isolation of operating system components (each runs in
|
||
|
its own user-level address space), and flexibility (we can have a file
|
||
|
server and a database server). One potential disadvantage is
|
||
|
performance loss, because what in a monolithich kernel requires a
|
||
|
single system call may require in a microkernel multiple system calls
|
||
|
and context switches.
|
||
|
|
||
|
<p>One way in how microkernels differ from each other is the exact
|
||
|
kernel API they implement. For example, Mach (a system developed at
|
||
|
CMU, which influenced a number of commercial operating systems) has
|
||
|
the following system calls: processes (create, terminate, suspend,
|
||
|
resume, priority, assign, info, threads), threads (fork, exit, join,
|
||
|
detach, yield, self), ports and messages (a port is a unidirectionally
|
||
|
communication channel with a message queue and supporting primitives
|
||
|
to send, destroy, etc), and regions/memory objects (allocate,
|
||
|
deallocate, map, copy, inherit, read, write).
|
||
|
|
||
|
<p>Some microkernels are more "microkernel" than others. For example,
|
||
|
some microkernels implement the pager in user space but the basic
|
||
|
virtual memory abstractions in the kernel (e.g, Mach); others, are
|
||
|
more extreme, and implement most of the virtual memory in user space
|
||
|
(L4). Yet others are less extreme: many servers run in their own
|
||
|
address space, but in kernel mode (Chorus).
|
||
|
|
||
|
<p>All microkernels support multiple threads per address space. xv6
|
||
|
and Unix until recently didn't; why? Because, in Unix system services
|
||
|
are typically implemented in the kernel, and those are the primary
|
||
|
programs that need multiple threads to handle events concurrently
|
||
|
(waiting for disk and processing new I/O requests). In microkernels,
|
||
|
these services are implemented in user-level address spaces and so
|
||
|
they need a mechanism to deal with handling operations concurrently.
|
||
|
(Of course, one can argue if fork efficient enough, there is no need
|
||
|
to have threads.)
|
||
|
|
||
|
<h2>L3/L4</h2>
|
||
|
|
||
|
<p>L3 is a predecessor to L4. L3 provides data persistence, DOS
|
||
|
emulation, and ELAN runtime system. L4 is a reimplementation of L3,
|
||
|
but without the data persistence. L4KA is a project at
|
||
|
sourceforge.net, and you can download the code for the latest
|
||
|
incarnation of L4 from there.
|
||
|
|
||
|
<p>L4 is a "second-generation" microkernel, with 7 calls: IPC (of
|
||
|
which there are several types), id_nearest (find a thread with an ID
|
||
|
close the given ID), fpage_unmap (unmap pages, mapping is done as a
|
||
|
side-effect of IPC), thread_switch (hand processor to specified
|
||
|
thread), lthread_ex_regs (manipulate thread registers),
|
||
|
thread_schedule (set scheduling policies), task_new (create a new
|
||
|
address space with some default number of threads). These calls
|
||
|
provide address spaces, tasks, threads, interprocess communication,
|
||
|
and unique identifiers. An address space is a set of mappings.
|
||
|
Multiple threads may share mappings, a thread may grants mappings to
|
||
|
another thread (through IPC). Task is the set of threads sharing an
|
||
|
address space.
|
||
|
|
||
|
<p>A thread is the execution abstraction; it belongs to an address
|
||
|
space, a UID, a register set, a page fault handler, and an exception
|
||
|
handler. A UID of a thread is its task number plus the number of the
|
||
|
thread within that task.
|
||
|
|
||
|
<p>IPC passes data by value or by reference to another address space.
|
||
|
It also provide for sequence coordination. It is used for
|
||
|
communication between client and servers, to pass interrupts to a
|
||
|
user-level exception handler, to pass page faults to an external
|
||
|
pager. In L4, device drivers are implemented has a user-level
|
||
|
processes with the device mapped into their address space.
|
||
|
Linux runs as a user-level process.
|
||
|
|
||
|
<p>L4 provides quite a scala of messages types: inline-by-value,
|
||
|
strings, and virtual memory mappings. The send and receive descriptor
|
||
|
specify how many, if any.
|
||
|
|
||
|
<p>In addition, there is a system call for timeouts and controling
|
||
|
thread scheduling.
|
||
|
|
||
|
<h2>L3/L4 paper discussion</h2>
|
||
|
|
||
|
<ul>
|
||
|
|
||
|
<li>This paper is about performance. What is a microsecond? Is 100
|
||
|
usec bad? Is 5 usec so much better we care? How many instructions
|
||
|
does 50-Mhz x86 execute in 100 usec? What can we compute with that
|
||
|
number of instructions? How many disk operations in that time? How
|
||
|
many interrupts can we take? (The livelock paper, which we cover in a
|
||
|
few lectures, mentions 5,000 network pkts per second, and each packet
|
||
|
generates two interrrupts.)
|
||
|
|
||
|
<li>In performance calculations, what is the appropriate/better metric?
|
||
|
Microseconds or cycles?
|
||
|
|
||
|
<li>Goal: improve IPC performance by a factor 10 by careful kernel
|
||
|
design that is fully aware of the hardware it is running on.
|
||
|
Principle: performance rules! Optimize for the common case. Because
|
||
|
in L3 interrupts are propagated to user-level using IPC, the system
|
||
|
may have to be able to support many IPCs per second (as many as the
|
||
|
device can generate interrupts).
|
||
|
|
||
|
<li>IPC consists of transfering control and transfering data. The
|
||
|
minimal cost for transfering control is 127 cycles, plus 45 cycles for
|
||
|
TLB misses (see table 3). What are the x86 instructions to enter and
|
||
|
leave the kernel? (int, iret) Why do they consume so much time?
|
||
|
(Flush pipeline) Do modern processors perform these operations more
|
||
|
efficient? Worse now. Faster processors optimized for straight-line
|
||
|
code; Traps/Exceptions flush deeper pipeline, cache misses cost more
|
||
|
cycles.
|
||
|
|
||
|
<li>What are the 5 TLB misses: 1) B's thread control block; loading %cr3
|
||
|
flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and
|
||
|
4+5) user text - two pages B's user code looks at message
|
||
|
|
||
|
<li>Interface:
|
||
|
<ul>
|
||
|
<li>call (threadID, send-message, receive-message, timeout);
|
||
|
<li>reply_and_receive (reply-message, receive-message, timeout);
|
||
|
</ul>
|
||
|
|
||
|
<li>Optimizations:
|
||
|
<ul>
|
||
|
|
||
|
<li>New system call: reply_and_receive. Effect: 2 system calls per
|
||
|
RPC.
|
||
|
|
||
|
<li>Complex messages: direct string, indirect strings, and memory
|
||
|
objects.
|
||
|
|
||
|
<li>Direct transfer by temporary mapping through a communication
|
||
|
window. The communication window is mapped in B address space and in
|
||
|
A's kernel address space; why is this better than just mapping a page
|
||
|
shared between A and B's address space? 1) Multi-level security, it
|
||
|
makes it hard to reason about information flow; 2) Receiver can't
|
||
|
check message legality (might change after check); 3) When server has
|
||
|
many clients, could run out of virtual address space Requires shared
|
||
|
memory region to be established ahead of time; 4) Not application
|
||
|
friendly, since data may already be at another address, i.e.
|
||
|
applications would have to copy anyway--possibly more copies.
|
||
|
|
||
|
<li>Why not use the following approach: map the region copy-on-write
|
||
|
(or read-only) in A's address space after send and read-only in B's
|
||
|
address space? Now B may have to copy data or cannot receive data in
|
||
|
its final destination.
|
||
|
|
||
|
<li>On the x86 implemented by coping B's PDE into A's address space.
|
||
|
Why two PDEs? (Maximum message size is 4 Meg, so guaranteed to work
|
||
|
if the message starts in the bottom for 4 Mbyte of an 8 Mbyte mapped
|
||
|
region.) Why not just copy PTEs? Would be much more expensive
|
||
|
|
||
|
<li> What does it mean for the TLB to be "window clean"? Why do we
|
||
|
care? Means TLB contains no mappings within communication window. We
|
||
|
care because mapping is cheap (copy PDE), but invalidation not; x86
|
||
|
only lets you invalidate one page at a time, or whole TLB Does TLB
|
||
|
invalidation of communication window turn out to be a problem? Not
|
||
|
usually, because have to load %cr3 during IPC anyway
|
||
|
|
||
|
<li>Thread control block registers, links to various double-linked
|
||
|
lists, pgdir, uid, etc.. Lower part of thread UID contains TCB
|
||
|
number. Can also dededuce TCB address from stack by taking SP AND
|
||
|
bitmask (the SP comes out of the TSS when just switching to kernel).
|
||
|
|
||
|
<li> Kernel stack is on same page as tcb. why? 1) Minimizes TLB
|
||
|
misses (since accessing kernel stack will bring in tcb); 2) Allows
|
||
|
very efficient access to tcb -- just mask off lower 12 bits of %esp;
|
||
|
3) With VM, can use lower 32-bits of thread id to indicate which tcb;
|
||
|
using one page per tcb means no need to check if thread is swapped out
|
||
|
(Can simply not map that tcb if shouldn't access it).
|
||
|
|
||
|
<li>Invariant on queues: queues always hold in-memory TCBs.
|
||
|
|
||
|
<li>Wakeup queue: set of 8 unordered wakeup lists (wakup time mod 8),
|
||
|
and smart representation of time so that 32-bit integers can be used
|
||
|
in the common case (base + offset in msec; bump base and recompute all
|
||
|
offsets ~4 hours. maximum timeout is ~24 days, 2^31 msec).
|
||
|
|
||
|
<li>What is the problem addressed by lazy scheduling?
|
||
|
Conventional approach to scheduling:
|
||
|
<pre>
|
||
|
A sends message to B:
|
||
|
Move A from ready queue to waiting queue
|
||
|
Move B from waiting queue to ready queue
|
||
|
This requires 58 cycles, including 4 TLB misses. What are TLB misses?
|
||
|
One each for head of ready and waiting queues
|
||
|
One each for previous queue element during the remove
|
||
|
</pre>
|
||
|
<li> Lazy scheduling:
|
||
|
<pre>
|
||
|
Ready queue must contain all ready threads except current one
|
||
|
Might contain other threads that aren't actually ready, though
|
||
|
Each wakeup queue contains all threads waiting in that queue
|
||
|
Again, might contain other threads, too
|
||
|
Scheduler removes inappropriate queue entries when scanning
|
||
|
queue
|
||
|
</pre>
|
||
|
|
||
|
<li>Why does this help performance? Only three situations in which
|
||
|
thread gives up CPU but stays ready: send syscall (as opposed to
|
||
|
call), preemption, and hardware interrupts. So very often can IPC into
|
||
|
thread while not putting it on ready list.
|
||
|
|
||
|
<li>Direct process switch. This section just says you should use
|
||
|
kernel threads instead of continuations.
|
||
|
|
||
|
<li>Short messages via registers.
|
||
|
|
||
|
<li>Avoiding unnecessary copies. Basically can send and receive
|
||
|
messages w. same vector. Makes forwarding efficient, which is
|
||
|
important for Clans/Chiefs model.
|
||
|
|
||
|
<li>Segment register optimization. Loading segments registers is
|
||
|
slow, have to access GDT, etc. But common case is that users don't
|
||
|
change their segment registers. Observation: it is faster to check
|
||
|
that segment descriptor than load it. So just check that segment
|
||
|
registers are okay. Only need to load if user code changed them.
|
||
|
|
||
|
<li>Registers for paramater passing where ever possible: systems calls
|
||
|
and IPC.
|
||
|
|
||
|
<li>Minimizing TLB misses. Try to cram as many things as possible onto
|
||
|
same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually
|
||
|
maybe can't fit whole tables but put the important parts of tables on
|
||
|
the same page (maybe beginning of TSS, IDT, or GDT only?)
|
||
|
|
||
|
<li>Coding tricks: short offsets, avoid jumps, avoid checks, pack
|
||
|
often-used data on same cache lines, lazily save/restore CPU state
|
||
|
like debug and FPU registers. Much of the kernel is written in
|
||
|
assembly!
|
||
|
|
||
|
<li>What are the results? figure 7 and 8 look good.
|
||
|
|
||
|
<li>Is fast IPC enough to get good overall system performance? This
|
||
|
paper doesn't make a statement either way; we have to read their 1997
|
||
|
paper to find find the answer to that question.
|
||
|
|
||
|
<li>Is the principle of optimizing for performance right? In general,
|
||
|
it is wrong to optimize for performance; other things matter more. Is
|
||
|
IPC the one exception? Maybe, perhaps not. Was Liedtke fighting a
|
||
|
losing battle against CPU makers? Should fast IPC time be a hardware,
|
||
|
or just an OS issue?
|
||
|
|
||
|
</ul>
|
||
|
|
||
|
</body>
|