341 lines
13 KiB
HTML
341 lines
13 KiB
HTML
|
<title>Scheduling</title>
|
||
|
<html>
|
||
|
<head>
|
||
|
</head>
|
||
|
<body>
|
||
|
|
||
|
<h1>Scheduling</h1>
|
||
|
|
||
|
<p>Required reading: Eliminating receive livelock
|
||
|
|
||
|
<p>Notes based on prof. Morris's lecture on scheduling (6.824, fall'02).
|
||
|
|
||
|
<h2>Overview</h2>
|
||
|
|
||
|
<ul>
|
||
|
|
||
|
<li>What is scheduling? The OS policies and mechanisms to allocates
|
||
|
resources to entities. A good scheduling policy ensures that the most
|
||
|
important entitity gets the resources it needs. This topic was
|
||
|
popular in the days of time sharing, when there was a shortage of
|
||
|
resources. It seemed irrelevant in era of PCs and workstations, when
|
||
|
resources were plenty. Now the topic is back from the dead to handle
|
||
|
massive Internet servers with paying customers. The Internet exposes
|
||
|
web sites to international abuse and overload, which can lead to
|
||
|
resource shortages. Furthermore, some customers are more important
|
||
|
than others (e.g., the ones that buy a lot).
|
||
|
|
||
|
<li>Key problems:
|
||
|
<ul>
|
||
|
<li>Gap between desired policy and available mechanism. The desired
|
||
|
policies often include elements that not implementable with the
|
||
|
mechanisms available to the operation system. Furthermore, often
|
||
|
there are many conflicting goals (low latency, high throughput, and
|
||
|
fairness), and the scheduler must make a trade-off between the goals.
|
||
|
|
||
|
<li>Interaction between different schedulers. One have to take a
|
||
|
systems view. Just optimizing the CPU scheduler may do little to for
|
||
|
the overall desired policy.
|
||
|
</ul>
|
||
|
|
||
|
<li>Resources you might want to schedule: CPU time, physical memory,
|
||
|
disk and network I/O, and I/O bus bandwidth.
|
||
|
|
||
|
<li>Entities that you might want to give resources to: users,
|
||
|
processes, threads, web requests, or MIT accounts.
|
||
|
|
||
|
<li>Many polices for resource to entity allocation are possible:
|
||
|
strict priority, divide equally, shortest job first, minimum guarantee
|
||
|
combined with admission control.
|
||
|
|
||
|
<li>General plan for scheduling mechanisms
|
||
|
<ol>
|
||
|
<li> Understand where scheduling is occuring.
|
||
|
<li> Expose scheduling decisions, allow control.
|
||
|
<li> Account for resource consumption, to allow intelligent control.
|
||
|
</ol>
|
||
|
|
||
|
<li>Simple example from 6.828 kernel. The policy for scheduling
|
||
|
environments is to give each one equal CPU time. The mechanism used to
|
||
|
implement this policy is a clock interrupt every 10 msec and then
|
||
|
selecting the next environment in a round-robin fashion.
|
||
|
|
||
|
<p>But this only works if processes are compute-bound. What if a
|
||
|
process gives up some of its 10 ms to wait for input? Do we have to
|
||
|
keep track of that and give it back?
|
||
|
|
||
|
<p>How long should the quantum be? is 10 msec the right answer?
|
||
|
Shorter quantum will lead to better interactive performance, but
|
||
|
lowers overall system throughput because we will reschedule more,
|
||
|
which has overhead.
|
||
|
|
||
|
<p>What if the environment computes for 1 msec and sends an IPC to
|
||
|
the file server environment? Shouldn't the file server get more CPU
|
||
|
time because it operates on behalf of all other functions?
|
||
|
|
||
|
<p>Potential improvements for the 6.828 kernel: track "recent" CPU use
|
||
|
(e.g., over the last second) and always run environment with least
|
||
|
recent CPU use. (Still, if you sleep long enough you lose.) Other
|
||
|
solution: directed yield; specify on the yield to which environment
|
||
|
you are donating the remainder of the quantuam (e.g., to the file
|
||
|
server so that it can compute on the environment's behalf).
|
||
|
|
||
|
<li>Pitfall: Priority Inversion
|
||
|
<pre>
|
||
|
Assume policy is strict priority.
|
||
|
Thread T1: low priority.
|
||
|
Thread T2: medium priority.
|
||
|
Thread T3: high priority.
|
||
|
T1: acquire(l)
|
||
|
context switch to T3
|
||
|
T3: acquire(l)... must wait for T1 to release(l)...
|
||
|
context switch to T2
|
||
|
T2 computes for a while
|
||
|
T3 is indefinitely delayed despite high priority.
|
||
|
Can solve if T3 lends its priority to holder of lock it is waiting for.
|
||
|
So T1 runs, not T2.
|
||
|
[this is really a multiple scheduler problem.]
|
||
|
[since locks schedule access to locked resource.]
|
||
|
</pre>
|
||
|
|
||
|
<li>Pitfall: Efficiency. Efficiency often conflicts with fairness (or
|
||
|
any other policy). Long time quantum for efficiency in CPU scheduling
|
||
|
versus low delay. Shortest seek versus FIFO disk scheduling.
|
||
|
Contiguous read-ahead vs data needed now. For example, scheduler
|
||
|
swaps out my idle emacs to let gcc run faster with more phys mem.
|
||
|
What happens when I type a key? These don't fit well into a "who gets
|
||
|
to go next" scheduler framework. Inefficient scheduling may make
|
||
|
<i>everybody</i> slower, including high priority users.
|
||
|
|
||
|
<li>Pitfall: Multiple Interacting Schedulers. Suppose you want your
|
||
|
emacs to have priority over everything else. Give it high CPU
|
||
|
priority. Does that mean nothing else will run if emacs wants to run?
|
||
|
Disk scheduler might not know to favor emacs's disk I/Os. Typical
|
||
|
UNIX disk scheduler favors disk efficiency, not process prio. Suppose
|
||
|
emacs needs more memory. Other processes have dirty pages; emacs must
|
||
|
wait. Does disk scheduler know these other processes' writes are high
|
||
|
prio?
|
||
|
|
||
|
<li>Pitfall: Server Processes. Suppose emacs uses X windows to
|
||
|
display. The X server must serve requests from many clients. Does it
|
||
|
know that emacs' requests should be given priority? Does the OS know
|
||
|
to raise X's priority when it is serving emacs? Similarly for DNS,
|
||
|
and NFS. Does the network know to give emacs' NFS requests priority?
|
||
|
|
||
|
</ul>
|
||
|
|
||
|
<p>In short, scheduling is a system problem. There are many
|
||
|
schedulers; they interact. The CPU scheduler is usually the easy
|
||
|
part. The hardest part is system structure. For example, the
|
||
|
<i>existence</i> of interrupts is bad for scheduling. Conflicting
|
||
|
goals may limit effectiveness.
|
||
|
|
||
|
<h2>Case study: modern UNIX</h2>
|
||
|
|
||
|
<p>Goals:
|
||
|
<ul>
|
||
|
<li>Simplicity (e.g. avoid complex locking regimes).
|
||
|
<li>Quick response to device interrupts.
|
||
|
<li> Favor interactive response.
|
||
|
</ul>
|
||
|
|
||
|
<p>UNIX has a number of execution environments. We care about
|
||
|
scheduling transitions among them. Some transitions aren't possible,
|
||
|
some can't be be controlled. The execution environments are:
|
||
|
|
||
|
<ul>
|
||
|
<li>Process, user half
|
||
|
<li>Process, kernel half
|
||
|
<li>Soft interrupts: timer, network
|
||
|
<li>Device interrupts
|
||
|
</ul>
|
||
|
|
||
|
<p>The rules are:
|
||
|
<ul>
|
||
|
<li>User is pre-emptible.
|
||
|
<li>Kernel half and software interrupts are not pre-emptible.
|
||
|
<li>Device handlers may not make blocking calls (e.g., sleep)
|
||
|
<li>Effective priorities: intr > soft intr > kernel half > user
|
||
|
</ul>
|
||
|
|
||
|
</ul>
|
||
|
|
||
|
<p>Rules are implemented as follows:
|
||
|
|
||
|
<ul>
|
||
|
|
||
|
<li>UNIX: Process User Half. Runs in process address space, on
|
||
|
per-process stack. Interruptible. Pre-emptible: interrupt may cause
|
||
|
context switch. We don't trust user processes to yield CPU.
|
||
|
Voluntarily enters kernel half via system calls and faults.
|
||
|
|
||
|
<li>UNIX: Process Kernel Half. Runs in kernel address space, on
|
||
|
per-process kernel stack. Executes system calls and faults for its
|
||
|
process. Interruptible (but can defer interrupts in critical
|
||
|
sections). Not pre-emptible. Only yields voluntarily, when waiting
|
||
|
for an event. E.g. disk I/O done. This simplifies concurrency
|
||
|
control; locks often not required. No user process runs if any kernel
|
||
|
half wants to run. Many process' kernel halfs may be sleeping in the
|
||
|
kernel.
|
||
|
|
||
|
<li>UNIX: Device Interrupts. Hardware asks CPU for an interrupt to ask
|
||
|
for attention. Disk read/write completed, or network packet received.
|
||
|
Runs in kernel space, on special interrupt stack. Interrupt routine
|
||
|
cannot block; must return. Interrupts are interruptible. They nest
|
||
|
on the one interrupt stack. Interrupts are not pre-emptible, and
|
||
|
cannot really yield. The real-time clock is a device and interrupts
|
||
|
every 10ms (or whatever). Process scheduling decisions can be made
|
||
|
when interrupt returns (e.g. wake up the process waiting for this
|
||
|
event). You want interrupt processing to be fast, since it has
|
||
|
priority. Don't do any more work than you have to. You're blocking
|
||
|
processes and other interrupts. Typically, an interrupt does the
|
||
|
minimal work necessary to keep the device happy, and then call wakeup
|
||
|
on a thread.
|
||
|
|
||
|
<li>UNIX: Soft Interrupts. (Didn't exist in xv6) Used when device
|
||
|
handling is expensive. But no obvious process context in which to
|
||
|
run. Examples include IP forwarding, TCP input processing. Runs in
|
||
|
kernel space, on interrupt stack. Interruptable. Not pre-emptable,
|
||
|
can't really yield. Triggered by hardware interrupt. Called when
|
||
|
outermost hardware interrupt returns. Periodic scheduling decisions
|
||
|
are made in timer s/w interrupt. Scheduled by hardware timer
|
||
|
interrupt (i.e., if current process has run long enough, switch).
|
||
|
</ul>
|
||
|
|
||
|
<p>Is this good software structure? Let's talk about receive
|
||
|
livelock.
|
||
|
|
||
|
<h2>Paper discussion</h2>
|
||
|
|
||
|
<ul>
|
||
|
|
||
|
<li>What is application that the paper is addressing: IP forwarding.
|
||
|
What functionality does a network interface offer to driver?
|
||
|
<ul>
|
||
|
<li> Read packets
|
||
|
<li> Poke hardware to send packets
|
||
|
<li> Interrupts when packet received/transmit complete
|
||
|
<li> Buffer many input packets
|
||
|
</ul>
|
||
|
|
||
|
<li>What devices in the 6.828 kernel are interrupt driven? Which one
|
||
|
are polling? Is this ideal?
|
||
|
|
||
|
<li>Explain Figure 6-1. Why does it go up? What determines how high
|
||
|
the peak is? Why does it go down? What determines how fast it goes
|
||
|
does? Answer:
|
||
|
<pre>
|
||
|
(fraction of packets discarded)(work invested in discarded packets)
|
||
|
-------------------------------------------
|
||
|
(total work CPU is capable of)
|
||
|
</pre>
|
||
|
|
||
|
<li>Suppose I wanted to test an NFS server for livelock.
|
||
|
<pre>
|
||
|
Run client with this loop:
|
||
|
while(1){
|
||
|
send NFS READ RPC;
|
||
|
wait for response;
|
||
|
}
|
||
|
</pre>
|
||
|
What would I see? Is the NFS server probably subject to livelock?
|
||
|
(No--offered load subject to feedback).
|
||
|
|
||
|
<li>What other problems are we trying to address?
|
||
|
<ul>
|
||
|
<li>Increased latency for packet delivery and forwarding (e.g., start
|
||
|
disk head moving when first NFS read request comes)
|
||
|
<li>Transmit starvation
|
||
|
<li>User-level CPU starvation
|
||
|
</ul>
|
||
|
|
||
|
<li>Why not tell the O/S scheduler to give interrupts lower priority?
|
||
|
Non-preemptible.
|
||
|
Could you fix this by making interrupts faster? (Maybe, if coupled
|
||
|
with some limit on input rate.)
|
||
|
|
||
|
<li>Why not completely process each packet in the interrupt handler?
|
||
|
(I.e. forward it?) Other parts of kernel don't expect to run at high
|
||
|
interrupt-level (e.g., some packet processing code might invoke a function
|
||
|
that sleeps). Still might want an output queue
|
||
|
|
||
|
<li>What about using polling instead of interrupts? Solves overload
|
||
|
problem, but killer for latency.
|
||
|
|
||
|
<li>What's the paper's solution?
|
||
|
<ul>
|
||
|
<li>No IP input queue.
|
||
|
<li>Input processing and device input polling in kernel thread.
|
||
|
<li>Device receive interrupt just wakes up thread. And leaves
|
||
|
interrupts *disabled* for that device.
|
||
|
<li>Thread does all input processing, then re-enables interrupts.
|
||
|
</ul>
|
||
|
<p>Why does this work? What happens when packets arrive too fast?
|
||
|
What happens when packets arrive slowly?
|
||
|
|
||
|
<li>Explain Figure 6-3.
|
||
|
<ul>
|
||
|
<li>Why does "Polling (no quota)" work badly? (Input still starves
|
||
|
xmit complete processing.)
|
||
|
<li>Why does it immediately fall to zero, rather than gradually decreasing?
|
||
|
(xmit complete processing must be very cheap compared to input.)
|
||
|
</ul>
|
||
|
|
||
|
<li>Explain Figure 6-4.
|
||
|
<ul>
|
||
|
|
||
|
<li>Why does "Polling, no feedback" behave badly? There's a queue in
|
||
|
front of screend. We can still give 100% to input thread, 0% to
|
||
|
screend.
|
||
|
|
||
|
<li>Why does "Polling w/ feedback" behave well? Input thread yields
|
||
|
when queue to screend fills.
|
||
|
|
||
|
<li>What if screend hangs, what about other consumers of packets?
|
||
|
(e.g., can you ssh to machine to fix screend?) Fortunately screend
|
||
|
typically is only application. Also, re-enable input after timeout.
|
||
|
|
||
|
</ul>
|
||
|
|
||
|
<li>Why are the two solutions different?
|
||
|
<ol>
|
||
|
<li> Polling thread <i>with quotas</i>.
|
||
|
<li> Feedback from full queue.
|
||
|
</ol>
|
||
|
(I believe they should have used #2 for both.)
|
||
|
|
||
|
<li>If we apply the proposed fixes, does the phenomemon totally go
|
||
|
away? (e.g. for web server, waits for disk, &c.)
|
||
|
<ul>
|
||
|
<li>Can the net device throw away packets without slowing down host?
|
||
|
<li>Problem: We want to drop packets for applications with big queues.
|
||
|
But requires work to determine which application a packet belongs to
|
||
|
Solution: NI-LRP (have network interface sort packets)
|
||
|
</ul>
|
||
|
|
||
|
<li>What about latency question? (Look at figure 14 p. 243.)
|
||
|
<ul>
|
||
|
<li>1st packet looks like an improvement over non-polling. But 2nd
|
||
|
packet transmitted later with poling. Why? (No new packets added to
|
||
|
xmit buffer until xmit interrupt)
|
||
|
<li>Why? In traditional BSD, to
|
||
|
amortize cost of poking device. Maybe better to poke a second time
|
||
|
anyway.
|
||
|
</ul>
|
||
|
|
||
|
<li>What if processing has more complex structure?
|
||
|
<ul>
|
||
|
<li>Chain of processing stages with queues? Does feedback work?
|
||
|
What happens when a late stage is slow?
|
||
|
<li>Split at some point, multiple parallel paths? No so great; one
|
||
|
slow path blocks all paths.
|
||
|
</ul>
|
||
|
|
||
|
<li>Can we formulate any general principles from paper?
|
||
|
<ul>
|
||
|
<li>Don't spend time on new work before completing existing work.
|
||
|
<li>Or give new work lower priority than partially-completed work.
|
||
|
</ul>
|
||
|
|
||
|
</ul>
|