202 lines
6.2 KiB
HTML
202 lines
6.2 KiB
HTML
<title>Scalable coordination</title>
|
|
<html>
|
|
<head>
|
|
</head>
|
|
<body>
|
|
|
|
<h1>Scalable coordination</h1>
|
|
|
|
<p>Required reading: Mellor-Crummey and Scott, Algorithms for Scalable
|
|
Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991.
|
|
|
|
<h2>Overview</h2>
|
|
|
|
<p>Shared memory machines are bunch of CPUs, sharing physical memory.
|
|
Typically each processor also mantains a cache (for performance),
|
|
which introduces the problem of keep caches coherent. If processor 1
|
|
writes a memory location whose value processor 2 has cached, then
|
|
processor 2's cache must be updated in some way. How?
|
|
<ul>
|
|
|
|
<li>Bus-based schemes. Any CPU can access "dance with" any memory
|
|
equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache
|
|
listens to the memory bus. With write-through architecture, invalidate
|
|
copy when see a write. Or can have "ownership" scheme with write-back
|
|
cache (E.g., Pentium cache have MESI bits---modified, exclusive,
|
|
shared, invalid). If E bit set, CPU caches exclusively and can do
|
|
write back. But bus places limits on scalability.
|
|
|
|
<li>More scalability w. NUMA schemes (non-uniform memory access). Each
|
|
CPU comes with fast "close" memory. Slower to access memory that is
|
|
stored with another processor. Use a directory to keep track of who is
|
|
caching what. For example, processor 0 is responsible for all memory
|
|
starting with address "000", processor 1 is responsible for all memory
|
|
starting with "001", etc.
|
|
|
|
<li>COMA - cache-only memory architecture. Each CPU has local RAM,
|
|
treated as cache. Cache lines migrate around to different nodes based
|
|
on access pattern. Data only lives in cache, no permanent memory
|
|
location. (These machines aren't too popular any more.)
|
|
|
|
</ul>
|
|
|
|
|
|
<h2>Scalable locks</h2>
|
|
|
|
<p>This paper is about cost and scalability of locking; what if you
|
|
have 10 CPUs waiting for the same lock? For example, what would
|
|
happen if xv6 runs on an SMP with many processors?
|
|
|
|
<p>What's the cost of a simple spinning acquire/release? Algorithm 1
|
|
*without* the delays, which is like xv6's implementation of acquire
|
|
and release (xv6 uses XCHG instead of test_and_set):
|
|
<pre>
|
|
each of the 10 CPUs gets the lock in turn
|
|
meanwhile, remaining CPUs in XCHG on lock
|
|
lock must be X in cache to run XCHG
|
|
otherwise all might read, then all might write
|
|
so bus is busy all the time with XCHGs!
|
|
can we avoid constant XCHGs while lock is held?
|
|
</pre>
|
|
|
|
<p>test-and-test-and-set
|
|
<pre>
|
|
only run expensive TSL if not locked
|
|
spin on ordinary load instruction, so cache line is S
|
|
acquire(l)
|
|
while(1){
|
|
while(l->locked != 0) { }
|
|
if(TSL(&l->locked) == 0)
|
|
return;
|
|
}
|
|
</pre>
|
|
|
|
<p>suppose 10 CPUs are waiting, let's count cost in total bus
|
|
transactions
|
|
<pre>
|
|
CPU1 gets lock in one cycle
|
|
sets lock's cache line to I in other CPUs
|
|
9 CPUs each use bus once in XCHG
|
|
then everyone has the line S, so they spin locally
|
|
CPU1 release the lock
|
|
CPU2 gets the lock in one cycle
|
|
8 CPUs each use bus once...
|
|
So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs!
|
|
Look at "test-and-test-and-set" in Figure 6
|
|
</pre>
|
|
<p> Can we have <i>n</i> CPUs acquire a lock in O(<i>n</i>) time?
|
|
|
|
<p>What is the point of the exponential backoff in Algorithm 1?
|
|
<pre>
|
|
Does it buy us O(n) time for n acquires?
|
|
Is there anything wrong with it?
|
|
may not be fair
|
|
exponential backoff may increase delay after release
|
|
</pre>
|
|
|
|
<p>What's the point of the ticket locks, Algorithm 2?
|
|
<pre>
|
|
one interlocked instruction to get my ticket number
|
|
then I spin on now_serving with ordinary load
|
|
release() just increments now_serving
|
|
</pre>
|
|
|
|
<p>why is that good?
|
|
<pre>
|
|
+ fair
|
|
+ no exponential backoff overshoot
|
|
+ no spinning on
|
|
</pre>
|
|
|
|
<p>but what's the cost, in bus transactions?
|
|
<pre>
|
|
while lock is held, now_serving is S in all caches
|
|
release makes it I in all caches
|
|
then each waiters uses a bus transaction to get new value
|
|
so still O(n^2)
|
|
</pre>
|
|
|
|
<p>What's the point of the array-based queuing locks, Algorithm 3?
|
|
<pre>
|
|
a lock has an array of "slots"
|
|
waiter allocates a slot, spins on that slot
|
|
release wakes up just next slot
|
|
so O(n) bus transactions to get through n waiters: good!
|
|
anderson lines in Figure 4 and 6 are flat-ish
|
|
they only go up because lock data structures protected by simpler lock
|
|
but O(n) space *per lock*!
|
|
</pre>
|
|
|
|
<p>Algorithm 5 (MCS), the new algorithm of the paper, uses
|
|
compare_and_swap:
|
|
<pre>
|
|
int compare_and_swap(addr, v1, v2) {
|
|
int ret = 0;
|
|
// stop all memory activity and ignore interrupts
|
|
if (*addr == v1) {
|
|
*addr = v2;
|
|
ret = 1;
|
|
}
|
|
// resume other memory activity and take interrupts
|
|
return ret;
|
|
}
|
|
</pre>
|
|
|
|
<p>What's the point of the MCS lock, Algorithm 5?
|
|
<pre>
|
|
constant space per lock, rather than O(n)
|
|
one "qnode" per thread, used for whatever lock it's waiting for
|
|
lock holder's qnode points to start of list
|
|
lock variable points to end of list
|
|
acquire adds your qnode to end of list
|
|
then you spin on your own qnode
|
|
release wakes up next qnode
|
|
</pre>
|
|
|
|
<h2>Wait-free or non-blocking data structures</h2>
|
|
|
|
<p>The previous implementations all block threads when there is
|
|
contention for a lock. Other atomic hardware operations allows one
|
|
to build implementation wait-free data structures. For example, one
|
|
can make an insert of an element in a shared list that don't block a
|
|
thread. Such versions are called wait free.
|
|
|
|
<p>A linked list with locks is as follows:
|
|
<pre>
|
|
Lock list_lock;
|
|
|
|
insert(int x) {
|
|
element *n = new Element;
|
|
n->x = x;
|
|
|
|
acquire(&list_lock);
|
|
n->next = list;
|
|
list = n;
|
|
release(&list_lock);
|
|
}
|
|
</pre>
|
|
|
|
<p>A wait-free implementation is as follows:
|
|
<pre>
|
|
insert (int x) {
|
|
element *n = new Element;
|
|
n->x = x;
|
|
do {
|
|
n->next = list;
|
|
} while (compare_and_swap (&list, n->next, n) == 0);
|
|
}
|
|
</pre>
|
|
<p>How many bus transactions with 10 CPUs inserting one element in the
|
|
list? Could you do better?
|
|
|
|
<p><a href="http://www.cl.cam.ac.uk/netos/papers/2007-cpwl.pdf">This
|
|
paper by Fraser and Harris</a> compares lock-based implementations
|
|
versus corresponding non-blocking implementations of a number of data
|
|
structures.
|
|
|
|
<p>It is not possible to make every operation wait-free, and there are
|
|
times we will need an implementation of acquire and release.
|
|
research on non-blocking data structures is active; the last word
|
|
isn't said on this topic yet.
|
|
|
|
</body>
|