xv6-cs450/web/l13.html

<title>High-performance File Systems</title>
<html>
<head>
</head>
<body>

<h1>High-performance File Systems</h1>

<p>Required reading: soft updates.

<h2>Overview</h2>

<p>A key problem in designing file systems is how to obtain
performance on file system operations while providing consistency.
With consistency, we mean, that file system invariants are maintained
is on disk.  These invariants include that if a file is created, it
appears in its directory, etc.  If the file system data structures are
consistent, then it is possible to rebuild the file system to a
correct state after a failure.

<p>To ensure consistency of on-disk file system data structures,
  modifications to the file system must respect certain rules:
<ul>

<li>Never point to a structure before it is initialized. An inode must
be initialized before a directory entry references it.  An block must
be initialized before an inode references it.

<li>Never reuse a structure before nullifying all pointers to it.  An
inode pointer to a disk block must be reset before the file system can
reallocate the disk block.

<li>Never reset the last point to a live structure before a new
pointer is set.  When renaming a file, the file system should not
remove the old name for an inode until after the new name has been
written.
</ul>
The paper calls these dependencies update dependencies.  

<p>xv6 ensures these rules by writing every block synchronously, and
  by ordering the writes appropriately.  With synchronous, we mean
  that a process waits until the current disk write has been
  completed before continuing with execution.

<ul>

<li>What happens if power fails after 4776 in mknod1?  Did we lose the
  inode for ever? No, we have a separate program (called fsck), which
  can rebuild the disk structures correctly and can mark the inode on
  the free list.

<li>Does the order of writes in mknod1 matter?  Say, what if we wrote
  directory entry first and then wrote the allocated inode to disk?
  This violates the update rules and it is not a good plan. If a
  failure happens after the directory write, then on recovery we have
  an directory pointing to an unallocated inode, which now may be
  allocated by another process for another file!

<li>Can we turn the writes (i.e., the ones invoked by iupdate and
  wdir) into delayed writes without creating problems?  No, because
  the cause might write them back to the disk in an incorrect order.
  It has no information to decide in what order to write them.

</ul>

<p>xv6 is a nice example of the tension between consistency and
  performance.  To get consistency, xv6 uses synchronous writes,
  but these writes are slow, because they perform at the rate of a
  seek instead of the rate of the maximum data transfer rate. The
  bandwidth to a disk is reasonable high for large transfer (around
  50Mbyte/s), but latency is low, because of the cost of moving the
  disk arm(s) (the seek latency is about 10msec).

<p>This tension is an implementation-dependent one.  The Unix API
  doesn't require that writes are synchronous.  Updates don't have to
  appear on disk until a sync, fsync, or open with O_SYNC.  Thus, in
  principle, the UNIX API allows delayed writes, which are good for
  performance:
<ul>
<li>Batch many writes together in a big one, written at the disk data
  rate.
<li>Absorp writes to the same block.
<li>Schedule writes to avoid seeks.
</ul>

<p>Thus the question: how to delay writes and achieve consistency?
  The paper provides an answer.

<h2>This paper</h2>

<p>The paper surveys some of the existing techniques and introduces a
new to achieve the goal of performance and consistency.

<p>

<p>Techniques possible:
<ul>

<li>Equip system with NVRAM, and put buffer cache in NVRAM.

<li>Logging.  Often used in UNIX file systems for metadata updates.
LFS is an extreme version of this strategy.

<li>Flusher-enforced ordering.  All writes are delayed. This flusher
is aware of dependencies between blocks, but doesn't work because
circular dependencies need to be broken by writing blocks out.

</ul>

<p>Soft updates is the solution explored in this paper.  It doesn't
require NVRAM, and performs as well as the naive strategy of keep all
dirty block in main memory.  Compared to logging, it is unclear if
soft updates is better.  The default BSD file systems uses soft
  updates, but most Linux file systems use logging.

<p>Soft updates is a sophisticated variant of flusher-enforced
ordering.  Instead of maintaining dependencies on the block-level, it
maintains dependencies on file structure level (per inode, per
directory, etc.), reducing circular dependencies. Furthermore, it
breaks any remaining circular dependencies by undo changes before
writing the block and then redoing them to the block after writing.

<p>Pseudocode for create:
<pre>
create (f) {
   allocate inode in block i  (assuming inode is available)
   add i to directory data block d  (assuming d has space)
   mark d has dependent on i, and create undo/redo record
   update directory inode in block di
   mark di has dependent on d
}
</pre>

<p>Pseudocode for the flusher:
<pre>
flushblock (b)
{
  lock b;
  for all dependencies that b is relying on
    "remove" that dependency by undoing the change to b
    mark the dependency as "unrolled"
  write b 
}

write_completed (b) {
  remove dependencies that depend on b
  reapply "unrolled" dependencies that b depended on
  unlock b
}
</pre>

<p>Apply flush algorithm to example:
<ul>
<li>A list of two dependencies: directory->inode, inode->directory.
<li>Lets say syncer picks directory first
<li>Undo directory->inode changes (i.e., unroll <A,#4>)
<li>Write directory block
<li>Remove met dependencies (i.e., remove inode->directory dependency)
<li>Perform redo operation (i.e., redo <A,#4>)
<li>Select inode block and write it
<li>Remove met dependencies (i.e., remove directory->inode dependency)
<li>Select directory block (it is dirty again!)
<li>Write it.
</ul>

<p>An file operation that is important for file-system consistency 
is rename.  Rename conceptually works as follows:
<pre>
rename (from, to)
   unlink (to);
   link (from, to);
   unlink (from);
</pre>

<p>Rename it often used by programs to make a new version of a file
the current version.  Committing to a new version must happen
atomically.  Unfortunately, without a transaction-like support
atomicity is impossible to guarantee, so a typical file systems
provides weaker semantics for rename: if to already exists, an
instance of to will always exist, even if the system should crash in
the middle of the operation.  Does the above implementation of rename
guarantee this semantics? (Answer: no).

<p>If rename is implemented as unlink, link, unlink, then it is
difficult to guarantee even the weak semantics. Modern UNIXes provide
rename as a file system call:
<pre>
   update dir block for to point to from's inode // write block
   update dir block for from to free entry // write block
</pre>
<p>fsck may need to correct refcounts in the inode if the file
system fails during rename.  for example, a crash after the first
write followed by fsck should set refcount to 2, since both from
and to are pointing at the inode.

<p>This semantics is sufficient, however, for an application to ensure
atomicity. Before the call, there is a from and perhaps a to.  If the
call is successful, following the call there is only a to.  If there
is a crash, there may be both a from and a to, in which case the
caller knows the previous attempt failed, and must retry.  The
subtlety is that if you now follow the two links, the "to" name may
link to either the old file or the new file.  If it links to the new
file, that means that there was a crash and you just detected that the
rename operation was composite.  On the other hand, the retry
procedure can be the same for either case (do the rename again), so it
isn't necessary to discover how it failed.  The function follows the
golden rule of recoverability, and it is idempotent, so it lays all
the needed groundwork for use as part of a true atomic action.

<p>With soft updates renames becomes:
<pre>
rename (from, to) {
   i = namei(from);
   add "to" directory data block td a reference to inode i
   mark td dependent on block i
   update directory inode "to" tdi
   mark tdi as dependent on td
   remove "from" directory data block fd a reference to inode i
   mark fd as dependent on tdi
   update directory inode in block fdi
   mark fdi as dependent on fd
}
</pre>
<p>No synchronous writes!

<p>What needs to be done on recovery?  (Inspect every statement in
rename and see what inconsistencies could exist on the disk; e.g.,
refcnt inode could be too high.)  None of these inconsitencies require
fixing before the file system can operate; they can be fixed by a
background file system repairer. 

<h2>Paper discussion</h2>

<p>Do soft updates perform any useless writes? (A useless write is a
write that will be immediately overwritten.)  (Answer: yes.) Fix
syncer to becareful with what block to start.  Fix cache replacement
to selecting LRU block with no pendending dependencies.

<p>Can a log-structured file system implement rename better? (Answer:
yes, since it can get the refcnts right).

<p>Discuss all graphs.

</body>
DO NOT MAIL: xv6 web pages 2008-09-03 06:50:04 +02:00			`<title>High-performance File Systems</title>`
			`<html>`
			`<head>`
			`</head>`
			`<body>`

			`<h1>High-performance File Systems</h1>`

			`<p>Required reading: soft updates.`

			`<h2>Overview</h2>`

			`<p>A key problem in designing file systems is how to obtain`
			`performance on file system operations while providing consistency.`
			`With consistency, we mean, that file system invariants are maintained`
			`is on disk. These invariants include that if a file is created, it`
			`appears in its directory, etc. If the file system data structures are`
			`consistent, then it is possible to rebuild the file system to a`
			`correct state after a failure.`

			`<p>To ensure consistency of on-disk file system data structures,`
			`modifications to the file system must respect certain rules:`
			`<ul>`

			`<li>Never point to a structure before it is initialized. An inode must`
			`be initialized before a directory entry references it. An block must`
			`be initialized before an inode references it.`

			`<li>Never reuse a structure before nullifying all pointers to it. An`
			`inode pointer to a disk block must be reset before the file system can`
			`reallocate the disk block.`

			`<li>Never reset the last point to a live structure before a new`
			`pointer is set. When renaming a file, the file system should not`
			`remove the old name for an inode until after the new name has been`
			`written.`
			`</ul>`
			`The paper calls these dependencies update dependencies.`

			`<p>xv6 ensures these rules by writing every block synchronously, and`
			`by ordering the writes appropriately. With synchronous, we mean`
			`that a process waits until the current disk write has been`
			`completed before continuing with execution.`

			`<ul>`

			`<li>What happens if power fails after 4776 in mknod1? Did we lose the`
			`inode for ever? No, we have a separate program (called fsck), which`
			`can rebuild the disk structures correctly and can mark the inode on`
			`the free list.`

			`<li>Does the order of writes in mknod1 matter? Say, what if we wrote`
			`directory entry first and then wrote the allocated inode to disk?`
			`This violates the update rules and it is not a good plan. If a`
			`failure happens after the directory write, then on recovery we have`
			`an directory pointing to an unallocated inode, which now may be`
			`allocated by another process for another file!`

			`<li>Can we turn the writes (i.e., the ones invoked by iupdate and`
			`wdir) into delayed writes without creating problems? No, because`
			`the cause might write them back to the disk in an incorrect order.`
			`It has no information to decide in what order to write them.`

			`</ul>`

			`<p>xv6 is a nice example of the tension between consistency and`
			`performance. To get consistency, xv6 uses synchronous writes,`
			`but these writes are slow, because they perform at the rate of a`
			`seek instead of the rate of the maximum data transfer rate. The`
			`bandwidth to a disk is reasonable high for large transfer (around`
			`50Mbyte/s), but latency is low, because of the cost of moving the`
			`disk arm(s) (the seek latency is about 10msec).`

			`<p>This tension is an implementation-dependent one. The Unix API`
			`doesn't require that writes are synchronous. Updates don't have to`
			`appear on disk until a sync, fsync, or open with O_SYNC. Thus, in`
			`principle, the UNIX API allows delayed writes, which are good for`
			`performance:`
			`<ul>`
			`<li>Batch many writes together in a big one, written at the disk data`
			`rate.`
			`<li>Absorp writes to the same block.`
			`<li>Schedule writes to avoid seeks.`
			`</ul>`

			`<p>Thus the question: how to delay writes and achieve consistency?`
			`The paper provides an answer.`

			`<h2>This paper</h2>`

			`<p>The paper surveys some of the existing techniques and introduces a`
			`new to achieve the goal of performance and consistency.`

			`<p>`

			`<p>Techniques possible:`
			`<ul>`

			`<li>Equip system with NVRAM, and put buffer cache in NVRAM.`

			`<li>Logging. Often used in UNIX file systems for metadata updates.`
			`LFS is an extreme version of this strategy.`

			`<li>Flusher-enforced ordering. All writes are delayed. This flusher`
			`is aware of dependencies between blocks, but doesn't work because`
			`circular dependencies need to be broken by writing blocks out.`

			`</ul>`

			`<p>Soft updates is the solution explored in this paper. It doesn't`
			`require NVRAM, and performs as well as the naive strategy of keep all`
			`dirty block in main memory. Compared to logging, it is unclear if`
			`soft updates is better. The default BSD file systems uses soft`
			`updates, but most Linux file systems use logging.`

			`<p>Soft updates is a sophisticated variant of flusher-enforced`
			`ordering. Instead of maintaining dependencies on the block-level, it`
			`maintains dependencies on file structure level (per inode, per`
			`directory, etc.), reducing circular dependencies. Furthermore, it`
			`breaks any remaining circular dependencies by undo changes before`
			`writing the block and then redoing them to the block after writing.`

			`<p>Pseudocode for create:`
			`<pre>`
			`create (f) {`
			`allocate inode in block i (assuming inode is available)`
			`add i to directory data block d (assuming d has space)`
			`mark d has dependent on i, and create undo/redo record`
			`update directory inode in block di`
			`mark di has dependent on d`
			`}`
			`</pre>`

			`<p>Pseudocode for the flusher:`
			`<pre>`
			`flushblock (b)`
			`{`
			`lock b;`
			`for all dependencies that b is relying on`
			`"remove" that dependency by undoing the change to b`
			`mark the dependency as "unrolled"`
			`write b`
			`}`

			`write_completed (b) {`
			`remove dependencies that depend on b`
			`reapply "unrolled" dependencies that b depended on`
			`unlock b`
			`}`
			`</pre>`

			`<p>Apply flush algorithm to example:`
			`<ul>`
			`<li>A list of two dependencies: directory->inode, inode->directory.`
			`<li>Lets say syncer picks directory first`
			`<li>Undo directory->inode changes (i.e., unroll <A,#4>)`
			`<li>Write directory block`
			`<li>Remove met dependencies (i.e., remove inode->directory dependency)`
			`<li>Perform redo operation (i.e., redo <A,#4>)`
			`<li>Select inode block and write it`
			`<li>Remove met dependencies (i.e., remove directory->inode dependency)`
			`<li>Select directory block (it is dirty again!)`
			`<li>Write it.`
			`</ul>`

			`<p>An file operation that is important for file-system consistency`
			`is rename. Rename conceptually works as follows:`
			`<pre>`
			`rename (from, to)`
			`unlink (to);`
			`link (from, to);`
			`unlink (from);`
			`</pre>`

			`<p>Rename it often used by programs to make a new version of a file`
			`the current version. Committing to a new version must happen`
			`atomically. Unfortunately, without a transaction-like support`
			`atomicity is impossible to guarantee, so a typical file systems`
			`provides weaker semantics for rename: if to already exists, an`
			`instance of to will always exist, even if the system should crash in`
			`the middle of the operation. Does the above implementation of rename`
			`guarantee this semantics? (Answer: no).`

			`<p>If rename is implemented as unlink, link, unlink, then it is`
			`difficult to guarantee even the weak semantics. Modern UNIXes provide`
			`rename as a file system call:`
			`<pre>`
			`update dir block for to point to from's inode // write block`
			`update dir block for from to free entry // write block`
			`</pre>`
			`<p>fsck may need to correct refcounts in the inode if the file`
			`system fails during rename. for example, a crash after the first`
			`write followed by fsck should set refcount to 2, since both from`
			`and to are pointing at the inode.`

			`<p>This semantics is sufficient, however, for an application to ensure`
			`atomicity. Before the call, there is a from and perhaps a to. If the`
			`call is successful, following the call there is only a to. If there`
			`is a crash, there may be both a from and a to, in which case the`
			`caller knows the previous attempt failed, and must retry. The`
			`subtlety is that if you now follow the two links, the "to" name may`
			`link to either the old file or the new file. If it links to the new`
			`file, that means that there was a crash and you just detected that the`
			`rename operation was composite. On the other hand, the retry`
			`procedure can be the same for either case (do the rename again), so it`
			`isn't necessary to discover how it failed. The function follows the`
			`golden rule of recoverability, and it is idempotent, so it lays all`
			`the needed groundwork for use as part of a true atomic action.`

			`<p>With soft updates renames becomes:`
			`<pre>`
			`rename (from, to) {`
			`i = namei(from);`
			`add "to" directory data block td a reference to inode i`
			`mark td dependent on block i`
			`update directory inode "to" tdi`
			`mark tdi as dependent on td`
			`remove "from" directory data block fd a reference to inode i`
			`mark fd as dependent on tdi`
			`update directory inode in block fdi`
			`mark fdi as dependent on fd`
			`}`
			`</pre>`
			`<p>No synchronous writes!`

			`<p>What needs to be done on recovery? (Inspect every statement in`
			`rename and see what inconsistencies could exist on the disk; e.g.,`
			`refcnt inode could be too high.) None of these inconsitencies require`
			`fixing before the file system can operate; they can be fixed by a`
			`background file system repairer.`

			`<h2>Paper discussion</h2>`

			`<p>Do soft updates perform any useless writes? (A useless write is a`
			`write that will be immediately overwritten.) (Answer: yes.) Fix`
			`syncer to becareful with what block to start. Fix cache replacement`
			`to selecting LRU block with no pendending dependencies.`

			`<p>Can a log-structured file system implement rename better? (Answer:`
			`yes, since it can get the refcnts right).`

			`<p>Discuss all graphs.`

			`</body>`