223 lines
9.2 KiB
HTML
223 lines
9.2 KiB
HTML
|
<title>L10</title>
|
||
|
<html>
|
||
|
<head>
|
||
|
</head>
|
||
|
<body>
|
||
|
|
||
|
<h1>File systems</h1>
|
||
|
|
||
|
<p>Required reading: iread, iwrite, and wdir, and code related to
|
||
|
these calls in fs.c, bio.c, ide.c, file.c, and sysfile.c
|
||
|
|
||
|
<h2>Overview</h2>
|
||
|
|
||
|
<p>The next 3 lectures are about file systems:
|
||
|
<ul>
|
||
|
<li>Basic file system implementation
|
||
|
<li>Naming
|
||
|
<li>Performance
|
||
|
</ul>
|
||
|
|
||
|
<p>Users desire to store their data durable so that data survives when
|
||
|
the user turns of his computer. The primary media for doing so are:
|
||
|
magnetic disks, flash memory, and tapes. We focus on magnetic disks
|
||
|
(e.g., through the IDE interface in xv6).
|
||
|
|
||
|
<p>To allow users to remember where they stored a file, they can
|
||
|
assign a symbolic name to a file, which appears in a directory.
|
||
|
|
||
|
<p>The data in a file can be organized in a structured way or not.
|
||
|
The structured variant is often called a database. UNIX uses the
|
||
|
unstructured variant: files are streams of bytes. Any particular
|
||
|
structure is likely to be useful to only a small class of
|
||
|
applications, and other applications will have to work hard to fit
|
||
|
their data into one of the pre-defined structures. Besides, if you
|
||
|
want structure, you can easily write a user-mode library program that
|
||
|
imposes that format on any file. The end-to-end argument in action.
|
||
|
(Databases have special requirements and support an important class of
|
||
|
applications, and thus have a specialized plan.)
|
||
|
|
||
|
<p>The API for a minimal file system consists of: open, read, write,
|
||
|
seek, close, and stat. Dup duplicates a file descriptor. For example:
|
||
|
<pre>
|
||
|
fd = open("x", O_RDWR);
|
||
|
read (fd, buf, 100);
|
||
|
write (fd, buf, 512);
|
||
|
close (fd)
|
||
|
</pre>
|
||
|
|
||
|
<p>Maintaining the file offset behind the read/write interface is an
|
||
|
interesting design decision . The alternative is that the state of a
|
||
|
read operation should be maintained by the process doing the reading
|
||
|
(i.e., that the pointer should be passed as an argument to read).
|
||
|
This argument is compelling in view of the UNIX fork() semantics,
|
||
|
which clones a process which shares the file descriptors of its
|
||
|
parent. A read by the parent of a shared file descriptor (e.g.,
|
||
|
stdin, changes the read pointer seen by the child). On the other
|
||
|
hand the alternative would make it difficult to get "(data; ls) > x"
|
||
|
right.
|
||
|
|
||
|
<p>Unix API doesn't specify that the effects of write are immediately
|
||
|
on the disk before a write returns. It is up to the implementation
|
||
|
of the file system within certain bounds. Choices include (that
|
||
|
aren't non-exclusive):
|
||
|
<ul>
|
||
|
<li>At some point in the future, if the system stays up (e.g., after
|
||
|
30 seconds);
|
||
|
<li>Before the write returns;
|
||
|
<li>Before close returns;
|
||
|
<li>User specified (e.g., before fsync returns).
|
||
|
</ul>
|
||
|
|
||
|
<p>A design issue is the semantics of a file system operation that
|
||
|
requires multiple disk writes. In particular, what happens if the
|
||
|
logical update requires writing multiple disks blocks and the power
|
||
|
fails during the update? For example, to create a new file,
|
||
|
requires allocating an inode (which requires updating the list of
|
||
|
free inodes on disk), writing a directory entry to record the
|
||
|
allocated i-node under the name of the new file (which may require
|
||
|
allocating a new block and updating the directory inode). If the
|
||
|
power fails during the operation, the list of free inodes and blocks
|
||
|
may be inconsistent with the blocks and inodes in use. Again this is
|
||
|
up to implementation of the file system to keep on disk data
|
||
|
structures consistent:
|
||
|
<ul>
|
||
|
<li>Don't worry about it much, but use a recovery program to bring
|
||
|
file system back into a consistent state.
|
||
|
<li>Journaling file system. Never let the file system get into an
|
||
|
inconsistent state.
|
||
|
</ul>
|
||
|
|
||
|
<p>Another design issue is the semantics are of concurrent writes to
|
||
|
the same data item. What is the order of two updates that happen at
|
||
|
the same time? For example, two processes open the same file and write
|
||
|
to it. Modern Unix operating systems allow the application to lock a
|
||
|
file to get exclusive access. If file locking is not used and if the
|
||
|
file descriptor is shared, then the bytes of the two writes will get
|
||
|
into the file in some order (this happens often for log files). If
|
||
|
the file descriptor is not shared, the end result is not defined. For
|
||
|
example, one write may overwrite the other one (e.g., if they are
|
||
|
writing to the same part of the file.)
|
||
|
|
||
|
<p>An implementation issue is performance, because writing to magnetic
|
||
|
disk is relatively expensive compared to computing. Three primary ways
|
||
|
to improve performance are: careful file system layout that induces
|
||
|
few seeks, an in-memory cache of frequently-accessed blocks, and
|
||
|
overlap I/O with computation so that file operations don't have to
|
||
|
wait until their completion and so that that the disk driver has more
|
||
|
data to write, which allows disk scheduling. (We will talk about
|
||
|
performance in detail later.)
|
||
|
|
||
|
<h2>xv6 code examples</h2>
|
||
|
|
||
|
<p>xv6 implements a minimal Unix file system interface. xv6 doesn't
|
||
|
pay attention to file system layout. It overlaps computation and I/O,
|
||
|
but doesn't do any disk scheduling. Its cache is write-through, which
|
||
|
simplifies keep on disk datastructures consistent, but is bad for
|
||
|
performance.
|
||
|
|
||
|
<p>On disk files are represented by an inode (struct dinode in fs.h),
|
||
|
and blocks. Small files have up to 12 block addresses in their inode;
|
||
|
large files use files the last address in the inode as a disk address
|
||
|
for a block with 128 disk addresses (512/4). The size of a file is
|
||
|
thus limited to 12 * 512 + 128*512 bytes. What would you change to
|
||
|
support larger files? (Ans: e.g., double indirect blocks.)
|
||
|
|
||
|
<p>Directories are files with a bit of structure to them. The file
|
||
|
contains of records of the type struct dirent. The entry contains the
|
||
|
name for a file (or directory) and its corresponding inode number.
|
||
|
How many files can appear in a directory?
|
||
|
|
||
|
<p>In memory files are represented by struct inode in fsvar.h. What is
|
||
|
the role of the additional fields in struct inode?
|
||
|
|
||
|
<p>What is xv6's disk layout? How does xv6 keep track of free blocks
|
||
|
and inodes? See balloc()/bfree() and ialloc()/ifree(). Is this
|
||
|
layout a good one for performance? What are other options?
|
||
|
|
||
|
<p>Let's assume that an application created an empty file x with
|
||
|
contains 512 bytes, and that the application now calls read(fd, buf,
|
||
|
100), that is, it is requesting to read 100 bytes into buf.
|
||
|
Furthermore, let's assume that the inode for x is is i. Let's pick
|
||
|
up what happens by investigating readi(), line 4483.
|
||
|
<ul>
|
||
|
<li>4488-4492: can iread be called on other objects than files? (Yes.
|
||
|
For example, read from the keyboard.) Everything is a file in Unix.
|
||
|
<li>4495: what does bmap do?
|
||
|
<ul>
|
||
|
<li>4384: what block is being read?
|
||
|
</ul>
|
||
|
<li>4483: what does bread do? does bread always cause a read to disk?
|
||
|
<ul>
|
||
|
<li>4006: what does bget do? it implements a simple cache of
|
||
|
recently-read disk blocks.
|
||
|
<ul>
|
||
|
<li>How big is the cache? (see param.h)
|
||
|
<li>3972: look if the requested block is in the cache by walking down
|
||
|
a circular list.
|
||
|
<li>3977: we had a match.
|
||
|
<li>3979: some other process has "locked" the block, wait until it
|
||
|
releases. the other processes releases the block using brelse().
|
||
|
Why lock a block?
|
||
|
<ul>
|
||
|
<li>Atomic read and update. For example, allocating an inode: read
|
||
|
block containing inode, mark it allocated, and write it back. This
|
||
|
operation must be atomic.
|
||
|
</ul>
|
||
|
<li>3982: it is ours now.
|
||
|
<li>3987: it is not in the cache; we need to find a cache entry to
|
||
|
hold the block.
|
||
|
<li>3987: what is the cache replacement strategy? (see also brelse())
|
||
|
<li>3988: found an entry that we are going to use.
|
||
|
<li>3989: mark it ours but don't mark it valid (there is no valid data
|
||
|
in the entry yet).
|
||
|
</ul>
|
||
|
<li>4007: if the block was in the cache and the entry has the block's
|
||
|
data, return.
|
||
|
<li>4010: if the block wasn't in the cache, read it from disk. are
|
||
|
read's synchronous or asynchronous?
|
||
|
<ul>
|
||
|
<li>3836: a bounded buffer of outstanding disk requests.
|
||
|
<li>3809: tell the disk to move arm and generate an interrupt.
|
||
|
<li>3851: go to sleep and run some other process to run. time sharing
|
||
|
in action.
|
||
|
<li>3792: interrupt: arm is in the right position; wakeup requester.
|
||
|
<li>3856: read block from disk.
|
||
|
<li>3860: remove request from bounded buffer. wakeup processes that
|
||
|
are waiting for a slot.
|
||
|
<li>3864: start next disk request, if any. xv6 can overlap I/O with
|
||
|
computation.
|
||
|
</ul>
|
||
|
<li>4011: mark the cache entry has holding the data.
|
||
|
</ul>
|
||
|
<li>4498: To where is the block copied? is dst a valid user address?
|
||
|
</ul>
|
||
|
|
||
|
<p>Now let's suppose that the process is writing 512 bytes at the end
|
||
|
of the file a. How many disk writes will happen?
|
||
|
<ul>
|
||
|
<li>4567: allocate a new block
|
||
|
<ul>
|
||
|
<li>4518: allocate a block: scan block map, and write entry
|
||
|
<li>4523: How many disk operations if the process would have been appending
|
||
|
to a large file? (Answer: read indirect block, scan block map, write
|
||
|
block map.)
|
||
|
</ul>
|
||
|
<li>4572: read the block that the process will be writing, in case the
|
||
|
process writes only part of the block.
|
||
|
<li>4574: write it. is it synchronous or asynchronous? (Ans:
|
||
|
synchronous but with timesharing.)
|
||
|
</ul>
|
||
|
|
||
|
<p>Lots of code to implement reading and writing of files. How about
|
||
|
directories?
|
||
|
<ul>
|
||
|
<li>4722: look for the directory, reading directory block and see if a
|
||
|
directory entry is unused (inum == 0).
|
||
|
<li>4729: use it and update it.
|
||
|
<li>4735: write the modified block.
|
||
|
</ul>
|
||
|
<p>Reading and writing of directories is trivial.
|
||
|
|
||
|
</body>
|