VFS: add documentation about internal working
This commit is contained in:
parent
23c5f56e32
commit
625f4ae4a3
1 changed files with 674 additions and 0 deletions
674
servers/vfs/README
Normal file
674
servers/vfs/README
Normal file
|
@ -0,0 +1,674 @@
|
|||
Description of VFS Thomas Veerman 18-12-2012
|
||||
|
||||
Table of contents
|
||||
1 ..... General description of responsibilities
|
||||
2 ..... General architecture
|
||||
3 ..... Worker threads
|
||||
4 ..... Locking
|
||||
4.1 .... Locking requirements
|
||||
4.2 .... Three-level Lock
|
||||
4.3 .... Data structures subject to locking
|
||||
4.4 .... Locking order
|
||||
4.5 .... Vmnt (file system) locking
|
||||
4.6 .... Vnode (open file) locking
|
||||
4.7 .... Filp (file position) locking
|
||||
4.8 .... Lock characteristics per request type
|
||||
5 ..... Recovery from driver crashes
|
||||
5.1 .... Recovery from block drivers crashes
|
||||
5.2 .... Recovery from character driver crashes
|
||||
5.3 .... Recovery from File Server crashes
|
||||
|
||||
1 General description of responsibilities
|
||||
VFS implements the file system in cooperation with one or more File Servers
|
||||
(FS). The File Servers take care of the actual file system on a partition. That
|
||||
is, they interpret the data structure on disk, write and read data to/from
|
||||
disk, etc. VFS sits on top of those File Servers and communicates with
|
||||
them. Looking inside VFS, we can identify several roles. First, a role of VFS
|
||||
is to handle most POSIX system calls that are supported by Minix. Additionally,
|
||||
it supports a few calls necessary for libc. The following system calls are
|
||||
handled by VFS:
|
||||
access, chdir, chmod, chown, chroot, close, creat, fchdir, fcntl, fstat,
|
||||
fstatfs, fstatvfs, fsync, ftruncate getdents, ioctl, link, llseek, lseek,
|
||||
lstat, mkdir, mknod, mount, open, pipe, read, readlink, rename, rmdir, select,
|
||||
stat, statvfs, symlink, sync, truncate, umask, umount, unlink, utime, write.
|
||||
Second, it maintains part of the state belonging to a process (process state is
|
||||
spread out over the kernel, VM, PM, and VFS). For example, it maintains state
|
||||
for select(2) calls, file descriptors and file positions. Also, it cooperates
|
||||
with the Process Manager to handle the fork, exec, and exit system calls.
|
||||
Third, VFS keeps track of endpoints that are supposed to be drivers for
|
||||
character or block special files. File Servers can be regarded as drivers for
|
||||
block special files, although they are handled entirely different compared
|
||||
to other drivers.
|
||||
|
||||
The following diagram depicts how a read() on a file in /home is being handled:
|
||||
|
||||
----------------
|
||||
| user process |
|
||||
----------------
|
||||
^ ^
|
||||
| |
|
||||
read(2) \
|
||||
| \
|
||||
V \
|
||||
---------------- |
|
||||
| VFS | |
|
||||
---------------- |
|
||||
^ |
|
||||
| |
|
||||
V |
|
||||
------- -------- ---------
|
||||
| MFS | | MFS | | MFS |
|
||||
| / | | /usr | | /home |
|
||||
------- -------- ---------
|
||||
Diagram 1: handling of read(2) system call
|
||||
|
||||
The user process executes the read system call which is delivered to VFS. VFS
|
||||
verifies the read is done on a valid (open) file and forwards the request
|
||||
to the FS responsible for the file system on which the file resides. The FS
|
||||
reads the data, copies it directly to the user process, and replies to VFS
|
||||
it has executed the request. Subsequently, VFS replies to the user process
|
||||
the operation is done and the user process continues to run.
|
||||
|
||||
2 General architecture
|
||||
VFS works roughly identical to every other server and driver in Minix; it
|
||||
fetches a message (internally referred to as a job in some cases), executes
|
||||
the request embedded in the message, returns a reply, and fetches the next
|
||||
job. There are several sources for new jobs: from user processes, from PM, from
|
||||
the kernel, and from suspended jobs inside VFS itself (suspended operations
|
||||
on pipes, locks, or character special files). File Servers are regarded as
|
||||
normal user processes in this case, but their abilities are limited. This
|
||||
is to prevent deadlocks. Once a job is received, a worker thread starts
|
||||
executing it. During the lifetime of a job, the worker thread might need
|
||||
to talk to several File Servers. The protocol VFS speaks with File Servers
|
||||
is fully documented on the Wiki at [0]. The protocol fields are defined in
|
||||
<minix/vfsif.h>. If the job is an operation on a character or block special
|
||||
file and the need to talk to a driver arises, VFS uses the Character and
|
||||
Block Device Protocol. See [1]. This is sadly not official documentation,
|
||||
but it is an accurate description of how it works. Luckily, driver writers
|
||||
can use the libchardriver and libblockdriver libraries and don't have to
|
||||
know the details of the protocol.
|
||||
|
||||
3 Worker threads
|
||||
Upon start up, VFS spawns a configurable amount of worker threads. The
|
||||
main thread fetches requests and replies, and hands them off to idle or
|
||||
reply-pending workers, respectively. If no worker threads are available,
|
||||
the request is queued. There are 3 types of worker threads: normal, a system
|
||||
worker, and a deadlock resolver. All standard system calls are handled by
|
||||
normal worker threads. Jobs from PM and notifications from the kernel are taken
|
||||
care of by the system worker. The deadlock resolver handles jobs from system
|
||||
processes (i.e., File Servers and drivers) when there are no normal worker
|
||||
threads available; all normal threads might be blocked on a single worker
|
||||
thread that caused a system process to send a request on its own. To unblock
|
||||
all normal threads, we need to reserve one thread to handle that situation.
|
||||
VFS drives all File Servers and drivers asynchronously. While waiting for
|
||||
a reply, a worker thread is blocked and other workers can keep processing
|
||||
requests. Upon reply the worker thread is unblocked.
|
||||
As mentioned above, the main thread is responsible for retrieving new jobs and
|
||||
replies to current jobs and start or unblock the proper worker thread. Given
|
||||
how many sources for new jobs and replies there are, the work for the main
|
||||
thread is quite complicated. Consider Table 1.
|
||||
|
||||
---------------------------------------------------------
|
||||
| From | normal | deadlock | system |
|
||||
---------------------------------------------------------
|
||||
msg is new job
|
||||
---------------------------------------------------------
|
||||
| PM | | | X |
|
||||
+----------------------+----------+----------+----------+
|
||||
| Notification from | | | |
|
||||
| the kernel | | | X |
|
||||
+----------------------+----------+----------+----------+
|
||||
| Notification from | | | |
|
||||
| DS or system process | X | X | |
|
||||
+----------------------+----------+----------+----------+
|
||||
| User process | X | | |
|
||||
+----------------------+----------+----------+----------+
|
||||
| Unsuspended process | X | | |
|
||||
---------------------------------------------------------
|
||||
msg is reply
|
||||
---------------------------------------------------------
|
||||
| File Server reply | resume | | |
|
||||
+----------------------+----------+----------+----------+
|
||||
| Sync. driver reply | resume | | |
|
||||
+----------------------+----------+----------+----------+
|
||||
| Async. driver reply | resume/X | X | |
|
||||
---------------------------------------------------------
|
||||
Table 1: VFS' message fetching main loop. X means 'start thread'.
|
||||
|
||||
The reason why asynchronous driver replies get their own thread is for the
|
||||
following. In some cases, a reply has a thread blocked waiting for it which
|
||||
can be resumed (e.g., open). In another case there's a lot of work to be
|
||||
done which involves sending new messages (e.g., select replies). Finally,
|
||||
DEV_REVIVE replies unblock suspended processes which in turn generate new jobs
|
||||
to be handled by the main loop (e.g., suspended reads and writes). So depending
|
||||
on the reply a new thread has to be started. Having all this logic in the main
|
||||
loop is messy, so we start a thread regardless of the actual reply contents.
|
||||
When there are no worker threads available and there is no need to invoke
|
||||
the deadlock resolver (i.e., normal system calls), the request is queued in
|
||||
the fproc table. This works because a process can send only one system call
|
||||
at a time. When implementing kernel threads, one has to take this assumption
|
||||
into account.
|
||||
The protocol PM speaks with VFS is asynchronous and PM is allowed to
|
||||
send as many request to VFS as it wants. It is impossible to use the same
|
||||
queueing mechanism as normal processes use, because that would allow for
|
||||
just 1 queued message. Instead, the system worker maintains a linked list
|
||||
of pending requests. Moreover, this queueing mechanism is also the reason
|
||||
why notifications from the kernel are handled by the system worker; the
|
||||
kernel has no corresponding fproc table entry (so we can't store it there)
|
||||
and the linked list has no dependencies on that table.
|
||||
Communication with drivers is asynchronous even when the driver uses the
|
||||
synchronous driver protocol. However, to guarantee identical behavior,
|
||||
access to synchronous drivers is serialized. File Servers are treated
|
||||
differently. VFS was designed to be able to send requests concurrently to
|
||||
File Servers, although at the time of writing there are no File Servers that
|
||||
can actually make use of that functionality. To identify which reply from an
|
||||
FS belongs to which worker thread, all requests have an embedded transaction
|
||||
identification number (a magic number + thread id encoded in the mtype field
|
||||
of a message) which the FS has to echo upon reply. Because the range of valid
|
||||
transaction IDs is isolated from valid system call numbers, VFS can use that
|
||||
ID to differentiate between replies from File Servers and actual new system
|
||||
calls from FSes. Using this mechanism VFS is able to support FUSE and ProcFS.
|
||||
|
||||
4 Locking
|
||||
To ensure correct execution of system calls, worker threads sometimes need
|
||||
certain objects within VFS to remain unchanged during thread suspension
|
||||
and resumption (i.e., when they need to communicate with a driver or File
|
||||
Server). Threads keep most state on the stack, but there are a few global
|
||||
variables that require protection: the fproc table, vmnt table, vnode table,
|
||||
and filp table. Other tables such as lock table, select table, and dmap table
|
||||
don't require protection by means of exclusive access. There it's required
|
||||
and enough to simply mark an entry in use.
|
||||
|
||||
4.1 Locking requirements
|
||||
VFS implements the locking model described in [2]. For completeness of this
|
||||
document we'll describe it here, too. The requirements are based on a threading
|
||||
package that is non-preemptive. VFS must guarantee correct functioning with
|
||||
several, semi-concurrently executing threads in any arbitrary order. The
|
||||
latter requirement follows from the fact that threads need service from
|
||||
other components like File Servers and drivers, and they may take any time
|
||||
to complete requests.
|
||||
1) Consistency of replicated values. Several system calls rely on VFS keeping
|
||||
a replicated representation of data in File Servers (e.g., file sizes,
|
||||
file modes, etc.).
|
||||
2) Isolation of system calls. Many system calls involve multiple requests to
|
||||
FSes. Concurrent requests from other processes must not lead to otherwise
|
||||
impossible results (e.g., a chmod operation on a file cannot fail halfway
|
||||
through because it's suddenly unlinked or moved).
|
||||
3) Integrity of objects. From the point of view of threads, obtaining mutual
|
||||
exclusion is a potentially blocking operation. The integrity of any objects
|
||||
used across blocking calls must be guaranteed (e.g., the file mode in a vnode
|
||||
must remain intact not only when talking to other components, but also when
|
||||
obtaining a lock on a filp).
|
||||
4) No deadlock. Not one call may cause another call to never complete. Deadlock
|
||||
situations are typically the result of two or more threads that each hold
|
||||
exclusive access to one resource and want exclusive access to the resource
|
||||
held by the other thread. These resources are a) data (global variables)
|
||||
and b) worker threads.
|
||||
4a) Conflicts between locking of different types of objects can be avoided by
|
||||
keeping a locking order: objects of different type must always be locked in
|
||||
the same order. If multiple objects of the same type are to be locked, then
|
||||
first a "common denominator" higher up in the locking order must be locked.
|
||||
4b) Some threads can only run to completion when another thread does work on
|
||||
their behalf. Examples of this are drivers and file servers that do system
|
||||
calls on their own (e.g., ProcFS, PFS/UNIX Domain Sockets, FUSE) or crashing
|
||||
components (e.g., a driver for a character special file that crashes during
|
||||
a request; a second thread is required to handle resource clean up or driver
|
||||
restart before the first thread can abort or retry the request).
|
||||
5) No starvation. VFS must guarantee that every system call completes in finite
|
||||
time (e.g., an infinite stream of reads must never completely block writes).
|
||||
Furthermore, we want to maximize parallelism to improve performance. This
|
||||
leads to:
|
||||
6) A request to one File Server must not block access to other FS
|
||||
processes. This means that most forms of locking cannot take place at a
|
||||
global level, and must at most take place on the file system level.
|
||||
7) No read-only operation on a regular file must block an independent read
|
||||
call to that file. In particular, (read-only) open and close operations may
|
||||
not block such reads, and multiple independent reads on the same file must
|
||||
be able to take place concurrently (i.e., reads that do not share a file
|
||||
position between their file descriptors).
|
||||
|
||||
4.2 Three-level Lock
|
||||
From the requirements it follows that we need at least two locking types: read
|
||||
and write locks. Concurrent reads are allowed, but writes are exclusive both
|
||||
from reads and from each other. However, in a lot of cases it possible to use
|
||||
a third locking type that is in between read and write lock: the serialize
|
||||
lock. This is implemented in the three-level lock [2]. The three-level
|
||||
lock provides:
|
||||
TLL_READ: allows an unlimited number of threads to hold the lock with the
|
||||
same type (both the thread itself and other threads); N * concurrent.
|
||||
TLL_READSER: also allows an unlimited number of threads with type TLL_READ,
|
||||
but only one thread can obtain serial access to the lock; N * concurrent +
|
||||
1 * serial.
|
||||
TLL_WRITE: provides full mutual exclusion; 1 * exclusive + 0 * concurrent +
|
||||
0 * serial.
|
||||
In absence of TLL_READ locks, a TLL_READSER is identical to TLL_WRITE. However,
|
||||
TLL_READSER never blocks concurrent TLL_READ access. TLL_READSER can be
|
||||
upgraded to TLL_WRITE; the thread will block until the last TLL_READ lock
|
||||
leaves and new TLL_READ locks are blocked. Locks can be downgraded to a
|
||||
lower type. The three-level lock is implemented using two FIFO queues with
|
||||
write-bias. This guarantees no starvation.
|
||||
|
||||
4.3 Data structures subject to locking
|
||||
VFS has a number of global data structures. See Table 2.
|
||||
|
||||
--------------------------------------------------------------------
|
||||
| Structure | Object description |
|
||||
+------------+-----------------------------------------------------|
|
||||
| fproc | Process (includes process's file descriptors) |
|
||||
+------------+-----------------------------------------------------|
|
||||
| vmnt | Virtual mount; a mounted file system |
|
||||
+------------+-----------------------------------------------------|
|
||||
| vnode | Virtual node; an open file |
|
||||
+------------+-----------------------------------------------------|
|
||||
| filp | File position into an open file |
|
||||
+------------+-----------------------------------------------------|
|
||||
| lock | File region locking state for an open file |
|
||||
+------------+-----------------------------------------------------|
|
||||
| select | State for an in-progress select(2) call |
|
||||
+------------+-----------------------------------------------------|
|
||||
| dmap | Mapping from major device number to a device driver |
|
||||
--------------------------------------------------------------------
|
||||
Table 2: VFS object types.
|
||||
|
||||
An fproc object is a process. An fproc object is created by fork(2)
|
||||
and destroyed by exit(2) (which may, or may not, be instantiated from the
|
||||
process itself). It is identified by its endpoint number ('fp_endpoint')
|
||||
and process id ('fp_pid'). Both are unique although in general the endpoint
|
||||
number is used throughout the system.
|
||||
A vmnt object is a mounted file system. It is created by mount(2) and destroyed
|
||||
by umount(2). It is identified by a device number ('m_dev') and FS endpoint
|
||||
number ('m_fs_e'); both are unique to each vmnt object. There is always a
|
||||
single process that handles a file system on a device and a device cannot
|
||||
be mounted twice.
|
||||
A vnode object is the VFS representation of an open inode on the file
|
||||
system. A vnode object is created when a first process opens or creates the
|
||||
corresponding file and is destroyed when the last process, which has that
|
||||
file open, closes it. It is identified by a combination of FS endpoint number
|
||||
('v_fs_e') and inode number of that file system ('v_inode_nr'). A vnode
|
||||
might be mapped to another file system; the actual reading and writing is
|
||||
handled by a different endpoint. This has no effect on locking.
|
||||
A filp object contains a file position within a file. It is created when a file
|
||||
is opened or anonymous pipe created and destroyed when the last user (i.e.,
|
||||
process) closes it. A file descriptor always points to a single filp. A filp
|
||||
always point to a single vnode, although not all vnodes are pointed to by a
|
||||
filp. A filp has a reference count ('filp_count') which is identical to the
|
||||
number of file descriptors pointing to it. It can be increased by a dup(2)
|
||||
or fork(2). A filp can therefore be shared by multiple processes.
|
||||
A lock object keeps information about locking of file regions. This has
|
||||
nothing to do with the threading type of locking. The lock objects require
|
||||
no locking protection and won't be discussed further.
|
||||
A select object keeps information on a select(2) operation that cannot
|
||||
be fulfilled immediately (waiting for timeout or file descriptors not
|
||||
ready). They are identified by their owner ('requestor'); a pointer to the
|
||||
fproc table. A null pointer means not in use. A select object can be used by
|
||||
only one process and a process can do only one select(2) at a time. Select(2)
|
||||
operates on filps and is organized in such a way that it is sufficient to
|
||||
apply locking on individual filps and not on select objects themselves. They
|
||||
won't be discussed further.
|
||||
A dmap object is a mapping from a device number to a device driver. A device
|
||||
driver can have multiple device numbers associated (e.g., TTY). Access to
|
||||
a driver is exclusive when it uses the synchronous driver protocol.
|
||||
|
||||
4.4 Locking order
|
||||
Based on the description in the previous section, we need protection for
|
||||
fproc, vmnt, vnode, and filp objects. To prevent deadlocks as a result of
|
||||
object locking, we need to define a strict locking order. In VFS we use the
|
||||
following order:
|
||||
|
||||
fproc -> [exec] -> vmnt -> vnode -> filp -> [block special file] -> [dmap]
|
||||
|
||||
That is, no thread may lock an fproc object while holding a vmnt lock,
|
||||
and no thread may lock a vmnt object while holding an (associated) vnode, etc.
|
||||
Fproc needs protection because processes themselves can initiate system
|
||||
calls, but also PM can cause system calls that have to be executed in their
|
||||
name. For example, a process might be busy reading from a character device
|
||||
and another process sends a termination signal. The exit(2) that follows is
|
||||
sent by PM and is to be executed by the to-be-killed process itself. At this
|
||||
point there is contention for the fproc object that belongs to the process,
|
||||
hence the need for protection.
|
||||
The exec(2) call is protected by a mutex for the following reason. VFS uses a
|
||||
number of variables on the heap to read ELF headers. They are on the heap due
|
||||
to their size; putting them on the stack would increase stack size demands for
|
||||
worker threads. The exec call does blocking read calls and thus needs exclusive
|
||||
access to these variables. However, only the exec(2) syscall needs this lock.
|
||||
Access to block special files needs to be exclusive. File Servers are
|
||||
responsible for handling reads from and writes to block special files; if
|
||||
a block special file is on a device that is mounted, the FS responsible for
|
||||
that mount point takes care of it, otherwise the FS that handles the root of
|
||||
the file system is responsible. Due to mounting and unmounting file systems,
|
||||
the FS handling a block special file may change. Locking the vnode is not
|
||||
enough since the inode can be on an entirely different File Server. Therefore,
|
||||
access to block special files must be mutually exclusive from concurrent
|
||||
mount(2)/umount(2) operations. However, when we're not accessing a block
|
||||
special file, we don't need this lock.
|
||||
|
||||
4.5 Vmnt (file system) locking
|
||||
Vmnt locking cannot be seen completely separately from vnode locking. For
|
||||
example, umount(2) fails if there are still in-use vnodes, which means that
|
||||
FS requests [0] only involving in-use inodes do not have to acquire a vmnt
|
||||
lock. On the other hand, all other request do need a vmnt lock. Extrapolating
|
||||
this to system calls this means that all system calls involving a file
|
||||
descriptor don't need a vmnt lock and all other system calls (that make FS
|
||||
requests) do need a vmnt lock.
|
||||
|
||||
-------------------------------------------------------------------------------
|
||||
| Category | System calls |
|
||||
+-------------------+---------------------------------------------------------+
|
||||
| System calls with | access, chdir, chmod, chown, chroot, creat, dumpcore*, |
|
||||
| a path name | exec, link, lstat, mkdir, mknod, mount, open, readlink, |
|
||||
| argument | rename, rmdir, stat, statvfs, symlink, truncate, umount,|
|
||||
| | unlink, utime |
|
||||
+-------------------+---------------------------------------------------------+
|
||||
| System calls with | close, fchdir, fcntl, fstat, fstatvfs, ftruncate, |
|
||||
| a file descriptor | getdents, ioctl, llseek, pipe, read, select, write |
|
||||
| argument | |
|
||||
+-------------------+---------------------------------------------------------+
|
||||
| System calls with | fsync**, sync, umask |
|
||||
| other or no | |
|
||||
| arguments | |
|
||||
-------------------------------------------------------------------------------
|
||||
Table 3: System call categories
|
||||
* path name argument is implicit, the path name is "core.<pid>"
|
||||
** although fsync actually provides a file descriptor argument, it's only
|
||||
used to find the vmnt and not to do any actual operations on
|
||||
|
||||
Before we describe what kind of vmnt locks VFS applies to system calls with a
|
||||
path name or other arguments, we need to make some notes on path lookup. Path
|
||||
lookups take arbitrary paths as input (relative and absolute). They can start
|
||||
at any vmnt (based on root directory and working directory of the process doing
|
||||
the lookup) and visit any file system in arbitrary order, possibly visiting
|
||||
the same file system more than once. As such, VFS can never tell in advance
|
||||
at which File Server a lookup will end. This has the following consequences:
|
||||
- In the lookup procedure, only one vmnt must be locked at a time. When
|
||||
moving from one vmnt to another, the first vmnt has to be unlocked before
|
||||
acquiring the next lock to prevent deadlocks.
|
||||
- The lookup procedure must lock each visited file system with TLL_READSER
|
||||
and downgrade or upgrade to the lock type desired by the caller for the
|
||||
destination file system (as VFS cannot know which file system is final). This
|
||||
is to prevent deadlocks when a thread acquires a TLL_READSER on a vmnt and
|
||||
another thread TLL_READ on the same vmnt. If the second thread is blocked
|
||||
on the first thread due to it acquiring a lock on a vnode, the first thread
|
||||
will be unable to upgrade a TLL_READSER lock to TLL_WRITE.
|
||||
|
||||
We use the following mapping for vmnt locks onto three-level lock types:
|
||||
-------------------------------------------------------------------------------
|
||||
| Lock type | Mapped to | Used for |
|
||||
+------------+-------------+--------------------------------------------------+
|
||||
| VMNT_READ | TLL_READ | Read-only operations and fully independent write |
|
||||
| | | operations |
|
||||
+------------+-------------+--------------------------------------------------+
|
||||
| VMNT_WRITE | TLL_READSER | Independent create and modify operations |
|
||||
+------------+-------------+--------------------------------------------------+
|
||||
| VMNT_EXCL | TLL_WRITE | Delete and dependent write operations |
|
||||
-------------------------------------------------------------------------------
|
||||
Table 4: vmnt to tll lock mapping
|
||||
|
||||
The following table shows a sub-categorization of system calls without a
|
||||
file descriptor argument, together with their locking types and motivation
|
||||
as used by VFS.
|
||||
-------------------------------------------------------------------------------
|
||||
| Group | System calls | Lock type | Motivation |
|
||||
+-------------+--------------+------------+-----------------------------------+
|
||||
| File open | chdir, | VMNT_READ | These operations do not interfere |
|
||||
| ops. | chroot, exec,| | with each other, as vnodes can be |
|
||||
| (non-create)| open | | opened concurrently, and open |
|
||||
| | | | operations do not affect |
|
||||
| | | | replicated state. |
|
||||
+-------------+--------------+------------+-----------------------------------+
|
||||
| File create-| creat, | VMNT_EXCL | File create ops. require mutual |
|
||||
| and-open | open(O_CREAT)| for create | exclusion from concurrent file |
|
||||
| ops | | VMNT_WRITE | open ops. If the file already |
|
||||
| | | for open | existed, the VMNT_WRITE lock that |
|
||||
| | | | is necessary for the lookup is |
|
||||
| | | | not upgraded |
|
||||
+-------------+--------------+------------+-----------------------------------+
|
||||
| File create-| pipe | VMNT_READ | These create nameless inodes |
|
||||
| unique-and- | | | which cannot be opened by means |
|
||||
| open ops. | | | of a path. Their creation |
|
||||
| | | | therefore does not interfere with |
|
||||
| | | | anything else |
|
||||
+-------------+--------------+------------+-----------------------------------+
|
||||
| File create-| mkdir, mknod,| VMNT_WRITE | These operations do not affect |
|
||||
| only ops. | slink | | any VFS state, and can therefore |
|
||||
| | | | take place concurrently with open |
|
||||
| | | | operations |
|
||||
+-------------+--------------+------------+-----------------------------------+
|
||||
| File info | access, lstat| VMNT_READ | These operations do not interfere |
|
||||
| retrieval or| readlink,stat| | with each other and do not modify |
|
||||
| modification| utime | | replicated state |
|
||||
+-------------+--------------+------------+-----------------------------------+
|
||||
| File | chmod, chown,| VMNT_READ | These operations do not interfere |
|
||||
| modification| truncate | | with each other. They do need |
|
||||
| | | | exclusive access on the vnode |
|
||||
| | | | level |
|
||||
+-------------+--------------+------------+-----------------------------------+
|
||||
| File link | link | VMNT_WRITE | Identical to file create-only |
|
||||
| ops. | | | operations |
|
||||
+-------------+--------------+------------+-----------------------------------+
|
||||
| File unlink | rmdir, unlink| VMNT_EXCL | These must not interfere with |
|
||||
| ops. | | | file create operations, to avoid |
|
||||
| | | | the scenario where inodes are |
|
||||
| | | | reused immediately. However, due |
|
||||
| | | | to necessary path checks, the |
|
||||
| | | | vmnt is first locked VMNT_WRITE |
|
||||
| | | | and then upgraded |
|
||||
+-------------+--------------+------------+-----------------------------------+
|
||||
| File rename | rename | VMNT_EXCL | Identical to file unlink |
|
||||
| ops. | | | operations |
|
||||
+-------------+--------------+------------+-----------------------------------+
|
||||
| Non-file | sync, umask | VMNT_READ | umask does not involve the file |
|
||||
| ops. | | or none | system, so it does not need |
|
||||
| | | | locks. sync does not alter state |
|
||||
| | | | in VFS and is atomic at the FS |
|
||||
| | | | level |
|
||||
-------------------------------------------------------------------------------
|
||||
Table 5: System call without file descriptor argument sub-categorization
|
||||
|
||||
4.6 Vnode (open file) locking
|
||||
Compared to vmnt locking, vnode locking is relatively straightforward. All
|
||||
read-only accesses to vnodes that merely read the vnode object's fields are
|
||||
allowed to be concurrent. Consequently, all accesses that change fields
|
||||
of a vnode object must be exclusive. This leaves us with creation and
|
||||
destruction of vnode objects (and related to that, their reference counts);
|
||||
it's sufficient to serialize these accesses. This follows from the fact
|
||||
that a vnode is only created when the first user opens it, and destroyed
|
||||
when the last user closes it. A open file in process A cannot be be closed
|
||||
by process B. Note that this also relies on the fact that a process can do
|
||||
only one system call at a time. Kernel threads would violate this assumption.
|
||||
|
||||
We use the following mapping for vnode locks onto three-level lock types:
|
||||
-------------------------------------------------------------------------------
|
||||
| Lock type | Mapped to | Used for |
|
||||
+------------+-------------+--------------------------------------------------+
|
||||
| VNODE_READ | TLL_READ | Read access to previously opened vnodes |
|
||||
+------------+-------------+--------------------------------------------------+
|
||||
| VNODE_OPCL | TLL_READSER | Creation, opening, closing, and destruction of |
|
||||
| | | vnodes |
|
||||
+------------+-------------+--------------------------------------------------+
|
||||
| VNODE_WRITE| TLL_WRITE | Write access to previously opened vnodes |
|
||||
-------------------------------------------------------------------------------
|
||||
Table 6: vnode to tll lock mapping
|
||||
|
||||
When vnodes are destroyed, they are initially locked with VNODE_OPCL. After
|
||||
all, we're going to alter the reference count, so this must be serialized. If
|
||||
the reference count then reaches zero we obtain exclusive access. This should
|
||||
always be immediately possible unless there is a consistency problem. See
|
||||
section 4.8 for an exhaustive listing of locking methods for all operations on
|
||||
vnodes.
|
||||
|
||||
4.7 Filp (file position) locking
|
||||
The main fields of a filp object that are shared between various processes
|
||||
(and by extension threads), and that can change after object creation,
|
||||
are filp_count and filp_pos. Writes to and reads from filp object must be
|
||||
mutually exclusive, as all system calls have to use the latest version. For
|
||||
example, a read(2) call changes the file position (i.e., filp_pos), so two
|
||||
concurrent reads must obtain exclusive access. Consequently, as even read
|
||||
operations require exclusive access, filp object don't use three-level locks,
|
||||
but only mutexes.
|
||||
|
||||
System calls that involve a file descriptor often access both the filp and
|
||||
the corresponding vnode. The locking order requires us to first lock the
|
||||
vnode and then the filp. This is taken care of at the filp level. Whenever
|
||||
a filp is locked, a lock on the vnode is acquired first. Conversely, when
|
||||
a filp is unlocked, the corresponding vnode is also unlocked. A convenient
|
||||
consequence is that whenever a vnode is locked exclusively (VNODE_WRITE),
|
||||
all corresponding filps are implicitly locked. This is of particular use
|
||||
when multiple filps must be locked at the same time:
|
||||
- When opening a named pipe, VFS must make sure that there is at most one
|
||||
filp for the reader end and one filp for the writer end.
|
||||
- Pipe readers and writers must be suspended in the absence of (respectively)
|
||||
writers and readers.
|
||||
- To prevent pipe file sizes to grow too large and wrap, the file size is
|
||||
reset to zero when the pipe is empty. This can happen after a read(2).
|
||||
Because both filps are linked to the same vnode object (they are for the same
|
||||
pipe), it suffices to exclusively lock that vnode instead of both filp objects.
|
||||
|
||||
In some cases it can happen that a function that operates on a locked filp,
|
||||
calls another function that triggers another lock on a different filp for
|
||||
the same vnode. For example, close_filp. At some point, close_filp() calls
|
||||
release() which in turn will loop through the filp table looking for pipes
|
||||
being select(2)ed on. If there are, the select code will lock the filp and do
|
||||
operations on it. This works fine when doing a select(2) call, but conflicts
|
||||
with close(2) or exit(2). Lock_filp() makes an exception for this situation;
|
||||
if you've already locked a vnode with VNODE_OPCL or VNODE_WRITE when locking
|
||||
a filp, you obtain a "soft lock" on the vnode for this filp. This means
|
||||
that lock_filp won't actually try to lock the vnode (which wouldn't work),
|
||||
but flags the vnode as "skip unlock_vnode upon unlock_filp." Upon unlocking
|
||||
the filp, the vnode remains locked, the soft lock is removed, and the filp
|
||||
mutex is released. Note that this scheme does not violate the locking order;
|
||||
the vnode is (already) locked before the filp.
|
||||
|
||||
A similar problem arises with do_pipe. In this case we obtain a new vnode
|
||||
object, lock it, and obtain two new, locked, filp objects. If everything works
|
||||
out and the filp objects are linked to the same vnode, we run into trouble
|
||||
when unlocking both filps. The first filp being unlocked would work; the
|
||||
second filp doesn't have an associated vnode that's locked anymore. Therefore
|
||||
we introduced a plural unlock_filps(filp1, filp2) that can unlock two filps
|
||||
that both point to the same vnode.
|
||||
|
||||
4.8 Lock characteristics per request type
|
||||
For File Servers that support concurrent requests, it's useful to know which
|
||||
locking guarantees VFS provides for vmnts and vnodes, so it can take that
|
||||
into account when protecting internal data structures. READ = TLL_READ,
|
||||
READSER = TLL_READSER, WRITE = TLL_WRITE. The vnode locks applies to the
|
||||
REQ_INODE_NR field in requests, unless the notes say otherwise.
|
||||
|
||||
------------------------------------------------------------------------------
|
||||
| request | vmnt | vnode | notes |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_BREAD | | READ | VFS serializes reads from and writes to |
|
||||
| | | | block special files |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_BWRITE | | WRITE | VFS serializes reads from and writes to |
|
||||
| | | | block special files |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_CHMOD | READ | WRITE | vmnt is only locked if file is not |
|
||||
| | | | already opened |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_CHOWN | READ | WRITE | vmnt is only locked if file is not |
|
||||
| | | | already opened |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_CREATE | WRITE | WRITE | The directory in which the file is |
|
||||
| | | | created is write locked |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_FLUSH | | | Mutually exclusive to REQ_BREAD and |
|
||||
| | | | REQ_BWRITE |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_FSTATFS | | | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_FTRUNC | READ | WRITE | vmnt is only locked if file is not |
|
||||
| | | | already opened |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_GETDENTS | READ | READ | vmnt is only locked if file is not |
|
||||
| | | | already opened |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_INHIBREAD| | READ | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_LINK | READSER | WRITE | REQ_INODE_NR is locked READ |
|
||||
| | | | REQ_DIR_INO is locked WRITE |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_LOOKUP | READSER | | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_MKDIR | READSER | WRITE | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_MKNOD | READSER | WRITE | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
|REQ_MOUNTPOINT| WRITE | WRITE | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
|REQ_NEW_DRIVER| | | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_NEWNODE | | | Only sent to PFS |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_PUTNODE | | READSER | READSER when dropping all but one |
|
||||
| | | or WRITE| references. WRITE when final reference |
|
||||
| | | | is dropped (i.e., no longer in use) |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_RDLINK | READ | READ | In some circumstances stricter locking |
|
||||
| | | | might be applied, but not guaranteed |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_READ | | READ | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
|REQ_READSUPER | WRITE | | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_RENAME | WRITE | WRITE | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_RMDIR | WRITE | WRITE | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_SLINK | READSER | READ | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_STAT | READ | READ | vmnt is only locked if file is not |
|
||||
| | | | already opened |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_STATVFS | READ | READ | vmnt is only locked if file is not |
|
||||
| | | | already opened |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_SYNC | READ | | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_UNLINK | WRITE | WRITE | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_UNMOUNT | WRITE | | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_UTIME | READ | READ | |
|
||||
+--------------+---------+---------+-----------------------------------------+
|
||||
| REQ_WRITE | | WRITE | |
|
||||
-----------------------------------------------------------------------------+
|
||||
Table 7: VFS-FS requests locking guarantees
|
||||
|
||||
5 Recovery from driver crashes
|
||||
VFS can recover from block special file and character special file driver
|
||||
crashes. It can recover to some degree from a crashed File Server (which we
|
||||
can regard as a driver).
|
||||
|
||||
5.1 Recovery from block drivers crashes
|
||||
When reading or writing, VFS doesn't communicate with block drivers directly,
|
||||
but always through a File Server (the root File Server being default). If the
|
||||
block driver crashes, the File Server does most of the work of the recovery
|
||||
procedure. VFS loops through all open files for block special files that
|
||||
were handled by this driver and reopens them. After that it sends the new
|
||||
endpoint to the File Server so it can finish the recover procedure. Finally,
|
||||
the File Server will retry pending requests if possible. However, reopening
|
||||
files can cause the block driver to crash again. When that happens, VFS will
|
||||
stop the recovery. A driver can return ERESTART to VFS to tell it to retry
|
||||
a request. VFS does this with an arbitrary maximum of 5 attempts.
|
||||
|
||||
5.2 Recovery from character driver crashes
|
||||
Character special files are treated differently. Once VFS has found out a
|
||||
driver has been restarted, it will stop the current request (if there is
|
||||
any). It makes no sense to retry requests due to the nature of character
|
||||
special files. If a character special driver can restart without changing
|
||||
endpoints, this merely results in the current request (e.g., read, write, or
|
||||
ioctl) failing and allows the user process to reissue the same request. On
|
||||
the other hand, if a driver restart causes the driver to change endpoint
|
||||
number, all associated file descriptors are marked invalid and subsequent
|
||||
operations on them will always fail with a bad file descriptor error.
|
||||
|
||||
5.3 Recovery from File Server crashes
|
||||
At the time of writing we cannot recover from crashed File Servers. When
|
||||
VFS detects it has to clean up the remnants of a File Server process (i.e.,
|
||||
through an exit(2)), it marks all associated file descriptors as invalid
|
||||
and cancels ongoing and pending requests to that File Server. Resources that
|
||||
were in use by the File Server are cleaned up.
|
||||
|
||||
[0] http://wiki.minix3.org/en/DevelopersGuide/VfsFsProtocol
|
||||
[1] http://www.cs.vu.nl/~dcvmoole/minix/blockchar.txt
|
||||
[2] http://www.minix3.org/theses/moolenbroek-multimedia-support.pdf
|
Loading…
Reference in a new issue