6ee180f5f7
Change-Id: I746f7c8ddabd1e047b8d536df14586c5b1594d55
679 lines
44 KiB
Text
679 lines
44 KiB
Text
## Description of VFS Thomas Veerman 21-3-2013
|
|
## This file is organized such that it can be read both in a Wiki and on
|
|
## the MINIX terminal using e.g. vi or less. Please, keep the file in the
|
|
## source tree as the canonical version and copy changes into the Wiki.
|
|
#pragma section-numbers 2
|
|
|
|
= VFS internals =
|
|
|
|
<<TableOfContents(2)>>
|
|
|
|
## Table of contents
|
|
## 1 ..... General description of responsibilities
|
|
## 2 ..... General architecture
|
|
## 3 ..... Worker threads
|
|
## 4 ..... Locking
|
|
## 4.1 .... Locking requirements
|
|
## 4.2 .... Three-level Lock
|
|
## 4.3 .... Data structures subject to locking
|
|
## 4.4 .... Locking order
|
|
## 4.5 .... Vmnt (file system) locking
|
|
## 4.6 .... Vnode (open file) locking
|
|
## 4.7 .... Filp (file position) locking
|
|
## 4.8 .... Lock characteristics per request type
|
|
## 5 ..... Recovery from driver crashes
|
|
## 5.1 .... Recovery from block drivers crashes
|
|
## 5.2 .... Recovery from character driver crashes
|
|
## 5.3 .... Recovery from File Server crashes
|
|
|
|
== General description of responsibilities ==
|
|
## 1 General description of responsibilities
|
|
VFS implements the file system in cooperation with one or more File Servers
|
|
(FS). The File Servers take care of the actual file system on a partition. That
|
|
is, they interpret the data structure on disk, write and read data to/from
|
|
disk, etc. VFS sits on top of those File Servers and communicates with
|
|
them. Looking inside VFS, we can identify several roles. First, a role of VFS
|
|
is to handle most POSIX system calls that are supported by Minix. Additionally,
|
|
it supports a few calls necessary for libc. The following system calls are
|
|
handled by VFS:
|
|
|
|
access, chdir, chmod, chown, chroot, close, creat, fchdir, fcntl, fstat,
|
|
fstatfs, fstatvfs, fsync, ftruncate getdents, ioctl, link, llseek, lseek,
|
|
lstat, mkdir, mknod, mount, open, pipe, read, readlink, rename, rmdir, select,
|
|
stat, statvfs, symlink, sync, truncate, umask, umount, unlink, utime, write.
|
|
|
|
Second, it maintains part of the state belonging to a process (process state is
|
|
spread out over the kernel, VM, PM, and VFS). For example, it maintains state
|
|
for select(2) calls, file descriptors and file positions. Also, it cooperates
|
|
with the Process Manager to handle the fork, exec, and exit system calls.
|
|
Third, VFS keeps track of endpoints that are supposed to be drivers for
|
|
character or block special files. File Servers can be regarded as drivers for
|
|
block special files, although they are handled entirely different compared
|
|
to other drivers.
|
|
|
|
The following diagram depicts how a read() on a file in /home is being handled:
|
|
{{{
|
|
----------------
|
|
| user process |
|
|
----------------
|
|
^ ^
|
|
| |
|
|
read(2) \
|
|
| \
|
|
V \
|
|
---------------- |
|
|
| VFS | |
|
|
---------------- |
|
|
^ |
|
|
| |
|
|
V |
|
|
------- -------- ---------
|
|
| MFS | | MFS | | MFS |
|
|
| / | | /usr | | /home |
|
|
------- -------- ---------
|
|
}}}
|
|
Diagram 1: handling of read(2) system call
|
|
|
|
The user process executes the read system call which is delivered to VFS. VFS
|
|
verifies the read is done on a valid (open) file and forwards the request
|
|
to the FS responsible for the file system on which the file resides. The FS
|
|
reads the data, copies it directly to the user process, and replies to VFS
|
|
it has executed the request. Subsequently, VFS replies to the user process
|
|
the operation is done and the user process continues to run.
|
|
|
|
== General architecture ==
|
|
## 2 General architecture
|
|
VFS works roughly identical to every other server and driver in Minix; it
|
|
fetches a message (internally referred to as a job in some cases), executes
|
|
the request embedded in the message, returns a reply, and fetches the next
|
|
job. There are several sources for new jobs: from user processes, from PM, from
|
|
the kernel, and from suspended jobs inside VFS itself (suspended operations
|
|
on pipes, locks, or character special files). File Servers are regarded as
|
|
normal user processes in this case, but their abilities are limited. This
|
|
is to prevent deadlocks. Once a job is received, a worker thread starts
|
|
executing it. During the lifetime of a job, the worker thread might need
|
|
to talk to several File Servers. The protocol VFS speaks with File Servers
|
|
is fully documented on the Wiki at [0]. The protocol fields are defined in
|
|
<minix/vfsif.h>. If the job is an operation on a character or block special
|
|
file and the need to talk to a driver arises, VFS uses the Character and
|
|
Block Device Protocol. See [1]. This is sadly not official documentation,
|
|
but it is an accurate description of how it works. Luckily, driver writers
|
|
can use the libchardriver and libblockdriver libraries and don't have to
|
|
know the details of the protocol.
|
|
|
|
== Worker threads ==
|
|
## 3 Worker threads
|
|
Upon start up, VFS spawns a configurable amount of worker threads. The
|
|
main thread fetches requests and replies, and hands them off to idle or
|
|
reply-pending workers, respectively. If no worker threads are available,
|
|
the request is queued. There are 3 types of worker threads: normal, a system
|
|
worker, and a deadlock resolver. All standard system calls are handled by
|
|
normal worker threads. Jobs from PM and notifications from the kernel are taken
|
|
care of by the system worker. The deadlock resolver handles jobs from system
|
|
processes (i.e., File Servers and drivers) when there are no normal worker
|
|
threads available; all normal threads might be blocked on a single worker
|
|
thread that caused a system process to send a request on its own. To unblock
|
|
all normal threads, we need to reserve one thread to handle that situation.
|
|
VFS drives all File Servers and drivers asynchronously. While waiting for
|
|
a reply, a worker thread is blocked and other workers can keep processing
|
|
requests. Upon reply the worker thread is unblocked.
|
|
As mentioned above, the main thread is responsible for retrieving new jobs and
|
|
replies to current jobs and start or unblock the proper worker thread. Given
|
|
how many sources for new jobs and replies there are, the work for the main
|
|
thread is quite complicated. Consider Table 1.
|
|
{{{
|
|
---------------------------------------------------------
|
|
| From | normal | deadlock | system |
|
|
---------------------------------------------------------
|
|
msg is new job
|
|
---------------------------------------------------------
|
|
| PM | | | X |
|
|
+----------------------+----------+----------+----------+
|
|
| Notification from | | | |
|
|
| the kernel | | | X |
|
|
+----------------------+----------+----------+----------+
|
|
| Notification from | | | |
|
|
| DS or system process | X | X | |
|
|
+----------------------+----------+----------+----------+
|
|
| User process | X | | |
|
|
+----------------------+----------+----------+----------+
|
|
| Unsuspended process | X | | |
|
|
---------------------------------------------------------
|
|
msg is reply
|
|
---------------------------------------------------------
|
|
| File Server reply | resume | | |
|
|
+----------------------+----------+----------+----------+
|
|
| Sync. driver reply | resume | | |
|
|
+----------------------+----------+----------+----------+
|
|
| Async. driver reply | resume/X | X | |
|
|
---------------------------------------------------------
|
|
}}}
|
|
Table 1: VFS' message fetching main loop. X means 'start thread'.
|
|
|
|
The reason why asynchronous driver replies get their own thread is for the
|
|
following. In some cases, a reply has a thread blocked waiting for it which
|
|
can be resumed (e.g., open). In another case there's a lot of work to be
|
|
done which involves sending new messages (e.g., select replies). Finally,
|
|
DEV_REVIVE replies unblock suspended processes which in turn generate new jobs
|
|
to be handled by the main loop (e.g., suspended reads and writes). So depending
|
|
on the reply a new thread has to be started. Having all this logic in the main
|
|
loop is messy, so we start a thread regardless of the actual reply contents.
|
|
When there are no worker threads available and there is no need to invoke
|
|
the deadlock resolver (i.e., normal system calls), the request is queued in
|
|
the fproc table. This works because a process can send only one system call
|
|
at a time. When implementing kernel threads, one has to take this assumption
|
|
into account.
|
|
The protocol PM speaks with VFS is asynchronous and PM is allowed to
|
|
send as many request to VFS as it wants. It is impossible to use the same
|
|
queueing mechanism as normal processes use, because that would allow for
|
|
just 1 queued message. Instead, the system worker maintains a linked list
|
|
of pending requests. Moreover, this queueing mechanism is also the reason
|
|
why notifications from the kernel are handled by the system worker; the
|
|
kernel has no corresponding fproc table entry (so we can't store it there)
|
|
and the linked list has no dependencies on that table.
|
|
Communication with drivers is asynchronous even when the driver uses the
|
|
synchronous driver protocol. However, to guarantee identical behavior,
|
|
access to synchronous drivers is serialized. File Servers are treated
|
|
differently. VFS was designed to be able to send requests concurrently to
|
|
File Servers, although at the time of writing there are no File Servers that
|
|
can actually make use of that functionality. To identify which reply from an
|
|
FS belongs to which worker thread, all requests have an embedded transaction
|
|
identification number (a magic number + thread id encoded in the mtype field
|
|
of a message) which the FS has to echo upon reply. Because the range of valid
|
|
transaction IDs is isolated from valid system call numbers, VFS can use that
|
|
ID to differentiate between replies from File Servers and actual new system
|
|
calls from FSes. Using this mechanism VFS is able to support FUSE and ProcFS.
|
|
|
|
== Locking ==
|
|
## 4 Locking
|
|
To ensure correct execution of system calls, worker threads sometimes need
|
|
certain objects within VFS to remain unchanged during thread suspension
|
|
and resumption (i.e., when they need to communicate with a driver or File
|
|
Server). Threads keep most state on the stack, but there are a few global
|
|
variables that require protection: the fproc table, vmnt table, vnode table,
|
|
and filp table. Other tables such as lock table, select table, and dmap table
|
|
don't require protection by means of exclusive access. There it's required
|
|
and enough to simply mark an entry in use.
|
|
|
|
=== Locking requirements ===
|
|
## 4.1 Locking requirements
|
|
VFS implements the locking model described in [2]. For completeness of this
|
|
document we'll describe it here, too. The requirements are based on a threading
|
|
package that is non-preemptive. VFS must guarantee correct functioning with
|
|
several, semi-concurrently executing threads in any arbitrary order. The
|
|
latter requirement follows from the fact that threads need service from
|
|
other components like File Servers and drivers, and they may take any time
|
|
to complete requests.
|
|
1. Consistency of replicated values. Several system calls rely on VFS keeping a replicated representation of data in File Servers (e.g., file sizes, file modes, etc.).
|
|
1. Isolation of system calls. Many system calls involve multiple requests to FSes. Concurrent requests from other processes must not lead to otherwise impossible results (e.g., a chmod operation on a file cannot fail halfway through because it's suddenly unlinked or moved).
|
|
1. Integrity of objects. From the point of view of threads, obtaining mutual exclusion is a potentially blocking operation. The integrity of any objects used across blocking calls must be guaranteed (e.g., the file mode in a vnode must remain intact not only when talking to other components, but also when obtaining a lock on a filp).
|
|
1. No deadlock. Not one call may cause another call to never complete. Deadlock situations are typically the result of two or more threads that each hold exclusive access to one resource and want exclusive access to the resource held by the other thread. These resources are a) data (global variables) and b) worker threads.
|
|
a. Conflicts between locking of different types of objects can be avoided by keeping a locking order: objects of different type must always be locked in the same order. If multiple objects of the same type are to be locked, then first a "common denominator" higher up in the locking order must be locked.
|
|
a. Some threads can only run to completion when another thread does work on their behalf. Examples of this are drivers and file servers that do system calls on their own (e.g., ProcFS, PFS/UNIX Domain Sockets, FUSE) or crashing components (e.g., a driver for a character special file that crashes during a request; a second thread is required to handle resource clean up or driver restart before the first thread can abort or retry the request).
|
|
1. No starvation. VFS must guarantee that every system call completes in finite time (e.g., an infinite stream of reads must never completely block writes). Furthermore, we want to maximize parallelism to improve performance. This leads to:
|
|
1. A request to one File Server must not block access to other FS processes. This means that most forms of locking cannot take place at a global level, and must at most take place on the file system level.
|
|
1. No read-only operation on a regular file must block an independent read call to that file. In particular, (read-only) open and close operations may not block such reads, and multiple independent reads on the same file must be able to take place concurrently (i.e., reads that do not share a file position between their file descriptors).
|
|
|
|
=== Three-level Lock ===
|
|
## 4.2 Three-level Lock
|
|
From the requirements it follows that we need at least two locking types: read
|
|
and write locks. Concurrent reads are allowed, but writes are exclusive both
|
|
from reads and from each other. However, in a lot of cases it possible to use
|
|
a third locking type that is in between read and write lock: the serialize
|
|
lock. This is implemented in the three-level lock [2]. The three-level
|
|
lock provides:
|
|
TLL_READ: allows an unlimited number of threads to hold the lock with the
|
|
same type (both the thread itself and other threads); N * concurrent.
|
|
TLL_READSER: also allows an unlimited number of threads with type TLL_READ,
|
|
but only one thread can obtain serial access to the lock; N * concurrent +
|
|
1 * serial.
|
|
TLL_WRITE: provides full mutual exclusion; 1 * exclusive + 0 * concurrent +
|
|
0 * serial.
|
|
In absence of TLL_READ locks, a TLL_READSER is identical to TLL_WRITE. However,
|
|
TLL_READSER never blocks concurrent TLL_READ access. TLL_READSER can be
|
|
upgraded to TLL_WRITE; the thread will block until the last TLL_READ lock
|
|
leaves and new TLL_READ locks are blocked. Locks can be downgraded to a
|
|
lower type. The three-level lock is implemented using two FIFO queues with
|
|
write-bias. This guarantees no starvation.
|
|
|
|
=== Data structures subject to locking ===
|
|
## 4.3 Data structures subject to locking
|
|
VFS has a number of global data structures. See Table 2.
|
|
{{{
|
|
--------------------------------------------------------------------
|
|
| Structure | Object description |
|
|
+------------+-----------------------------------------------------|
|
|
| fproc | Process (includes process's file descriptors) |
|
|
+------------+-----------------------------------------------------|
|
|
| vmnt | Virtual mount; a mounted file system |
|
|
+------------+-----------------------------------------------------|
|
|
| vnode | Virtual node; an open file |
|
|
+------------+-----------------------------------------------------|
|
|
| filp | File position into an open file |
|
|
+------------+-----------------------------------------------------|
|
|
| lock | File region locking state for an open file |
|
|
+------------+-----------------------------------------------------|
|
|
| select | State for an in-progress select(2) call |
|
|
+------------+-----------------------------------------------------|
|
|
| dmap | Mapping from major device number to a device driver |
|
|
--------------------------------------------------------------------
|
|
}}}
|
|
Table 2: VFS object types.
|
|
|
|
An fproc object is a process. An fproc object is created by fork(2)
|
|
and destroyed by exit(2) (which may, or may not, be instantiated from the
|
|
process itself). It is identified by its endpoint number ('fp_endpoint')
|
|
and process id ('fp_pid'). Both are unique although in general the endpoint
|
|
number is used throughout the system.
|
|
A vmnt object is a mounted file system. It is created by mount(2) and destroyed
|
|
by umount(2). It is identified by a device number ('m_dev') and FS endpoint
|
|
number ('m_fs_e'); both are unique to each vmnt object. There is always a
|
|
single process that handles a file system on a device and a device cannot
|
|
be mounted twice.
|
|
A vnode object is the VFS representation of an open inode on the file
|
|
system. A vnode object is created when a first process opens or creates the
|
|
corresponding file and is destroyed when the last process, which has that
|
|
file open, closes it. It is identified by a combination of FS endpoint number
|
|
('v_fs_e') and inode number of that file system ('v_inode_nr'). A vnode
|
|
might be mapped to another file system; the actual reading and writing is
|
|
handled by a different endpoint. This has no effect on locking.
|
|
A filp object contains a file position within a file. It is created when a file
|
|
is opened or anonymous pipe created and destroyed when the last user (i.e.,
|
|
process) closes it. A file descriptor always points to a single filp. A filp
|
|
always point to a single vnode, although not all vnodes are pointed to by a
|
|
filp. A filp has a reference count ('filp_count') which is identical to the
|
|
number of file descriptors pointing to it. It can be increased by a dup(2)
|
|
or fork(2). A filp can therefore be shared by multiple processes.
|
|
A lock object keeps information about locking of file regions. This has
|
|
nothing to do with the threading type of locking. The lock objects require
|
|
no locking protection and won't be discussed further.
|
|
A select object keeps information on a select(2) operation that cannot
|
|
be fulfilled immediately (waiting for timeout or file descriptors not
|
|
ready). They are identified by their owner ('requestor'); a pointer to the
|
|
fproc table. A null pointer means not in use. A select object can be used by
|
|
only one process and a process can do only one select(2) at a time. Select(2)
|
|
operates on filps and is organized in such a way that it is sufficient to
|
|
apply locking on individual filps and not on select objects themselves. They
|
|
won't be discussed further.
|
|
A dmap object is a mapping from a device number to a device driver. A device
|
|
driver can have multiple device numbers associated (e.g., TTY). Access to
|
|
a driver is exclusive when it uses the synchronous driver protocol.
|
|
|
|
=== Locking order ===
|
|
## 4.4 Locking order
|
|
Based on the description in the previous section, we need protection for
|
|
fproc, vmnt, vnode, and filp objects. To prevent deadlocks as a result of
|
|
object locking, we need to define a strict locking order. In VFS we use the
|
|
following order:
|
|
|
|
fproc -> [exec] -> vmnt -> vnode -> filp -> [block special file] -> [dmap]
|
|
|
|
That is, no thread may lock an fproc object while holding a vmnt lock,
|
|
and no thread may lock a vmnt object while holding an (associated) vnode, etc.
|
|
Fproc needs protection because processes themselves can initiate system
|
|
calls, but also PM can cause system calls that have to be executed in their
|
|
name. For example, a process might be busy reading from a character device
|
|
and another process sends a termination signal. The exit(2) that follows is
|
|
sent by PM and is to be executed by the to-be-killed process itself. At this
|
|
point there is contention for the fproc object that belongs to the process,
|
|
hence the need for protection.
|
|
The exec(2) call is protected by a mutex for the following reason. VFS uses a
|
|
number of variables on the heap to read ELF headers. They are on the heap due
|
|
to their size; putting them on the stack would increase stack size demands for
|
|
worker threads. The exec call does blocking read calls and thus needs exclusive
|
|
access to these variables. However, only the exec(2) syscall needs this lock.
|
|
Access to block special files needs to be exclusive. File Servers are
|
|
responsible for handling reads from and writes to block special files; if
|
|
a block special file is on a device that is mounted, the FS responsible for
|
|
that mount point takes care of it, otherwise the FS that handles the root of
|
|
the file system is responsible. Due to mounting and unmounting file systems,
|
|
the FS handling a block special file may change. Locking the vnode is not
|
|
enough since the inode can be on an entirely different File Server. Therefore,
|
|
access to block special files must be mutually exclusive from concurrent
|
|
mount(2)/umount(2) operations. However, when we're not accessing a block
|
|
special file, we don't need this lock.
|
|
|
|
=== Vmnt (file system) locking ===
|
|
## 4.5 Vmnt (file system) locking
|
|
Vmnt locking cannot be seen completely separately from vnode locking. For
|
|
example, umount(2) fails if there are still in-use vnodes, which means that
|
|
FS requests [0] only involving in-use inodes do not have to acquire a vmnt
|
|
lock. On the other hand, all other request do need a vmnt lock. Extrapolating
|
|
this to system calls this means that all system calls involving a file
|
|
descriptor don't need a vmnt lock and all other system calls (that make FS
|
|
requests) do need a vmnt lock.
|
|
{{{
|
|
-------------------------------------------------------------------------------
|
|
| Category | System calls |
|
|
+-------------------+---------------------------------------------------------+
|
|
| System calls with | access, chdir, chmod, chown, chroot, creat, dumpcore+, |
|
|
| a path name | exec, link, lstat, mkdir, mknod, mount, open, readlink, |
|
|
| argument | rename, rmdir, stat, statvfs, symlink, truncate, umount,|
|
|
| | unlink, utime |
|
|
+-------------------+---------------------------------------------------------+
|
|
| System calls with | close, fchdir, fcntl, fstat, fstatvfs, ftruncate, |
|
|
| a file descriptor | getdents, ioctl, llseek, pipe, read, select, write |
|
|
| argument | |
|
|
+-------------------+---------------------------------------------------------+
|
|
| System calls with | fsync++, sync, umask |
|
|
| other or no | |
|
|
| arguments | |
|
|
-------------------------------------------------------------------------------
|
|
}}}
|
|
Table 3: System call categories.
|
|
+ path name argument is implicit, the path name is "core.<pid>"
|
|
++ although fsync actually provides a file descriptor argument, it's only
|
|
used to find the vmnt and not to do any actual operations on
|
|
|
|
Before we describe what kind of vmnt locks VFS applies to system calls with a
|
|
path name or other arguments, we need to make some notes on path lookup. Path
|
|
lookups take arbitrary paths as input (relative and absolute). They can start
|
|
at any vmnt (based on root directory and working directory of the process doing
|
|
the lookup) and visit any file system in arbitrary order, possibly visiting
|
|
the same file system more than once. As such, VFS can never tell in advance
|
|
at which File Server a lookup will end. This has the following consequences:
|
|
* In the lookup procedure, only one vmnt must be locked at a time. When
|
|
moving from one vmnt to another, the first vmnt has to be unlocked before
|
|
acquiring the next lock to prevent deadlocks.
|
|
* The lookup procedure must lock each visited file system with TLL_READSER
|
|
and downgrade or upgrade to the lock type desired by the caller for the
|
|
destination file system (as VFS cannot know which file system is final). This
|
|
is to prevent deadlocks when a thread acquires a TLL_READSER on a vmnt and
|
|
another thread TLL_READ on the same vmnt. If the second thread is blocked
|
|
on the first thread due to it acquiring a lock on a vnode, the first thread
|
|
will be unable to upgrade a TLL_READSER lock to TLL_WRITE.
|
|
|
|
We use the following mapping for vmnt locks onto three-level lock types:
|
|
{{{
|
|
-------------------------------------------------------------------------------
|
|
| Lock type | Mapped to | Used for |
|
|
+------------+-------------+--------------------------------------------------+
|
|
| VMNT_READ | TLL_READ | Read-only operations and fully independent write |
|
|
| | | operations |
|
|
+------------+-------------+--------------------------------------------------+
|
|
| VMNT_WRITE | TLL_READSER | Independent create and modify operations |
|
|
+------------+-------------+--------------------------------------------------+
|
|
| VMNT_EXCL | TLL_WRITE | Delete and dependent write operations |
|
|
-------------------------------------------------------------------------------
|
|
}}}
|
|
Table 4: vmnt to tll lock mapping
|
|
|
|
The following table shows a sub-categorization of system calls without a
|
|
file descriptor argument, together with their locking types and motivation
|
|
as used by VFS.
|
|
{{{
|
|
-------------------------------------------------------------------------------
|
|
| Group | System calls | Lock type | Motivation |
|
|
+-------------+--------------+------------+-----------------------------------+
|
|
| File open | chdir, | VMNT_READ | These operations do not interfere |
|
|
| ops. | chroot, exec,| | with each other, as vnodes can be |
|
|
| (non-create)| open | | opened concurrently, and open |
|
|
| | | | operations do not affect |
|
|
| | | | replicated state. |
|
|
+-------------+--------------+------------+-----------------------------------+
|
|
| File create-| creat, | VMNT_EXCL | File create ops. require mutual |
|
|
| and-open | open(O_CREAT)| for create | exclusion from concurrent file |
|
|
| ops | | VMNT_WRITE | open ops. If the file already |
|
|
| | | for open | existed, the VMNT_WRITE lock that |
|
|
| | | | is necessary for the lookup is |
|
|
| | | | not upgraded |
|
|
+-------------+--------------+------------+-----------------------------------+
|
|
| File create-| pipe | VMNT_READ | These create nameless inodes |
|
|
| unique-and- | | | which cannot be opened by means |
|
|
| open ops. | | | of a path. Their creation |
|
|
| | | | therefore does not interfere with |
|
|
| | | | anything else |
|
|
+-------------+--------------+------------+-----------------------------------+
|
|
| File create-| mkdir, mknod,| VMNT_WRITE | These operations do not affect |
|
|
| only ops. | slink | | any VFS state, and can therefore |
|
|
| | | | take place concurrently with open |
|
|
| | | | operations |
|
|
+-------------+--------------+------------+-----------------------------------+
|
|
| File info | access, lstat| VMNT_READ | These operations do not interfere |
|
|
| retrieval or| readlink,stat| | with each other and do not modify |
|
|
| modification| utime | | replicated state |
|
|
+-------------+--------------+------------+-----------------------------------+
|
|
| File | chmod, chown,| VMNT_READ | These operations do not interfere |
|
|
| modification| truncate | | with each other. They do need |
|
|
| | | | exclusive access on the vnode |
|
|
| | | | level |
|
|
+-------------+--------------+------------+-----------------------------------+
|
|
| File link | link | VMNT_WRITE | Identical to file create-only |
|
|
| ops. | | | operations |
|
|
+-------------+--------------+------------+-----------------------------------+
|
|
| File unlink | rmdir, unlink| VMNT_EXCL | These must not interfere with |
|
|
| ops. | | | file create operations, to avoid |
|
|
| | | | the scenario where inodes are |
|
|
| | | | reused immediately. However, due |
|
|
| | | | to necessary path checks, the |
|
|
| | | | vmnt is first locked VMNT_WRITE |
|
|
| | | | and then upgraded |
|
|
+-------------+--------------+------------+-----------------------------------+
|
|
| File rename | rename | VMNT_EXCL | Identical to file unlink |
|
|
| ops. | | | operations |
|
|
+-------------+--------------+------------+-----------------------------------+
|
|
| Non-file | sync, umask | VMNT_READ | umask does not involve the file |
|
|
| ops. | | or none | system, so it does not need |
|
|
| | | | locks. sync does not alter state |
|
|
| | | | in VFS and is atomic at the FS |
|
|
| | | | level |
|
|
-------------------------------------------------------------------------------
|
|
}}}
|
|
Table 5: System call without file descriptor argument sub-categorization
|
|
|
|
=== Vnode (open file) locking ===
|
|
## 4.6 Vnode (open file) locking
|
|
Compared to vmnt locking, vnode locking is relatively straightforward. All
|
|
read-only accesses to vnodes that merely read the vnode object's fields are
|
|
allowed to be concurrent. Consequently, all accesses that change fields
|
|
of a vnode object must be exclusive. This leaves us with creation and
|
|
destruction of vnode objects (and related to that, their reference counts);
|
|
it's sufficient to serialize these accesses. This follows from the fact
|
|
that a vnode is only created when the first user opens it, and destroyed
|
|
when the last user closes it. A open file in process A cannot be be closed
|
|
by process B. Note that this also relies on the fact that a process can do
|
|
only one system call at a time. Kernel threads would violate this assumption.
|
|
|
|
We use the following mapping for vnode locks onto three-level lock types:
|
|
{{{
|
|
-------------------------------------------------------------------------------
|
|
| Lock type | Mapped to | Used for |
|
|
+------------+-------------+--------------------------------------------------+
|
|
| VNODE_READ | TLL_READ | Read access to previously opened vnodes |
|
|
+------------+-------------+--------------------------------------------------+
|
|
| VNODE_OPCL | TLL_READSER | Creation, opening, closing, and destruction of |
|
|
| | | vnodes |
|
|
+------------+-------------+--------------------------------------------------+
|
|
| VNODE_WRITE| TLL_WRITE | Write access to previously opened vnodes |
|
|
-------------------------------------------------------------------------------
|
|
}}}
|
|
Table 6: vnode to tll lock mapping
|
|
|
|
When vnodes are destroyed, they are initially locked with VNODE_OPCL. After
|
|
all, we're going to alter the reference count, so this must be serialized. If
|
|
the reference count then reaches zero we obtain exclusive access. This should
|
|
always be immediately possible unless there is a consistency problem. See
|
|
section 4.8 for an exhaustive listing of locking methods for all operations on
|
|
vnodes.
|
|
|
|
=== Filp (file position) locking ===
|
|
## 4.7 Filp (file position) locking
|
|
The main fields of a filp object that are shared between various processes
|
|
(and by extension threads), and that can change after object creation,
|
|
are filp_count and filp_pos. Writes to and reads from filp object must be
|
|
mutually exclusive, as all system calls have to use the latest version. For
|
|
example, a read(2) call changes the file position (i.e., filp_pos), so two
|
|
concurrent reads must obtain exclusive access. Consequently, as even read
|
|
operations require exclusive access, filp object don't use three-level locks,
|
|
but only mutexes.
|
|
|
|
System calls that involve a file descriptor often access both the filp and
|
|
the corresponding vnode. The locking order requires us to first lock the
|
|
vnode and then the filp. This is taken care of at the filp level. Whenever
|
|
a filp is locked, a lock on the vnode is acquired first. Conversely, when
|
|
a filp is unlocked, the corresponding vnode is also unlocked. A convenient
|
|
consequence is that whenever a vnode is locked exclusively (VNODE_WRITE),
|
|
all corresponding filps are implicitly locked. This is of particular use
|
|
when multiple filps must be locked at the same time:
|
|
* When opening a named pipe, VFS must make sure that there is at most one filp for the reader end and one filp for the writer end.
|
|
* Pipe readers and writers must be suspended in the absence of (respectively) writers and readers.
|
|
Because both filps are linked to the same vnode object (they are for the same
|
|
pipe), it suffices to exclusively lock that vnode instead of both filp objects.
|
|
|
|
In some cases it can happen that a function that operates on a locked filp,
|
|
calls another function that triggers another lock on a different filp for
|
|
the same vnode. For example, close_filp. At some point, close_filp() calls
|
|
release() which in turn will loop through the filp table looking for pipes
|
|
being select(2)ed on. If there are, the select code will lock the filp and do
|
|
operations on it. This works fine when doing a select(2) call, but conflicts
|
|
with close(2) or exit(2). Lock_filp() makes an exception for this situation;
|
|
if you've already locked a vnode with VNODE_OPCL or VNODE_WRITE when locking
|
|
a filp, you obtain a "soft lock" on the vnode for this filp. This means
|
|
that lock_filp won't actually try to lock the vnode (which wouldn't work),
|
|
but flags the vnode as "skip unlock_vnode upon unlock_filp." Upon unlocking
|
|
the filp, the vnode remains locked, the soft lock is removed, and the filp
|
|
mutex is released. Note that this scheme does not violate the locking order;
|
|
the vnode is (already) locked before the filp.
|
|
|
|
A similar problem arises with create_pipe. In this case we obtain a new vnode
|
|
object, lock it, and obtain two new, locked, filp objects. If everything works
|
|
out and the filp objects are linked to the same vnode, we run into trouble
|
|
when unlocking both filps. The first filp being unlocked would work; the
|
|
second filp doesn't have an associated vnode that's locked anymore. Therefore
|
|
we introduced a plural unlock_filps(filp1, filp2) that can unlock two filps
|
|
that both point to the same vnode.
|
|
|
|
=== Lock characteristics per request type ===
|
|
## 4.8 Lock characteristics per request type
|
|
For File Servers that support concurrent requests, it's useful to know which
|
|
locking guarantees VFS provides for vmnts and vnodes, so it can take that
|
|
into account when protecting internal data structures. READ = TLL_READ,
|
|
READSER = TLL_READSER, WRITE = TLL_WRITE. The vnode locks applies to the
|
|
REQ_INODE_NR field in requests, unless the notes say otherwise.
|
|
{{{
|
|
------------------------------------------------------------------------------
|
|
| request | vmnt | vnode | notes |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_BREAD | | READ | VFS serializes reads from and writes to |
|
|
| | | | block special files |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_BWRITE | | WRITE | VFS serializes reads from and writes to |
|
|
| | | | block special files |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_CHMOD | READ | WRITE | vmnt is only locked if file is not |
|
|
| | | | already opened |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_CHOWN | READ | WRITE | vmnt is only locked if file is not |
|
|
| | | | already opened |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_CREATE | WRITE | WRITE | The directory in which the file is |
|
|
| | | | created is write locked |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_FLUSH | | | Mutually exclusive to REQ_BREAD and |
|
|
| | | | REQ_BWRITE |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_FSTATFS | | | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_FTRUNC | READ | WRITE | vmnt is only locked if file is not |
|
|
| | | | already opened |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_GETDENTS | READ | READ | vmnt is only locked if file is not |
|
|
| | | | already opened |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_INHIBREAD| | READ | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_LINK | READSER | WRITE | REQ_INODE_NR is locked READ |
|
|
| | | | REQ_DIR_INO is locked WRITE |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_LOOKUP | READSER | | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_MKDIR | READSER | WRITE | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_MKNOD | READSER | WRITE | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
|REQ_MOUNTPOINT| WRITE | WRITE | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
|REQ_NEW_DRIVER| | | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_NEWNODE | | | Only sent to PFS |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_PUTNODE | | READSER | READSER when dropping all but one |
|
|
| | | or WRITE| references. WRITE when final reference |
|
|
| | | | is dropped (i.e., no longer in use) |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_RDLINK | READ | READ | In some circumstances stricter locking |
|
|
| | | | might be applied, but not guaranteed |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_READ | | READ | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
|REQ_READSUPER | WRITE | | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_RENAME | WRITE | WRITE | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_RMDIR | WRITE | WRITE | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_SLINK | READSER | READ | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_STAT | READ | READ | vmnt is only locked if file is not |
|
|
| | | | already opened |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_STATVFS | READ | READ | vmnt is only locked if file is not |
|
|
| | | | already opened |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_SYNC | READ | | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_UNLINK | WRITE | WRITE | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_UNMOUNT | WRITE | | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_UTIME | READ | READ | |
|
|
+--------------+---------+---------+-----------------------------------------+
|
|
| REQ_WRITE | | WRITE | |
|
|
-----------------------------------------------------------------------------+
|
|
}}}
|
|
Table 7: VFS-FS requests locking guarantees
|
|
|
|
== Recovery from driver crashes ==
|
|
## 5 Recovery from driver crashes
|
|
VFS can recover from block special file and character special file driver
|
|
crashes. It can recover to some degree from a crashed File Server (which we
|
|
can regard as a driver).
|
|
|
|
=== Recovery from block drivers crashes ===
|
|
## 5.1 Recovery from block drivers crashes
|
|
When reading or writing, VFS doesn't communicate with block drivers directly,
|
|
but always through a File Server (the root File Server being default). If the
|
|
block driver crashes, the File Server does most of the work of the recovery
|
|
procedure. VFS loops through all open files for block special files that
|
|
were handled by this driver and reopens them. After that it sends the new
|
|
endpoint to the File Server so it can finish the recover procedure. Finally,
|
|
the File Server will retry pending requests if possible. However, reopening
|
|
files can cause the block driver to crash again. When that happens, VFS will
|
|
stop the recovery. A driver can return ERESTART to VFS to tell it to retry
|
|
a request. VFS does this with an arbitrary maximum of 5 attempts.
|
|
|
|
=== Recovery from character driver crashes ===
|
|
## 5.2 Recovery from character driver crashes
|
|
Character special files are treated differently. Once VFS has found out a
|
|
driver has been restarted, it will stop the current request (if there is
|
|
any). It makes no sense to retry requests due to the nature of character
|
|
special files. If a character special driver can restart without changing
|
|
endpoints, this merely results in the current request (e.g., read, write, or
|
|
ioctl) failing and allows the user process to reissue the same request. On
|
|
the other hand, if a driver restart causes the driver to change endpoint
|
|
number, all associated file descriptors are marked invalid and subsequent
|
|
operations on them will always fail with a bad file descriptor error.
|
|
|
|
=== Recovery from File Server crashes ===
|
|
## 5.3 Recovery from File Server crashes
|
|
At the time of writing we cannot recover from crashed File Servers. When
|
|
VFS detects it has to clean up the remnants of a File Server process (i.e.,
|
|
through an exit(2)), it marks all associated file descriptors as invalid
|
|
and cancels ongoing and pending requests to that File Server. Resources that
|
|
were in use by the File Server are cleaned up.
|
|
|
|
[0] http://wiki.minix3.org/en/DevelopersGuide/VfsFsProtocol
|
|
|
|
[1] http://www.cs.vu.nl/~dcvmoole/minix/blockchar.txt
|
|
|
|
[2] http://www.minix3.org/theses/moolenbroek-multimedia-support.pdf
|