- ack assumes that the direction flag in eflags is clear when
assigning two structures. It is implemented by a call to a built-in
function which is like memcpy but needs the flag to be clear
otherwise rubish is copied. This patch fixes the kernel entries.
- When the cpu halts, the interrupts are enable so the cpu may be
woken up. When the interrupt handler returns but another interrupt
is available it is also serviced immediately. This is not a problem
per-se. It only slightly breaks time accounting as idle accounted is
for the kernel time in the interrupt handler.
- As the big kernel lock is lock/unlocked in the smp branch in the
time acounting functions as they are called exactly at the places
we need to take the lock) this leads to a deadlock.
- we make sure that once the interrupt handler returns from the nested
trap, the interrupts are disabled. This means that only one
interrupt is serviced after idle is interrupted.
- this requires the loop in apic timer calibration to keep reenabling
the interrupts. I admit it is a little bit hackish (one line),
however, this code is a stupid corner case at the boot time.
Hopefully it does not matter too much.
- before enabling paging VM asks kernel to resize its segments. This
may cause kernel to segfault if APIC is used and an interrupt
happens between this and paging enabled. As these are 2 separate
vmctl calls it is not atomic. This patch fixes this problem. VM does
not ask kernel to resize the segments in a separate call anymore.
The new segments limit is part of the "enable paging" call. It
generalizes this call in such a way that more information can be
passed as need be or the information may be completely different if
another architecture requires this.
- if an exception occurs in kernel and this exception is not handled
in an sane way and the kernel crashes, it also dumps what was loaded
in the general purpose registers exactly at the time of the
exception to help to debug the problem
the kernel. They are not used atm, but having them in trunk allows them
to be easily used when needed. To set a breakpoint that triggers when
the variable foo is written to (the most common use case), one calls:
breakpoint_set(vir2phys((vir_bytes) &foo), 0,
BREAKPOINT_FLAG_MODE_GLOBAL |
BREAKPOINT_FLAG_RW_WRITE |
BREAKPOINT_FLAG_LEN_4);
It can later be disabled using:
breakpoint_set(vir2phys((vir_bytes) &foo), 0,
BREAKPOINT_FLAG_MODE_OFF);
There are some limitations:
- There are at most four breakpoints (hardware limit); the index of the
breakpoint (0-3) is specified as the second parameter of
breakpoint_set.
- The breakpoint exception in the kernel is not handled and causes a
panic; it would be reasonably easy to change this by inspecing DR6,
printing a message, disabling the breakpoint and continuing. However,
in my experience even just a panic can be very useful.
- Breakpoints can be set only in the part of the address space that is
in every page table. It is useful for the kernel, but to use this for
user processes would require saving and restoring the debug registers
as part of the context switch. Although the CPU provides support for
local breakpoints (I implemened this as BREAKPOINT_FLAG_LOCAL) they
only work if task switching is used.
forget about the dirtypde bitmap and WIPEPDE/DONEPDE macros too.
check if mapping happens to already be in place, and if so, don't
reload cr3 (on the account of that mapping, that is).
don't reload cr3 unconditionally.
UPDATING INFO:
20100317:
/usr/src/etc/system.conf updated to ignore default kernel calls: copy
it (or merge it) to /etc/system.conf.
The hello driver (/dev/hello) added to the distribution:
# cd /usr/src/commands/scripts && make clean install
# cd /dev && MAKEDEV hello
KERNEL CHANGES:
- Generic signal handling support. The kernel no longer assumes PM as a signal
manager for every process. The signal manager of a given process can now be
specified in its privilege slot. When a signal has to be delivered, the kernel
performs the lookup and forwards the signal to the appropriate signal manager.
PM is the default signal manager for user processes, RS is the default signal
manager for system processes. To enable ptrace()ing for system processes, it
is sufficient to change the default signal manager to PM. This will temporarily
disable crash recovery, though.
- sys_exit() is now split into sys_exit() (i.e. exit() for system processes,
which generates a self-termination signal), and sys_clear() (i.e. used by PM
to ask the kernel to clear a process slot when a process exits).
- Added a new kernel call (i.e. sys_update()) to swap two process slots and
implement live update.
PM CHANGES:
- Posix signal handling is no longer allowed for system processes. System
signals are split into two fixed categories: termination and non-termination
signals. When a non-termination signaled is processed, PM transforms the signal
into an IPC message and delivers the message to the system process. When a
termination signal is processed, PM terminates the process.
- PM no longer assumes itself as the signal manager for system processes. It now
makes sure that every system signal goes through the kernel before being
actually processes. The kernel will then dispatch the signal to the appropriate
signal manager which may or may not be PM.
SYSLIB CHANGES:
- Simplified SEF init and LU callbacks.
- Added additional predefined SEF callbacks to debug crash recovery and
live update.
- Fixed a temporary ack in the SEF init protocol. SEF init reply is now
completely synchronous.
- Added SEF signal event type to provide a uniform interface for system
processes to deal with signals. A sef_cb_signal_handler() callback is
available for system processes to handle every received signal. A
sef_cb_signal_manager() callback is used by signal managers to process
system signals on behalf of the kernel.
- Fixed a few bugs with memory mapping and DS.
VM CHANGES:
- Page faults and memory requests coming from the kernel are now implemented
using signals.
- Added a new VM call to swap two process slots and implement live update.
- The call is used by RS at update time and in turn invokes the kernel call
sys_update().
RS CHANGES:
- RS has been reworked with a better functional decomposition.
- Better kernel call masks. com.h now defines the set of very basic kernel calls
every system service is allowed to use. This makes system.conf simpler and
easier to maintain. In addition, this guarantees a higher level of isolation
for system libraries that use one or more kernel calls internally (e.g. printf).
- RS is the default signal manager for system processes. By default, RS
intercepts every signal delivered to every system process. This makes crash
recovery possible before bringing PM and friends in the loop.
- RS now supports fast rollback when something goes wrong while initializing
the new version during a live update.
- Live update is now implemented by keeping the two versions side-by-side and
swapping the process slots when the old version is ready to update.
- Crash recovery is now implemented by keeping the two versions side-by-side
and cleaning up the old version only when the recovery process is complete.
DS CHANGES:
- Fixed a bug when the process doing ds_publish() or ds_delete() is not known
by DS.
- Fixed the completely broken support for strings. String publishing is now
implemented in the system library and simply wraps publishing of memory ranges.
Ideally, we should adopt a similar approach for other data types as well.
- Test suite fixed.
DRIVER CHANGES:
- The hello driver has been added to the Minix distribution to demonstrate basic
live update and crash recovery functionalities.
- Other drivers have been adapted to conform the new SEF interface.
Move archtypes.h to include/ dir, since several servers require it. Move
fpu.h and stackframe.h to arch-specific header directory. Make source
files and makefiles aware of the new header locations.
-Convert the include directory over to using bsdmake
syntax
-Update/add mkfiles
-Modify install(1) so that it can create symlinks
-Update makefiles to use new install(1) options
-Rename /usr/include/ibm to /usr/include/i386
-Create /usr/include/machine symlink to arch header files
-Move vm_i386.h to its new home in the /usr/include/i386
-Update source files to #include the header files at their
new homes.
-Add new gnu-includes target for building GCC headers
this change
- makes panic() variadic, doing full printf() formatting -
no more NO_NUM, and no more separate printf() statements
needed to print extra info (or something in hex) before panicing
- unifies panic() - same panic() name and usage for everyone -
vm, kernel and rest have different names/syntax currently
in order to implement their own luxuries, but no longer
- throws out the 1st argument, to make source less noisy.
the panic() in syslib retrieves the server name from the kernel
so it should be clear enough who is panicing; e.g.
panic("sigaction failed: %d", errno);
looks like:
at_wini(73130): panic: sigaction failed: 0
syslib:panic.c: stacktrace: 0x74dc 0x2025 0x100a
- throws out report() - printf() is more convenient and powerful
- harmonizes/fixes the use of panic() - there were a few places
that used printf-style formatting (didn't work) and newlines
(messes up the formatting) in panic()
- throws out a few per-server panic() functions
- cleans up a tie-in of tty with panic()
merging printf() and panic() statements to be done incrementally.
process waiting for" logic, which is duplicated a few times in the
kernel. (For a new feature for top.)
Introducing it and throwing out ESRCDIED and EDSTDIED (replaced by
EDEADSRCDST - so we don't have to care which part of the blocking is
failing in system.c) simplifies some code in the kernel and callers that
check for E{DEADSRCDST,ESRCDIED,EDSTDIED}, but don't care about the
difference, a fair bit, and more significantly doesn't duplicate the
'blocked-on' logic.
- as thre are still KERNEL and IDLE entries, time accounting for
kernel and idle time works the same as for any other process
- everytime we stop accounting for the currently running process,
kernel or idle, we read the TSC counter and increment the p_cycles
entry.
- the process cycles inherently include some of the kernel cycles as
we can stop accounting for the process only after we save its
context and we start accounting just before we restore its context
- this assumes that the system does not scale the CPU frequency which
will be true for ... long time ;-)
- there are no tasks running, we don't need TASK_PRIVILEGE priviledge anymore
- as there is no ring 1 anymore, there is no need for level0() to call sensitive
code from ring 1 in ring 0
- 286 related macros removed as clean up
* Userspace change to use the new kernel calls
- _taskcall(SYSTASK...) changed to _kernel_call(...)
- int 32 reused for the kernel calls
- _do_kernel_call() to make the trap to kernel
- kernel_call() to make the actuall kernel call from C using
_do_kernel_call()
- unlike ipc call the kernel call always succeeds as kernel is
always available, however, kernel may return an error
* Kernel side implementation of kernel calls
- the SYSTEm task does not run, only the proc table entry is
preserved
- every data_copy(SYSTEM is no data_copy(KERNEL
- "locking" is an empty operation now as everything runs in
kernel
- sys_task() is replaced by kernel_call() which copies the
message into kernel, dispatches the call to its handler and
finishes by either copying the results back to userspace (if
need be) or by suspending the process because of VM
- suspended processes are later made runnable once the memory
issue is resolved, picked up by the scheduler and only at
this time the call is resumed (in fact restarted) which does
not need to copy the message from userspace as the message
is already saved in the process structure.
- no ned for the vmrestart queue, the scheduler will restart
the system calls
- no special case in do_vmctl(), all requests remove the
RTS_VMREQUEST flag
- copies a mesage from/to userspace without need of translating
addresses
- the assumption is that the address space is installed, i.e. ldt and
cr3 are loaded correctly
- if a pagefault or a general protection occurs while copying from
userland to kernel (or vice versa) and error is returned which gives
the caller a chance to respond in a proper way
- error happens _only_ because of a wrong user pointer if the function
is used correctly
- if the prerequisites of the function do no hold, the function will
most likely fail as the user address becomes random
- switch_address_space() implements a switch of the user address space
for the destination process
- this makes memory of this process easily accessible, e.g. a pointer
valid in the userspace can be used with a little complexity to
access the process's memory
- the switch does not happed only just before we return to userspace,
however, it happens right after we know which process we are going
to schedule. This happens before we start processing the misc flags
of this process so its memory is available
- if the process becomes not runnable while processing the mics flags
we pick a new process and we switch the address space again which
introduces possibly a little bit more overhead, however, it is
hopefully hidden by reducing the overheads when we actually access
the memory
- the syscalls are pretty much just ipc calls, however, sendrec() is
used to implement system task (sys) calls
- sendrec() won't be used anymore for this, therefore ipc calls will
become pure ipc calls
kernel (sys task). The main reason is that these would have to become
cpu local variables on SMP. Once the system task is not a task but a
genuine part of the kernel there is even less reason to have these
extra variables as proc_ptr will already contain all neccessary
information. In addition converting who_e to the process pointer and
back again all the time will be avoided.
Although proc_ptr will contain all important information, accessing it
as a cpu local variable will be fairly expensive, hence the value
would be assigned to some on stack local variable. Therefore it is
better to add the 'caller' argument to the syscall handlers to pass
the value on stack anyway. It also clearly denotes on who's behalf is
the syscall being executed.
This patch also ANSIfies the syscall function headers.
Last but not least, it also fixes a potential bug in virtual_copy_f()
in case the check is disabled. So far the function in case of a
failure could possible reuse an old who_p in case this function had
not been called from the system task.
virtual_copy_f() takes the caller as a parameter too. In case the
checking is disabled, the caller must be NULL and non NULL if it is
enabled as we must be able to suspend the caller.
Some cases were fixed by declaring the function void, others were fixed
by adding a return <value> statement, thereby avoiding potentially
incorrect behavior (usually in error handling).
Some enum correctness in boot.c.
There is not that much use for it on a single CPU, however, deadlock
between kernel and system task can be delected. Or a runaway loop.
If a kernel gets locked up the timer interrupts don't occure (as all
interrupts are disabled in kernel mode). The only chance is to
interrupt the kernel by a non-maskable interrupt.
This patch generates NMIs using performance counters. It uses the most
widely available performace counters. As the performance counters are
highly model-specific this patch is not guaranteed to work on every
machine. Unfortunately this is also true for KVM :-/ On the other
hand adding this feature for other models is not extremely difficult
and the framework makes it hopefully easy enough.
Depending on the frequency of the CPU an NMI is generated at most
about every 0.5s If the cpu's speed is less then 2Ghz it is generated
at most every 1s. In general an NMI is generated much less often as
the performance counter counts down only if the cpu is not idle.
Therefore the overhead of this feature is fairly minimal even if the
load is high.
Uppon detecting that the kernel is locked up the kernel dumps the
state of the kernel registers and panics.
Local APIC must be enabled for the watchdog to work.
The code is _always_ compiled in, however, it is only enabled if
watchdog=<non-zero> is set in the boot monitor.
One corner case is serial console debugging. As dumping a lot of stuff
to the serial link may take a lot of time, the watchdog does not
detect lockups during this time!!! as it would result in too many
false positives. 10 nmi have to be handled before the lockup is
detected. This means something between ~5s to 10s.
Another corner case is that the watchdog is enabled only after the
paging is enabled as it would be pure madness to try to get it right.
Main changes:
- COW optimization for safecopy.
- safemap, a grant-based interface for sharing memory regions between processes.
- Integration with safemap and complete rework of DS, supporting new data types
natively (labels, memory ranges, memory mapped ranges).
- For further information:
http://wiki.minix3.org/en/SummerOfCode2009/MemoryGrants
Additional changes not included in the original Wu's branch:
- Fixed unhandled case in VM when using COW optimization for safecopy in case
of a block that has already been shared as SMAP.
- Better interface and naming scheme for sys_saferevmap and ds_retrieve_map
calls.
- Better input checking in syslib: check for page alignment when creating
memory mapping grants.
- DS notifies subscribers when an entry is deleted.
- Documented the behavior of indirect grants in case of memory mapping.
- Test suite in /usr/src/test/safeperf|safecopy|safemap|ds/* reworked
and extended.
- Minor fixes and general cleanup.
- TO-DO: Grant ids should be generated and managed the way endpoints are to make
sure grant slots are never misreused.
- if debugging on serial console is enabled typing Q kills the system. It is
handy if the system gets locked up and the timer interrupts still work. Good
for remote debugging.
- NOT_REACHABLE reintroduced and fixed. It should be used for marking code which
is not reachable because the previous code _should_ not return. Such places
are not always obvious