- each CPU has its own runqueues
- processes on BSP are put on the runqueues later after a switch to
the final stack when cpuid works to avoid special cases
- enqueue() and dequeue() use the run queues of the cpu the process is
assigned to
- pick_proc() uses the local run queues
- printing of per-CPU run queues ('2') on serial console
- APs configure local timers
- while configuring local APIC timer the CPUs fiddle with the interrupt
handlers. As the interrupt table is shared the BSP must not run
- APs wait until BSP turns paging on, it is not possible to safely
execute any code on APs until we can turn paging on as well as it
must be done synchronously everywhere
- APs turn paging on but do not continue and wait
- to isolate execution inside kernel we use a big kernel lock
implemented as a spinlock
- the lock is acquired asap after entering kernel mode and released as
late as possible. Only one CPU as a time can execute the core kernel
code
- measurement son real hw show that the overhead of this lock is close
to 0% of kernel time for the currnet system
- the overhead of this lock may be as high as 45% of kernel time in
virtual machines depending on the ratio between physical CPUs
available and emulated CPUs. The performance degradation is
significant
- kernel detects CPUs by searching ACPI tables for local apic nodes
- each CPU has its own TSS that points to its own stack. All cpus boot
on the same boot stack (in sequence) but switch to its private stack
as soon as they can.
- final booting code in main() placed in bsp_finish_booting() which is
executed only after the BSP switches to its final stack
- apic functions to send startup interrupts
- assembler functions to handle CPU features not needed for single cpu
mode like memory barries, HT detection etc.
- new files kernel/smp.[ch], kernel/arch/i386/arch_smp.c and
kernel/arch/i386/include/arch_smp.h
- 16-bit trampoline code for the APs. It is executed by each AP after
receiving startup IPIs it brings up the CPUs to 32bit mode and let
them spin in an infinite loop so they don't do any damage.
- implementation of kernel spinlock
- CONFIG_SMP and CONFIG_MAX_CPUS set by the build system
- most global variables carry information which is specific to the
local CPU and each CPU must have its own copy
- cpu local variable must be declared in cpulocal.h between
DECLARE_CPULOCAL_START and DECLARE_CPULOCAL_END markers using
DECLARE_CPULOCAL macro
- to access the cpu local data the provided macros must be used
get_cpu_var(cpu, name)
get_cpu_var_ptr(cpu, name)
get_cpulocal_var(name)
get_cpulocal_var_ptr(name)
- using this macros makes future changes in the implementation
possible
- switching to ELF will make the declaration of cpu local data much
simpler, e.g.
CPULOCAL int blah;
anywhere in the kernel source code
- kernel turns on IO APICs if no_apic is _not_ set or is equal 0
- pci driver must use the acpi driver to setup IRQ routing otherwise
the system cannot work correctly except systems like KVM that use
only legacy (E)ISA IRQs 0-15
- kernel exports DSDP (the root pointer where ACPI parsing starts) and
apic_enabled in the machine structure.
- ACPI driver uses DSDP to locate ACPI in memory. acpi_enabled tell
PCI driver to query ACPI for IRQ routing information.
- the ability for kernel to use ACPI tables to detect IO APICs. It is
the bare minimum the kernel needs to know about ACPI tables.
- it will be used to find out about processors as the MPS tables are
deprecated by ACPI and not all vendorsprovide them.
- removes p_delivermsg_lin item from the process structure and code
related to it
- as the send part, the receive does not need to use the
PHYS_COPY_CATCH() and umap_local() couple.
- The address space of the target process is installed before
delivermsg() is called.
- unlike the linear address, the virtual address does not change when
paging is turned on nor after fork().
- FPU context is stored only if conflict between 2 FPU users or while
exporting context of a process to userspace while it is the active
user of FPU
- FPU has its owner (fpu_owner) which points to the process whose
state is currently loaded in FPU
- the FPU exception is only turned on when scheduling a process which
is not the owner of FPU
- FPU state is restored for the process that generated the FPU
exception. This process runs immediately without letting scheduler
to pick a new process to resolve the FPU conflict asap, to minimize
the FPU thrashing and FPU exception hadler execution
- faster all non-FPU-exception kernel entries as FPU state is not
checked nor saved
- removed MF_USED_FPU flag, only MF_FPU_INITIALIZED remains to signal
that a process has used FPU in the past
There seems to have been a broken assumption in the fpu context
restoring code. It restores the context of the running process, without
guarantee that the current process is the one that will be scheduled.
This caused fpu saving for a different process to be triggered without
fpu hardware being enabled, causing an fpu exception in the kernel. This
practically only shows up with DEBUG_RACE on. Fix my thruby+me.
The fix
. is to only set the fpu-in-use-by-this-process flag in the
exception handler, and then take care of fpu restoring when
actually returning to userspace
And the patch
. translates fpu saving and restoring to c in arch_system.c,
getting rid of a juicy chunk of assembly
. makes osfxsr_feature private to arch_system.c
. removes most of the arch dependent code from do_sigsend
- Currently the cpu time quantum is timer-ticks based. Thus the
remaining quantum is decreased only if the processes is interrupted
by a timer tick. As processes block a lot this typically does not
happen for normal user processes. Also the quantum depends on the
frequency of the timer.
- This change makes the quantum miliseconds based. Internally the
miliseconds are translated into cpu cycles. Everytime userspace
execution is interrupted by kernel the cycles just consumed by the
current process are deducted from the remaining quantum.
- It makes the quantum system timer frequency independent.
- The boot processes quantum is loosely derived from the tick-based
quantas and 60Hz timer and subject to future change
- the 64bit arithmetics is a little ugly, will be changes once we have
compiler support for 64bit integers (soon)
-Makefile updates
-Update mkdep
-Build fixes/warning cleanups for some programs
-Restore leading underscores on global syms in kernel asm files
-Increase ramdisk size
ask to map in oxpcie i/o memory and support serial i/o for it in the
kernel. set oxpcie=<address> in boot monitor (retrieve address using
pci_debug=1 output). (no sanity checking is done on the address
currently.) disabled by default.
The change also contains some other minor cleanup (a new serial.h to set
register info common to UART and the OXPCIe card, in-kernel memory
mapping a little more structured and env_get() to get sysenv variables
without knowing about the params_buffer).
- this patch only renames schedcheck() to switch_to_user(),
cycles_accounting_stop() to context_stop() and restart() to
+restore_user_context()
- the motivation is that since the introduction of schedcheck() it has
been abused for many things. It deserves a better name. It should
express the fact that from the moment we call the function we are in
the process of switching to user.
- cycles_accounting_stop() was originally a single purpose function.
As this function is called at were convenient places it is used in
for other things too, e.g. (un)locking the kernel. Thus it deserves
a better name too.
- using the old name, restart() does not call schedcheck(), however
calls to restart are replaced by calls to schedcheck()
[switch_to_user] and it calls restart() [restore_user_context]
- this patch moves the former printslot() from arch_system.c to
debug.c and reimplements it slightly. The output is not changed,
however, the process information is printed in a separate function
print_proc() in debug.c as such a function is also handy in other
situations and should be publicly available when debugging.
this patch changes the way pagefaults are delivered to VM. It adopts
the same model as the out-of-quantum messages sent by kernel to a
scheduler.
- everytime a userspace pagefault occurs, kernel creates a message
which is sent to VM on behalf of the faulting process
- the process is blocked on delivery to VM in the standard IPC code
instead of waiting in a spacial in-kernel queue (stack) and is not
runnable until VM tell kernel that the pagefault is resolved and is
free to clear the RTS_PAGEFAULT flag.
- VM does not need call kernel and poll the pagefault information
which saves many (1/2?) calls and kernel calls that return "no more
data"
- VM notification by kernel does not need to use signals
- each entry in proc table is by 12 bytes smaller (~3k save)
this patch does not add or change any functionality of do_ipc(), it
only makes things a little cleaner (hopefully).
Until now do_ipc() was responsible for handling all ipc calls. The
catch is that SENDA is fairly different which results in some ugly
code like this typecasting and variables naming which does not make
much sense for SENDA and makes the code hard to read.
result = mini_senda(caller_ptr, (asynmsg_t *)m_ptr, (size_t)src_dst_e);
As it is called directly from assembly, the new do_ipc() takes as
input values of 3 registers in reg_t variables (it used to be 4,
however, bit_map wasn't used so I removed it), does the checks common
to all ipc calls and call the appropriate handler either for
do_sync_ipc() (all except SENDA) or mini_senda() (for SENDA) while
typecasting the reg_t values correctly. As a result, handling SENDA
differences in do_sync_ipc() is no more needed. Also the code that
uses msg_size variable is improved a little bit.
arch_do_syscall() is simplified too.
- cotributed by Bjorn Swift
- In this first phase, scheduling is moved from the kernel to the PM
server. The next steps are to a) moving scheduling to its own server
and b) include useful information in the "out of quantum" message,
so that the scheduler can make use of this information.
- The kernel process table now keeps record of who is responsible for
scheduling each process (p_scheduler). When this pointer is NULL,
the process will be scheduled by the kernel. If such a process runs
out of quantum, the kernel will simply renew its quantum an requeue
it.
- When PM loads, it will take over scheduling of all running
processes, except system processes, using sys_schedctl().
Essentially, this only results in taking over init. As children
inherit a scheduler from their parent, user space programs forked by
init will inherit PM (for now) as their scheduler.
- Once a process has been assigned a scheduler, and runs out of
quantum, its RTS_NO_QUANTUM flag will be set and the process
dequeued. The kernel will send a message to the scheduler, on the
process' behalf, informing the scheduler that it has run out of
quantum. The scheduler can take what ever action it pleases, based
on its policy, and then reschedule the process using the
sys_schedule() system call.
- Balance queues does not work as before. While the old in-kernel
function used to renew the quantum of processes in the highest
priority run queue, the user-space implementation only acts on
processes that have been bumped down to a lower priority queue.
This approach reacts slower to changes than the old one, but saves
us sending a sys_schedule message for each process every time we
balance the queues. Currently, when processes are moved up a
priority queue, their quantum is also renewed, but this can be
fiddled with.
- do_nice has been removed from kernel. PM answers to get- and
setpriority calls, updates it's own nice variable as well as the
max_run_queue. This will be refactored once scheduling is moved to a
separate server. We will probably have PM update it's local nice
value and then send a message to whoever is scheduling the process.
- changes to fix an issue in do_fork() where processes could run out
of quantum but bypassing the code path that handles it correctly.
The future plan is to remove the policy from do_fork() and implement
it in userspace too.
- ack assumes that the direction flag in eflags is clear when
assigning two structures. It is implemented by a call to a built-in
function which is like memcpy but needs the flag to be clear
otherwise rubish is copied. This patch fixes the kernel entries.
- When the cpu halts, the interrupts are enable so the cpu may be
woken up. When the interrupt handler returns but another interrupt
is available it is also serviced immediately. This is not a problem
per-se. It only slightly breaks time accounting as idle accounted is
for the kernel time in the interrupt handler.
- As the big kernel lock is lock/unlocked in the smp branch in the
time acounting functions as they are called exactly at the places
we need to take the lock) this leads to a deadlock.
- we make sure that once the interrupt handler returns from the nested
trap, the interrupts are disabled. This means that only one
interrupt is serviced after idle is interrupted.
- this requires the loop in apic timer calibration to keep reenabling
the interrupts. I admit it is a little bit hackish (one line),
however, this code is a stupid corner case at the boot time.
Hopefully it does not matter too much.
- before enabling paging VM asks kernel to resize its segments. This
may cause kernel to segfault if APIC is used and an interrupt
happens between this and paging enabled. As these are 2 separate
vmctl calls it is not atomic. This patch fixes this problem. VM does
not ask kernel to resize the segments in a separate call anymore.
The new segments limit is part of the "enable paging" call. It
generalizes this call in such a way that more information can be
passed as need be or the information may be completely different if
another architecture requires this.
- if an exception occurs in kernel and this exception is not handled
in an sane way and the kernel crashes, it also dumps what was loaded
in the general purpose registers exactly at the time of the
exception to help to debug the problem
the kernel. They are not used atm, but having them in trunk allows them
to be easily used when needed. To set a breakpoint that triggers when
the variable foo is written to (the most common use case), one calls:
breakpoint_set(vir2phys((vir_bytes) &foo), 0,
BREAKPOINT_FLAG_MODE_GLOBAL |
BREAKPOINT_FLAG_RW_WRITE |
BREAKPOINT_FLAG_LEN_4);
It can later be disabled using:
breakpoint_set(vir2phys((vir_bytes) &foo), 0,
BREAKPOINT_FLAG_MODE_OFF);
There are some limitations:
- There are at most four breakpoints (hardware limit); the index of the
breakpoint (0-3) is specified as the second parameter of
breakpoint_set.
- The breakpoint exception in the kernel is not handled and causes a
panic; it would be reasonably easy to change this by inspecing DR6,
printing a message, disabling the breakpoint and continuing. However,
in my experience even just a panic can be very useful.
- Breakpoints can be set only in the part of the address space that is
in every page table. It is useful for the kernel, but to use this for
user processes would require saving and restoring the debug registers
as part of the context switch. Although the CPU provides support for
local breakpoints (I implemened this as BREAKPOINT_FLAG_LOCAL) they
only work if task switching is used.
forget about the dirtypde bitmap and WIPEPDE/DONEPDE macros too.
check if mapping happens to already be in place, and if so, don't
reload cr3 (on the account of that mapping, that is).
don't reload cr3 unconditionally.
UPDATING INFO:
20100317:
/usr/src/etc/system.conf updated to ignore default kernel calls: copy
it (or merge it) to /etc/system.conf.
The hello driver (/dev/hello) added to the distribution:
# cd /usr/src/commands/scripts && make clean install
# cd /dev && MAKEDEV hello
KERNEL CHANGES:
- Generic signal handling support. The kernel no longer assumes PM as a signal
manager for every process. The signal manager of a given process can now be
specified in its privilege slot. When a signal has to be delivered, the kernel
performs the lookup and forwards the signal to the appropriate signal manager.
PM is the default signal manager for user processes, RS is the default signal
manager for system processes. To enable ptrace()ing for system processes, it
is sufficient to change the default signal manager to PM. This will temporarily
disable crash recovery, though.
- sys_exit() is now split into sys_exit() (i.e. exit() for system processes,
which generates a self-termination signal), and sys_clear() (i.e. used by PM
to ask the kernel to clear a process slot when a process exits).
- Added a new kernel call (i.e. sys_update()) to swap two process slots and
implement live update.
PM CHANGES:
- Posix signal handling is no longer allowed for system processes. System
signals are split into two fixed categories: termination and non-termination
signals. When a non-termination signaled is processed, PM transforms the signal
into an IPC message and delivers the message to the system process. When a
termination signal is processed, PM terminates the process.
- PM no longer assumes itself as the signal manager for system processes. It now
makes sure that every system signal goes through the kernel before being
actually processes. The kernel will then dispatch the signal to the appropriate
signal manager which may or may not be PM.
SYSLIB CHANGES:
- Simplified SEF init and LU callbacks.
- Added additional predefined SEF callbacks to debug crash recovery and
live update.
- Fixed a temporary ack in the SEF init protocol. SEF init reply is now
completely synchronous.
- Added SEF signal event type to provide a uniform interface for system
processes to deal with signals. A sef_cb_signal_handler() callback is
available for system processes to handle every received signal. A
sef_cb_signal_manager() callback is used by signal managers to process
system signals on behalf of the kernel.
- Fixed a few bugs with memory mapping and DS.
VM CHANGES:
- Page faults and memory requests coming from the kernel are now implemented
using signals.
- Added a new VM call to swap two process slots and implement live update.
- The call is used by RS at update time and in turn invokes the kernel call
sys_update().
RS CHANGES:
- RS has been reworked with a better functional decomposition.
- Better kernel call masks. com.h now defines the set of very basic kernel calls
every system service is allowed to use. This makes system.conf simpler and
easier to maintain. In addition, this guarantees a higher level of isolation
for system libraries that use one or more kernel calls internally (e.g. printf).
- RS is the default signal manager for system processes. By default, RS
intercepts every signal delivered to every system process. This makes crash
recovery possible before bringing PM and friends in the loop.
- RS now supports fast rollback when something goes wrong while initializing
the new version during a live update.
- Live update is now implemented by keeping the two versions side-by-side and
swapping the process slots when the old version is ready to update.
- Crash recovery is now implemented by keeping the two versions side-by-side
and cleaning up the old version only when the recovery process is complete.
DS CHANGES:
- Fixed a bug when the process doing ds_publish() or ds_delete() is not known
by DS.
- Fixed the completely broken support for strings. String publishing is now
implemented in the system library and simply wraps publishing of memory ranges.
Ideally, we should adopt a similar approach for other data types as well.
- Test suite fixed.
DRIVER CHANGES:
- The hello driver has been added to the Minix distribution to demonstrate basic
live update and crash recovery functionalities.
- Other drivers have been adapted to conform the new SEF interface.
Move archtypes.h to include/ dir, since several servers require it. Move
fpu.h and stackframe.h to arch-specific header directory. Make source
files and makefiles aware of the new header locations.
-Convert the include directory over to using bsdmake
syntax
-Update/add mkfiles
-Modify install(1) so that it can create symlinks
-Update makefiles to use new install(1) options
-Rename /usr/include/ibm to /usr/include/i386
-Create /usr/include/machine symlink to arch header files
-Move vm_i386.h to its new home in the /usr/include/i386
-Update source files to #include the header files at their
new homes.
-Add new gnu-includes target for building GCC headers
this change
- makes panic() variadic, doing full printf() formatting -
no more NO_NUM, and no more separate printf() statements
needed to print extra info (or something in hex) before panicing
- unifies panic() - same panic() name and usage for everyone -
vm, kernel and rest have different names/syntax currently
in order to implement their own luxuries, but no longer
- throws out the 1st argument, to make source less noisy.
the panic() in syslib retrieves the server name from the kernel
so it should be clear enough who is panicing; e.g.
panic("sigaction failed: %d", errno);
looks like:
at_wini(73130): panic: sigaction failed: 0
syslib:panic.c: stacktrace: 0x74dc 0x2025 0x100a
- throws out report() - printf() is more convenient and powerful
- harmonizes/fixes the use of panic() - there were a few places
that used printf-style formatting (didn't work) and newlines
(messes up the formatting) in panic()
- throws out a few per-server panic() functions
- cleans up a tie-in of tty with panic()
merging printf() and panic() statements to be done incrementally.
process waiting for" logic, which is duplicated a few times in the
kernel. (For a new feature for top.)
Introducing it and throwing out ESRCDIED and EDSTDIED (replaced by
EDEADSRCDST - so we don't have to care which part of the blocking is
failing in system.c) simplifies some code in the kernel and callers that
check for E{DEADSRCDST,ESRCDIED,EDSTDIED}, but don't care about the
difference, a fair bit, and more significantly doesn't duplicate the
'blocked-on' logic.