- the Intel architecture cycle counter (performance counter) does not
count when the CPU is idle therefore we use busy loop instead of
halting the cpu when there is nothing to schedule
- the downside is that handling interrupts may be accounted as idle
time if a sample is taken before we get out of the nested trap and
pick a new process
- contributed by Bjorn Swift
- adds process accounting, for example counting the number of messages
sent, how often the process was preemted and how much time it spent
in the run queue. These statistics, along with the current cpu load,
are sent back to the user-space scheduler in the Out Of Quantum
message.
- the user-space scheduler may choose to make use of these statistics
when making scheduling decisions. For isntance the cpu load becomes
especially useful when scheduling on multiple cores.
- when a process is migrated to a different CPU it may have an active
FPU context in the processor registers. We must save it and migrate
it together with the process.
- this makes sure that each process always run with updated TLB
- this is the simplest way how to achieve the consistency. As it means
significant performace degradation when not require, this is nto the
final solution and will be refined
- APIC timer always reprogrammed if expired
- timer tick never happens when in kernel => never immediate return
from userspace to kernel because of a buffered interrupt
- renamed argument to lapic_set_timer_one_shot()
- removed arch_ prefix from timer functions
- any cpu can use smp_schedule() to tell another cpu to reschedule
- if an AP is idle, it turns off timer as there is nothing to
preempt, no need to wakeup just to go back to sleep again
- if a cpu makes a process runnable on an idle cpu, it must wake it up
to reschedule
- each CPU has its own runqueues
- processes on BSP are put on the runqueues later after a switch to
the final stack when cpuid works to avoid special cases
- enqueue() and dequeue() use the run queues of the cpu the process is
assigned to
- pick_proc() uses the local run queues
- printing of per-CPU run queues ('2') on serial console
- kernel detects CPUs by searching ACPI tables for local apic nodes
- each CPU has its own TSS that points to its own stack. All cpus boot
on the same boot stack (in sequence) but switch to its private stack
as soon as they can.
- final booting code in main() placed in bsp_finish_booting() which is
executed only after the BSP switches to its final stack
- apic functions to send startup interrupts
- assembler functions to handle CPU features not needed for single cpu
mode like memory barries, HT detection etc.
- new files kernel/smp.[ch], kernel/arch/i386/arch_smp.c and
kernel/arch/i386/include/arch_smp.h
- 16-bit trampoline code for the APs. It is executed by each AP after
receiving startup IPIs it brings up the CPUs to 32bit mode and let
them spin in an infinite loop so they don't do any damage.
- implementation of kernel spinlock
- CONFIG_SMP and CONFIG_MAX_CPUS set by the build system
- most global variables carry information which is specific to the
local CPU and each CPU must have its own copy
- cpu local variable must be declared in cpulocal.h between
DECLARE_CPULOCAL_START and DECLARE_CPULOCAL_END markers using
DECLARE_CPULOCAL macro
- to access the cpu local data the provided macros must be used
get_cpu_var(cpu, name)
get_cpu_var_ptr(cpu, name)
get_cpulocal_var(name)
get_cpulocal_var_ptr(name)
- using this macros makes future changes in the implementation
possible
- switching to ELF will make the declaration of cpu local data much
simpler, e.g.
CPULOCAL int blah;
anywhere in the kernel source code
- for better readability xpp is substitued by sender
- makes sure that the dequeued sender has p_q_link == NULL and that
this condition holds when enqueuing the sender again. This is a
sanity check to make sure that the new sender is not enqueued
already
- Before this change the dequeued sender's p_q_link may not be NULL
and it was only set to NULL when enqueued again
- removes p_delivermsg_lin item from the process structure and code
related to it
- as the send part, the receive does not need to use the
PHYS_COPY_CATCH() and umap_local() couple.
- The address space of the target process is installed before
delivermsg() is called.
- unlike the linear address, the virtual address does not change when
paging is turned on nor after fork().
- FPU context is stored only if conflict between 2 FPU users or while
exporting context of a process to userspace while it is the active
user of FPU
- FPU has its owner (fpu_owner) which points to the process whose
state is currently loaded in FPU
- the FPU exception is only turned on when scheduling a process which
is not the owner of FPU
- FPU state is restored for the process that generated the FPU
exception. This process runs immediately without letting scheduler
to pick a new process to resolve the FPU conflict asap, to minimize
the FPU thrashing and FPU exception hadler execution
- faster all non-FPU-exception kernel entries as FPU state is not
checked nor saved
- removed MF_USED_FPU flag, only MF_FPU_INITIALIZED remains to signal
that a process has used FPU in the past
There seems to have been a broken assumption in the fpu context
restoring code. It restores the context of the running process, without
guarantee that the current process is the one that will be scheduled.
This caused fpu saving for a different process to be triggered without
fpu hardware being enabled, causing an fpu exception in the kernel. This
practically only shows up with DEBUG_RACE on. Fix my thruby+me.
The fix
. is to only set the fpu-in-use-by-this-process flag in the
exception handler, and then take care of fpu restoring when
actually returning to userspace
And the patch
. translates fpu saving and restoring to c in arch_system.c,
getting rid of a juicy chunk of assembly
. makes osfxsr_feature private to arch_system.c
. removes most of the arch dependent code from do_sigsend
- removes dependency of do_safecopy() on the m_type field of the kcall
messages.
- instead of do_safecopy() figuring out what action is requested, the
correct safecopy method is called right away.
- Currently the cpu time quantum is timer-ticks based. Thus the
remaining quantum is decreased only if the processes is interrupted
by a timer tick. As processes block a lot this typically does not
happen for normal user processes. Also the quantum depends on the
frequency of the timer.
- This change makes the quantum miliseconds based. Internally the
miliseconds are translated into cpu cycles. Everytime userspace
execution is interrupted by kernel the cycles just consumed by the
current process are deducted from the remaining quantum.
- It makes the quantum system timer frequency independent.
- The boot processes quantum is loosely derived from the tick-based
quantas and 60Hz timer and subject to future change
- the 64bit arithmetics is a little ugly, will be changes once we have
compiler support for 64bit integers (soon)
- this patch only renames schedcheck() to switch_to_user(),
cycles_accounting_stop() to context_stop() and restart() to
+restore_user_context()
- the motivation is that since the introduction of schedcheck() it has
been abused for many things. It deserves a better name. It should
express the fact that from the moment we call the function we are in
the process of switching to user.
- cycles_accounting_stop() was originally a single purpose function.
As this function is called at were convenient places it is used in
for other things too, e.g. (un)locking the kernel. Thus it deserves
a better name too.
- using the old name, restart() does not call schedcheck(), however
calls to restart are replaced by calls to schedcheck()
[switch_to_user] and it calls restart() [restore_user_context]
it does this by
- making all processes interruptible by running out of quantum
- giving all processes a single tick of quantum
- picking a random runnable process instead of in order, and
from a single pool of runnable processes (no priorities)
This together with very high HZ values currently provokes some race conditions
seen earlier only when running with SMP.
- this patch substitutes *xpp for sender to increase readability of
mini_receive().
- makes sure that the dequeued sender has p_q_link == NULL and that
this condition holds when enqueuing the sender again.
- it is a sanity check to make sure that the new sender is not
enqueued already. Before this change the dequeued sender's p_q_link
may not be NULL and it was only set to NULL when enqueued again.
- deadlock() is more verbose in case of a detected deadlock. First, it
lists all processses in the deadlock group. Then it prints the proc
extra info, not only stack trace and register dump
- this is a small addition to the userspace scheduling.
proc_kernel_scheduler() tests whether to use the default scheduling
policy in kernel. It is true if the process' scheduler is NULL _or_
self. Currently none of the tests was complete.
this patch does not add or change any functionality of do_ipc(), it
only makes things a little cleaner (hopefully).
Until now do_ipc() was responsible for handling all ipc calls. The
catch is that SENDA is fairly different which results in some ugly
code like this typecasting and variables naming which does not make
much sense for SENDA and makes the code hard to read.
result = mini_senda(caller_ptr, (asynmsg_t *)m_ptr, (size_t)src_dst_e);
As it is called directly from assembly, the new do_ipc() takes as
input values of 3 registers in reg_t variables (it used to be 4,
however, bit_map wasn't used so I removed it), does the checks common
to all ipc calls and call the appropriate handler either for
do_sync_ipc() (all except SENDA) or mini_senda() (for SENDA) while
typecasting the reg_t values correctly. As a result, handling SENDA
differences in do_sync_ipc() is no more needed. Also the code that
uses msg_size variable is improved a little bit.
arch_do_syscall() is simplified too.
- IPC_FLG_MSG_FROM_KERNEL status flag is returned to userspace if the
receive was satisfied by s message which was sent by the kernel on
behalf of a process. This perfectly reliale information.
- MF_SENDING_FROM_KERNEL flag added to processes to be able to set
IPC_FLG_MSG_FROM_KERNEL when finishing receive if the receiver
wasn't ready to receive immediately.
- PM is changed to use this information to confirm that the scheduling
messages are indeed from the kernel and not faked by a process.
PM uses sef_receive_status()
- get_work() is removed from PM to make the changes simpler
- there are cycles wasted in the IPC call due to a fairly compliacted
way of copying messages from userland to kernel. Sometimes this
complicated way (generic though) is used even for copying within the
kernel address space, sometimes it is used for copying in case _no_
copying is necessary. The goal of this patch is to improve this a
little bit.
- the places where a copy is from user to kernel use the
copy_msg_from_user() kernel-kernel copies are turned into
assignments and BuildNotifyMessage uses the delivery buffers to
avoid copying.
- copy_msg_from_user() was introduced when removing the system task
and is about 2/3 faster then using the current mechanism
(phys_copy). It also avoids the PHYS_COPY_CATCH macro. Assignment is
also faster and no copy is the fastest ;-) so perhaps there will be
some hardly noticable performance gain besides the clean up.
- cotributed by Bjorn Swift
- In this first phase, scheduling is moved from the kernel to the PM
server. The next steps are to a) moving scheduling to its own server
and b) include useful information in the "out of quantum" message,
so that the scheduler can make use of this information.
- The kernel process table now keeps record of who is responsible for
scheduling each process (p_scheduler). When this pointer is NULL,
the process will be scheduled by the kernel. If such a process runs
out of quantum, the kernel will simply renew its quantum an requeue
it.
- When PM loads, it will take over scheduling of all running
processes, except system processes, using sys_schedctl().
Essentially, this only results in taking over init. As children
inherit a scheduler from their parent, user space programs forked by
init will inherit PM (for now) as their scheduler.
- Once a process has been assigned a scheduler, and runs out of
quantum, its RTS_NO_QUANTUM flag will be set and the process
dequeued. The kernel will send a message to the scheduler, on the
process' behalf, informing the scheduler that it has run out of
quantum. The scheduler can take what ever action it pleases, based
on its policy, and then reschedule the process using the
sys_schedule() system call.
- Balance queues does not work as before. While the old in-kernel
function used to renew the quantum of processes in the highest
priority run queue, the user-space implementation only acts on
processes that have been bumped down to a lower priority queue.
This approach reacts slower to changes than the old one, but saves
us sending a sys_schedule message for each process every time we
balance the queues. Currently, when processes are moved up a
priority queue, their quantum is also renewed, but this can be
fiddled with.
- do_nice has been removed from kernel. PM answers to get- and
setpriority calls, updates it's own nice variable as well as the
max_run_queue. This will be refactored once scheduling is moved to a
separate server. We will probably have PM update it's local nice
value and then send a message to whoever is scheduling the process.
- changes to fix an issue in do_fork() where processes could run out
of quantum but bypassing the code path that handles it correctly.
The future plan is to remove the policy from do_fork() and implement
it in userspace too.
Currently a sequence of messages between a sender A and a receiver B of the
form: A.asynsend(M1, B); A.send(M2, B) may result in the receiver receiving
M1 first and then M2 or viceversa. This patch makes sure that the original
order M1, M2 is always preserved.
Note that the order of a hypotetical sequence A.asynsend(M1, B);
A.asynsend(M2, B) is already guaranteed by the implementation of
asynsend by design. Other senda-based wrappers can define their own
semantics.
IPC changes:
- receive() is changed to take an additional parameter, which is a pointer to
a status code.
- The status code is filled in by the kernel to provide additional information
to the caller. For now, the kernel only fills in the IPC call used by the
sender.
Syslib changes:
- sef_receive() has been split into sef_receive() (with the original semantics)
and sef_receive_status() which exposes the status code to userland.
- Ideally, every sys process should gradually switch to sef_receive_status()
and use is_ipc_notify() as a dependable way to check for notify.
- SEF has been modified to use is_ipc_notify() and demonstrate how to use the
new status code.
this change
- makes panic() variadic, doing full printf() formatting -
no more NO_NUM, and no more separate printf() statements
needed to print extra info (or something in hex) before panicing
- unifies panic() - same panic() name and usage for everyone -
vm, kernel and rest have different names/syntax currently
in order to implement their own luxuries, but no longer
- throws out the 1st argument, to make source less noisy.
the panic() in syslib retrieves the server name from the kernel
so it should be clear enough who is panicing; e.g.
panic("sigaction failed: %d", errno);
looks like:
at_wini(73130): panic: sigaction failed: 0
syslib:panic.c: stacktrace: 0x74dc 0x2025 0x100a
- throws out report() - printf() is more convenient and powerful
- harmonizes/fixes the use of panic() - there were a few places
that used printf-style formatting (didn't work) and newlines
(messes up the formatting) in panic()
- throws out a few per-server panic() functions
- cleans up a tie-in of tty with panic()
merging printf() and panic() statements to be done incrementally.
process waiting for" logic, which is duplicated a few times in the
kernel. (For a new feature for top.)
Introducing it and throwing out ESRCDIED and EDSTDIED (replaced by
EDEADSRCDST - so we don't have to care which part of the blocking is
failing in system.c) simplifies some code in the kernel and callers that
check for E{DEADSRCDST,ESRCDIED,EDSTDIED}, but don't care about the
difference, a fair bit, and more significantly doesn't duplicate the
'blocked-on' logic.