- removes p_delivermsg_lin item from the process structure and code
related to it
- as the send part, the receive does not need to use the
PHYS_COPY_CATCH() and umap_local() couple.
- The address space of the target process is installed before
delivermsg() is called.
- unlike the linear address, the virtual address does not change when
paging is turned on nor after fork().
- FPU context is stored only if conflict between 2 FPU users or while
exporting context of a process to userspace while it is the active
user of FPU
- FPU has its owner (fpu_owner) which points to the process whose
state is currently loaded in FPU
- the FPU exception is only turned on when scheduling a process which
is not the owner of FPU
- FPU state is restored for the process that generated the FPU
exception. This process runs immediately without letting scheduler
to pick a new process to resolve the FPU conflict asap, to minimize
the FPU thrashing and FPU exception hadler execution
- faster all non-FPU-exception kernel entries as FPU state is not
checked nor saved
- removed MF_USED_FPU flag, only MF_FPU_INITIALIZED remains to signal
that a process has used FPU in the past
There seems to have been a broken assumption in the fpu context
restoring code. It restores the context of the running process, without
guarantee that the current process is the one that will be scheduled.
This caused fpu saving for a different process to be triggered without
fpu hardware being enabled, causing an fpu exception in the kernel. This
practically only shows up with DEBUG_RACE on. Fix my thruby+me.
The fix
. is to only set the fpu-in-use-by-this-process flag in the
exception handler, and then take care of fpu restoring when
actually returning to userspace
And the patch
. translates fpu saving and restoring to c in arch_system.c,
getting rid of a juicy chunk of assembly
. makes osfxsr_feature private to arch_system.c
. removes most of the arch dependent code from do_sigsend
- Currently the cpu time quantum is timer-ticks based. Thus the
remaining quantum is decreased only if the processes is interrupted
by a timer tick. As processes block a lot this typically does not
happen for normal user processes. Also the quantum depends on the
frequency of the timer.
- This change makes the quantum miliseconds based. Internally the
miliseconds are translated into cpu cycles. Everytime userspace
execution is interrupted by kernel the cycles just consumed by the
current process are deducted from the remaining quantum.
- It makes the quantum system timer frequency independent.
- The boot processes quantum is loosely derived from the tick-based
quantas and 60Hz timer and subject to future change
- the 64bit arithmetics is a little ugly, will be changes once we have
compiler support for 64bit integers (soon)
ask to map in oxpcie i/o memory and support serial i/o for it in the
kernel. set oxpcie=<address> in boot monitor (retrieve address using
pci_debug=1 output). (no sanity checking is done on the address
currently.) disabled by default.
The change also contains some other minor cleanup (a new serial.h to set
register info common to UART and the OXPCIe card, in-kernel memory
mapping a little more structured and env_get() to get sysenv variables
without knowing about the params_buffer).
- this patch only renames schedcheck() to switch_to_user(),
cycles_accounting_stop() to context_stop() and restart() to
+restore_user_context()
- the motivation is that since the introduction of schedcheck() it has
been abused for many things. It deserves a better name. It should
express the fact that from the moment we call the function we are in
the process of switching to user.
- cycles_accounting_stop() was originally a single purpose function.
As this function is called at were convenient places it is used in
for other things too, e.g. (un)locking the kernel. Thus it deserves
a better name too.
- using the old name, restart() does not call schedcheck(), however
calls to restart are replaced by calls to schedcheck()
[switch_to_user] and it calls restart() [restore_user_context]
- this patch moves the former printslot() from arch_system.c to
debug.c and reimplements it slightly. The output is not changed,
however, the process information is printed in a separate function
print_proc() in debug.c as such a function is also handy in other
situations and should be publicly available when debugging.
SYSLIB CHANGES:
- DS calls to publish / retrieve labels consider endpoints instead of u32_t.
VFS CHANGES:
- mapdriver() only adds an entry in the dmap table in VFS.
- dev_up() is only executed upon reception of a driver up event.
INET CHANGES:
- INET no longer searches for existing drivers instances at startup.
- A newtwork driver is (re)initialized upon reception of a driver up event.
- Networking startup is now race-free by design. No need to waste 5 seconds
at startup any more.
DRIVER CHANGES:
- Every driver publishes driver up events when starting for the first time or
in case of restart when recovery actions must be taken in the upper layers.
- Driver up events are published by drivers through DS.
- For regular drivers, VFS is normally the only subscriber, but not necessarily.
For instance, when the filter driver is in use, it must subscribe to driver
up events to initiate recovery.
- For network drivers, inet is the only subscriber for now.
- Every VFS driver is statically linked with libdriver, every network driver
is statically linked with libnetdriver.
DRIVER LIBRARIES CHANGES:
- Libdriver is extended to provide generic receive() and ds_publish() interfaces
for VFS drivers.
- driver_receive() is a wrapper for sef_receive() also used in driver_task()
to discard spurious messages that were meant to be delivered to a previous
version of the driver.
- driver_receive_mq() is the same as driver_receive() but integrates support
for queued messages.
- driver_announce() publishes a driver up event for VFS drivers and marks
the driver as initialized and expecting a DEV_OPEN message.
- Libnetdriver is introduced to provide similar receive() and ds_publish()
interfaces for network drivers (netdriver_announce() and netdriver_receive()).
- Network drivers all support live update with no state transfer now.
KERNEL CHANGES:
- Added kernel call statectl for state management. Used by driver_announce() to
unblock eventual callers sendrecing to the driver.
this patch does not add or change any functionality of do_ipc(), it
only makes things a little cleaner (hopefully).
Until now do_ipc() was responsible for handling all ipc calls. The
catch is that SENDA is fairly different which results in some ugly
code like this typecasting and variables naming which does not make
much sense for SENDA and makes the code hard to read.
result = mini_senda(caller_ptr, (asynmsg_t *)m_ptr, (size_t)src_dst_e);
As it is called directly from assembly, the new do_ipc() takes as
input values of 3 registers in reg_t variables (it used to be 4,
however, bit_map wasn't used so I removed it), does the checks common
to all ipc calls and call the appropriate handler either for
do_sync_ipc() (all except SENDA) or mini_senda() (for SENDA) while
typecasting the reg_t values correctly. As a result, handling SENDA
differences in do_sync_ipc() is no more needed. Also the code that
uses msg_size variable is improved a little bit.
arch_do_syscall() is simplified too.
- cotributed by Bjorn Swift
- In this first phase, scheduling is moved from the kernel to the PM
server. The next steps are to a) moving scheduling to its own server
and b) include useful information in the "out of quantum" message,
so that the scheduler can make use of this information.
- The kernel process table now keeps record of who is responsible for
scheduling each process (p_scheduler). When this pointer is NULL,
the process will be scheduled by the kernel. If such a process runs
out of quantum, the kernel will simply renew its quantum an requeue
it.
- When PM loads, it will take over scheduling of all running
processes, except system processes, using sys_schedctl().
Essentially, this only results in taking over init. As children
inherit a scheduler from their parent, user space programs forked by
init will inherit PM (for now) as their scheduler.
- Once a process has been assigned a scheduler, and runs out of
quantum, its RTS_NO_QUANTUM flag will be set and the process
dequeued. The kernel will send a message to the scheduler, on the
process' behalf, informing the scheduler that it has run out of
quantum. The scheduler can take what ever action it pleases, based
on its policy, and then reschedule the process using the
sys_schedule() system call.
- Balance queues does not work as before. While the old in-kernel
function used to renew the quantum of processes in the highest
priority run queue, the user-space implementation only acts on
processes that have been bumped down to a lower priority queue.
This approach reacts slower to changes than the old one, but saves
us sending a sys_schedule message for each process every time we
balance the queues. Currently, when processes are moved up a
priority queue, their quantum is also renewed, but this can be
fiddled with.
- do_nice has been removed from kernel. PM answers to get- and
setpriority calls, updates it's own nice variable as well as the
max_run_queue. This will be refactored once scheduling is moved to a
separate server. We will probably have PM update it's local nice
value and then send a message to whoever is scheduling the process.
- changes to fix an issue in do_fork() where processes could run out
of quantum but bypassing the code path that handles it correctly.
The future plan is to remove the policy from do_fork() and implement
it in userspace too.
- before enabling paging VM asks kernel to resize its segments. This
may cause kernel to segfault if APIC is used and an interrupt
happens between this and paging enabled. As these are 2 separate
vmctl calls it is not atomic. This patch fixes this problem. VM does
not ask kernel to resize the segments in a separate call anymore.
The new segments limit is part of the "enable paging" call. It
generalizes this call in such a way that more information can be
passed as need be or the information may be completely different if
another architecture requires this.
Move archtypes.h to include/ dir, since several servers require it. Move
fpu.h and stackframe.h to arch-specific header directory. Make source
files and makefiles aware of the new header locations.
this change
- makes panic() variadic, doing full printf() formatting -
no more NO_NUM, and no more separate printf() statements
needed to print extra info (or something in hex) before panicing
- unifies panic() - same panic() name and usage for everyone -
vm, kernel and rest have different names/syntax currently
in order to implement their own luxuries, but no longer
- throws out the 1st argument, to make source less noisy.
the panic() in syslib retrieves the server name from the kernel
so it should be clear enough who is panicing; e.g.
panic("sigaction failed: %d", errno);
looks like:
at_wini(73130): panic: sigaction failed: 0
syslib:panic.c: stacktrace: 0x74dc 0x2025 0x100a
- throws out report() - printf() is more convenient and powerful
- harmonizes/fixes the use of panic() - there were a few places
that used printf-style formatting (didn't work) and newlines
(messes up the formatting) in panic()
- throws out a few per-server panic() functions
- cleans up a tie-in of tty with panic()
merging printf() and panic() statements to be done incrementally.
- as thre are still KERNEL and IDLE entries, time accounting for
kernel and idle time works the same as for any other process
- everytime we stop accounting for the currently running process,
kernel or idle, we read the TSC counter and increment the p_cycles
entry.
- the process cycles inherently include some of the kernel cycles as
we can stop accounting for the process only after we save its
context and we start accounting just before we restore its context
- this assumes that the system does not scale the CPU frequency which
will be true for ... long time ;-)
- no kernel tasks are runnable
- clock initialization moved to the end of main()
- the rest of the body of clock_task() is moved to bsp_timer_int_handler() as
for now we are going to handle this on the bootstrap cpu. A change later is
possible.
* Userspace change to use the new kernel calls
- _taskcall(SYSTASK...) changed to _kernel_call(...)
- int 32 reused for the kernel calls
- _do_kernel_call() to make the trap to kernel
- kernel_call() to make the actuall kernel call from C using
_do_kernel_call()
- unlike ipc call the kernel call always succeeds as kernel is
always available, however, kernel may return an error
* Kernel side implementation of kernel calls
- the SYSTEm task does not run, only the proc table entry is
preserved
- every data_copy(SYSTEM is no data_copy(KERNEL
- "locking" is an empty operation now as everything runs in
kernel
- sys_task() is replaced by kernel_call() which copies the
message into kernel, dispatches the call to its handler and
finishes by either copying the results back to userspace (if
need be) or by suspending the process because of VM
- suspended processes are later made runnable once the memory
issue is resolved, picked up by the scheduler and only at
this time the call is resumed (in fact restarted) which does
not need to copy the message from userspace as the message
is already saved in the process structure.
- no ned for the vmrestart queue, the scheduler will restart
the system calls
- no special case in do_vmctl(), all requests remove the
RTS_VMREQUEST flag
- copies a mesage from/to userspace without need of translating
addresses
- the assumption is that the address space is installed, i.e. ldt and
cr3 are loaded correctly
- if a pagefault or a general protection occurs while copying from
userland to kernel (or vice versa) and error is returned which gives
the caller a chance to respond in a proper way
- error happens _only_ because of a wrong user pointer if the function
is used correctly
- if the prerequisites of the function do no hold, the function will
most likely fail as the user address becomes random
- switch_address_space() implements a switch of the user address space
for the destination process
- this makes memory of this process easily accessible, e.g. a pointer
valid in the userspace can be used with a little complexity to
access the process's memory
- the switch does not happed only just before we return to userspace,
however, it happens right after we know which process we are going
to schedule. This happens before we start processing the misc flags
of this process so its memory is available
- if the process becomes not runnable while processing the mics flags
we pick a new process and we switch the address space again which
introduces possibly a little bit more overhead, however, it is
hopefully hidden by reducing the overheads when we actually access
the memory
- the syscalls are pretty much just ipc calls, however, sendrec() is
used to implement system task (sys) calls
- sendrec() won't be used anymore for this, therefore ipc calls will
become pure ipc calls
- the system task initialization code does not really need to be part
of the system task process. An earlier initialization in kernel is
cleaner as it does not only initialize the syscalls but also irq
hooks etc.
kernel (sys task). The main reason is that these would have to become
cpu local variables on SMP. Once the system task is not a task but a
genuine part of the kernel there is even less reason to have these
extra variables as proc_ptr will already contain all neccessary
information. In addition converting who_e to the process pointer and
back again all the time will be avoided.
Although proc_ptr will contain all important information, accessing it
as a cpu local variable will be fairly expensive, hence the value
would be assigned to some on stack local variable. Therefore it is
better to add the 'caller' argument to the syscall handlers to pass
the value on stack anyway. It also clearly denotes on who's behalf is
the syscall being executed.
This patch also ANSIfies the syscall function headers.
Last but not least, it also fixes a potential bug in virtual_copy_f()
in case the check is disabled. So far the function in case of a
failure could possible reuse an old who_p in case this function had
not been called from the system task.
virtual_copy_f() takes the caller as a parameter too. In case the
checking is disabled, the caller must be NULL and non NULL if it is
enabled as we must be able to suspend the caller.
Some cases were fixed by declaring the function void, others were fixed
by adding a return <value> statement, thereby avoiding potentially
incorrect behavior (usually in error handling).
Some enum correctness in boot.c.
Main changes:
- COW optimization for safecopy.
- safemap, a grant-based interface for sharing memory regions between processes.
- Integration with safemap and complete rework of DS, supporting new data types
natively (labels, memory ranges, memory mapped ranges).
- For further information:
http://wiki.minix3.org/en/SummerOfCode2009/MemoryGrants
Additional changes not included in the original Wu's branch:
- Fixed unhandled case in VM when using COW optimization for safecopy in case
of a block that has already been shared as SMAP.
- Better interface and naming scheme for sys_saferevmap and ds_retrieve_map
calls.
- Better input checking in syslib: check for page alignment when creating
memory mapping grants.
- DS notifies subscribers when an entry is deleted.
- Documented the behavior of indirect grants in case of memory mapping.
- Test suite in /usr/src/test/safeperf|safecopy|safemap|ds/* reworked
and extended.
- Minor fixes and general cleanup.
- TO-DO: Grant ids should be generated and managed the way endpoints are to make
sure grant slots are never misreused.
- local APIC timer used as the source of time
- PIC is still used as the hw interrupt controller as we don't have
enough info without ACPI or MPS to set up IO APICs
- remapping of APIC when switching paging on, uses the new mechanism
to tell VM what phys areas to map in kernel's virtual space
- one more step to SMP
based on code by Arun C.
- idle task becomes a pseudo task which is never scheduled. It is never put on
any run queue and never enters userspace. An entry for this task still remains
in the process table for time accounting
- Instead of panicing if there is not process to schedule, pick_proc() returns
NULL which is a signal to put the cpu in an idle state and set everything in
such a way that after receiving and interrupt it looks like idle task was
preempted
- idle task is set non-preemptible to avoid handling in the timer interrupt code
which make userspace scheduling simpler as idle task does not need to be
handled as a special case.
- after a trap to kernel, the code automatically switches to kernel
stack, in the future local to the CPU
- k_reenter variable replaced by a test whether the CS is kernel cs or
not. The information is passed further if needed. Removes a global
variable which would need to be cpu local
- no need for global variables describing the exception or trap
context. This information is kept on stack and a pointer to this
structure is passed to the C code as a single structure
- removed loadedcr3 variable and its use replaced by reading the %cr3
register
- no need to redisable interrupts in restart() as they are already
disabled.
- unified handling of traps that push and don't push errorcode
- removed save() function as the process context is not saved directly
to process table but saved as required by the trap code. Essentially
it means that save() code is inlined everywhere not only in the
exception handling routine
- returning from syscall is more arch independent - it sets the retger
in C
- top of the x86 stack contains the current CPU id and pointer to the
currently scheduled process (the one right interrupted) so the mode
switch code can find where to save the context without need to use
proc_ptr which will be cpu local in the future and therefore
difficult to access in assembler and expensive to access in general
- some more clean up of level0 code. No need to read-back the argument
passed in
%eax from the proc structure. The mode switch code does not clobber
%the general registers and hence we can just call what is in %eax
- many assebly macros in sconst.h as they will be reused by the apic
assembly
- preemption handled in the clock timer interrupt handler, not in the clock task
- more achitecture independent clock timer handling code
- smp ready as each CPU can have its own timer
- the PIC master and slave irq handlers don't pass the irq hook pointer but just
the irq number. It gives a little bit more information to the C handler as the
irq number is not lost
- the irq code path is more achitecture independent. i386 hw interrupts are
called irq and whereever the code is arch independent enough hw_intr_
functions are called to mask/unmask interrupts
- the legacy PIC is not the only possible interrupt controller in the x86 world,
therefore the intr_(un)mask functions were renamed to signal their
functionality explicitly. APIC will add their own.
- masking and unmasking PIC interrupt lines is removed from assembler and all
the functionality is rewriten in C and moved to i8259.c
- interrupt handlers have to unmask the interrupt line if all irq handlers are
done. Assembler does not do it anymore
o Support for ptrace T_ATTACH/T_DETACH and T_SYSCALL
o PM signal handling logic should now work properly, even with debuggers
being present
o Asynchronous PM/VFS protocol, full IPC support for senda(), and
AMF_NOREPLY senda() flag
DETAILS
Process stop and delay call handling of PM:
o Added sys_runctl() kernel call with sys_stop() and sys_resume()
aliases, for PM to stop and resume a process
o Added exception for sending/syscall-traced processes to sys_runctl(),
and matching SIGKREADY pseudo-signal to PM
o Fixed PM signal logic to deal with requests from a process after
stopping it (so-called "delay calls"), using the SIGKREADY facility
o Fixed various PM panics due to race conditions with delay calls versus
VFS calls
o Removed special PRIO_STOP priority value
o Added SYS_LOCK RTS kernel flag, to stop an individual process from
running while modifying its process structure
Signal and debugger handling in PM:
o Fixed debugger signals being dropped if a second signal arrives when
the debugger has not retrieved the first one
o Fixed debugger signals being sent to the debugger more than once
o Fixed debugger signals unpausing process in VFS; removed PM_UNPAUSE_TR
protocol message
o Detached debugger signals from general signal logic and from being
blocked on VFS calls, meaning that even VFS can now be traced
o Fixed debugger being unable to receive more than one pending signal in
one process stop
o Fixed signal delivery being delayed needlessly when multiple signals
are pending
o Fixed wait test for tracer, which was returning for children that were
not waited for
o Removed second parallel pending call from PM to VFS for any process
o Fixed process becoming runnable between exec() and debugger trap
o Added support for notifying the debugger before the parent when a
debugged child exits
o Fixed debugger death causing child to remain stopped forever
o Fixed consistently incorrect use of _NSIG
Extensions to ptrace():
o Added T_ATTACH and T_DETACH ptrace request, to attach and detach a
debugger to and from a process
o Added T_SYSCALL ptrace request, to trace system calls
o Added T_SETOPT ptrace request, to set trace options
o Added TO_TRACEFORK trace option, to attach automatically to children
of a traced process
o Added TO_ALTEXEC trace option, to send SIGSTOP instead of SIGTRAP upon
a successful exec() of the tracee
o Extended T_GETUSER ptrace support to allow retrieving a process's priv
structure
o Removed T_STOP ptrace request again, as it does not help implementing
debuggers properly
o Added MINIX3-specific ptrace test (test42)
o Added proper manual page for ptrace(2)
Asynchronous PM/VFS interface:
o Fixed asynchronous messages not being checked when receive() is called
with an endpoint other than ANY
o Added AMF_NOREPLY senda() flag, preventing such messages from
satisfying the receive part of a sendrec()
o Added asynsend3() that takes optional flags; asynsend() is now a
#define passing in 0 as third parameter
o Made PM/VFS protocol asynchronous; reintroduced tell_fs()
o Made PM_BASE request/reply number range unique
o Hacked in a horrible temporary workaround into RS to deal with newly
revealed RS-PM-VFS race condition triangle until VFS is asynchronous
System signal handling:
o Fixed shutdown logic of device drivers; removed old SIGKSTOP signal
o Removed is-superuser check from PM's do_procstat() (aka getsigset())
o Added sigset macros to allow system processes to deal with the full
signal set, rather than just the POSIX subset
Miscellaneous PM fixes:
o Split do_getset into do_get and do_set, merging common code and making
structure clearer
o Fixed setpriority() being able to put to sleep processes using an
invalid parameter, or revive zombie processes
o Made find_proc() global; removed obsolete proc_from_pid()
o Cleanup here and there
Also included:
o Fixed false-positive boot order kernel warning
o Removed last traces of old NOTIFY_FROM code
THINGS OF POSSIBLE INTEREST
o It should now be possible to run PM at any priority, even lower than
user processes
o No assumptions are made about communication speed between PM and VFS,
although communication must be FIFO
o A debugger will now receive incoming debuggee signals at kill time
only; the process may not yet be fully stopped
o A first step has been made towards making the SYSTEM task preemptible
- no longer have kernel have its own page table that is loaded
on every kernel entry (trap, interrupt, exception). the primary
purpose is to reduce the number of required reloads.
Result:
- kernel can only access memory of process that was running when
kernel was entered
- kernel must be mapped into every process page table, so traps to
kernel keep working
Problem:
- kernel must often access memory of arbitrary processes (e.g. send
arbitrary processes messages); this can't happen directly any more;
usually because that process' page table isn't loaded at all, sometimes
because that memory isn't mapped in at all, sometimes because it isn't
mapped in read-write.
So:
- kernel must be able to map in memory of any process, in its own
address space.
Implementation:
- VM and kernel share a range of memory in which addresses of
all page tables of all processes are available. This has two purposes:
. Kernel has to know what data to copy in order to map in a range
. Kernel has to know where to write the data in order to map it in
That last point is because kernel has to write in the currently loaded
page table.
- Processes and kernel are separated through segments; kernel segments
haven't changed.
- The kernel keeps the process whose page table is currently loaded
in 'ptproc.'
- If it wants to map in a range of memory, it writes the value of the
page directory entry for that range into the page directory entry
in the currently loaded map. There is a slot reserved for such
purposes. The kernel can then access this memory directly.
- In order to do this, its segment has been increased (and the
segments of processes start where it ends).
- In the pagefault handler, detect if the kernel is doing
'trappable' memory access (i.e. a pagefault isn't a fatal
error) and if so,
- set the saved instruction pointer to phys_copy_fault,
breaking out of phys_copy
- set the saved eax register to the address of the page
fault, both for sanity checking and for checking in
which of the two ranges that phys_copy was called
with the fault occured
- Some boot-time processes do not have their own page table,
and are mapped in with the kernel, and separated with
segments. The kernel detects this using HASPT. If such a
process has to be scheduled, any page table will work and
no page table switch is done.
Major changes in kernel are
- When accessing user processes memory, kernel no longer
explicitly checks before it does so if that memory is OK.
It simply makes the mapping (if necessary), tries to do the
operation, and traps the pagefault if that memory isn't present;
if that happens, the copy function returns EFAULT.
So all of the CHECKRANGE_OR_SUSPEND macros are gone.
- Kernel no longer has to copy/read and parse page tables.
- A message copying optimisation: when messages are copied, and
the recipient isn't mapped in, they are copied into a buffer
in the kernel. This is done in QueueMess. The next time
the recipient is scheduled, this message is copied into
its memory. This happens in schedcheck().
This eliminates the mapping/copying step for messages, and makes
it easier to deliver messages. This eliminates soft_notify.
- Kernel no longer creates a page table at all, so the vm_setbuf
and pagetable writing in memory.c is gone.
Minor changes in kernel are
- ipc_stats thrown out, wasn't used
- misc flags all renamed to MF_*
- NOREC_* macros to enter and leave functions that should not
be called recursively; just sanity checks really
- code to fully decode segment selectors and descriptors
to print on exceptions
- lots of vmassert()s added, only executed if DEBUG_VMASSERT is 1
- a better name for architecture specific init function
- some of x86 init code must execute in protected mode
- prot_init() removed from this function and still called in cstart() Imho this
should be called from the architecture specific assembly not cstart. cstart
perform Minix monitor specific tasks and will be touched once another
bootloader is in use, e.g. booting via tftp, therefore we keep it as is for
now.
- this is a backport from the SMP code which requires this. Merging will be simpler