The global synchronization event used to be scheduled at
simQuantum. This prevented repeated entries into gem5 from Python as
it can be scheduled in the past. This changeset ensures that the first
global synchronization happens at curTick() + simQuantum instead.
Note: AArch64 and AArch32 interworking is not supported. If you use an AArch64
kernel you are restricted to AArch64 user-mode binaries. This will be addressed
in a later patch.
Note: Virtualization is only supported in AArch32 mode. This will also be fixed
in a later patch.
Contributors:
Giacomo Gabrielli (TrustZone, LPAE, system-level AArch64, AArch64 NEON, validation)
Thomas Grocutt (AArch32 Virtualization, AArch64 FP, validation)
Mbou Eyole (AArch64 NEON, validation)
Ali Saidi (AArch64 Linux support, code integration, validation)
Edmund Grimley-Evans (AArch64 FP)
William Wang (AArch64 Linux support)
Rene De Jong (AArch64 Linux support, performance opt.)
Matt Horsnell (AArch64 MP, validation)
Matt Evans (device models, code integration, validation)
Chris Adeniyi-Jones (AArch64 syscall-emulation)
Prakash Ramrakhyani (validation)
Dam Sunwoo (validation)
Chander Sudanthi (validation)
Stephan Diestelhorst (validation)
Andreas Hansson (code integration, performance opt.)
Eric Van Hensbergen (performance opt.)
Gabe Black
The probe patch is motivated by the desire to move analytical and trace code
away from functional code. This is achieved by the probe interface which is
essentially a glorified observer model.
What this means to users:
* add a probe point and a "notify" call at the source of an "event"
* add an isolated module, that is being used to carry out *your* analysis (e.g. generate a trace)
* register that module as a probe listener
Note: an example is given for reference in src/cpu/o3/simple_trace.[hh|cc] and src/cpu/SimpleTrace.py
What is happening under the hood:
* every SimObject maintains has a ProbeManager.
* during initialization (src/python/m5/simulate.py) first regProbePoints and
the regProbeListeners is called on each SimObject. this hooks up the probe
point notify calls with the listeners.
FAQs:
Why did you develop probe points:
* to remove trace, stats gathering, analytical code out of the functional code.
* the belief that probes could be generically useful.
What is a probe point:
* a probe point is used to notify upon a given event (e.g. cpu commits an instruction)
What is a probe listener:
* a class that handles whatever the user wishes to do when they are notified
about an event.
What can be passed on notify:
* probe points are templates, and so the user can generate probes that pass any
type of argument (by const reference) to a listener.
What relationships can be generated (1:1, 1:N, N:M etc):
* there isn't a restriction. You can hook probe points and listeners up in a
1:1, 1:N, N:M relationship. They become useful when a number of modules
listen to the same probe points. The idea being that you can add a small
number of probes into the source code and develop a larger number of useful
analysis modules that use information passed by the probes.
Can you give examples:
* adding a probe point to the cpu's commit method allows you to build a trace
module (outputting assembler), you could re-use this to gather instruction
distribution (arithmetic, load/store, conditional, control flow) stats.
Why is the probe interface currently restricted to passing a const reference:
* the desire, initially at least, is to allow an interface to observe
functionality, but not to change functionality.
* of course this can be subverted by const-casting.
What is the performance impact of adding probes:
* when nothing is actively listening to the probes they should have a
relatively minor impact. Profiling has suggested even with a large number of
probes (60) the impact of them (when not active) is very minimal (<1%).
This patch adds observability to the clock period of the clock domains
by including it as a stat.
As a result of adding this, the regressions will be updated in a
separate patch.
This patch provides support for DFS by having ClockedObjects register
themselves with their clock domain at construction time in a member list.
Using this list, a clock domain can update each member's tick to the
curTick() before modifying the clock period.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
There is a race between enabling asynchronous IO for a file descriptor
and IO events happening on that descriptor. A SIGIO won't normally be
delivered if an event is pending when asynchronous IO is
enabled. Instead, the signal will be raised the next time there is an
event on the FD. This changeset simulates a SIGIO by setting the
async_io flag when setting up asynchronous IO for an FD. This causes
the main event loop to poll all file descriptors to check for pending
IO. As a consequence of this, the old SIGALRM hack should no longer be
needed and is therefore removed.
The PollEvent class dynamically installs a SIGIO and SIGALRM handler
when a file handler is registered. Most signal handlers currently get
registered in the initSignals() function. This changeset moves the
SIGIO/SIGALRM handlers to initSignals() to live with the other signal
handlers. The original code installs SIGIO and SIGALRM with the
SA_RESTART option to prevent syscalls from returning EINTR. This
changeset consistently uses this flag for all signal handlers to
ensure that other signals that trigger asynchronous behavior (e.g.,
statistics dumping) do not cause undesirable EINTR returns.
This patch adds support for simulating with multiple threads, each of
which operates on an event queue. Each sim object specifies which eventq
is would like to be on. A custom barrier implementation is being added
using which eventqs synchronize.
The patch was tested in two different configurations:
1. ruby_network_test.py: in this simulation L1 cache controllers receive
requests from the cpu. The requests are replied to immediately without
any communication taking place with any other level.
2. twosys-tsunami-simple-atomic: this configuration simulates a client-server
system which are connected by an ethernet link.
We still lack the ability to communicate using message buffers or ports. But
other things like simulation start and end, synchronizing after every quantum
are working.
Committed by: Nilay Vaish
This patch changes the name the command-line options related to debug
output to all start with "debug" rather than being a mix of that and
"trace". It also makes it clear that the breakpoint time is specified
in ticks and not in cycles.
Thumb2 ARM kernels may access the TEEHBR via thumbee_notifier
in arch/arm/kernel/thumbee.c. The Linux kernel code just seems
to be saving and restoring the register. This patch adds support
for the TEEHBR cp14 register. Note, this may be a special case
when restoring from an image that was run on a system that
supports ThumbEE.
Newer linux kernels and distros exercise more functionality in the IDE device
than previously, exposing 2 races. The first race is the handling of aborted
DMA commands would immediately report the device is ready back to the kernel
and cause already in flight commands to assert the simulator when they returned
and discovered an inconsitent device state. The second race was due to the
Status register not being handled correctly, the interrupt status bit would get
stuck at 1 and the driver eventually views this as a bad state and logs the
condition to the terminal. This patch fixes these two conditions by making the
device handle aborted commands gracefully and properly handles clearing the
interrupt status bit in the Status register.
SimLoopExitEvents weren't serialized by default. Some benchmarks
utilize a delayed m5 exit pseudo op call to terminate the simulation
and this event was lost when resuming from a checkpoint generated
after the pseudo op call. This patch adds the capability to serialize
the SimLoopExitEvents and enable serialization for m5_exit and m5_fail
pseudo ops by default. Does not affect other generic
SimLoopExitEvents.
The order between updating and using arg_num in
PseudoInst::pseudoInst() is currently undefined. This changeset
explicitly updates arg_num after it has been used to extract an
argument.
--HG--
extra : rebase_source : 67c46dc3333d16ce56687ee8aea41ce6c6d133bb
This patch makes it possible to once again build gem5 without any
ISA. The main purpose is to enable work around the interconnect and
memory system without having to build any CPU models or device models.
The regress script is updated to include the NULL ISA target. Currently
no regressions make use of it, but all the testers could (and perhaps
should) transition to it.
--HG--
rename : build_opts/NOISA => build_opts/NULL
rename : src/arch/noisa/SConsopts => src/arch/null/SConsopts
rename : src/arch/noisa/cpu_dummy.hh => src/arch/null/cpu_dummy.hh
rename : src/cpu/intr_control.cc => src/cpu/intr_control_noisa.cc
This patch is a first step to getting NOISA working again. A number of
redundant includes make life more difficult than it has to be and this
patch simply removes them. There are also some redundant forward
declarations removed.
This patch moves the system virtual port proxy to the Alpha system
only to make the resurrection of the NOISA slightly less
painful. Alpha is the only ISA that is actually using it.
This patch adds the notion of voltage domains, and groups clock
domains that operate under the same voltage (i.e. power supply) into
domains. Each clock domain is required to be associated with a voltage
domain, and the latter requires the voltage to be explicitly set.
A voltage domain is an independently controllable voltage supply being
provided to section of the design. Thus, if you wish to perform
dynamic voltage scaling on a CPU, its clock domain should be
associated with a separate voltage domain.
The current implementation of the voltage domain does not take into
consideration cases where there are derived voltage domains running at
ratio of native voltage domains, as with the case where there can be
on-chip buck/boost (charge pumps) voltage regulation logic.
The regression and configuration scripts are updated with a generic
voltage domain for the system, and one for the CPUs.
This patch adds checkpointing support to x86 tlb. It upgrades the
cpt_upgrader.py script so that previously created checkpoints can
be updated. It moves the checkpoint version to 6.
This patch removes the notion of a peer block size and instead sets
the cache line size on the system level.
Previously the size was set per cache, and communicated through the
interconnect. There were plenty checks to ensure that everyone had the
same size specified, and these checks are now removed. Another benefit
that is not yet harnessed is that the cache line size is now known at
construction time, rather than after the port binding. Hence, the
block size can be locally stored and does not have to be queried every
time it is used.
A follow-on patch updates the configuration scripts accordingly.
This patch adds the notion of source- and derived-clock domains to the
ClockedObjects. As such, all clock information is moved to the clock
domain, and the ClockedObjects are grouped into domains.
The clock domains are either source domains, with a specific clock
period, or derived domains that have a parent domain and a divider
(potentially chained). For piece of logic that runs at a derived clock
(a ratio of the clock its parent is running at) the necessary derived
clock domain is created from its corresponding parent clock
domain. For now, the derived clock domain only supports a divider,
thus ensuring a lower speed compared to its parent. Multiplier
functionality implies a PLL logic that has not been modelled yet
(create a separate clock instead).
The clock domains should be used as a mechanism to provide a
controllable clock source that affects clock for every clocked object
lying beneath it. The clock of the domain can (in a future patch) be
controlled by a handler responsible for dynamic frequency scaling of
the respective clock domains.
All the config scripts have been retro-fitted with clock domains. For
the System a default SrcClockDomain is created. For CPUs that run at a
different speed than the system, there is a seperate clock domain
created. This domain incorporates the CPU and the associated
caches. As before, Ruby runs under its own clock domain.
The clock period of all domains are pre-computed, such that no virtual
functions or multiplications are needed when calling
clockPeriod. Instead, the clock period is pre-computed when any
changes occur. For this to be possible, each clock domain tracks its
children.
This patch adds a 'sys_clock' command-line option and use it to assign
clocks to the system during instantiation.
As part of this change, the default clock in the System class is
removed and whenever a system is instantiated a system clock value
must be set. A default value is provided for the command-line option.
The configs and tests are updated accordingly.
HG changset 34e3295b0e39 introduced a check in the main simulation
loop that discards exit events that happen at the same tick as another
exit event. This was supposed to fix a problem where a simulation
script got confused by multiple exit events. This obviously breaks the
simulator since it can hide important simulation events, such as a
simulation failure, that happen at the same time as a non-fatal
simulation event.
in the TLB
Some architectures (currently only x86) require some fixing-up of
physical addresses after a normal address translation. This is usually
to remap devices such as the APIC, but could be used for other memory
mapped devices as well. When running the CPU in a using hardware
virtualization, we still need to do these address fix-ups before
inserting the request into the memory system. This patch moves this
patch allows that code to be used by such CPUs without doing full
address translations.
All architectures execute m5 pseudo instructions by setting up
arguments according to the ABI and executing a magic instruction that
contains an operation number. Handling of such instructions is
currently spread across the different ISA implementations. This
changeset introduces the PseudoInst::pseudoInst function which handles
most of this in an architecture independent way. This is function is
mainly intended to be used from KVM, but can also be used from the
simulated CPUs.
Previously, nextCycle() could return the *current* cycle if the current tick was
already aligned with the clock edge. This behavior is not only confusing (not
quite what the function name implies), but also caused problems in the
drainResume() function. When exiting/re-entering the sim loop (e.g., to take
checkpoints), the CPUs will drain and resume. Due to the previous behavior of
nextCycle(), the CPU tick events were being rescheduled in the same ticks that
were already processed before draining. This caused divergence from runs that
did not exit/re-entered the sim loop. (Initially a cycle difference, but a
significant impact later on.)
This patch separates out the two behaviors (nextCycle() and clockEdge()),
uses nextCycle() in drainResume, and uses clockEdge() everywhere else.
Nothing (other than name) should change except for the drainResume timing.
This changeset adds support for forwarding arguments to the PC
event constructors to following methods:
addKernelFuncEvent
addFuncEvent
Additionally, this changeset adds the following helper method to the
System base class:
addFuncEventOrPanic - Hook a PCEvent to a symbol, panic on failure.
addKernelFuncEventOrPanic - Hook a PCEvent to a kernel symbol, panic
on failure.
System implementations have been updated to use the new functionality
where appropriate.
Without loading weak symbols into gem5, some function names and the given PC
cannot correspond correctly, because the binding attributes of unction names
in an ELF file are not only STB_GLOBAL or STB_LOCAL, but also STB_WEAK. This
patch adds a function for loading weak symbols.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch address the most important name shadowing warnings (as
produced when using gcc/clang with -Wshadow). There are many
locations where constructor parameters and function parameters shadow
local variables, but these are left unchanged.
This patch makes the clock member private to the ClockedObject and
forces all children to access it using clockPeriod(). This makes it
impossible to inadvertently change the clock, and also makes it easier
to transition to a situation where the clock is derived from e.g. a
clock domain, or through a multiplier.
Virtualized CPUs and the fastmem mode of the atomic CPU require direct
access to physical memory. We currently require caches to be disabled
when using them to prevent chaos. This is not ideal when switching
between hardware virutalized CPUs and other CPU models as it would
require a configuration change on each switch. This changeset
introduces a new version of the atomic memory mode,
'atomic_noncaching', where memory accesses are inserted into the
memory system as atomic accesses, but bypass caches.
To make memory mode tests cleaner, the following methods are added to
the System class:
* isAtomicMode() -- True if the memory mode is 'atomic' or 'direct'.
* isTimingMode() -- True if the memory mode is 'timing'.
* bypassCaches() -- True if caches should be bypassed.
The old getMemoryMode() and setMemoryMode() methods should never be
used from the C++ world anymore.
Used as a command in full-system scripts helps the user ensure the benchmarks have finished successfully.
For example, one can use:
/path/to/benchmark args || /sbin/m5 fail 1
and thus ensure gem5 will exit with an error if the benchmark fails.
When "-I" (maximum instruction number) and "-F" (fastforward instruction
number) are applied together, gem5 immediately exits after the cpu switching.
The reason is that multiple exit events may be generated in the same cycle by
Atomic CPU and inserted to mainEventQueue. However, mainEventQueue can only
serve one exit event in one cycle. Therefore, the rest exit events are left in
mainEventQueue without being descheduled or deleted, which causes gem5 exits
immediately after the system resumes by cpu switching.
Tick was not correctly wrapped for the stats system, and therefore it was not
possible to configure the stats dumping from the python scripts without
defining Ticks as long long. This patch fixes the wrapping of Tick by copying
the typemap of uint64_t to Tick.
In order to see all registers independent of the current CPU mode, the
ARM architecture model uses the magic MISCREG_CPSR_MODE register to
change the register mappings without actually updating the CPU
mode. This hack is no longer needed since the thread context now
provides a flat interface to the register file. This patch replaces
the CPSR_MODE hack with the flat register interface.
After making the ISA an independent SimObject, it is serialized
automatically by the Python world. Previously, this just resulted in
an empty ISA section. This patch moves the contents of the ISA to that
section and removes the explicit ISA serialization from the thread
contexts, which makes it behave like a normal SimObject during
serialization.
Note: This patch breaks checkpoint backwards compatibility! Use the
cpt_upgrader.py utility to upgrade old checkpoints to the new format.
This patch adds support for the memInvalidate() drain method. TLB
flushing is requested by calling the virtual flushAll() method on the
TLB.
Note: This patch renames invalidateAll() to flushAll() on x86 and
SPARC to make the interface consistent across all supported
architectures.
This patch generalises the address range resolution for the I/O cache
and I/O bridge such that they do not assume a single memory. The patch
involves adding a parameter to the system which is then defined based
on the memories that are to be visible from the I/O subsystem, whether
behind a cache or a bridge.
The change is needed to allow interleaved memory controllers in the
system.
This patch adds support for outputting protobuf messages through a
ProtoOutputStream which hides the internal streams used by the
library. The stream is created based on the name of an output file and
optionally includes compression using gzip.
The output stream will start by putting a magic number in the file,
and then for every message that is serialized prepend the size such
that the stream can be written and read incrementally. At this point
this merely serves as a proof of concept.
This patch adds a _curTick variable to an eventq. This variable is updated
whenever an event is serviced in function serviceOne(), or all events upto
a particular time are processed in function serviceEvents(). This change
helps when there are eventqs that do not make use of curTick for scheduling
events.
This patch adds the following two methods to the Drainable base class:
memWriteback() - Write back all dirty cache lines to memory using
functional accesses.
memInvalidate() - Invalidate memory system buffers. Dirty data
won't be written back.
Specifying calling memWriteback() after draining will allow us to
checkpoint systems with caches. memInvalidate() can be used to drop
memory system buffers in preparation for switching to an accelerated
CPU model that bypasses the gem5 memory system (e.g., hardware
virtualized CPUs).
Note: This patch only adds the methods to Drainable, the code for
flushing the TLB and the cache is committed separately.
This patch moves the draining interface from SimObject to a separate
class that can be used by any object needing draining. However,
objects not visible to the Python code (i.e., objects not deriving
from SimObject) still depend on their parents informing them when to
drain. This patch also gets rid of the CountedDrainEvent (which isn't
really an event) and replaces it with a DrainManager.
When casting objects in the generated SWIG interfaces, SWIG uses
classical C-style casts ( (Foo *)bar; ). In some cases, this can
degenerate into the equivalent of a reinterpret_cast (mainly if only a
forward declaration of the type is available). This usually works for
most compilers, but it is known to break if multiple inheritance is
used anywhere in the object hierarchy.
This patch introduces the cxx_header attribute to Python SimObject
definitions, which should be used to specify a header to include in
the SWIG interface. The header should include the declaration of the
wrapped object. We currently don't enforce header the use of the
header attribute, but a warning will be generated for objects that do
not use it.
This patch enables dumping statistics and Linux process information on
context switch boundaries (__switch_to() calls) that are used for
Streamline integration (a graphical statistics viewer from ARM).
This patch changes the default system clock from 1THz to 1GHz. This
clock is used by all modules that do not override the default (parent
clock), and primarily affects the IO subsystem. Every DMA device uses
its clock to schedule the next transfer, and the change will thus
cause this inter-transfer delay to be longer.
The default clock of the bus is removed, as the clock inherited from
the system provides exactly the same value.
A follow-on patch will bump the stats.
Ruby system was recently converted to a clocked object. Such objects maintain
state related to the time that has passed so far. During the cache warmup, Ruby
system changes its own time and the global time. Later on, the global time is
restored. So Ruby system also needs to reset its own time.
This patch adds an additional level of ports in the inheritance
hierarchy, separating out the protocol-specific and protocl-agnostic
parts. All the functionality related to the binding of ports is now
confined to use BaseMaster/BaseSlavePorts, and all the
protocol-specific parts stay in the Master/SlavePort. In the future it
will be possible to add other protocol-specific implementations.
The functions used in the binding of ports, i.e. getMaster/SlavePort
now use the base classes, and the index parameter is updated to use
the PortID typedef with the symbolic InvalidPortID as the default.
This patch moves all the memory backing store operations from the
independent memory controllers to the global physical memory. The main
reason for this patch is to allow address striping in a future set of
patches, but at this point it already provides some useful
functionality in that it is now possible to change the number of
memory controllers and their address mapping in combination with
checkpointing. Thus, the host and guest view of the memory backing
store are now completely separate.
With this patch, the individual memory controllers are far simpler as
all responsibility for serializing/unserializing is moved to the
physical memory. Currently, the functionality is more or less moved
from AbstractMemory to PhysicalMemory without any major
changes. However, in a future patch the physical memory will also
resolve any ranges that are interleaved and properly assign the
backing store to the memory controllers, and keep the host memory as a
single contigous chunk per address range.
Functionality for future extensions which involve CPU virtualization
also enable the host to get pointers to the backing store.
This patch changes how the serialization of the system works. The base
class had a non-virtual serialize and unserialize, that was hidden by
a function with the same name for a number of subclasses (most likely
not intentional as the base class should have been virtual). A few of
the derived systems had no specialization at all (e.g. Power and x86
that simply called the System::serialize), but MIPS and Alpha adds
additional symbol table entries to the checkpoint.
Instead of overriding the virtual function, the additional entries are
now printed through a virtual function (un)serializeSymtab. The reason
for not calling System::serialize from the two related systems is that
a follow up patch will require the system to also serialize the
PhysicalMemory, and if this is done in the base class if ends up being
between the general parts and the specialized symbol table.
With this patch, the checkpoint is not modified, as the order of the
segments is unchanged.
This patch changes the default 1 Tick clock period to a proxy that
resolves the parents clock. As a result of this, the caches and
L1-to-L2 bus, for example, will automatically use the clock period of
the CPU unless explicitly overridden.
To ensure backwards compatibility, the System class overrides the
proxy and specifies a 1 Tick clock. We could change this to something
more reasonable in a follow-on patch, perhaps 1 GHz or something
similar.
With this patch applied, all clocked objects should have a reasonable
clock period set, and could start specifying delays in Cycles instead
of absolute time.
This patch adds a function, periodicStatDump(long long period), which will dump
and reset the statistics every period. This function is designed to be called
from the python configuration scripts. This allows the periodic stats dumping to
be configured more easilly at run time.
The period is currently specified as a long long as there are issues passing
Tick into the C++ from the python as they have conflicting definitions. If the
period is less than curTick, the first occurance occurs at curTick. If the
period is set to 0, then the event is descheduled and the stats are not
periodically dumped.
Due to issues when resumung from a checkpoint, the StatDump event must be moved
forward such that it occues AFTER the current tick. As the function is called
from the python, the event is scheduled before the system resumes from the
checkpoint. Therefore, the event is moved using the updateEvents() function.
This is called from simulate.py once the system has resumed from the checkpoint.
NOTE: It should be noted that this is a fairly temporary patch which re-adds the
capability to extract temporal information from the communication monitors. It
should not be used at the same time as anything that relies on dumping the
statistics based on in simulation events i.e. a context switch.
Remove SimObject::setMemoryMode from the main SimObject class since it
is only valid for the System class. In addition to removing the method
from the C++ sources, this patch also removes getMemoryMode and
changeTiming from SimObject.py and updates the simulation code to call
the (get|set)MemoryMode method on the System object instead.
This patch ignores the FUTEX_PRIVATE_FLAG of the sys_futex system call
in SE mode.
With this patch, when sys_futex with the options FUTEX_WAIT_PRIVATE or
FUTEX_WAKE_PRIVATE is emulated, the FUTEX_PRIVATE_FLAG is ignored and
so their behaviours are the regular FUTEX_WAIT and FUTEX_WAKE.
Emulating FUTEX_WAIT_PRIVATE and FUTEX_WAKE_PRIVATE as if they were
non-private is safe from a functional point of view. The
FUTEX_PRIVATE_FLAG does not change the semantics of the futex, it's
just a mechanism to improve performance under certain circunstances
that can be ignored in SE mode.
Includes a small change in sim_object.cc that adds the name space to
the output stream parameter in serializeAll. Leaving out the name
space unfortunately confuses Doxygen.
Simulation objects normally register derived statistics, presumably
what regFormulas originally was meant for, in regStats(). This patch
removes regRegformulas since there is no need to have a separate
method call to register formulas.
This patch addresses the comments and feedback on the preceding patch
that reworks the clocks and now more clearly shows where cycles
(relative cycle counts) are used to express time.
Instead of bumping the existing patch I chose to make this a separate
patch, merely to try and focus the discussion around a smaller set of
changes. The two patches will be pushed together though.
This changes done as part of this patch are mostly following directly
from the introduction of the wrapper class, and change enough code to
make things compile and run again. There are definitely more places
where int/uint/Tick is still used to represent cycles, and it will
take some time to chase them all down. Similarly, a lot of parameters
should be changed from Param.Tick and Param.Unsigned to
Param.Cycles.
In addition, the use of curTick is questionable as there should not be
an absolute cycle. Potential solutions can be built on top of this
patch. There is a similar situation in the o3 CPU where
lastRunningCycle is currently counting in Cycles, and is still an
absolute time. More discussion to be had in other words.
An additional change that would be appropriate in the future is to
perform a similar wrapping of Tick and probably also introduce a
Ticks class along with suitable operators for all these classes.
This patch introduces the notion of a clock update function that aims
to avoid costly divisions when turning the current tick into a
cycle. Each clocked object advances a private (hidden) cycle member
and a tick member and uses these to implement functions for getting
the tick of the next cycle, or the tick of a cycle some time in the
future.
In the different modules using the clocks, changes are made to avoid
counting in ticks only to later translate to cycles. There are a few
oddities in how the O3 and inorder CPU count idle cycles, as seen by a
few locations where a cycle is subtracted in the calculation. This is
done such that the regression does not change any stats, but should be
revisited in a future patch.
Another, much needed, change that is not done as part of this patch is
to introduce a new typedef uint64_t Cycle to be able to at least hint
at the unit of the variables counting Ticks vs Cycles. This will be
done as a follow-up patch.
As an additional follow up, the thread context still uses ticks for
the book keeping of last activate and last suspend and this should
probably also be changed into cycles as well.
This patch tidies up the EventManager constructor and prunes a corner
case where the EventManager would initialise its eventq pointer to
NULL. This would cause segmentation faults on actual use and should
never happen.
This patch makes the Tick unsigned and removes the UTick typedef. The
ticks should never be negative, and there was only one major issue
with removing it, caused by the o3 CPU using a -1 as an initial value.
The patch has no impact on any regressions.
This patch moves the clock of the CPU, bus, and numerous devices to
the new class ClockedObject, that sits in between the SimObject and
MemObject in the class hierarchy. Although there are currently a fair
amount of MemObjects that do not make use of the clock, they
potentially should do so, e.g. the caches should at some point have
the same clock as the CPU, potentially with a 1:n ratio. This patch
does not introduce any new clock objects or object hierarchies
(clusters, clock domains etc), but is still a step in the direction of
having a more structured approach clock domains.
The most contentious part of this patch is the serialisation of clocks
that some of the modules (but not all) did previously. This
serialisation should not be needed as the clock is set through the
parameters even when restoring from the checkpoint. In other words,
the state is "stored" in the Python code that creates the modules.
The nextCycle methods are also simplified and the clock phase
parameter of the CPU is removed (this could be part of a clock object
once they are introduced).
This patch fixes some problems with the drain/switchout functionality
for the O3 cpu and for the ARM ISA and adds some useful debug print
statements.
This is an incremental fix as there are still a few bugs/mem leaks with the
switchout code. Particularly when switching from an O3CPU to a
TimingSimpleCPU. However, when switching from O3 to O3 cores with the ARM ISA
I haven't encountered any more assertion failures; now the kernel will
typically panic inside of simulation.
This replaces a (potentially uninitialized) string
field with a virtual function so that we can have
a safe interface without requiring changes to the
eio code.
Enable different whitelists for different OS/arch combinations,
since some use the generic Linux definitions only, and others
use definitions inherited from earlier Unix flavors on those
architectures.
Also update x86 function pointers so ioctl is no longer
unimplemented on that platform.
This patch is a revised version of Vince Weaver's earlier patch.
This enables configuration scripts to set up mappings
from process virtual addresses to specific physical
addresses in SE mode. This feature is needed to
support modeling of user-accessible memories or
devices in SE mode, avoiding the complexities of FS
mode and the need to write a device driver.
This patch renames the queue() accessor to the less ambigious
eventQueue, and also removes the cast operator. The queue() member
function cause problems in derived classes that declare members with
the same name, e.g. a MemObject subclass that has a packet queue on
its own. The operator is not causing any harm at this point, but as it
is not used there is little point in keeping it.
This patch is the result of static analysis identifying a number of
memory leaks. The leaks are all benign as they are a result of not
deallocating memory in the desctructor. The fix still has value as it
removes false positives in the static analysis.
While FastAlloc provides a small performance increase (~1.5%) over regular malloc it isn't thread safe.
After removing FastAlloc and using tcmalloc I've seen a performance increase of 12% over libc malloc
when running twolf for ARM.
This patch makes two very minor changes to please gcc 4.7. The
CopyData function no longer exists and this has been replaced. For
some reason previous versions of gcc did not complain on the const
char casting not having an implementation, but this is now addressed.
If the length argument to mmap is larger than the arbitrary but reasonable
limit of 4GB, there's a good chance that the value is nonsense and not
intentional. Rather than attempting to satisfy the mmap anyway, this change
makes gem5 warn to make it more apparent what's going wrong.
Track the point in the initialization where statistics have been registered.
After this point registering new masterIds can no longer work as some
SimObjects may have sized stats vectors based on the previous value. If someone
tries to register a masterId after this point the simulator executes fatal().
This patch moves send/recvTiming and send/recvTimingSnoop from the
Port base class to the MasterPort and SlavePort, and also splits them
into separate member functions for requests and responses:
send/recvTimingReq, send/recvTimingResp, and send/recvTimingSnoopReq,
send/recvTimingSnoopResp. A master port sends requests and receives
responses, and also receives snoop requests and sends snoop
responses. A slave port has the reciprocal behaviour as it receives
requests and sends responses, and sends snoop requests and receives
snoop responses.
For all MemObjects that have only master ports or slave ports (but not
both), e.g. a CPU, or a PIO device, this patch merely adds more
clarity to what kind of access is taking place. For example, a CPU
port used to call sendTiming, and will now call
sendTimingReq. Similarly, a response previously came back through
recvTiming, which is now recvTimingResp. For the modules that have
both master and slave ports, e.g. the bus, the behaviour was
previously relying on branches based on pkt->isRequest(), and this is
now replaced with a direct call to the apprioriate member function
depending on the type of access. Please note that send/recvRetry is
still shared by all the timing accessors and remains in the Port base
class for now (to maintain the current bus functionality and avoid
changing the statistics of all regressions).
The packet queue is split into a MasterPort and SlavePort version to
facilitate the use of the new timing accessors. All uses of the
PacketQueue are updated accordingly.
With this patch, the type of packet (request or response) is now well
defined for each type of access, and asserts on pkt->isRequest() and
pkt->isResponse() are now moved to the appropriate send member
functions. It is also worth noting that sendTimingSnoopReq no longer
returns a boolean, as the semantics do not alow snoop requests to be
rejected or stalled. All these assumptions are now excplicitly part of
the port interface itself.
This patch introduces port access methods that separates snoop
request/responses from normal memory request/responses. The
differentiation is made for functional, atomic and timing accesses and
builds on the introduction of master and slave ports.
Before the introduction of this patch, the packets belonging to the
different phases of the protocol (request -> [forwarded snoop request
-> snoop response]* -> response) all use the same port access
functions, even though the snoop packets flow in the opposite
direction to the normal packet. That is, a coherent master sends
normal request and receives responses, but receives snoop requests and
sends snoop responses (vice versa for the slave). These two distinct
phases now use different access functions, as described below.
Starting with the functional access, a master sends a request to a
slave through sendFunctional, and the request packet is turned into a
response before the call returns. In a system without cache coherence,
this is all that is needed from the functional interface. For the
cache-coherent scenario, a slave also sends snoop requests to coherent
masters through sendFunctionalSnoop, with responses returned within
the same packet pointer. This is currently used by the bus and caches,
and the LSQ of the O3 CPU. The send/recvFunctional and
send/recvFunctionalSnoop are moved from the Port super class to the
appropriate subclass.
Atomic accesses follow the same flow as functional accesses, with
request being sent from master to slave through sendAtomic. In the
case of cache-coherent ports, a slave can send snoop requests to a
master through sendAtomicSnoop. Just as for the functional access
methods, the atomic send and receive member functions are moved to the
appropriate subclasses.
The timing access methods are different from the functional and atomic
in that requests and responses are separated in time and
send/recvTiming are used for both directions. Hence, a master uses
sendTiming to send a request to a slave, and a slave uses sendTiming
to send a response back to a master, at a later point in time. Snoop
requests and responses travel in the opposite direction, similar to
what happens in functional and atomic accesses. With the introduction
of this patch, it is possible to determine the direction of packets in
the bus, and no longer necessary to look for both a master and a slave
port with the requested port id.
In contrast to the normal recvFunctional, recvAtomic and recvTiming
that are pure virtual functions, the recvFunctionalSnoop,
recvAtomicSnoop and recvTimingSnoop have a default implementation that
calls panic. This is to allow non-coherent master and slave ports to
not implement these functions.
This patch addresses a number of minor issues that cause problems when
compiling with clang >= 3.0 and gcc >= 4.6. Most importantly, it
avoids using the deprecated ext/hash_map and instead uses
unordered_map (and similarly so for the hash_set). To make use of the
new STL containers, g++ and clang has to be invoked with "-std=c++0x",
and this is now added for all gcc versions >= 4.6, and for clang >=
3.0. For gcc >= 4.3 and <= 4.5 and clang <= 3.0 we use the tr1
unordered_map to avoid the deprecation warning.
The addition of c++0x in turn causes a few problems, as the
compiler is more stringent and adds a number of new warnings. Below,
the most important issues are enumerated:
1) the use of namespaces is more strict, e.g. for isnan, and all
headers opening the entire namespace std are now fixed.
2) another other issue caused by the more stringent compiler is the
narrowing of the embedded python, which used to be a char array,
and is now unsigned char since there were values larger than 128.
3) a particularly odd issue that arose with the new c++0x behaviour is
found in range.hh, where the operator< causes gcc to complain about
the template type parsing (the "<" is interpreted as the beginning
of a template argument), and the problem seems to be related to the
begin/end members introduced for the range-type iteration, which is
a new feature in c++11.
As a minor update, this patch also fixes the build flags for the clang
debug target that used to be shared with gcc and incorrectly use
"-ggdb".
This patch removes the assumption on having on single instance of
PhysicalMemory, and enables a distributed memory where the individual
memories in the system are each responsible for a single contiguous
address range.
All memories inherit from an AbstractMemory that encompasses the basic
behaviuor of a random access memory, and provides untimed access
methods. What was previously called PhysicalMemory is now
SimpleMemory, and a subclass of AbstractMemory. All future types of
memory controllers should inherit from AbstractMemory.
To enable e.g. the atomic CPU and RubyPort to access the now
distributed memory, the system has a wrapper class, called
PhysicalMemory that is aware of all the memories in the system and
their associated address ranges. This class thus acts as an
infinitely-fast bus and performs address decoding for these "shortcut"
accesses. Each memory can specify that it should not be part of the
global address map (used e.g. by the functional memories by some
testers). Moreover, each memory can be configured to be reported to
the OS configuration table, useful for populating ATAG structures, and
any potential ACPI tables.
Checkpointing support currently assumes that all memories have the
same size and organisation when creating and resuming from the
checkpoint. A future patch will enable a more flexible
re-organisation.
--HG--
rename : src/mem/PhysicalMemory.py => src/mem/AbstractMemory.py
rename : src/mem/PhysicalMemory.py => src/mem/SimpleMemory.py
rename : src/mem/physical.cc => src/mem/abstract_mem.cc
rename : src/mem/physical.hh => src/mem/abstract_mem.hh
rename : src/mem/physical.cc => src/mem/simple_mem.cc
rename : src/mem/physical.hh => src/mem/simple_mem.hh
This patch introduces the notion of a master and slave port in the C++
code, thus bringing the previous classification from the Python
classes into the corresponding simulation objects and memory objects.
The patch enables us to classify behaviours into the two bins and add
assumptions and enfore compliance, also simplifying the two
interfaces. As a starting point, isSnooping is confined to a master
port, and getAddrRanges to slave ports. More of these specilisations
are to come in later patches.
The getPort function is not getMasterPort and getSlavePort, and
returns a port reference rather than a pointer as NULL would never be
a valid return value. The default implementation of these two
functions is placed in MemObject, and calls fatal.
The one drawback with this specific patch is that it requires some
code duplication, e.g. QueuedPort becomes QueuedMasterPort and
QueuedSlavePort, and BusPort becomes BusMasterPort and BusSlavePort
(avoiding multiple inheritance). With the later introduction of the
port interfaces, moving the functionality outside the port itself, a
lot of the duplicated code will disappear again.
This patch cleans up a number of minor issues aiming to get closer to
compliance with the C++0x standard as interpreted by gcc and clang
(compile with std=c++0x and -pedantic-errors). In particular, the
patch cleans up enums where the last item was succeded by a comma,
namespaces closed by a curcly brace followed by a semi-colon, and the
use of the GNU-extension typeof (replaced by templated functions). It
does not address variable-length arrays, zero-size arrays, anonymous
structs, range expressions in switch statements, and the use of long
long. The generated CPU code also has a large number of issues that
remain to be fixed, mainly related to overflows in implicit constant
conversion (due to shifts).
This patch makes the code compile with clang 2.9 and 3.0 again by
making two very minor changes. Firt, it maintains a strict typing in
the forward declaration of the BaseCPUParams. Second, it adds a
FullSystemInt flag of the type unsigned int next to the boolean
FullSystem flag. The FullSystemInt variable can be used in
decode-statements (expands to switch statements) in the instruction
decoder.
The change to port proxies recently moved code out of the constructor into
initState(). This is needed for code that loads data into memory, however
for code that setups symbol tables, kernel based events, etc this is the wrong
thing to do as that code is only called when a checkpoint isn't being restored
from.
This patch is adding a clearer design intent to all objects that would
not be complete without a port proxy by making the proxies members
rathen than dynamically allocated. In essence, if NULL would not be a
valid value for the proxy, then we avoid using a pointer to make this
clear.
The same approach is used for the methods using these proxies, such as
loadSections, that now use references rather than pointers to better
reflect the fact that NULL would not be an acceptable value (in fact
the code would break and that is how this patch started out).
Overall the concept of "using a reference to express unconditional
composition where a NULL pointer is never valid" could be done on a
much broader scale throughout the code base, but for now it is only
done in the locations affected by the proxies.
This patch classifies all ports in Python as either Master or Slave
and enforces a binding of master to slave. Conceptually, a master (such
as a CPU or DMA port) issues requests, and receives responses, and
conversely, a slave (such as a memory or a PIO device) receives
requests and sends back responses. Currently there is no
differentiation between coherent and non-coherent masters and slaves.
The classification as master/slave also involves splitting the dual
role port of the bus into a master and slave port and updating all the
system assembly scripts to use the appropriate port. Similarly, the
interrupt devices have to have their int_port split into a master and
slave port. The intdev and its children have minimal changes to
facilitate the extra port.
Note that this patch does not enforce any port typing in the C++
world, it merely ensures that the Python objects have a notion of the
port roles and are connected in an appropriate manner. This check is
carried when two ports are connected, e.g. bus.master =
memory.port. The following patches will make use of the
classifications and specialise the C++ ports into masters and slaves.
This change adds a master id to each request object which can be
used identify every device in the system that is capable of issuing a request.
This is part of the way to removing the numCpus+1 stats in the cache and
replacing them with the master ids. This is one of a series of changes
that make way for the stats output to be changed to python.
The code that checks whether pages allocated by allocPhysPages only checks
that the first page fits into physical memory, not that all of them do. This
change makes the code check the last page which should work properly. This
function used to only allocate one page at a time, so the first page and last
page used to be the same thing.
This patch adds the necessary flags to the SConstruct and SConscript
files for compiling using clang 2.9 and later (on Ubuntu et al and OSX
XCode 4.2), and also cleans up a bunch of compiler warnings found by
clang. Most of the warnings are related to hidden virtual functions,
comparisons with unsigneds >= 0, and if-statements with empty
bodies. A number of mismatches between struct and class are also
fixed. clang 2.8 is not working as it has problems with class names
that occur in multiple namespaces (e.g. Statistics in
kernel_stats.hh).
clang has a bug (http://llvm.org/bugs/show_bug.cgi?id=7247) which
causes confusion between the container std::set and the function
Packet::set, and this is currently addressed by not including the
entire namespace std, but rather selecting e.g. "using std::vector" in
the appropriate places.
Usage: m5 writefile <filename>
File will be created in the gem5 output folder with the identical filename.
Implementation is largely based on the existing "readfile" functionality.
Currently does not support exporting of folders.
This patch simplifies the address-range determination mechanism and
also unifies the naming across ports and devices. It further splits
the queries for determining if a port is snooping and what address
ranges it responds to (aiming towards a separation of
cache-maintenance ports and pure memory-mapped ports). Default
behaviours are such that most ports do not have to define isSnooping,
and master ports need not implement getAddrRanges.
Port proxies are used to replace non-structural ports, and thus enable
all ports in the system to correspond to a structural entity. This has
the advantage of accessing memory through the normal memory subsystem
and thus allowing any constellation of distributed memories, address
maps, etc. Most accesses are done through the "system port" that is
used for loading binaries, debugging etc. For the entities that belong
to the CPU, e.g. threads and thread contexts, they wrap the CPU data
port in a port proxy.
The following replacements are made:
FunctionalPort > PortProxy
TranslatingPort > SETranslatingPortProxy
VirtualPort > FSTranslatingPortProxy
--HG--
rename : src/mem/vport.cc => src/mem/fs_translating_port_proxy.cc
rename : src/mem/vport.hh => src/mem/fs_translating_port_proxy.hh
rename : src/mem/translating_port.cc => src/mem/se_translating_port_proxy.cc
rename : src/mem/translating_port.hh => src/mem/se_translating_port_proxy.hh
The system port is used as a globally reachable access point to the
memory subsystem. The benefit of using an actual port is that the
usual infrastructure is used to resolve any access and thus makes the
overall system able to handle distributed memories in any
configuration, and also makes the accesses agnostic to the address
map. This patch only introduces the port and does not actually use it
for anything.
This patch adds a mechanism to collect run time samples for specific portions
of a benchmark, using work_begin and work_end pseudo instructions.It also enhances
the histogram stat to report geometric mean.
This patch adds a function for replacing the event at the head of the queue
with another event. This helps in running a different set of events. Events
already scheduled can processed by replacing the original head event back.
This function has been specifically added to support cache warmup and
cooldown required for creating and restoring checkpoints.
--HG--
extra : rebase_source : ed6e2905720b6bfdefd020fab76235ccf33d28d1
This parameter depends on a number of coincidences to work properly. First,
there must be an array assigned to system called "cpu" even though there's no
parameter called that. Second, the items in the "cpu" array have to have a
"clock" parameter which has a "frequency" member. This is true of the normal
CPUs, but isn't true of the memory tester CPUs. This happened to work before
because the memory tester CPUs were only used in SE mode where this parameter
was being excluded. Since everything is being pulled into a common binary,
this won't work any more. Since the boot_cpu_frequency parameter is only used
by Alpha's Linux System object (and Mips's through copy and paste), the
definition of that parameter is moved down to those objects specifically.
PageTable supported an allocate() call that called back
through the Process to allocate memory, but did not have
a method to map addresses without allocating new pages.
It makes more sense for Process to do the allocation, so
this method was renamed allocateMem() and moved to Process,
and uses a new map() call on PageTable.
The remaining uses of the process pointer in PageTable
were only to get the name and the PID, so by passing these
in directly in the constructor, we can make PageTable
completely independent of Process.
Replace the (broken as of previous changeset) swig_objdecl() method
that allowed/forced you to substitute a whole new C++ struct
definition for SWIG to wrap with a set of export_method* hooks
that let you just declare a set of C++ methods (or other declarations)
that get inserted in the auto-generated struct.
Restore the System get/setMemoryMode methods, and use this mechanism
to specialize SimObject as well, eliminating teh need for sim_object.i.
Needed bits of sim_object.i are moved to the new pyobject.i.
Also sucked a little SimObject specialization into cxx_param_decl()
allowing us to get rid of src/sim/sim_object_params.hh. Now the
generation and wrapping of the base SimObject param struct is more
in line with how derived objects are handled.
--HG--
rename : src/python/swig/sim_object.i => src/python/swig/pyobject.i
- Move the random bits of SWIG code generation out of src/SConscript
file and into methods on the objects being wrapped.
- Cleaned up some variable naming and added some comments to make
the process a little clearer.
- Did a little generated file/module renaming:
- vptype_Foo now Foo_vector
- init_Foo is now Foo_init
This makes it easier to see all the Foo-related files in a
sorted directory listing.
- Made cxx_predecls and swig_predecls normal SimObject classmethods.
- Got rid of swig_objdecls hook, even though this breaks the System
objects get/setMemoryMode method exports. Will be fixing this in
a future changeset.
In order for a system object to work in SE mode and FS mode, it has to either
always require a platform object even in SE mode, or get rid of the
requirement all together. Making SE mode carry around unnecessary/unused bits
of FS seems less than ideal, so I decided to go with the second option. The
platform pointer in the System class was used for exactly one purpose, a path
for the Alpha Linux system object to get to the real time clock and read its
frequency so that it could short cut the loops_per_jiffy calculation. There
was also a copy and pasted implementation in MIPS, but since it was only there
because it was there in Alpha I still count that as one use.
This change reverses the mechanism that communicates the RTC frequency so that
the Tsunami platform object pushes it up to the AlphaSystem object. This is
slightly less specific than it could be because really only the
AlphaLinuxSystem uses it. Because the intrFrequency function on the Platform
class was no longer necessary (and unimplemented on anything but Alpha) it was
eliminated.
After this change, a platform will need to have a system, but a system won't
have to have a platform.
All of the classes will now be available in both modes, and only
GenericPageTableFault will continue to check the mode for conditional
compilation. It uses a process object to handle the fault in SE mode, and
for now those aren't available in FS mode.
This constant will have the same value as FULL_SYSTEM but will not be usable
by the preprocessor. It can be substituted into places where FULL_SYSTEM is
used in a C++ context and will make it easier to find which parts of the
simulator still use FULL_SYSTEM with the preprocessor using grep.
Initialize flags via the Event constructor instead of calling
setFlags() in the body of the derived class's constructor. I
forget exactly why, but this made life easier when implementing
multi-queue support.
Also rename Event::getFlags() to isFlagSet() to better match
common usage, and get rid of some unused Event methods.
Use exitSimLoop() function instead of explicitly scheduling
on mainEventQueue (which won't work once we go to multiple
event queues). Also introduced a local params variable to
shorten a lot of expressions.