This patch adds two MemoryObject's: ExternalMaster and ExternalSlave.
Each object has a single port which can be bound to an externally-
provided bridge to a port of another simulation system at
initialisation.
This patch adds a 'wakeup' member function to EventQueue which should be
called on an event queue whenever an event is scheduled on the event queue
from outside code within the call tree of the gem5 event loop.
This clearly isn't necessary for normal gem5 EventQueue operation but
becomes the minimum necessary interface to allow hosting gem5's event loop
onto other schedulers where there may be calls into gem5 from external
code which schedules events onto an EventQueue between the current time and
the time of the next scheduled event.
The use case I have in mind is a SystemC hosting where the event loop is:
while (more events) {
wait(time_to_next_event or wakeup)
setCurTick
service events at this time
}
where the 'wait' needs to be woken up if time_to_next_event becomes shorter
due to a scheduled event from SystemC arriving in a gem5 object.
Requiring 'wakeup' to be called is a more efficient interface than
requiring all gem5 event scheduling actions to affect the host scheduler.
This interface could be located elsewhere, say on another global object,
or by being passed by the host scheduler to objects which will schedule
such events, but it seems cleanest to put it on EventQueue as it is
actually a signal to the queue.
EventQueue::wakeup is called for async_event events on event queue 0 as
it's only important that *some* queue be triggered for such events.
This patch adds a Logger class encapsulating dprintf. This allows
variants of DPRINTF logging to be constructed and substituted in
place of the default behaviour.
The Logger provides a logMessage(when, name, format, ...) member
function like Trace::dprintf and a getOstream member function to
use a raw ostream for logging.
A class OstreamLogger is provided which generates the customary
debugging output with Trace::OstreamLogger::logMessage being the
old Trace::dprintf.
This patch takes quite a large step in transitioning from the ad-hoc
RefCountingPtr to the c++11 shared_ptr by adopting its use for all
Faults. There are no changes in behaviour, and the code modifications
are mostly just replacing "new" with "make_shared".
This patch transitions the o3 MemDepEntry from the ad-hoc
RefCountingPtr to the c++11 shared_ptr. There are no changes in
behaviour, and the code modifications are mainly replacing "new" with
"make_shared".
This patch transitions the Ruby Message and its derived classes from
the ad-hoc RefCountingPtr to the c++11 shared_ptr. There are no
changes in behaviour, and the code modifications are mainly replacing
"new" with "make_shared".
The cloning of derived messages is slightly changed as they previously
relied on overriding the base-class through covariant return types.
This patch transitions the stat Node and its derived classes from
the ad-hoc RefCountingPtr to the c++11 shared_ptr. There are no
changes in behaviour, and the code modifications are mainly replacing
"new" with "make_shared".
This patch transitions the EthPacketData from the ad-hoc
RefCountingPtr to the c++11 shared_ptr. There are no changes in
behaviour, and the code modifications are mainly replacing "new" with
"make_shared".
The bool casting operator for the shared_ptr is explicit, and we must
therefore either cast it, compare it to NULL (p != nullptr), double
negate it (!!p) or do a (p ? true : false).
This patch takes a first few steps in transitioning from the ad-hoc
RefCountingPtr to the c++11 shared_ptr. There are no changes in
behaviour, and the code modifications are mainly introducing the
use of make_shared.
Note that the class could use unique_ptr rather than shared_ptr, was
it not for the postfix increment and decrement operators.
This patch makes the memory system ISA-agnostic by enabling the Ruby
Sequencer to dynamically determine if it has to do a store check. To
enable this check, the ISA is encoded as an enum, and the system
is able to provide the ISA to the Sequencer at run time.
--HG--
rename : src/arch/x86/insts/microldstop.hh => src/arch/x86/ldstflags.hh
This patch takes a step towards an ISA-agnostic memory
system by enabling the components to establish the page size after
instantiation. The swap operation in the memory is now also allowing
any granularity to avoid depending on the IntReg of the ISA.
This changeset adds probe points that can be used to implement PMU
counters for CPU stats. The following probes are supported:
* BaseCPU::ppCycles / Cycles
* BaseCPU::ppRetiredInsts / RetiredInsts
* BaseCPU::ppRetiredLoads / RetiredLoads
* BaseCPU::ppRetiredStores / RetiredStores
* BaseCPU::ppRetiredBranches RetiredBranches
This changeset adds probe points that can be used to implement PMU
counters for TLB stats. The following probes are supported:
* ArmISA::TLB::ppRefills / TLB Refills (TLB insertions)
This changeset adds probe points that can be used to implement PMU
counters for branch predictor stats. The following probes are
supported:
* BPRedUnit::ppBranches / Branches
* BPRedUnit::ppMisses / Misses
This class implements a subset of the ARM PMU v3 specification as
described in the ARMv8 reference manual. It supports most of the
features of the PMU, however the following features are known to be
missing:
* Event filtering (e.g., from different privilege levels).
* Access controls (the PMU currently ignores the execution level).
* The chain counter (event no. 0x1E) is unimplemented.
The PMU itself does not implement any events, it merely provides an
interface for the configuration scripts to hook up probes that drive
events. Configuration scripts should call addEventProbe() to configure
custom events or high-level methods to configure architected
events. The Python implementation of addEventProbe() automatically
delays event type registration until after instantiation.
In order to support CPU switching and some combined counters (e.g.,
memory references synthesized from loads and stores), the PMU allows
multiple probes per event type. When creating a system that switches
between CPU models that share the same PMU, PMU events for all of the
CPU models can be registered with the PMU.
Kudos to Matt Horsnell for the initial gem5 implementation of the PMU.
In order to show make PMU probe points usable across different PMU
implementations, we want a common probe interface. This patch the
namespace ProbePoins that contains typedefs for probe points that are
shared between multiple SimObjects. It also adds typedefs for the PMU
probe interface.
BitUnion instances can normally not be used with the SERIALIZE_SCALAR
and UNSERIALIZE_SCALAR macros due to the way they are converted
between their storage type and their actual type. This changeset adds
a set of parm(In|Out) functions specifically for gem5 bit unions to
work around the issue.
This patch adds the ability to load in config.ini files generated from
gem5 into another instance of gem5 built without Python configuration
support. The intended use case is for configuring gem5 when it is a
library embedded in another simulation system.
A parallel config file reader is also provided purely in Python to
demonstrate the approach taken and to provided similar functionality
for as-yet-unknown use models. The Python configuration file reader
can read both .ini and .json files.
C++ configuration file reading:
A command line option has been added for scons to enable C++ configuration
file reading: --with-cxx-config
There is an example in util/cxx_config that shows C++ configuration in action.
util/cxx_config/README explains how to build the example.
Configuration is achieved by the object CxxConfigManager. It handles
reading object descriptions from a CxxConfigFileBase object which
wraps a config file reader. The wrapper class CxxIniFile is provided
which wraps an IniFile for reading .ini files. Reading .json files
from C++ would be possible with a similar wrapper and a JSON parser.
After reading object descriptions, CxxConfigManager creates
SimObjectParam-derived objects from the classes in the (generated with this
patch) directory build/ARCH/cxx_config
CxxConfigManager can then build SimObjects from those SimObjectParams (in an
order dictated by the SimObject-value parameters on other objects) and bind
ports of the produced SimObjects.
A minimal set of instantiate-replacing member functions are provided by
CxxConfigManager and few of the member functions of SimObject (such as drain)
are extended onto CxxConfigManager.
Python configuration file reading (configs/example/read_config.py):
A Python version of the reader is also supplied with a similar interface to
CxxConfigFileBase (In Python: ConfigFile) to config file readers.
The Python config file reading will handle both .ini and .json files.
The object construction strategy is slightly different in Python from the C++
reader as you need to avoid objects prematurely becoming the children of other
objects when setting parameters.
Port binding also needs to be strictly in the same port-index order as the
original instantiation.
This patch adds the Undefined Behavior Sanitizer (UBSan) for clang and
gcc >= 4.9. Due to the performance impact, the usage is guarded by a
command-line option.
Add the ability to build libgem5 without embedded Python or the
ability to configure with Python.
This is a prelude to a patch to allow config.ini files to be loaded
into libgem5 using only C++ which would make embedding gem5 within
other simulation systems easier.
This adds a few registration interfaces to things which cross
between Python and C++. Namely: stats dumping and SimObject resolving
This patch adds some statistics to garnet that record the activity
of certain structures in the on-chip network. These statistics, in a later
patch, will be used for computing the energy consumed by the on-chip network.
Orion is being dropped from ruby. It would be replaced with DSENT
which has better models. Note that the power / energy numbers reported
after this patch has been applied are not for use.
This patch takes the final step in integrating DRAMPower and adds the
appropriate calls in the DRAM controller to provide the command trace
and extract the power and energy stats. The debug printouts are still
left in place, but will eventually be removed.
At the moment the DRAM power calculation is always on when using the
DRAM controller model. The run-time impact of this addition is around
1.5% when looking at the total host seconds of the regressions. We
deem this a sensible trade-off to avoid the complication of adding an
enable/disable mechanism.
This patch adds a class to wrap DRAMPower Library in gem5.
This class initiates an object of class MemorySpecification
of the DRAMPower Library, passes the parameters from DRAMCtrl.py
to this object and creates an object of drampower library using
the memory specification.
This patch adds missing timing and current parameters to the existing
DRAM configs. These missing timing and current parameters are required
by DRAMPower for the DRAM power calculations. The missing values are
datasheet values of the specified DRAMs, and the appropriate
references are added for the variuos configs.
This patch prunes the DDR3 config that was initially created to match
the default config of DRAMSim2. The config is not complete as it is,
and to avoid having to maintain it, the easiest way forward is to
simply prune it. Going forward we are adding power number etc to the
other configurations.
This patch adds the Python parameter type Current, which is used for
the DRAM power modelling (to start with). With this addition we avoid
implicit unit assumptions.
The Ozone CPU is now very much out of date and completely
non-functional, with no one actively working on restoring it. It is a
source of confusion for new users who attempt to use it before
realizing its current state. RIP
This patch fixes the runtime errors highlighted by the undefined
behaviour sanitizer. In the end there were two issues. First, when
rotating an immediate, we ended up shifting an uint32_t by 32 in some
cases. This case is fixed by checking for a rotation by 0
positions. Second, the Mrc15 and Mcr15 are operating on an IntReg and
a MiscReg, but we used the type RegRegImmOp and passed a MiscRegIndex
as an IntRegIndex. This issue is resolved by introducing a
MiscRegRegImmOp and RegMiscRegImmOp with the appropriate types.
With these fixes there are no runtime errors identified for the full
ARM regressions.
This patch optimises the passing of StaticInstPtr by avoiding copying
the reference-counting pointer. This avoids first incrementing and
then decrementing the reference-counting pointer.
Fix a number few minor issues to please gcc 4.9.1. Removing the
'-fuse-linker-plugin' flag means no libraries are part of the LTO
process, but hopefully this is an acceptable loss, as the flag causes
issues on a lot of systems (only certain combinations of gcc, ld and
ar work).
The call paths for de-scheduling a thread are halt() and suspend(), from
the thread context. There is no call to deallocateContext() in general,
though some CPUs chose to define it. This patch removes the function
from BaseCPU and the cores which do not require it.
activate(), suspend(), and halt() used on thread contexts had an optional
delay parameter. However this parameter was often ignored. Also, when used,
the delay was seemily arbitrarily set to 0 or 1 cycle (no other delays were
ever specified). This patch removes the delay parameter and 'Events'
associated with them across all ISAs and cores. Unused activate logic
is also removed.
This patch changes the name of the Bus classes to XBar to better
reflect the actual timing behaviour. The actual instances in the
config scripts are not renamed, and remain as e.g. iobus or membus.
As part of this renaming, the code has also been clean up slightly,
making use of range-based for loops and tidying up some comments. The
only changes outside the bus/crossbar code is due to the delay
variables in the packet.
--HG--
rename : src/mem/Bus.py => src/mem/XBar.py
rename : src/mem/coherent_bus.cc => src/mem/coherent_xbar.cc
rename : src/mem/coherent_bus.hh => src/mem/coherent_xbar.hh
rename : src/mem/noncoherent_bus.cc => src/mem/noncoherent_xbar.cc
rename : src/mem/noncoherent_bus.hh => src/mem/noncoherent_xbar.hh
rename : src/mem/bus.cc => src/mem/xbar.cc
rename : src/mem/bus.hh => src/mem/xbar.hh
Adds a simple access counter for requests and snoops for the snoop filter and
also classifies hits based on whether a single other holder existed or whether
multiple shares held the line.
This patch adds a simple counter for both total messages and a histogram for
the fan-out of snoop messages. The fan-out describes to how many ports snoops
had to be sent per incoming request / snoop-from-below. Without any
cleverness, this usually means to either all, or all but the requesting port.
Adds two public domain algorithms for determining number of set bits and also
whether a value is a power of two, uses the builtin that is available in GCC
and clang for popcount.
This is a first cut at a simple snoop filter that tracks presence of lines in
the caches "above" it. The snoop filter can be applied at any given cache
hierarchy and will then handle the caches above it appropriately; there is no
need to use this only in the last-level bus.
This design currently has some limitations: missing stats, no notion of clean
evictions (these will not update the underlying snoop filter, because they are
not sent from the evicting cache down), no notion of capacity for the snoop
filter and thus no need for invalidations caused by capacity pressure in the
snoop filter. These are planned to be added on top with future change sets.
There are cases where users might by accident / intention specify less voltage
operating points thatn frequency points. We consider one of these cases
special: giving only a single voltage to a voltage domain effectively renders
it as a static domain. This patch adds additional logic in the auxiliary parts
of the functionality to handle these cases properly (simple driver asking for
N>1 operating levels, we should return the same voltage for all of them) and
adds error checking code in the voltage domain.
This patch provides an Energy Controller device that provides software
(driver) access to a DVFS handler. The device is currently residing in
the dev/arm tree, but there is nothing inherently ARM specific in the
behaviour. It is currently only tested and supported for ARM Linux,
hence the location.
These additions allow easier interoperability with and querying from an
additional controller which will be in a separate patch. Also adding warnings
for changing the enabled state of the handler across checkpoint / resume and
deviating from the state in the configuration.
Contributed-by: Akash Bagdia <akash.bagdia@arm.com>
Added the following parameter to the DRAMCtrl class:
- bank_groups_per_rank
This defaults to 1. For the DDR4 case, the default is overridden to indicate
bank group architecture, with multiple bank groups per rank.
Added the following delays to the DRAMCtrl class:
- tCCD_L : CAS-to-CAS, same bank group delay
- tRRD_L : RAS-to-RAS, same bank group delay
These parameters are only applied when bank group timing is enabled. Bank
group timing is currently enabled only for DDR4 memories.
For all other memories, these delays will default to '0 ns'
In the DRAM controller model, applied the bank group timing to the per bank
parameters actAllowedAt and colAllowedAt.
The actAllowedAt will be updated based on bank group when an ACT is issued.
The colAllowedAt will be updated based on bank group when a RD/WR burst is
issued.
At the moment no modifications are made to the scheduling.
Add the following delay to the DRAM controller:
- tCS : Different rank bus turnaround delay
This will be applied for
1) read-to-read,
2) write-to-write,
3) write-to-read, and
4) read-to-write
command sequences, where the new command accesses a different rank
than the previous burst.
The delay defaults to 2*tCK for each defined memory class. Note that
this does not correspond to one particular timing constraint, but is a
way of modelling all the associated constraints.
The DRAM controller has some minor changes to prioritize commands to
the same rank. This prioritization will only occur when the command
stream is not switching from a read to write or vice versa (in the
case of switching we have a gap in any case).
To prioritize commands to the same rank, the model will determine if there are
any commands queued (same type) to the same rank as the previous command.
This check will ensure that the 'same rank' command will be able to execute
without adding bubbles to the command flow, e.g. any ACT delay requirements
can be done under the hoods, allowing the burst to issue seamlessly.
Add new DRAM_ROTATE mode to traffic generator.
This mode will generate DRAM traffic that rotates across
banks per rank, command types, and ranks per channel
The looping order is illustrated below:
for (ranks per channel)
for (command types)
for (banks per rank)
// Generate DRAM Command Series
This patch also adds the read percentage as an input argument to the
DRAM sweep script. If the simulated read percentage is 0 or 100, the
middle for loop does not generate additional commands. This loop is
used only when the read percentage is set to 50, in which case the
middle loop will toggle between read and write commands.
Modified sweep.py script, which generates DRAM traffic.
Added input arguments and support for new DRAM_ROTATE mode.
The script now has input arguments for:
1) Read percentage
2) Number of ranks
3) Address mapping
4) Traffic generator mode (DRAM or DRAM_ROTATE)
The default values are:
100% reads, 1 rank, RoRaBaCoCh address mapping, and DRAM traffic gen mode
For the DRAM traffic mode, added multi-rank support.
This patch adds support for 9p filesystem proxying over VirtIO. It can
currently operate by connecting to a 9p server over a socket
(VirtIO9PSocket) or by starting the diod 9p server and connecting over
pipe (VirtIO9PDiod).
*WARNING*: Checkpoints are currently not supported for systems with 9p
proxies!
This patch adds support for VirtIO over the PCI bus. It does so by
providing the following new SimObjects:
* VirtIODeviceBase - Abstract base class for VirtIO devices.
* PciVirtIO - VirtIO PCI transport interface.
A VirtIO device is hooked up to the guest system by adding a PciVirtIO
device to the PCI bus and connecting it to a VirtIO device using the
vio parameter.
New VirtIO devices should inherit from VirtIODevice base and
implementing one or more VirtQueues. The VirtQueues are usually
device-specific and all derive from the VirtQueue class. Queues must
be registered with the base class from the constructor since the
device assumes that the number of queues stay constant.
The terminal currently assumes that the transport to the guest always
inherits from the Uart class. This assumption breaks when
implementing, for example, a VirtIO consoles. This patch removes this
assumption by adding pointer to the from the terminal to the uart and
replacing it with a more general callback interface. The Uart, or any
other class using the terminal, class implements an instance of the
callbacks class and registers it with the terminal.
This patch does a bit of housekeeping on the string helper functions
and relies on the C++11 standard library where possible. It also does
away with our custom string hash as an implementation is already part
of the standard library.
There are two primary issues with this code which make it deserving of deletion.
1) GHB is a way to structure a prefetcher, not a definitive type of prefetcher
2) This prefetcher isn't even structured like a GHB prefetcher.
It's basically a worse version of the stride prefetcher.
It primarily serves to confuse new gem5 users and most functionality is already
present in the stride prefetcher.
This patch 'completes' .json config files generation by adding in the
SimObject references and String-valued parameters not currently
printed.
TickParamValues are also changed to print in the same tick-value
format as in .ini files.
This allows .json files to describe a system as fully as the .ini files
currently do.
This patch adds a new function config_value (which mirrors ini_str) to
each ParamValue and to SimObject. This function can then be explicitly
changed to give different .json and .ini printing behaviour rather than
being written in terms of ini_str.
This patch changes how faults are passed between methods in an attempt
to copy as few reference-counting pointer instances as possible. This
should avoid unecessary copies being created, contributing to the
increment/decrement of the reference counters.
The changeset ad9c042dce54 made changes to the structures under the network
directory to use a map of buffers instead of vector of buffers.
The reasoning was that not all vnets that are created are used and we
needlessly allocate more buffers than required and then iterate over them
while processing network messages. But the move to map resulted in a slow
down which was pointed out by Andreas Hansson. This patch moves things
back to using vector of message buffers.
This patch fixes cases where uncacheable/memory type flags are not set
correctly on a memory op which is split in the LSQ. Without this
patch, request->request if freely used to check flags where the flags
should actually come from the accumulation of request fragment flags.
This patch also fixes a bug where an uncacheable access which passes
through tryToSendRequest more than once can increment
LSQ::numAccessesInMemorySystem more than once.
This patch closes a number of space gaps in debug messages caused by
the incorrect use of line continuation within strings. (There's also
one consistency change to a similar, but correct, use of line
continuation)
The ProbeListener base class automatically registers itself with a
probe manager. Currently, the class does not unregister a itself when
it is destroyed, which makes removing probes listeners somewhat
cumbersome. This patch adds an automatic call to
manager->removeListener in the ProbeListener destructor, which solves
the problem.
Parsing vectorparams from the command was slightly broken
in that it wouldn't accept the input that the help message
provided to the user and it didn't do the conversion
on the second code path used to convert the string input
to the actual internal representation. This patch fixes these bugs.
Static analysis revealed that BaseGlobalEvent::barrier was never
deallocated. This changeset solves this leak by making the barrier
allocation a part of the BaseGlobalEvent instead of storing a pointer
to a separate heap-allocated barrier.
Static analysis unearther a bunch of uninitialised variables and
members, and this patch addresses the problem. In all cases these
omissions seem benign in the end, but at least fixing them means less
false positives next time round.
The PC platform has a single IO range that is used both legacy IO and PCI IO
while other platforms may use seperate regions. Provide another mechanism to
configure the legacy IO base address range and set it to the PCI IO address
range for x86.
This change adds support for a generic pci host bus driver that
has been included in recent Linux kernel instead of the more
bespoke one we've been using to date. It also works with
aarch64 so it provides PCI support for 64-bit ARM Linux.
To make this work a new configuration option pci_io_base is added
to the RealView platform that should be set to the start of
the memory used as memory mapped IO ports (IO ports that are
memory mapped, not regular memory mapped IO). And a parameter
pci_cfg_gen_offsets which specifies if the config space
offsets should be used that the generic driver expects.
To use the pci-host-generic device you need to:
pci_io_base = 0x2f000000 (Valid for VExpress EMM)
pci_cfg_gen_offsets = True
and add the following to your device tree:
pci {
compatible = "pci-host-ecam-generic";
device_type = "pci";
#address-cells = <0x3>;
#size-cells = <0x2>;
#interrupt-cells = <0x1>;
//bus-range = <0x0 0x1>;
// CPU_PHYSICAL(2) SIZE(2)
// Note, some DTS blobs only support 1 size
reg = <0x0 0x30000000 0x0 0x10000000>;
// IO (1), no bus address (2), cpu address (2), size (2)
// MMIO (1), at address (2), cpu address (2), size (2)
ranges = <0x01000000 0x0 0x00000000 0x0 0x2f000000 0x0 0x10000>,
<0x02000000 0x0 0x40000000 0x0 0x40000000 0x0 0x10000000>;
// With gem5 we typically use INTA/B/C/D one per device
interrupt-map = <0x0000 0x0 0x0 0x1 0x1 0x0 0x11 0x1
0x0000 0x0 0x0 0x2 0x1 0x0 0x12 0x1
0x0000 0x0 0x0 0x3 0x1 0x0 0x13 0x1
0x0000 0x0 0x0 0x4 0x1 0x0 0x14 0x1>;
// Only match INTA/B/C/D and not BDF
interrupt-map-mask = <0x0000 0x0 0x0 0x7>;
};
The new configuration scripts need the ability to splice
a simobject between a pair of ports that are already connected.
The primary use case is when a CommMonitor needs to be
created after the system is configured and then spliced between
the pair of ports it will monitor.
This patch changes the random number generator from the in-house
Mersenne twister to an implementation relying entirely on C++11 STL.
The format for the checkpointing of the twister is simplified. As the
functionality was never used this should not matter. Note that this
patch does not actually make use of the checkpointing
functionality. As the random number generator is not thread safe, it
may be sensible to create one generator per thread, system, or even
object. Until this is decided the status quo is maintained in that no
generator state is part of the checkpoint.
This patch tidies up random number generation to ensure that it is
done consistently throughout the code base. In essence this involves a
clean-up of Ruby, and some code simplifications in the traffic
generator.
As part of this patch a bunch of skewed distributions (off-by-one etc)
have been fixed.
Note that a single global random number generator is used, and that
the object instantiation order will impact the behaviour (the sequence
of numbers will be unaffected, but if module A calles random before
module B then they would obviously see a different outcome). The
dependency on the instantiation order is true in any case due to the
execution-model of gem5, so we leave it as is. Also note that the
global ranom generator is not thread safe at this point.
Regressions using the memtest, TrafficGen or any Ruby tester are
affected and will be updated accordingly.
This patch removes unecessary retries that happened when the bus layer
itself was no longer busy, but the the peer was not yet ready. Instead
of sending a retry that will inevitably not succeed, the bus now
silenty waits until the peer sends a retry.
Multiple instructions assume only 32-bit load operations are available,
this patch increases load sizes to 64-bit or 128-bit for many load pair and
load multiple instructions.
Support full-block writes directly rather than requiring RMW:
* a cache line is allocated in the cache upon receipt of a
WriteInvalidateReq, not the WriteInvalidateResp.
* only top-level caches allocate the line; the others just pass
the request along and invalidate as necessary.
* to close a timing window between the *Req and the *Resp, a new
metadata bit tracks whether another cache has read a copy of
the new line before the writeback to memory.
This patch fixes a bug in the cache port where the retry flag was
reset too early, allowing new requests to arrive before the retry was
actually sent, but with the event already scheduled. This caused a
deadlock in the interactions with the O3 LSQ.
The patche fixes the underlying issue by shifting the resetting of the
flag to be done by the event that also calls sendRetry(). The patch
also tidies up the flow control in recvTimingReq and ensures that we
also check if we already have a retry outstanding.
Previously, they were treated so much like loads that they could stall
at the head of the ROB. Now they are always treated like L1 hits.
If they actually miss, a new request is created at the L1 and tracked
from the MSHRs there if necessary (i.e. if it didn't coalesce with
an existing outstanding load).
The o3 cpu relies upon instructions that suspend a thread context being
flagged as "IsQuiesce". If they are not, unpredictable behavior can occur.
This patch fixes that for the x86 ISA.
For X86, the o3 CPU would get stuck with the commit stage not being
drained if an interrupt arrived while drain was pending. isDrained()
makes sure that pcState.microPC() == 0, thus ensuring that we are at
an instruction boundary. However, when we take an interrupt we
execute:
pcState.upc(romMicroPC(entry));
pcState.nupc(romMicroPC(entry) + 1);
tc->pcState(pcState);
As a result, the MicroPC is no longer zero. This patch ensures the drain is
delayed until no interrupts are present. Once draining, non-synchronous
interrupts are deffered until after the switch.
Neon memory ops that operate on multiple registers currently have very poor
performance because of interleave/deinterleave micro-ops.
This patch marks the deinterleave/interleave micro-ops as "No_OpClass" such
that they take minumum cycles to execute and are never resource constrained.
Additionaly the micro-ops over-read registers. Although one form may need
to read up to 20 sources, not all do. This adds in new forms so false
dependencies are not modeled. Instructions read their minimum number of
sources.
Analogous to ee049bf (for x86). Requires a bump of the checkpoint version
and corresponding upgrader code to move the condition code register values
to the new register file.
A small bug in the bimodal predictor caused significant degradation in
performance on some benchmarks. This was caused by using the wrong
globalHistoryReg during the update phase. This patches fixes the bug
and brings the performance to normal level.
This patch fixes the load blocked/replay mechanism in the o3 cpu. Rather than
flushing the entire pipeline, this patch replays loads once the cache becomes
unblocked.
Additionally, deferred memory instructions (loads which had conflicting stores),
when replayed would not respect the number of functional units (only respected
issue width). This patch also corrects that.
Improvements over 20% have been observed on a microbenchmark designed to
exercise this behavior.
O3 is supposed to stop fetching instructions once a quiesce is encountered.
However due to a bug, it would continue fetching instructions from the current
fetch buffer. This is because of a break statment that only broke out of the
first of 2 nested loops. It should have broken out of both.
The o3 cpu could attempt to schedule inactive threads under round-robin SMT
mode.
This is because it maintained an independent priority list of threads from the
active thread list. This priority list could be come stale once threads were
inactive, leading to the cpu trying to fetch/commit from inactive threads.
Additionally the fetch queue is now forcibly flushed of instrctuctions
from the de-scheduled thread.
Relevant output:
24557000: system.cpu: [tid:1]: Calling deactivate thread.
24557000: system.cpu: [tid:1]: Removing from active threads list
24557500: system.cpu:
FullO3CPU: Ticking main, FullO3CPU.
24557500: system.cpu.fetch: Running stage.
24557500: system.cpu.fetch: Attempting to fetch from [tid:1]
When a branch mispredicted gem5 would squash all history after and including
the mispredicted branch. However, the mispredicted branch is still speculative
and its history is required to rollback state if another, older, branch
mispredicts. This leads to things like RAS corruption.
This patch adds a fetch queue that sits between fetch and decode to the
o3 cpu. This effectively decouples fetch from decode stalls allowing it
to be more aggressive, running futher ahead in the instruction stream.
The o3 pipeline interlock/stall logic is incorrect. o3 unnecessicarily stalled
fetch and decode due to later stages in the pipeline. In general, a stage
should usually only consider if it is stalled by the adjacent, downstream stage.
Forcing stalls due to later stages creates and results in bubbles in the
pipeline. Additionally, o3 stalled the entire frontend (fetch, decode, rename)
on a branch mispredict while the ROB is being serially walked to update the
RAT (robSquashing). Only should have stalled at rename.
As highlighed on the mailing list gem5's writeback modeling can impact
performance. This patch removes the limitation on maximum outstanding issued
instructions, however the number that can writeback in a single cycle is still
respected in instToCommit().
isa_parser.py guesses the OpClass if none were given based upon the StaticInst
flags. The existing code does not take into account optionally set flags.
This code hoists the setting of optional flags so OpClass is properly assigned.
If a set of LL/SC requests contend on the same cache block we
can get into a situation where CPUs will deadlock if they expect
a failed SC to supply them data. This case happens where 3 or
more cores are contending for a cache block using LL/SC and the system
is configured where 2 cores are connected to a local bus and the
third is connected to a remote bus. If a core on the local bus
sends an SCUpgrade and the core on the remote bus sends and SCUpgrade
they will race to see who will win the SC access. In the meantime
if the other core appends a read to one of the SCUpgrades it will expect
to be supplied data by that SCUpgrade transaction. If it happens that
the SCUpgrade that was picked to supply the data is failed, it will
drop the appended request for data and never respond, leaving the requesting
core to deadlock. This patch makes all SC's behave as normal stores to
prevent this case but still makes sure to check whether it can perform
the update.
The first DPRINTF() in PL390::writeDistributor always read a uint32_t, though a
packet may have only been 1 or 2 bytes. This caused an assertion in
packet->get().
This patch makes restoring the 'lastStopped' value for Ticked-containing
objects (including MinorCPU) optional so that Ticked-containing objects
can be restored from non-Ticked-containing objects (such as AtomicSimpleCPU).
We currently generate and compile one version of the ISA code per CPU
model. This is obviously wasting a lot of resources at compile
time. This changeset factors out the interface into a separate
ExecContext class, which also serves as documentation for the
interface between CPUs and the ISA code. While doing so, this
changeset also fixes up interface inconsistencies between the
different CPU models.
The main argument for using one set of ISA code per CPU model has
always been performance as this avoid indirect branches in the
generated code. However, this argument does not hold water. Booting
Linux on a simulated ARM system running in atomic mode
(opt/10.linux-boot/realview-simple-atomic) is actually 2% faster
(compiled using clang 3.4) after applying this patch. Additionally,
compilation time is decreased by 35%.
This patch prunes unused values, and also unifies how the values are
defined (not using an enum for ALPHA), aligning the use of int vs Addr
etc.
The patch also removes the duplication of PageBytes/PageShift and
VMPageSize/LogVMPageSize. For all ISAs the two pairs had identical
values and the latter has been removed.
When passed from a configuration script with a hexadecimal value (like
"0x80000000"), gem5 would error out. This is because it would call
"toMemorySize" which requires the argument to end with a size specifier (like
1MB, etc).
This modification makes it so raw hex values can be passed through Addr
parameters from the configuration scripts.
This patch fixes the hash operator used for ARM ExtMachInst, which
incorrectly was still using uint32_t. Instead of changing it to
uint64_t it is not using the underlying data type of the BitUnion.
The Index type defined as typedef int64 does not really provide any help
since in most places we use primitive types instead of Index. Also, the name
Index is very generic that it does not merit being used as a typename.
This patch is the final patch in a series of patches. The aim of the series
is to make ruby more configurable than it was. More specifically, the
connections between controllers are not at all possible (unless one is ready
to make significant changes to the coherence protocol). Moreover the buffers
themselves are magically connected to the network inside the slicc code.
These connections are not part of the configuration file.
This patch makes changes so that these connections will now be made in the
python configuration files associated with the protocols. This requires
each state machine to expose the message buffers it uses for input and output.
So, the patch makes these buffers configurable members of the machines.
The patch drops the slicc code that usd to connect these buffers to the
network. Now these buffers are exposed to the python configuration system
as Master and Slave ports. In the configuration files, any master port
can be connected any slave port. The file pyobject.cc has been modified to
take care of allocating the actual message buffer. This is inline with how
other port connections work.
A later changeset changes the file src/python/swig/pyobject.cc to include
a header file that includes a header file generated at build time depending
on the PROTOCOL in use. Since NULL ISA was not specifying any protocol,
this resulted in compilation problems. Hence, the changeset.
The namespace Message conflicts with the Message data type used extensively
in Ruby. Since Ruby is being moved to the same Master/Slave ports based
configuration style as the rest of gem5, this conflict needs to be resolved.
Hence, the namespace is being renamed to ProtoMessage.
There are two changes this patch makes to the way configurable members of a
state machine are specified in SLICC. The first change is that the data
member declarations will need to be separated by a semi-colon instead of a
comma. Secondly, the default value to be assigned would now use SLICC's
assignment operator i.e. ':='.
This patch changes the grammar for SLICC so as to remove some of the
redundant / duplicate rules. In particular rules for object/variable
declaration and class member declaration have been unified. Similarly, the
rules for a general function and a class method have been unified.
One more change is in the priority of two rules. The first rule is on
declaring a function with all the params typed and named. The second rule is
on declaring a function with all the params only typed. Earlier the second
rule had a higher priority. Now the first rule has a higher priority.
This patch enables the use of page tables that are stored in system memory
and respect x86 specification, in SE mode. It defines an architectural
page table for x86 as a MultiLevelPageTable class and puts a placeholder
class for other ISAs page tables, giving the possibility for future
implementation.
This patch defines a multi-level page table class that stores the page table in
system memory, consistent with ISA specifications. In this way, cpu models that
use the actual hardware to execute (e.g. KvmCPU), are able to traverse the page
table.
This patch ensures the cycle check is still valid even restoring from
a checkpoint. In this case the DRAMSim2 cycle count is relative to the
startTick rather than 0.
We currently use our own home-baked support for type-safe variadic
functions. This is confusing and somewhat limited (e.g., cprintf only
supports a limited number of arguments). This changeset converts all
uses of our internal varargs support to use C++11 variadic macros.
Add the macros M5_ATTR_FINAL and M5_ATTR_OVERRIDE which are defined to
final and override respectively if supported by the compiler. This is
done to allow a smooth transition to gcc >= 4.7.
If a bit field in a bit union specified as Bitfield<LSB, MSB> instead
of Bitfield<MSB, LSB> the code silently fails and the field is read as
zero. This changeset introduces a static assert that tests, at compile
time, that the bit order is correct.
The order of the MSB and LSB bit of the mm field in the PSTATE union
is wrong. Any access to this field will currently be ignored and reads
will always return zero. This patch fixes the ordering so it is <MSB,
LSB> instead of <LSB, MSB>.
This patch fixes a bug in the DRAM controller address decoding. In
cases where the DRAM burst size (e.g. 32 bytes in a rank with a single
LPDDR3 x32) was smaller than the channel interleaving size
(e.g. systems with a 64-byte cache line) one address bit effectively
got used as a channel bit when it should have been a low-order column
bit.
This patch adds a notion of "columns per stripe", and more clearly
deals with the low-order column bits and high-order column bits. The
patch also relaxes the granularity check such that it is possible to
use interleaving granularities other than the cache line size.
The patch also adds a missing M5_CLASS_VAR_USED to the tCK member as
it is only used in the debug build for now.
This patch adds a fix for older checkpoints before support for
multiple event queues were added in changeset 2cce74fe359e. The change
in checkpoint version should really hav ebeen part of the
aforementioned changeset.
Some newer binaries compiled for Versatile Express TC2 contain access
to implementation specific L2MERRSR registers. This causes an infinite
loop of undefined exceptions. This patch changes the behavior to "warn
not fail" to keep the workloads going.
Baremetal workloads are specified using the "kernel" parameter, but
don't always have the correct address mappings. This patch adds a
boolean flag to the system and bypasses the kernel addr mapping checks
when running in baremetal mode.
The branch predictor is normally only built when a CPU that uses a
branch predictor is built. The list of CPUs is currently incomplete as
the simple CPUs support branch predictors (for warming, branch stats,
etc). In practice, all CPU models now use branch predictors, so this
changeset removes the CPU model check and replaces it with a check for
the NULL ISA.
Certain versions of clang complain about unused private members if
they are not used. This changeset removes such members from the
MIPS-specific classes to silence the warning.
Certain versions of clang complain about unused private members if
they are not used. This changeset removes such members from the
POWER-specific ProcessInfo struct to silence the warning.
This changeset fixes three types of warnings that occur in clang 3.4
on Ubuntu 12.04:
* Certain versions of libstdc++ (primarily 4.8) use struct and class
interchangeably. This triggers a warning in clang.
* Swig has a tendency to generate code with the register class which
was deprecated in C++11. This triggers a deprecation warning in
clang.
* Swig sometimes generates Python wrapper code which returns
uninitialized values. It's unclear if this is actually a problem
(the cases might be limited to failure paths). We'll silence these
warnings for now since there is little we can do about the
generated code.
The M5_PRAGMA_NORETURN macro was only used in for
__exit_message. Since the macro only holds a stub definition and all
functions with noreturn semantics use the M5_ATTR_NORETURN, this
macros is completely redundant.
RefCountingPtr is sometimes forward declared to avoid having to
include refcnt.hh. This does not work since we typically return
instances of RefCountingPtr rather than references to instances. The
only reason this currently works is that we include refcnt.hh in
cprintf.hh, which "leaks" the header to most other source files. This
changeset replaces such forward declarations with an include of
refcnt.hh.
When a cacheline is written back to a lower-level cache,
tags->insertBlock() sets various status parameters. However these
status bits were cleared immediately after calling. This patch makes
it so that these status fields are not cleared by moving them outside
of the tags->insertBlock() call.
This patch does some minor house keeping of the branch predictor by
adopting STL containers, and shifting some iterator to use range-based
for loops.
The predictor history is also changed from a list to a deque as we
never to insertion/deletion other than at the front and back.
This patch adds the SubSystem container for grouping
simobjects together in logical subsystems to facilitate
building a larger system from constituent parts. The container
is simply a non-abstract empty simobject to hold the components
that will be connected as its children. In simulation the
object does not participate, its only use is during configuration
of the system.
This patch adds helper functions to SimObject.py, params.py and
simulate.py to enable the new configuration system. Functions like
enumerateParams() in SimObject lets the config system auto-generate
command line options for simobjects to be modified on the command
line.
Params in params.py have __call__() added
to their definition to allow the argparse module to use them
as a type to check command input is in the proper format.
This patch adds a check to ensure that packets which are not going to
a memory range are suppressed in the traffic generator. Thus, if a
trace is collected in full-system, the packets destined for devices
are not played back.
this patch implements a new tags class that uses a random replacement policy.
these tags prefer to evict invalid blocks first, if none are available a
replacement candidate is chosen at random.
this patch factors out the common code in the LRU class and creates a new
abstract class: the BaseSetAssoc class. any set associative tag class must
implement the functionality related to the actual replacement policy in the
following methods:
accessBlock()
findVictim()
insertBlock()
invalidate()
This patch contains a new CPU model named `Minor'. Minor models a four
stage in-order execution pipeline (fetch lines, decompose into
macroops, decompose macroops into microops, execute).
The model was developed to support the ARM ISA but should be fixable
to support all the remaining gem5 ISAs. It currently also works for
Alpha, and regressions are included for ARM and Alpha (including Linux
boot).
Documentation for the model can be found in src/doc/inside-minor.doxygen and
its internal operations can be visualised using the Minorview tool
utils/minorview.py.
Minor was designed to be fairly simple and not to engage in a lot of
instruction annotation. As such, it currently has very few gathered
stats and may lack other gem5 features.
Minor is faster than the o3 model. Sample results:
Benchmark | Stat host_seconds (s)
---------------+--------v--------v--------
(on ARM, opt) | simple | o3 | minor
| timing | timing | timing
---------------+--------+--------+--------
10.linux-boot | 169 | 1883 | 1075
10.mcf | 117 | 967 | 491
20.parser | 668 | 6315 | 3146
30.eon | 542 | 3413 | 2414
40.perlbmk | 2339 | 20905 | 11532
50.vortex | 122 | 1094 | 588
60.bzip2 | 2045 | 18061 | 9662
70.twolf | 207 | 2736 | 1036
Stop setting the use_default_range flag in PioBus in order to
have random bad addresses result in a BadAddress response and
not a gem5 fatal error. This is necessary in Ruby as Ruby is
connected directly to PioBus, so misspeculated addresses will
be sent there directly. For the classic memory system, this
change has no effect, as bad addresses are caught by the
memory bus before being sent to the PioBus.
This work was done while Binh was an intern at AMD Research.
The System object has a static MemoryModeStrings array
that's (1) unused and (2) redundant, since there's an
auto-generated version in the Enums namespace. No
point in leaving it in.
When we switched getSyscallArg() from explicit arg indices to
the implicit method, some DPRINTF arguments were left as calls
to getSyscallArg(), even though C/C++ doesn't guarantee
anything about the order of invocation of these calls. As a
result, the args could be printed out in arbitrary orders.
Interestingly, this bug has been around since 2009:
http://repo.gem5.org/gem5/rev/4842482e1bd1
this operator uses memcmp() to detect if two EthAddr object have the same
address, however memcmp() will return 0 if all bytes are equal. operator==
returns the return value of memcmp() to indicate whether or not two
address are equal. this is incorrect as it will always give the opposite of
the intended behavior. this patch fixes that problem.
per the IEEE 802 spec:
1) fixed broadcast() to ensure that all bytes are equal to 0xff.
2) fixed unicast() to ensure that bit 0 of the first byte is equal to 0
3) fixed multicast() to ensure that bit 0 of the first byte is equal to 1, and
that it is not a broadcast.
also the constructors in EthAddr are fixed so that all bytes of data are
initialized.
Adds DVFS capabilities to gem5, by allowing users to specify lists for
frequencies and voltages in SrcClockDomains and VoltageDomains respectively.
A separate component, DVFSHandler, provides a small interface to change
operating points of the associated domains.
Clock domains will be linked to voltage domains and thus allow separate clock,
but shared voltage lines.
Currently all the valid performance-level updates are performed with a fixed
transition latency as specified for the domain.
Config file example:
...
vd = VoltageDomain(voltage = ['1V','0.95V','0.90V','0.85V'])
tsys.cluster1.clk_domain.clock = ['1GHz','700MHz','400MHz','230MHz']
tsys.cluster2.clk_domain.clock = ['1GHz','700MHz','400MHz','230MHz']
tsys.cluster1.clk_domain.domain_id = 0
tsys.cluster2.clk_domain.domain_id = 1
tsys.cluster1.clk_domain.voltage_domain = vd
tsys.cluster2.clk_domain.voltage_domain = vd
tsys.dvfs_handler.domains = [tsys.cluster1.clk_domain,
tsys.cluster2.clk_domain]
tsys.dvfs_handler.enable = True
This patch adds a DRAMPower flag to enable off-line DRAM power
analysis using the DRAMPower tool. A new DRAMPower flag is added
and a follow-on patch adds a Python script to post-process the output
and order it based on time stamps.
The long-term goal is to link DRAMPower as a library and provide the
commands through function calls to the model rather than first
printing and then parsing the commands. At the moment it is also up to
the user to ensure that the same DRAM configuration is used by the
gem5 controller model and DRAMPower.
This patch adds the index of the bank and rank as a field so that we can
determine the identity of a given bank (reference or pointer) for the
power tracing. We also grab the opportunity of cleaning up the
arguments used for identifying the bank when activating.
In a cycle, we could see a R and W requests corresponding to the same
page walk being sent to the memory. During the cycle that assertion
happens, we have 2 responses corresponding to the R and W above. We
also have a 'read' variable to keep track of the inflight Read
request, this gets reset to NULL right after we send out any R
request; and gets set to the next R in the page walk when a response
comes back.
The issue we are seeing here is when we get a response for W request,
assert(!read) fires because we got a response for R request right
before this, hence we set 'read' to NOT NULL value, pointing to the
next R request in the pagewalk!
This work was done while Binh was an intern at AMD Research.
Check for free entries in Load Queue and Store Queue separately to
avoid cases when load cannot be renamed due to full Store Queue and
vice versa.
This work was done while Binh was an intern at AMD Research.
This patch bumps the supported version of gcc from 4.4 to 4.6, and
clang from 2.9 to 3.0. This enables, amongst other things, range-based
for loops, lambda expressions, etc. The STL implementation shipping
with 4.6 also has a full functional implementation of unique_ptr and
shared_ptr.
The language describing the clockEdge and nextCycle functions were ambiguous,
and so were prone to misinterpretation/misuse. Clear up the comments to more
rigorously describe their functionality.
Using '== true' in a boolean expression is totally redundant,
and using '== false' is pretty verbose (and arguably less
readable in most cases) compared to '!'.
It's somewhat of a pet peeve, perhaps, but I had some time
waiting for some tests to run and decided to clean these up.
Unfortunately, SLICC appears not to have the '!' operator,
so I had to leave the '== false' tests in the SLICC code.
This patch removes the stat totalCommittedInsts. This variable was used for
recording the total number of instructions committed across all the threads
of a core. The instructions committed by each thread are recorded invidually.
The total would now be generated by summing these individual counts.
This patch makes a more firm connection between the DDR3-1600
configuration and the corresponding datasheet, and also adds a
DDR3-2133 and a DDR4-2400 configuration. At the moment there is also
an ongoing effort to align the choice of datasheets to what is
available in DRAMPower.
This patch extends the current timing parameters with the DRAM cycle
time. This is needed as the DRAMPower tool expects timestamps in DRAM
cycles. At the moment we could get away with doing this in a
post-processing step as the DRAMPower execution is separate from the
simulation run. However, in the long run we want the tool to be called
during the simulation, and then the cycle time is needed.
This patch adds the basic ingredients for a precharge all operation,
to be used in conjunction with DRAM power modelling.
Currently we do not try and apply any cleverness when precharging all
banks, thus even if only a single bank is open we use PREA as opposed
to PRE. At the moment we only have a single tRP (tRPpb), and do not
model the slightly longer all-bank precharge constraint (tRPab).
This patch adds the tRTP timing constraint, governing the minimum time
between a read command and a precharge. Default values are provided
for the existing DRAM types.
This patch merges the two control paths used to estimate the latency
and update the bank state. As a result of this merging the computation
is now in one place only, and should be easier to follow as it is all
done in absolute (rather than relative) time.
As part of this change, the scheduling is also refined to ensure that
we look at a sensible estimate of the bank ready time in choosing the
next request. The bank latency stat is removed as it ends up being
misleading when the DRAM access code gets evaluated ahead of time (due
to the eagerness of waking the model up for scheduling the next
request).
This patch adds the write recovery time to the DRAM timing
constraints, and changes the current tRASDoneAt to a more generic
preAllowedAt, capturing when a precharge is allowed to take place.
The part of the DRAM access code that accounts for the precharge and
activate constraints is updated accordingly.
This patch adds power states to the controller. These states and the
transitions can be used together with the Micron power model. As a
more elaborate use-case, the transitions can be used to drive the
DRAMPower tool.
At the moment, the power-down modes are not used, and this patch
simply serves to capture the idle, auto refresh and active modes. The
patch adds a third state machine that interacts with the refresh state
machine.
This patch adds a state machine for the refresh scheduling to
ensure that no accesses are allowed while the refresh is in progress,
and that all banks are propely precharged.
As part of this change, the precharging of banks of broken out into a
method of its own, making is similar to how activations are dealt
with. The idle accounting is also updated to ensure that the refresh
duration is not added to the time that the DRAM is in the idle state
with all banks precharged.
This patch changes the read/write event loop to use a single event
(nextReqEvent), along with a state variable, thus joining the two
control flows. This change makes it easier to follow the state
transitions, and control what happens when.
With the new loop we modify the overly conservative switching times
such that the write-to-read switch allows bank preparation to happen
in parallel with the bus turn around. Similarly, the read-to-write
switch uses the introduced tRTW constraint.
This patch adds a the member function StaticInst::printFlags to allow all
of an instruction's flags to be printed without using the individual
is... member functions or resorting to exposing the 'flags' vector
It also replaces the enum definition StaticInst::Flags with a
Python-generated enumeration and adds to the enum generation mechanism
in src/python/m5/params.py to allow Enums to be placed in namespaces
other than Enums or, alternatively, in wrapper structs allowing them to
be inherited by other classes (so populating that class's name-space
with the enumeration element names).
This patch encompasses several interrelated and interdependent changes
to the ISA generation step. The end goal is to reduce the size of the
generated compilation units for instruction execution and decoding so
that batch compilation can proceed with all CPUs active without
exhausting physical memory.
The ISA parser (src/arch/isa_parser.py) has been improved so that it can
accept 'split [output_type];' directives at the top level of the grammar
and 'split(output_type)' python calls within 'exec {{ ... }}' blocks.
This has the effect of "splitting" the files into smaller compilation
units. I use air-quotes around "splitting" because the files themselves
are not split, but preprocessing directives are inserted to have the same
effect.
Architecturally, the ISA parser has had some changes in how it works.
In general, it emits code sooner. It doesn't generate per-CPU files,
and instead defers to the C preprocessor to create the duplicate copies
for each CPU type. Likewise there are more files emitted and the C
preprocessor does more substitution that used to be done by the ISA parser.
Finally, the build system (SCons) needs to be able to cope with a
dynamic list of source files coming out of the ISA parser. The changes
to the SCons{cript,truct} files support this. In broad strokes, the
targets requested on the command line are hidden from SCons until all
the build dependencies are determined, otherwise it would try, realize
it can't reach the goal, and terminate in failure. Since build steps
(i.e. running the ISA parser) must be taken to determine the file list,
several new build stages have been inserted at the very start of the
build. First, the build dependencies from the ISA parser will be emitted
to arch/$ISA/generated/inc.d, which is then read by a new SCons builder
to finalize the dependencies. (Once inc.d exists, the ISA parser will not
need to be run to complete this step.) Once the dependencies are known,
the 'Environments' are made by the makeEnv() function. This function used
to be called before the build began but now happens during the build.
It is easy to see that this step is quite slow; this is a known issue
and it's important to realize that it was already slow, but there was
no obvious cause to attribute it to since nothing was displayed to the
terminal. Since new steps that used to be performed serially are now in a
potentially-parallel build phase, the pathname handling in the SCons scripts
has been tightened up to deal with chdir() race conditions. In general,
pathnames are computed earlier and more likely to be stored, passed around,
and processed as absolute paths rather than relative paths. In the end,
some of these issues had to be fixed by inserting serializing dependencies
in the build.
Minor note:
For the null ISA, we just provide a dummy inc.d so SCons is never
compelled to try to generate it. While it seems slightly wrong to have
anything in src/arch/*/generated (i.e. a non-generated 'generated' file),
it's by far the simplest solution.
The unproxy code for Parent.any can generate a circular reference
in certain situations with classes hierarchies like those in ClockDomain.py.
This patch solves this by marking ouself as visited to make sure the
search does not resolve to a self-reference.
The ARM TLBs have a bootUncacheability flag used to make some loads
and stores become uncacheable when booting in FS mode. Later the
flag is cleared to let those loads and stores operate as normal. When
doing a takeOverFrom(), this flag's state is not preserved and is
momentarily reset until the CPSR is touched. On single core runs this
is a non-issue. On multi-core runs this can lead to crashes on the O3
CPU model from the following series of events:
1) takeOverFrom executed to switch from Atomic -> O3
2) All bootUncacheability flags are reset to true
3) Core2 tries to execute a load covered by bootUncacheability, it
is flagged as uncacheable
4) Core2's load needs to replay due to a pipeline flush
3) Core1 core does an action on CPSR
4) The handling code for CPSR then checks all other cores
to determine if bootUncacheability can be set to false
5) Asynchronously set bootUncacheability on all cores to false
6) Core2 replays load previously set as uncacheable and notices
it is now flagged as cacheable, leads to a panic.
This patch implements takeOverFrom() functionality for the ARM TLBs
to preserve flag values when switching from atomic -> detailed.
For the o3, add instruction mix (OpClass) histogram at commit (stats
also already collected at issue). For the simple CPUs we add a
histogram of executed instructions
This patch squashes prefetch requests from downstream caches,
so that they do not steal cachelines away from caches closer
to the cpu. It was originally coded by Mitch Hayenga and
modified by Aasheesh Kolli.
This source for stats binds an object and a method / function from the object
to a stats object. This allows pulling out stats from object methods without
needing to go through a global, or static shim.
Syntax is somewhat unpleasant, but the templates and method pointer type
specification were quite tricky. Interface is very clean though; and similar
to .functor
Allow the specification of a socket ID for every core that is reflected in the
MPIDR field in ARM systems. This allows studying multi-socket / cluster
systems with ARM CPUs.
Splits the CommMonitor trace_file parameter into three parameters. Previously,
the trace was only enabled if the trace_file parameter was set, and would be
written to this file. This patch adds in a trace_enable and trace_compress
parameter to the CommMonitor.
No trace is generated if trace_enable is set to False. If it is set to True, the
trace is written to a file based on the name of the SimObject in the simulation
hierarchy. For example, system.cluster.il1_commmonitor.trc. This filename can be
overridden by additionally specifying a file name to the trace_file parameter
(more on this later).
The trace_compress parameter will append .gz to any filename if set to True.
This enables compression of the generated traces. If the file name already ends
in .gz, then no changes are made.
The trace_file parameter will override the name set by the trace_enable
parameter. In the case that the specified name does not end in .gz but
trace_compress is set to true, .gz is appended to the supplied file name.
Unimplemented miscregs for the generic timer were guarded by panics
in arm/isa.cc which can be tripped by the O3 model if it speculatively
executes a wrong path containing a mrs instruction with a bad miscreg
index. These registers were flagged as implemented and accessible.
This patch changes the miscreg info bit vector to flag them as
unimplemented and inaccessible. In this case, and UndefinedInst
fault will be generated if the register access is not trapped
by a hypervisor.
This patch changes the default pixel clock to effectively generate
1080p resolution at 60 frames per second. It is dependent upon the
kernel device tree file using the specified resolution / display
string in the comments.
This is a quick hack to communicate a greater number of CPUs to a guest OS via
the ARM A9 SCU config register. Some OSes (Linux) just look at the bottom field
to count CPUs and with a small change can look at bits [3:0] to learn about up
to 16 CPUs.
Very much unsupported (and contains warning messages as such) but useful for
running 8 core sims without hardwiring CPU count in the guest OS.
With (upcoming) separate compilation, they are useless. Only
link-time optimization could re-inline them, but ideally
feedback-directed optimization would choose to do so only for
profitable (i.e. common) instructions.
The MicroMemOp class generates the disassembly for both integer
and floating point instructions, but it would always print its
first operand as an integer register without considering that the
op may be a floating instruction in which case a float register
should be displayed instead.
In the O3 LSQ, data read/written is printed out in DPRINTFs. However,
the data field is treated as a character string with a null terminated.
However the data field is not encoded this way. This patch removes
that possibility by removing the data part of the print.
This never actually worked since it was printing out only a word
of the cache block and not the entire thing and doubly didn't work
csprintf overrides the %#x specifier and assumes a char* array is
actually a string.
O3CPU has a compile-time maximum width set in o3/impl.hh, but checking
the configuration against this limit was not implemented anywhere
except for fetch. Configuring a wider pipe than the limit can silently
cause various issues during the simulation. This patch adds the proper
checking in the constructor of the various pipeline stages.
Without this declaration, new clangs will complain about this value
being unused. It has no explicit use in the codebase, but it can be
useful to turn on all debugging flags while in a debugger to greatly
increase simulator verbosity.
A number of calls to isEmpty() and numFreeEntries()
should be thread-specific.
In cpu.cc, the fact that tid is /*commented*/ out is a bug. Say the rob
has instructions from thread 0 (isEmpty() returns false), and none from
thread 1. If we are trying to squash all of thread 1, then
readTailInst(thread 1) will be called because rob->isEmpty() returns
false. The result is end_it is not in the list and the while
statement loops indefinitely back over the cpu's instList.
In iew_impl.hh, all threads are told they have the entire remaining IQ, when
each thread actually has a certain allocation. The result is extra stalls at
the iew dispatch stage which the rename stage usually takes care of.
In commit_impl.hh, rob->readHeadInst(thread 1) can be called if the rob only
contains instructions from thread 0. This returns a dummyInst (which may work
since we are trying to squash all instructions, but hardly seems like the right
way to do it).
In rob_impl.hh this fix skips the rest of the function more frequently and is
more efficient.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Upon aggregating records, serialize system's cache-block size, as the
cache-block size can be different when restoring from a checkpoint. This way,
we can correctly read all records when restoring from a checkpoints, even if
the cache-block size is different.
Note, that it is only possible to restore from a checkpoint if the
desired cache-block size is smaller or equal to the cache-block size
when the checkpoint was taken; we can split one larger request into
multiple small ones, but it is not reliable to do the opposite.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Simulating a SMP or multicore requires devices to be shared between
multiple KVM vCPUs. This means that locking is required when accessing
devices. This changeset adds the necessary locking to allow devices to
execute correctly. It is implemented by temporarily migrating the KVM
CPU to the VM's (and devices) event queue when handling
MMIO. Similarly, the VM migrates to the interrupt controller's event
queue when delivering an interrupt.
The support for fast-forwarding of multicore simulations added by this
changeset assumes that all devices in a system are simulated in the
same thread and each vCPU has its own thread. Special care must be
taken to ensure that devices living under the CPU in the object
hierarchy (e.g., the interrupt controller) do not inherit the parent
CPUs thread and are assigned to device thread. The KvmVM object is
assumed to live in the same thread as the other devices in the system.
The calling thread is undefined when the PollQueue services events.
This implies that PollEvents need to handle the case where they are
processed from a different thread than the thread that created the
event. This changeset adds temporary event queue migrations to the VNC
server, the ethernet tap device, and the terminal to protect them from
inter-thread calls.
As of now, the enqueue statement can take in any number of 'pairs' as
argument. But we only use the pair in which latency is the key. This
latency is allowed to be either a fixed integer or a member variable of
controller in which the expression appears. This patch drops the use of pairs
in an enqueue statement. Instead, an expression is allowed which will be
interpreted to be the latency of the enqueue. This expression can anything
allowed by slicc including a constant integer or a member variable.
We need the ability to lock event queues to enable device accesses
across threads. The serviceOne() method now takes a service lock prior
to handling a new event. By locking an event queue, a different
thread/eq can effectively execute in the context of the locked event
queue. To simplify temporary event queue migrations, this changeset
introduces the EventQueue::ScopedMigration class that unlocks the
current event queue, locks a new event queue, and updates the current
event queue variable.
In order to prevent deadlocks, event queues need to be released when
waiting on barriers. This is implemented using the
EventQueue::ScopedRelease class. An instance of this class is, for
example, used in the BaseGlobalEvent class to release the event queue
when waiting on the synchronization barrier.
The intended use for this functionality is when devices need to be
accessed across thread boundaries. For example, when fast-forwarding,
it might be useful to run devices and CPUs in separate threads. In
such a case, the CPU locks the device queue whenever it needs to
perform IO. This functionality is primarily intended for KVM.
Note: Migrating between event queues can lead to non-deterministic
timing. Use with extreme care!
--HG--
extra : rebase_source : 23e3a741a1fd73861d1339782dbbe1bc76285315
This patch fixes violation of TSO in the O3CPU, as all loads must be
ordered with all other loads. In the LQ, if a snoop is observed, all
subsequent loads need to be squashed if the system is TSO.
Prior to this patch, the following case could be violated:
P0 | P1 ;
MOV [x],mail=/usr/spool/mail/nilay | MOV EAX,[y] ;
MOV [y],mail=/usr/spool/mail/nilay | MOV EBX,[x] ;
exists (1:EAX=1 /\ 1:EBX=0) [is a violation]
The problem was found using litmus [http://diy.inria.fr].
Committed by: Nilay Vaish <nilay@cs.wisc.edu
This patch adds stats for tracking the number of reads/writes per bus
turn around, and also adds hysteresis to the write-to-read switching
to ensure that the queue does not oscilate around the low threshold.
This patch renames the not-so-simple SimpleDRAM to a more suitable
DRAMCtrl. The name change is intended to ensure that we do not send
the wrong message (although the "simple" in SimpleDRAM was originally
intended as in cleverly simple, or elegant).
As the DRAM controller modelling work is being presented at ISPASS'14
our hope is that a broader audience will use the model in the future.
--HG--
rename : src/mem/SimpleDRAM.py => src/mem/DRAMCtrl.py
rename : src/mem/simple_dram.cc => src/mem/dram_ctrl.cc
rename : src/mem/simple_dram.hh => src/mem/dram_ctrl.hh
Make the default memory type DDR3-1600 x64, and use the open-adaptive
page policy. This change is aiming to ensure that users by default are
using a realistic memory system.
This patch adds a basic starvation-prevention mechanism where a DRAM
page is forced to close after a certain number of accesses. The limit
is combined with the open and open-adaptive page policy and if reached
causes an auto-precharge.
This patch changes the triggering condition for the write draining
such that we grab the opportunity to issue writes if there are no
reads waiting (as opposed to waiting for the writes to reach the high
threshold). As a result, we potentially drain some of the writes in read
idle periods (if any).
A low threshold is added to be able to control how many write bursts
are kept in the memory controller queue (acting as on-chip storage).
The high and low thresholds are updated to sensible values for a 32/64
size write buffer. Note that the thresholds should be adjusted along
with the queue sizes.
This patch also adds some basic initialisation sanity checks and moves
part of the initialisation to the constructor.
This patch enables a new 'DRAM' mode to the existing traffic
generator, catered to generate specific requests to DRAM based on
required hit length (stride size) and bank utilization. It is an add on
to the Random mode.
The basic idea is to control how many successive packets target the
same page, and how many banks are being used in parallel. This gives a
two-dimensional space that stresses different aspects of the DRAM
timing.
The configuration file needed to use this patch has to be changed as
follow: (reference to Random Mode, LPDDR3 memory type)
'STATE 0 10000000000 RANDOM 50 0 134217728 64 3004 5002 0'
-> 'STATE 0 10000000000 DRAM 50 0 134217728 32 3004 5002 0 96 1024 8 6 1'
The last 4 parameters to be added are:
<stride size (bytes), page size(bytes), number of banks available in DRAM,
number of banks to be utilized, address mapping scheme>
The address mapping information is used to get the stride address
stream of the specified size and to know where to find the bank
bits. The configuration file has a parameter where '0'-> RoCoRaBaCh,
'1'-> RoRaBaCoCh/RoRaBaChCo address-mapping schemes. Note that the
generator currently assumes a single channel and a single rank. This
is to avoid overwhelming the traffic generator with information about
the memory organisation.
This patch adds the row bits to the name of the address mapping
schemes to make it more clear that all the current schemes places the
row bits as the most significant bits.
This patch moves the Ruby-related debug flags to the ruby
sub-directory, and also removes the state SConsopts that add the
no-longer-used NO_VECTOR_BOUNDS_CHECK.
Prevent incomplete configuration of TrafficGen class from causing
segmentation faults. If an 'INIT' line is not present in the
configuration file then the currState variable will remain
uninitialized which may result in a crash.
There were several sections of the m5ops code which were
essentially copy/pasted versions of the 32-bit code. The
problem is that some of these didn't account fo4 64-bit
registers leading to arguments being in the wrong registers.
This patch addresses the args for readfile64, writefile64,
and addsymbol64 -- all of which seemed to suffer from a
similar set of problems when moving to 64-bit.
Each consumer object maintains a set of tick values when the object is supposed
to wakeup and do some processing. As of now, the object accesses this set both
when scheduling a wakeup event and when the object actually wakes up. The set
is accessed during wakeup to remove the current tick value from the set. This
functionality is now being moved to the scheduling function where ticks are
removed at a later time.
This helps in configuring the network interfaces from the python script and
these objects no longer rely on the network object for the timing information.
Piobus was recently added to se scripts for ruby so that the interrupt
controller can be connected to something (required since the interrupt
controller sends address range messages). This patch removes the piobus
and instead, the pio port of ruby port will now ignore the range change
messages in se mode.
KVM used to use two signals, one for instruction count exits and one
for timer exits. There is really no need to distinguish between the
two since they only trigger exits from KVM. This changeset unifies and
renames the signals and adds a method, kick(), that can be used to
raise the control signal in the vCPU thread. It also removes the early
timer warning since we do not normally see if the signal was
delivered.
--HG--
extra : rebase_source : cd0e45ca90894c3d6f6aa115b9b06a1d8f0fda4d
gem5 seems to store the PC as RIP+CS_BASE. This is not what KVM
expects, so we need to subtract CS_BASE prior to transferring the PC
into KVM. This changeset adds the necessary PC manipulation and
refactors thread context updates slightly to avoid reading registers
multiple times from KVM.
--HG--
extra : rebase_source : 3f0569dca06a1fcd8694925f75c8918d954ada44
This changeset adds support for INIT and STARTUP IPI handling. We
currently handle both of these interrupts in gem5 and transfer the
state to KVM. Since we do not have a BIOS loaded, we pretend that the
INIT interrupt suspends the CPU after reset.
--HG--
extra : rebase_source : 7f3b25f3801d68f668b6cd91eaf50d6f48ee2a6a
The table walker code currently accounts for two types of walks,
Atomic and Timing, and treats them differently. Atomic walks keep a
single instance of WalkerState around for all walks to use in
currState. Timing mode keeps a queue of in-flight WalkerStates and
maintains currState as NULL between walks.
If a functional walk is done during Timing mode, it is treated as an
atomic walk and either creates a persistent WalkerState if in between
Timing walks, or stomps an existing currState for an in-progress
Timing walk.
This patch distinguishes functional walks as being able to exist at
any time and sets up a temporary WalkerState for its exclusive use and
then cleans up when finished, leaving any in progress Atomic or Timing
walks undisturbed.
This patch fixes an assert condition that is not true at all
times. There are valid situations that arise in dual-core
dual-workload runs where the assert condition is false. The function
call following the assert however needs to be called only when the
condition is true (a block cannot be invalidated in the tags structure
if has not been allocated in the structure, and the tempBlock is never
allocated). Hence the 'assert' has been replaced with an 'if'.
This patch changes the decode script to output the optional fields of
the proto message Packet, namely id and flags. The flags field is set
by the communication monitor.
The id field is useful for CPU trace experiments, e.g. linking the
fetch side to decode side. It had to be renamed because it clashes
with a built in python function id() for getting the "identity" of an
object.
This patch also takes a few common function definitions out from the
multiple scripts and adds them to a protolib python module.
This snippet can be used to replace if + {panics, fatals, asserts} constructs.
The idea is to have both the condition checking and a verbose printout in a single statement. The interface is as follows:
panic_if(foo != bar, "These should be equal: foo %i bar %i", foo, bar);
fatal_if(foo != bar, "These should be equal: foo %i bar %i", foo, bar);
chatty_assert(foo == bar, "These should be equal: foo %i bar %i", foo, bar);
Small fix for a warning that prevents compilation with gcc 4.8.1 due
to detecting that a variable might be uninitialised. The fix is to
assign a safe default.
The global synchronization event used to be scheduled at
simQuantum. This prevented repeated entries into gem5 from Python as
it can be scheduled in the past. This changeset ensures that the first
global synchronization happens at curTick() + simQuantum instead.
The TSL/LDT & TR/TSS segments didn't contain valid attributes. This
caused problems when transfering the state into KVM where invalid
state is a no-go. Fixup the attributes with values from AMD's
architecture programmer's manual.
When transferring segment registers into kvm, we need to find the
value of the unusable bit. We used to assume that this could be
inferred from the selector since segments are generally unusable if
their selector is 0. This assumption breaks in some weird corner
cases. Instead, we just assume that segments are always usable. This
is what qemu does so it should work.
Signal handlers in KVM are controlled per thread and should be
initialized from the thread that is going to execute the CPU. This
changeset moves the initialization call from startup() to
startupThread().