This patch fixes the load blocked/replay mechanism in the o3 cpu. Rather than
flushing the entire pipeline, this patch replays loads once the cache becomes
unblocked.
Additionally, deferred memory instructions (loads which had conflicting stores),
when replayed would not respect the number of functional units (only respected
issue width). This patch also corrects that.
Improvements over 20% have been observed on a microbenchmark designed to
exercise this behavior.
O3 is supposed to stop fetching instructions once a quiesce is encountered.
However due to a bug, it would continue fetching instructions from the current
fetch buffer. This is because of a break statment that only broke out of the
first of 2 nested loops. It should have broken out of both.
The o3 cpu could attempt to schedule inactive threads under round-robin SMT
mode.
This is because it maintained an independent priority list of threads from the
active thread list. This priority list could be come stale once threads were
inactive, leading to the cpu trying to fetch/commit from inactive threads.
Additionally the fetch queue is now forcibly flushed of instrctuctions
from the de-scheduled thread.
Relevant output:
24557000: system.cpu: [tid:1]: Calling deactivate thread.
24557000: system.cpu: [tid:1]: Removing from active threads list
24557500: system.cpu:
FullO3CPU: Ticking main, FullO3CPU.
24557500: system.cpu.fetch: Running stage.
24557500: system.cpu.fetch: Attempting to fetch from [tid:1]
When a branch mispredicted gem5 would squash all history after and including
the mispredicted branch. However, the mispredicted branch is still speculative
and its history is required to rollback state if another, older, branch
mispredicts. This leads to things like RAS corruption.
This patch adds a fetch queue that sits between fetch and decode to the
o3 cpu. This effectively decouples fetch from decode stalls allowing it
to be more aggressive, running futher ahead in the instruction stream.
The o3 pipeline interlock/stall logic is incorrect. o3 unnecessicarily stalled
fetch and decode due to later stages in the pipeline. In general, a stage
should usually only consider if it is stalled by the adjacent, downstream stage.
Forcing stalls due to later stages creates and results in bubbles in the
pipeline. Additionally, o3 stalled the entire frontend (fetch, decode, rename)
on a branch mispredict while the ROB is being serially walked to update the
RAT (robSquashing). Only should have stalled at rename.
As highlighed on the mailing list gem5's writeback modeling can impact
performance. This patch removes the limitation on maximum outstanding issued
instructions, however the number that can writeback in a single cycle is still
respected in instToCommit().
isa_parser.py guesses the OpClass if none were given based upon the StaticInst
flags. The existing code does not take into account optionally set flags.
This code hoists the setting of optional flags so OpClass is properly assigned.
If a set of LL/SC requests contend on the same cache block we
can get into a situation where CPUs will deadlock if they expect
a failed SC to supply them data. This case happens where 3 or
more cores are contending for a cache block using LL/SC and the system
is configured where 2 cores are connected to a local bus and the
third is connected to a remote bus. If a core on the local bus
sends an SCUpgrade and the core on the remote bus sends and SCUpgrade
they will race to see who will win the SC access. In the meantime
if the other core appends a read to one of the SCUpgrades it will expect
to be supplied data by that SCUpgrade transaction. If it happens that
the SCUpgrade that was picked to supply the data is failed, it will
drop the appended request for data and never respond, leaving the requesting
core to deadlock. This patch makes all SC's behave as normal stores to
prevent this case but still makes sure to check whether it can perform
the update.
The first DPRINTF() in PL390::writeDistributor always read a uint32_t, though a
packet may have only been 1 or 2 bytes. This caused an assertion in
packet->get().
This patch makes restoring the 'lastStopped' value for Ticked-containing
objects (including MinorCPU) optional so that Ticked-containing objects
can be restored from non-Ticked-containing objects (such as AtomicSimpleCPU).
We currently generate and compile one version of the ISA code per CPU
model. This is obviously wasting a lot of resources at compile
time. This changeset factors out the interface into a separate
ExecContext class, which also serves as documentation for the
interface between CPUs and the ISA code. While doing so, this
changeset also fixes up interface inconsistencies between the
different CPU models.
The main argument for using one set of ISA code per CPU model has
always been performance as this avoid indirect branches in the
generated code. However, this argument does not hold water. Booting
Linux on a simulated ARM system running in atomic mode
(opt/10.linux-boot/realview-simple-atomic) is actually 2% faster
(compiled using clang 3.4) after applying this patch. Additionally,
compilation time is decreased by 35%.
This patch prunes unused values, and also unifies how the values are
defined (not using an enum for ALPHA), aligning the use of int vs Addr
etc.
The patch also removes the duplication of PageBytes/PageShift and
VMPageSize/LogVMPageSize. For all ISAs the two pairs had identical
values and the latter has been removed.
When passed from a configuration script with a hexadecimal value (like
"0x80000000"), gem5 would error out. This is because it would call
"toMemorySize" which requires the argument to end with a size specifier (like
1MB, etc).
This modification makes it so raw hex values can be passed through Addr
parameters from the configuration scripts.
This patch fixes the hash operator used for ARM ExtMachInst, which
incorrectly was still using uint32_t. Instead of changing it to
uint64_t it is not using the underlying data type of the BitUnion.
The Index type defined as typedef int64 does not really provide any help
since in most places we use primitive types instead of Index. Also, the name
Index is very generic that it does not merit being used as a typename.
This patch is the final patch in a series of patches. The aim of the series
is to make ruby more configurable than it was. More specifically, the
connections between controllers are not at all possible (unless one is ready
to make significant changes to the coherence protocol). Moreover the buffers
themselves are magically connected to the network inside the slicc code.
These connections are not part of the configuration file.
This patch makes changes so that these connections will now be made in the
python configuration files associated with the protocols. This requires
each state machine to expose the message buffers it uses for input and output.
So, the patch makes these buffers configurable members of the machines.
The patch drops the slicc code that usd to connect these buffers to the
network. Now these buffers are exposed to the python configuration system
as Master and Slave ports. In the configuration files, any master port
can be connected any slave port. The file pyobject.cc has been modified to
take care of allocating the actual message buffer. This is inline with how
other port connections work.
A later changeset changes the file src/python/swig/pyobject.cc to include
a header file that includes a header file generated at build time depending
on the PROTOCOL in use. Since NULL ISA was not specifying any protocol,
this resulted in compilation problems. Hence, the changeset.
The namespace Message conflicts with the Message data type used extensively
in Ruby. Since Ruby is being moved to the same Master/Slave ports based
configuration style as the rest of gem5, this conflict needs to be resolved.
Hence, the namespace is being renamed to ProtoMessage.
There are two changes this patch makes to the way configurable members of a
state machine are specified in SLICC. The first change is that the data
member declarations will need to be separated by a semi-colon instead of a
comma. Secondly, the default value to be assigned would now use SLICC's
assignment operator i.e. ':='.
This patch changes the grammar for SLICC so as to remove some of the
redundant / duplicate rules. In particular rules for object/variable
declaration and class member declaration have been unified. Similarly, the
rules for a general function and a class method have been unified.
One more change is in the priority of two rules. The first rule is on
declaring a function with all the params typed and named. The second rule is
on declaring a function with all the params only typed. Earlier the second
rule had a higher priority. Now the first rule has a higher priority.
This patch enables the use of page tables that are stored in system memory
and respect x86 specification, in SE mode. It defines an architectural
page table for x86 as a MultiLevelPageTable class and puts a placeholder
class for other ISAs page tables, giving the possibility for future
implementation.
This patch defines a multi-level page table class that stores the page table in
system memory, consistent with ISA specifications. In this way, cpu models that
use the actual hardware to execute (e.g. KvmCPU), are able to traverse the page
table.
This patch ensures the cycle check is still valid even restoring from
a checkpoint. In this case the DRAMSim2 cycle count is relative to the
startTick rather than 0.
We currently use our own home-baked support for type-safe variadic
functions. This is confusing and somewhat limited (e.g., cprintf only
supports a limited number of arguments). This changeset converts all
uses of our internal varargs support to use C++11 variadic macros.
Add the macros M5_ATTR_FINAL and M5_ATTR_OVERRIDE which are defined to
final and override respectively if supported by the compiler. This is
done to allow a smooth transition to gcc >= 4.7.
If a bit field in a bit union specified as Bitfield<LSB, MSB> instead
of Bitfield<MSB, LSB> the code silently fails and the field is read as
zero. This changeset introduces a static assert that tests, at compile
time, that the bit order is correct.
The order of the MSB and LSB bit of the mm field in the PSTATE union
is wrong. Any access to this field will currently be ignored and reads
will always return zero. This patch fixes the ordering so it is <MSB,
LSB> instead of <LSB, MSB>.
This patch fixes a bug in the DRAM controller address decoding. In
cases where the DRAM burst size (e.g. 32 bytes in a rank with a single
LPDDR3 x32) was smaller than the channel interleaving size
(e.g. systems with a 64-byte cache line) one address bit effectively
got used as a channel bit when it should have been a low-order column
bit.
This patch adds a notion of "columns per stripe", and more clearly
deals with the low-order column bits and high-order column bits. The
patch also relaxes the granularity check such that it is possible to
use interleaving granularities other than the cache line size.
The patch also adds a missing M5_CLASS_VAR_USED to the tCK member as
it is only used in the debug build for now.
This patch adds a fix for older checkpoints before support for
multiple event queues were added in changeset 2cce74fe359e. The change
in checkpoint version should really hav ebeen part of the
aforementioned changeset.
Some newer binaries compiled for Versatile Express TC2 contain access
to implementation specific L2MERRSR registers. This causes an infinite
loop of undefined exceptions. This patch changes the behavior to "warn
not fail" to keep the workloads going.
Baremetal workloads are specified using the "kernel" parameter, but
don't always have the correct address mappings. This patch adds a
boolean flag to the system and bypasses the kernel addr mapping checks
when running in baremetal mode.
The branch predictor is normally only built when a CPU that uses a
branch predictor is built. The list of CPUs is currently incomplete as
the simple CPUs support branch predictors (for warming, branch stats,
etc). In practice, all CPU models now use branch predictors, so this
changeset removes the CPU model check and replaces it with a check for
the NULL ISA.
Certain versions of clang complain about unused private members if
they are not used. This changeset removes such members from the
MIPS-specific classes to silence the warning.
Certain versions of clang complain about unused private members if
they are not used. This changeset removes such members from the
POWER-specific ProcessInfo struct to silence the warning.
This changeset fixes three types of warnings that occur in clang 3.4
on Ubuntu 12.04:
* Certain versions of libstdc++ (primarily 4.8) use struct and class
interchangeably. This triggers a warning in clang.
* Swig has a tendency to generate code with the register class which
was deprecated in C++11. This triggers a deprecation warning in
clang.
* Swig sometimes generates Python wrapper code which returns
uninitialized values. It's unclear if this is actually a problem
(the cases might be limited to failure paths). We'll silence these
warnings for now since there is little we can do about the
generated code.
The M5_PRAGMA_NORETURN macro was only used in for
__exit_message. Since the macro only holds a stub definition and all
functions with noreturn semantics use the M5_ATTR_NORETURN, this
macros is completely redundant.
RefCountingPtr is sometimes forward declared to avoid having to
include refcnt.hh. This does not work since we typically return
instances of RefCountingPtr rather than references to instances. The
only reason this currently works is that we include refcnt.hh in
cprintf.hh, which "leaks" the header to most other source files. This
changeset replaces such forward declarations with an include of
refcnt.hh.
When a cacheline is written back to a lower-level cache,
tags->insertBlock() sets various status parameters. However these
status bits were cleared immediately after calling. This patch makes
it so that these status fields are not cleared by moving them outside
of the tags->insertBlock() call.
This patch does some minor house keeping of the branch predictor by
adopting STL containers, and shifting some iterator to use range-based
for loops.
The predictor history is also changed from a list to a deque as we
never to insertion/deletion other than at the front and back.
This patch adds the SubSystem container for grouping
simobjects together in logical subsystems to facilitate
building a larger system from constituent parts. The container
is simply a non-abstract empty simobject to hold the components
that will be connected as its children. In simulation the
object does not participate, its only use is during configuration
of the system.
This patch adds helper functions to SimObject.py, params.py and
simulate.py to enable the new configuration system. Functions like
enumerateParams() in SimObject lets the config system auto-generate
command line options for simobjects to be modified on the command
line.
Params in params.py have __call__() added
to their definition to allow the argparse module to use them
as a type to check command input is in the proper format.
This patch adds a check to ensure that packets which are not going to
a memory range are suppressed in the traffic generator. Thus, if a
trace is collected in full-system, the packets destined for devices
are not played back.
this patch implements a new tags class that uses a random replacement policy.
these tags prefer to evict invalid blocks first, if none are available a
replacement candidate is chosen at random.
this patch factors out the common code in the LRU class and creates a new
abstract class: the BaseSetAssoc class. any set associative tag class must
implement the functionality related to the actual replacement policy in the
following methods:
accessBlock()
findVictim()
insertBlock()
invalidate()
This patch contains a new CPU model named `Minor'. Minor models a four
stage in-order execution pipeline (fetch lines, decompose into
macroops, decompose macroops into microops, execute).
The model was developed to support the ARM ISA but should be fixable
to support all the remaining gem5 ISAs. It currently also works for
Alpha, and regressions are included for ARM and Alpha (including Linux
boot).
Documentation for the model can be found in src/doc/inside-minor.doxygen and
its internal operations can be visualised using the Minorview tool
utils/minorview.py.
Minor was designed to be fairly simple and not to engage in a lot of
instruction annotation. As such, it currently has very few gathered
stats and may lack other gem5 features.
Minor is faster than the o3 model. Sample results:
Benchmark | Stat host_seconds (s)
---------------+--------v--------v--------
(on ARM, opt) | simple | o3 | minor
| timing | timing | timing
---------------+--------+--------+--------
10.linux-boot | 169 | 1883 | 1075
10.mcf | 117 | 967 | 491
20.parser | 668 | 6315 | 3146
30.eon | 542 | 3413 | 2414
40.perlbmk | 2339 | 20905 | 11532
50.vortex | 122 | 1094 | 588
60.bzip2 | 2045 | 18061 | 9662
70.twolf | 207 | 2736 | 1036
Stop setting the use_default_range flag in PioBus in order to
have random bad addresses result in a BadAddress response and
not a gem5 fatal error. This is necessary in Ruby as Ruby is
connected directly to PioBus, so misspeculated addresses will
be sent there directly. For the classic memory system, this
change has no effect, as bad addresses are caught by the
memory bus before being sent to the PioBus.
This work was done while Binh was an intern at AMD Research.
The System object has a static MemoryModeStrings array
that's (1) unused and (2) redundant, since there's an
auto-generated version in the Enums namespace. No
point in leaving it in.
When we switched getSyscallArg() from explicit arg indices to
the implicit method, some DPRINTF arguments were left as calls
to getSyscallArg(), even though C/C++ doesn't guarantee
anything about the order of invocation of these calls. As a
result, the args could be printed out in arbitrary orders.
Interestingly, this bug has been around since 2009:
http://repo.gem5.org/gem5/rev/4842482e1bd1
this operator uses memcmp() to detect if two EthAddr object have the same
address, however memcmp() will return 0 if all bytes are equal. operator==
returns the return value of memcmp() to indicate whether or not two
address are equal. this is incorrect as it will always give the opposite of
the intended behavior. this patch fixes that problem.
per the IEEE 802 spec:
1) fixed broadcast() to ensure that all bytes are equal to 0xff.
2) fixed unicast() to ensure that bit 0 of the first byte is equal to 0
3) fixed multicast() to ensure that bit 0 of the first byte is equal to 1, and
that it is not a broadcast.
also the constructors in EthAddr are fixed so that all bytes of data are
initialized.
Adds DVFS capabilities to gem5, by allowing users to specify lists for
frequencies and voltages in SrcClockDomains and VoltageDomains respectively.
A separate component, DVFSHandler, provides a small interface to change
operating points of the associated domains.
Clock domains will be linked to voltage domains and thus allow separate clock,
but shared voltage lines.
Currently all the valid performance-level updates are performed with a fixed
transition latency as specified for the domain.
Config file example:
...
vd = VoltageDomain(voltage = ['1V','0.95V','0.90V','0.85V'])
tsys.cluster1.clk_domain.clock = ['1GHz','700MHz','400MHz','230MHz']
tsys.cluster2.clk_domain.clock = ['1GHz','700MHz','400MHz','230MHz']
tsys.cluster1.clk_domain.domain_id = 0
tsys.cluster2.clk_domain.domain_id = 1
tsys.cluster1.clk_domain.voltage_domain = vd
tsys.cluster2.clk_domain.voltage_domain = vd
tsys.dvfs_handler.domains = [tsys.cluster1.clk_domain,
tsys.cluster2.clk_domain]
tsys.dvfs_handler.enable = True
This patch adds a DRAMPower flag to enable off-line DRAM power
analysis using the DRAMPower tool. A new DRAMPower flag is added
and a follow-on patch adds a Python script to post-process the output
and order it based on time stamps.
The long-term goal is to link DRAMPower as a library and provide the
commands through function calls to the model rather than first
printing and then parsing the commands. At the moment it is also up to
the user to ensure that the same DRAM configuration is used by the
gem5 controller model and DRAMPower.
This patch adds the index of the bank and rank as a field so that we can
determine the identity of a given bank (reference or pointer) for the
power tracing. We also grab the opportunity of cleaning up the
arguments used for identifying the bank when activating.
In a cycle, we could see a R and W requests corresponding to the same
page walk being sent to the memory. During the cycle that assertion
happens, we have 2 responses corresponding to the R and W above. We
also have a 'read' variable to keep track of the inflight Read
request, this gets reset to NULL right after we send out any R
request; and gets set to the next R in the page walk when a response
comes back.
The issue we are seeing here is when we get a response for W request,
assert(!read) fires because we got a response for R request right
before this, hence we set 'read' to NOT NULL value, pointing to the
next R request in the pagewalk!
This work was done while Binh was an intern at AMD Research.
Check for free entries in Load Queue and Store Queue separately to
avoid cases when load cannot be renamed due to full Store Queue and
vice versa.
This work was done while Binh was an intern at AMD Research.
This patch bumps the supported version of gcc from 4.4 to 4.6, and
clang from 2.9 to 3.0. This enables, amongst other things, range-based
for loops, lambda expressions, etc. The STL implementation shipping
with 4.6 also has a full functional implementation of unique_ptr and
shared_ptr.
The language describing the clockEdge and nextCycle functions were ambiguous,
and so were prone to misinterpretation/misuse. Clear up the comments to more
rigorously describe their functionality.
Using '== true' in a boolean expression is totally redundant,
and using '== false' is pretty verbose (and arguably less
readable in most cases) compared to '!'.
It's somewhat of a pet peeve, perhaps, but I had some time
waiting for some tests to run and decided to clean these up.
Unfortunately, SLICC appears not to have the '!' operator,
so I had to leave the '== false' tests in the SLICC code.
This patch removes the stat totalCommittedInsts. This variable was used for
recording the total number of instructions committed across all the threads
of a core. The instructions committed by each thread are recorded invidually.
The total would now be generated by summing these individual counts.
This patch makes a more firm connection between the DDR3-1600
configuration and the corresponding datasheet, and also adds a
DDR3-2133 and a DDR4-2400 configuration. At the moment there is also
an ongoing effort to align the choice of datasheets to what is
available in DRAMPower.
This patch extends the current timing parameters with the DRAM cycle
time. This is needed as the DRAMPower tool expects timestamps in DRAM
cycles. At the moment we could get away with doing this in a
post-processing step as the DRAMPower execution is separate from the
simulation run. However, in the long run we want the tool to be called
during the simulation, and then the cycle time is needed.
This patch adds the basic ingredients for a precharge all operation,
to be used in conjunction with DRAM power modelling.
Currently we do not try and apply any cleverness when precharging all
banks, thus even if only a single bank is open we use PREA as opposed
to PRE. At the moment we only have a single tRP (tRPpb), and do not
model the slightly longer all-bank precharge constraint (tRPab).
This patch adds the tRTP timing constraint, governing the minimum time
between a read command and a precharge. Default values are provided
for the existing DRAM types.
This patch merges the two control paths used to estimate the latency
and update the bank state. As a result of this merging the computation
is now in one place only, and should be easier to follow as it is all
done in absolute (rather than relative) time.
As part of this change, the scheduling is also refined to ensure that
we look at a sensible estimate of the bank ready time in choosing the
next request. The bank latency stat is removed as it ends up being
misleading when the DRAM access code gets evaluated ahead of time (due
to the eagerness of waking the model up for scheduling the next
request).
This patch adds the write recovery time to the DRAM timing
constraints, and changes the current tRASDoneAt to a more generic
preAllowedAt, capturing when a precharge is allowed to take place.
The part of the DRAM access code that accounts for the precharge and
activate constraints is updated accordingly.
This patch adds power states to the controller. These states and the
transitions can be used together with the Micron power model. As a
more elaborate use-case, the transitions can be used to drive the
DRAMPower tool.
At the moment, the power-down modes are not used, and this patch
simply serves to capture the idle, auto refresh and active modes. The
patch adds a third state machine that interacts with the refresh state
machine.
This patch adds a state machine for the refresh scheduling to
ensure that no accesses are allowed while the refresh is in progress,
and that all banks are propely precharged.
As part of this change, the precharging of banks of broken out into a
method of its own, making is similar to how activations are dealt
with. The idle accounting is also updated to ensure that the refresh
duration is not added to the time that the DRAM is in the idle state
with all banks precharged.
This patch changes the read/write event loop to use a single event
(nextReqEvent), along with a state variable, thus joining the two
control flows. This change makes it easier to follow the state
transitions, and control what happens when.
With the new loop we modify the overly conservative switching times
such that the write-to-read switch allows bank preparation to happen
in parallel with the bus turn around. Similarly, the read-to-write
switch uses the introduced tRTW constraint.
This patch adds a the member function StaticInst::printFlags to allow all
of an instruction's flags to be printed without using the individual
is... member functions or resorting to exposing the 'flags' vector
It also replaces the enum definition StaticInst::Flags with a
Python-generated enumeration and adds to the enum generation mechanism
in src/python/m5/params.py to allow Enums to be placed in namespaces
other than Enums or, alternatively, in wrapper structs allowing them to
be inherited by other classes (so populating that class's name-space
with the enumeration element names).
This patch encompasses several interrelated and interdependent changes
to the ISA generation step. The end goal is to reduce the size of the
generated compilation units for instruction execution and decoding so
that batch compilation can proceed with all CPUs active without
exhausting physical memory.
The ISA parser (src/arch/isa_parser.py) has been improved so that it can
accept 'split [output_type];' directives at the top level of the grammar
and 'split(output_type)' python calls within 'exec {{ ... }}' blocks.
This has the effect of "splitting" the files into smaller compilation
units. I use air-quotes around "splitting" because the files themselves
are not split, but preprocessing directives are inserted to have the same
effect.
Architecturally, the ISA parser has had some changes in how it works.
In general, it emits code sooner. It doesn't generate per-CPU files,
and instead defers to the C preprocessor to create the duplicate copies
for each CPU type. Likewise there are more files emitted and the C
preprocessor does more substitution that used to be done by the ISA parser.
Finally, the build system (SCons) needs to be able to cope with a
dynamic list of source files coming out of the ISA parser. The changes
to the SCons{cript,truct} files support this. In broad strokes, the
targets requested on the command line are hidden from SCons until all
the build dependencies are determined, otherwise it would try, realize
it can't reach the goal, and terminate in failure. Since build steps
(i.e. running the ISA parser) must be taken to determine the file list,
several new build stages have been inserted at the very start of the
build. First, the build dependencies from the ISA parser will be emitted
to arch/$ISA/generated/inc.d, which is then read by a new SCons builder
to finalize the dependencies. (Once inc.d exists, the ISA parser will not
need to be run to complete this step.) Once the dependencies are known,
the 'Environments' are made by the makeEnv() function. This function used
to be called before the build began but now happens during the build.
It is easy to see that this step is quite slow; this is a known issue
and it's important to realize that it was already slow, but there was
no obvious cause to attribute it to since nothing was displayed to the
terminal. Since new steps that used to be performed serially are now in a
potentially-parallel build phase, the pathname handling in the SCons scripts
has been tightened up to deal with chdir() race conditions. In general,
pathnames are computed earlier and more likely to be stored, passed around,
and processed as absolute paths rather than relative paths. In the end,
some of these issues had to be fixed by inserting serializing dependencies
in the build.
Minor note:
For the null ISA, we just provide a dummy inc.d so SCons is never
compelled to try to generate it. While it seems slightly wrong to have
anything in src/arch/*/generated (i.e. a non-generated 'generated' file),
it's by far the simplest solution.
The unproxy code for Parent.any can generate a circular reference
in certain situations with classes hierarchies like those in ClockDomain.py.
This patch solves this by marking ouself as visited to make sure the
search does not resolve to a self-reference.
The ARM TLBs have a bootUncacheability flag used to make some loads
and stores become uncacheable when booting in FS mode. Later the
flag is cleared to let those loads and stores operate as normal. When
doing a takeOverFrom(), this flag's state is not preserved and is
momentarily reset until the CPSR is touched. On single core runs this
is a non-issue. On multi-core runs this can lead to crashes on the O3
CPU model from the following series of events:
1) takeOverFrom executed to switch from Atomic -> O3
2) All bootUncacheability flags are reset to true
3) Core2 tries to execute a load covered by bootUncacheability, it
is flagged as uncacheable
4) Core2's load needs to replay due to a pipeline flush
3) Core1 core does an action on CPSR
4) The handling code for CPSR then checks all other cores
to determine if bootUncacheability can be set to false
5) Asynchronously set bootUncacheability on all cores to false
6) Core2 replays load previously set as uncacheable and notices
it is now flagged as cacheable, leads to a panic.
This patch implements takeOverFrom() functionality for the ARM TLBs
to preserve flag values when switching from atomic -> detailed.
For the o3, add instruction mix (OpClass) histogram at commit (stats
also already collected at issue). For the simple CPUs we add a
histogram of executed instructions
This patch squashes prefetch requests from downstream caches,
so that they do not steal cachelines away from caches closer
to the cpu. It was originally coded by Mitch Hayenga and
modified by Aasheesh Kolli.
This source for stats binds an object and a method / function from the object
to a stats object. This allows pulling out stats from object methods without
needing to go through a global, or static shim.
Syntax is somewhat unpleasant, but the templates and method pointer type
specification were quite tricky. Interface is very clean though; and similar
to .functor
Allow the specification of a socket ID for every core that is reflected in the
MPIDR field in ARM systems. This allows studying multi-socket / cluster
systems with ARM CPUs.
Splits the CommMonitor trace_file parameter into three parameters. Previously,
the trace was only enabled if the trace_file parameter was set, and would be
written to this file. This patch adds in a trace_enable and trace_compress
parameter to the CommMonitor.
No trace is generated if trace_enable is set to False. If it is set to True, the
trace is written to a file based on the name of the SimObject in the simulation
hierarchy. For example, system.cluster.il1_commmonitor.trc. This filename can be
overridden by additionally specifying a file name to the trace_file parameter
(more on this later).
The trace_compress parameter will append .gz to any filename if set to True.
This enables compression of the generated traces. If the file name already ends
in .gz, then no changes are made.
The trace_file parameter will override the name set by the trace_enable
parameter. In the case that the specified name does not end in .gz but
trace_compress is set to true, .gz is appended to the supplied file name.
Unimplemented miscregs for the generic timer were guarded by panics
in arm/isa.cc which can be tripped by the O3 model if it speculatively
executes a wrong path containing a mrs instruction with a bad miscreg
index. These registers were flagged as implemented and accessible.
This patch changes the miscreg info bit vector to flag them as
unimplemented and inaccessible. In this case, and UndefinedInst
fault will be generated if the register access is not trapped
by a hypervisor.
This patch changes the default pixel clock to effectively generate
1080p resolution at 60 frames per second. It is dependent upon the
kernel device tree file using the specified resolution / display
string in the comments.
This is a quick hack to communicate a greater number of CPUs to a guest OS via
the ARM A9 SCU config register. Some OSes (Linux) just look at the bottom field
to count CPUs and with a small change can look at bits [3:0] to learn about up
to 16 CPUs.
Very much unsupported (and contains warning messages as such) but useful for
running 8 core sims without hardwiring CPU count in the guest OS.
With (upcoming) separate compilation, they are useless. Only
link-time optimization could re-inline them, but ideally
feedback-directed optimization would choose to do so only for
profitable (i.e. common) instructions.
The MicroMemOp class generates the disassembly for both integer
and floating point instructions, but it would always print its
first operand as an integer register without considering that the
op may be a floating instruction in which case a float register
should be displayed instead.
In the O3 LSQ, data read/written is printed out in DPRINTFs. However,
the data field is treated as a character string with a null terminated.
However the data field is not encoded this way. This patch removes
that possibility by removing the data part of the print.
This never actually worked since it was printing out only a word
of the cache block and not the entire thing and doubly didn't work
csprintf overrides the %#x specifier and assumes a char* array is
actually a string.
O3CPU has a compile-time maximum width set in o3/impl.hh, but checking
the configuration against this limit was not implemented anywhere
except for fetch. Configuring a wider pipe than the limit can silently
cause various issues during the simulation. This patch adds the proper
checking in the constructor of the various pipeline stages.
Without this declaration, new clangs will complain about this value
being unused. It has no explicit use in the codebase, but it can be
useful to turn on all debugging flags while in a debugger to greatly
increase simulator verbosity.
A number of calls to isEmpty() and numFreeEntries()
should be thread-specific.
In cpu.cc, the fact that tid is /*commented*/ out is a bug. Say the rob
has instructions from thread 0 (isEmpty() returns false), and none from
thread 1. If we are trying to squash all of thread 1, then
readTailInst(thread 1) will be called because rob->isEmpty() returns
false. The result is end_it is not in the list and the while
statement loops indefinitely back over the cpu's instList.
In iew_impl.hh, all threads are told they have the entire remaining IQ, when
each thread actually has a certain allocation. The result is extra stalls at
the iew dispatch stage which the rename stage usually takes care of.
In commit_impl.hh, rob->readHeadInst(thread 1) can be called if the rob only
contains instructions from thread 0. This returns a dummyInst (which may work
since we are trying to squash all instructions, but hardly seems like the right
way to do it).
In rob_impl.hh this fix skips the rest of the function more frequently and is
more efficient.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Upon aggregating records, serialize system's cache-block size, as the
cache-block size can be different when restoring from a checkpoint. This way,
we can correctly read all records when restoring from a checkpoints, even if
the cache-block size is different.
Note, that it is only possible to restore from a checkpoint if the
desired cache-block size is smaller or equal to the cache-block size
when the checkpoint was taken; we can split one larger request into
multiple small ones, but it is not reliable to do the opposite.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Simulating a SMP or multicore requires devices to be shared between
multiple KVM vCPUs. This means that locking is required when accessing
devices. This changeset adds the necessary locking to allow devices to
execute correctly. It is implemented by temporarily migrating the KVM
CPU to the VM's (and devices) event queue when handling
MMIO. Similarly, the VM migrates to the interrupt controller's event
queue when delivering an interrupt.
The support for fast-forwarding of multicore simulations added by this
changeset assumes that all devices in a system are simulated in the
same thread and each vCPU has its own thread. Special care must be
taken to ensure that devices living under the CPU in the object
hierarchy (e.g., the interrupt controller) do not inherit the parent
CPUs thread and are assigned to device thread. The KvmVM object is
assumed to live in the same thread as the other devices in the system.
The calling thread is undefined when the PollQueue services events.
This implies that PollEvents need to handle the case where they are
processed from a different thread than the thread that created the
event. This changeset adds temporary event queue migrations to the VNC
server, the ethernet tap device, and the terminal to protect them from
inter-thread calls.
As of now, the enqueue statement can take in any number of 'pairs' as
argument. But we only use the pair in which latency is the key. This
latency is allowed to be either a fixed integer or a member variable of
controller in which the expression appears. This patch drops the use of pairs
in an enqueue statement. Instead, an expression is allowed which will be
interpreted to be the latency of the enqueue. This expression can anything
allowed by slicc including a constant integer or a member variable.
We need the ability to lock event queues to enable device accesses
across threads. The serviceOne() method now takes a service lock prior
to handling a new event. By locking an event queue, a different
thread/eq can effectively execute in the context of the locked event
queue. To simplify temporary event queue migrations, this changeset
introduces the EventQueue::ScopedMigration class that unlocks the
current event queue, locks a new event queue, and updates the current
event queue variable.
In order to prevent deadlocks, event queues need to be released when
waiting on barriers. This is implemented using the
EventQueue::ScopedRelease class. An instance of this class is, for
example, used in the BaseGlobalEvent class to release the event queue
when waiting on the synchronization barrier.
The intended use for this functionality is when devices need to be
accessed across thread boundaries. For example, when fast-forwarding,
it might be useful to run devices and CPUs in separate threads. In
such a case, the CPU locks the device queue whenever it needs to
perform IO. This functionality is primarily intended for KVM.
Note: Migrating between event queues can lead to non-deterministic
timing. Use with extreme care!
--HG--
extra : rebase_source : 23e3a741a1fd73861d1339782dbbe1bc76285315
This patch fixes violation of TSO in the O3CPU, as all loads must be
ordered with all other loads. In the LQ, if a snoop is observed, all
subsequent loads need to be squashed if the system is TSO.
Prior to this patch, the following case could be violated:
P0 | P1 ;
MOV [x],mail=/usr/spool/mail/nilay | MOV EAX,[y] ;
MOV [y],mail=/usr/spool/mail/nilay | MOV EBX,[x] ;
exists (1:EAX=1 /\ 1:EBX=0) [is a violation]
The problem was found using litmus [http://diy.inria.fr].
Committed by: Nilay Vaish <nilay@cs.wisc.edu
This patch adds stats for tracking the number of reads/writes per bus
turn around, and also adds hysteresis to the write-to-read switching
to ensure that the queue does not oscilate around the low threshold.
This patch renames the not-so-simple SimpleDRAM to a more suitable
DRAMCtrl. The name change is intended to ensure that we do not send
the wrong message (although the "simple" in SimpleDRAM was originally
intended as in cleverly simple, or elegant).
As the DRAM controller modelling work is being presented at ISPASS'14
our hope is that a broader audience will use the model in the future.
--HG--
rename : src/mem/SimpleDRAM.py => src/mem/DRAMCtrl.py
rename : src/mem/simple_dram.cc => src/mem/dram_ctrl.cc
rename : src/mem/simple_dram.hh => src/mem/dram_ctrl.hh
Make the default memory type DDR3-1600 x64, and use the open-adaptive
page policy. This change is aiming to ensure that users by default are
using a realistic memory system.
This patch adds a basic starvation-prevention mechanism where a DRAM
page is forced to close after a certain number of accesses. The limit
is combined with the open and open-adaptive page policy and if reached
causes an auto-precharge.
This patch changes the triggering condition for the write draining
such that we grab the opportunity to issue writes if there are no
reads waiting (as opposed to waiting for the writes to reach the high
threshold). As a result, we potentially drain some of the writes in read
idle periods (if any).
A low threshold is added to be able to control how many write bursts
are kept in the memory controller queue (acting as on-chip storage).
The high and low thresholds are updated to sensible values for a 32/64
size write buffer. Note that the thresholds should be adjusted along
with the queue sizes.
This patch also adds some basic initialisation sanity checks and moves
part of the initialisation to the constructor.
This patch enables a new 'DRAM' mode to the existing traffic
generator, catered to generate specific requests to DRAM based on
required hit length (stride size) and bank utilization. It is an add on
to the Random mode.
The basic idea is to control how many successive packets target the
same page, and how many banks are being used in parallel. This gives a
two-dimensional space that stresses different aspects of the DRAM
timing.
The configuration file needed to use this patch has to be changed as
follow: (reference to Random Mode, LPDDR3 memory type)
'STATE 0 10000000000 RANDOM 50 0 134217728 64 3004 5002 0'
-> 'STATE 0 10000000000 DRAM 50 0 134217728 32 3004 5002 0 96 1024 8 6 1'
The last 4 parameters to be added are:
<stride size (bytes), page size(bytes), number of banks available in DRAM,
number of banks to be utilized, address mapping scheme>
The address mapping information is used to get the stride address
stream of the specified size and to know where to find the bank
bits. The configuration file has a parameter where '0'-> RoCoRaBaCh,
'1'-> RoRaBaCoCh/RoRaBaChCo address-mapping schemes. Note that the
generator currently assumes a single channel and a single rank. This
is to avoid overwhelming the traffic generator with information about
the memory organisation.
This patch adds the row bits to the name of the address mapping
schemes to make it more clear that all the current schemes places the
row bits as the most significant bits.
This patch moves the Ruby-related debug flags to the ruby
sub-directory, and also removes the state SConsopts that add the
no-longer-used NO_VECTOR_BOUNDS_CHECK.
Prevent incomplete configuration of TrafficGen class from causing
segmentation faults. If an 'INIT' line is not present in the
configuration file then the currState variable will remain
uninitialized which may result in a crash.
There were several sections of the m5ops code which were
essentially copy/pasted versions of the 32-bit code. The
problem is that some of these didn't account fo4 64-bit
registers leading to arguments being in the wrong registers.
This patch addresses the args for readfile64, writefile64,
and addsymbol64 -- all of which seemed to suffer from a
similar set of problems when moving to 64-bit.
Each consumer object maintains a set of tick values when the object is supposed
to wakeup and do some processing. As of now, the object accesses this set both
when scheduling a wakeup event and when the object actually wakes up. The set
is accessed during wakeup to remove the current tick value from the set. This
functionality is now being moved to the scheduling function where ticks are
removed at a later time.
This helps in configuring the network interfaces from the python script and
these objects no longer rely on the network object for the timing information.
Piobus was recently added to se scripts for ruby so that the interrupt
controller can be connected to something (required since the interrupt
controller sends address range messages). This patch removes the piobus
and instead, the pio port of ruby port will now ignore the range change
messages in se mode.
KVM used to use two signals, one for instruction count exits and one
for timer exits. There is really no need to distinguish between the
two since they only trigger exits from KVM. This changeset unifies and
renames the signals and adds a method, kick(), that can be used to
raise the control signal in the vCPU thread. It also removes the early
timer warning since we do not normally see if the signal was
delivered.
--HG--
extra : rebase_source : cd0e45ca90894c3d6f6aa115b9b06a1d8f0fda4d
gem5 seems to store the PC as RIP+CS_BASE. This is not what KVM
expects, so we need to subtract CS_BASE prior to transferring the PC
into KVM. This changeset adds the necessary PC manipulation and
refactors thread context updates slightly to avoid reading registers
multiple times from KVM.
--HG--
extra : rebase_source : 3f0569dca06a1fcd8694925f75c8918d954ada44
This changeset adds support for INIT and STARTUP IPI handling. We
currently handle both of these interrupts in gem5 and transfer the
state to KVM. Since we do not have a BIOS loaded, we pretend that the
INIT interrupt suspends the CPU after reset.
--HG--
extra : rebase_source : 7f3b25f3801d68f668b6cd91eaf50d6f48ee2a6a
The table walker code currently accounts for two types of walks,
Atomic and Timing, and treats them differently. Atomic walks keep a
single instance of WalkerState around for all walks to use in
currState. Timing mode keeps a queue of in-flight WalkerStates and
maintains currState as NULL between walks.
If a functional walk is done during Timing mode, it is treated as an
atomic walk and either creates a persistent WalkerState if in between
Timing walks, or stomps an existing currState for an in-progress
Timing walk.
This patch distinguishes functional walks as being able to exist at
any time and sets up a temporary WalkerState for its exclusive use and
then cleans up when finished, leaving any in progress Atomic or Timing
walks undisturbed.
This patch fixes an assert condition that is not true at all
times. There are valid situations that arise in dual-core
dual-workload runs where the assert condition is false. The function
call following the assert however needs to be called only when the
condition is true (a block cannot be invalidated in the tags structure
if has not been allocated in the structure, and the tempBlock is never
allocated). Hence the 'assert' has been replaced with an 'if'.
This patch changes the decode script to output the optional fields of
the proto message Packet, namely id and flags. The flags field is set
by the communication monitor.
The id field is useful for CPU trace experiments, e.g. linking the
fetch side to decode side. It had to be renamed because it clashes
with a built in python function id() for getting the "identity" of an
object.
This patch also takes a few common function definitions out from the
multiple scripts and adds them to a protolib python module.
This snippet can be used to replace if + {panics, fatals, asserts} constructs.
The idea is to have both the condition checking and a verbose printout in a single statement. The interface is as follows:
panic_if(foo != bar, "These should be equal: foo %i bar %i", foo, bar);
fatal_if(foo != bar, "These should be equal: foo %i bar %i", foo, bar);
chatty_assert(foo == bar, "These should be equal: foo %i bar %i", foo, bar);
Small fix for a warning that prevents compilation with gcc 4.8.1 due
to detecting that a variable might be uninitialised. The fix is to
assign a safe default.
The global synchronization event used to be scheduled at
simQuantum. This prevented repeated entries into gem5 from Python as
it can be scheduled in the past. This changeset ensures that the first
global synchronization happens at curTick() + simQuantum instead.
The TSL/LDT & TR/TSS segments didn't contain valid attributes. This
caused problems when transfering the state into KVM where invalid
state is a no-go. Fixup the attributes with values from AMD's
architecture programmer's manual.
When transferring segment registers into kvm, we need to find the
value of the unusable bit. We used to assume that this could be
inferred from the selector since segments are generally unusable if
their selector is 0. This assumption breaks in some weird corner
cases. Instead, we just assume that segments are always usable. This
is what qemu does so it should work.
Signal handlers in KVM are controlled per thread and should be
initialized from the thread that is going to execute the CPU. This
changeset moves the initialization call from startup() to
startupThread().
A copyRegs() function is added to MIPS utilities
to copy architectural state from the old CPU to
the new CPU during fast-forwarding. This
addition alone enables fast-forwarding for the
o3 cpu model running MIPS.
The patch also adds takeOverFrom() and
drainResume() functions to the InOrderCPU to
enable it to take over from another CPU. This
change enables fast-forwarding for the inorder
cpu model running MIPS, but not for Alpha.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Couple of users observed segmentation fault when the simulator tries to
register the statistical variable m_IncompleteTimes. It seems that there
is some problem with the initialization of these variables when allocated
in the constructor.
Currently, the interrupt controller in x86 is connected to the io bus
directly. Therefore the packets between the io devices and the interrupt
controller do not go through ruby. This patch changes ruby port so that
these packets arrive at the ruby port first, which then routes them to their
destination. Note that the patch does not make these packets go through the
ruby network. That would happen in a subsequent patch.
This patch simplfies the retry logic in the RubyPort, avoiding
redundant attributes, and enforcing more stringent checks on the
interactions with the normal ports. The patch also simplifies the
routing done by the RubyPort, using the port identifiers instead of a
heavy-weight sender state.
The patch also fixes a bug in the sending of responses from PIO
ports. Previously these responses bypassed the queue in the queued
port, and ignored the return value, potentially leading to response
packets being lost.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Code in two of the functions was exactly the same. This patch moves
this code to a new function which is called from the two functions
mentioned initially.
At several places, there are functions that take a cycle value as input
and performs some computation. Along with each such function, another
function was being defined that simply added one more cycle to input and
computed the same function. This patch removes this second copy of the
function. Places where these functions were being called have been updated
to use the original function with argument being current cycle + 1.
Two files had been incorrectly named with a .cache suffix.
--HG--
rename : src/mem/protocol/MESI_Three_Level-L0.cache => src/mem/protocol/MESI_Three_Level-L0cache.sm
rename : src/mem/protocol/MESI_Three_Level-L1.cache => src/mem/protocol/MESI_Three_Level-L1cache.sm