This patch changes the random number generator from the in-house
Mersenne twister to an implementation relying entirely on C++11 STL.
The format for the checkpointing of the twister is simplified. As the
functionality was never used this should not matter. Note that this
patch does not actually make use of the checkpointing
functionality. As the random number generator is not thread safe, it
may be sensible to create one generator per thread, system, or even
object. Until this is decided the status quo is maintained in that no
generator state is part of the checkpoint.
This patch tidies up random number generation to ensure that it is
done consistently throughout the code base. In essence this involves a
clean-up of Ruby, and some code simplifications in the traffic
generator.
As part of this patch a bunch of skewed distributions (off-by-one etc)
have been fixed.
Note that a single global random number generator is used, and that
the object instantiation order will impact the behaviour (the sequence
of numbers will be unaffected, but if module A calles random before
module B then they would obviously see a different outcome). The
dependency on the instantiation order is true in any case due to the
execution-model of gem5, so we leave it as is. Also note that the
global ranom generator is not thread safe at this point.
Regressions using the memtest, TrafficGen or any Ruby tester are
affected and will be updated accordingly.
This patch removes unecessary retries that happened when the bus layer
itself was no longer busy, but the the peer was not yet ready. Instead
of sending a retry that will inevitably not succeed, the bus now
silenty waits until the peer sends a retry.
Multiple instructions assume only 32-bit load operations are available,
this patch increases load sizes to 64-bit or 128-bit for many load pair and
load multiple instructions.
Support full-block writes directly rather than requiring RMW:
* a cache line is allocated in the cache upon receipt of a
WriteInvalidateReq, not the WriteInvalidateResp.
* only top-level caches allocate the line; the others just pass
the request along and invalidate as necessary.
* to close a timing window between the *Req and the *Resp, a new
metadata bit tracks whether another cache has read a copy of
the new line before the writeback to memory.
This patch fixes a bug in the cache port where the retry flag was
reset too early, allowing new requests to arrive before the retry was
actually sent, but with the event already scheduled. This caused a
deadlock in the interactions with the O3 LSQ.
The patche fixes the underlying issue by shifting the resetting of the
flag to be done by the event that also calls sendRetry(). The patch
also tidies up the flow control in recvTimingReq and ensures that we
also check if we already have a retry outstanding.
Previously, they were treated so much like loads that they could stall
at the head of the ROB. Now they are always treated like L1 hits.
If they actually miss, a new request is created at the L1 and tracked
from the MSHRs there if necessary (i.e. if it didn't coalesce with
an existing outstanding load).
The o3 cpu relies upon instructions that suspend a thread context being
flagged as "IsQuiesce". If they are not, unpredictable behavior can occur.
This patch fixes that for the x86 ISA.
For X86, the o3 CPU would get stuck with the commit stage not being
drained if an interrupt arrived while drain was pending. isDrained()
makes sure that pcState.microPC() == 0, thus ensuring that we are at
an instruction boundary. However, when we take an interrupt we
execute:
pcState.upc(romMicroPC(entry));
pcState.nupc(romMicroPC(entry) + 1);
tc->pcState(pcState);
As a result, the MicroPC is no longer zero. This patch ensures the drain is
delayed until no interrupts are present. Once draining, non-synchronous
interrupts are deffered until after the switch.
Neon memory ops that operate on multiple registers currently have very poor
performance because of interleave/deinterleave micro-ops.
This patch marks the deinterleave/interleave micro-ops as "No_OpClass" such
that they take minumum cycles to execute and are never resource constrained.
Additionaly the micro-ops over-read registers. Although one form may need
to read up to 20 sources, not all do. This adds in new forms so false
dependencies are not modeled. Instructions read their minimum number of
sources.
Analogous to ee049bf (for x86). Requires a bump of the checkpoint version
and corresponding upgrader code to move the condition code register values
to the new register file.
This patch changes the CPU configuration used for the full-system ARM
regressions to increase the test coverage. Note that it is only the
core configuration, and not the caches etc.
A small bug in the bimodal predictor caused significant degradation in
performance on some benchmarks. This was caused by using the wrong
globalHistoryReg during the update phase. This patches fixes the bug
and brings the performance to normal level.
This patch fixes the load blocked/replay mechanism in the o3 cpu. Rather than
flushing the entire pipeline, this patch replays loads once the cache becomes
unblocked.
Additionally, deferred memory instructions (loads which had conflicting stores),
when replayed would not respect the number of functional units (only respected
issue width). This patch also corrects that.
Improvements over 20% have been observed on a microbenchmark designed to
exercise this behavior.
O3 is supposed to stop fetching instructions once a quiesce is encountered.
However due to a bug, it would continue fetching instructions from the current
fetch buffer. This is because of a break statment that only broke out of the
first of 2 nested loops. It should have broken out of both.
The o3 cpu could attempt to schedule inactive threads under round-robin SMT
mode.
This is because it maintained an independent priority list of threads from the
active thread list. This priority list could be come stale once threads were
inactive, leading to the cpu trying to fetch/commit from inactive threads.
Additionally the fetch queue is now forcibly flushed of instrctuctions
from the de-scheduled thread.
Relevant output:
24557000: system.cpu: [tid:1]: Calling deactivate thread.
24557000: system.cpu: [tid:1]: Removing from active threads list
24557500: system.cpu:
FullO3CPU: Ticking main, FullO3CPU.
24557500: system.cpu.fetch: Running stage.
24557500: system.cpu.fetch: Attempting to fetch from [tid:1]
When a branch mispredicted gem5 would squash all history after and including
the mispredicted branch. However, the mispredicted branch is still speculative
and its history is required to rollback state if another, older, branch
mispredicts. This leads to things like RAS corruption.
This patch adds a fetch queue that sits between fetch and decode to the
o3 cpu. This effectively decouples fetch from decode stalls allowing it
to be more aggressive, running futher ahead in the instruction stream.
The o3 pipeline interlock/stall logic is incorrect. o3 unnecessicarily stalled
fetch and decode due to later stages in the pipeline. In general, a stage
should usually only consider if it is stalled by the adjacent, downstream stage.
Forcing stalls due to later stages creates and results in bubbles in the
pipeline. Additionally, o3 stalled the entire frontend (fetch, decode, rename)
on a branch mispredict while the ROB is being serially walked to update the
RAT (robSquashing). Only should have stalled at rename.
As highlighed on the mailing list gem5's writeback modeling can impact
performance. This patch removes the limitation on maximum outstanding issued
instructions, however the number that can writeback in a single cycle is still
respected in instToCommit().
isa_parser.py guesses the OpClass if none were given based upon the StaticInst
flags. The existing code does not take into account optionally set flags.
This code hoists the setting of optional flags so OpClass is properly assigned.
If a set of LL/SC requests contend on the same cache block we
can get into a situation where CPUs will deadlock if they expect
a failed SC to supply them data. This case happens where 3 or
more cores are contending for a cache block using LL/SC and the system
is configured where 2 cores are connected to a local bus and the
third is connected to a remote bus. If a core on the local bus
sends an SCUpgrade and the core on the remote bus sends and SCUpgrade
they will race to see who will win the SC access. In the meantime
if the other core appends a read to one of the SCUpgrades it will expect
to be supplied data by that SCUpgrade transaction. If it happens that
the SCUpgrade that was picked to supply the data is failed, it will
drop the appended request for data and never respond, leaving the requesting
core to deadlock. This patch makes all SC's behave as normal stores to
prevent this case but still makes sure to check whether it can perform
the update.
This patch adds basic functionality to quickly visualise the output
from the DRAM efficiency script. There are some unfortunate hacks
needed to communicate the needed information from one script to the
other, and we fall back on (ab)using the simout to do this.
As part of this patch we also trim the efficiency sweep to stop at 512
bytes as this should be sufficient for all forseeable DRAMs.
The first DPRINTF() in PL390::writeDistributor always read a uint32_t, though a
packet may have only been 1 or 2 bytes. This caused an assertion in
packet->get().
This patch makes restoring the 'lastStopped' value for Ticked-containing
objects (including MinorCPU) optional so that Ticked-containing objects
can be restored from non-Ticked-containing objects (such as AtomicSimpleCPU).
We currently generate and compile one version of the ISA code per CPU
model. This is obviously wasting a lot of resources at compile
time. This changeset factors out the interface into a separate
ExecContext class, which also serves as documentation for the
interface between CPUs and the ISA code. While doing so, this
changeset also fixes up interface inconsistencies between the
different CPU models.
The main argument for using one set of ISA code per CPU model has
always been performance as this avoid indirect branches in the
generated code. However, this argument does not hold water. Booting
Linux on a simulated ARM system running in atomic mode
(opt/10.linux-boot/realview-simple-atomic) is actually 2% faster
(compiled using clang 3.4) after applying this patch. Additionally,
compilation time is decreased by 35%.
This patch prunes unused values, and also unifies how the values are
defined (not using an enum for ALPHA), aligning the use of int vs Addr
etc.
The patch also removes the duplication of PageBytes/PageShift and
VMPageSize/LogVMPageSize. For all ISAs the two pairs had identical
values and the latter has been removed.
When passed from a configuration script with a hexadecimal value (like
"0x80000000"), gem5 would error out. This is because it would call
"toMemorySize" which requires the argument to end with a size specifier (like
1MB, etc).
This modification makes it so raw hex values can be passed through Addr
parameters from the configuration scripts.
This patch fixes the hash operator used for ARM ExtMachInst, which
incorrectly was still using uint32_t. Instead of changing it to
uint64_t it is not using the underlying data type of the BitUnion.
The Index type defined as typedef int64 does not really provide any help
since in most places we use primitive types instead of Index. Also, the name
Index is very generic that it does not merit being used as a typename.
This patch is the final patch in a series of patches. The aim of the series
is to make ruby more configurable than it was. More specifically, the
connections between controllers are not at all possible (unless one is ready
to make significant changes to the coherence protocol). Moreover the buffers
themselves are magically connected to the network inside the slicc code.
These connections are not part of the configuration file.
This patch makes changes so that these connections will now be made in the
python configuration files associated with the protocols. This requires
each state machine to expose the message buffers it uses for input and output.
So, the patch makes these buffers configurable members of the machines.
The patch drops the slicc code that usd to connect these buffers to the
network. Now these buffers are exposed to the python configuration system
as Master and Slave ports. In the configuration files, any master port
can be connected any slave port. The file pyobject.cc has been modified to
take care of allocating the actual message buffer. This is inline with how
other port connections work.
A later changeset changes the file src/python/swig/pyobject.cc to include
a header file that includes a header file generated at build time depending
on the PROTOCOL in use. Since NULL ISA was not specifying any protocol,
this resulted in compilation problems. Hence, the changeset.
The namespace Message conflicts with the Message data type used extensively
in Ruby. Since Ruby is being moved to the same Master/Slave ports based
configuration style as the rest of gem5, this conflict needs to be resolved.
Hence, the namespace is being renamed to ProtoMessage.
There are two changes this patch makes to the way configurable members of a
state machine are specified in SLICC. The first change is that the data
member declarations will need to be separated by a semi-colon instead of a
comma. Secondly, the default value to be assigned would now use SLICC's
assignment operator i.e. ':='.
This patch changes the grammar for SLICC so as to remove some of the
redundant / duplicate rules. In particular rules for object/variable
declaration and class member declaration have been unified. Similarly, the
rules for a general function and a class method have been unified.
One more change is in the priority of two rules. The first rule is on
declaring a function with all the params typed and named. The second rule is
on declaring a function with all the params only typed. Earlier the second
rule had a higher priority. Now the first rule has a higher priority.
This patch fixes scripts related to ruby by adding the ruby clock domain.
Now the L1 controllers and the Sequencer shares the cpu clock domain,
while the rest of the components use the ruby clock domain.
Before this patch, running simulations with the cpu clock set at 2GHz or
1GHz will output the same time results and could distort power measurements.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>