Have the traffic generator add its masterID as the PC address to the
requests. That way, prefetchers (and other components) that use a PC
for request classification will see per-tester streams of requests.
This enables us to test strided prefetchers with the memchecker, too.
To be able to use the TrafficGen in a system with caches we need to
allow it to sink incoming snoop requests. By default the master port
panics, so silently ignore any snoops.
The MemTest class really only tests false sharing, and as such there
was a lot of old cruft that could be removed. This patch cleans up the
tester, and also makes it more clear what the assumptions are. As part
of this simplification the reference functional memory is also
removed.
The regression configs using MemTest are updated to reflect the
changes, and the stats will be bumped in a separate patch. The example
config will be updated in a separate patch due to more extensive
re-work.
In a follow-on patch a new tester will be introduced that uses the
MemChecker to implement true sharing.
The TLB-related code is generally architecture dependent and should
live in the arch directory to signify that.
--HG--
rename : src/sim/BaseTLB.py => src/arch/generic/BaseTLB.py
rename : src/sim/tlb.cc => src/arch/generic/tlb.cc
rename : src/sim/tlb.hh => src/arch/generic/tlb.hh
This patch changes how the timing CPU deals with processing responses,
always scheduling an event, even if it is for the current tick. This
helps to avoid situations where a new request shows up before a
response is finished in the crossbar, and also is more in line with
any realistic behaviour.
While the IsFirstMicroop flag exists it was only occasionally used in the ARM
instructions that gem5 microOps and therefore couldn't be relied on to be correct.
This patch tidies up how we create and set the fields of a Request. In
essence it tries to use the constructor where possible (as opposed to
setPhys and setVirt), thus avoiding spreading the information across a
number of locations. In fact, setPhys is made private as part of this
patch, and a number of places where we callede setVirt instead uses
the appropriate constructor.
The ppCommit should notify the attached listener every time the cpu commits
a microop or non microcoded insturction. The listener can then decide
whether it will process only the last microop (eg. SimPoint probe).
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Minor was reporting the data cache access as ".inst" accesses.
This just switches the MasterPortID to dataMasterPortId.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Only the instruction address is actually checked, so there's no need to check
repeatedly while we're working through the microops of a macroop and that's
not changing.
This patch fixes a case where a store in Minor's store buffer never
leaves the store buffer as it is pre-maturely counted as having been
issued, leading to the store buffer idling.
LSQ::StoreBuffer::numUnissuedAccesses should count the number of accesses
either in memory, or still in the store buffer after being completed.
For stores which are also barriers, the store will stay in the store
buffer for a cycle after it is completed and will be cleaned up by the
barrier clearing code (to ensure that barriers are completed in-order).
To acheive this, numUnissuedAccesses is not decremented when a store-barrier
is issued to memory, but when its barrier effect is cleared.
Without this patch, the correct behaviour happens when a memory transaction
is immediately accepted, but not if it needs a retry.
In case the memory subsystem sends a combined response with invalidate
(e.g. ReadRespWithInvalidate), we cannot ignore the invalidate part
of the response.
If we were to ignore the invalidate part, under certain circumstances
this effectively leads to reordering of loads to the same address
which is not permitted under any memory consistency model implemented
in gem5.
Consider the case where a later load's address is computed before an
earlier load in program order, and is therefore sent to the memory
subsystem first. At some point the earlier load's address is computed
and in doing so correctly marks the later load as a
possibleLoadViolation. In the meantime some other node writes and
sends invalidations to all other nodes. The invalidation races with
the later load's ReadResp, and arrives before ReadResp and is
deferred. Upon receipt of the ReadResp, the response is changed to
ReadRespWithInvalidate, and sent to the CPU. If we ignore the
invalidate part of the packet, we let the later load read the old
value of the address. Eventually the earlier load's ReadResp arrives,
but with new data. As there was no invalidate snoop (sunk into the
ReadRespWithInvalidate), and if we did not process the invalidate of
the ReadRespWithInvalidate, we obtain a load reordering.
A similar scenario can be constructed where the earlier load's address
is computed after ReadRespWithInvalidate arrives for the younger
load. In this case hitExternalSnoop needs to be set to true on the
ReadRespWithInvalidate, so that upon knowing the address of the
earlier load, checkViolations will cause the later load to be
squashed.
Finally we must account for the case where both loads are sent to the
memory subsystem (reordered), a snoop invalidate arrives and correctly
sets the later loads fault to ReExec. However, before the CPU
processes the fault, the later load's ReadResp arrives and the
writeback discards the outstanding fault. We must add a check to
ensure that we do not skip any unprocessed faults.
Move the packet deallocations in the O3 CPU so that the completeDataAccess
deals only with the LSQ specific parts and the generic recvTimingResp frees the
packet in all other cases.
This patch simplifies how we deal with dynamically allocated data in
the packet, always assuming that it is array allocated, and hence
should be array deallocated (delete[] as opposed to delete). The only
uses of dataDynamic was in the Ruby testers.
The ARRAY_DATA flag in the packet is removed accordingly. No
defragmentation of the flags is done at this point, leaving a gap in
the bit masks.
As the last part the patch, it renames dataDynamicArray to dataDynamic.
This patch takes a first step in tightening up how we use the data
pointer in write packets. A const getter is added for the pointer
itself (getConstPtr), and a number of member functions are also made
const accordingly. In a range of places throughout the memory system
the new member is used.
The patch also removes the unused isReadWrite function.
This patch removes the parameter that enables bypassing the null check
in the Packet::getPtr method. A number of call sites assume the value
to be non-null.
The one odd case is the RubyTester, which issues zero-sized
prefetches(!), and despite being reads they had no valid data
pointer. This is now fixed, but the size oddity remains (unless anyone
object or has any good suggestions).
Finally, in the Ruby Sequencer, appropriate checks are made for flush
packets as they have no valid data pointer.
This patch adds methods in KvmCPU model to handle KVM exits caused by syscall
instructions and page faults. These types of exits will be encountered if
KvmCPU is run in SE mode.
Another churn to clean up undefined behaviour, mostly ARM, but some
parts also touching the generic part of the code base.
Most of the fixes are simply ensuring that proper intialisation. One
of the more subtle changes is the return type of the sign-extension,
which is changed to uint64_t. This is to avoid shifting negative
values (undefined behaviour) in the ISA code.
Mwait works as follows:
1. A cpu monitors an address of interest (monitor instruction)
2. A cpu calls mwait - this loads the cache line into that cpu's cache.
3. The cpu goes to sleep.
4. When another processor requests write permission for the line, it is
evicted from the sleeping cpu's cache. This eviction is forwarded to the
sleeping cpu, which then wakes up.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Fixes a bug where Minor drains in the midst of committing a
conditional store.
While committing a conditional store, lastCommitWasEndOfMacroop is true
(from the previous instruction) as we still haven't finished the conditional
store. If a drain occurs before the cache response, Minor would check just
lastCommitWasEndOfMacroop, which was true, and set drainState=DrainHaltFetch,
which increases the streamSeqNum. This caused the conditional store to be
squashed when the memory responded and it completed. However, to the memory
the store succeeded, while to the instruction sequence it never occurred.
In the case of an LLSC, the instruction sequence will replay the squashed
STREX, which will fail as the cache is no longer in LLSC. Then the
instruction sequence will loop back to a LDREX, which receives the updated
(incorrect) value.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
It is possible for the O3 CPU to consider itself drained and
later have a squashed instruction perform a writeback. This
patch re-adds tracking of in-flight instructions to prevent
falsely signaling a drained event.
IEW did not check the instQueue and memDepUnit to ensure
they were drained. This caused issues when drainSanityCheck()
did check those structures after asserting IEW was drained.
This patch fixes a bug where a completing load or store which is also a
barrier can push a barrier into the store buffer without first checking
that there is a free slot.
The bug was not fatal but would print a warning that the store buffer
was full when inserting.
This patch takes quite a large step in transitioning from the ad-hoc
RefCountingPtr to the c++11 shared_ptr by adopting its use for all
Faults. There are no changes in behaviour, and the code modifications
are mostly just replacing "new" with "make_shared".
This patch transitions the o3 MemDepEntry from the ad-hoc
RefCountingPtr to the c++11 shared_ptr. There are no changes in
behaviour, and the code modifications are mainly replacing "new" with
"make_shared".
This changeset adds probe points that can be used to implement PMU
counters for CPU stats. The following probes are supported:
* BaseCPU::ppCycles / Cycles
* BaseCPU::ppRetiredInsts / RetiredInsts
* BaseCPU::ppRetiredLoads / RetiredLoads
* BaseCPU::ppRetiredStores / RetiredStores
* BaseCPU::ppRetiredBranches RetiredBranches
This changeset adds probe points that can be used to implement PMU
counters for branch predictor stats. The following probes are
supported:
* BPRedUnit::ppBranches / Branches
* BPRedUnit::ppMisses / Misses
The Ozone CPU is now very much out of date and completely
non-functional, with no one actively working on restoring it. It is a
source of confusion for new users who attempt to use it before
realizing its current state. RIP
This patch optimises the passing of StaticInstPtr by avoiding copying
the reference-counting pointer. This avoids first incrementing and
then decrementing the reference-counting pointer.
Fix a number few minor issues to please gcc 4.9.1. Removing the
'-fuse-linker-plugin' flag means no libraries are part of the LTO
process, but hopefully this is an acceptable loss, as the flag causes
issues on a lot of systems (only certain combinations of gcc, ld and
ar work).
The call paths for de-scheduling a thread are halt() and suspend(), from
the thread context. There is no call to deallocateContext() in general,
though some CPUs chose to define it. This patch removes the function
from BaseCPU and the cores which do not require it.
activate(), suspend(), and halt() used on thread contexts had an optional
delay parameter. However this parameter was often ignored. Also, when used,
the delay was seemily arbitrarily set to 0 or 1 cycle (no other delays were
ever specified). This patch removes the delay parameter and 'Events'
associated with them across all ISAs and cores. Unused activate logic
is also removed.
This patch changes the name of the Bus classes to XBar to better
reflect the actual timing behaviour. The actual instances in the
config scripts are not renamed, and remain as e.g. iobus or membus.
As part of this renaming, the code has also been clean up slightly,
making use of range-based for loops and tidying up some comments. The
only changes outside the bus/crossbar code is due to the delay
variables in the packet.
--HG--
rename : src/mem/Bus.py => src/mem/XBar.py
rename : src/mem/coherent_bus.cc => src/mem/coherent_xbar.cc
rename : src/mem/coherent_bus.hh => src/mem/coherent_xbar.hh
rename : src/mem/noncoherent_bus.cc => src/mem/noncoherent_xbar.cc
rename : src/mem/noncoherent_bus.hh => src/mem/noncoherent_xbar.hh
rename : src/mem/bus.cc => src/mem/xbar.cc
rename : src/mem/bus.hh => src/mem/xbar.hh
Add new DRAM_ROTATE mode to traffic generator.
This mode will generate DRAM traffic that rotates across
banks per rank, command types, and ranks per channel
The looping order is illustrated below:
for (ranks per channel)
for (command types)
for (banks per rank)
// Generate DRAM Command Series
This patch also adds the read percentage as an input argument to the
DRAM sweep script. If the simulated read percentage is 0 or 100, the
middle for loop does not generate additional commands. This loop is
used only when the read percentage is set to 50, in which case the
middle loop will toggle between read and write commands.
Modified sweep.py script, which generates DRAM traffic.
Added input arguments and support for new DRAM_ROTATE mode.
The script now has input arguments for:
1) Read percentage
2) Number of ranks
3) Address mapping
4) Traffic generator mode (DRAM or DRAM_ROTATE)
The default values are:
100% reads, 1 rank, RoRaBaCoCh address mapping, and DRAM traffic gen mode
For the DRAM traffic mode, added multi-rank support.
This patch does a bit of housekeeping on the string helper functions
and relies on the C++11 standard library where possible. It also does
away with our custom string hash as an implementation is already part
of the standard library.
This patch changes how faults are passed between methods in an attempt
to copy as few reference-counting pointer instances as possible. This
should avoid unecessary copies being created, contributing to the
increment/decrement of the reference counters.
This patch fixes cases where uncacheable/memory type flags are not set
correctly on a memory op which is split in the LSQ. Without this
patch, request->request if freely used to check flags where the flags
should actually come from the accumulation of request fragment flags.
This patch also fixes a bug where an uncacheable access which passes
through tryToSendRequest more than once can increment
LSQ::numAccessesInMemorySystem more than once.
This patch closes a number of space gaps in debug messages caused by
the incorrect use of line continuation within strings. (There's also
one consistency change to a similar, but correct, use of line
continuation)
Static analysis unearther a bunch of uninitialised variables and
members, and this patch addresses the problem. In all cases these
omissions seem benign in the end, but at least fixing them means less
false positives next time round.
This patch tidies up random number generation to ensure that it is
done consistently throughout the code base. In essence this involves a
clean-up of Ruby, and some code simplifications in the traffic
generator.
As part of this patch a bunch of skewed distributions (off-by-one etc)
have been fixed.
Note that a single global random number generator is used, and that
the object instantiation order will impact the behaviour (the sequence
of numbers will be unaffected, but if module A calles random before
module B then they would obviously see a different outcome). The
dependency on the instantiation order is true in any case due to the
execution-model of gem5, so we leave it as is. Also note that the
global ranom generator is not thread safe at this point.
Regressions using the memtest, TrafficGen or any Ruby tester are
affected and will be updated accordingly.
For X86, the o3 CPU would get stuck with the commit stage not being
drained if an interrupt arrived while drain was pending. isDrained()
makes sure that pcState.microPC() == 0, thus ensuring that we are at
an instruction boundary. However, when we take an interrupt we
execute:
pcState.upc(romMicroPC(entry));
pcState.nupc(romMicroPC(entry) + 1);
tc->pcState(pcState);
As a result, the MicroPC is no longer zero. This patch ensures the drain is
delayed until no interrupts are present. Once draining, non-synchronous
interrupts are deffered until after the switch.
Analogous to ee049bf (for x86). Requires a bump of the checkpoint version
and corresponding upgrader code to move the condition code register values
to the new register file.
A small bug in the bimodal predictor caused significant degradation in
performance on some benchmarks. This was caused by using the wrong
globalHistoryReg during the update phase. This patches fixes the bug
and brings the performance to normal level.
This patch fixes the load blocked/replay mechanism in the o3 cpu. Rather than
flushing the entire pipeline, this patch replays loads once the cache becomes
unblocked.
Additionally, deferred memory instructions (loads which had conflicting stores),
when replayed would not respect the number of functional units (only respected
issue width). This patch also corrects that.
Improvements over 20% have been observed on a microbenchmark designed to
exercise this behavior.
O3 is supposed to stop fetching instructions once a quiesce is encountered.
However due to a bug, it would continue fetching instructions from the current
fetch buffer. This is because of a break statment that only broke out of the
first of 2 nested loops. It should have broken out of both.
The o3 cpu could attempt to schedule inactive threads under round-robin SMT
mode.
This is because it maintained an independent priority list of threads from the
active thread list. This priority list could be come stale once threads were
inactive, leading to the cpu trying to fetch/commit from inactive threads.
Additionally the fetch queue is now forcibly flushed of instrctuctions
from the de-scheduled thread.
Relevant output:
24557000: system.cpu: [tid:1]: Calling deactivate thread.
24557000: system.cpu: [tid:1]: Removing from active threads list
24557500: system.cpu:
FullO3CPU: Ticking main, FullO3CPU.
24557500: system.cpu.fetch: Running stage.
24557500: system.cpu.fetch: Attempting to fetch from [tid:1]
When a branch mispredicted gem5 would squash all history after and including
the mispredicted branch. However, the mispredicted branch is still speculative
and its history is required to rollback state if another, older, branch
mispredicts. This leads to things like RAS corruption.
This patch adds a fetch queue that sits between fetch and decode to the
o3 cpu. This effectively decouples fetch from decode stalls allowing it
to be more aggressive, running futher ahead in the instruction stream.
The o3 pipeline interlock/stall logic is incorrect. o3 unnecessicarily stalled
fetch and decode due to later stages in the pipeline. In general, a stage
should usually only consider if it is stalled by the adjacent, downstream stage.
Forcing stalls due to later stages creates and results in bubbles in the
pipeline. Additionally, o3 stalled the entire frontend (fetch, decode, rename)
on a branch mispredict while the ROB is being serially walked to update the
RAT (robSquashing). Only should have stalled at rename.
As highlighed on the mailing list gem5's writeback modeling can impact
performance. This patch removes the limitation on maximum outstanding issued
instructions, however the number that can writeback in a single cycle is still
respected in instToCommit().
We currently generate and compile one version of the ISA code per CPU
model. This is obviously wasting a lot of resources at compile
time. This changeset factors out the interface into a separate
ExecContext class, which also serves as documentation for the
interface between CPUs and the ISA code. While doing so, this
changeset also fixes up interface inconsistencies between the
different CPU models.
The main argument for using one set of ISA code per CPU model has
always been performance as this avoid indirect branches in the
generated code. However, this argument does not hold water. Booting
Linux on a simulated ARM system running in atomic mode
(opt/10.linux-boot/realview-simple-atomic) is actually 2% faster
(compiled using clang 3.4) after applying this patch. Additionally,
compilation time is decreased by 35%.
The namespace Message conflicts with the Message data type used extensively
in Ruby. Since Ruby is being moved to the same Master/Slave ports based
configuration style as the rest of gem5, this conflict needs to be resolved.
Hence, the namespace is being renamed to ProtoMessage.
The branch predictor is normally only built when a CPU that uses a
branch predictor is built. The list of CPUs is currently incomplete as
the simple CPUs support branch predictors (for warming, branch stats,
etc). In practice, all CPU models now use branch predictors, so this
changeset removes the CPU model check and replaces it with a check for
the NULL ISA.
RefCountingPtr is sometimes forward declared to avoid having to
include refcnt.hh. This does not work since we typically return
instances of RefCountingPtr rather than references to instances. The
only reason this currently works is that we include refcnt.hh in
cprintf.hh, which "leaks" the header to most other source files. This
changeset replaces such forward declarations with an include of
refcnt.hh.
This patch does some minor house keeping of the branch predictor by
adopting STL containers, and shifting some iterator to use range-based
for loops.
The predictor history is also changed from a list to a deque as we
never to insertion/deletion other than at the front and back.
This patch adds a check to ensure that packets which are not going to
a memory range are suppressed in the traffic generator. Thus, if a
trace is collected in full-system, the packets destined for devices
are not played back.
This patch contains a new CPU model named `Minor'. Minor models a four
stage in-order execution pipeline (fetch lines, decompose into
macroops, decompose macroops into microops, execute).
The model was developed to support the ARM ISA but should be fixable
to support all the remaining gem5 ISAs. It currently also works for
Alpha, and regressions are included for ARM and Alpha (including Linux
boot).
Documentation for the model can be found in src/doc/inside-minor.doxygen and
its internal operations can be visualised using the Minorview tool
utils/minorview.py.
Minor was designed to be fairly simple and not to engage in a lot of
instruction annotation. As such, it currently has very few gathered
stats and may lack other gem5 features.
Minor is faster than the o3 model. Sample results:
Benchmark | Stat host_seconds (s)
---------------+--------v--------v--------
(on ARM, opt) | simple | o3 | minor
| timing | timing | timing
---------------+--------+--------+--------
10.linux-boot | 169 | 1883 | 1075
10.mcf | 117 | 967 | 491
20.parser | 668 | 6315 | 3146
30.eon | 542 | 3413 | 2414
40.perlbmk | 2339 | 20905 | 11532
50.vortex | 122 | 1094 | 588
60.bzip2 | 2045 | 18061 | 9662
70.twolf | 207 | 2736 | 1036
Check for free entries in Load Queue and Store Queue separately to
avoid cases when load cannot be renamed due to full Store Queue and
vice versa.
This work was done while Binh was an intern at AMD Research.
Using '== true' in a boolean expression is totally redundant,
and using '== false' is pretty verbose (and arguably less
readable in most cases) compared to '!'.
It's somewhat of a pet peeve, perhaps, but I had some time
waiting for some tests to run and decided to clean these up.
Unfortunately, SLICC appears not to have the '!' operator,
so I had to leave the '== false' tests in the SLICC code.
This patch removes the stat totalCommittedInsts. This variable was used for
recording the total number of instructions committed across all the threads
of a core. The instructions committed by each thread are recorded invidually.
The total would now be generated by summing these individual counts.
This patch adds a the member function StaticInst::printFlags to allow all
of an instruction's flags to be printed without using the individual
is... member functions or resorting to exposing the 'flags' vector
It also replaces the enum definition StaticInst::Flags with a
Python-generated enumeration and adds to the enum generation mechanism
in src/python/m5/params.py to allow Enums to be placed in namespaces
other than Enums or, alternatively, in wrapper structs allowing them to
be inherited by other classes (so populating that class's name-space
with the enumeration element names).
The ARM TLBs have a bootUncacheability flag used to make some loads
and stores become uncacheable when booting in FS mode. Later the
flag is cleared to let those loads and stores operate as normal. When
doing a takeOverFrom(), this flag's state is not preserved and is
momentarily reset until the CPSR is touched. On single core runs this
is a non-issue. On multi-core runs this can lead to crashes on the O3
CPU model from the following series of events:
1) takeOverFrom executed to switch from Atomic -> O3
2) All bootUncacheability flags are reset to true
3) Core2 tries to execute a load covered by bootUncacheability, it
is flagged as uncacheable
4) Core2's load needs to replay due to a pipeline flush
3) Core1 core does an action on CPSR
4) The handling code for CPSR then checks all other cores
to determine if bootUncacheability can be set to false
5) Asynchronously set bootUncacheability on all cores to false
6) Core2 replays load previously set as uncacheable and notices
it is now flagged as cacheable, leads to a panic.
This patch implements takeOverFrom() functionality for the ARM TLBs
to preserve flag values when switching from atomic -> detailed.
For the o3, add instruction mix (OpClass) histogram at commit (stats
also already collected at issue). For the simple CPUs we add a
histogram of executed instructions
Allow the specification of a socket ID for every core that is reflected in the
MPIDR field in ARM systems. This allows studying multi-socket / cluster
systems with ARM CPUs.
In the O3 LSQ, data read/written is printed out in DPRINTFs. However,
the data field is treated as a character string with a null terminated.
However the data field is not encoded this way. This patch removes
that possibility by removing the data part of the print.
O3CPU has a compile-time maximum width set in o3/impl.hh, but checking
the configuration against this limit was not implemented anywhere
except for fetch. Configuring a wider pipe than the limit can silently
cause various issues during the simulation. This patch adds the proper
checking in the constructor of the various pipeline stages.
A number of calls to isEmpty() and numFreeEntries()
should be thread-specific.
In cpu.cc, the fact that tid is /*commented*/ out is a bug. Say the rob
has instructions from thread 0 (isEmpty() returns false), and none from
thread 1. If we are trying to squash all of thread 1, then
readTailInst(thread 1) will be called because rob->isEmpty() returns
false. The result is end_it is not in the list and the while
statement loops indefinitely back over the cpu's instList.
In iew_impl.hh, all threads are told they have the entire remaining IQ, when
each thread actually has a certain allocation. The result is extra stalls at
the iew dispatch stage which the rename stage usually takes care of.
In commit_impl.hh, rob->readHeadInst(thread 1) can be called if the rob only
contains instructions from thread 0. This returns a dummyInst (which may work
since we are trying to squash all instructions, but hardly seems like the right
way to do it).
In rob_impl.hh this fix skips the rest of the function more frequently and is
more efficient.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Simulating a SMP or multicore requires devices to be shared between
multiple KVM vCPUs. This means that locking is required when accessing
devices. This changeset adds the necessary locking to allow devices to
execute correctly. It is implemented by temporarily migrating the KVM
CPU to the VM's (and devices) event queue when handling
MMIO. Similarly, the VM migrates to the interrupt controller's event
queue when delivering an interrupt.
The support for fast-forwarding of multicore simulations added by this
changeset assumes that all devices in a system are simulated in the
same thread and each vCPU has its own thread. Special care must be
taken to ensure that devices living under the CPU in the object
hierarchy (e.g., the interrupt controller) do not inherit the parent
CPUs thread and are assigned to device thread. The KvmVM object is
assumed to live in the same thread as the other devices in the system.
This patch fixes violation of TSO in the O3CPU, as all loads must be
ordered with all other loads. In the LQ, if a snoop is observed, all
subsequent loads need to be squashed if the system is TSO.
Prior to this patch, the following case could be violated:
P0 | P1 ;
MOV [x],mail=/usr/spool/mail/nilay | MOV EAX,[y] ;
MOV [y],mail=/usr/spool/mail/nilay | MOV EBX,[x] ;
exists (1:EAX=1 /\ 1:EBX=0) [is a violation]
The problem was found using litmus [http://diy.inria.fr].
Committed by: Nilay Vaish <nilay@cs.wisc.edu
This patch enables a new 'DRAM' mode to the existing traffic
generator, catered to generate specific requests to DRAM based on
required hit length (stride size) and bank utilization. It is an add on
to the Random mode.
The basic idea is to control how many successive packets target the
same page, and how many banks are being used in parallel. This gives a
two-dimensional space that stresses different aspects of the DRAM
timing.
The configuration file needed to use this patch has to be changed as
follow: (reference to Random Mode, LPDDR3 memory type)
'STATE 0 10000000000 RANDOM 50 0 134217728 64 3004 5002 0'
-> 'STATE 0 10000000000 DRAM 50 0 134217728 32 3004 5002 0 96 1024 8 6 1'
The last 4 parameters to be added are:
<stride size (bytes), page size(bytes), number of banks available in DRAM,
number of banks to be utilized, address mapping scheme>
The address mapping information is used to get the stride address
stream of the specified size and to know where to find the bank
bits. The configuration file has a parameter where '0'-> RoCoRaBaCh,
'1'-> RoRaBaCoCh/RoRaBaChCo address-mapping schemes. Note that the
generator currently assumes a single channel and a single rank. This
is to avoid overwhelming the traffic generator with information about
the memory organisation.
Prevent incomplete configuration of TrafficGen class from causing
segmentation faults. If an 'INIT' line is not present in the
configuration file then the currState variable will remain
uninitialized which may result in a crash.
KVM used to use two signals, one for instruction count exits and one
for timer exits. There is really no need to distinguish between the
two since they only trigger exits from KVM. This changeset unifies and
renames the signals and adds a method, kick(), that can be used to
raise the control signal in the vCPU thread. It also removes the early
timer warning since we do not normally see if the signal was
delivered.
--HG--
extra : rebase_source : cd0e45ca90894c3d6f6aa115b9b06a1d8f0fda4d
gem5 seems to store the PC as RIP+CS_BASE. This is not what KVM
expects, so we need to subtract CS_BASE prior to transferring the PC
into KVM. This changeset adds the necessary PC manipulation and
refactors thread context updates slightly to avoid reading registers
multiple times from KVM.
--HG--
extra : rebase_source : 3f0569dca06a1fcd8694925f75c8918d954ada44
This changeset adds support for INIT and STARTUP IPI handling. We
currently handle both of these interrupts in gem5 and transfer the
state to KVM. Since we do not have a BIOS loaded, we pretend that the
INIT interrupt suspends the CPU after reset.
--HG--
extra : rebase_source : 7f3b25f3801d68f668b6cd91eaf50d6f48ee2a6a
When transferring segment registers into kvm, we need to find the
value of the unusable bit. We used to assume that this could be
inferred from the selector since segments are generally unusable if
their selector is 0. This assumption breaks in some weird corner
cases. Instead, we just assume that segments are always usable. This
is what qemu does so it should work.
Signal handlers in KVM are controlled per thread and should be
initialized from the thread that is going to execute the CPU. This
changeset moves the initialization call from startup() to
startupThread().
A copyRegs() function is added to MIPS utilities
to copy architectural state from the old CPU to
the new CPU during fast-forwarding. This
addition alone enables fast-forwarding for the
o3 cpu model running MIPS.
The patch also adds takeOverFrom() and
drainResume() functions to the InOrderCPU to
enable it to take over from another CPU. This
change enables fast-forwarding for the inorder
cpu model running MIPS, but not for Alpha.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
The introduction of parallel event queues added most of the support
needed to run multiple VMs (systems) within the same gem5
instance. This changeset fixes up signal delivery so that KVM's
control signals are delivered to the thread that executes the CPU's
event queue. Specifically:
* Timers and counters are now initialized from a separate method
(startupThread) that is scheduled as the first event in the
thread-specific event queue. This ensures that they are
initialized from the thread that is going to execute the CPUs
event queue and enables signal delivery to the right thread when
exiting from KVM.
* The POSIX-timer-based KVM timer (used to force exits from KVM) has
been updated to deliver signals to the thread that's executing KVM
instead of the process (thread is undefined in that case). This
assumes that the timer is instantiated from the thread that is
going to execute the KVM vCPU.
* Signal masking is now done using pthread_sigmask instead of
sigprocmask. The behavior of the latter is undefined in threaded
applications.
* Since signal masks can be inherited, make sure to actively unmask
the control signals when setting up the KVM signal mask.
There are currently no facilities to multiplex between multiple KVM
CPUs in the same event queue, we are therefore limited to
configurations where there is only one KVM CPU per event queue. In
practice, this means that multi-system configurations can be
simulated, but not multiple CPUs in a shared-memory configuration.
This changesets adds branch predictor support to the
BaseSimpleCPU. The simple CPUs normally don't need a branch predictor,
however, there are at least two cases where it can be desirable:
1) A simple CPU can be used to warm the branch predictor of an O3
CPU before switching to the slower O3 model.
2) The simple CPU can be used as a quick way of evaluating/debugging
new branch predictors since it exposes branch predictor
statistics.
Limitations:
* Since the simple CPU doesn't speculate, only one instruction will
be active in the branch predictor at a time (i.e., the branch
predictor will never see speculative branches).
* The outcome of a branch prediction does not affect the performance
of the simple CPU.
Note: AArch64 and AArch32 interworking is not supported. If you use an AArch64
kernel you are restricted to AArch64 user-mode binaries. This will be addressed
in a later patch.
Note: Virtualization is only supported in AArch32 mode. This will also be fixed
in a later patch.
Contributors:
Giacomo Gabrielli (TrustZone, LPAE, system-level AArch64, AArch64 NEON, validation)
Thomas Grocutt (AArch32 Virtualization, AArch64 FP, validation)
Mbou Eyole (AArch64 NEON, validation)
Ali Saidi (AArch64 Linux support, code integration, validation)
Edmund Grimley-Evans (AArch64 FP)
William Wang (AArch64 Linux support)
Rene De Jong (AArch64 Linux support, performance opt.)
Matt Horsnell (AArch64 MP, validation)
Matt Evans (device models, code integration, validation)
Chris Adeniyi-Jones (AArch64 syscall-emulation)
Prakash Ramrakhyani (validation)
Dam Sunwoo (validation)
Chander Sudanthi (validation)
Stephan Diestelhorst (validation)
Andreas Hansson (code integration, performance opt.)
Eric Van Hensbergen (performance opt.)
Gabe Black
The CheckerCPU model in pre-v8 code was not checking the
updates to miscellaneous registers due to some methods
for setting misc regs were not instrumented. The v8 patches
exposed this by calling the instrumented misc reg update
methods and then invoking the checker before the main CPU had
updated its misc regs, leading to false positives about
register mismatches. This patch fixes the non-instrumented
misc reg update methods and places calls to the checker in
the proper places in the O3 model.
With ARMv8 support the same misc register id results in accessing different
registers depending on the current mode of the processor. This patch adds
the same orthogonality to the misc register file as the others (int, float, cc).
For all the othre ISAs this is currently a null-implementation.
Additionally, a system variable is added to all the ISA objects.
This patch add support for generating wake-up events in the CPU when an address
that is currently in the exclusive state is hit by a snoop. This mechanism is required
for ARMv8 multi-processor support.
This patch enables tracking of cache occupancy per thread along with
ages (in buckets) per cache blocks. Cache occupancy stats are
recalculated on each stat dump.
The probe patch is motivated by the desire to move analytical and trace code
away from functional code. This is achieved by the probe interface which is
essentially a glorified observer model.
What this means to users:
* add a probe point and a "notify" call at the source of an "event"
* add an isolated module, that is being used to carry out *your* analysis (e.g. generate a trace)
* register that module as a probe listener
Note: an example is given for reference in src/cpu/o3/simple_trace.[hh|cc] and src/cpu/SimpleTrace.py
What is happening under the hood:
* every SimObject maintains has a ProbeManager.
* during initialization (src/python/m5/simulate.py) first regProbePoints and
the regProbeListeners is called on each SimObject. this hooks up the probe
point notify calls with the listeners.
FAQs:
Why did you develop probe points:
* to remove trace, stats gathering, analytical code out of the functional code.
* the belief that probes could be generically useful.
What is a probe point:
* a probe point is used to notify upon a given event (e.g. cpu commits an instruction)
What is a probe listener:
* a class that handles whatever the user wishes to do when they are notified
about an event.
What can be passed on notify:
* probe points are templates, and so the user can generate probes that pass any
type of argument (by const reference) to a listener.
What relationships can be generated (1:1, 1:N, N:M etc):
* there isn't a restriction. You can hook probe points and listeners up in a
1:1, 1:N, N:M relationship. They become useful when a number of modules
listen to the same probe points. The idea being that you can add a small
number of probes into the source code and develop a larger number of useful
analysis modules that use information passed by the probes.
Can you give examples:
* adding a probe point to the cpu's commit method allows you to build a trace
module (outputting assembler), you could re-use this to gather instruction
distribution (arithmetic, load/store, conditional, control flow) stats.
Why is the probe interface currently restricted to passing a const reference:
* the desire, initially at least, is to allow an interface to observe
functionality, but not to change functionality.
* of course this can be subverted by const-casting.
What is the performance impact of adding probes:
* when nothing is actively listening to the probes they should have a
relatively minor impact. Profiling has suggested even with a large number of
probes (60) the impact of them (when not active) is very minimal (<1%).
Add some values and methods to the request object to track the translation
and access latency for a request and which level of the cache hierarchy responded
to the request.
This patch relaxes the check performed when squashing non-speculative
instructions, as it caused problems with loads that were marked ready,
and then stalled on a blocked cache. The assertion is now allowing
memory references to be non-faulting.
This patch removes an assertion in the simpoint profiling code that
asserts that a previously-seen basic block has the exact same number
of instructions executed as before. This can be false if the basic
block generates aborts or takes interrupts at different locations
within the basic block. The basic block profiling are not affected
significantly as these events are rare in general.
The performance counting framework in Linux 3.2 and onwards supports
an attribute to exclude events generated by the host when running
KVM. Setting this attribute allows us to get more reliable
measurements of the guest machine. For example, on a highly loaded
system, the instruction counts from the guest can be severely
distorted by the host kernel (e.g., by page fault handlers).
This changeset introduces a check for the attribute and enables it in
the KVM CPU if present.
This patch adds support for simulating with multiple threads, each of
which operates on an event queue. Each sim object specifies which eventq
is would like to be on. A custom barrier implementation is being added
using which eventqs synchronize.
The patch was tested in two different configurations:
1. ruby_network_test.py: in this simulation L1 cache controllers receive
requests from the cpu. The requests are replied to immediately without
any communication taking place with any other level.
2. twosys-tsunami-simple-atomic: this configuration simulates a client-server
system which are connected by an ethernet link.
We still lack the ability to communicate using message buffers or ports. But
other things like simulation start and end, synchronizing after every quantum
are working.
Committed by: Nilay Vaish
the current implementation of the fetch buffer in the o3 cpu
is only allowed to be the size of a cache line. some
architectures, e.g., ARM, have fetch buffers smaller than a cache
line, see slide 22 at:
http://www.arm.com/files/pdf/at-exploring_the_design_of_the_cortex-a15.pdf
this patch allows the fetch buffer to be set to values smaller
than a cache line.
This patch fixes an issue in the checker CPU register indexing. The
code will not even compile using LTO as deep inlining causes the used
index to be outside the array bounds.
Fix a problem in the O3 CPU for instructions that are both
memory loads and memory barriers (e.g. load acquire) and
to uncacheable memory. This combination can confuse the
commit stage into commitng an instruction that hasn't
executed and got it's value yet. At the same time refactor
the code slightly to remove duplication between two of
the cases.
LSQSenderState represents the LQ/SQ index using uint8_t, which supports up to
256 entries (including the sentinel entry). Sending packets to memory with a
higher index than 255 truncates the index, such that the response matches the
wrong entry. For instance, this can result in a deadlock if a store completion
does not clear the head entry.
This change fixes an issue in the O3 CPU where an uncachable instruction
is attempted to be executed before it reaches the head of the ROB. It is
determined to be uncacheable, and is replayed, but a PanicFault is attached
to the instruction to make sure that it is properly executed before
committing. If the TLB entry it was using is replaced in the interveaning
time, the TLB returns a delayed translation when the load is replayed at
the head of the ROB, however the LSQ code can't differntiate between the
old fault and the new one. If the translation isn't complete it can't
be faulting, so clear the fault.
When handling IPR accesses in doMMIOAccess, the KVM CPU used
clockEdge() to convert between cycles and ticks. This is incorrect
since doMMIOAccess is supposed to return a latency in ticks rather
than when the access is done. This changeset fixes this issue by
returning clockPeriod() * ipr_delay instead.
Convert condition code registers from being specialized
("pseudo") integer registers to using the recently
added CC register class.
Nilay Vaish also contributed to this patch.
Restructured rename map and free list to clean up some
extraneous code and separate out common code that can
be reused across different register classes (int and fp
at this point). Both components now consist of a set
of Simple* objects that are stand-alone rename map &
free list for each class, plus a Unified* object that
presents a unified interface across all register
classes and then redirects accesses to the appropriate
Simple* object as needed.
Moved free list initialization to PhysRegFile to better
isolate knowledge of physical register index mappings
to that class (and remove the need to pass a number
of parameters to the free list constructor).
Causes a small change to these stats:
cpu.rename.int_rename_lookups
cpu.rename.fp_rename_lookups
because they are now categorized on a per-operand basis
rather than a per-instruction basis.
That is, an instruction with mixed fp/int/misc operand
types will have each operand categorized independently,
where previously the lookup was categorized based on
the instruction type.
Make these names more meaningful.
Specifically, made these substitutions:
s/FP_Base_DepTag/FP_Reg_Base/g;
s/Ctrl_Base_DepTag/Misc_Reg_Base/g;
s/Max_DepTag/Max_Reg_Index/g;
It had a bunch of fields (and associated constructor
parameters) thet it didn't really use, and the array
initialization was needlessly verbose.
Also just hardwired the getReg() method to aleays
return true for misc regs, rather than having an array
of bits that we always kept marked as ready.
No need for PhysRegFile to be a template class, or
have a pointer back to the CPU. Also made some methods
for checking the physical register type (int vs. float)
based on the phys reg index, which will come in handy later.
The previous patch introduced a RegClass enum to clean
up register classification. The inorder model already
had an equivalent enum (RegType) that was used internally.
This patch replaces RegType with RegClass to get rid
of the now-redundant code.
Move from a poorly documented scheme where the mapping
of unified architectural register indices to register
classes is hardcoded all over to one where there's an
enum for the register classes and a function that
encapsulates the mapping.
This changset adds calls to the service the instruction event queues
that accidentally went missing from commit [0063c7dd18ec]. The
original commit only included the code needed to schedule instruction
stops from KVM and missed the functionality to actually service the
events.
Instruction events are currently ignored when executing in KVM. This
changeset adds support for triggering KVM exits based on instruction
counts using hardware performance counters. Depending on the
underlying performance counter implementation, there might be some
inaccuracies due to instructions being counted in the host kernel when
entering/exiting KVM.
Due to limitations/bugs in Linux's performance counter interface, we
can't reliably change the period of an overflow counter. We work
around this issue by detaching and reattaching the counter if we need
to reconfigure it.
This changeset adds support for synchronizing the FPU and SIMD state
of a virtual x86 CPU with gem5. It supports both the XSave API and the
KVM_(GET|SET)_FPU kernel API. The XSave interface can be disabled
using the useXSave parameter (in case of kernel
issues). Unfortunately, KVM_(GET|SET)_FPU interface seems to be buggy
in some kernels (specifically, the MXCSR register isn't always
synchronized), which means that it might not be possible to
synchronize MXCSR on old kernels without the XSave interface.
This changeset depends on the __float80 type in gcc and might not
build using llvm.
There are cases when the segment registers in gem5 are not compatible
with VMX. This changeset works around all known such issues. Specifically:
* The accessed bits in CS, SS, DD, ES, FS, GS are forced to 1.
* The busy bit in TR is forced to 1.
* The protection level of SS is forced to the same protection level as
CS. The difference /seems/ to be caused by a bug in gem5's x86
implementation.