Currently if there are shell special characters in a
command-line argument, you can't copy and paste the
echoed command line onto a shell prompt because the
characters aren't quoted properly. This patch fixes
that problem.
This patch accomplishes two things:
1. Makes simulate()'s GlobalSimLoopExitEvent a singleton reused
across calls. This is slightly more efficient than recreating
it every time.
2. Gives callers to simulate() (especially other simulators) a
foolproof way of knowing that the simulation period ended
successfully by hitting the limit event. They can call
getLimitEvent() and compare it to the return
value of simulate().
This change was motivated by an ongoing effort to integrate gem5
and SST, with SST as the master sim and gem5 as the slave sim.
This patch adds a new PIO-accessible GICv2m shim. This shim has a PIO
slave port on one side, and SPI 'wires' on the other. It accepts MSIs
from the system and triggers SPIs on the GIC. It is configurable with
a number of frames, each of which has a number of SPIs and a base SPI
offset.
A Linux driver for GICv2m is available upstream.
This patch removes the code that added this magic register. A
follow-up patch provides a GICv2m MSI shim that gives the same
functionality in a standard ARM system architecture way.
Fix erroneous message format for fatal error.
Previously, code did not have type indicator (% instead of %d).
Also removed redundant fatal check.
Ran modified sweep.py with in range and out of range values to test.
The CommMonitor by default only allows memory traces to be gathered in
timing mode. This patch allows memory traces to be gathered in atomic
mode if all one needs is a functional trace of memory addresses used
and timing information is of a secondary concern.
For some reason we were checking mshr->hasTargets() even though
we had already called mshr->getTarget() unconditionally earlier
in the same function (which asserts if there are no targets).
Get rid of this useless check, and while we're at it get rid
of the redundant call to mshr->getTarget(), since we still have
the value saved in a local var.
Refactor the way that specific MemCmd values are generated for packets.
The new approach is a little more elegant in that we assign the right
value up front, and it's also more amenable to non-heap-allocated
Packet objects.
Also replaced the code in the Minor model that was still doing it the
ad-hoc way.
This is basically a refinement of http://repo.gem5.org/gem5/rev/711eb0e64249.
The 'if (writebacks.size)' check was redundant, because
writeBuffer.findMatches() would return false if the
writebacks list was empty.
Also renamed 'mshr' to 'wb_entry' in this context since
we are pointing at a writebuffer entry and not an MSHR
(even though it's the same C++ class).
The variable is used in only one place and a whole new function setNextStatus()
has been defined just to compute the value of the variable. Instead of calling
the function, the value is now computed in the loop that preceded the function
call.
This patch changes all the DPRINTF messages in the cache to use
'%#llx' every time a packet address is printed. The inclusion of '#'
ensures '0x' is prepended, and since the address type is a uint64_t %x
really should be %llx.
This patch fixes a rather subtle issue in the sending of MSHR requests
in the cache, where the logic previously did not check for conflicts
between the MSRH queue and the write queue when requests were not
ready. The correct thing to do is to always check, since not having a
ready MSHR does not guarantee that there is no conflict.
The underlying problem seems to have slipped past due to the symmetric
timings used for the write queue and MSHR queue. However, with the
recent timing changes the bug caused regressions to fail.
This patch changes the valid-bytes start/end to a proper byte
mask. With the changes in timing introduced in previous patches there
are more packets waiting in queues, and there are regressions using
the checker CPU failing due to non-contigous read data being found in
the various cache queues.
This patch also adds some more comments explaining what is going on,
and adds the fourth and missing case to Packet::checkFunctional.
By default, the packet queue is ordered by the ticks of the to-be-sent
packages. With the recent modifications of packages sinking their header time
when their resposne leaves the caches, there could be cases of MSHR targets
being allocated and ordered A, B, but their responses being sent out in the
order B,A. This led to inconsistencies in bus traffic, in particular the snoop
filter observing first a ReadExResp and later a ReadRespWithInv. Logically,
these were ordered the other way around behind the MSHR, but due to the timing
adjustments when inserting into the PacketQueue, they were sent out in the
wrong order on the bus, confusing the snoop filter.
This patch adds a flag (off by default) such that these special cases can
request in-order insertion into the packet queue, which might offset timing
slighty. This is expected to occur rarely and not affect timing results.
This patch makes the caches and memory controllers consume the delay
that is annotated to a packet by the crossbar. Previously many
components simply threw these delays away. Note that the devices still
do not pay for these delays.
This patch introduces a few subclasses to the CoherentXBar and
NoncoherentXBar to distinguish the different uses in the system. We
use the crossbar in a wide range of places: interfacing cores to the
L2, as a system interconnect, connecting I/O and peripherals,
etc. Needless to say, these crossbars have very different performance,
and the clock frequency alone is not enough to distinguish these
scenarios.
Instead of trying to capture every possible case, this patch
introduces dedicated subclasses for the three primary use-cases:
L2XBar, SystemXBar and IOXbar. More can be added if needed, and the
defaults can be overridden.
This patch introduces latencies in crossbar that were neglected
before. In particular, it adds three parameters in crossbar model:
front_end_latency, forward_latency, and response_latency. Along with
these parameters, three corresponding members are added:
frontEndLatency, forwardLatency, and responseLatency. The coherent
crossbar has an additional snoop_response_latency.
The latency of the request path through the xbar is set as
--> frontEndLatency + forwardLatency
In case the snoop filter is enabled, the request path latency is charged
also by look-up latency of the snoop filter.
--> frontEndLatency + SF(lookupLatency) + forwardLatency.
The latency of the response path through the xbar is set instead as
--> responseLatency.
In case of snoop response, if the response is treated as a normal response
the latency associated is again
--> responseLatency;
If instead it is forwarded as snoop response we add an additional variable
+ snoopResponseLatency
and the latency associated is
--> snoopResponseLatency;
Furthermore, this patch lets the crossbar progress on the next clock
edge after an unused retry, changing the time the crossbar considers
itself busy after sending a retry that was not acted upon.
The ARM PL011 UART model didn't clear and raise interrupts
correctly. This changeset rewrites the whole interrupt handling and
makes it both simpler and fixes several cases where the correct
interrupts weren't raised or cleared. Additionally, it cleans up many
other aspects of the code.
This patch changes how the MMU and table walkers are created such that
a single port is used to connect the MMU and the TLBs to the memory
system. Previously two ports were needed as there are two table walker
objects (stage one and stage two), and they both had a port. Now the
port itself is moved to the Stage2MMU, and each TableWalker is simply
using the port from the parent.
By using the same port we also remove the need for having an
additional crossbar joining the two ports before the walker cache or
the L2. This simplifies the creation of the CPU cache topology in
BaseCPU.py considerably. Moreover, for naming and symmetry reasons,
the TLB walker port is connected through the stage-one table walker
thus making the naming identical to x86. Along the same line, we use
the stage-one table walker to generate the master id that is used by
all TLB-related requests.
Now, prior to the renaming, the instruction requests the exact amount of
registers it will need, and the rename_map decides whether the instruction is
allowed to proceed or not.
This patch fixes a long-standing isue with the port flow
control. Before this patch the retry mechanism was shared between all
different packet classes. As a result, a snoop response could get
stuck behind a request waiting for a retry, even if the send/recv
functions were split. This caused message-dependent deadlocks in
stress-test scenarios.
The patch splits the retry into one per packet (message) class. Thus,
sendTimingReq has a corresponding recvReqRetry, sendTimingResp has
recvRespRetry etc. Most of the changes to the code involve simply
clarifying what type of request a specific object was accepting.
The biggest change in functionality is in the cache downstream packet
queue, facing the memory. This queue was shared by requests and snoop
responses, and it is now split into two queues, each with their own
flow control, but the same physical MasterPort. These changes fixes
the previously seen deadlocks.
This patch resolves a bug with hardware prefetches. Before a hardware prefetch
is sent towards the memory, the system generates a snoop request to check all
caches above the prefetch generating cache for the presence of the prefetth
target. If the prefetch target is found in the tags or the MSHRs of the upper
caches, the cache sets the prefetchSquashed flag in the snoop packet. When the
snoop packet returns with the prefetchSquashed flag set, the prefetch
generating cache deallocates the MSHR reserved for the prefetch. If the
prefetch target is found in the writeback buffer of the upper cache, the cache
sets the memInhibit flag, which signals the prefetch generating cache to
expect the data from the writeback. When the snoop packet returns with the
memInhibitAsserted flag set, it marks the allocated MSHR as inService and
waits for the data from the writeback.
If the prefetch target is found in multiple upper level caches, specifically
in the tags or MSHRs of one upper level cache and the writeback buffer of
another, the snoop packet will return with both prefetchSquashed and
memInhibitAsserted set, while the current code is not written to handle such
an outcome. Current code checks for the prefetchSquashed flag first, if it
finds the flag, it deallocates the reserved MSHR. This leads to assert failure
when the data from the writeback appears at cache. In this fix, we simply
switch the order of checks. We first check for memInhibitAsserted and then for
prefetch squashed.
Have the traffic generator add its masterID as the PC address to the
requests. That way, prefetchers (and other components) that use a PC
for request classification will see per-tester streams of requests.
This enables us to test strided prefetchers with the memchecker, too.
The ISA code sometimes stores 16-bit ASIDs as 8-bit unsigned integers
and has a couple of inverted checks that mask out the high 8 bits of
an ASID if 16-bit ASIDs have been /enabled/. This changeset fixes both
of those issues.
We curently use INTREG_X31 instead of INTREG_SPX when accessing the
stack pointer in GDB. gem5 normally uses INTREG_SPX to access the
stack pointer, which gets mapped to the stack pointer corresponding
(INTREG_SPn) to the current exception level. This changeset updates
the GDB interface to use SPX instead of X31 (which is always zero)
when transfering CPU state to gdb.
The remote GDB interface currently doesn't check if translations are
valid before reading memory. This causes a panic when GDB tries to
access unmapped memory (e.g., when getting a stack trace). There are
two reasons for this: 1) The function used to check for valid
translations (virtvalid()) doesn't work and panics on invalid
translations. 2) The method in the GDB interface used to test if a
translation is valid (RemoteGDB::acc) always returns true regardless
of the return from virtvalid().
This changeset fixes both of these issues.
Previously, the user would have to manually set access_backing_store=True
on all RubyPorts (Sequencers) in the config files.
Now, instead there is one global option that each RubyPort checks on
initialization.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
To be able to use the TrafficGen in a system with caches we need to
allow it to sink incoming snoop requests. By default the master port
panics, so silently ignore any snoops.
In highly loaded cases, reads might actually overlap with writes to the
initial memory state. The mem checker needs to detect such cases and
permit the read reading either from the writes (what it is doing now) or
read from the initial, unknown value.
This patch adds this logic.
This patch fixes a rather unfortunate oversight where the annotation
pointer was used even though it is null. Somehow the code still works,
but UBSan is rather unhappy. The use is now guarded, and the variable
is initialised in the constructor (as well as init()).
Move the (common) GIC initialization code that notifies the platform
code of the new GIC to the base class (BaseGic) instead of the Pl390
implementation.
This patch ensures we can run simulations with very large simulated
memories (at least 64 TB based on some quick runs on a Linux
workstation). In essence this allows us to efficiently deal with
sparse address maps without having to implement a redirection layer in
the backing store.
This opens up for run-time errors if we eventually exhausts the hosts
memory and swap space, but this should hopefully never happen.
This patch changes the range cache used in the global physical memory
to be an iterator so that we can use it not only as part of isMemAddr,
but also access and functionalAccess. This matches use-cases where a
core is using the atomic non-caching memory mode, and repeatedly calls
isMemAddr and access.
Linux boot on aarch32, with a single atomic CPU, is now more than 30%
faster when using "--fastmem" compared to not using the direct memory
access.
This changeset moves the pseudo instructions used to signal unknown
instructions and unimplemented instructions to the same source files
as the decoder fault.
This patch clarifies the packet timings annotated
when going through a crossbar.
The old 'firstWordDelay' is replaced by 'headerDelay' that represents
the delay associated to the delivery of the header of the packet.
The old 'lastWordDelay' is replaced by 'payloadDelay' that represents
the delay needed to processing the payload of the packet.
For now the uses and values remain identical. However, going forward
the payloadDelay will be additive, and not include the
headerDelay. Follow-on patches will make the headerDelay capture the
pipeline latency incurred in the crossbar, whereas the payloadDelay
will capture the additional serialisation delay.
This patch adds some much-needed clarity in the specification of the
cache timing. For now, hit_latency and response_latency are kept as
top-level parameters, but the cache itself has a number of local
variables to better map the individual timing variables to different
behaviours (and sub-components).
The introduced variables are:
- lookupLatency: latency of tag lookup, occuring on any access
- forwardLatency: latency that occurs in case of outbound miss
- fillLatency: latency to fill a cache block
We keep the existing responseLatency
The forwardLatency is used by allocateInternalBuffer() for:
- MSHR allocateWriteBuffer (unchached write forwarded to WriteBuffer);
- MSHR allocateMissBuffer (cacheable miss in MSHR queue);
- MSHR allocateUncachedReadBuffer (unchached read allocated in MSHR
queue)
It is our assumption that the time for the above three buffers is the
same. Similarly, for snoop responses passing through the cache we use
forwardLatency.
The MemTest class really only tests false sharing, and as such there
was a lot of old cruft that could be removed. This patch cleans up the
tester, and also makes it more clear what the assumptions are. As part
of this simplification the reference functional memory is also
removed.
The regression configs using MemTest are updated to reflect the
changes, and the stats will be bumped in a separate patch. The example
config will be updated in a separate patch due to more extensive
re-work.
In a follow-on patch a new tester will be introduced that uses the
MemChecker to implement true sharing.
The TLB-related code is generally architecture dependent and should
live in the arch directory to signify that.
--HG--
rename : src/sim/BaseTLB.py => src/arch/generic/BaseTLB.py
rename : src/sim/tlb.cc => src/arch/generic/tlb.cc
rename : src/sim/tlb.hh => src/arch/generic/tlb.hh
Gcc and clang both provide an attribute that can be used to flag a
function as deprecated at compile time. This changeset adds a gem5
compiler macro for that compiler feature. The macro can be used to
indicate that a legacy API within gem5 has been deprecated and provide
a graceful migration to the new API.
This patch fixes the CompoundFlag constructor, ensuring that it does
not dereference NULL. Doing so has undefined behaviuor, and both clang
and gcc's undefined-behaviour sanitiser was rather unhappy.
The Platform base class contains a pointer to an instance of the
System which is never initialized. This can lead to subtle bugs since
some architecture-specific platform implementations contain their own
system pointer which is normally used. However, if the platform is
accessed through a pointer to its base class, the dangling pointer
will be used instead.
This patch adds a bit of clarification around the assumptions made in
the cache when packets are sent out, and dirty responses are
pending. As part of the change, the marking of an MSHR as in service
is simplified slightly, and comments are added to explain what
assumptions are made.
This patch extends the current address interleaving with basic hashing
support. Instead of directly comparing a number of address bits with a
matching value, it is now possible to use two independent set of
address bits XOR'ed together. This avoids issues where strided address
patterns are heavily biased to a subset of the interleaved ranges.
This patch changes the DRAM channel interleaving default behaviour to
be more representative. The default address mapping (RoRaBaCoCh) moves
the channel bits towards the least significant bits, and uses 128 byte
as the default channel interleaving granularity.
These defaults can be overridden if desired, but should serve as a
sensible starting point for most use-cases.
The method Event::initialized() tests if this != NULL as a part of the
expression that tests if an event is initialized. The only case when
this check could be false is if the method is called on a null
pointer, which is illegal and leads to undefined behavior (such as
eating your pets) according to the C++ standard. Because of this,
modern compilers (specifically, recent versions of clang) warn about
this which we treat as an error. This changeset removes the redundant
check to fix said warning.
This patch changes how the timing CPU deals with processing responses,
always scheduling an event, even if it is for the current tick. This
helps to avoid situations where a new request shows up before a
response is finished in the crossbar, and also is more in line with
any realistic behaviour.
The Float param was not settable on the command line
due to a typo in the class definition in
python/m5/params.py. This corrects the typo and allows
floats to be set on the command line as intended.
While the IsFirstMicroop flag exists it was only occasionally used in the ARM
instructions that gem5 microOps and therefore couldn't be relied on to be correct.
When gem5 is a slave to another simulator and the Python is only used
to initialize the configuration (and not perform actual simulation), a
"debug start" (--debug-start) event will get freed during or immediately
after the initial Python frame's execution rather than remaining in the
event queue. This tricky patch fixes the GC issue causing this.
This patch takes the final step in removing the src and dest fields in
the packet. These fields were rather confusing in that they only
remember a single multiplexing component, and pushed the
responsibility to the bridge and caches to store the fields in a
senderstate, thus effectively creating a stack. With the recent
changes to the crossbar response routing the crossbar is now
responsible without relying on the packet fields. Thus, these
variables are now unused and can be removed.
This patch removes the source field from the ForwardResponseRecord,
but keeps the class as it is part of how the cache identifies
responses to hardware prefetches that are snooped upwards.
This patch aligns how the response routing is done in the RubyPort,
using the SenderState for both memory and I/O accesses. Before this
patch, only the I/O used the SenderState, whereas the memory accesses
relied on the src field in the packet. With this patch we shift to
using SenderState in both cases, thus not relying on the src field any
longer.
This patch removes the need for a source and destination field in the
packet by shifting the onus of the tracking to the crossbar, much like
a real implementation. This change in behaviour also means we no
longer need a SenderState to remember the source/dest when ever we
have multiple crossbars in the system. Thus, the stack that was
created by the SenderState is not needed, and each crossbar locally
tracks the response routing.
The fields in the packet are still left behind as the RubyPort (which
also acts as a crossbar) does routing based on them. In the succeeding
patches the uses of the src and dest field will be removed. Combined,
these patches improve the simulation performance by roughly 2%.
This patch fixes a minor issue in the X86 page table walker where it
ended up sending new request packets to the crossbar before the
response processing was finished (recvTimingResp is directly calling
sendTimingReq). Under certain conditions this caused the crossbar to
see illegal combinations of request/response overlap, in turn causing
problems with a slightly modified crossbar implementation.
This patch tidies up how we create and set the fields of a Request. In
essence it tries to use the constructor where possible (as opposed to
setPhys and setVirt), thus avoiding spreading the information across a
number of locations. In fact, setPhys is made private as part of this
patch, and a number of places where we callede setVirt instead uses
the appropriate constructor.
The ppCommit should notify the attached listener every time the cpu commits
a microop or non microcoded insturction. The listener can then decide
whether it will process only the last microop (eg. SimPoint probe).
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch fixes a bug where the DRAM controller tried to access the
system cacheline size before the system pointer was initialised. It
also fixes a bug where the granularity is 0 (no interleaving).
This patch corrects the FXSAVE and FXRSTOR Macroops. The actual code used for
saving/restore the FP registers is in the file but it was not used.
The FXSAVE and FXRSTOR instructions are used in the kernel for saving and
loading the state of the mmx,xmm and fpu registers.
This operation is triggered in FS by issuing a Device Not Available Fault. The
cr0 register has a TS flag that is set upon each context change. Every time a
task access any FP related register (SIMD as well) if the TS flag is set to
one, the device not available fault is issued. The kernel saves the current
state of the registers, and restore the previous state of the currently running
task.
Right now Gem5 lacks of this capability. the Device Not Available Fault is
never issued, leading to several problems when different threads share the same
CPU and SMT is not used. The PARSEC Ferret benchmark is an example of this
behavior.
In order to test this a hack in the atomic cpu code was done to detect if a
static instruction has any FP operands and the cr0 reg TS bit is set. This
check must be done in the ISA dependent code. But it seems to be tricky to
access the cr0 register while executing an instruction.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This change includes edits to Intel8254Timer to prevent counter events firing
before startup to comply with SimObject initialization call sequence.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
If two bitfields are of the same type, also implying that they have the same
first and last bit positions, the existing implementation would copy the
entire bitfield. That includes the __data member which is shared among all the
bitfields, effectively overwritting the entire bitunion.
This change also adjusts the write only signed bitfield assignment operator to
be like the unsigned version, using "using" instead of implementing it again
and calling down to the underlying implementation.
That change enables CPUID bits for features that aren't implemented in gem5.
If a simulated system tries to use those features because it was told it
could, bad things can happen.
Minor was reporting the data cache access as ".inst" accesses.
This just switches the MasterPortID to dataMasterPortId.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
added ARM aarch64 unlinkat syscall support, modeled on other <xxx>at syscalls.
This gets all of the cpu2006 int workloads passing in SE mode on aarch64.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch implements the simd128 ADDSUBPD instruction for the x86 architecture.
Tested with a simple program in assembly language which executes the
instruction. Checked that different versions of the instruction are executed
by using the execution tracing option.
Committed by: Nilay Vaish <nilay@cs.wisc.edu
This change includes edits to MC146818 timer to prevent RTC events
firing before startup to comply with SimObject initialization call sequence.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
According to Linux man pages, if writev is successful, it returns the total
number of bytes written. Otherwise, it returns an error code. Instead of
returning 0, return the result from the actual call to writev in the system
call.
The cache's MemSidePacketQueue schedules a sendEvent based upon
nextMSHRReadyTime() which is the time when the next MSHR is ready or whenever
a future prefetch is ready. However, a prefetch being ready does not guarentee
that it can obtain an MSHR. So, when all MSHRs are full,
the simulation ends up unnecessiciarly scheduling a sendEvent every picosecond
until an MSHR is finally freed and the prefetch can happen.
This patch fixes this by not signaling the prefetch ready time if the prefetch
could not be generated. The event is rescheduled as soon as a MSHR becomes
available.
Previously the code commented about an unhandled case where it might be
possible for a writeback to arrive after a prefetch was generated but
before it was sent to the memory system. I hit that case. Luckily
the prefetchSquash() logic already in the code handles dropping prefetch
request in certian circumstances.
Re-organizes the prefetcher class structure. Previously the
BasePrefetcher forced multiple assumptions on the prefetchers that
inherited from it. This patch makes the BasePrefetcher class truly
representative of base functionality. For example, the base class no
longer enforces FIFO order. Instead, prefetchers with FIFO requests
(like the existing stride and tagged prefetchers) now inherit from a
new QueuedPrefetcher base class.
Finally, the stride-based prefetcher now assumes a custimizable lookup table
(sets/ways) rather than the previous fully associative structure.
Adds a new parameter that reserves some number of MSHR entries for demand
accesses. This helps prevent prefetchers from taking all MSHRs, forcing demand
requests from the CPU to stall.
This patch adds table walker stats for:
- Walk events
- Instruction vs Data
- Page size histogram
- Wait time and service time histograms
- Pending requests histogram (per cycle) - measures dist. of L
(p(1..) = how often busy, p(0) = how often idle)
- Squashes, before starting and after completion
This patch gives the user direct influence over the number of DRAM
ranks to make it easier to tune the memory density without affecting
the bandwidth (previously the only means of scaling the device count
was through the number of channels).
The patch also adds some basic sanity checks to ensure that the number
of ranks is a power of two (since we rely on bit slices in the address
decoding).
This patch addresses an issue seen with the KVM CPU where the refresh
events scheduled by the DRAM controller forces the simulator to switch
out of the KVM mode, thus killing performance.
The current patch works around the fact that we currently have no
proper API to inform a SimObject of the mode switches. Instead we rely
on drainResume being called after any switch, and cache the previous
mode locally to be able to decide on appropriate actions.
The switcheroo regression require a minor stats bump as a result.
This patch adds rank-wise refresh to the controller, as opposed to the
channel-wide refresh currently in place. In essence each rank can be
refreshed independently, and for this to be possible the controller
is extended with a state machine per rank.
Without this patch the data bus is always idle during a refresh, as
all the ranks are refreshing at the same time. With the rank-wise
refresh it is possible to use one rank while another one is
refreshing, and thus the data bus can be kept busy.
The patch introduces a Rank class to encapsulate the state per rank,
and also shifts all the relevant banks, activation tracking etc to the
rank. The arbitration is also updated to consider the state of the rank.
This patch adds a stand-alone stack distance calculator. The stack
distance calculator is a passive SimObject that observes the addresses
passed to it. It calculates stack distances (LRU Distances) of
incoming addresses based on the partial sum hierarchy tree algorithm
described by Alamasi et al. http://doi.acm.org/10.1145/773039.773043.
For each transaction a hashtable look-up is performed. At every
non-unique transaction the tree is traversed from the leaf at the
returned index to the root, the old node is deleted from the tree, and
the sums (to the right) are collected and decremented. The collected
sum represets the stack distance of the found node. At every unique
transaction the stack distance is returned as
numeric_limits<uint64>::max().
In addition to the basic stack distance calculation, a feature to mark
an old node in the tree is added. This is useful if it is required to
see the reuse pattern. For example, Writebacks to the lower level
(e.g. membus from L2), can be marked instead of being removed from the
stack (isMarked flag of Node set to True). And then later if this same
address is accessed (by L1), the value of the isMarked flag would be
True. This gives some insight on how the Writeback policy of the
lower level affect the read/write accesses in an application.
Debugging is enabled by setting the verify flag to true. Debugging is
implemented using a dummy stack that behaves in a naive way, using STL
vectors. Note that this has a large impact on run time.
This patch adds the MemChecker and MemCheckerMonitor classes. While
MemChecker can be integrated anywhere in the system and is independent,
the most convenient usage is through the MemCheckerMonitor -- this
however, puts limitations on where the MemChecker is able to observe
read/write transactions.
We currently don't handle unaligned PCs correctly. There is one check
for unaligned PCs in the TLB when running in aarch64 mode, but this
check does not cover cases where the CPU does not do a TLB lookup when
decoding an instruction (e.g., a branch stays within the same cache
line). Additionally, the Decoder class sometimes throws an assertion
for unaligned PCs which breaks speculation.
This changeset introduces a decoder fault bit field in the ExtMachInst
structure. This field can be used to signal a decoder failure. If set,
the decoder generates an internal gem5fault instruction instead of a
normal instruction. This instruction in turns either panics (fault
type PANIC), returns an PCAlignmentFault (fault type UNALIGNED,
aarch64) or PrefetchAbort (fault type UNALIGNED, aarch32).
The patch causes minor changes to the realview64 regressions, and a
stats bump will follow.
This patch adds support for filtering events in the PMU. In order to
do so, it updates the ISADevice base class to forward an ISA pointer
to ISA devices. This enables such devices to access the MiscReg file
to determine the current execution level.
The aarch64 system register decoder is currently not decoding
PMXEVTYPER_EL0 and PMCCFILTR_EL0 correctly. This changeset updates the
decoder so that they are decoded using the values in table C5-6 in ARM
DDI 0478A.c.
Add an assert in the PioPort that checks if a response packet from a
device has the right flags set before passing it to them rest of the
memory system.
The new single stepping implementation for x86 doesn't rely on any ISA
specific properties or functionality. This change pulls out the per ISA
implementation of those functions and promotes the X86 implementation to the
base class.
One drawback of that implementation is that the CPU might stop on an
instruction twice if it's affected by both breakpoints and single stepping.
While that might be a little surprising, it's harmless and would only happen
under somewhat unlikely circumstances.
This stub should allow remote debugging of 32 bit and 64 bit targets. Single
stepping seems to work, as do breakpoints. If both breakpoints and single
stepping affect an instruction, gdb will stop at the instruction twice before
continuing. That's a little surprising, but is generally harmless.
Only the instruction address is actually checked, so there's no need to check
repeatedly while we're working through the microops of a macroop and that's
not changing.
Not all ISAs have 64 bit sized registers, so it's not always very convenient
to access the GDB register cache in 64 bit sized chunks. This change makes it
accessible in 8, 16, 32, or 64 bit chunks. The MIPS and ARM implementations
were working around that limitation by bundling and unbundling 32 bit values
into 64 bit values. That code has been removed.
Instead of counting the number of opcode bytes in an instruction and recording
each byte before the actual opcode, we can represent the path we took to get to
the actual opcode byte by using a type code. That has a couple of advantages.
First, we can disambiguate the properties of opcodes of the same length which
have different properties. Second, it reduces the amount of data stored in an
ExtMachInst, making them slightly easier/faster to create and process. This
also adds some flexibility as far as how different types of opcodes are
handled, which might come in handy if we decide to support VEX or XOP
instructions.
This change also adds tables to support properly decoding 3 byte opcodes.
Before we would fall off the end of some arrays, on top of the ambiguity
described above.
This change doesn't measureably affect performance on the twolf benchmark.
--HG--
rename : src/arch/x86/isa/decoder/three_byte_opcodes.isa => src/arch/x86/isa/decoder/three_byte_0f38_opcodes.isa
rename : src/arch/x86/isa/decoder/three_byte_opcodes.isa => src/arch/x86/isa/decoder/three_byte_0f3a_opcodes.isa
The values in a "bitfield" or in an ExtMachInst structure member may not be a
literal value, it might select from an arbitrary collection of options. Instead
of using the raw value of those constants in the decoder, it's easier to tell
what's going on if they can be referred to as a symbolic constant/enum.
To support that, the ISA description language is extended slightly so that in
addition to integer literals, the case value for decode blobs can also be a
string literal. It's up to the ISA author to ensure that the string evaluates
to a legal constant value when interpretted as C++.
The check which makes sure the length of the breakpoint being written is the
same as a MachInst is only correct on fixed instruction width ISAs. Instead of
incorrectly applying that check to all ISAs, this change makes that the
default check and lets ISA specific GDB classes override it.
This command is supposed to set up a timer which will put the drive into a
standby mode if it isn't sent a command within a given time out. Since most of
the timeouts are generally significantly longer than a simulation would run
anyway, and we don't have an implementation for standby mode to begin with,
we can accept the command, do nothing, and report success.
This patch adds sorting based on the SimObject name or parameter name
for all situations where we iterate over dictionaries. This should
ensure a deterministic and consistent order across the host systems
and hopefully avoid regression results differing across python
versions.
This patch takes a clean-slate approach to providing WriteInvalidate
(write streaming, full cache line writes without first reading)
support.
Unlike the prior attempt, which took an aggressive approach of directly
writing into the cache before handling the coherence actions, this
approach follows the existing cache flows as closely as possible.
This patch fixes a case where a store in Minor's store buffer never
leaves the store buffer as it is pre-maturely counted as having been
issued, leading to the store buffer idling.
LSQ::StoreBuffer::numUnissuedAccesses should count the number of accesses
either in memory, or still in the store buffer after being completed.
For stores which are also barriers, the store will stay in the store
buffer for a cycle after it is completed and will be cleaned up by the
barrier clearing code (to ensure that barriers are completed in-order).
To acheive this, numUnissuedAccesses is not decremented when a store-barrier
is issued to memory, but when its barrier effect is cleared.
Without this patch, the correct behaviour happens when a memory transaction
is immediately accepted, but not if it needs a retry.
This patch fixes a case where the Minor CPU can deadlock due to the lack
of a response to TLB request because of a bug in fault handling in the ARM
table walker.
TableWalker::processWalkWrapper is the scheduler-called wrapper which
handles deferred walks which calls to TableWalker::wait cannot immediately
process. The handling of faults generated by processWalk{AArch64,LPAE,}
calls in those two functions is is different. processWalkWrapper ignores
fault returns from processWalk... which can lead to ::finish not being
called on a translation.
This fix provides fault handling in processWalkWrapper similar to that
found in the leaf functions which BaseTLB::Translation::finish.
In case the memory subsystem sends a combined response with invalidate
(e.g. ReadRespWithInvalidate), we cannot ignore the invalidate part
of the response.
If we were to ignore the invalidate part, under certain circumstances
this effectively leads to reordering of loads to the same address
which is not permitted under any memory consistency model implemented
in gem5.
Consider the case where a later load's address is computed before an
earlier load in program order, and is therefore sent to the memory
subsystem first. At some point the earlier load's address is computed
and in doing so correctly marks the later load as a
possibleLoadViolation. In the meantime some other node writes and
sends invalidations to all other nodes. The invalidation races with
the later load's ReadResp, and arrives before ReadResp and is
deferred. Upon receipt of the ReadResp, the response is changed to
ReadRespWithInvalidate, and sent to the CPU. If we ignore the
invalidate part of the packet, we let the later load read the old
value of the address. Eventually the earlier load's ReadResp arrives,
but with new data. As there was no invalidate snoop (sunk into the
ReadRespWithInvalidate), and if we did not process the invalidate of
the ReadRespWithInvalidate, we obtain a load reordering.
A similar scenario can be constructed where the earlier load's address
is computed after ReadRespWithInvalidate arrives for the younger
load. In this case hitExternalSnoop needs to be set to true on the
ReadRespWithInvalidate, so that upon knowing the address of the
earlier load, checkViolations will cause the later load to be
squashed.
Finally we must account for the case where both loads are sent to the
memory subsystem (reordered), a snoop invalidate arrives and correctly
sets the later loads fault to ReExec. However, before the CPU
processes the fault, the later load's ReadResp arrives and the
writeback discards the outstanding fault. We must add a check to
ensure that we do not skip any unprocessed faults.
Move the packet deallocations in the O3 CPU so that the completeDataAccess
deals only with the LSQ specific parts and the generic recvTimingResp frees the
packet in all other cases.
This patch allows objects to get the src/dest of a packet even if it
is not set to a valid port id. This simplifies (ab)using the bridge as
a buffer and latency adapter in situations where the neighbouring
MemObjects are not crossbars.
The checks that were done in the packet are now shifted to the
crossbar where the fields are used to index into the port
arrays. Thus, the carrier of the information is not burdened with
checking, and the crossbar can check not only that the destination is
set, but also that the port index is within limits.
This patch attempts to make the rules for data allocation in the
packet explicit, understandable, and easy to verify. The constructor
that copies a packet is extended with an additional flag "alloc_data"
to enable the call site to explicitly say whether the newly created
packet is short-lived (a zero-time snoop), or has an unknown life-time
and therefore should allocate its own data (or copy a static pointer
in the case of static data).
The tricky case is the static data. In essence this is a
copy-avoidance scheme where the original source of the request (DMA,
CPU etc) does not ask the memory system to return data as part of the
packet, but instead provides a pointer, and then the memory system
carries this pointer around, and copies the appropriate data to the
location itself. Thus any derived packet actually never copies any
data. As the original source does not copy any data from the response
packet when arriving back at the source, we must maintain the copy of
the original pointer to not break the system. We might want to revisit
this one day and pay the price for a few extra memcpy invocations.
All in all this patch should make it easier to grok what is going on
in the memory system and how data is actually copied (or not).
This patch cleans up the use of hasData and checkFunctional in the
packet. The hasData function is unfortunately suggesting that it
checks if the packet has a valid data pointer, when it does in fact
only check if the specific packet type is specified to have a data
payload. The confusion led to a bug in checkFunctional. The latter
function is also tidied up to avoid name overloading.
This adds a basic level of sanity checking to the packet by ensuring
that a request is not modified once the packet is created. The only
issue that had to be worked around is the relaying of
software-prefetches in the cache. The specific situation is now solved
by first copying the request, and then creating a new packet
accordingly.
This patch tidies up the Request class, making all getters const. The
odd one out is incAccessDepth which is called by the memory system as
packets carry the request around. This is also const to enable the
packet to hold on to a const Request.
This patch simplifies how we deal with dynamically allocated data in
the packet, always assuming that it is array allocated, and hence
should be array deallocated (delete[] as opposed to delete). The only
uses of dataDynamic was in the Ruby testers.
The ARRAY_DATA flag in the packet is removed accordingly. No
defragmentation of the flags is done at this point, leaving a gap in
the bit masks.
As the last part the patch, it renames dataDynamicArray to dataDynamic.
This patch cleans up the packet memory allocation confusion. The data
is always allocated at the requesting side, when a packet is created
(or copied), and there is never a need for any device to allocate any
space if it is merely responding to a paket. This behaviour is in line
with how SystemC and TLM works as well, thus increasing
interoperability, and matching established conventions.
The redundant calls to Packet::allocate are removed, and the checks in
the function are tightened up to make sure data is only ever allocated
once. There are still some oddities in the packet copy constructor
where we copy the data pointer if it is static (without ownership),
and allocate new space if the data is dynamic (with ownership). The
latter is being worked on further in a follow-on patch.
This patch changes the various write functions in the port proxies
to use const pointers for all sources (similar to how memcpy works).
The one unfortunate aspect is the need for a const_cast in the packet,
to avoid having to juggle a const and a non-const data pointer. This
design decision can always be re-evaluated at a later stage.
This patch takes a first step in tightening up how we use the data
pointer in write packets. A const getter is added for the pointer
itself (getConstPtr), and a number of member functions are also made
const accordingly. In a range of places throughout the memory system
the new member is used.
The patch also removes the unused isReadWrite function.
This patch removes the parameter that enables bypassing the null check
in the Packet::getPtr method. A number of call sites assume the value
to be non-null.
The one odd case is the RubyTester, which issues zero-sized
prefetches(!), and despite being reads they had no valid data
pointer. This is now fixed, but the size oddity remains (unless anyone
object or has any good suggestions).
Finally, in the Ruby Sequencer, appropriate checks are made for flush
packets as they have no valid data pointer.
This patch adds a first cut GDDR5 config to accommodate the users
combining gem5 and GPUSim. The config is based on a SK Hynix
datasheet, and the Nvidia GTX580 specification. Someone from the
GPUSim user-camp should tweak the default page-policy and static
frontend and backend latencies.
This patch adds uncacheable/cacheable and read-only/read-write attributes to
the map method of PageTableBase. It also modifies the constructor of TlbEntry
structs for all architectures to consider the new attributes.
This patch sets up low and high privilege code and data segments and places them
in the following order: cs low, ds low, ds, cs, in the GDT. Additionally, a
syscall and page fault handler for KvmCPU in SE mode are defined. The order of
the segment selectors in GDT is required in this manner for interrupt handling
to work properly. Segment initialization is done for all the thread
contexts.
This patch adds methods in KvmCPU model to handle KVM exits caused by syscall
instructions and page faults. These types of exits will be encountered if
KvmCPU is run in SE mode.
There was already a stub device at 0x80, the port traditionally used for an IO
delay. 0x80 is also the port used for POST codes sent by firmware, and that
may have prompted adding this port as a second option.
The data size used for actually writing the base value for the segment was the
default size, but really it should set the entire value without any possible
truncation.
The far pointer should be shifted right to get the selector value, not left.
Also, when calculating the width of the offset, the wrong register was used in
one spot.
The getRegArrayBit function extracts a bit from a series of registers which
are treated as a single large bit array. A previous change had modified the
logic which figured out which bit to extract from ">> 5" to "% 5" which seems
wrong, especially when other, similar functions were changed to use "% 32".
The value in EAX has an 8 bit field for the linear address size and one for
the physical address size when calling that function. A recent change
implemented it but returned 0xff for both of those fields. That implies that
linear and physical addresses are 255 bits wide which is wrong. When using the
KVM CPU model this causes an error, presumably because some of those bits are
actually reserved, or the CPU or kernel realizes 255 bits is a bad value.
This change makes those values 48.
Another churn to clean up undefined behaviour, mostly ARM, but some
parts also touching the generic part of the code base.
Most of the fixes are simply ensuring that proper intialisation. One
of the more subtle changes is the return type of the sign-extension,
which is changed to uint64_t. This is to avoid shifting negative
values (undefined behaviour) in the ISA code.
This patch reverts changeset 9277177eccff which does not do what it
was intended to do. In essence, we go back to implementing mkutctime
much like the non-standard timegm extension.
Mwait works as follows:
1. A cpu monitors an address of interest (monitor instruction)
2. A cpu calls mwait - this loads the cache line into that cpu's cache.
3. The cpu goes to sleep.
4. When another processor requests write permission for the line, it is
evicted from the sleeping cpu's cache. This eviction is forwarded to the
sleeping cpu, which then wakes up.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Fixes a bug where Minor drains in the midst of committing a
conditional store.
While committing a conditional store, lastCommitWasEndOfMacroop is true
(from the previous instruction) as we still haven't finished the conditional
store. If a drain occurs before the cache response, Minor would check just
lastCommitWasEndOfMacroop, which was true, and set drainState=DrainHaltFetch,
which increases the streamSeqNum. This caused the conditional store to be
squashed when the memory responded and it completed. However, to the memory
the store succeeded, while to the instruction sequence it never occurred.
In the case of an LLSC, the instruction sequence will replay the squashed
STREX, which will fail as the cache is no longer in LLSC. Then the
instruction sequence will loop back to a LDREX, which receives the updated
(incorrect) value.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Ruby's functional accesses are not guaranteed to succeed as of now. While
this is not a problem for the protocols that are currently in the mainline
repo, it seems that coherence protocols for gpus rely on a backing store to
supply the correct data. The aim of this patch is to make this backing store
configurable i.e. it comes into play only when a particular option:
--access-backing-store is invoked.
The backing store has been there since M5 and GEMS were integrated. The only
difference is that earlier the system used to maintain the backing store and
ruby's copy was write-only. Sometime last year, we moved to data being
supplied supplied by ruby in SE mode simulations. And now we have patches on
the reviewboard, which remove ruby's copy of memory altogether and rely
completely on the system's memory to supply data. This patch adds back a
SimpleMemory member to RubySystem. This member is used only if the option:
access-backing-store is set to true. By default, the memory would not be
accessed.
This patch is the final in the series. The whole series and this patch in
particular were written with the aim of interfacing ruby's directory controller
with the memory controller in the classic memory system. This is being done
since ruby's memory controller has not being kept up to date with the changes
going on in DRAMs. Classic's memory controller is more up to date and
supports multiple different types of DRAM. This also brings classic and
ruby ever more close. The patch also changes ruby's memory controller to
expose the same interface.
This function was added when I had incorrectly arrived at the conclusion
that such a function can improve the chances of a functional read succeeding.
As was later realized, this is not possible in the current setup. While the
code using this function was dropped long back, this function was not. Hence
the patch.
This patch removes the data block present in the directory entry structure
of each protocol in gem5's mainline. Firstly, this is required for moving
towards common set of memory controllers for classic and ruby memory systems.
Secondly, the data block was being misused in several places. It was being
used for having free access to the physical memory instead of calling on the
memory controller.
From now on, the directory controller will not have a direct visibility into
the physical memory. The Memory Vector object now resides in the
Memory Controller class. This also means that some significant changes are
being made to the functional accesses in ruby.
In my opinion, it creates needless complications in rest of the code.
Also, this structure hinders the move towards common set of code for
physical memory controllers.
Both ruby and the system used to maintain memory copies. With the changes
carried for programmed io accesses, only one single memory is required for
fs simulations. This patch sets the copy of memory that used to reside
with the system to null, so that no space is allocated, but address checks
can still be carried out. All the memory accesses now source and sink values
to the memory maintained by ruby.
As of now DMASequencer inherits from the RubyPort class. But the code in
RubyPort class is heavily tailored for the CPU Sequencer. There are parts of
the code that are not required at all for the DMA sequencer. Moreover, the
next patch uses the dma sequencer for carrying out memory accesses for all the
io devices. Hence, it is better to have a leaner dma sequencer.
This changes the default ARM system to a Versatile Express-like system that supports
2GB of memory and PCI devices and updates the default kernels/file-systems for
AArch64 ARM systems (64-bit) to support up to 32GB of memory and PCI devices. Some
platforms that are no longer supported have been pruned from the configuration files.
In addition a set of 64-bit ARM regressions have been added to the regression system.
It is possible for the O3 CPU to consider itself drained and
later have a squashed instruction perform a writeback. This
patch re-adds tracking of in-flight instructions to prevent
falsely signaling a drained event.
IEW did not check the instQueue and memDepUnit to ensure
they were drained. This caused issues when drainSanityCheck()
did check those structures after asserting IEW was drained.
The checker can't verify timer registers, so it should just grab the version
from the executing CPU, otherwise it could get a larger value and diverge
execution.
This patch fixes a bug where a completing load or store which is also a
barrier can push a barrier into the store buffer without first checking
that there is a free slot.
The bug was not fatal but would print a warning that the store buffer
was full when inserting.
WriteInvalidate semantics depend on the unconditional writeback
or they won't complete. Also, there's no point in deferring snoops
on their MSHRs, as they don't get new data at the end of their life
cycle the way other transactions do.
Add comment in the cache about a minor inefficiency re: WriteInvalidate.
Since WriteInvalidate directly writes into the cache, it can
create tricky timing interleavings with reads and writes to the
same cache line that haven't yet completed. This patch ensures
that these requests, when completed, don't overwrite the newer
data from the WriteInvalidate.
This hook allows blocking emulated system calls to indicate
that they would block, but return control to the simulator
so that the simulation does not hang. The actual retry
functionality requires additional support, to be provided
in a future changeset.
Move the BufferArg classes that support syscall buffer args
(i.e., pointers into simulated user space) out of syscall_emul.hh
and into a new header syscall_emul_buf.hh so they are accessible
to emulated driver implementations.
Take the opportunity to add some comments as well.
The identifier SYS_getdents is not available on Mac OS X. Therefore, its use
results in compilation failure. It seems there is no straight forward way to
implement the system call getdents using readdir() or similar C functions.
Hence the commit 6709bbcf564d is being rolled back.
This patch fixes a few minor issues that caused link-time warnings
when using LTO, mainly for x86. The most important change is how the
syscall array is created. Previously gcc and clang would complain that
the declaration and definition types did not match. The organisation
is now changed to match how it is done for ARM, moving the code that
was previously in syscalls.cc into process.cc, and having a class
variable pointing to the static array.
With these changes, there are no longer any warnings using gcc 4.6.3
with LTO.
This patch changes how we turn time into UTC. Previously we
manipulated the TZ environment variable, but this has issues as the
strings that are manipulated could be tainted (see e.g. CERT
ENV34-C). Now we simply rely on the built-in gmtime function and avoid
touching getenv/setenv all together.
This patch adds the size of the DRAM device to the DRAM config. It
also compares the actual DRAM size (calculated using information from
the config) to the size defined in the system. If these two values do
not match gem5 will print a warning. In order to do correct DRAM
research the size of the memory defined in the system should match the
size of the DRAM in the config. The timing and current parameters
found in the DRAM configs are defined for a DRAM device with a
specific size and would differ for another device with a different
size.
Presently, the alignment checks in the mmap and mremap implementations
in syscall_emul.hh are wrong. The checks are implemented as:
if ((start % TheISA::PageBytes) != 0 ||
(length % TheISA::PageBytes) != 0) {
warn("mmap failing: arguments not page-aligned: "
"start 0x%x length 0x%x",
start, length);
return -EINVAL;
}
This checks that both the start and the length arguments of the mmap
syscall are checked for page-alignment. However, the POSIX specification says:
The off argument is constrained to be aligned and sized according to the value
returned by sysconf() when passed _SC_PAGESIZE or _SC_PAGE_SIZE. When MAP_FIXED
is specified, the application shall ensure that the argument addr also meets
these constraints. The implementation performs mapping operations over whole
pages. Thus, while the argument len need not meet a size or alignment
constraint, the implementation shall include, in any mapping operation, any
partial page specified by the range [pa,pa+len).
So the length parameter should not be checked for page-alignment. By contrast,
the current implementation fails to check the offset argument, which must be
page aligned.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Change mmap fixed address request to return an error if the mapping is
impossible due to conflict instead of what I believe used to be silent
corruption.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
On exit_group syscall, we used to exit the simulator. But now we will only
halt the execution of threads that belong to the group.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Sysfs on ubuntu scrapes the entire PCI config space
when it discovers a device using 4 byte accesses.
This was not supported by our devices, in particular the NIC
that implemented the extended PCI config space. This change
allows the extended PCI config space to be accessed by
sysfs properly.
This patch adds two MemoryObject's: ExternalMaster and ExternalSlave.
Each object has a single port which can be bound to an externally-
provided bridge to a port of another simulation system at
initialisation.
This patch adds a 'wakeup' member function to EventQueue which should be
called on an event queue whenever an event is scheduled on the event queue
from outside code within the call tree of the gem5 event loop.
This clearly isn't necessary for normal gem5 EventQueue operation but
becomes the minimum necessary interface to allow hosting gem5's event loop
onto other schedulers where there may be calls into gem5 from external
code which schedules events onto an EventQueue between the current time and
the time of the next scheduled event.
The use case I have in mind is a SystemC hosting where the event loop is:
while (more events) {
wait(time_to_next_event or wakeup)
setCurTick
service events at this time
}
where the 'wait' needs to be woken up if time_to_next_event becomes shorter
due to a scheduled event from SystemC arriving in a gem5 object.
Requiring 'wakeup' to be called is a more efficient interface than
requiring all gem5 event scheduling actions to affect the host scheduler.
This interface could be located elsewhere, say on another global object,
or by being passed by the host scheduler to objects which will schedule
such events, but it seems cleanest to put it on EventQueue as it is
actually a signal to the queue.
EventQueue::wakeup is called for async_event events on event queue 0 as
it's only important that *some* queue be triggered for such events.
This patch adds a Logger class encapsulating dprintf. This allows
variants of DPRINTF logging to be constructed and substituted in
place of the default behaviour.
The Logger provides a logMessage(when, name, format, ...) member
function like Trace::dprintf and a getOstream member function to
use a raw ostream for logging.
A class OstreamLogger is provided which generates the customary
debugging output with Trace::OstreamLogger::logMessage being the
old Trace::dprintf.
This patch takes quite a large step in transitioning from the ad-hoc
RefCountingPtr to the c++11 shared_ptr by adopting its use for all
Faults. There are no changes in behaviour, and the code modifications
are mostly just replacing "new" with "make_shared".
This patch transitions the o3 MemDepEntry from the ad-hoc
RefCountingPtr to the c++11 shared_ptr. There are no changes in
behaviour, and the code modifications are mainly replacing "new" with
"make_shared".
This patch transitions the Ruby Message and its derived classes from
the ad-hoc RefCountingPtr to the c++11 shared_ptr. There are no
changes in behaviour, and the code modifications are mainly replacing
"new" with "make_shared".
The cloning of derived messages is slightly changed as they previously
relied on overriding the base-class through covariant return types.
This patch transitions the stat Node and its derived classes from
the ad-hoc RefCountingPtr to the c++11 shared_ptr. There are no
changes in behaviour, and the code modifications are mainly replacing
"new" with "make_shared".
This patch transitions the EthPacketData from the ad-hoc
RefCountingPtr to the c++11 shared_ptr. There are no changes in
behaviour, and the code modifications are mainly replacing "new" with
"make_shared".
The bool casting operator for the shared_ptr is explicit, and we must
therefore either cast it, compare it to NULL (p != nullptr), double
negate it (!!p) or do a (p ? true : false).
This patch takes a first few steps in transitioning from the ad-hoc
RefCountingPtr to the c++11 shared_ptr. There are no changes in
behaviour, and the code modifications are mainly introducing the
use of make_shared.
Note that the class could use unique_ptr rather than shared_ptr, was
it not for the postfix increment and decrement operators.
This patch makes the memory system ISA-agnostic by enabling the Ruby
Sequencer to dynamically determine if it has to do a store check. To
enable this check, the ISA is encoded as an enum, and the system
is able to provide the ISA to the Sequencer at run time.
--HG--
rename : src/arch/x86/insts/microldstop.hh => src/arch/x86/ldstflags.hh
This patch takes a step towards an ISA-agnostic memory
system by enabling the components to establish the page size after
instantiation. The swap operation in the memory is now also allowing
any granularity to avoid depending on the IntReg of the ISA.
This changeset adds probe points that can be used to implement PMU
counters for CPU stats. The following probes are supported:
* BaseCPU::ppCycles / Cycles
* BaseCPU::ppRetiredInsts / RetiredInsts
* BaseCPU::ppRetiredLoads / RetiredLoads
* BaseCPU::ppRetiredStores / RetiredStores
* BaseCPU::ppRetiredBranches RetiredBranches
This changeset adds probe points that can be used to implement PMU
counters for TLB stats. The following probes are supported:
* ArmISA::TLB::ppRefills / TLB Refills (TLB insertions)
This changeset adds probe points that can be used to implement PMU
counters for branch predictor stats. The following probes are
supported:
* BPRedUnit::ppBranches / Branches
* BPRedUnit::ppMisses / Misses
This class implements a subset of the ARM PMU v3 specification as
described in the ARMv8 reference manual. It supports most of the
features of the PMU, however the following features are known to be
missing:
* Event filtering (e.g., from different privilege levels).
* Access controls (the PMU currently ignores the execution level).
* The chain counter (event no. 0x1E) is unimplemented.
The PMU itself does not implement any events, it merely provides an
interface for the configuration scripts to hook up probes that drive
events. Configuration scripts should call addEventProbe() to configure
custom events or high-level methods to configure architected
events. The Python implementation of addEventProbe() automatically
delays event type registration until after instantiation.
In order to support CPU switching and some combined counters (e.g.,
memory references synthesized from loads and stores), the PMU allows
multiple probes per event type. When creating a system that switches
between CPU models that share the same PMU, PMU events for all of the
CPU models can be registered with the PMU.
Kudos to Matt Horsnell for the initial gem5 implementation of the PMU.
In order to show make PMU probe points usable across different PMU
implementations, we want a common probe interface. This patch the
namespace ProbePoins that contains typedefs for probe points that are
shared between multiple SimObjects. It also adds typedefs for the PMU
probe interface.
BitUnion instances can normally not be used with the SERIALIZE_SCALAR
and UNSERIALIZE_SCALAR macros due to the way they are converted
between their storage type and their actual type. This changeset adds
a set of parm(In|Out) functions specifically for gem5 bit unions to
work around the issue.
This patch adds the ability to load in config.ini files generated from
gem5 into another instance of gem5 built without Python configuration
support. The intended use case is for configuring gem5 when it is a
library embedded in another simulation system.
A parallel config file reader is also provided purely in Python to
demonstrate the approach taken and to provided similar functionality
for as-yet-unknown use models. The Python configuration file reader
can read both .ini and .json files.
C++ configuration file reading:
A command line option has been added for scons to enable C++ configuration
file reading: --with-cxx-config
There is an example in util/cxx_config that shows C++ configuration in action.
util/cxx_config/README explains how to build the example.
Configuration is achieved by the object CxxConfigManager. It handles
reading object descriptions from a CxxConfigFileBase object which
wraps a config file reader. The wrapper class CxxIniFile is provided
which wraps an IniFile for reading .ini files. Reading .json files
from C++ would be possible with a similar wrapper and a JSON parser.
After reading object descriptions, CxxConfigManager creates
SimObjectParam-derived objects from the classes in the (generated with this
patch) directory build/ARCH/cxx_config
CxxConfigManager can then build SimObjects from those SimObjectParams (in an
order dictated by the SimObject-value parameters on other objects) and bind
ports of the produced SimObjects.
A minimal set of instantiate-replacing member functions are provided by
CxxConfigManager and few of the member functions of SimObject (such as drain)
are extended onto CxxConfigManager.
Python configuration file reading (configs/example/read_config.py):
A Python version of the reader is also supplied with a similar interface to
CxxConfigFileBase (In Python: ConfigFile) to config file readers.
The Python config file reading will handle both .ini and .json files.
The object construction strategy is slightly different in Python from the C++
reader as you need to avoid objects prematurely becoming the children of other
objects when setting parameters.
Port binding also needs to be strictly in the same port-index order as the
original instantiation.
This patch adds the Undefined Behavior Sanitizer (UBSan) for clang and
gcc >= 4.9. Due to the performance impact, the usage is guarded by a
command-line option.
Add the ability to build libgem5 without embedded Python or the
ability to configure with Python.
This is a prelude to a patch to allow config.ini files to be loaded
into libgem5 using only C++ which would make embedding gem5 within
other simulation systems easier.
This adds a few registration interfaces to things which cross
between Python and C++. Namely: stats dumping and SimObject resolving
This patch adds some statistics to garnet that record the activity
of certain structures in the on-chip network. These statistics, in a later
patch, will be used for computing the energy consumed by the on-chip network.
Orion is being dropped from ruby. It would be replaced with DSENT
which has better models. Note that the power / energy numbers reported
after this patch has been applied are not for use.
This patch takes the final step in integrating DRAMPower and adds the
appropriate calls in the DRAM controller to provide the command trace
and extract the power and energy stats. The debug printouts are still
left in place, but will eventually be removed.
At the moment the DRAM power calculation is always on when using the
DRAM controller model. The run-time impact of this addition is around
1.5% when looking at the total host seconds of the regressions. We
deem this a sensible trade-off to avoid the complication of adding an
enable/disable mechanism.
This patch adds a class to wrap DRAMPower Library in gem5.
This class initiates an object of class MemorySpecification
of the DRAMPower Library, passes the parameters from DRAMCtrl.py
to this object and creates an object of drampower library using
the memory specification.
This patch adds missing timing and current parameters to the existing
DRAM configs. These missing timing and current parameters are required
by DRAMPower for the DRAM power calculations. The missing values are
datasheet values of the specified DRAMs, and the appropriate
references are added for the variuos configs.
This patch prunes the DDR3 config that was initially created to match
the default config of DRAMSim2. The config is not complete as it is,
and to avoid having to maintain it, the easiest way forward is to
simply prune it. Going forward we are adding power number etc to the
other configurations.
This patch adds the Python parameter type Current, which is used for
the DRAM power modelling (to start with). With this addition we avoid
implicit unit assumptions.
The Ozone CPU is now very much out of date and completely
non-functional, with no one actively working on restoring it. It is a
source of confusion for new users who attempt to use it before
realizing its current state. RIP
This patch fixes the runtime errors highlighted by the undefined
behaviour sanitizer. In the end there were two issues. First, when
rotating an immediate, we ended up shifting an uint32_t by 32 in some
cases. This case is fixed by checking for a rotation by 0
positions. Second, the Mrc15 and Mcr15 are operating on an IntReg and
a MiscReg, but we used the type RegRegImmOp and passed a MiscRegIndex
as an IntRegIndex. This issue is resolved by introducing a
MiscRegRegImmOp and RegMiscRegImmOp with the appropriate types.
With these fixes there are no runtime errors identified for the full
ARM regressions.
This patch optimises the passing of StaticInstPtr by avoiding copying
the reference-counting pointer. This avoids first incrementing and
then decrementing the reference-counting pointer.
Fix a number few minor issues to please gcc 4.9.1. Removing the
'-fuse-linker-plugin' flag means no libraries are part of the LTO
process, but hopefully this is an acceptable loss, as the flag causes
issues on a lot of systems (only certain combinations of gcc, ld and
ar work).
The call paths for de-scheduling a thread are halt() and suspend(), from
the thread context. There is no call to deallocateContext() in general,
though some CPUs chose to define it. This patch removes the function
from BaseCPU and the cores which do not require it.
activate(), suspend(), and halt() used on thread contexts had an optional
delay parameter. However this parameter was often ignored. Also, when used,
the delay was seemily arbitrarily set to 0 or 1 cycle (no other delays were
ever specified). This patch removes the delay parameter and 'Events'
associated with them across all ISAs and cores. Unused activate logic
is also removed.
This patch changes the name of the Bus classes to XBar to better
reflect the actual timing behaviour. The actual instances in the
config scripts are not renamed, and remain as e.g. iobus or membus.
As part of this renaming, the code has also been clean up slightly,
making use of range-based for loops and tidying up some comments. The
only changes outside the bus/crossbar code is due to the delay
variables in the packet.
--HG--
rename : src/mem/Bus.py => src/mem/XBar.py
rename : src/mem/coherent_bus.cc => src/mem/coherent_xbar.cc
rename : src/mem/coherent_bus.hh => src/mem/coherent_xbar.hh
rename : src/mem/noncoherent_bus.cc => src/mem/noncoherent_xbar.cc
rename : src/mem/noncoherent_bus.hh => src/mem/noncoherent_xbar.hh
rename : src/mem/bus.cc => src/mem/xbar.cc
rename : src/mem/bus.hh => src/mem/xbar.hh
Adds a simple access counter for requests and snoops for the snoop filter and
also classifies hits based on whether a single other holder existed or whether
multiple shares held the line.
This patch adds a simple counter for both total messages and a histogram for
the fan-out of snoop messages. The fan-out describes to how many ports snoops
had to be sent per incoming request / snoop-from-below. Without any
cleverness, this usually means to either all, or all but the requesting port.
Adds two public domain algorithms for determining number of set bits and also
whether a value is a power of two, uses the builtin that is available in GCC
and clang for popcount.
This is a first cut at a simple snoop filter that tracks presence of lines in
the caches "above" it. The snoop filter can be applied at any given cache
hierarchy and will then handle the caches above it appropriately; there is no
need to use this only in the last-level bus.
This design currently has some limitations: missing stats, no notion of clean
evictions (these will not update the underlying snoop filter, because they are
not sent from the evicting cache down), no notion of capacity for the snoop
filter and thus no need for invalidations caused by capacity pressure in the
snoop filter. These are planned to be added on top with future change sets.
There are cases where users might by accident / intention specify less voltage
operating points thatn frequency points. We consider one of these cases
special: giving only a single voltage to a voltage domain effectively renders
it as a static domain. This patch adds additional logic in the auxiliary parts
of the functionality to handle these cases properly (simple driver asking for
N>1 operating levels, we should return the same voltage for all of them) and
adds error checking code in the voltage domain.
This patch provides an Energy Controller device that provides software
(driver) access to a DVFS handler. The device is currently residing in
the dev/arm tree, but there is nothing inherently ARM specific in the
behaviour. It is currently only tested and supported for ARM Linux,
hence the location.
These additions allow easier interoperability with and querying from an
additional controller which will be in a separate patch. Also adding warnings
for changing the enabled state of the handler across checkpoint / resume and
deviating from the state in the configuration.
Contributed-by: Akash Bagdia <akash.bagdia@arm.com>
Added the following parameter to the DRAMCtrl class:
- bank_groups_per_rank
This defaults to 1. For the DDR4 case, the default is overridden to indicate
bank group architecture, with multiple bank groups per rank.
Added the following delays to the DRAMCtrl class:
- tCCD_L : CAS-to-CAS, same bank group delay
- tRRD_L : RAS-to-RAS, same bank group delay
These parameters are only applied when bank group timing is enabled. Bank
group timing is currently enabled only for DDR4 memories.
For all other memories, these delays will default to '0 ns'
In the DRAM controller model, applied the bank group timing to the per bank
parameters actAllowedAt and colAllowedAt.
The actAllowedAt will be updated based on bank group when an ACT is issued.
The colAllowedAt will be updated based on bank group when a RD/WR burst is
issued.
At the moment no modifications are made to the scheduling.
Add the following delay to the DRAM controller:
- tCS : Different rank bus turnaround delay
This will be applied for
1) read-to-read,
2) write-to-write,
3) write-to-read, and
4) read-to-write
command sequences, where the new command accesses a different rank
than the previous burst.
The delay defaults to 2*tCK for each defined memory class. Note that
this does not correspond to one particular timing constraint, but is a
way of modelling all the associated constraints.
The DRAM controller has some minor changes to prioritize commands to
the same rank. This prioritization will only occur when the command
stream is not switching from a read to write or vice versa (in the
case of switching we have a gap in any case).
To prioritize commands to the same rank, the model will determine if there are
any commands queued (same type) to the same rank as the previous command.
This check will ensure that the 'same rank' command will be able to execute
without adding bubbles to the command flow, e.g. any ACT delay requirements
can be done under the hoods, allowing the burst to issue seamlessly.
Add new DRAM_ROTATE mode to traffic generator.
This mode will generate DRAM traffic that rotates across
banks per rank, command types, and ranks per channel
The looping order is illustrated below:
for (ranks per channel)
for (command types)
for (banks per rank)
// Generate DRAM Command Series
This patch also adds the read percentage as an input argument to the
DRAM sweep script. If the simulated read percentage is 0 or 100, the
middle for loop does not generate additional commands. This loop is
used only when the read percentage is set to 50, in which case the
middle loop will toggle between read and write commands.
Modified sweep.py script, which generates DRAM traffic.
Added input arguments and support for new DRAM_ROTATE mode.
The script now has input arguments for:
1) Read percentage
2) Number of ranks
3) Address mapping
4) Traffic generator mode (DRAM or DRAM_ROTATE)
The default values are:
100% reads, 1 rank, RoRaBaCoCh address mapping, and DRAM traffic gen mode
For the DRAM traffic mode, added multi-rank support.
This patch adds support for 9p filesystem proxying over VirtIO. It can
currently operate by connecting to a 9p server over a socket
(VirtIO9PSocket) or by starting the diod 9p server and connecting over
pipe (VirtIO9PDiod).
*WARNING*: Checkpoints are currently not supported for systems with 9p
proxies!
This patch adds support for VirtIO over the PCI bus. It does so by
providing the following new SimObjects:
* VirtIODeviceBase - Abstract base class for VirtIO devices.
* PciVirtIO - VirtIO PCI transport interface.
A VirtIO device is hooked up to the guest system by adding a PciVirtIO
device to the PCI bus and connecting it to a VirtIO device using the
vio parameter.
New VirtIO devices should inherit from VirtIODevice base and
implementing one or more VirtQueues. The VirtQueues are usually
device-specific and all derive from the VirtQueue class. Queues must
be registered with the base class from the constructor since the
device assumes that the number of queues stay constant.
The terminal currently assumes that the transport to the guest always
inherits from the Uart class. This assumption breaks when
implementing, for example, a VirtIO consoles. This patch removes this
assumption by adding pointer to the from the terminal to the uart and
replacing it with a more general callback interface. The Uart, or any
other class using the terminal, class implements an instance of the
callbacks class and registers it with the terminal.
This patch does a bit of housekeeping on the string helper functions
and relies on the C++11 standard library where possible. It also does
away with our custom string hash as an implementation is already part
of the standard library.
There are two primary issues with this code which make it deserving of deletion.
1) GHB is a way to structure a prefetcher, not a definitive type of prefetcher
2) This prefetcher isn't even structured like a GHB prefetcher.
It's basically a worse version of the stride prefetcher.
It primarily serves to confuse new gem5 users and most functionality is already
present in the stride prefetcher.
This patch 'completes' .json config files generation by adding in the
SimObject references and String-valued parameters not currently
printed.
TickParamValues are also changed to print in the same tick-value
format as in .ini files.
This allows .json files to describe a system as fully as the .ini files
currently do.
This patch adds a new function config_value (which mirrors ini_str) to
each ParamValue and to SimObject. This function can then be explicitly
changed to give different .json and .ini printing behaviour rather than
being written in terms of ini_str.
This patch changes how faults are passed between methods in an attempt
to copy as few reference-counting pointer instances as possible. This
should avoid unecessary copies being created, contributing to the
increment/decrement of the reference counters.
The changeset ad9c042dce54 made changes to the structures under the network
directory to use a map of buffers instead of vector of buffers.
The reasoning was that not all vnets that are created are used and we
needlessly allocate more buffers than required and then iterate over them
while processing network messages. But the move to map resulted in a slow
down which was pointed out by Andreas Hansson. This patch moves things
back to using vector of message buffers.
This patch fixes cases where uncacheable/memory type flags are not set
correctly on a memory op which is split in the LSQ. Without this
patch, request->request if freely used to check flags where the flags
should actually come from the accumulation of request fragment flags.
This patch also fixes a bug where an uncacheable access which passes
through tryToSendRequest more than once can increment
LSQ::numAccessesInMemorySystem more than once.
This patch closes a number of space gaps in debug messages caused by
the incorrect use of line continuation within strings. (There's also
one consistency change to a similar, but correct, use of line
continuation)
The ProbeListener base class automatically registers itself with a
probe manager. Currently, the class does not unregister a itself when
it is destroyed, which makes removing probes listeners somewhat
cumbersome. This patch adds an automatic call to
manager->removeListener in the ProbeListener destructor, which solves
the problem.
Parsing vectorparams from the command was slightly broken
in that it wouldn't accept the input that the help message
provided to the user and it didn't do the conversion
on the second code path used to convert the string input
to the actual internal representation. This patch fixes these bugs.
Static analysis revealed that BaseGlobalEvent::barrier was never
deallocated. This changeset solves this leak by making the barrier
allocation a part of the BaseGlobalEvent instead of storing a pointer
to a separate heap-allocated barrier.
Static analysis unearther a bunch of uninitialised variables and
members, and this patch addresses the problem. In all cases these
omissions seem benign in the end, but at least fixing them means less
false positives next time round.
The PC platform has a single IO range that is used both legacy IO and PCI IO
while other platforms may use seperate regions. Provide another mechanism to
configure the legacy IO base address range and set it to the PCI IO address
range for x86.
This change adds support for a generic pci host bus driver that
has been included in recent Linux kernel instead of the more
bespoke one we've been using to date. It also works with
aarch64 so it provides PCI support for 64-bit ARM Linux.
To make this work a new configuration option pci_io_base is added
to the RealView platform that should be set to the start of
the memory used as memory mapped IO ports (IO ports that are
memory mapped, not regular memory mapped IO). And a parameter
pci_cfg_gen_offsets which specifies if the config space
offsets should be used that the generic driver expects.
To use the pci-host-generic device you need to:
pci_io_base = 0x2f000000 (Valid for VExpress EMM)
pci_cfg_gen_offsets = True
and add the following to your device tree:
pci {
compatible = "pci-host-ecam-generic";
device_type = "pci";
#address-cells = <0x3>;
#size-cells = <0x2>;
#interrupt-cells = <0x1>;
//bus-range = <0x0 0x1>;
// CPU_PHYSICAL(2) SIZE(2)
// Note, some DTS blobs only support 1 size
reg = <0x0 0x30000000 0x0 0x10000000>;
// IO (1), no bus address (2), cpu address (2), size (2)
// MMIO (1), at address (2), cpu address (2), size (2)
ranges = <0x01000000 0x0 0x00000000 0x0 0x2f000000 0x0 0x10000>,
<0x02000000 0x0 0x40000000 0x0 0x40000000 0x0 0x10000000>;
// With gem5 we typically use INTA/B/C/D one per device
interrupt-map = <0x0000 0x0 0x0 0x1 0x1 0x0 0x11 0x1
0x0000 0x0 0x0 0x2 0x1 0x0 0x12 0x1
0x0000 0x0 0x0 0x3 0x1 0x0 0x13 0x1
0x0000 0x0 0x0 0x4 0x1 0x0 0x14 0x1>;
// Only match INTA/B/C/D and not BDF
interrupt-map-mask = <0x0000 0x0 0x0 0x7>;
};
The new configuration scripts need the ability to splice
a simobject between a pair of ports that are already connected.
The primary use case is when a CommMonitor needs to be
created after the system is configured and then spliced between
the pair of ports it will monitor.
This patch changes the random number generator from the in-house
Mersenne twister to an implementation relying entirely on C++11 STL.
The format for the checkpointing of the twister is simplified. As the
functionality was never used this should not matter. Note that this
patch does not actually make use of the checkpointing
functionality. As the random number generator is not thread safe, it
may be sensible to create one generator per thread, system, or even
object. Until this is decided the status quo is maintained in that no
generator state is part of the checkpoint.
This patch tidies up random number generation to ensure that it is
done consistently throughout the code base. In essence this involves a
clean-up of Ruby, and some code simplifications in the traffic
generator.
As part of this patch a bunch of skewed distributions (off-by-one etc)
have been fixed.
Note that a single global random number generator is used, and that
the object instantiation order will impact the behaviour (the sequence
of numbers will be unaffected, but if module A calles random before
module B then they would obviously see a different outcome). The
dependency on the instantiation order is true in any case due to the
execution-model of gem5, so we leave it as is. Also note that the
global ranom generator is not thread safe at this point.
Regressions using the memtest, TrafficGen or any Ruby tester are
affected and will be updated accordingly.
This patch removes unecessary retries that happened when the bus layer
itself was no longer busy, but the the peer was not yet ready. Instead
of sending a retry that will inevitably not succeed, the bus now
silenty waits until the peer sends a retry.
Multiple instructions assume only 32-bit load operations are available,
this patch increases load sizes to 64-bit or 128-bit for many load pair and
load multiple instructions.
Support full-block writes directly rather than requiring RMW:
* a cache line is allocated in the cache upon receipt of a
WriteInvalidateReq, not the WriteInvalidateResp.
* only top-level caches allocate the line; the others just pass
the request along and invalidate as necessary.
* to close a timing window between the *Req and the *Resp, a new
metadata bit tracks whether another cache has read a copy of
the new line before the writeback to memory.
This patch fixes a bug in the cache port where the retry flag was
reset too early, allowing new requests to arrive before the retry was
actually sent, but with the event already scheduled. This caused a
deadlock in the interactions with the O3 LSQ.
The patche fixes the underlying issue by shifting the resetting of the
flag to be done by the event that also calls sendRetry(). The patch
also tidies up the flow control in recvTimingReq and ensures that we
also check if we already have a retry outstanding.
Previously, they were treated so much like loads that they could stall
at the head of the ROB. Now they are always treated like L1 hits.
If they actually miss, a new request is created at the L1 and tracked
from the MSHRs there if necessary (i.e. if it didn't coalesce with
an existing outstanding load).
The o3 cpu relies upon instructions that suspend a thread context being
flagged as "IsQuiesce". If they are not, unpredictable behavior can occur.
This patch fixes that for the x86 ISA.
For X86, the o3 CPU would get stuck with the commit stage not being
drained if an interrupt arrived while drain was pending. isDrained()
makes sure that pcState.microPC() == 0, thus ensuring that we are at
an instruction boundary. However, when we take an interrupt we
execute:
pcState.upc(romMicroPC(entry));
pcState.nupc(romMicroPC(entry) + 1);
tc->pcState(pcState);
As a result, the MicroPC is no longer zero. This patch ensures the drain is
delayed until no interrupts are present. Once draining, non-synchronous
interrupts are deffered until after the switch.
Neon memory ops that operate on multiple registers currently have very poor
performance because of interleave/deinterleave micro-ops.
This patch marks the deinterleave/interleave micro-ops as "No_OpClass" such
that they take minumum cycles to execute and are never resource constrained.
Additionaly the micro-ops over-read registers. Although one form may need
to read up to 20 sources, not all do. This adds in new forms so false
dependencies are not modeled. Instructions read their minimum number of
sources.
Analogous to ee049bf (for x86). Requires a bump of the checkpoint version
and corresponding upgrader code to move the condition code register values
to the new register file.
A small bug in the bimodal predictor caused significant degradation in
performance on some benchmarks. This was caused by using the wrong
globalHistoryReg during the update phase. This patches fixes the bug
and brings the performance to normal level.
This patch fixes the load blocked/replay mechanism in the o3 cpu. Rather than
flushing the entire pipeline, this patch replays loads once the cache becomes
unblocked.
Additionally, deferred memory instructions (loads which had conflicting stores),
when replayed would not respect the number of functional units (only respected
issue width). This patch also corrects that.
Improvements over 20% have been observed on a microbenchmark designed to
exercise this behavior.
O3 is supposed to stop fetching instructions once a quiesce is encountered.
However due to a bug, it would continue fetching instructions from the current
fetch buffer. This is because of a break statment that only broke out of the
first of 2 nested loops. It should have broken out of both.
The o3 cpu could attempt to schedule inactive threads under round-robin SMT
mode.
This is because it maintained an independent priority list of threads from the
active thread list. This priority list could be come stale once threads were
inactive, leading to the cpu trying to fetch/commit from inactive threads.
Additionally the fetch queue is now forcibly flushed of instrctuctions
from the de-scheduled thread.
Relevant output:
24557000: system.cpu: [tid:1]: Calling deactivate thread.
24557000: system.cpu: [tid:1]: Removing from active threads list
24557500: system.cpu:
FullO3CPU: Ticking main, FullO3CPU.
24557500: system.cpu.fetch: Running stage.
24557500: system.cpu.fetch: Attempting to fetch from [tid:1]
When a branch mispredicted gem5 would squash all history after and including
the mispredicted branch. However, the mispredicted branch is still speculative
and its history is required to rollback state if another, older, branch
mispredicts. This leads to things like RAS corruption.
This patch adds a fetch queue that sits between fetch and decode to the
o3 cpu. This effectively decouples fetch from decode stalls allowing it
to be more aggressive, running futher ahead in the instruction stream.
The o3 pipeline interlock/stall logic is incorrect. o3 unnecessicarily stalled
fetch and decode due to later stages in the pipeline. In general, a stage
should usually only consider if it is stalled by the adjacent, downstream stage.
Forcing stalls due to later stages creates and results in bubbles in the
pipeline. Additionally, o3 stalled the entire frontend (fetch, decode, rename)
on a branch mispredict while the ROB is being serially walked to update the
RAT (robSquashing). Only should have stalled at rename.
As highlighed on the mailing list gem5's writeback modeling can impact
performance. This patch removes the limitation on maximum outstanding issued
instructions, however the number that can writeback in a single cycle is still
respected in instToCommit().
isa_parser.py guesses the OpClass if none were given based upon the StaticInst
flags. The existing code does not take into account optionally set flags.
This code hoists the setting of optional flags so OpClass is properly assigned.
If a set of LL/SC requests contend on the same cache block we
can get into a situation where CPUs will deadlock if they expect
a failed SC to supply them data. This case happens where 3 or
more cores are contending for a cache block using LL/SC and the system
is configured where 2 cores are connected to a local bus and the
third is connected to a remote bus. If a core on the local bus
sends an SCUpgrade and the core on the remote bus sends and SCUpgrade
they will race to see who will win the SC access. In the meantime
if the other core appends a read to one of the SCUpgrades it will expect
to be supplied data by that SCUpgrade transaction. If it happens that
the SCUpgrade that was picked to supply the data is failed, it will
drop the appended request for data and never respond, leaving the requesting
core to deadlock. This patch makes all SC's behave as normal stores to
prevent this case but still makes sure to check whether it can perform
the update.
The first DPRINTF() in PL390::writeDistributor always read a uint32_t, though a
packet may have only been 1 or 2 bytes. This caused an assertion in
packet->get().
This patch makes restoring the 'lastStopped' value for Ticked-containing
objects (including MinorCPU) optional so that Ticked-containing objects
can be restored from non-Ticked-containing objects (such as AtomicSimpleCPU).
We currently generate and compile one version of the ISA code per CPU
model. This is obviously wasting a lot of resources at compile
time. This changeset factors out the interface into a separate
ExecContext class, which also serves as documentation for the
interface between CPUs and the ISA code. While doing so, this
changeset also fixes up interface inconsistencies between the
different CPU models.
The main argument for using one set of ISA code per CPU model has
always been performance as this avoid indirect branches in the
generated code. However, this argument does not hold water. Booting
Linux on a simulated ARM system running in atomic mode
(opt/10.linux-boot/realview-simple-atomic) is actually 2% faster
(compiled using clang 3.4) after applying this patch. Additionally,
compilation time is decreased by 35%.
This patch prunes unused values, and also unifies how the values are
defined (not using an enum for ALPHA), aligning the use of int vs Addr
etc.
The patch also removes the duplication of PageBytes/PageShift and
VMPageSize/LogVMPageSize. For all ISAs the two pairs had identical
values and the latter has been removed.
When passed from a configuration script with a hexadecimal value (like
"0x80000000"), gem5 would error out. This is because it would call
"toMemorySize" which requires the argument to end with a size specifier (like
1MB, etc).
This modification makes it so raw hex values can be passed through Addr
parameters from the configuration scripts.
This patch fixes the hash operator used for ARM ExtMachInst, which
incorrectly was still using uint32_t. Instead of changing it to
uint64_t it is not using the underlying data type of the BitUnion.
The Index type defined as typedef int64 does not really provide any help
since in most places we use primitive types instead of Index. Also, the name
Index is very generic that it does not merit being used as a typename.
This patch is the final patch in a series of patches. The aim of the series
is to make ruby more configurable than it was. More specifically, the
connections between controllers are not at all possible (unless one is ready
to make significant changes to the coherence protocol). Moreover the buffers
themselves are magically connected to the network inside the slicc code.
These connections are not part of the configuration file.
This patch makes changes so that these connections will now be made in the
python configuration files associated with the protocols. This requires
each state machine to expose the message buffers it uses for input and output.
So, the patch makes these buffers configurable members of the machines.
The patch drops the slicc code that usd to connect these buffers to the
network. Now these buffers are exposed to the python configuration system
as Master and Slave ports. In the configuration files, any master port
can be connected any slave port. The file pyobject.cc has been modified to
take care of allocating the actual message buffer. This is inline with how
other port connections work.
A later changeset changes the file src/python/swig/pyobject.cc to include
a header file that includes a header file generated at build time depending
on the PROTOCOL in use. Since NULL ISA was not specifying any protocol,
this resulted in compilation problems. Hence, the changeset.
The namespace Message conflicts with the Message data type used extensively
in Ruby. Since Ruby is being moved to the same Master/Slave ports based
configuration style as the rest of gem5, this conflict needs to be resolved.
Hence, the namespace is being renamed to ProtoMessage.
There are two changes this patch makes to the way configurable members of a
state machine are specified in SLICC. The first change is that the data
member declarations will need to be separated by a semi-colon instead of a
comma. Secondly, the default value to be assigned would now use SLICC's
assignment operator i.e. ':='.
This patch changes the grammar for SLICC so as to remove some of the
redundant / duplicate rules. In particular rules for object/variable
declaration and class member declaration have been unified. Similarly, the
rules for a general function and a class method have been unified.
One more change is in the priority of two rules. The first rule is on
declaring a function with all the params typed and named. The second rule is
on declaring a function with all the params only typed. Earlier the second
rule had a higher priority. Now the first rule has a higher priority.
This patch enables the use of page tables that are stored in system memory
and respect x86 specification, in SE mode. It defines an architectural
page table for x86 as a MultiLevelPageTable class and puts a placeholder
class for other ISAs page tables, giving the possibility for future
implementation.
This patch defines a multi-level page table class that stores the page table in
system memory, consistent with ISA specifications. In this way, cpu models that
use the actual hardware to execute (e.g. KvmCPU), are able to traverse the page
table.
This patch ensures the cycle check is still valid even restoring from
a checkpoint. In this case the DRAMSim2 cycle count is relative to the
startTick rather than 0.
We currently use our own home-baked support for type-safe variadic
functions. This is confusing and somewhat limited (e.g., cprintf only
supports a limited number of arguments). This changeset converts all
uses of our internal varargs support to use C++11 variadic macros.
Add the macros M5_ATTR_FINAL and M5_ATTR_OVERRIDE which are defined to
final and override respectively if supported by the compiler. This is
done to allow a smooth transition to gcc >= 4.7.
If a bit field in a bit union specified as Bitfield<LSB, MSB> instead
of Bitfield<MSB, LSB> the code silently fails and the field is read as
zero. This changeset introduces a static assert that tests, at compile
time, that the bit order is correct.
The order of the MSB and LSB bit of the mm field in the PSTATE union
is wrong. Any access to this field will currently be ignored and reads
will always return zero. This patch fixes the ordering so it is <MSB,
LSB> instead of <LSB, MSB>.
This patch fixes a bug in the DRAM controller address decoding. In
cases where the DRAM burst size (e.g. 32 bytes in a rank with a single
LPDDR3 x32) was smaller than the channel interleaving size
(e.g. systems with a 64-byte cache line) one address bit effectively
got used as a channel bit when it should have been a low-order column
bit.
This patch adds a notion of "columns per stripe", and more clearly
deals with the low-order column bits and high-order column bits. The
patch also relaxes the granularity check such that it is possible to
use interleaving granularities other than the cache line size.
The patch also adds a missing M5_CLASS_VAR_USED to the tCK member as
it is only used in the debug build for now.
This patch adds a fix for older checkpoints before support for
multiple event queues were added in changeset 2cce74fe359e. The change
in checkpoint version should really hav ebeen part of the
aforementioned changeset.
Some newer binaries compiled for Versatile Express TC2 contain access
to implementation specific L2MERRSR registers. This causes an infinite
loop of undefined exceptions. This patch changes the behavior to "warn
not fail" to keep the workloads going.
Baremetal workloads are specified using the "kernel" parameter, but
don't always have the correct address mappings. This patch adds a
boolean flag to the system and bypasses the kernel addr mapping checks
when running in baremetal mode.
The branch predictor is normally only built when a CPU that uses a
branch predictor is built. The list of CPUs is currently incomplete as
the simple CPUs support branch predictors (for warming, branch stats,
etc). In practice, all CPU models now use branch predictors, so this
changeset removes the CPU model check and replaces it with a check for
the NULL ISA.
Certain versions of clang complain about unused private members if
they are not used. This changeset removes such members from the
MIPS-specific classes to silence the warning.
Certain versions of clang complain about unused private members if
they are not used. This changeset removes such members from the
POWER-specific ProcessInfo struct to silence the warning.
This changeset fixes three types of warnings that occur in clang 3.4
on Ubuntu 12.04:
* Certain versions of libstdc++ (primarily 4.8) use struct and class
interchangeably. This triggers a warning in clang.
* Swig has a tendency to generate code with the register class which
was deprecated in C++11. This triggers a deprecation warning in
clang.
* Swig sometimes generates Python wrapper code which returns
uninitialized values. It's unclear if this is actually a problem
(the cases might be limited to failure paths). We'll silence these
warnings for now since there is little we can do about the
generated code.
The M5_PRAGMA_NORETURN macro was only used in for
__exit_message. Since the macro only holds a stub definition and all
functions with noreturn semantics use the M5_ATTR_NORETURN, this
macros is completely redundant.
RefCountingPtr is sometimes forward declared to avoid having to
include refcnt.hh. This does not work since we typically return
instances of RefCountingPtr rather than references to instances. The
only reason this currently works is that we include refcnt.hh in
cprintf.hh, which "leaks" the header to most other source files. This
changeset replaces such forward declarations with an include of
refcnt.hh.
When a cacheline is written back to a lower-level cache,
tags->insertBlock() sets various status parameters. However these
status bits were cleared immediately after calling. This patch makes
it so that these status fields are not cleared by moving them outside
of the tags->insertBlock() call.
This patch does some minor house keeping of the branch predictor by
adopting STL containers, and shifting some iterator to use range-based
for loops.
The predictor history is also changed from a list to a deque as we
never to insertion/deletion other than at the front and back.
This patch adds the SubSystem container for grouping
simobjects together in logical subsystems to facilitate
building a larger system from constituent parts. The container
is simply a non-abstract empty simobject to hold the components
that will be connected as its children. In simulation the
object does not participate, its only use is during configuration
of the system.
This patch adds helper functions to SimObject.py, params.py and
simulate.py to enable the new configuration system. Functions like
enumerateParams() in SimObject lets the config system auto-generate
command line options for simobjects to be modified on the command
line.
Params in params.py have __call__() added
to their definition to allow the argparse module to use them
as a type to check command input is in the proper format.
This patch adds a check to ensure that packets which are not going to
a memory range are suppressed in the traffic generator. Thus, if a
trace is collected in full-system, the packets destined for devices
are not played back.
this patch implements a new tags class that uses a random replacement policy.
these tags prefer to evict invalid blocks first, if none are available a
replacement candidate is chosen at random.
this patch factors out the common code in the LRU class and creates a new
abstract class: the BaseSetAssoc class. any set associative tag class must
implement the functionality related to the actual replacement policy in the
following methods:
accessBlock()
findVictim()
insertBlock()
invalidate()
This patch contains a new CPU model named `Minor'. Minor models a four
stage in-order execution pipeline (fetch lines, decompose into
macroops, decompose macroops into microops, execute).
The model was developed to support the ARM ISA but should be fixable
to support all the remaining gem5 ISAs. It currently also works for
Alpha, and regressions are included for ARM and Alpha (including Linux
boot).
Documentation for the model can be found in src/doc/inside-minor.doxygen and
its internal operations can be visualised using the Minorview tool
utils/minorview.py.
Minor was designed to be fairly simple and not to engage in a lot of
instruction annotation. As such, it currently has very few gathered
stats and may lack other gem5 features.
Minor is faster than the o3 model. Sample results:
Benchmark | Stat host_seconds (s)
---------------+--------v--------v--------
(on ARM, opt) | simple | o3 | minor
| timing | timing | timing
---------------+--------+--------+--------
10.linux-boot | 169 | 1883 | 1075
10.mcf | 117 | 967 | 491
20.parser | 668 | 6315 | 3146
30.eon | 542 | 3413 | 2414
40.perlbmk | 2339 | 20905 | 11532
50.vortex | 122 | 1094 | 588
60.bzip2 | 2045 | 18061 | 9662
70.twolf | 207 | 2736 | 1036
Stop setting the use_default_range flag in PioBus in order to
have random bad addresses result in a BadAddress response and
not a gem5 fatal error. This is necessary in Ruby as Ruby is
connected directly to PioBus, so misspeculated addresses will
be sent there directly. For the classic memory system, this
change has no effect, as bad addresses are caught by the
memory bus before being sent to the PioBus.
This work was done while Binh was an intern at AMD Research.
The System object has a static MemoryModeStrings array
that's (1) unused and (2) redundant, since there's an
auto-generated version in the Enums namespace. No
point in leaving it in.
When we switched getSyscallArg() from explicit arg indices to
the implicit method, some DPRINTF arguments were left as calls
to getSyscallArg(), even though C/C++ doesn't guarantee
anything about the order of invocation of these calls. As a
result, the args could be printed out in arbitrary orders.
Interestingly, this bug has been around since 2009:
http://repo.gem5.org/gem5/rev/4842482e1bd1
this operator uses memcmp() to detect if two EthAddr object have the same
address, however memcmp() will return 0 if all bytes are equal. operator==
returns the return value of memcmp() to indicate whether or not two
address are equal. this is incorrect as it will always give the opposite of
the intended behavior. this patch fixes that problem.
per the IEEE 802 spec:
1) fixed broadcast() to ensure that all bytes are equal to 0xff.
2) fixed unicast() to ensure that bit 0 of the first byte is equal to 0
3) fixed multicast() to ensure that bit 0 of the first byte is equal to 1, and
that it is not a broadcast.
also the constructors in EthAddr are fixed so that all bytes of data are
initialized.
Adds DVFS capabilities to gem5, by allowing users to specify lists for
frequencies and voltages in SrcClockDomains and VoltageDomains respectively.
A separate component, DVFSHandler, provides a small interface to change
operating points of the associated domains.
Clock domains will be linked to voltage domains and thus allow separate clock,
but shared voltage lines.
Currently all the valid performance-level updates are performed with a fixed
transition latency as specified for the domain.
Config file example:
...
vd = VoltageDomain(voltage = ['1V','0.95V','0.90V','0.85V'])
tsys.cluster1.clk_domain.clock = ['1GHz','700MHz','400MHz','230MHz']
tsys.cluster2.clk_domain.clock = ['1GHz','700MHz','400MHz','230MHz']
tsys.cluster1.clk_domain.domain_id = 0
tsys.cluster2.clk_domain.domain_id = 1
tsys.cluster1.clk_domain.voltage_domain = vd
tsys.cluster2.clk_domain.voltage_domain = vd
tsys.dvfs_handler.domains = [tsys.cluster1.clk_domain,
tsys.cluster2.clk_domain]
tsys.dvfs_handler.enable = True
This patch adds a DRAMPower flag to enable off-line DRAM power
analysis using the DRAMPower tool. A new DRAMPower flag is added
and a follow-on patch adds a Python script to post-process the output
and order it based on time stamps.
The long-term goal is to link DRAMPower as a library and provide the
commands through function calls to the model rather than first
printing and then parsing the commands. At the moment it is also up to
the user to ensure that the same DRAM configuration is used by the
gem5 controller model and DRAMPower.
This patch adds the index of the bank and rank as a field so that we can
determine the identity of a given bank (reference or pointer) for the
power tracing. We also grab the opportunity of cleaning up the
arguments used for identifying the bank when activating.
In a cycle, we could see a R and W requests corresponding to the same
page walk being sent to the memory. During the cycle that assertion
happens, we have 2 responses corresponding to the R and W above. We
also have a 'read' variable to keep track of the inflight Read
request, this gets reset to NULL right after we send out any R
request; and gets set to the next R in the page walk when a response
comes back.
The issue we are seeing here is when we get a response for W request,
assert(!read) fires because we got a response for R request right
before this, hence we set 'read' to NOT NULL value, pointing to the
next R request in the pagewalk!
This work was done while Binh was an intern at AMD Research.
Check for free entries in Load Queue and Store Queue separately to
avoid cases when load cannot be renamed due to full Store Queue and
vice versa.
This work was done while Binh was an intern at AMD Research.
This patch bumps the supported version of gcc from 4.4 to 4.6, and
clang from 2.9 to 3.0. This enables, amongst other things, range-based
for loops, lambda expressions, etc. The STL implementation shipping
with 4.6 also has a full functional implementation of unique_ptr and
shared_ptr.
The language describing the clockEdge and nextCycle functions were ambiguous,
and so were prone to misinterpretation/misuse. Clear up the comments to more
rigorously describe their functionality.
Using '== true' in a boolean expression is totally redundant,
and using '== false' is pretty verbose (and arguably less
readable in most cases) compared to '!'.
It's somewhat of a pet peeve, perhaps, but I had some time
waiting for some tests to run and decided to clean these up.
Unfortunately, SLICC appears not to have the '!' operator,
so I had to leave the '== false' tests in the SLICC code.
This patch removes the stat totalCommittedInsts. This variable was used for
recording the total number of instructions committed across all the threads
of a core. The instructions committed by each thread are recorded invidually.
The total would now be generated by summing these individual counts.
This patch makes a more firm connection between the DDR3-1600
configuration and the corresponding datasheet, and also adds a
DDR3-2133 and a DDR4-2400 configuration. At the moment there is also
an ongoing effort to align the choice of datasheets to what is
available in DRAMPower.
This patch extends the current timing parameters with the DRAM cycle
time. This is needed as the DRAMPower tool expects timestamps in DRAM
cycles. At the moment we could get away with doing this in a
post-processing step as the DRAMPower execution is separate from the
simulation run. However, in the long run we want the tool to be called
during the simulation, and then the cycle time is needed.
This patch adds the basic ingredients for a precharge all operation,
to be used in conjunction with DRAM power modelling.
Currently we do not try and apply any cleverness when precharging all
banks, thus even if only a single bank is open we use PREA as opposed
to PRE. At the moment we only have a single tRP (tRPpb), and do not
model the slightly longer all-bank precharge constraint (tRPab).
This patch adds the tRTP timing constraint, governing the minimum time
between a read command and a precharge. Default values are provided
for the existing DRAM types.
This patch merges the two control paths used to estimate the latency
and update the bank state. As a result of this merging the computation
is now in one place only, and should be easier to follow as it is all
done in absolute (rather than relative) time.
As part of this change, the scheduling is also refined to ensure that
we look at a sensible estimate of the bank ready time in choosing the
next request. The bank latency stat is removed as it ends up being
misleading when the DRAM access code gets evaluated ahead of time (due
to the eagerness of waking the model up for scheduling the next
request).
This patch adds the write recovery time to the DRAM timing
constraints, and changes the current tRASDoneAt to a more generic
preAllowedAt, capturing when a precharge is allowed to take place.
The part of the DRAM access code that accounts for the precharge and
activate constraints is updated accordingly.
This patch adds power states to the controller. These states and the
transitions can be used together with the Micron power model. As a
more elaborate use-case, the transitions can be used to drive the
DRAMPower tool.
At the moment, the power-down modes are not used, and this patch
simply serves to capture the idle, auto refresh and active modes. The
patch adds a third state machine that interacts with the refresh state
machine.
This patch adds a state machine for the refresh scheduling to
ensure that no accesses are allowed while the refresh is in progress,
and that all banks are propely precharged.
As part of this change, the precharging of banks of broken out into a
method of its own, making is similar to how activations are dealt
with. The idle accounting is also updated to ensure that the refresh
duration is not added to the time that the DRAM is in the idle state
with all banks precharged.
This patch changes the read/write event loop to use a single event
(nextReqEvent), along with a state variable, thus joining the two
control flows. This change makes it easier to follow the state
transitions, and control what happens when.
With the new loop we modify the overly conservative switching times
such that the write-to-read switch allows bank preparation to happen
in parallel with the bus turn around. Similarly, the read-to-write
switch uses the introduced tRTW constraint.
This patch adds a the member function StaticInst::printFlags to allow all
of an instruction's flags to be printed without using the individual
is... member functions or resorting to exposing the 'flags' vector
It also replaces the enum definition StaticInst::Flags with a
Python-generated enumeration and adds to the enum generation mechanism
in src/python/m5/params.py to allow Enums to be placed in namespaces
other than Enums or, alternatively, in wrapper structs allowing them to
be inherited by other classes (so populating that class's name-space
with the enumeration element names).
This patch encompasses several interrelated and interdependent changes
to the ISA generation step. The end goal is to reduce the size of the
generated compilation units for instruction execution and decoding so
that batch compilation can proceed with all CPUs active without
exhausting physical memory.
The ISA parser (src/arch/isa_parser.py) has been improved so that it can
accept 'split [output_type];' directives at the top level of the grammar
and 'split(output_type)' python calls within 'exec {{ ... }}' blocks.
This has the effect of "splitting" the files into smaller compilation
units. I use air-quotes around "splitting" because the files themselves
are not split, but preprocessing directives are inserted to have the same
effect.
Architecturally, the ISA parser has had some changes in how it works.
In general, it emits code sooner. It doesn't generate per-CPU files,
and instead defers to the C preprocessor to create the duplicate copies
for each CPU type. Likewise there are more files emitted and the C
preprocessor does more substitution that used to be done by the ISA parser.
Finally, the build system (SCons) needs to be able to cope with a
dynamic list of source files coming out of the ISA parser. The changes
to the SCons{cript,truct} files support this. In broad strokes, the
targets requested on the command line are hidden from SCons until all
the build dependencies are determined, otherwise it would try, realize
it can't reach the goal, and terminate in failure. Since build steps
(i.e. running the ISA parser) must be taken to determine the file list,
several new build stages have been inserted at the very start of the
build. First, the build dependencies from the ISA parser will be emitted
to arch/$ISA/generated/inc.d, which is then read by a new SCons builder
to finalize the dependencies. (Once inc.d exists, the ISA parser will not
need to be run to complete this step.) Once the dependencies are known,
the 'Environments' are made by the makeEnv() function. This function used
to be called before the build began but now happens during the build.
It is easy to see that this step is quite slow; this is a known issue
and it's important to realize that it was already slow, but there was
no obvious cause to attribute it to since nothing was displayed to the
terminal. Since new steps that used to be performed serially are now in a
potentially-parallel build phase, the pathname handling in the SCons scripts
has been tightened up to deal with chdir() race conditions. In general,
pathnames are computed earlier and more likely to be stored, passed around,
and processed as absolute paths rather than relative paths. In the end,
some of these issues had to be fixed by inserting serializing dependencies
in the build.
Minor note:
For the null ISA, we just provide a dummy inc.d so SCons is never
compelled to try to generate it. While it seems slightly wrong to have
anything in src/arch/*/generated (i.e. a non-generated 'generated' file),
it's by far the simplest solution.
The unproxy code for Parent.any can generate a circular reference
in certain situations with classes hierarchies like those in ClockDomain.py.
This patch solves this by marking ouself as visited to make sure the
search does not resolve to a self-reference.
The ARM TLBs have a bootUncacheability flag used to make some loads
and stores become uncacheable when booting in FS mode. Later the
flag is cleared to let those loads and stores operate as normal. When
doing a takeOverFrom(), this flag's state is not preserved and is
momentarily reset until the CPSR is touched. On single core runs this
is a non-issue. On multi-core runs this can lead to crashes on the O3
CPU model from the following series of events:
1) takeOverFrom executed to switch from Atomic -> O3
2) All bootUncacheability flags are reset to true
3) Core2 tries to execute a load covered by bootUncacheability, it
is flagged as uncacheable
4) Core2's load needs to replay due to a pipeline flush
3) Core1 core does an action on CPSR
4) The handling code for CPSR then checks all other cores
to determine if bootUncacheability can be set to false
5) Asynchronously set bootUncacheability on all cores to false
6) Core2 replays load previously set as uncacheable and notices
it is now flagged as cacheable, leads to a panic.
This patch implements takeOverFrom() functionality for the ARM TLBs
to preserve flag values when switching from atomic -> detailed.
For the o3, add instruction mix (OpClass) histogram at commit (stats
also already collected at issue). For the simple CPUs we add a
histogram of executed instructions
This patch squashes prefetch requests from downstream caches,
so that they do not steal cachelines away from caches closer
to the cpu. It was originally coded by Mitch Hayenga and
modified by Aasheesh Kolli.
This source for stats binds an object and a method / function from the object
to a stats object. This allows pulling out stats from object methods without
needing to go through a global, or static shim.
Syntax is somewhat unpleasant, but the templates and method pointer type
specification were quite tricky. Interface is very clean though; and similar
to .functor
Allow the specification of a socket ID for every core that is reflected in the
MPIDR field in ARM systems. This allows studying multi-socket / cluster
systems with ARM CPUs.
Splits the CommMonitor trace_file parameter into three parameters. Previously,
the trace was only enabled if the trace_file parameter was set, and would be
written to this file. This patch adds in a trace_enable and trace_compress
parameter to the CommMonitor.
No trace is generated if trace_enable is set to False. If it is set to True, the
trace is written to a file based on the name of the SimObject in the simulation
hierarchy. For example, system.cluster.il1_commmonitor.trc. This filename can be
overridden by additionally specifying a file name to the trace_file parameter
(more on this later).
The trace_compress parameter will append .gz to any filename if set to True.
This enables compression of the generated traces. If the file name already ends
in .gz, then no changes are made.
The trace_file parameter will override the name set by the trace_enable
parameter. In the case that the specified name does not end in .gz but
trace_compress is set to true, .gz is appended to the supplied file name.
Unimplemented miscregs for the generic timer were guarded by panics
in arm/isa.cc which can be tripped by the O3 model if it speculatively
executes a wrong path containing a mrs instruction with a bad miscreg
index. These registers were flagged as implemented and accessible.
This patch changes the miscreg info bit vector to flag them as
unimplemented and inaccessible. In this case, and UndefinedInst
fault will be generated if the register access is not trapped
by a hypervisor.
This patch changes the default pixel clock to effectively generate
1080p resolution at 60 frames per second. It is dependent upon the
kernel device tree file using the specified resolution / display
string in the comments.
This is a quick hack to communicate a greater number of CPUs to a guest OS via
the ARM A9 SCU config register. Some OSes (Linux) just look at the bottom field
to count CPUs and with a small change can look at bits [3:0] to learn about up
to 16 CPUs.
Very much unsupported (and contains warning messages as such) but useful for
running 8 core sims without hardwiring CPU count in the guest OS.
With (upcoming) separate compilation, they are useless. Only
link-time optimization could re-inline them, but ideally
feedback-directed optimization would choose to do so only for
profitable (i.e. common) instructions.
The MicroMemOp class generates the disassembly for both integer
and floating point instructions, but it would always print its
first operand as an integer register without considering that the
op may be a floating instruction in which case a float register
should be displayed instead.
In the O3 LSQ, data read/written is printed out in DPRINTFs. However,
the data field is treated as a character string with a null terminated.
However the data field is not encoded this way. This patch removes
that possibility by removing the data part of the print.
This never actually worked since it was printing out only a word
of the cache block and not the entire thing and doubly didn't work
csprintf overrides the %#x specifier and assumes a char* array is
actually a string.
O3CPU has a compile-time maximum width set in o3/impl.hh, but checking
the configuration against this limit was not implemented anywhere
except for fetch. Configuring a wider pipe than the limit can silently
cause various issues during the simulation. This patch adds the proper
checking in the constructor of the various pipeline stages.
Without this declaration, new clangs will complain about this value
being unused. It has no explicit use in the codebase, but it can be
useful to turn on all debugging flags while in a debugger to greatly
increase simulator verbosity.
A number of calls to isEmpty() and numFreeEntries()
should be thread-specific.
In cpu.cc, the fact that tid is /*commented*/ out is a bug. Say the rob
has instructions from thread 0 (isEmpty() returns false), and none from
thread 1. If we are trying to squash all of thread 1, then
readTailInst(thread 1) will be called because rob->isEmpty() returns
false. The result is end_it is not in the list and the while
statement loops indefinitely back over the cpu's instList.
In iew_impl.hh, all threads are told they have the entire remaining IQ, when
each thread actually has a certain allocation. The result is extra stalls at
the iew dispatch stage which the rename stage usually takes care of.
In commit_impl.hh, rob->readHeadInst(thread 1) can be called if the rob only
contains instructions from thread 0. This returns a dummyInst (which may work
since we are trying to squash all instructions, but hardly seems like the right
way to do it).
In rob_impl.hh this fix skips the rest of the function more frequently and is
more efficient.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Upon aggregating records, serialize system's cache-block size, as the
cache-block size can be different when restoring from a checkpoint. This way,
we can correctly read all records when restoring from a checkpoints, even if
the cache-block size is different.
Note, that it is only possible to restore from a checkpoint if the
desired cache-block size is smaller or equal to the cache-block size
when the checkpoint was taken; we can split one larger request into
multiple small ones, but it is not reliable to do the opposite.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Simulating a SMP or multicore requires devices to be shared between
multiple KVM vCPUs. This means that locking is required when accessing
devices. This changeset adds the necessary locking to allow devices to
execute correctly. It is implemented by temporarily migrating the KVM
CPU to the VM's (and devices) event queue when handling
MMIO. Similarly, the VM migrates to the interrupt controller's event
queue when delivering an interrupt.
The support for fast-forwarding of multicore simulations added by this
changeset assumes that all devices in a system are simulated in the
same thread and each vCPU has its own thread. Special care must be
taken to ensure that devices living under the CPU in the object
hierarchy (e.g., the interrupt controller) do not inherit the parent
CPUs thread and are assigned to device thread. The KvmVM object is
assumed to live in the same thread as the other devices in the system.
The calling thread is undefined when the PollQueue services events.
This implies that PollEvents need to handle the case where they are
processed from a different thread than the thread that created the
event. This changeset adds temporary event queue migrations to the VNC
server, the ethernet tap device, and the terminal to protect them from
inter-thread calls.
As of now, the enqueue statement can take in any number of 'pairs' as
argument. But we only use the pair in which latency is the key. This
latency is allowed to be either a fixed integer or a member variable of
controller in which the expression appears. This patch drops the use of pairs
in an enqueue statement. Instead, an expression is allowed which will be
interpreted to be the latency of the enqueue. This expression can anything
allowed by slicc including a constant integer or a member variable.
We need the ability to lock event queues to enable device accesses
across threads. The serviceOne() method now takes a service lock prior
to handling a new event. By locking an event queue, a different
thread/eq can effectively execute in the context of the locked event
queue. To simplify temporary event queue migrations, this changeset
introduces the EventQueue::ScopedMigration class that unlocks the
current event queue, locks a new event queue, and updates the current
event queue variable.
In order to prevent deadlocks, event queues need to be released when
waiting on barriers. This is implemented using the
EventQueue::ScopedRelease class. An instance of this class is, for
example, used in the BaseGlobalEvent class to release the event queue
when waiting on the synchronization barrier.
The intended use for this functionality is when devices need to be
accessed across thread boundaries. For example, when fast-forwarding,
it might be useful to run devices and CPUs in separate threads. In
such a case, the CPU locks the device queue whenever it needs to
perform IO. This functionality is primarily intended for KVM.
Note: Migrating between event queues can lead to non-deterministic
timing. Use with extreme care!
--HG--
extra : rebase_source : 23e3a741a1fd73861d1339782dbbe1bc76285315
This patch fixes violation of TSO in the O3CPU, as all loads must be
ordered with all other loads. In the LQ, if a snoop is observed, all
subsequent loads need to be squashed if the system is TSO.
Prior to this patch, the following case could be violated:
P0 | P1 ;
MOV [x],mail=/usr/spool/mail/nilay | MOV EAX,[y] ;
MOV [y],mail=/usr/spool/mail/nilay | MOV EBX,[x] ;
exists (1:EAX=1 /\ 1:EBX=0) [is a violation]
The problem was found using litmus [http://diy.inria.fr].
Committed by: Nilay Vaish <nilay@cs.wisc.edu
This patch adds stats for tracking the number of reads/writes per bus
turn around, and also adds hysteresis to the write-to-read switching
to ensure that the queue does not oscilate around the low threshold.
This patch renames the not-so-simple SimpleDRAM to a more suitable
DRAMCtrl. The name change is intended to ensure that we do not send
the wrong message (although the "simple" in SimpleDRAM was originally
intended as in cleverly simple, or elegant).
As the DRAM controller modelling work is being presented at ISPASS'14
our hope is that a broader audience will use the model in the future.
--HG--
rename : src/mem/SimpleDRAM.py => src/mem/DRAMCtrl.py
rename : src/mem/simple_dram.cc => src/mem/dram_ctrl.cc
rename : src/mem/simple_dram.hh => src/mem/dram_ctrl.hh
Make the default memory type DDR3-1600 x64, and use the open-adaptive
page policy. This change is aiming to ensure that users by default are
using a realistic memory system.
This patch adds a basic starvation-prevention mechanism where a DRAM
page is forced to close after a certain number of accesses. The limit
is combined with the open and open-adaptive page policy and if reached
causes an auto-precharge.
This patch changes the triggering condition for the write draining
such that we grab the opportunity to issue writes if there are no
reads waiting (as opposed to waiting for the writes to reach the high
threshold). As a result, we potentially drain some of the writes in read
idle periods (if any).
A low threshold is added to be able to control how many write bursts
are kept in the memory controller queue (acting as on-chip storage).
The high and low thresholds are updated to sensible values for a 32/64
size write buffer. Note that the thresholds should be adjusted along
with the queue sizes.
This patch also adds some basic initialisation sanity checks and moves
part of the initialisation to the constructor.
This patch enables a new 'DRAM' mode to the existing traffic
generator, catered to generate specific requests to DRAM based on
required hit length (stride size) and bank utilization. It is an add on
to the Random mode.
The basic idea is to control how many successive packets target the
same page, and how many banks are being used in parallel. This gives a
two-dimensional space that stresses different aspects of the DRAM
timing.
The configuration file needed to use this patch has to be changed as
follow: (reference to Random Mode, LPDDR3 memory type)
'STATE 0 10000000000 RANDOM 50 0 134217728 64 3004 5002 0'
-> 'STATE 0 10000000000 DRAM 50 0 134217728 32 3004 5002 0 96 1024 8 6 1'
The last 4 parameters to be added are:
<stride size (bytes), page size(bytes), number of banks available in DRAM,
number of banks to be utilized, address mapping scheme>
The address mapping information is used to get the stride address
stream of the specified size and to know where to find the bank
bits. The configuration file has a parameter where '0'-> RoCoRaBaCh,
'1'-> RoRaBaCoCh/RoRaBaChCo address-mapping schemes. Note that the
generator currently assumes a single channel and a single rank. This
is to avoid overwhelming the traffic generator with information about
the memory organisation.
This patch adds the row bits to the name of the address mapping
schemes to make it more clear that all the current schemes places the
row bits as the most significant bits.
This patch moves the Ruby-related debug flags to the ruby
sub-directory, and also removes the state SConsopts that add the
no-longer-used NO_VECTOR_BOUNDS_CHECK.
Prevent incomplete configuration of TrafficGen class from causing
segmentation faults. If an 'INIT' line is not present in the
configuration file then the currState variable will remain
uninitialized which may result in a crash.
There were several sections of the m5ops code which were
essentially copy/pasted versions of the 32-bit code. The
problem is that some of these didn't account fo4 64-bit
registers leading to arguments being in the wrong registers.
This patch addresses the args for readfile64, writefile64,
and addsymbol64 -- all of which seemed to suffer from a
similar set of problems when moving to 64-bit.
Each consumer object maintains a set of tick values when the object is supposed
to wakeup and do some processing. As of now, the object accesses this set both
when scheduling a wakeup event and when the object actually wakes up. The set
is accessed during wakeup to remove the current tick value from the set. This
functionality is now being moved to the scheduling function where ticks are
removed at a later time.
This helps in configuring the network interfaces from the python script and
these objects no longer rely on the network object for the timing information.
Piobus was recently added to se scripts for ruby so that the interrupt
controller can be connected to something (required since the interrupt
controller sends address range messages). This patch removes the piobus
and instead, the pio port of ruby port will now ignore the range change
messages in se mode.
KVM used to use two signals, one for instruction count exits and one
for timer exits. There is really no need to distinguish between the
two since they only trigger exits from KVM. This changeset unifies and
renames the signals and adds a method, kick(), that can be used to
raise the control signal in the vCPU thread. It also removes the early
timer warning since we do not normally see if the signal was
delivered.
--HG--
extra : rebase_source : cd0e45ca90894c3d6f6aa115b9b06a1d8f0fda4d
gem5 seems to store the PC as RIP+CS_BASE. This is not what KVM
expects, so we need to subtract CS_BASE prior to transferring the PC
into KVM. This changeset adds the necessary PC manipulation and
refactors thread context updates slightly to avoid reading registers
multiple times from KVM.
--HG--
extra : rebase_source : 3f0569dca06a1fcd8694925f75c8918d954ada44
This changeset adds support for INIT and STARTUP IPI handling. We
currently handle both of these interrupts in gem5 and transfer the
state to KVM. Since we do not have a BIOS loaded, we pretend that the
INIT interrupt suspends the CPU after reset.
--HG--
extra : rebase_source : 7f3b25f3801d68f668b6cd91eaf50d6f48ee2a6a
The table walker code currently accounts for two types of walks,
Atomic and Timing, and treats them differently. Atomic walks keep a
single instance of WalkerState around for all walks to use in
currState. Timing mode keeps a queue of in-flight WalkerStates and
maintains currState as NULL between walks.
If a functional walk is done during Timing mode, it is treated as an
atomic walk and either creates a persistent WalkerState if in between
Timing walks, or stomps an existing currState for an in-progress
Timing walk.
This patch distinguishes functional walks as being able to exist at
any time and sets up a temporary WalkerState for its exclusive use and
then cleans up when finished, leaving any in progress Atomic or Timing
walks undisturbed.
This patch fixes an assert condition that is not true at all
times. There are valid situations that arise in dual-core
dual-workload runs where the assert condition is false. The function
call following the assert however needs to be called only when the
condition is true (a block cannot be invalidated in the tags structure
if has not been allocated in the structure, and the tempBlock is never
allocated). Hence the 'assert' has been replaced with an 'if'.
This patch changes the decode script to output the optional fields of
the proto message Packet, namely id and flags. The flags field is set
by the communication monitor.
The id field is useful for CPU trace experiments, e.g. linking the
fetch side to decode side. It had to be renamed because it clashes
with a built in python function id() for getting the "identity" of an
object.
This patch also takes a few common function definitions out from the
multiple scripts and adds them to a protolib python module.
This snippet can be used to replace if + {panics, fatals, asserts} constructs.
The idea is to have both the condition checking and a verbose printout in a single statement. The interface is as follows:
panic_if(foo != bar, "These should be equal: foo %i bar %i", foo, bar);
fatal_if(foo != bar, "These should be equal: foo %i bar %i", foo, bar);
chatty_assert(foo == bar, "These should be equal: foo %i bar %i", foo, bar);
Small fix for a warning that prevents compilation with gcc 4.8.1 due
to detecting that a variable might be uninitialised. The fix is to
assign a safe default.
The global synchronization event used to be scheduled at
simQuantum. This prevented repeated entries into gem5 from Python as
it can be scheduled in the past. This changeset ensures that the first
global synchronization happens at curTick() + simQuantum instead.
The TSL/LDT & TR/TSS segments didn't contain valid attributes. This
caused problems when transfering the state into KVM where invalid
state is a no-go. Fixup the attributes with values from AMD's
architecture programmer's manual.
When transferring segment registers into kvm, we need to find the
value of the unusable bit. We used to assume that this could be
inferred from the selector since segments are generally unusable if
their selector is 0. This assumption breaks in some weird corner
cases. Instead, we just assume that segments are always usable. This
is what qemu does so it should work.
Signal handlers in KVM are controlled per thread and should be
initialized from the thread that is going to execute the CPU. This
changeset moves the initialization call from startup() to
startupThread().
A copyRegs() function is added to MIPS utilities
to copy architectural state from the old CPU to
the new CPU during fast-forwarding. This
addition alone enables fast-forwarding for the
o3 cpu model running MIPS.
The patch also adds takeOverFrom() and
drainResume() functions to the InOrderCPU to
enable it to take over from another CPU. This
change enables fast-forwarding for the inorder
cpu model running MIPS, but not for Alpha.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Couple of users observed segmentation fault when the simulator tries to
register the statistical variable m_IncompleteTimes. It seems that there
is some problem with the initialization of these variables when allocated
in the constructor.
Currently, the interrupt controller in x86 is connected to the io bus
directly. Therefore the packets between the io devices and the interrupt
controller do not go through ruby. This patch changes ruby port so that
these packets arrive at the ruby port first, which then routes them to their
destination. Note that the patch does not make these packets go through the
ruby network. That would happen in a subsequent patch.
This patch simplfies the retry logic in the RubyPort, avoiding
redundant attributes, and enforcing more stringent checks on the
interactions with the normal ports. The patch also simplifies the
routing done by the RubyPort, using the port identifiers instead of a
heavy-weight sender state.
The patch also fixes a bug in the sending of responses from PIO
ports. Previously these responses bypassed the queue in the queued
port, and ignored the return value, potentially leading to response
packets being lost.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Code in two of the functions was exactly the same. This patch moves
this code to a new function which is called from the two functions
mentioned initially.
At several places, there are functions that take a cycle value as input
and performs some computation. Along with each such function, another
function was being defined that simply added one more cycle to input and
computed the same function. This patch removes this second copy of the
function. Places where these functions were being called have been updated
to use the original function with argument being current cycle + 1.
Two files had been incorrectly named with a .cache suffix.
--HG--
rename : src/mem/protocol/MESI_Three_Level-L0.cache => src/mem/protocol/MESI_Three_Level-L0cache.sm
rename : src/mem/protocol/MESI_Three_Level-L1.cache => src/mem/protocol/MESI_Three_Level-L1cache.sm
The introduction of parallel event queues added most of the support
needed to run multiple VMs (systems) within the same gem5
instance. This changeset fixes up signal delivery so that KVM's
control signals are delivered to the thread that executes the CPU's
event queue. Specifically:
* Timers and counters are now initialized from a separate method
(startupThread) that is scheduled as the first event in the
thread-specific event queue. This ensures that they are
initialized from the thread that is going to execute the CPUs
event queue and enables signal delivery to the right thread when
exiting from KVM.
* The POSIX-timer-based KVM timer (used to force exits from KVM) has
been updated to deliver signals to the thread that's executing KVM
instead of the process (thread is undefined in that case). This
assumes that the timer is instantiated from the thread that is
going to execute the KVM vCPU.
* Signal masking is now done using pthread_sigmask instead of
sigprocmask. The behavior of the latter is undefined in threaded
applications.
* Since signal masks can be inherited, make sure to actively unmask
the control signals when setting up the KVM signal mask.
There are currently no facilities to multiplex between multiple KVM
CPUs in the same event queue, we are therefore limited to
configurations where there is only one KVM CPU per event queue. In
practice, this means that multi-system configurations can be
simulated, but not multiple CPUs in a shared-memory configuration.
This patch fixes a bug in how physical memory used to be mapped and
unmapped. Previously we unmapped and re-mapped if restoring from a
checkpoint. However, we never checked that the new mapping was
actually the same, it was just magically working as the OS seems to
fairly reliably give us the same chunk back. This patch fixes this
issue by relying entirely on the mmap call in the constructor.
This patch enbles use of the basic PIO devices as part of the NULL
build. Although it might seem counter intuitive to have a PIO device
without being able to execute a driver, this change enables us to
break a device class hierarchy into an ISA-agnostic part, and an
ISA-specific part, without requiring multiple-inheritance. The
ISA-agnostic base class is a PIO device, but does not make use of the
port.
This patch adds a filter to the cache to drop snoop requests that are
not for a range covered by the cache. This fixes an issue observed
when multiple caches are placed in parallel, covering different
address ranges. Without this patch, all the caches will forward the
snoop upwards, when only one should do so.
This patch adds DRAMSim2 as a memory controller by wrapping the
external library and creating a sublass of AbstractMemory that bridges
between the semantics of gem5 and the DRAMSim2 interface.
The DRAMSim2 wrapper extracts the clock period from the config
file. There is no way of extracting this information from DRAMSim2
itself, so we simply read the same config file and get it from there.
To properly model the response queue, the wrapper keeps track of how
many transactions are in the actual controller, and how many are
stacking up waiting to be sent back as responses (in the wrapper). The
latter requires us to move away from the queued port and manage the
packets ourselves. This is due to DRAMSim2 not having any flow control
on the response path.
DRAMSim2 assumes that the transactions it is given are matching the
burst size of the choosen memory. The wrapper checks to ensure the
cache line size of the system matches the burst size of DRAMSim2 as
there are currently no provisions to split the system requests. In
theory we could allow a cache line size smaller than the burst size,
but that would lead to inefficient use of the DRAM, so for not we
fatal also in this case.
This changesets adds branch predictor support to the
BaseSimpleCPU. The simple CPUs normally don't need a branch predictor,
however, there are at least two cases where it can be desirable:
1) A simple CPU can be used to warm the branch predictor of an O3
CPU before switching to the slower O3 model.
2) The simple CPU can be used as a quick way of evaluating/debugging
new branch predictors since it exposes branch predictor
statistics.
Limitations:
* Since the simple CPU doesn't speculate, only one instruction will
be active in the branch predictor at a time (i.e., the branch
predictor will never see speculative branches).
* The outcome of a branch prediction does not affect the performance
of the simple CPU.
Currently fatal() ends the simulation in a normal fashion. This results in
the call stack getting lost when using a debugger and it is not always
possible to debug the simulation just from the information provided by the
printed error message. Even though the error is likely due to a user's fault,
the information available should not be thrown away. Hence, this patch to
call abort() from fatal().
Changeset 7274310be1bb (isa: clean up register constants) increased
the value of NumFloatRegs, which triggered a bug in
X86ISA::copyRegs(). This bug is caused by the x87 stack being copied
twice since register indexes past NUM_FLOATREGS are mapped into the
x87 stack relative to the top of the stack, which is undefined when
the copy takes place.
This changeset updates the copyRegs() function to use access registers
using the non-flattening interface, which guarantees that undesirable
register folding does not happen.
The getRFlags and setRFlags utility functions were not updated
correctly when condition registers were separated into their own
register class. This lead to incorrect state transfer in calls from
kvm into the simulator (e.g., m5 readfile ended up in an infinite
loop) and when switching CPUs. This patch makes these utility
functions use getCCReg and setCCReg instead of getIntReg and setIntReg
which read and write the integer registers.
Reviewed-by: Andreas Sandberg <andreas@sandberg.pp.se>
Forces the prefetcher to mispredict twice in a row before resetting the
confidence of prefetching. This helps cases where a load PC strides by a
constant factor, however it may operate on different arrays at times.
Avoids the cost of retraining. Primarily helps with small iteration loops.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
For systems with a tightly coupled L2, a stride-based prefetcher may observe
access requests from both instruction and data L1 caches. However, the PC
address of an instruction miss gives no relevant training information to the
stride based prefetcher(there is no stride to train). In theses cases, its
better if the L2 stride prefetcher simply reverted back to a simple N-block
ahead prefetcher. This patch enables this option.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch extends the classic prefetcher to work on non-block aligned
addresses. Because the existing prefetchers in gem5 mask off the lower
address bits of cache accesses, many predictable strides fail to be
detected. For example, if a load were to stride by 48 bytes, with 64 byte
cachelines, the current stride based prefetcher would see an access pattern
of 0, 64, 64, 128, 192.... Thus not detecting a constant stride pattern. This
patch fixes this, by training the prefetcher on access and not masking off the
lower address bits.
It also adds the following configuration options:
1) Training/prefetching only on cache misses,
2) Training/prefetching only on data acceses,
3) Optionally tagging prefetches with a PC address.
#3 allows prefetchers to train off of prefetch requests in systems with
multiple cache levels and PC-based prefetchers present at multiple levels.
It also effectively allows a pipelining of prefetch requests (like in POWER4)
across multiple levels of cache hierarchy.
Improves performance on my gem5 configuration by 4.3% for SPECINT and 4.7% for SPECFP (geomean).
gem5 makes the incorrect assumption that by binding a socket, it
effectively has allocated a port. Linux only allocates ports once you call
listen on the given socket, not when you call bind. So even if the port was
free when bind was called, another process (gem5 instance) could race in
between the bind & listen calls and steal the port. In the current code, if
the call to bind fails due to the port being in use (EADDRINUSE), gem5 retries
for a different port. However if listen fails, gem5 just panics. The fix is
testing the return value of listen and re-trying if it was due to EADDRINUSE.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
The patch
(1) removes the redundant writeback argument from findVictim()
(2) fixes the description of access() function
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Note: AArch64 and AArch32 interworking is not supported. If you use an AArch64
kernel you are restricted to AArch64 user-mode binaries. This will be addressed
in a later patch.
Note: Virtualization is only supported in AArch32 mode. This will also be fixed
in a later patch.
Contributors:
Giacomo Gabrielli (TrustZone, LPAE, system-level AArch64, AArch64 NEON, validation)
Thomas Grocutt (AArch32 Virtualization, AArch64 FP, validation)
Mbou Eyole (AArch64 NEON, validation)
Ali Saidi (AArch64 Linux support, code integration, validation)
Edmund Grimley-Evans (AArch64 FP)
William Wang (AArch64 Linux support)
Rene De Jong (AArch64 Linux support, performance opt.)
Matt Horsnell (AArch64 MP, validation)
Matt Evans (device models, code integration, validation)
Chris Adeniyi-Jones (AArch64 syscall-emulation)
Prakash Ramrakhyani (validation)
Dam Sunwoo (validation)
Chander Sudanthi (validation)
Stephan Diestelhorst (validation)
Andreas Hansson (code integration, performance opt.)
Eric Van Hensbergen (performance opt.)
Gabe Black
The CheckerCPU model in pre-v8 code was not checking the
updates to miscellaneous registers due to some methods
for setting misc regs were not instrumented. The v8 patches
exposed this by calling the instrumented misc reg update
methods and then invoking the checker before the main CPU had
updated its misc regs, leading to false positives about
register mismatches. This patch fixes the non-instrumented
misc reg update methods and places calls to the checker in
the proper places in the O3 model.
With ARMv8 support the same misc register id results in accessing different
registers depending on the current mode of the processor. This patch adds
the same orthogonality to the misc register file as the others (int, float, cc).
For all the othre ISAs this is currently a null-implementation.
Additionally, a system variable is added to all the ISA objects.
This patch add support for generating wake-up events in the CPU when an address
that is currently in the exclusive state is hit by a snoop. This mechanism is required
for ARMv8 multi-processor support.
Previously we were casting the result type to the the memory type which
is incorrect for things like dual-memory operations which still return a
single result.
Adds very basic statistics on the number of tag and data accesses within the
cache, which is important for power modelling. For the tags, simply count
the associativity of the cache each time. For the data, this depends on
whether tags and data are accessed sequentially, which is given by a new
parameter. In the parallel case, all data blocks are accessed each time, but
with sequential accesses, a single data block is accessed only on a hit.
This patch enables tracking of cache occupancy per thread along with
ages (in buckets) per cache blocks. Cache occupancy stats are
recalculated on each stat dump.
The probe patch is motivated by the desire to move analytical and trace code
away from functional code. This is achieved by the probe interface which is
essentially a glorified observer model.
What this means to users:
* add a probe point and a "notify" call at the source of an "event"
* add an isolated module, that is being used to carry out *your* analysis (e.g. generate a trace)
* register that module as a probe listener
Note: an example is given for reference in src/cpu/o3/simple_trace.[hh|cc] and src/cpu/SimpleTrace.py
What is happening under the hood:
* every SimObject maintains has a ProbeManager.
* during initialization (src/python/m5/simulate.py) first regProbePoints and
the regProbeListeners is called on each SimObject. this hooks up the probe
point notify calls with the listeners.
FAQs:
Why did you develop probe points:
* to remove trace, stats gathering, analytical code out of the functional code.
* the belief that probes could be generically useful.
What is a probe point:
* a probe point is used to notify upon a given event (e.g. cpu commits an instruction)
What is a probe listener:
* a class that handles whatever the user wishes to do when they are notified
about an event.
What can be passed on notify:
* probe points are templates, and so the user can generate probes that pass any
type of argument (by const reference) to a listener.
What relationships can be generated (1:1, 1:N, N:M etc):
* there isn't a restriction. You can hook probe points and listeners up in a
1:1, 1:N, N:M relationship. They become useful when a number of modules
listen to the same probe points. The idea being that you can add a small
number of probes into the source code and develop a larger number of useful
analysis modules that use information passed by the probes.
Can you give examples:
* adding a probe point to the cpu's commit method allows you to build a trace
module (outputting assembler), you could re-use this to gather instruction
distribution (arithmetic, load/store, conditional, control flow) stats.
Why is the probe interface currently restricted to passing a const reference:
* the desire, initially at least, is to allow an interface to observe
functionality, but not to change functionality.
* of course this can be subverted by const-casting.
What is the performance impact of adding probes:
* when nothing is actively listening to the probes they should have a
relatively minor impact. Profiling has suggested even with a large number of
probes (60) the impact of them (when not active) is very minimal (<1%).
This patch adds observability to the clock period of the clock domains
by including it as a stat.
As a result of adding this, the regressions will be updated in a
separate patch.
Add some values and methods to the request object to track the translation
and access latency for a request and which level of the cache hierarchy responded
to the request.
This patch makes the Clock a TickParamValue just like
Latency/Frequency. There is no longer any need to distinguish it
(originally needed to support multiplication).
This patch fixes a memory leak in the table walker, by ensuring that
the sender state is deleted again if the request packet cannot be
successfully sent.
This patch relaxes the check performed when squashing non-speculative
instructions, as it caused problems with loads that were marked ready,
and then stalled on a blocked cache. The assertion is now allowing
memory references to be non-faulting.
This patch removes an assertion in the simpoint profiling code that
asserts that a previously-seen basic block has the exact same number
of instructions executed as before. This can be false if the basic
block generates aborts or takes interrupts at different locations
within the basic block. The basic block profiling are not affected
significantly as these events are rare in general.
This patch adds a function to the HistStor class for adding two histograms.
This functionality is required for Ruby. It also adds support for printing
histograms in a single line.
The first two levels (L0, L1) are private to the core, the third level (L2)is
possibly shared. The protocol supports clustered designs. For example, one
can have two sets of two cores. Each core has an L0 and L1 cache. There are
two L2 controllers where each set accesses only one of the L2 controllers.
A cluster over here means a set of controllers that can be accessed only by a
certain set of cores. For example, consider a two level hierarchy. Assume
there are 4 L1 controllers (private) and 2 L2 controllers. We can have two
different hierarchies here:
a. the address space is partitioned between the two L2 controllers. Each L1
controller accesses both the L2 controllers. In this case, each L1 controller
is a cluster initself.
b. both the L2 controllers can cache any address. An L1 controller has access
to only one of the L2 controllers. In this case, each L2 controller
along with the L1 controllers that access it, form a cluster.
This patch allows for each controller to have a cluster ID, which is 0 by
default. By setting the cluster ID properly, one can instantiate hierarchies
with clusters. Note that the coherence protocol might have to be changed as
well.
If you successfully export a C++ SimObject method, but try to
invoke it from Python before the C++ object is created, you
get a confusing error that says the attribute does not exist,
making you question whether you successfully exported the
method at all. In reality, your only problem is that you're
calling the method too soon. This patch enhances the error
message to give you a better clue.
Updating the SimObject topology of a cloned hierarchy is a little
dangerous, in that cloning is a "deep copy" and the clone does not
inherit SimObject updates the same way it would inherit scalar
variable assignments.
However, because of various SimObject-valued proxy parameters,
like 'memories', 'clk_domain', and 'system', it turns out that
there are a number of implicit topology changes that happen at
instantiation, which means that these changes are impossible to
avoid. So in order to make cloning systems useful, this error
has to go. Changing it to a warning produces a lot of noise,
so it seems best just to delete it.
This patch provides support for DFS by having ClockedObjects register
themselves with their clock domain at construction time in a member list.
Using this list, a clock domain can update each member's tick to the
curTick() before modifying the clock period.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
In mips architecture, floating point convert instructions use the
FloatConvertOp format defined in src/arch/mips/isa/formats/fp.isa. The type
of the operands in the ISA description file (_sw for signed word, or _sf for
signed float, etc.) is used to create a type for the operand in C++. Then the
operand is converted using the fpConvert() function in src/arch/mips/utility.cc.
If we are converting from a word to a float, and we want to convert 0xffffffff,
we expect -1 to be passed into fpConvert(). Instead, we see MAX_INT passed in.
Then fpConvert() converts _val_ to MAX_INT in single-precision floating point,
and we get the wrong value.
To fix it, the signs of the convert operands are being changed from unsigned to
signed in the MIPS ISA description.
Then, the FloatConvertOp format is being changed to insert a int32_t into the
C++ code instead of a uint32_t.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch fixes couple of bugs in the L2 controller of the mesi cmp
directory protocol.
1. The state MT_I was transitioning to NP on receiving a clean writeback
from the L1 controller. This patch makes it inform the directory controller
about the writeback.
2. The L2 controller was sending the dirty bit to the L1 controller and the
L2 controller used writeback from the L1 controller to update the dirty bit
unconditionally. Now, the L1 controller always assumes that the incoming
data is clean. The L2 controller updates the dirty bit only when the L1
controller writes to the block.
3. Certain unused functions and events are being removed.
This patch replaces max_in_port_rank with the number of inports. The use of
max_in_port_rank was causing spurious re-builds and incorrect initialization
of variables in ruby related regression tests. This was due to the variable
value being used across threads while compiling when it was not meant to be.
Since the number of inports is state machine specific value, this problem
should get solved.
The directory controller should not have the sharer field since there is
only one level 2 cache. Anyway the field was not in use. The owner field
was being used to track the l2 cache version (in case of distributed l2) that
has the cache block under consideration. The information is not required
since the version of the level 2 cache can be obtained from a subset of the
address bits.
Currently statistics are reset after the initial / checkpoint state
has been loaded. But ruby does some checkpoint processing in its
startup() function. So the stats need to be reset after the startup()
function has been called. This patch moves the class to stats.reset()
to achieve this change in functionality.
There is a race between enabling asynchronous IO for a file descriptor
and IO events happening on that descriptor. A SIGIO won't normally be
delivered if an event is pending when asynchronous IO is
enabled. Instead, the signal will be raised the next time there is an
event on the FD. This changeset simulates a SIGIO by setting the
async_io flag when setting up asynchronous IO for an FD. This causes
the main event loop to poll all file descriptors to check for pending
IO. As a consequence of this, the old SIGALRM hack should no longer be
needed and is therefore removed.
The PollEvent class dynamically installs a SIGIO and SIGALRM handler
when a file handler is registered. Most signal handlers currently get
registered in the initSignals() function. This changeset moves the
SIGIO/SIGALRM handlers to initSignals() to live with the other signal
handlers. The original code installs SIGIO and SIGALRM with the
SA_RESTART option to prevent syscalls from returning EINTR. This
changeset consistently uses this flag for all signal handlers to
ensure that other signals that trigger asynchronous behavior (e.g.,
statistics dumping) do not cause undesirable EINTR returns.
The performance counting framework in Linux 3.2 and onwards supports
an attribute to exclude events generated by the host when running
KVM. Setting this attribute allows us to get more reliable
measurements of the guest machine. For example, on a highly loaded
system, the instruction counts from the guest can be severely
distorted by the host kernel (e.g., by page fault handlers).
This changeset introduces a check for the attribute and enables it in
the KVM CPU if present.
This patch adds support for simulating with multiple threads, each of
which operates on an event queue. Each sim object specifies which eventq
is would like to be on. A custom barrier implementation is being added
using which eventqs synchronize.
The patch was tested in two different configurations:
1. ruby_network_test.py: in this simulation L1 cache controllers receive
requests from the cpu. The requests are replied to immediately without
any communication taking place with any other level.
2. twosys-tsunami-simple-atomic: this configuration simulates a client-server
system which are connected by an ethernet link.
We still lack the ability to communicate using message buffers or ports. But
other things like simulation start and end, synchronizing after every quantum
are working.
Committed by: Nilay Vaish
the current implementation of the fetch buffer in the o3 cpu
is only allowed to be the size of a cache line. some
architectures, e.g., ARM, have fetch buffers smaller than a cache
line, see slide 22 at:
http://www.arm.com/files/pdf/at-exploring_the_design_of_the_cortex-a15.pdf
this patch allows the fetch buffer to be set to values smaller
than a cache line.
This patch fixes an issue in the checker CPU register indexing. The
code will not even compile using LTO as deep inlining causes the used
index to be outside the array bounds.
The output from the switcheroo tests is voluminous and
(because it includes timestamps) highly sensitive to
minor changes, leading to extremely large updates to the
reference outputs. This patch addresses this problem
by suppressing output from the tests. An internal
parameter can be set to enable the output. Wiring that
up to a command-line flag (perhaps even the rudimantary
-v/-q options in m5/main.py) is left for future work.
This patch fixes a number of stats accounting issues in the DRAM
controller. Most importantly, it separates the system interface and
DRAM interface so that it is clearer what the actual DRAM bandwidth
(and consequently utilisation) is.
This patch unifies the request selection across read and write queues
for FR-FCFS scheduling policy. It also fixes the request selection
code to prioritize the row hits present in the request queues over the
selection based on earliest bank availability.
This patch adds a basic adaptive version of the open-page policy that
guides the decision to keep open or close by looking at the contents
of the controller queues. If no row hits are found, and bank conflicts
are present, then the row is closed by means of an auto
precharge. This is a well-known technique that should improve
performance in most use-cases.
This patch removes the untimed while loop in the write scheduling
mechanism and now schedule commands taking into account the minimum
timing constraint. It also introduces an optimization to track write
queue size and switch from writes to reads if the number of write
requests fall below write low threshold.
This patch adds the tRRD parameter to the DRAM controller. With the
recent addition of the actAllowedAt member for each bank, this
addition is trivial.
This patch changes the tXAW constraint so that it is enforced per rank
rather than globally for all ranks in the channel. It also avoids
using the bank freeAt to enforce the activation limit, as doing so
also precludes performing any column or row command to the
DRAM. Instead the patch introduces a new variable actAllowedAt for the
banks and use this to track when a potential activation can occur.
This patch fixes the controller when a write threshold of 100% is
used. Earlier for 100% write threshold no data is written to memory
as writes never get triggered since this corner case is not
considered.
This patch changes the FCFS bit of FR-FCFS such that requests that
target the earliest available bank are picked first (as suggested in
the original work on FR-FCFS by Rixner et al). To accommodate this we
add functionality to identify a bank through a one-dimensional
identifier (bank id). The member names of the DRAMPacket are also
update to match the style guide.
This patch changes the time the controller is woken up to take the
next scheduling decisions. tRAS is now handled in estimateLatency and
doDRAMAccess and we do not need to worry about it at scheduling
time. The earliest we need to wake up is to do a pre-charge, row
access and column access before the bus becomes free for use.
This patch adds an explicit tRAS parameter to the DRAM controller
model. Previously tRAS was, rather conservatively, assumed to be tRCD
+ tCL + tRP. The default values for tRAS are chosen to match the
previous behaviour and will be updated later.
This patch changes the name the command-line options related to debug
output to all start with "debug" rather than being a mix of that and
"trace". It also makes it clear that the breakpoint time is specified
in ticks and not in cycles.
Thumb2 ARM kernels may access the TEEHBR via thumbee_notifier
in arch/arm/kernel/thumbee.c. The Linux kernel code just seems
to be saving and restoring the register. This patch adds support
for the TEEHBR cp14 register. Note, this may be a special case
when restoring from an image that was run on a system that
supports ThumbEE.
The VE motherboard provides a set of system control registers through which
various motherboard and coretile registers are accessed. Voltage regulators and
oscillator (DLL/PLL) config are examples. These registers must be impleted to
boot Linux 3.9+ kernels.
Newer linux kernels and distros exercise more functionality in the IDE device
than previously, exposing 2 races. The first race is the handling of aborted
DMA commands would immediately report the device is ready back to the kernel
and cause already in flight commands to assert the simulator when they returned
and discovered an inconsitent device state. The second race was due to the
Status register not being handled correctly, the interrupt status bit would get
stuck at 1 and the driver eventually views this as a bad state and logs the
condition to the terminal. This patch fixes these two conditions by making the
device handle aborted commands gracefully and properly handles clearing the
interrupt status bit in the Status register.
SimObjectVector objects did not provide the same interface to
the _parent attribute through get_parent() like a normal
SimObject. It also handled assigning a _parent incorrectly
if objects in a SimObjectVector were changed post-creation,
leading to errors later when the simulator tried to execute.
This patch fixes these two omissions.
SimLoopExitEvents weren't serialized by default. Some benchmarks
utilize a delayed m5 exit pseudo op call to terminate the simulation
and this event was lost when resuming from a checkpoint generated
after the pseudo op call. This patch adds the capability to serialize
the SimLoopExitEvents and enable serialization for m5_exit and m5_fail
pseudo ops by default. Does not affect other generic
SimLoopExitEvents.
Fix a problem in the O3 CPU for instructions that are both
memory loads and memory barriers (e.g. load acquire) and
to uncacheable memory. This combination can confuse the
commit stage into commitng an instruction that hasn't
executed and got it's value yet. At the same time refactor
the code slightly to remove duplication between two of
the cases.
This patch adds missing initializations of the SenderMachine field of
out_msg's when thery are created in the L2 cache controller of the
MOESI_CMP_directory coherence protocol. When an out_msg is created and this
field is left uninitialized, it is set to the default value MachineType_NUM.
This causes a panic in the MachineType_to_string function when gem5 is
executed with the Ruby debug flag on and it tries to print the message.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch fixes a problem where in Garnet, the enqueue time in the
VCallocator and the SWallocator which is of type Cycles was being stored
inside a variable with int type.
This lead to a known problem restoring checkpoints with garnet & the fixed
pipeline enabled. That value was really big and didn't fit in the variable
overflowing it, therefore some conditions on the VC allocation stage & the
SW allocation stage were not met and the packets didn't advance through the
network, leading to a deadlock panic right after the checkpoint was restored.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
The CoherentBus eventually got virtual methods for its interface. The
"virtuality" of the CoherentBus, however, comes already from the virtual
interface of the bus' ports. There is no need to add another layer of virtual
functions, here.
The underlying assumption that all PPIs must be edge-triggered is
strained when the architected timers and VGIC interfaces make
level-behaviour observable. For example, a virtual timer interrupt
'goes away' when the hypervisor is entered and the vtimer is disabled;
this requires a PPI to be de-activated.
The new method simply clears the interrupt pending state.
The ethernet address param tries to convert a hexadecimal
string using int() in python, which defaults to base 10,
need to specify base 16 in this case.
SimObjects are expected to only generate one port reference per
port belonging to them. There is a subtle bug with using "not"
here as a VectorPort is seen as not having a reference if it is
either None or empty as per Python docs sec 9.9 for Standard operators.
Intended behavior is to only check if we have not created the reference.
There is an option to enable/disable all framebuffer dumps, but the
last frame always gets dumped in the run folder with no other way to
disable it. These files can add up very quickly running many experiments.
This patch adds an option to disable them. The default behavior
remains unchanged.
LSQSenderState represents the LQ/SQ index using uint8_t, which supports up to
256 entries (including the sentinel entry). Sending packets to memory with a
higher index than 255 truncates the index, such that the response matches the
wrong entry. For instance, this can result in a deadlock if a store completion
does not clear the head entry.
This change fixes an issue in the O3 CPU where an uncachable instruction
is attempted to be executed before it reaches the head of the ROB. It is
determined to be uncacheable, and is replayed, but a PanicFault is attached
to the instruction to make sure that it is properly executed before
committing. If the TLB entry it was using is replaced in the interveaning
time, the TLB returns a delayed translation when the load is replayed at
the head of the ROB, however the LSQ code can't differntiate between the
old fault and the new one. If the translation isn't complete it can't
be faulting, so clear the fault.
This patch changes the ProtoBuf builder such that the generated source
and header is placed in the build directory of the proto file. This
was previously not the case for the directories included as EXTRAS. To
make this work, we also ensure that the build directory for the EXTRAS
are added to the include path (which does not seem to automatically be
the case).
When handling IPR accesses in doMMIOAccess, the KVM CPU used
clockEdge() to convert between cycles and ticks. This is incorrect
since doMMIOAccess is supposed to return a latency in ticks rather
than when the access is done. This changeset fixes this issue by
returning clockPeriod() * ipr_delay instead.
Get rid of non-deterministic "stats" in ruby.stats output
such as time & date of run, elapsed & CPU time used,
and memory usage. These values cause spurious
miscomparisons when looking at output diffs (though
they don't affect regressions, since the regressions
pass/fail status currently ignores ruby.stats entirely).
Most of this information is already captured in other
places (time & date in stdout, elapsed time & mem usage
in stats.txt), where the regression script is smart
enough to filter it out. It seems easier to get rid of
the redundant output rather than teaching the
regression tester to ignore the same information in
two different places.
Convert condition code registers from being specialized
("pseudo") integer registers to using the recently
added CC register class.
Nilay Vaish also contributed to this patch.
Restructured rename map and free list to clean up some
extraneous code and separate out common code that can
be reused across different register classes (int and fp
at this point). Both components now consist of a set
of Simple* objects that are stand-alone rename map &
free list for each class, plus a Unified* object that
presents a unified interface across all register
classes and then redirects accesses to the appropriate
Simple* object as needed.
Moved free list initialization to PhysRegFile to better
isolate knowledge of physical register index mappings
to that class (and remove the need to pass a number
of parameters to the free list constructor).
Causes a small change to these stats:
cpu.rename.int_rename_lookups
cpu.rename.fp_rename_lookups
because they are now categorized on a per-operand basis
rather than a per-instruction basis.
That is, an instruction with mixed fp/int/misc operand
types will have each operand categorized independently,
where previously the lookup was categorized based on
the instruction type.
Make these names more meaningful.
Specifically, made these substitutions:
s/FP_Base_DepTag/FP_Reg_Base/g;
s/Ctrl_Base_DepTag/Misc_Reg_Base/g;
s/Max_DepTag/Max_Reg_Index/g;
Clean up and add some consistency to the *_Base_DepTag
constants as well as some related register constants:
- Get rid of NumMiscArchRegs, TotalArchRegs, and TotalDataRegs
since they're never used and not always defined
- Set FP_Base_DepTag = NumIntRegs when possible (i.e.,
every case except x86)
- Set Ctrl_Base_DepTag = FP_Base_DepTag + NumFloatRegs
(this was true before, but wasn't always expressed
that way)
- Drastically reduce the number of arbitrary constants
appearing in these calculations
It had a bunch of fields (and associated constructor
parameters) thet it didn't really use, and the array
initialization was needlessly verbose.
Also just hardwired the getReg() method to aleays
return true for misc regs, rather than having an array
of bits that we always kept marked as ready.
No need for PhysRegFile to be a template class, or
have a pointer back to the CPU. Also made some methods
for checking the physical register type (int vs. float)
based on the phys reg index, which will come in handy later.
The previous patch introduced a RegClass enum to clean
up register classification. The inorder model already
had an equivalent enum (RegType) that was used internally.
This patch replaces RegType with RegClass to get rid
of the now-redundant code.
Move from a poorly documented scheme where the mapping
of unified architectural register indices to register
classes is hardcoded all over to one where there's an
enum for the register classes and a function that
encapsulates the mapping.
ASI_BITS in the Request object were originally used to store a memory
request's ASI on SPARC. This is not the case any more since other ISAs
use the ASI bits to store architecture-dependent information. This
changeset renames the ASI_BITS to ARCH_BITS which better describes
their use. Additionally, the getAsi() accessor is renamed to
getArchFlags().
Using address bit 63 to identify generic IPRs caused problems on
SPARC, where IPRs are heavily used. This changeset redefines how
generic IPRs are identified. Instead of using bit 63, we now use a
separate flag (GENERIC_IPR) a memory request.
There is a potential race between enabling asynchronous IO and
selecting the target for the SIGIO signal. This changeset move the
F_SETOWN call to before the F_SETFL call that enables SIGIO
delivery. This ensures that signals are always sent to the correct
process.
This changset adds calls to the service the instruction event queues
that accidentally went missing from commit [0063c7dd18ec]. The
original commit only included the code needed to schedule instruction
stops from KVM and missed the functionality to actually service the
events.
In order to support m5ops in virtualized environments, we need to use
a memory mapped interface. This changeset adds support for that by
reserving 0xFFFF0000-0xFFFFFFFF and mapping those to the generic IPR
interface for m5ops. The mapping is done in the
X86ISA::TLB::finalizePhysical() which means that it just works for all
of the CPU models, including virtualized ones.
In order to support m5ops on virtualized CPUs, we need to either
intercept hypercall instructions or provide a memory mapped m5ops
interface. Since KVM does not normally pass the results of hypercalls
to userspace, which makes that method unfeasible. This changeset
introduces support for m5ops using memory mapped mmapped IPRs. This is
implemented by adding a class of "generic" IPRs which are handled by
architecture-independent code. Such IPRs always have bit 63 set and
are handled by handleGenericIprRead() and
handleGenericIprWrite(). Platform specific impementations of
handleIprRead and handleIprWrite should use
GenericISA::isGenericIprAccess to determine if an IPR address should
be handled by the generic code instead of the architecture-specific
code. Platforms that don't need their own IPR support can reuse
GenericISA::handleIprRead() and GenericISA::handleIprWrite().
The x87 FPU supports three floating point formats: 32-bit, 64-bit, and
80-bit floats. The current gem5 implementation supports 32-bit and
64-bit floats, but only works correctly for 64-bit floats. This
changeset fixes the 32-bit float handling by correctly loading and
rounding (using truncation) 32-bit floats instead of simply truncating
the bit pattern.
80-bit floats are loaded by first loading the 80-bits of the float to
two temporary integer registers. A micro-op (cvtint_fp80) then
converts the contents of the two integer registers to the internal FP
representation (double). Similarly, when storing an 80-bit float,
there are two conversion routines (ctvfp80h_int and cvtfp80l_int) that
convert an internal FP register to 80-bit and stores the upper 64-bits
or lower 32-bits to an integer register, which is the written to
memory using normal integer stores.
X87 store instructions typically loads and pops the top value of the
stack and stores it in memory. The current implementation pops the
stack at the same time as the floating point value is loaded to a
temporary register. This will corrupt the state of the x87 stack if
the store fails. This changeset introduces a pop87 micro-instruction
that pops the stack and uses this instruction in the affected
macro-instructions to pop the stack after storing the value to memory.
Instruction events are currently ignored when executing in KVM. This
changeset adds support for triggering KVM exits based on instruction
counts using hardware performance counters. Depending on the
underlying performance counter implementation, there might be some
inaccuracies due to instructions being counted in the host kernel when
entering/exiting KVM.
Due to limitations/bugs in Linux's performance counter interface, we
can't reliably change the period of an overflow counter. We work
around this issue by detaching and reattaching the counter if we need
to reconfigure it.
This changeset adds support for synchronizing the FPU and SIMD state
of a virtual x86 CPU with gem5. It supports both the XSave API and the
KVM_(GET|SET)_FPU kernel API. The XSave interface can be disabled
using the useXSave parameter (in case of kernel
issues). Unfortunately, KVM_(GET|SET)_FPU interface seems to be buggy
in some kernels (specifically, the MXCSR register isn't always
synchronized), which means that it might not be possible to
synchronize MXCSR on old kernels without the XSave interface.
This changeset depends on the __float80 type in gcc and might not
build using llvm.
The x87 FPU on x86 supports extended floating point. We currently
handle all floating point on x86 as double and don't support 80-bit
loads/stores. This changeset add a utility function to load and
convert 80-bit floats to doubles (loadFloat80) and another function to
store doubles as 80-bit floats (storeFloat80). Both functions use
libfputils to do the conversion in software. The functions are
currently not used, but are required to handle floating point in KVM
and to properly support all x87 loads/stores.
There are cases when the segment registers in gem5 are not compatible
with VMX. This changeset works around all known such issues. Specifically:
* The accessed bits in CS, SS, DD, ES, FS, GS are forced to 1.
* The busy bit in TR is forced to 1.
* The protection level of SS is forced to the same protection level as
CS. The difference /seems/ to be caused by a bug in gem5's x86
implementation.
This changeset adds support for KVM on x86. Full support is split
across a number of commits since some features are relatively
complex. This changeset includes support for:
* Integer state synchronization (including segment regs)
* CPUID (gem5's CPUID values are inserted into KVM)
* x86 legacy IO (remapped and handled by gem5's memory system)
* Memory mapped IO
* PCI
* MSRs
* State dumping
Most of the functionality is fairly straight forward. There are some
quirks to support PCI enumerations since this is done in the TLB(!) in
the simulated CPUs. We currently replicate some of that code.
Unlike the ARM implementation, the x86 implementation of the virtual
CPU does not use the cycles hardware counter. KVM on x86 simulates the
time stamp counter (TSC) in the kernel. If we just measure host cycles
using perfevent, we might end up measuring a slightly different number
of cycles. If we don't get the cycle accounting right, we might end up
rewinding the TSC, with all kinds of chaos as a result.
An additional feature of the KVM CPU on x86 is extended state
dumping. This enables Python scripts controlling the simulator to
request dumping of a subset of the processor state. The following
methods are currenlty supported:
* dumpFpuRegs
* dumpIntRegs
* dumpSpecRegs
* dumpDebugRegs
* dumpXCRs
* dumpXSave
* dumpVCpuEvents
* dumpMSRs
Known limitations:
* M5 ops are currently not supported.
* FPU synchronization is not supported (only affects CPU switching).
Both of the limitations will be addressed in separate commits.
The KVM base class incorrectly assumed that handleIprRead and
handleIprWrite both return ticks. This is not the case, instead they
return cycles. This changeset converts the returned cycles to ticks
when handling IPR accesses.
There is a possibility that the timespec used to arm a timer becomes
zero if the number of ticks used when arming a timer is close to the
resolution of the timer. Due to the semantics of POSIX timers, this
actually disarms the timer. This changeset fixes this issue by
eliminating the rounding error (we always round away from zero
now). It also reuses the minimum number of cycles, which were
previously only used for cycle-based timers, to calculate a more
useful resolution.
This changeset adds the convX87XTagsToTags() and convX87TagsToXTags()
which convert between the tag formats in the FTW register and the
format used in the xsave area. The conversion from to the x87 FTW
representation is currently loses some information since it does not
reconstruct the valid/zero/special flags which are not included in the
xsave representation.
The order between updating and using arg_num in
PseudoInst::pseudoInst() is currently undefined. This changeset
explicitly updates arg_num after it has been used to extract an
argument.
--HG--
extra : rebase_source : 67c46dc3333d16ce56687ee8aea41ce6c6d133bb
This patch ensures that a dequeue event is not scheduled if the memory
controller is waiting for a retry already. Without this check it is
possible for the controller to attempt sending something whilst
already having one packet that is in retry, thus causing the bus to
have an assertion failure.
This patch fixes an issue which prevented gem5 from running when built
using swig 2.0.9 and 2.0.10. The generated event.py tried to import
m5.internal which in turn relied on importing event. This patch seems
to fix the problem, and so far has not caused any other issues.
In order to support hardware virtualization, we need to be able to
check if there are any interrupts pending irregardless of the
rflags.intf value. This changeset adds the checkInterruptsRaw() method
to the x86 interrupt control. It returns true if there are pending
interrupts that can be delivered as soon as the CPU is ready for
interrupt delivery.
The Topology source sets up input and output buffers for each of the external
nodes of a topology by indexing on Ruby's generated controller unique IDs.
These unique IDs are found by adding the MachineType_base_number to the version
number of each controller (see any generated *_Controller.cc - init() calls
getToNetQueue and getFromNetQueue using m_version + base). However, the
Topology object used the cntrl_id - which is required to be unique across all
controllers - to index the controllers list as they are being connected to
their input and output buffers. If the cntrl_ids did not match the Ruby unique
ID, the throttles end up connected to incorrectly indexed nodes in the network,
resulting in packets traversing incorrect network paths. This patch fixes the
Topology indexing scheme by using the Ruby unique ID to match that of the
SimpleNetwork buffer vectors.
Previously, the LSQ would instantiate MaxThreads LSQUnits in the body of it's
object, but it would only initialize numThreads LSQUnits as specified by the
user. This had the effect of leaving some LSQUnits uninitialized when the
number of threads was less than MaxThreads, and when adding statistics to the
LSQUnit that must be initialized, this caused the stats initialization check to
fail. By dynamically instantiating LSQUnits, they are all initialized and this
avoids uninitialized LSQUnits from floating around during runtime.
The previous changeset (9863:9483739f83ee) used STL vector containers to
dynamically allocate stats in the Ruby SimpleNetwork, Switch and Throttle. For
gcc versions before at least 4.6.3, this causes the standard vector allocator
to call Stats copy constructors (a no-no, since stats should be allocated in
the body of each SimObject instance). Since the size of these stats arrays is
known at compile time (NOTE: after code generation), this patch changes their
allocation to be static rather than using an STL vector.
This patch adds the config ini string as a tooltip that can be
displayed in most browsers rendering the resulting svg. Certain
characters are modified for HTML output.
Tested on chrome and firefox.
This patch is adding a splash of colour to the dot output to make it
easier to distinguish objects of different types. As a bonus, the
pastel-colour palette also makes the output look like a something from
the 21st century.
This patch adds the class name to the label, creates some more space
by increasing the rank separation, and additionally outputs the graph
as an editable SVG in addition to the PDF.
This patch makes it possible to once again build gem5 without any
ISA. The main purpose is to enable work around the interconnect and
memory system without having to build any CPU models or device models.
The regress script is updated to include the NULL ISA target. Currently
no regressions make use of it, but all the testers could (and perhaps
should) transition to it.
--HG--
rename : build_opts/NOISA => build_opts/NULL
rename : src/arch/noisa/SConsopts => src/arch/null/SConsopts
rename : src/arch/noisa/cpu_dummy.hh => src/arch/null/cpu_dummy.hh
rename : src/cpu/intr_control.cc => src/cpu/intr_control_noisa.cc
The branch predictor is guarded by having either the in-order or
out-of-order CPU as one of the available CPU models and therefore
should not be used in the BaseCPU. This patch moves the parameter to
the relevant CPU classes.
This patch is a first step to getting NOISA working again. A number of
redundant includes make life more difficult than it has to be and this
patch simply removes them. There are also some redundant forward
declarations removed.
This patch moves the system virtual port proxy to the Alpha system
only to make the resurrection of the NOISA slightly less
painful. Alpha is the only ISA that is actually using it.
This patch changes the SConscript to build gem5 with libc++ on OSX as
the conventional libstdc++ does not have the C++11 constructs that the
current code base makes use of (e.g. std::forward).
Since this was the last use of the transitional TR1, the unordered map
and set header can now be simplified as well.
This patch updates the stats to reflect the: 1) addition of the
internal queue in SimpleMemory, 2) moving of the memory class outside
FSConfig, 3) fixing up of the 2D vector printing format, 4) specifying
burst size and interface width for the DRAM instead of relying on
cache-line size, 5) performing merging in the DRAM controller write
buffer, and 6) fixing how idle cycles are counted in the atomic and
timing CPU models.
The main reason for bundling them up is to minimise the changeset
size.
Added a couple missing updates to the notIdleFraction stat. Without
these, it sometimes gives a (not) idle fraction that is greater than 1
or less than 0.
This patch adds support for specifying multi-channel memory
configurations on the command line, e.g. 'se/fs.py
--mem-type=ddr3_1600_x64 --mem-channels=4'. To enable this, it
enhances the functionality of MemConfig and moves the existing
makeMultiChannel class method from SimpleDRAM to the support scripts.
The se/fs.py example scripts are updated to make use of the new
feature.
This patch changes the default parameter value of conf_table_reported
to match the common case. It also simplifies the regression and config
scripts to reflect this change.
This patch addresses an issue with trace playback in the TrafficGen
where the trace was reset but the header was not read from the trace
when a captured trace was played back for a second time. This resulted
in parsing errors as the expected message was not found in the trace
file.
The header check is moved to an init funtion which is called by the
constructor and when the trace is reset. This ensures that the trace
header is read each time when the trace is replayed.
This patch also addresses a small formatting issue in a panic.
This patch changes the data structure used for the DRAM read, write
and response queues from an STL list to deque. This optimisation is
based on the observation that the size is small (and fixed), and that
the structures are frequently iterated over in a linear fashion.
This patch implements basic write merging in the DRAM to avoid
redundant bursts. When a new access is added to the queue it is
compared against the existing entries, and if it is either
intersecting or immediately succeeding/preceeding an existing item it
is merged.
There is currently no attempt made at avoiding iterating over the
existing items in determining whether merging is possible or not.
This patch gets rid of bytesPerCacheLine parameter and makes the DRAM
configuration separate from cache line size. Instead of
bytesPerCacheLine, we define a parameter for the DRAM called
burst_length. The burst_length parameter shows the length of a DRAM
device burst in bits. Also, lines_per_rowbuffer is replaced with
device_rowbuffer_size to improve code portablity.
This patch adds a burst length in beats for each memory type, an
interface width for each memory type, and the memory controller model
is extended to reason about "system" packets vs "dram" packets and
assemble the responses properly. It means that system packets larger
than a full burst are split into multiple dram packets.
This patch modifies the SimpleTimingCPU drain check to also consider
the fetch event. Previously, there was an assumption that there is
never a fetch event scheduled if the CPU is not executing
microcode. However, when a context is activated, a fetch even is
scheduled, and microPC() is zero.
This patch adds a check to the quiesce operation to ensure that the
CPU does not suspend itself when there are unmasked interrupts
pending. Without this patch there are corner cases when the CPU gets
an interrupt before the quiesce is executed and then never wakes up
again.
This patch addresses an issue with the text-based stats output which
resulted in Vector2D stats being printed without subnames in the event
that one of the dimensions was of length 1.
This patch also fixes the total printing for the 2D vector. Previously
totals were printed without explicitly stating that a total was being
printed. This has been rectified in this patch.
This patch adds the notion of voltage domains, and groups clock
domains that operate under the same voltage (i.e. power supply) into
domains. Each clock domain is required to be associated with a voltage
domain, and the latter requires the voltage to be explicitly set.
A voltage domain is an independently controllable voltage supply being
provided to section of the design. Thus, if you wish to perform
dynamic voltage scaling on a CPU, its clock domain should be
associated with a separate voltage domain.
The current implementation of the voltage domain does not take into
consideration cases where there are derived voltage domains running at
ratio of native voltage domains, as with the case where there can be
on-chip buck/boost (charge pumps) voltage regulation logic.
The regression and configuration scripts are updated with a generic
voltage domain for the system, and one for the CPUs.
This patch adds a packet queue in SimpleMemory to avoid using the
packet queue in the port (and thus have no involvement in the flow
control). The port queue was bound to 100 packets, and as the
SimpleMemory is modelling both a controller and an actual RAM, it
potentially has a large number of packets in flight. There is
currently no limit on the number of packets in the memory controller,
but this could easily be added in a follow-on patch.
As a result of the added internal storage, the functional access and
draining is updated. Some minor cleaning up and renaming has also been
done.
The memtest regression changes as a result of this patch and the stats
will be updated.
This patch fixes a bug in the O3 fetch stage that was introduced when
the cache line size was moved to the system. By mistake, the
initialisation and resetting of the fetch stage was merged and put in
the constructor. The resetting is now re-added where it should be.
Some of the code in StateMachine.py file is added to all the controllers and
is independent of the controller definition. This code is being moved to the
AbstractController class which is the parent class of all controllers.
This patch adds checkpointing support to x86 tlb. It upgrades the
cpt_upgrader.py script so that previously created checkpoints can
be updated. It moves the checkpoint version to 6.
This patch removes the notion of a peer block size and instead sets
the cache line size on the system level.
Previously the size was set per cache, and communicated through the
interconnect. There were plenty checks to ensure that everyone had the
same size specified, and these checks are now removed. Another benefit
that is not yet harnessed is that the cache line size is now known at
construction time, rather than after the port binding. Hence, the
block size can be locally stored and does not have to be queried every
time it is used.
A follow-on patch updates the configuration scripts accordingly.
Instead of relying on derived classes explicitly assigning
to the BasicPioDevice pioSize field, require them to pass
a size value in to the constructor.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
PciDev and IntDev stuck out as the only device classes that
ended in 'Dev' rather than 'Device'. This patch takes care
of that inconsistency.
Note that you may need to delete pre-existing files matching
build/*/python/m5/internal/param_* as scons does not pick up
indirect dependencies on imported python modules when generating
params, and the PciDev -> PciDevice rename takes place in a
file (dev/Device.py) that gets imported quite a bit.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
It was confusing having an AmbaDev namespace along with an
AmbaDevice class. The namespace stuff is now moved in to
a new base AmbaDevice class, which is a mixin for classes
AmbaPioDevice (the former AmbaDevice) and AmbaDmaDevice
to provide the readId function as an inherited member function.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
A couple of devices that have single fixed memory mapped regions
were not derived from BasicPioDevice, when that's exactly
the functionality that BasicPioDevice provides. This patch
gets rid of a little bit of redundant code by making those
devices actually do so.
Also fixed the weird case of X86ISA::Interrupts, where
the class already did derive from BasicPioDevice but
didn't actually use all the features it could have.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This code seems not to be of any use now. There is no path in the simulator
that allows for reconfiguring the network. A better approach would be to
take a checkpoint and start the simulation from the checkpoint with the new
configuration.
This patch reorganizes the cache tags to allow more flexibility to
implement new replacement policies. The base tags class is now a
clocked object so that derived classes can use a clock if they need
one. Also having deriving from SimObject allows specialized Tag
classes to be swapped in/out in .py files.
The cache set is now templatized to allow it to contain customized
cache blocks with additional informaiton. This involved moving code to
the .hh file and removing cacheset.cc.
The statistics belonging to the cache tags are now including ".tags"
in their name. Hence, the stats need an update to reflect the change
in naming.
This patch removes the multiplication operator support for Clock
parameters as this functionality is now achieved by creating derived
clock domains.
Nate, this one is for you.
This patch adds the notion of source- and derived-clock domains to the
ClockedObjects. As such, all clock information is moved to the clock
domain, and the ClockedObjects are grouped into domains.
The clock domains are either source domains, with a specific clock
period, or derived domains that have a parent domain and a divider
(potentially chained). For piece of logic that runs at a derived clock
(a ratio of the clock its parent is running at) the necessary derived
clock domain is created from its corresponding parent clock
domain. For now, the derived clock domain only supports a divider,
thus ensuring a lower speed compared to its parent. Multiplier
functionality implies a PLL logic that has not been modelled yet
(create a separate clock instead).
The clock domains should be used as a mechanism to provide a
controllable clock source that affects clock for every clocked object
lying beneath it. The clock of the domain can (in a future patch) be
controlled by a handler responsible for dynamic frequency scaling of
the respective clock domains.
All the config scripts have been retro-fitted with clock domains. For
the System a default SrcClockDomain is created. For CPUs that run at a
different speed than the system, there is a seperate clock domain
created. This domain incorporates the CPU and the associated
caches. As before, Ruby runs under its own clock domain.
The clock period of all domains are pre-computed, such that no virtual
functions or multiplications are needed when calling
clockPeriod. Instead, the clock period is pre-computed when any
changes occur. For this to be possible, each clock domain tracks its
children.
This patch adds a 'sys_clock' command-line option and use it to assign
clocks to the system during instantiation.
As part of this change, the default clock in the System class is
removed and whenever a system is instantiated a system clock value
must be set. A default value is provided for the command-line option.
The configs and tests are updated accordingly.
This patch removes the explicit setting of the clock period for
certain instances of CoherentBus, NonCoherentBus and IOCache where the
specified clock is same as the default value of the system clock. As
all the values used are the defaults, there are no performance
changes. There are similar cases where the toL2Bus is set to use the
parent CPU clock which is already the default behaviour.
The main motivation for these simplifications is to ease the
introduction of clock domains.
This patch does a bit of tidying up in the bridge code, adding const
where appropriate and also removing redundant checks and adding a few
new ones.
There are no changes to the behaviour of any regressions.
This patch fixes the CommMonitor local variable names, and also
introduces a variable to capture if it expects to see a response. The
latter check considers both needsResponse and memInhibitAsserted.
This patch changes the IEW drain check to include the FU pool as there
can be instructions that are "stored" in FU completion events and thus
not covered by the existing checks. With this patch, we simply include
a check to see if all the FUs are considered non-busy in the next
tick.
Without this patch, the pc-switcheroo-full regression fails after
minor changes to the cache timing (aligning to clock edge).
This patch fixes an outstanding issue in the cache timing calculations
where an atomic access returned a time in Cycles, but the port
forwarded it on as if it was in Ticks.
A separate patch will update the regression stats.
This patch fixes a bug in the granularity calculation. For example, if
the high bit is 6 (counting from 0) and we have one interleaving bit,
then the granularity is now 2 ** (6 - 1 + 1) = 64.
This patch changes the updards snoop packet to avoid allocating and
later deleting it. As the code executes in 0 time and the lifetime of
the packet does not extend beyond the block there is no reason to heap
allocate it.
This patch removes the printing of the SparseHist total in the
stats.txt output file. This has been removed as a sparse histogram has
no total, and therefore this was printing out the value of a
non-local, unrelated variable.
This patch adds separate actions for requests that missed in the local cache
and messages were sent out to get the requested line. These separate actions
are required for differentiating between the hit and miss latencies in the
statistics collected.
This patch adds separate actions for requests that missed in the local cache
and messages were sent out to get the requested line. These separate actions
are required for differentiating between the hit and miss latencies in the
statistics collected.
The patch started of with removing the global variables from the profiler for
profiling the miss latency of requests made to the cache. The corrresponding
histograms have been moved to the Sequencer. These are combined together when
the histograms are printed. Separate histograms are now maintained for
tracking latency of all requests together, of hits only and of misses only.
A particular set of histograms used to use the type GenericMachineType defined
in one of the protocol files. This patch removes this type. Now, everything
that relied on this type would use MachineType instead. To do this, SLICC has
been changed so that multiple machine types can be declared by a controller
in its preamble.
This patch removes the following three files: RubySlicc_Profiler.sm,
RubySlicc_Profiler_interface.cc and RubySlicc_Profiler_interface.hh.
Only one function prototyped in the file RubySlicc_Profiler.sm. Rest of the
code appearing in any of these files is not in use. Therefore, these files
are being removed.
That one single function, profileMsgDelay(), is being moved to the protocol
files where it is in use. If we need any of these deleted functions, I think
the right way to make them visible is to have the AbstractController class in
a .sm and let the controller state machine inherit from this class. The
AbstractController class can then have the prototypes of these profiling
functions in its definition.
2013-06-24 08:59:08 -05:00
Joel Hestness ext:(%2C%20Nilay%20Vaish%20%3Cnilay%40cs.wisc.edu%3E)
The m_size variable attempted to track m_prio_heap.size(), but it did so
incorrectly due to the functions reanalyzeMessages and reanalyzeAllMessages().
Since this variable is intended to track m_prio_heap.size(), we can simply
replace instances where m_size is referenced with m_prio_heap.size(), which
has the added bonus of removing the need for m_size.
Note: This patch also removes an extraneous DPRINTF format string designator
from reanalyzeAllMessages()
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Previously, .sm files were allowed to use the same name for a type and a
variable. This is unnecessarily confusing and has some bad side effects, like
not being able to declare later variables in the same scope with the same type.
This causes the compiler to complain and die on things like Address Address.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Change all occurrances of Address as a variable name to instead use Addr.
Address is an allowed name in slicc even when Address is also being used as a
type, leading to declarations of "Address Address". While this works, it
prevents adding another field of type Address because the compiler then thinks
Address is a variable name, not type.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
The current implementation of the x87 never updates the x87 tag
word. This is currently not a big issue since the simulated x87 never
checks for stack overflows, however this becomes an issue when
switching between a virtualized CPU and a simulated CPU. This
changeset adds support, which is enabled by default, for updating the
tag register to every floating point microop that updates the stack
top using the spm mechanism.
The new tag words is generated by the helper function
X86ISA::genX87Tags(). This function is currently limited to flagging a
stack position as valid or invalid and does not try to distinguish
between the valid, zero, and special states.
This changeset actually fixes two issues:
* The lfpimm instruction didn't work correctly when applied to a
floating point constant (it did work for integers containing the
bit string representation of a constant) since it used
reinterpret_cast to convert a double to a uint64_t. This caused a
compilation error, at least, in gcc 4.6.3.
* The instructions loading floating point constants in the x87
processor didn't work correctly since they just stored a truncated
integer instead of a double in the floating point register. This
changeset fixes the old microcode by using lfpimm instruction
instead of the limm instructions.
The current implementation of fprem simply does an fmod and doesn't
simulate any of the iterative behavior in a real fprem. This isn't
normally a problem, however, it can lead to problems when switching
between CPU models. If switching from a real CPU in the middle of an
fprem loop to a simulated CPU, the output of the fprem loop becomes
correupted. This changeset changes the fprem implementation to work
like the one on real hardware.
Reuse the address finalization code in the TLB instead of replicating
it when handling MMIO. This patch also adds support for injecting
memory mapped IPR requests into the memory system.
The rflags register is spread across several different registers. Most
of the flags are stored in MISCREG_RFLAGS, but some are stored in
microcode registers. When accessing RFLAGS, we need to reconstruct it
from these registers. This changeset adds two functions,
X86ISA::getRFlags() and X86ISA::setRFlags(), that take care of this
magic.
This changeset fixes two problems in the FABS and FCHS
implementation. First, the ISA parser expects the assignment in
flag_code to be a pure assignment and not an and-assignment, which
leads to the isa_parser omitting the misc reg update. Second, the FCHS
and FABS macro-ops don't set the SetStatus flag, which means that the
default micro-op version, which doesn't update FSW, is executed.
This changeset adds the following stats to KVM:
* numVMHalfEntries: Number of entries into KVM to finalize pending
IO operations without executing guest instructions. These typically
happen as a result of a drain where the guest must finalize some
operations before the guest state is consistent.
* numExitSignal: Number of VM exits that have been triggered by a
signal. These usually happen as a result of the timer that limits
the time spent in KVM.
We used to use the KVM CPU's clock to specify the host frequency. This
was not ideal for several reasons. One of them being that the clock
parameter of a CPU determines the frequency of some of the components
connected to the CPU. This changeset adds a separate hostFreq
parameter that should be used to specify the host frequency until we
add code to autodetect it. The hostFactor should still be used to
specify the conversion factor between the host performance and that of
the simulated system.
We currently execute instructions in the guest and then handle any IO
request right after we break out of the virtualized environment. This
has the effect of executing IO requests in the exact same tick as the
first instruction in the sequence that was just run. There seem to be
cases where this simplification upsets some timing-sensitive devices.
This changeset splits execute and IO (and other services) across
multiple ticks. This is implemented by adding a separate
RunningService state to the CPU state machine. When a VM requires
service, it enters into this state and pending IO is then serviced in
the future instead of immediately. The delay between getting the
request and servicing it depends on the number of cycles executed in
the guest, which allows other components to catch up with the CPU.
Update the system's totalNumInst counter when exiting from KVM and
maintain an internal absolute instruction count instead of relying on
the one from perf.
The TSC value stored in MISCREG_TSC is actually just an offset from
the current CPU cycle to the actual TSC value. Writes with
side-effects to the TSC subtract the current cycle count before
storing the new value, while reads add the current cycle count. When
switching CPUs, the current value is copied without side-effects. This
works as long as the source and the destination CPUs have the same
clock frequencies. The TSC will jump, sometimes backwards, if they
have different clock frequencies. Most OSes assume the TSC to be
monotonic and break when this happens.
This changeset makes sure that the TSC is copied with side-effects to
ensure that the offset is updated to match the new CPU.
HG changset 34e3295b0e39 introduced a check in the main simulation
loop that discards exit events that happen at the same tick as another
exit event. This was supposed to fix a problem where a simulation
script got confused by multiple exit events. This obviously breaks the
simulator since it can hide important simulation events, such as a
simulation failure, that happen at the same time as a non-fatal
simulation event.
Currently, the only way to get a CPU to stop after a fixed number of
instructions/loads is to set a property on the CPU that causes a
SimLoopExitEvent to be scheduled when the CPU is constructed. This is
clearly not ideal in cases where the simulation script wants the CPU
to stop at multiple instruction counts (e.g., SimPoint generation).
This changeset adds the methods scheduleInstStop() and
scheduleLoadStop() to the BaseCPU. These methods are exported to
Python and are designed to be used from the simulation script. By
using these methods instead of the old properties, a simulation script
can schedule a stop at any point during simulation or schedule
multiple stops. The number of instructions specified when scheduling a
stop is relative to the current point of execution.
This patch removes per processor cycle count, histogram for filter stats,
histogram for multicasts, histogram for prefetch wait, some function
prototypes that do not have definitions.
The Profiler class does not need an event for dumping statistics
periodically. This is because there is a method for dumping statistics
for all the sim objects periodically. Since Ruby is a sim object, its
statistics are also included.
This moves event and transition count statistics for cache controllers to
gem5's statistics. It does the same for the statistics associated with the
memory controller in ruby.
All the cache/directory/dma controllers individually collect the event and
transition counts. A callback function, collateStats(), has been added that
is invoked on the controller version 0 of each controller class. This
function adds all the individual controller statistics to a vector
variables. All the code for registering the statistical variables and
collating them is generated by SLICC. The patch removes the files
*_Profiler.{cc,hh} and *_ProfileDumper.{cc,hh} which were earlier used for
collecting and dumping statistics respectively.
This patch adds a new flag to specify if the data values for a given vector
should be printed in one line in the stats.txt file. The default behavior
will be to print the data in multiple lines. It makes changes to print
functions to enforce this behavior.
in the TLB
Some architectures (currently only x86) require some fixing-up of
physical addresses after a normal address translation. This is usually
to remap devices such as the APIC, but could be used for other memory
mapped devices as well. When running the CPU in a using hardware
virtualization, we still need to do these address fix-ups before
inserting the request into the memory system. This patch moves this
patch allows that code to be used by such CPUs without doing full
address translations.
The custom Python loader didn't comply with PEP302 for two reasons:
* Previously, we would overwrite old modules on name
conflicts. PEP302 explicitly states that: "If there is an existing
module object named 'fullname' in sys.modules, the loader must use
that existing module".
* The "__package__" attribute wasn't set. PEP302: "The __package__
attribute must be set."
This changeset addresses both of these issues.
Some architectures have special registers in the guest that can be
used to do cycle accounting. This is generally preferrable since the
prevents the guest from seeing a non-monotonic clock. This changeset
adds a virtual method, getHostCycles(), that the architecture-specific
code can override to implement this functionallity. The default
implementation uses the hwCycles counter.
timer_create can apparently return -1 and set errno to EAGAIN if the
kernel suffered a temporary failure when allocating a timer. This
happens from time to time, so we need to handle it.
It is now required to initialize the thread context by calling
startup() on it. Failing to do so currently causes decoder in
x86-based CPUs to get very confused when restoring from checkpoints.
Some Linux versions disable updates (regB.set = 1) to prevent the chip
from updating its internal state while the OS is updating it. Support
for this was already there, this patch merely disables the check in
writeReg that prevented it from being enabled. The patch also includes
support for disabling the divider, which is used to control when clock
updates should start after setting the internal RTC state.
These changes are required to boot most vanilla Linux distributions
that update the RTC settings at boot.
Rewrite reg A & B handling to use the bitunion stuff instead of bit
masking. Add better error messages when the kernel tries to enable
unsupported stuff.
This patch changes the class names of the variuos DRAM configurations
to better reflect what memory they are based on. The speed and
interface width is now part of the name, and also the alias that is
used to select them on the command line.
Some minor changes are done to the actual parameters, to better
reflect the named configurations. As a result of these changes the
regressions change slightly and the stats will be bumped in a separate
patch.
This patch adds a histogram to track how many bytes are accessed in an
open row before it is closed. This metric is useful in characterising
a workload and the efficiency of the DRAM scheduler. For example, a
DDR3-1600 device requires 44 cycles (tRC) before it can activate
another row in the same bank. For a x32 interface (8 bytes per cycle)
that means 8 x 44 = 352 bytes must be transferred to hide the
preparation time.
This patch adds a frontend and backend static latency to the DRAM
controller by delaying the responses. Two parameters expressing the
frontend and backend contributions in absolute time are added to the
controller, and the appropriate latency is added to the responses when
adding them to the (infinite) queued port for sending.
For writes and reads that hit in the write buffer, only the frontend
latency is added. For reads that are serviced by the DRAM, the static
latency is the sum of the pipeline latencies of the entire frontend,
backend and PHY. The default values are chosen based on having roughly
10 pipeline stages in total at 500 MHz.
In the future, it would be sensible to make the controller use its
clock and convert these latencies (and a few of the DRAM timings) to
cycles.
This patch does some minor tidying up of the MSHR and MSHRQueue. The
clean up started as part of some ad-hoc tracing and debugging, but
seems worthwhile enough to go in as a separate patch.
The highlights of the changes are reduced scoping (private) members
where possible, avoiding redundant new/delete, and constructor
initialisation to please static code analyzers.
This patch prunes the TraceCPU as the code is stale and the
functionality that it provided can now be achieved with the TrafficGen
using its trace playback mode.
The TraceCPU was able to play back pre-recorded memory traces of a few
different formats, and to achieve this level of flexibility with the
TrafficGen, use the util/encode_packet_trace (with suitable
modifications) to create a protobuf trace off-line.
Add a check which ensures that the minumum period for the LINEAR and
RANDOM traffic generator states is less than or equal to the maximum
period. If the minimum period is greater than the maximum period a
fatal is triggered.
This patch fixes a bug with the traffic generator which occured when
reading in the state transitions from the configuration
file. Previously, the size of the vector which stored the transitions
was used to get the size of the transitions matrix, rather than using
the number of states. Therefore, if there were more transitions than
states, i.e. some transitions has a probability of less than 1, then
the traffic generator would fatal when trying to check the
transitions.
This issue has been addressed by using the number of input states,
rather then the number of transitions.
This patch adds an optional request elasticity to the traffic
generator, effectievly compensating for it in the case of the linear
and random generators, and adding it in the case of the trace
generator. The accounting is left with the top-level traffic
generator, and the individual generators do the necessary math as part
of determining the next packet tick.
Note that in the linear and random generators we have to compensate
for the blocked time to not be elastic, i.e. without this patch the
aforementioned generators will slow down in the case of back-pressure.
This patch changes the queued port for a conventional master port and
stalls the traffic generator when requests are not immediately
accepted. This is a first step to allowing elasticity in the injection
of requests.
The patch also adds stats for the sent packets and retries, and
slightly changes how the nextPacketTick and getNextPacket
interact. The advancing of the trace is now moved to getNextPacket and
nextPacketTick is only responsible for answering the question when the
next packet should be sent.
This patch moves the responsibility for sending packets out of the
generator states and leaves it with the top-level traffic
generator. The main aim of this patch is to enable a transition to
non-queued ports, i.e. with send/retry flow control, and to do so it
is much more convenient to not wrap the port interactions and instead
leave it all local to the traffic generator.
The generator states now only govern when they are ready to send
something new, and the generation of the packets to send. They thus
have no knowledge of the port that is used.
This patch simplifies the object hierarchy of the traffic generator by
getting rid of the StateGraph class and folding this functionality
into the traffic generator itself.
The main goal of this patch is to facilitate upcoming changes by
reducing the number of affected layers.
This patch introduces a mirrored internal snoop port to facilitate
easy addition of flow control for the snoop responses that are turned
into normal responses on their return. To perform this, the slave
ports of the coherent bus are wrapped in internal master ports that
are passed as the source ports to the response layer in question.
As a result of this patch, there is more contention for the response
resources, and as such system performance will decrease slightly.
A consequence of the mirrored internal port is that the port the bus
tells to retry (the internal one) and the port actually retrying (the
mirrored) one are not the same. Thus, the existing check in tryTiming
is not longer correct. In fact, the test is redundant as the layer is
only in the retry state while calling sendRetry on the waiting port,
and if the latter does not immediately call the bus then the retry
state is left. Consequently the check is removed.
This patch makes the buses multi layered, and effectively creates a
crossbar structure with distributed contention ports at the
destination ports. Before this patch, a bus could have a single
request, response and snoop response in flight at any time, and with
these changes there can be as many requests as connected slaves (bus
master ports), and as many responses as connected masters (bus slave
ports).
Together with address interleaving, this patch enables us to create
high-throughput memory interconnects, e.g. 50+ GByte/s.
This patch makes the flow control and state updates of the coherent
bus more clear by separating the two cases, i.e. forward as a snoop
response, or turn it into a normal response.
With this change it is also more clear what resources are being
occupied, and that we effectively bypass the busy check for the second
case. As a result of the change in resource usage some stats change.
This patch does some minor housekeeping on the bus code, removing
redundant code, and moving the extraction of the destination id to the
top of the functions using it.
This patch adds a basic set of stats which are hard to impossible to
implement using only communication monitors, and are needed for
insight such as bus utilization, transactions through the bus etc.
Stats added include throughput and transaction distribution, and also
a two-dimensional vector capturing how many packets and how much data
is exchanged between the masters and slaves connected to the bus.
This patch changes the set used to track outstanding requests to an
unordered set (part of C++11 STL). There is no need to maintain the
order, and hopefully there might even be a small performance benefit.
This patch adds a typical (leaning towards fast) LPDDR3 configuration
based on publically available data. As expected, it looks very similar
to the LPDDR2-S4 configuration, only with a slightly lower burst time.
This patch adapts the existing LPDDR2 configuration to make use of the
multi-channel functionality. Thus, to get a x64 interface two
controllers should be instantiated using the makeMultiChannel method.
The page size and ranks are also adapted to better suit with a typical
LPDDR2 part.
This patch removes the explicit memset as it is redundant and causes
the simulator to touch the entire space, forcing the host system to
allocate the pages.
Anonymous pages are mapped on the first access, and the page-fault
handler is responsible for zeroing them. Thus, the pages are still
zeroed, but we avoid touching the entire allocated space which enables
us to use much larger memory sizes as long as not all the memory is
actually used.
This patch changes how the streams are created to avoid the size
limitation on the coded streams. As we only read/write a single
message at a time, there is never any message larger than a few
bytes. However, the coded stream eventually complains that its
internal counter reaches 64+ MByte if the total file size exceeds this
value.
Based on suggestions in the protobuf discussion forums, the coded
stream is now created for every message that is read/written. The
result is that the internal byte count never goes about tens of bytes,
and we can read/write any size file that the underlying file I/O can
handle.
This patch changes the type of the hash function for BasicBlockRanges
to match the original definition of the templatized type. Without
this, clang raises a warning and combined with the "-Werror" flag this
causes compilation to fail.
This is the x86 version of the ARM changeset baa17ba80e06. In case an
instruction has been squashed by the o3 cpu, this patch allows page
table walker to avoid carrying out a pending translation that the
instruction requested for.
Currently call and return instructions are marked as IsCall and IsReturn. Thus, the
branch predictor does not use RAS for these instructions. Similarly, the number of
function calls that took place is recorded as 0. This patch marks these instructions
as they should be.
Currently all the integer microops are marked as IntAluOp and the floating
point microops are marked as FloatAddOp. This patch adds support for marking
different microops differently. Now IntMultOp, IntDivOp, FloatDivOp,
FloatMultOp, FloatCvtOp, FloatSqrtOp classes will be used as well. This will
help in providing different latencies for different op class.
This patch changes the way cache statistics are collected in ruby.
As of now, there is separate entity called CacheProfiler which holds
statistical variables for caches. The CacheMemory class defines different
functions for accessing the CacheProfiler. These functions are then invoked
in the .sm files. I find this approach opaque and prone to error. Secondly,
we probably should not be paying the cost of a function call for recording
statistics.
Instead, this patch allows for accessing statistical variables in the
.sm files. The collection would become transparent. Secondly, it would happen
in place, so no function calls. The patch also removes the CacheProfiler class.
--HG--
rename : src/mem/slicc/ast/InfixOperatorExprAST.py => src/mem/slicc/ast/OperatorExprAST.py
having separate params for the local/globalHistoryBits and the
local/globalPredictorSize can lead to inconsistencies if they
are not carefully set. this patch dervies the number of bits
necessary to index into the local/global predictors based on
their size.
the value of the localHistoryTableSize for the ARM O3 CPU has been
increased to 1024 from 64, which is more accurate for an A15 based
on some correlation against A15 hardware.
The CpuPort class was removed before the KVM patches were committed,
which means that the KVM interface currently doesn't compile. This
changeset adds the BaseKvmCPU::KVMCpuPort class which derives from
MasterPort. This class is used on the data and instruction ports
instead of the old CpuPort.
This changeset adds a 'numInsts' stat to the KVM-based CPU. It also
cleans up the variable names in kvmRun to make the distinction between
host cycles and estimated simulated cycles clearer. As a bonus
feature, it also fixes a warning (unreferenced variable) when
compiling in fast mode.
Add a debug print (when the Checkpoint debug flag is set) on serialize
and unserialize. Additionally, dump the KVM state before
serializing. The KVM state isn't dumped after unserializing since the
state is loaded lazily on the next KVM entry.
Device accesses are normally uncacheable. This change probably doesn't
make any difference since we normally disable caching when KVM is
active. However, there might be devices that check this, so we'd
better enable this flag to be safe.
The vsyscall address for gettimeofday is 0xffffffffff600000ul. The offset
therefore should be 0x0 instead of 0x410. This can be cross checked with
the file sysdeps/unix/sysv/linux/x86_64/gettimeofday.c in source of glibc.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
The existing implementation can read uninitialized data or stale information
from the cached PageTable entries.
1) Add a valid bit for the cache entries. Simply using zero for the virtual
address to signify invalid entries is not sufficient. Speculative, wrong-path
accesses frequently access page zero. The current implementation would return
a uninitialized TLB entry when address zero was accessed and the PageTable
cache entry was invalid.
2) When unmapping/mapping/remaping a page, invalidate the corresponding
PageTable cache entry if one already exists.
The 'lret' instruction reloads instruction pointer and code segment from the
stack and then pops them. But the popping part is missing from the current
implementation. This caused incorrect behavior in some code related to the
Fiasco OS. Microops are being added to rectify the behavior of the instruction.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Due to recent changes to clocking system in Ruby and the way Ruby restores
state from a checkpoint, garnet was failing to run from a checkpointed state.
The problem is that Ruby resets the time to zero while warming up the caches.
If any component records a local copy of the time (read calls curCycle())
before the simulation has started, then that component will not operate until
that time is reached. In the context of this particular patch, the Garnet
Network class calls curCycle() at multiple places. Any non-operational
component can block in requests in the memory system, which the system
interprets as a deadlock. This patch makes changes so that Garnet can
successfully run from checkpointed state.
It adds a globally visible time at which the actual execution started. This
time is initialized in RubySystem::startup() function. This variable is only
meant for components with in Ruby. This replaces the private variable that
was maintained within Garnet since it is not possible to figure out the
correct time when the value of this variable can be set.
The patch also does away with all cases where curCycle() is called with in
some Ruby component before the system has actually started executing. This
is required due to the quirky manner in which ruby restores from a checkpoint.
This patch adds an address mapping scheme where the channel
interleaving takes place on a cache line granularity. It is similar to
the existing RaBaChCo that interleaves on a DRAM page, but should give
higher performance when there is less locality in the address
stream.
This patch changes the slightly ambigious names used for the address
mapping scheme to be more descriptive, and actually spell out what
they do. With this patch we also open up for adding more flavours of
open- and close-type mappings, i.e. interleaving across channels with
the open map.
This patch enables the use of the generator behaviours outside the
TrafficGen module. This is useful e.g. to allow packet replay modes
for other devices in the system without having to replace them with a
TrafficGen in the configuration files.
This change also enables more specific behaviours to be composed as
specific modules, e.g. BaseBandModem can use a number of generators
and have application-specific parameters based around a specific set
of generators.
This patch adds a WideIO 200 MHz configuration that can be used as a
baseline to compare with DDRx and LPDDRx. Note that it is a single
channel and that it should be replicated 4 times. It is based on
publically available information and attempts to capture an envisioned
8 Gbit single-die part (i.e. without TSVs).
This patch provides useful printouts throughut the memory system. This
includes pretty-printed cache tags and function call messages
(call-stack like).
This patch changes the SimpleTimingPort and RubyPort to panic on
inhibited requests as this should never happen in either of the
cases. The SimpleTimingPort is only used for the I/O devices PIO port
and the DMA devices config port and should thus never see an inhibited
request. Similarly, the SimpleTimingPort is also used for the
MessagePort in x86, and there should also not be any cases where the
port sees an inhibited request.
This changeset adds support for m5 pseudo-ops when running in
kvm-mode. Unfortunately, we can't trap the normal gem5 co-processor
entry in KVM (it doesn't seem to be possible to trap accesses to
non-existing co-processors). We therefore use BZJ instructions to
cause a trap from virtualized mode into gem5. The BZJ instruction is
becomes a normal branch to the gem5 fallback code when running in
simulated mode, which means that this patch does not need to change
the ARM ISA-specific code.
Note: This requires a patched host kernel.
All architectures execute m5 pseudo instructions by setting up
arguments according to the ABI and executing a magic instruction that
contains an operation number. Handling of such instructions is
currently spread across the different ISA implementations. This
changeset introduces the PseudoInst::pseudoInst function which handles
most of this in an architecture independent way. This is function is
mainly intended to be used from KVM, but can also be used from the
simulated CPUs.