This patch attempts to make the rules for data allocation in the
packet explicit, understandable, and easy to verify. The constructor
that copies a packet is extended with an additional flag "alloc_data"
to enable the call site to explicitly say whether the newly created
packet is short-lived (a zero-time snoop), or has an unknown life-time
and therefore should allocate its own data (or copy a static pointer
in the case of static data).
The tricky case is the static data. In essence this is a
copy-avoidance scheme where the original source of the request (DMA,
CPU etc) does not ask the memory system to return data as part of the
packet, but instead provides a pointer, and then the memory system
carries this pointer around, and copies the appropriate data to the
location itself. Thus any derived packet actually never copies any
data. As the original source does not copy any data from the response
packet when arriving back at the source, we must maintain the copy of
the original pointer to not break the system. We might want to revisit
this one day and pay the price for a few extra memcpy invocations.
All in all this patch should make it easier to grok what is going on
in the memory system and how data is actually copied (or not).
This patch cleans up the use of hasData and checkFunctional in the
packet. The hasData function is unfortunately suggesting that it
checks if the packet has a valid data pointer, when it does in fact
only check if the specific packet type is specified to have a data
payload. The confusion led to a bug in checkFunctional. The latter
function is also tidied up to avoid name overloading.
This adds a basic level of sanity checking to the packet by ensuring
that a request is not modified once the packet is created. The only
issue that had to be worked around is the relaying of
software-prefetches in the cache. The specific situation is now solved
by first copying the request, and then creating a new packet
accordingly.
This patch cleans up the packet memory allocation confusion. The data
is always allocated at the requesting side, when a packet is created
(or copied), and there is never a need for any device to allocate any
space if it is merely responding to a paket. This behaviour is in line
with how SystemC and TLM works as well, thus increasing
interoperability, and matching established conventions.
The redundant calls to Packet::allocate are removed, and the checks in
the function are tightened up to make sure data is only ever allocated
once. There are still some oddities in the packet copy constructor
where we copy the data pointer if it is static (without ownership),
and allocate new space if the data is dynamic (with ownership). The
latter is being worked on further in a follow-on patch.
This patch takes a first step in tightening up how we use the data
pointer in write packets. A const getter is added for the pointer
itself (getConstPtr), and a number of member functions are also made
const accordingly. In a range of places throughout the memory system
the new member is used.
The patch also removes the unused isReadWrite function.
WriteInvalidate semantics depend on the unconditional writeback
or they won't complete. Also, there's no point in deferring snoops
on their MSHRs, as they don't get new data at the end of their life
cycle the way other transactions do.
Add comment in the cache about a minor inefficiency re: WriteInvalidate.
Since WriteInvalidate directly writes into the cache, it can
create tricky timing interleavings with reads and writes to the
same cache line that haven't yet completed. This patch ensures
that these requests, when completed, don't overwrite the newer
data from the WriteInvalidate.
This patch takes a step towards an ISA-agnostic memory
system by enabling the components to establish the page size after
instantiation. The swap operation in the memory is now also allowing
any granularity to avoid depending on the IntReg of the ISA.
This patch changes the name of the Bus classes to XBar to better
reflect the actual timing behaviour. The actual instances in the
config scripts are not renamed, and remain as e.g. iobus or membus.
As part of this renaming, the code has also been clean up slightly,
making use of range-based for loops and tidying up some comments. The
only changes outside the bus/crossbar code is due to the delay
variables in the packet.
--HG--
rename : src/mem/Bus.py => src/mem/XBar.py
rename : src/mem/coherent_bus.cc => src/mem/coherent_xbar.cc
rename : src/mem/coherent_bus.hh => src/mem/coherent_xbar.hh
rename : src/mem/noncoherent_bus.cc => src/mem/noncoherent_xbar.cc
rename : src/mem/noncoherent_bus.hh => src/mem/noncoherent_xbar.hh
rename : src/mem/bus.cc => src/mem/xbar.cc
rename : src/mem/bus.hh => src/mem/xbar.hh
There are two primary issues with this code which make it deserving of deletion.
1) GHB is a way to structure a prefetcher, not a definitive type of prefetcher
2) This prefetcher isn't even structured like a GHB prefetcher.
It's basically a worse version of the stride prefetcher.
It primarily serves to confuse new gem5 users and most functionality is already
present in the stride prefetcher.
Static analysis unearther a bunch of uninitialised variables and
members, and this patch addresses the problem. In all cases these
omissions seem benign in the end, but at least fixing them means less
false positives next time round.
Support full-block writes directly rather than requiring RMW:
* a cache line is allocated in the cache upon receipt of a
WriteInvalidateReq, not the WriteInvalidateResp.
* only top-level caches allocate the line; the others just pass
the request along and invalidate as necessary.
* to close a timing window between the *Req and the *Resp, a new
metadata bit tracks whether another cache has read a copy of
the new line before the writeback to memory.
This patch fixes a bug in the cache port where the retry flag was
reset too early, allowing new requests to arrive before the retry was
actually sent, but with the event already scheduled. This caused a
deadlock in the interactions with the O3 LSQ.
The patche fixes the underlying issue by shifting the resetting of the
flag to be done by the event that also calls sendRetry(). The patch
also tidies up the flow control in recvTimingReq and ensures that we
also check if we already have a retry outstanding.
Previously, they were treated so much like loads that they could stall
at the head of the ROB. Now they are always treated like L1 hits.
If they actually miss, a new request is created at the L1 and tracked
from the MSHRs there if necessary (i.e. if it didn't coalesce with
an existing outstanding load).
If a set of LL/SC requests contend on the same cache block we
can get into a situation where CPUs will deadlock if they expect
a failed SC to supply them data. This case happens where 3 or
more cores are contending for a cache block using LL/SC and the system
is configured where 2 cores are connected to a local bus and the
third is connected to a remote bus. If a core on the local bus
sends an SCUpgrade and the core on the remote bus sends and SCUpgrade
they will race to see who will win the SC access. In the meantime
if the other core appends a read to one of the SCUpgrades it will expect
to be supplied data by that SCUpgrade transaction. If it happens that
the SCUpgrade that was picked to supply the data is failed, it will
drop the appended request for data and never respond, leaving the requesting
core to deadlock. This patch makes all SC's behave as normal stores to
prevent this case but still makes sure to check whether it can perform
the update.
This patch prunes unused values, and also unifies how the values are
defined (not using an enum for ALPHA), aligning the use of int vs Addr
etc.
The patch also removes the duplication of PageBytes/PageShift and
VMPageSize/LogVMPageSize. For all ISAs the two pairs had identical
values and the latter has been removed.
When a cacheline is written back to a lower-level cache,
tags->insertBlock() sets various status parameters. However these
status bits were cleared immediately after calling. This patch makes
it so that these status fields are not cleared by moving them outside
of the tags->insertBlock() call.
this patch implements a new tags class that uses a random replacement policy.
these tags prefer to evict invalid blocks first, if none are available a
replacement candidate is chosen at random.
this patch factors out the common code in the LRU class and creates a new
abstract class: the BaseSetAssoc class. any set associative tag class must
implement the functionality related to the actual replacement policy in the
following methods:
accessBlock()
findVictim()
insertBlock()
invalidate()
This patch squashes prefetch requests from downstream caches,
so that they do not steal cachelines away from caches closer
to the cpu. It was originally coded by Mitch Hayenga and
modified by Aasheesh Kolli.
This never actually worked since it was printing out only a word
of the cache block and not the entire thing and doubly didn't work
csprintf overrides the %#x specifier and assumes a char* array is
actually a string.
This patch fixes an assert condition that is not true at all
times. There are valid situations that arise in dual-core
dual-workload runs where the assert condition is false. The function
call following the assert however needs to be called only when the
condition is true (a block cannot be invalidated in the tags structure
if has not been allocated in the structure, and the tempBlock is never
allocated). Hence the 'assert' has been replaced with an 'if'.
This patch adds a filter to the cache to drop snoop requests that are
not for a range covered by the cache. This fixes an issue observed
when multiple caches are placed in parallel, covering different
address ranges. Without this patch, all the caches will forward the
snoop upwards, when only one should do so.
Forces the prefetcher to mispredict twice in a row before resetting the
confidence of prefetching. This helps cases where a load PC strides by a
constant factor, however it may operate on different arrays at times.
Avoids the cost of retraining. Primarily helps with small iteration loops.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
For systems with a tightly coupled L2, a stride-based prefetcher may observe
access requests from both instruction and data L1 caches. However, the PC
address of an instruction miss gives no relevant training information to the
stride based prefetcher(there is no stride to train). In theses cases, its
better if the L2 stride prefetcher simply reverted back to a simple N-block
ahead prefetcher. This patch enables this option.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch extends the classic prefetcher to work on non-block aligned
addresses. Because the existing prefetchers in gem5 mask off the lower
address bits of cache accesses, many predictable strides fail to be
detected. For example, if a load were to stride by 48 bytes, with 64 byte
cachelines, the current stride based prefetcher would see an access pattern
of 0, 64, 64, 128, 192.... Thus not detecting a constant stride pattern. This
patch fixes this, by training the prefetcher on access and not masking off the
lower address bits.
It also adds the following configuration options:
1) Training/prefetching only on cache misses,
2) Training/prefetching only on data acceses,
3) Optionally tagging prefetches with a PC address.
#3 allows prefetchers to train off of prefetch requests in systems with
multiple cache levels and PC-based prefetchers present at multiple levels.
It also effectively allows a pipelining of prefetch requests (like in POWER4)
across multiple levels of cache hierarchy.
Improves performance on my gem5 configuration by 4.3% for SPECINT and 4.7% for SPECFP (geomean).
The patch
(1) removes the redundant writeback argument from findVictim()
(2) fixes the description of access() function
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Adds very basic statistics on the number of tag and data accesses within the
cache, which is important for power modelling. For the tags, simply count
the associativity of the cache each time. For the data, this depends on
whether tags and data are accessed sequentially, which is given by a new
parameter. In the parallel case, all data blocks are accessed each time, but
with sequential accesses, a single data block is accessed only on a hit.
This patch enables tracking of cache occupancy per thread along with
ages (in buckets) per cache blocks. Cache occupancy stats are
recalculated on each stat dump.
Add some values and methods to the request object to track the translation
and access latency for a request and which level of the cache hierarchy responded
to the request.
This patch makes it possible to once again build gem5 without any
ISA. The main purpose is to enable work around the interconnect and
memory system without having to build any CPU models or device models.
The regress script is updated to include the NULL ISA target. Currently
no regressions make use of it, but all the testers could (and perhaps
should) transition to it.
--HG--
rename : build_opts/NOISA => build_opts/NULL
rename : src/arch/noisa/SConsopts => src/arch/null/SConsopts
rename : src/arch/noisa/cpu_dummy.hh => src/arch/null/cpu_dummy.hh
rename : src/cpu/intr_control.cc => src/cpu/intr_control_noisa.cc
This patch removes the notion of a peer block size and instead sets
the cache line size on the system level.
Previously the size was set per cache, and communicated through the
interconnect. There were plenty checks to ensure that everyone had the
same size specified, and these checks are now removed. Another benefit
that is not yet harnessed is that the cache line size is now known at
construction time, rather than after the port binding. Hence, the
block size can be locally stored and does not have to be queried every
time it is used.
A follow-on patch updates the configuration scripts accordingly.
This patch reorganizes the cache tags to allow more flexibility to
implement new replacement policies. The base tags class is now a
clocked object so that derived classes can use a clock if they need
one. Also having deriving from SimObject allows specialized Tag
classes to be swapped in/out in .py files.
The cache set is now templatized to allow it to contain customized
cache blocks with additional informaiton. This involved moving code to
the .hh file and removing cacheset.cc.
The statistics belonging to the cache tags are now including ".tags"
in their name. Hence, the stats need an update to reflect the change
in naming.
This patch fixes an outstanding issue in the cache timing calculations
where an atomic access returned a time in Cycles, but the port
forwarded it on as if it was in Ticks.
A separate patch will update the regression stats.
This patch changes the updards snoop packet to avoid allocating and
later deleting it. As the code executes in 0 time and the lifetime of
the packet does not extend beyond the block there is no reason to heap
allocate it.
This patch does some minor tidying up of the MSHR and MSHRQueue. The
clean up started as part of some ad-hoc tracing and debugging, but
seems worthwhile enough to go in as a separate patch.
The highlights of the changes are reduced scoping (private) members
where possible, avoiding redundant new/delete, and constructor
initialisation to please static code analyzers.
This patch provides useful printouts throughut the memory system. This
includes pretty-printed cache tags and function call messages
(call-stack like).
Fixes a latency calculation bug for accesses during a cache line fill.
Under a cache miss, before the line is filled, accesses to the cache are
associated with a MSHR and marked as targets. Once the line fill completes,
MSHR target packets pay an additional latency of
"responseLatency + busSerializationLatency". However, the "whenReady"
field of the cache line is only set to an additional delay of
"busSerializationLatency". This lacks the responseLatency component of
the fill. It is possible for accesses that occur on the cycle of
(or briefly after) the line fill to respond without properly paying the
responseLatency. This also creates the situation where two accesses to the
same address may be serviced in an order opposite of how they were received
by the cache. For stores to the same address, this means that although the
cache performs the stores in the order they were received, acknowledgements
may be sent in a different order.
Adding the responseLatency component to the whenReady field preserves the
penalty that should be paid and prevents these ordering issues.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch solves the corner case scenario where the sendRetryEvent could be
scheduled twice, when an io device stresses the IOcache in the system. This
should not be possible in the cache system.
This patch fixes a newly introduced bug where the sender state was
popped before checking that it should be. Amazingly all regressions
pass, but Linux fails to boot on the detailed CPU with caches enabled.
This patch address the most important name shadowing warnings (as
produced when using gcc/clang with -Wshadow). There are many
locations where constructor parameters and function parameters shadow
local variables, but these are left unchanged.
This patch adds a check to ensure that the delay incurred by
the bus is not simply disregarded, but accounted for by someone. At
this point, all the modules do is to zero it out, and no additional
time is spent. This highlights where the bus timing is simply dropped
instead of being paid for.
As a follow up, the locations identified in this patch should add this
additional time to the packets in one way or another. For now it
simply acts as a sanity check and highlights where the delay is simply
ignored.
Since no time is added, all regressions remain the same.
This patch changes the names of the cache accessor functions to be in
line with those used by the ports. This is done to avoid confusion and
get closer to a one-to-one correspondence between the interface of the
memory object (the cache in this case) and the port itself.
The member function timingAccess has been split into a snoop/non-snoop
part to avoid branching on the isResponse() of the packet.
This patch changes the bus-related time accounting done in the packet
to be relative. Besides making it easier to align the cache timing to
cache clock cycles, it also makes it possible to create a Last-Level
Cache (LLC) directly to a memory controller without a bus inbetween.
The bus is unique in that it does not ever make the packets wait to
reflect the time spent forwarding them. Instead, the cache is
currently responsible for making the packets wait. Thus, the bus
annotates the packets with the time needed for the first word to
appear, and also the last word. The cache then delays the packets in
its queues before passing them on. It is worth noting that every
object attached to a bus (devices, memories, bridges, etc) should be
doing this if we opt for keeping this way of accounting for the bus
timing.
This patch removes the time field from the packet as it was only used
by the preftecher. Similar to the packet queue, the prefetcher now
wraps the packet in a deferred packet, which also has a tick
representing the absolute time when the packet should be sent.
This patch makes the clock member private to the ClockedObject and
forces all children to access it using clockPeriod(). This makes it
impossible to inadvertently change the clock, and also makes it easier
to transition to a situation where the clock is derived from e.g. a
clock domain, or through a multiplier.
This patch fixes a potential deadlock in the caches. This deadlock
could occur when more than one cache is used in a system, and
pkt->senderState is modified in between the two caches. This happened
as the caches relied on the senderState remaining unchanged, and used
it for instantaneous upstream communication with other caches.
This issue has been addressed by iterating over the linked list of
senderStates until we are either able to cast to a MSHR* or
senderState is NULL. If the cast is successful, we know that the
packet has previously passed through another cache, and therefore
update the downstreamPending flag accordingly. Otherwise, we do
nothing.
This patch adds a predecessor field to the SenderState base class to
make the process of linking them up more uniform, and enable a
traversal of the stack without knowing the specific type of the
subclasses.
There are a number of simplifications done as part of changing the
SenderState, particularly in the RubyTest.
This patch merely adopts a more strict use of const for the cache
member functions and variables, and also moves a large portion of the
member functions from public to protected.
Virtualized CPUs and the fastmem mode of the atomic CPU require direct
access to physical memory. We currently require caches to be disabled
when using them to prevent chaos. This is not ideal when switching
between hardware virutalized CPUs and other CPU models as it would
require a configuration change on each switch. This changeset
introduces a new version of the atomic memory mode,
'atomic_noncaching', where memory accesses are inserted into the
memory system as atomic accesses, but bypass caches.
To make memory mode tests cleaner, the following methods are added to
the System class:
* isAtomicMode() -- True if the memory mode is 'atomic' or 'direct'.
* isTimingMode() -- True if the memory mode is 'timing'.
* bypassCaches() -- True if caches should be bypassed.
The old getMemoryMode() and setMemoryMode() methods should never be
used from the C++ world anymore.
the cache drainManager is set but never cleared, this is because
the cache itself does not need to be drained and thus never
triggers a signalDrainDone(). because the drainManager variable
is not used properly and does not appear to be necessary it has
been removed with this patch.
The current implementation in gem5 just keeps a list of locks per cacheline.
Due to this, a store to a non-overlapping portion of the cacheline can cause an
LL/SC pair to fail. This patch simply adds an address range to the lock
structure, so that the lock is only invalidated if the store overlaps the lock
range.
When the classic gem5 cache sees an uncacheable memory access, it used
to ignore it or silently drop the cache line in case of a
write. Normally, there shouldn't be any data in the cache belonging to
an uncacheable address range. However, since some architecture models
don't implement cache maintenance instructions, there might be some
dirty data in the cache that is discarded when this happens. The
reason it has mostly worked before is because such cache lines were
most likely evicted by normal memory activity before a TLB flush was
requested by the OS.
Previously, the cache model would invalidate cache lines when they
were accessed by an uncacheable write. This changeset alters this
behavior so all uncacheable memory accesses cause a cache flush with
an associated writeback if necessary. This is implemented by reusing
the cache flushing machinery used when draining the cache, which
implies that writebacks are performed using functional accesses.
The IIC replacement policy seems to be unused and has probably
gathered too much bit rot to be useful. This patch removes the IIC and
its associated cache parameters.
This patch adds support for the following optional drain methods in
the classical memory system's cache model:
memWriteback() - Write back all dirty cache lines to memory using
functional accesses.
memInvalidate() - Invalidate all cache lines. Dirty cache lines
are lost unless a writeback is requested.
Since memWriteback() is called when checkpointing systems, this patch
adds support for checkpointing systems with caches. The serialization
code now checks whether there are any dirty lines in the cache. If
there are dirty lines in the cache, the checkpoint is flagged as bad
and a warning is printed.
This patch moves the draining interface from SimObject to a separate
class that can be used by any object needing draining. However,
objects not visible to the Python code (i.e., objects not deriving
from SimObject) still depend on their parents informing them when to
drain. This patch also gets rid of the CountedDrainEvent (which isn't
really an event) and replaces it with a DrainManager.
When casting objects in the generated SWIG interfaces, SWIG uses
classical C-style casts ( (Foo *)bar; ). In some cases, this can
degenerate into the equivalent of a reinterpret_cast (mainly if only a
forward declaration of the type is available). This usually works for
most compilers, but it is known to break if multiple inheritance is
used anywhere in the object hierarchy.
This patch introduces the cxx_header attribute to Python SimObject
definitions, which should be used to specify a header to include in
the SWIG interface. The header should include the declaration of the
wrapped object. We currently don't enforce header the use of the
header attribute, but a warning will be generated for objects that do
not use it.
This patch adds an additional level of ports in the inheritance
hierarchy, separating out the protocol-specific and protocl-agnostic
parts. All the functionality related to the binding of ports is now
confined to use BaseMaster/BaseSlavePorts, and all the
protocol-specific parts stay in the Master/SlavePort. In the future it
will be possible to add other protocol-specific implementations.
The functions used in the binding of ports, i.e. getMaster/SlavePort
now use the base classes, and the index parameter is updated to use
the PortID typedef with the symbolic InvalidPortID as the default.
This patch addresses a number of smaller issues identified by the code
inspection utility cppcheck. There are a number of identified leaks in
the arm/linux/system.cc (although the function only get's called once
so it is not a major problem), a few deletes in dev/x86/i8042.cc that
were not array deletes, and sprintfs where the character array had one
element less than needed. In the IIC tags there was a function
allocating an array of longs which is in fact never used.
This patch changes the cache-related latencies from an absolute time
expressed in Ticks, to a number of cycles that can be scaled with the
clock period of the caches. Ultimately this patch serves to enable
future work that involves dynamic frequency scaling. As an immediate
benefit it also makes it more convenient to specify cache performance
without implicitly assuming a specific CPU core operating frequency.
The stat blocked_cycles that actually counter in ticks is now updated
to count in cycles.
As the timing is now rounded to the clock edges of the cache, there
are some regressions that change. Plenty of them have very minor
changes, whereas some regressions with a short run-time are perturbed
quite significantly. A follow-on patch updates all the statistics for
the regressions.
In the current caches the hit latency is paid twice on a miss. This patch lets
a configurable response latency be set of the cache for the backward path.
This patch takes the final plunge and transitions from the templated
Range class to the more specific AddrRange. In doing so it changes the
obvious Range<Addr> to AddrRange, and also bumps the range_map to be
AddrRangeMap.
In addition to the obvious changes, including the removal of redundant
includes, this patch also does some house keeping in preparing for the
introduction of address interleaving support in the ranges. The Range
class is also stripped of all the functionality that is never used.
--HG--
rename : src/base/range.hh => src/base/addr_range.hh
rename : src/base/range_map.hh => src/base/addr_range_map.hh
This patch addresses a few minor issues reported by the clang static
analyzer.
The analysis was run with:
scan-build -disable-checker deadcode \
-enable-checker experimental.core \
-disable-checker experimental.core.CastToStruct \
-enable-checker experimental.cpluscplus
This seperates the functionality to clear the state in a block into
blk.hh and the functionality to udpate the tag information into the
tags. This gets rid of the case where calling invalidateBlk on an
already-invalid block does something different than calling it on a
valid block, which was confusing.
This patch is a first step to using Cycles as a parameter type. The
main affected modules are the CPUs and the Ruby caches. There are
definitely plenty more places that are affected, but this patch serves
as a starting point to making the transition.
An important part of this patch is to actually enable parameters to be
specified as Param.Cycles which involves some changes to params.py.
This patch removes the NACK frrom the packet as there is no longer any
module in the system that issues them (the bridge was the only one and
the previous patch removes that).
The handling of NACKs was mostly avoided throughout the code base, by
using e.g. panic or assert false, but in a few locations the NACKs
were actually dealt with (although NACKs never occured in any of the
regressions). Most notably, the DMA port will now never receive a NACK
and the backoff time is thus never changed. As a consequence, the
entire backoff mechanism (similar to a PCI bus) is now removed and the
DMA port entirely relies on the bus performing the arbitration and
issuing a retry when appropriate. This is more in line with e.g. PCIe.
Surprisingly, this patch has no impact on any of the regressions. As
mentioned in the patch that removes the NACK from the bridge, a
follow-up patch should change the request and response buffer size for
at least one regression to also verify that the system behaves as
expected when the bridge fills up.
This patch extends the queued port interfaces with methods for
scheduling the transmission of a timing request/response. The methods
are named similar to the corresponding sendTiming(Snoop)Req/Resp,
replacing the "send" with "sched". As the queues are currently
unbounded, the methods always succeed and hence do not return a value.
This functionality was previously provided in the subclasses by
calling PacketQueue::schedSendTiming with the appropriate
parameters. With this change, there is no need to introduce these
extra methods in the subclasses, and the use of the queued interface
is more uniform and explicit.
This patch fixes some problems with the drain/switchout functionality
for the O3 cpu and for the ARM ISA and adds some useful debug print
statements.
This is an incremental fix as there are still a few bugs/mem leaks with the
switchout code. Particularly when switching from an O3CPU to a
TimingSimpleCPU. However, when switching from O3 to O3 cores with the ARM ISA
I haven't encountered any more assertion failures; now the kernel will
typically panic inside of simulation.
removes the optimization that forwards an exclusive copy to a requester on a
read, only for the i-cache. this optimization isn't necessary because we
typically won't be writing to the i-cache.
This patch is a first step to align the port names used in the Python
world and the C++ world. Ultimately it serves to make the use of
config.json together with output from the simulation easier, including
post-processing of statistics.
Most notably, the CPU, cache, and bus is addressed in this patch, and
there might be other ports that should be updated accordingly. The
dash name separator has also been replaced with a "." which is what is
used to concatenate the names in python, and a separation is made
between the master and slave port in the bus.
This patch makes getAddrRanges const throughout the code base. There
is no reason why it should not be, and making it const prevents adding
any unintentional side-effects.
This patch adds isSnooping to the slave port, and thus avoids going
through getMasterPort to be able to ask the master. Over the course of
the next few patches, all getMasterPort/getSlavePort in Port and
MemObject are to be protocol agnostic, and the snooping is part of the
protocol layer.
The function is already present on the master port, where it is
implemented by the module itself, e.g. a cache. On the slave side, it
is merely asking the connected master port. The same name is used by
both functions despite their difference in behaviour. The initial
design used isMasterSnooping on the slave port side, but the more
verbose function name was later changed.
This patch is the last part of moving all protocol-related
functionality out of the Port base class. All the send/recv functions
are already moved, and the retry (which still governs all the timing
transport functions) is the only part that remained in the base class.
The only point where this currently causes a bit of inconvenience is
in the bus where the retry list is global and holds Port pointers (not
Master/SlavePort). This is about to change with the split into a
request/response bus and will soon be removed anyway.
The patch has no impact on any regressions.
This patch is the result of static analysis identifying a number of
memory leaks. The leaks are all benign as they are a result of not
deallocating memory in the desctructor. The fix still has value as it
removes false positives in the static analysis.
The LRU policy always evicted the least recently touched way, even if it
contained valid data and another way was invalid, as can happen if a block has
been invalidated by coherance. This can result in caches never warming up even
though they are replacing blocks. This modifies the LRU policy to move blocks
to LRU position on invalidation.
This patch is a temporary fix until Andreas' four-phase patches
get reviewed and committed. Removing FastAlloc seems to have exposed
an issue which previously was reasonable rare in which packets are freed
before the sending cache is done with them. This change puts incoming packets
no a pendingDelete queue which are deleted at the start of the next call and
thus breaks the dependency between when the caller returns true and when the
packet is actually used by the sending cache.
Running valgrind on a multi-core linux boot and the memtester results in no
valgrind warnings.
While FastAlloc provides a small performance increase (~1.5%) over regular malloc it isn't thread safe.
After removing FastAlloc and using tcmalloc I've seen a performance increase of 12% over libc malloc
when running twolf for ARM.
The main aim of this patch is to arrive at a suitable port interface
for vector ports, including both the packet and the port id. This
patch changes the bus transport functions
(recvFunctional/Atomic/Timing) to require a PortId parameter
indicating the source port. Previously this information was passed by
setting the source field of the packet, and this is only required in
the case of a timing request.
With this patch, the use of the source and destination field is also
more restrictive, as they are only needed for timing accesses. The
modifications to these fields for atomic snoops is now removed
entirely, also making minor modifications to the cache.
This patch removes the Packet::NodeID typedef and unifies it with the
Port::PortId. The src and dest fields in the packet are used to hold a
port id (e.g. in the bus), and thus the two should actually be the
same.
The typedef PortID is now global (in base/types.hh) and aligned with
the ThreadID in terms of capitalisation and naming of the
InvalidPortID constant.
Before this patch, two flags were used for valid destination and
source, rather than relying on a named value (InvalidPortID), and
this is now redundant, as the src and dest field themselves are
sufficient to tell whether the current value is a valid port
identifier or not. Consequently, the VALID_SRC and VALID_DST are
removed.
As part of the cleaning up, a number of int parameters and local
variables are updated to use PortID.
Note that Ruby still has its own NodeID typedef. Furthermore, the
MemObject getMaster/SlavePort still has an int idx parameter with a
default value of -1 which should eventually change to PortID idx =
InvalidPortID.