This changeset moves the access trace functionality from the
CommMonitor into a separate probe. The probe can be hooked up to any
component that exports probe points of the type ProbePoints::Packet.
This patch moves the dependency on Google's Protocol Buffers library
from the CommMonitor to the MemTraceProbe, which means that the
CommMonitor (including stack distance profiling) no long depends on
it.
This changeset removes the stack distance calculator hooks from the
CommMonitor class and implements a stack distance calculator as a
memory system probe instead. The probe can be hooked up to any
component that exports probe points of the type ProbePoints::Packet.
This changeset adds a standardized probe point type to monitor packets
in the memory system and adds two probe points to the CommMonitor
class. These probe points enable monitoring of successfully delivered
requests and successfully delivered responses.
Memory system probe listeners should use the BaseMemProbe base class
to provide a unified configuration interface and reuse listener
registration code. Unlike the ProbeListenerObject class, the
BaseMemProbe allows objects to be wired to multiple ProbeManager
instances as long as they use the same probe point name.
There are 2 problems with the existing checkpoint and restore code in ruby.
The first is that when the event queue is altered by ruby during serialization,
some events that are currently scheduled cannot be found (e.g. the event to
stop simulation that always lives on the queue), causing a panic.
The second is that ruby is sometimes serialized after the memory system,
meaning that the dirty data in its cache is flushed back to memory too late
and so isn't included in the checkpoint.
These are fixed by implementing memory writeback in ruby, using the same
technique of hijacking the event queue, but first descheduling all events that
are currently on it. They are saved, along with their scheduled time, so that
the event queue can be faithfully reconstructed after writeback has finished.
Events with the AutoDelete flag set will delete themselves when they
are descheduled, causing an error when attempting to schedule them again.
This is fixed by simply not recording them when taking them off the queue.
Writeback is still implemented using flushing, so the cache recorder object,
that is created to generate the trace and manage flushing, is kept
around and used during serialization to write the trace to disk.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
1. Eliminate state NP in L0 and L1 Caches: The two states 'NP' and 'I' both
mean that the cache block is not present in the cache. 'I' also means that the
cache entry has been allocated. This causes problems when we do not correctly
initialize the cache entry when it is re-used. Hence, this patch eliminates
the state NP altogether. Everytime a new block comes into the cache, a cache
entry is allocated. Everytime a block leaves, the corresponding entry is
deallocated.
2. Separate transient state for instruction fetches: purely for accouting
purposes.
3. Drop state IS_I in L1 Cache and the message type STALE_DATA: when
invalidation is received for a block in IS, the block used to be moved to IS_I.
This meant that the data that would arrive in future would be used but not
stored since the controller lost the permissions after gaining them. This
state is being dropped and now invalidation messages would not processed till
the data has arrived. This also means that STALE_DATA type is not longer
required.
The level 2 controller has a bug. In one particular action, the data block was
copied from a message irrespective whether the block is dirty or not. In cases
when L1 sends no data, the data value copied was incorrect.
For many years the slicc symbol table has supported overloaded functions in
external classes. This patch extends that support to functions that are not
part of classes (a.k.a. no parent). For example, this support allows slicc
to understand that mapAddressToRange is overloaded and the NodeID is an
optional parameter.
This patch changes the router pipeline stages from 4 to 2. The
canonical 4-stage router is conservative while a lower-latency router
with look ahead routing and speculative allocation is well acknowledged.
Sets m_stage.second to the second parameter of the function.
Then, for every place where advance_stage is called, adds
a cycle to the argument being passed.
Adds features to allow protocols to reschedule controllers when conditionally
stalling within inport logic or actions. Also insures that resource and
protocol stalls are re-evaluated the next cycle.
This patch adds support that allows the replacement policy to identify each
cache block's access permission. This information can be useful when making
replacement decisions.
The Ruby banked array resource checks (initiated from SLICC) did a check and
allocate at the same time. If a transition needs more than one resource, then
it might check/allocate resource #1, then fail to get resource #2. Another
transition might then try to get the same resources, but in reverse order.
Deadlock.
This patch separates resource checking and resource reservation into two
steps to avoid deadlock.
It was previously possible for a stalled message to be reordered after an
incomming message. This patch ensures that any stalled message stays in its
original request order.
This patch adds a few helpful functions that allow .sm files to directly
invalidate all cache blocks using a trigger queue rather than rely on each
individual cache block to be invalidated via requests from the mandatory
queue.
This patch allows DPRINTFs to be used in SLICC state machines similar to how
they are used by the rest of gem5. Previously all DPRINTFs in the .sm files
had to use the RubySlicc flag.
This patch exposes the tag and data array latencies to the SLICC state machines
so that it can be used to determine the correct enqueue latency for response
messages.
To have multiple Entry types (e.g., a cache Entry type and
a directory Entry type), just declare one of them as a secondary
type by using the pair 'main="false"', e.g.:
structure(DirEntry, desc="...", interface="AbstractCacheEntry",
main="false") {
...and the primary type would be declared:
structure(Entry, desc="...", interface="AbstractCacheEntry") {
This patch fixes the type handling when prefix operations are used. Previously
prefix operators would assume a void return type, which made it impossible to
combine prefix operations with other expressions. This patch allows SLICC
programmers to use prefix operations more naturally.
This patches adds support for transitions of the form:
transition(START, EVENTS, *) { ACTIONS }
This allows a machine to collapse states that differ only in the next state
transition to collapse into one, and can help shorten/simplfy some protocols
significantly.
When * is encountered as an end state of a transition, the next state is
determined by calling the machine-specific getNextState function. The next
state is determined before any actions of the transition execute, and
therefore the next state calculation cannot depend on any of the transition
actions.
This patch allows SLICC protocols to use more than one message type with a
message buffer. For example, you can declare two in ports as such:
in_port(ResponseQueue_in, ResponseMsg, responseFromDir, rank=3) { ... }
in_port(tgtResponseQueue_in, TgtResponseMsg, responseFromDir, rank=2) { ... }
This patch was created by Bihn Pham during his internship at AMD.
There is no need to delay hit callback response messages by a cycle because
the response latency is already incurred in the Ruby protocol. This ensures
correct timing of memory instructions.
This patch removes the RequestCause, and also simplifies how we
schedule the sending of packets through the memory-side port. The
deassertion of bus requests is removed as it is not used.
This patch makes cache sets aware of the way number. This enables
some nice features such as the ablity to restrict way allocation. The
implemented mechanism allows to set a maximum way number to be
allocated 'k' which must fulfill 0 < k <= N (where N is the number of
ways). In the future more sophisticated mechasims can be implemented.
This patch changes how writebacks communicate whether the line is
passed as modified or owned. Previously we relied on the
isSupplyExclusive mechanism, which was originally designed to avoid
unecessary snoops.
For normal cache requests we use the sharedAsserted mechanism to
determine if a block should be marked writeable or not, and with this
patch we transition the writebacks to also use this
mechanism. Conceptually this is cleaner and more consistent.
Some minor fixes and removal of dead code. Changing the flags to be
enums rather than static const (to avoid any linking issues caused by
the latter). Also adding a getBlockAddr member which hopefully can
slowly finds its way into caches, snoop filters etc.
This is another step in the process of removing global variables
from Ruby to enable multiple RubySystem instances in a single simulation.
The list of abstract controllers is per-RubySystem and should be
represented that way, rather than as a global.
Since this is the last remaining Ruby global variable, the
src/mem/ruby/Common/Global.* files are also removed.
This is another step in the process of removing global variables
from Ruby to enable multiple RubySystem instances in a single simulation.
With possibly multiple RubySystem objects, we can no longer use a global
variable to find "the" RubySystem object. Instead, each Ruby component
has to carry a pointer to the RubySystem object to which it belongs.
This patch begins the process of removing global variables from the Ruby
source with the goal of eventually allowing users to create multiple Ruby
instances in a single simulation. Currently, users cannot do so because
several global variables and static members are referenced by the RubySystem
object in a way that assumes that there will only ever be a single RubySystem.
These need to be replaced with per-RubySystem equivalents.
This specific patch replaces the global var g_ruby_start, which is used
to calculate throughput statistics for Throttles in simple networks and
links in Garnet networks, with a RubySystem instance var m_start_cycle.
The drain() call currently passes around a DrainManager pointer, which
is now completely pointless since there is only ever one global
DrainManager in the system. It also contains vestiges from the time
when SimObjects had to keep track of their child objects that needed
draining.
This changeset moves all of the DrainState handling to the Drainable
base class and changes the drain() and drainResume() calls to reflect
this. Particularly, the drain() call has been updated to take no
parameters (the DrainManager argument isn't needed) and return a
DrainState instead of an unsigned integer (there is no point returning
anything other than 0 or 1 any more). Drainable objects should return
either DrainState::Draining (equivalent to returning 1 in the old
system) if they need more time to drain or DrainState::Drained
(equivalent to returning 0 in the old system) if they are already in a
consistent state. Returning DrainState::Running is considered an
error.
Drain done signalling is now done through the signalDrainDone() method
in the Drainable class instead of using the DrainManager directly. The
new call checks if the state of the object is DrainState::Draining
before notifying the drain manager. This means that it is safe to call
signalDrainDone() without first checking if the simulator has
requested draining. The intention here is to reduce the code needed to
implement draining in simple objects.
Draining is currently done by traversing the SimObject graph and
calling drain()/drainResume() on the SimObjects. This is not ideal
when non-SimObjects (e.g., ports) need draining since this means that
SimObjects owning those objects need to be aware of this.
This changeset moves the responsibility for finding objects that need
draining from SimObjects and the Python-side of the simulator to the
DrainManager. The DrainManager now maintains a set of all objects that
need draining. To reduce the overhead in classes owning non-SimObjects
that need draining, objects inheriting from Drainable now
automatically register with the DrainManager. If such an object is
destroyed, it is automatically unregistered. This means that drain()
and drainResume() should never be called directly on a Drainable
object.
While implementing the new functionality, the DrainManager has now
been made thread safe. In practice, this means that it takes a lock
whenever it manipulates the set of Drainable objects since SimObjects
in different threads may create Drainable objects
dynamically. Similarly, the drain counter is now an atomic_uint, which
ensures that it is manipulated correctly when objects signal that they
are done draining.
A nice side effect of these changes is that it makes the drain state
changes stricter, which the simulation scripts can exploit to avoid
redundant drains.
The drain state enum is currently a part of the Drainable
interface. The same state machine will be used by the DrainManager to
identify the global state of the simulator. Make the drain state a
global typed enum to better cater for this usage scenario.
Objects that are can be serialized are supposed to inherit from the
Serializable class. This class is meant to provide a unified API for
such objects. However, so far it has mainly been used by SimObjects
due to some fundamental design limitations. This changeset redesigns
to the serialization interface to make it more generic and hide the
underlying checkpoint storage. Specifically:
* Add a set of APIs to serialize into a subsection of the current
object. Previously, objects that needed this functionality would
use ad-hoc solutions using nameOut() and section name
generation. In the new world, an object that implements the
interface has the methods serializeSection() and
unserializeSection() that serialize into a named /subsection/ of
the current object. Calling serialize() serializes an object into
the current section.
* Move the name() method from Serializable to SimObject as it is no
longer needed for serialization. The fully qualified section name
is generated by the main serialization code on the fly as objects
serialize sub-objects.
* Add a scoped ScopedCheckpointSection helper class. Some objects
need to serialize data structures, that are not deriving from
Serializable, into subsections. Previously, this was done using
nameOut() and manual section name generation. To simplify this,
this changeset introduces a ScopedCheckpointSection() helper
class. When this class is instantiated, it adds a new /subsection/
and subsequent serialization calls during the lifetime of this
helper class happen inside this section (or a subsection in case
of nested sections).
* The serialize() call is now const which prevents accidental state
manipulation during serialization. Objects that rely on modifying
state can use the serializeOld() call instead. The default
implementation simply calls serialize(). Note: The old-style calls
need to be explicitly called using the
serializeOld()/serializeSectionOld() style APIs. These are used by
default when serializing SimObjects.
* Both the input and output checkpoints now use their own named
types. This hides underlying checkpoint implementation from
objects that need checkpointing and makes it easier to change the
underlying checkpoint storage code.
This patch drops the NetworkMessage class. The relevant data members and functions
have been moved to the Message class, which was the parent of NetworkMessage.
The accessor function getDestination() for Destination variable in the
coherence message clashes with the getDestination() that is part of the Message
class. Hence the name change.
This structure's only purpose was to provide a comparison function for
ordering messages in the MessageBuffer. The comparison function is now
being moved to the Message class itself. So we no longer require this
structure.
This patch increases the default read/write buffer sizes for the DDR4
controller config to values that are more suitable for the high
bandwidth and high bank count.
This patch updates the command arbitration so that bank group timing
as well as rank-to-rank delays will be taken into account. The
resulting arbitration no longer selects commands (prepped or not) that
cannot issue seamlessly if there are commands that can issue
back-to-back, minimizing the effect of rank-to-rank (tCS) & same bank
group (tCCD_L) delays.
The arbitration selects a new command based on the following priority.
Within each priority band, the arbitration will use FCFS to select the
appropriate command:
1) Bank is prepped and burst can issue seamlessly, without a bubble
2) Bank is not prepped, but can prep and issue seamlessly, without a
bubble
3) Bank is prepped but burst cannot issue seamlessly. In this case, a
bubble will occur on the bus
Thus, to enable more parallelism in subsequent selections, an
unprepped packet is given higher priority if the bank prep can be
hidden. If the bank prep cannot be hidden, the selection logic will
choose a prepped packet that cannot issue seamlessly if one exist.
Otherwise, the default selection will choose the packet with the
minimum bank prep delay.
This patch adds a simple lookup structure to avoid iterating over the
write queue to find read matches, and for the merging of write
bursts. Instead of relying on iteration we simply store a set of
currently-buffered write-burst addresses and compare against
these. For the reads we still perform the iteration if we have a
match. For the writes, we rely entirely on the set. Note that there
are corner-cases where sub-bursts would actually not be mergeable
without a read-modify-write. We ignore these cases and opt for speed.
This patch changes how the crossbar classes deal with
responses. Instead of forwarding responses directly and burdening the
neighbouring modules in paying for the latency (through the
pkt->headerDelay), we now queue them before sending them.
The coherency protocol is not affected as requests and any snoop
requests/responses are still passed on in zero time. Thus, the
responses end up paying for any header delay accumulated when passing
through the crossbar. Any latency incurred on the request path will be
paid for on the response side, if no other module has dealt with it.
As a result of this patch, responses are returned at a later
point. This affects the number of outstanding transactions, and quite
a few regressions see an impact in blocking due to no MSHRs, increased
cache-miss latencies, etc.
Going forward we should be able to use the same concept also for snoop
responses, and any request that is not an express snoop.
This patch takes the final step in removing the is_top_level parameter
from the cache. With the recent changes to read requests and write
invalidations, the parameter is no longer needed, and consequently
removed.
This also means that asymmetric cache hierarchies are now fully
supported (and we are actually using them already with L1 caches, but
no table-walker caches, connected to a shared L2).
WriteInvalidateReq ensures that a whole-line write does not incur the
cost of first doing a read exclusive, only to later overwrite the
data. This patch splits the existing WriteInvalidateReq into a
WriteLineReq, which is done locally, and an InvalidateReq that is sent
out throughout the memory system. The WriteLineReq re-uses the normal
WriteResp.
The change allows us to better express the difference between the
cache that is performing the write, and the ones that are merely
invalidating. As a consequence, we no longer have to rely on the
isTopLevel flag. Moreover, the actual memory in the system does not
see the intitial write, only the writeback. We were marking the
written line as dirty already, so there is really no need to also push
the write all the way to the memory.
The overall flow of the write-invalidate operation remains the same,
i.e. the operation is only carried out once the response for the
invalidate comes back. This patch adds the InvalidateResp for this
very reason.
This patch adds two new read requests packets:
ReadCleanReq - For a cache to explicitly request clean data. The
response is thus exclusive or shared, but not owned or modified. The
read-only caches (see previous patch) use this request type to ensure
they do not get dirty data.
ReadSharedReq - We add this to distinguish cache read requests from
those issued by other masters, such as devices and CPUs. Thus, devices
use ReadReq, and caches use ReadCleanReq, ReadExReq, or
ReadSharedReq. For the latter, the response can be any state, shared,
exclusive, owned or even modified.
Both ReadCleanReq and ReadSharedReq re-use the normal ReadResp. The
two transactions are aligned with the emerging cache-coherent TLM
standard and the AMBA nomenclature.
With this change, the normal ReadReq should never be used by a cache,
and is reserved for the actual (non-caching) masters in the system. We
thus have a way of identifying if a request came from a cache or
not. The introduction of ReadSharedReq thus removes the need for the
current isTopLevel hack, and also allows us to stop relying on
checking the packet size to determine if the source is a cache or
not. This is fixed in follow-on patches.
This patch adds a parameter to the BaseCache to enable a read-only
cache, for example for the instruction cache, or table-walker cache
(not for x86). A number of checks are put in place in the code to
ensure a read-only cache does not end up with dirty data.
A follow-on patch adds suitable read requests to allow a read-only
cache to explicitly ask for clean data.
This patch adds eviction notices to the caches, to provide accurate
tracking of cache blocks in snoop filters. We add the CleanEvict
message to the memory heirarchy and use both CleanEvicts and
Writebacks with BLOCK_CACHED flags to propagate notice of clean and
dirty evictions respectively, down the memory hierarchy. Note that the
BLOCK_CACHED flag indicates whether there exist any copies of the
evicted block in the caches above the evicting cache.
The purpose of the CleanEvict message is to notify snoop filters of
silent evictions in the relevant caches. The CleanEvict message
behaves much like a Writeback. CleanEvict is a write and a request but
unlike a Writeback, CleanEvict does not have data and does not need
exclusive access to the block. The cache generates the CleanEvict
message on a fill resulting in eviction of a clean block. Before
travelling downwards CleanEvict requests generate zero-time snoop
requests to check if the same block is cached in upper levels of the
memory heirarchy. If the block exists, the cache discards the
CleanEvict message. The snoops check the tags, writeback queue and the
MSHRs of upper level caches in a manner similar to snoops generated
from HardPFReqs. Currently CleanEvicts keep travelling towards main
memory unless they encounter the block corresponding to their address
or reach main memory (since we have no well defined point of
serialisation). Main memory simply discards CleanEvict messages.
We have modified the behavior of Writebacks, such that they generate
snoops to check for the presence of blocks in upper level caches. It
is possible in our current implmentation for a lower level cache to be
writing back a block while a shared copy of the same block exists in
the upper level cache. If the snoops find the same block in upper
level caches, we set the BLOCK_CACHED flag in the Writeback message.
We have also added logic to account for interaction of other message
types with CleanEvicts waiting in the writeback queue. A simple
example is of a response arriving at a cache removing any CleanEvicts
to the same address from the cache's writeback queue.
This patch fixes an issue which is very wide spread in the codebase,
causing sporadic linking failures. The issue is that we declare static
const class variables in the header, without any definition (as part
of a source file). In most cases the compiler propagates the value and
we have no issues. However, especially for less optimising builds such
as debug, we get sporadic linking failures due to undefined
references.
This patch fixes the Request class, by turning the static const flags
and master IDs into C++11 typed enums.
Remove the assert when adding a port to the RubyPort retry list.
Instead of asserting, just ignore the added port, since it's
already on the list.
Without this patch, Ruby+detailed fails for even the simplest tests
Snoop packets share the request pointer with the originating
packets. We need to ensure that the snoop packet destruction does not
delete the request. Snoops are used for reads, invalidations,
HardPFReqs, Writebacks and CleansEvicts. Reads, invalidations, and
HardPFReqs need a response so their snoops do not delete the
request. For Writebacks and CleanEvicts we need to check explicitly
for whethere the current packet is an express snoop, in whcih case do
not delete the request.
Fixes missed forward eviction to CPU. With the O3CPU this can lead to load-load
reordering, as the LQ is never notified of the invalidate.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
A single HMC-2500 x32 model based on:
[1] DRAMSpec: a high-level DRAM bank modelling tool developed at the University
of Kaiserslautern. This high level tool uses RC (resistance-capacitance) and CV
(capacitance-voltage) models to estimate the DRAM bank latency and power
numbers.
[2] A Logic-base Interconnect for Supporting Near Memory Computation in the
Hybrid Memory Cube (E. Azarkhish et. al) Assumed for the HMC model is a 30 nm
technology node. The modelled HMC consists of a 4 Gbit part with 4 layers
connected with TSVs. Each layer has 16 vaults and each vault consists of 2
banks per layer. In order to be able to use the same controller used for 2D
DRAM generations for HMC, the following analogy is done: Channel (DDR) => Vault
(HMC) device_size (DDR) => size of a single layer in a vault ranks per channel
(DDR) => number of layers banks per rank (DDR) => banks per layer devices per
rank (DDR) => devices per layer ( 1 for HMC). The parameters for which no
input is available are inherited from the DDR3 configuration.
A step towards removing RubyMemoryControl and shift users to
DRAMCtrl. The latter is faster, more representative, very versatile,
and is integrated with power models.
The processes of warming up and cooling down Ruby caches are simulation-wide
processes, not just RubySystem instance-specific processes. Thus, the warm-up
and cool-down variables should be globally visible to any Ruby components
participating in either process. Make these variables static members and track
the warm-up and cool-down processes as appropriate.
This patch also has two side benefits:
1) It removes references to the RubySystem g_system_ptr, which are problematic
for allowing multiple RubySystem instances in a single simulation. Warmup and
cooldown variables being static (global) reduces the need for instance-specific
dereferences through the RubySystem.
2) From the AbstractController, it removes local RubySystem pointers, which are
used inconsistently with other uses of the RubySystem: 11 other uses reference
the RubySystem with the g_system_ptr. Only sequencers have local pointers.
Sometimes, we need to defer an express snoop in an MSHR, but the original
request might complete and deallocate the original pkt->req. In those cases,
create a copy of the request so that someone who is inspecting the delayed
snoop can also inspect the request still. All of this is rather hacky, but the
allocation / linking and general life-time management of Packet and Request is
rather tricky. Deleting the copy is another tricky area, testing so far has
shown that the right copy is deleted at the right time.
The Request::UNCACHEABLE flag currently has two different
functions. The first, and obvious, function is to prevent the memory
system from caching data in the request. The second function is to
prevent reordering and speculation in CPU models.
This changeset gives the order/speculation requirement a separate flag
(Request::STRICT_ORDER). This flag prevents CPU models from doing the
following optimizations:
* Speculation: CPU models are not allowed to issue speculative
loads.
* Write combining: CPU models and caches are not allowed to merge
writes to the same cache line.
Note: The memory system may still reorder accesses unless the
UNCACHEABLE flag is set. It is therefore expected that the
STRICT_ORDER flag is combined with the UNCACHEABLE flag to prevent
this behavior.
This patch takes a last step in fixing issues related to uncacheable
accesses. We do not separate uncacheable memory from uncacheable
devices, and in cases where it is really memory, there are valid
scenarios where we need to snoop since we do not support cache
maintenance instructions (yet). On snooping an uncacheable access we
thus provide data if possible. In essence this makes uncacheable
accesses IO coherent.
The snoop filter is also queried to steer the snoops, but not updated
since the uncacheable accesses do not allocate a block.
This patch ensures that we pass on information about a packet being
shared (rather than exclusive), when forwarding a packet downstream.
Without this patch there is a risk that a downstream cache considers
the line exclusive when it really isn't.
This patch adds a missing counter update for the uncacheable
accesses. By updating this counter we also get a meaningful average
latency for uncacheable accesses (previously inf).
This patch changes the cache implementation to rely on virtual methods
rather than using the replacement policy as a template argument.
There is no impact on the simulation performance, and overall the
changes make it easier to modify (and subclass) the cache and/or
replacement policy.
Both open_adaptive and close_adaptive page polices keep the page
open if a row hit is found. If a row hit is not found, close_adaptive
page policy precharges the row, and open_adaptive policy precharges
the row only if there is a bank conflict request waiting in the queue.
This patch makes the checks for above conditions simpler.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Restoring from a checkpoint with ruby + the DRAMCtrl memory model was not
working, because ruby and DRAMCtrl disagreed on the current tick during warmup.
Since there is no reason to do timing requests during warmup, use functional
requests instead.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
The stride prefetcher had a hardcoded number of contexts (i.e. master-IDs)
that it could handle. Since master IDs need to be unique per system, and
every core, cache etc. requires a separate master port, a static limit on
these does not make much sense.
Instead, this patch adds a small hash map that will map all master IDs to
the right prefetch state and dynamically allocates new state for new master
IDs.
This patch changes the order of writeback allocation such that any
writebacks resulting from a tag lookup (e.g. for an uncacheable
access), are added to the writebuffer before any new MSHR entries are
allocated. This ensures that the writebacks logically precedes the new
allocations.
The patch also changes the uncacheable flush to use proper timed (or
atomic) writebacks, as opposed to functional writes.
This patch simplifies the code dealing with uncacheable timing
accesses, aiming to align it with the existing miss handling. Similar
to what we do in atomic, a timing request now goes through
Cache::access (where the block is also flushed), and then proceeds to
ignore any existing MSHR for the block in question. This unifies the
flow for cacheable and uncacheable accesses, and for atomic and timing.
This patch changes how we search for matching MSHRs, ignoring any MSHR
that is allocated for an uncacheable access. By doing so, this patch
fixes a corner case in the MSHRs where incorrect data ended up being
copied into a (cacheable) read packet due to a first uncacheable MSHR
target of size 4, followed by a cacheable target to the same MSHR of
size 64. The latter target was filled with nonsense data.
This patch removes the no-longer-needed
allocateUncachedReadBuffer. Besides the checks it is exactly the same
as allocateMissBuffer and thus provides no value.
This patch aligns all MSHR queue entries to block boundaries to
simplify checks for matches. Previously there were corner cases that
could lead to existing entries not being identified as matches.
There are, rather alarmingly, a few regressions that change with this
patch.
This patch subsumes the PREFETCH_SNOOP_SQUASH flag with the more
generic BLOCK_CACHED flag. Future patches implementing cache eviction
messages can use the BLOCK_CACHED flag in almost the same manner as
hardware prefetches use the PREFETCH_SNOOP_SQUASH flag. The
PREFTECH_SNOOP_FLAG is set if the prefetch target is found in the tags
or the MSHRs in any state, so we are simply replacing calls to
setPrefetchSquashed() with setBlockCached(). The case of where the
prefetch target is found in the writeback MSHRs of upper level caches
continues to be covered by the MEM_INHIBIT flag.