"executing" isnt a very descriptive debug message and in going through the
output you get multiple messages that say "executing" but nothing to help
you parse through the code/execution.
So instead, at least print out the name of the action that is taking
place in these functions.
Overall, continue to progress Ruby debug messages to more of the normal M5
debug message style
- add a name() to the Ruby Throttle & PerfectSwitch objects so that the debug output
isn't littered w/"global:" everywhere.
- clean up messages that print over multiple lines when possible
- clean up duplicate prints in the message buffer
In certain actions of the L1 cache controller, while creating an outgoing
message, the machine type was not being set. This results in a
segmentation fault when trace is collected. Joseph Pusudesris provided
his patch for fixing this issue.
Currently the wakeup function for the PerfectSwitch contains three loops -
loop on number of virtual networks
loop on number of incoming links
loop till all messages for this (link, network) have been routed
With an 8 processor mesh network and Hammer protocol, about 11-12% of the
was observed to have been spent in this function, which is the highest
amongst all the functions. It was found that the innermost loop is executed
about 45 times per invocation of the wakeup function, when each invocation
of the wakeup function processes just about one message.
The patch tries to do away with the redundant executions of the innermost
loop. Counters have been added for each virtual network that record the
number of messages that need to be routed for that virtual network. The
inner loops are only executed when the number of messages for that particular
virtual network > 0. This does away with almost 80% of the executions of the
innermost loop. The function now consumes about 5-6% of the total execution
time.
The patch changes the order in which L1 dcache and icache are looked up when
a request comes in. Earlier, if a request came in for instruction fetch, the
dcache was looked up before the icache, to correctly handle self-modifying
code. But, in the common case, dcache is going to report a miss and the
subsequent icache lookup is going to report a hit. Given the invariant -
caches under the same controller keep track of disjoint sets of cache blocks,
we can move the icache lookup before the dcache lookup. In case of a hit in
the icache, using our invariant, we know that the dcache would have reported
a miss. In case of a miss in the icache, we know that icache would have
missed even if the dcache was looked up before looking up the icache.
Effectively, we are doing the same thing as before, though in the common case,
we expect reduction in the number of lookups. This was empirically confirmed
for MOESI hammer. The ratio lookups to access requests is now about 1.1 to 1.
The TBE pointer in the MESI CMP implementation was not being set to NULL
when the TBE is deallocated. This resulted in segmentation fault on testing
the protocol when the ProtocolTrace was switched on.
The code for Orion 2.0 makes use of printf() at several places where there as
an error in configuration of the model. These have been replaced with fatal().
By stalling and waiting the mandatory queue instead of recycling it, one can
ensure that no incoming messages are starved when the mandatory queue puts
signficant of pressure on the L1 cache controller (i.e. the ruby memtester).
--HG--
rename : src/mem/slicc/ast/WakeUpDependentsStatementAST.py => src/mem/slicc/ast/WakeUpAllDependentsStatementAST.py
The packet now identifies whether static or dynamic data has been allocated and
is used by Ruby to determine whehter to copy the data pointer into the ruby
request. Subsequently, Ruby can be told not to update phys memory when
receiving packets.
Separate data VCs and ctrl VCs in garnet, as ctrl VCs have 1 buffer per VC,
while data VCs have > 1 buffers per VC. This is for correct power estimations.
The purpose of this patch is to change the way CacheMemory interfaces with
coherence protocols. Currently, whenever a cache controller (defined in the
protocol under consideration) needs to carry out any operation on a cache
block, it looks up the tag hash map and figures out whether or not the block
exists in the cache. In case it does exist, the operation is carried out
(which requires another lookup). As observed through profiling of different
protocols, multiple such lookups take place for a given cache block. It was
noted that the tag lookup takes anything from 10% to 20% of the simulation
time. In order to reduce this time, this patch is being posted.
I have to acknowledge that the many of the thoughts that went in to this
patch belong to Brad.
Changes to CacheMemory, TBETable and AbstractCacheEntry classes:
1. The lookup function belonging to CacheMemory class now returns a pointer
to a cache block entry, instead of a reference. The pointer is NULL in case
the block being looked up is not present in the cache. Similar change has
been carried out in the lookup function of the TBETable class.
2. Function for setting and getting access permission of a cache block have
been moved from CacheMemory class to AbstractCacheEntry class.
3. The allocate function in CacheMemory class now returns pointer to the
allocated cache entry.
Changes to SLICC:
1. Each action now has implicit variables - cache_entry and tbe. cache_entry,
if != NULL, must point to the cache entry for the address on which the action
is being carried out. Similarly, tbe should also point to the transaction
buffer entry of the address on which the action is being carried out.
2. If a cache entry or a transaction buffer entry is passed on as an
argument to a function, it is presumed that a pointer is being passed on.
3. The cache entry and the tbe pointers received __implicitly__ by the
actions, are passed __explicitly__ to the trigger function.
4. While performing an action, set/unset_cache_entry, set/unset_tbe are to
be used for setting / unsetting cache entry and tbe pointers respectively.
5. is_valid() and is_invalid() has been made available for testing whether
a given pointer 'is not NULL' and 'is NULL' respectively.
6. Local variables are now available, but they are assumed to be pointers
always.
7. It is now possible for an object of the derieved class to make calls to
a function defined in the interface.
8. An OOD token has been introduced in SLICC. It is same as the NULL token
used in C/C++. If you are wondering, OOD stands for Out Of Domain.
9. static_cast can now taken an optional parameter that asks for casting the
given variable to a pointer of the given type.
10. Functions can be annotated with 'return_by_pointer=yes' to return a
pointer.
11. StateMachine has two new variables, EntryType and TBEType. EntryType is
set to the type which inherits from 'AbstractCacheEntry'. There can only be
one such type in the machine. TBEType is set to the type for which 'TBE' is
used as the name.
All the protocols have been modified to conform with the new interface.
This patch changes the manner in which data is copied from L1 to L2 cache in
the implementation of the Hammer's cache coherence protocol. Earlier, data was
copied directly from one cache entry to another. This has been broken in to
two parts. First, the data is copied from the source cache entry to a
transaction buffer entry. Then, data is copied from the transaction buffer
entry to the destination cache entry.
This has been done to maintain the invariant - at any given instant, multiple
caches under a controller are exclusive with respect to each other.
Ran all the source files through 'perl -pi' with this script:
s|\s*(};?\s*)?/\*\s*(end\s*)?namespace\s*(\S+)\s*\*/(\s*})?|} // namespace $3|;
s|\s*};?\s*//\s*(end\s*)?namespace\s*(\S+)\s*|} // namespace $2\n|;
s|\s*};?\s*//\s*(\S+)\s*namespace\s*|} // namespace $1\n|;
Also did a little manual editing on some of the arch/*/isa_traits.hh files
and src/SConscript.
Two functions in src/mem/ruby/system/PerfectCacheMemory.hh, tryCacheAccess()
and cacheProbe(), end with calls to panic(). Both of these functions have
return type other than void. Any file that includes this header file fails
to compile because of the missing return statement. This patch adds dummy
values so as to avoid the compiler warnings.
This diff is for changing the way ASSERT is handled in Ruby. m5.fast
compiles out the assert statements by using the macro NDEBUG. Ruby uses the
macro RUBY_NO_ASSERT to do so. This macro has been removed and NDEBUG has
been put in its place.
These flags were being used to identify what alignment a request needed, but
the same information is available using the request size. This change also
eliminates the isMisaligned function. If more complicated alignment checks are
needed, they can be signaled using the ASI_BITS space in the flags vector like
is currently done with ARM.
If we write back an exclusive copy, we now mark it
as such, so the cache receiving the writeback can
mark its copy as exclusive. This avoids some
unnecessary upgrade requests when a cache later
tries to re-acquire exclusive access to the block.
Also move the "Fault" reference counted pointer type into a separate file,
sim/fault.hh. It would be better to name this less similarly to sim/faults.hh
to reduce confusion, but fault.hh matches the name of the type. We could change
Fault to FaultPtr to match other pointer types, and then changing the name of
the file would make more sense.
Corrects an oversight in cset f97b62be544f. The fix there only
failed queued SCUpgradeReq packets that encountered an
invalidation, which meant that the upgrade had to reach the L2
cache. To handle pending requests in the L1 we must similarly
fail StoreCondReq packets too.
Allow lower-level caches (e.g., L2 or L3) to pass exclusive
copies to higher levels (e.g., L1). This eliminates a lot
of unnecessary upgrade transactions on read-write sequences
to non-shared data.
Also some cleanup of MSHR coherence handling and multiple
bug fixes.
This patch allows messages to be stalled in their input buffers and wait
until a corresponding address changes state. In order to make this work,
all in_ports must be ranked in order of dependence and those in_ports that
may unblock an address, must wake up the stalled messages. Alot of this
complexity is handled in slicc and the specification files simply
annotate the in_ports.
--HG--
rename : src/mem/slicc/ast/CheckAllocateStatementAST.py => src/mem/slicc/ast/StallAndWaitStatementAST.py
rename : src/mem/slicc/ast/CheckAllocateStatementAST.py => src/mem/slicc/ast/WakeUpDependentsStatementAST.py
Patch allows each individual message buffer to have different recycle latencies
and allows the overall recycle latency to be specified at the cmd line. The
patch also adds profiling info to make sure no one processor's requests are
recycled too much.
The main purpose for clearing stats in the unserialize process is so
that the profiler can correctly set its start time to the unserialized
value of curTick.
This patch allows one to disable migratory sharing for those cache blocks that
are accessed by atomic requests. While the implementations are different
between the token and hammer protocols, the motivation is the same. For
Alpha, LLSC semantics expect that normal loads do not unlock cache blocks that
have been locked by LL accesses. Therefore, locked blocks should not transfer
write permissions when responding to these load requests. Instead, only they
only transfer read permissions so that the subsequent SC access can possibly
succeed.
This patch fixes several bugs related to previous inconsistent assumptions on
how many tokens the Owner had. Mike Marty should have fixes these bugs years
ago. :)
Previously, the MOESI_hammer protocol calculated the same latency for L1 and
L2 hits. This was because the protocol was written using the old ruby
assumption that L1 hits used the sequencer fast path. Since ruby no longer
uses the fast-path, the protocol delays L2 hits by placing them on the
trigger queue.
The previous slower ruby latencies created a mismatch between the faster M5
cpu models and the much slower ruby memory system. Specifically smp
interrupts were much slower and infrequent, as well as cpus moving in and out
of spin locks. The result was many cpus were idle for large periods of time.
These changes fix the latency mismatch.
This patch adds back to ruby the capability to understand the response time
for messages that hit in different levels of the cache heirarchy.
Specifically add support for the MI_example, MOESI_hammer, and MOESI_CMP_token
protocols.
This patch adds DMA testing to the Memtester and is inherits many changes from
Polina's old tester_dma_extension patch. Since Ruby does not work in atomic
mode, the atomic mode options are removed.
Clean up some minor things left over from the default responder
change in rev 9af6fb59752f. Mostly renaming the 'responder_set'
param to 'use_default_range' to actually reflect what it does...
old name wasn't that descriptive in the first place, but now
it really doesn't make sense at all.
Also got rid of the bogus obsolete assignment to 'bus.responder'
which used to be a parameter but now is interpreted as an
implicit child assignment, and which was giving me problems in
the config restructuring to come. (A good argument for not
allowing implicit child assignments, IMO, but that's water under
the bridge, I'm afraid.)
Also moved the Bus constructor to the .cc file since that's
where it should have been all along.
Requires new "SCUpgradeReq" message that marks upgrades
for store conditionals, so downstream caches can fail
these when they run into invalidations.
See http://www.m5sim.org/flyspray/task/197
Only set the dirty bit when we actually write to a block
(not if we thought we might but didn't, as in a failed
SC or CAS). This requires makeing sure the dirty bit
stays set when we get an exclusive (writable) copy
in a cache-to-cache transfer from another owner, which
n turn requires copying the mem-inhibit flag from
timing-mode requests to their associated responses.
One big difference is that PrioHeap puts the smallest element at the
top of the heap, whereas stl puts the largest element on top, so I
changed all comparisons so they did the right thing.
Some usage of PrioHeap was simply changed to a std::vector, using sort
at the right time, other usage had me just use the various heap functions
in the stl.
This was somewhat tricky because the RefCnt API was somewhat odd. The
biggest confusion was that the the RefCnt object's constructor that
took a TYPE& cloned the object. I created an explicit virtual clone()
function for things that took advantage of this version of the
constructor. I was conservative and used clone() when I was in doubt
of whether or not it was necessary. I still think that there are
probably too many instances of clone(), but hopefully not too many.
I converted several instances of const MsgPtr & to a simple MsgPtr.
If the function wants to avoid the overhead of creating another
reference, then it should just use a regular pointer instead of a ref
counting ptr.
There were a couple of instances where refcounted objects were created
on the stack. This seems pretty dangerous since if you ever
accidentally make a reference to that object with a ref counting
pointer, bad things are bound to happen.
Further cleanup should probably be done to make this class be non-Ruby
specific and put it in src/base.
There are probably several cases where this class is used, std::bitset
could be used instead.
In addition to obvious changes, this required a slight change to the slicc
grammar to allow types with :: in them. Otherwise slicc barfs on std::string
which we need for the headers that slicc generates.
Previously, the set size was set to 4. This was mostly do to the fact that a
crazy graduate student use to create networks with 256 l2 cache banks. Now it
is far more likely that users will create systems with less than 64 of any
particular controller type. Therefore Ruby should be optimized for a set size
of 1.
On the config end, if a shared L2 is created for the system, it is
parameterized to have n sharers as defined by option.num_cpus. In addition to
making the cache sharing aware so that discriminating tag policies can make use
of context_ids to make decisions, I added an occupancy AverageStat and an occ %
stat to each cache so that you could know which contexts are occupying how much
cache on average, both in terms of blocks and percentage. Note that since
devices have context_id -1, having an array of occ stats that correspond to
each context_id will break here, so in FS mode I add an extra bucket for device
blocks. This bucket is explicitly not added in SE mode in order to not only
avoid ugliness in the stats.txt file, but to avoid broken stats (some formulas
break when a bucket is 0).