This single parameter replaces the collection of bools that set up various
flavors of microops. A flag parameter also allows other flags to be set like
the serialize before/after flags, etc., without having to change the
constructor.
Since miscellaneous registers bypass wakeup logic, force serialization
to resolve data dependencies through them
* * *
ARM: adding non-speculative/serialize flags for instructions change CPSR
THis allows the CPU to handle predicated-false instructions accordingly.
This particular patch makes loads that are predicated-false to be sent
straight to the commit stage directly, not waiting for return of the data
that was never requested since it was predicated-false.
This allows one two different OS requirements for the same ISA to be handled.
Some OSes are compiled for a virtual address and need to be loaded into physical
memory that starts at address 0, while other bare metal tools generate
images that start at address 0.
This was being done in read(), but if readBytes was called directly it
wouldn't happen. Also, instead of setting the memory blob being read to -1
which would (I believe) require using memset with -1 as a parameter, this now
uses bzero. It's hoped that it's more specialized behavior will make it
slightly faster.
bkt size isn't evenly divisible by max-min and it would round down,
it's possible to sample a distribution and have no place to put the sample.
When this case occured the simulator would assert.
This patch allows messages to be stalled in their input buffers and wait
until a corresponding address changes state. In order to make this work,
all in_ports must be ranked in order of dependence and those in_ports that
may unblock an address, must wake up the stalled messages. Alot of this
complexity is handled in slicc and the specification files simply
annotate the in_ports.
--HG--
rename : src/mem/slicc/ast/CheckAllocateStatementAST.py => src/mem/slicc/ast/StallAndWaitStatementAST.py
rename : src/mem/slicc/ast/CheckAllocateStatementAST.py => src/mem/slicc/ast/WakeUpDependentsStatementAST.py
Patch allows each individual message buffer to have different recycle latencies
and allows the overall recycle latency to be specified at the cmd line. The
patch also adds profiling info to make sure no one processor's requests are
recycled too much.
The main purpose for clearing stats in the unserialize process is so
that the profiler can correctly set its start time to the unserialized
value of curTick.
This patch allows one to disable migratory sharing for those cache blocks that
are accessed by atomic requests. While the implementations are different
between the token and hammer protocols, the motivation is the same. For
Alpha, LLSC semantics expect that normal loads do not unlock cache blocks that
have been locked by LL accesses. Therefore, locked blocks should not transfer
write permissions when responding to these load requests. Instead, only they
only transfer read permissions so that the subsequent SC access can possibly
succeed.
Added drain functions to the RTC and 8254 timer so that periodic interrupts
stop when the system is draining. This patch is needed to checkpoint in
timing mode. Otherwise under certain situations, the event queue will never
be completely empty.
This patch fixes several bugs related to previous inconsistent assumptions on
how many tokens the Owner had. Mike Marty should have fixes these bugs years
ago. :)
Previously, the MOESI_hammer protocol calculated the same latency for L1 and
L2 hits. This was because the protocol was written using the old ruby
assumption that L1 hits used the sequencer fast path. Since ruby no longer
uses the fast-path, the protocol delays L2 hits by placing them on the
trigger queue.
The previous slower ruby latencies created a mismatch between the faster M5
cpu models and the much slower ruby memory system. Specifically smp
interrupts were much slower and infrequent, as well as cpus moving in and out
of spin locks. The result was many cpus were idle for large periods of time.
These changes fix the latency mismatch.
This patch adds back to ruby the capability to understand the response time
for messages that hit in different levels of the cache heirarchy.
Specifically add support for the MI_example, MOESI_hammer, and MOESI_CMP_token
protocols.
This patch adds DMA testing to the Memtester and is inherits many changes from
Polina's old tester_dma_extension patch. Since Ruby does not work in atomic
mode, the atomic mode options are removed.
Replace direct call to unserialize() on each SimObject with a pair of
calls for better control over initialization in both ckpt and non-ckpt
cases.
If restoring from a checkpoint, loadState(ckpt) is called on each
SimObject. The default implementation simply calls unserialize() if
there is a corresponding checkpoint section, so we get backward
compatibility for existing objects. However, objects can override
loadState() to get other behaviors, e.g., doing other programmed
initializations after unserialize(), or complaining if no checkpoint
section is found. (Note that the default warning for a missing
checkpoint section is now gone.)
If not restoring from a checkpoint, we call the new initState() method
on each SimObject instead. This provides a hook for state
initializations that are only required when *not* restoring from a
checkpoint.
Given this new framework, do some cleanup of LiveProcess subclasses
and X86System, which were (in some cases) emulating initState()
behavior in startup via a local flag or (in other cases) erroneously
doing initializations in startup() that clobbered state loaded earlier
by unserialize().
The separate restoreCheckpoint() call is gone; just pass
the checkpoint dir as an optional arg to instantiate().
This change is a precursor to some more extensive
reworking of the startup code.
The old code for handling SimObject children was kind of messy,
with children stored both in _values and _children, and
inconsistent and potentially buggy handling of SimObject
vectors. Now children are always stored in _children, and
SimObject vectors are consistently handled using the
SimObjectVector class.
Also, by deferring the parenting of SimObject-valued parameters
until the end (instead of doing it at assignment), we eliminate
the hole where one could assign a vector of SimObjects to a
parameter then append to that vector, with the appended objects
never getting parented properly.
This patch induces small stats changes in tests with data races
due to changes in the object creation & initialization order.
The new code does object vectors in order and so should be more
stable.
Orphan SimObjects (not in the config hierarchy) could get
created implicitly if they have a port connection to a SimObject
that is in the hierarchy. This means that there are objects on
the C++ SimObject list (created via the C++ SimObject
constructor call) that are unknown to Python and will get
skipped if we walk the hierarchy from the Python side (as we are
about to do). This patch detects this situation and prints an
error message.
Also fix the rubytester config script which happened to rely on
this behavior.
Enforce that the Python Root SimObject is instantiated only
once. The C++ Root object already panics if more than one is
created. This change avoids the need to track what the root
object is, since it's available from Root.getInstance() (if it
exists). It's now redundant to have the user pass the root
object to functions like instantiate(), checkpoint(), and
restoreCheckpoint(), so that arg is gone. Users who use
configs/common/Simulate.py should not notice.
Clean up some minor things left over from the default responder
change in rev 9af6fb59752f. Mostly renaming the 'responder_set'
param to 'use_default_range' to actually reflect what it does...
old name wasn't that descriptive in the first place, but now
it really doesn't make sense at all.
Also got rid of the bogus obsolete assignment to 'bus.responder'
which used to be a parameter but now is interpreted as an
implicit child assignment, and which was giving me problems in
the config restructuring to come. (A good argument for not
allowing implicit child assignments, IMO, but that's water under
the bridge, I'm afraid.)
Also moved the Bus constructor to the .cc file since that's
where it should have been all along.
printMemData is only used in DPRINTFs. If those are removed by compiling
m5.fast, that function is unused, gcc generates a warning, that gets turned
into an error, and the build fails. This change surrounds the function
definition with #if TRACING_ON so it only gets compiled in if the DPRINTFs do
to.
When a request is NO_ACCESS (x86 CDA microinstruction), the memory op
doesn't go to the cache, so TimingSimpleCPU::completeDataAccess needs
to handle the case where the current status of the CPU is Running
and not DcacheWaitResponse or DTBWaitResponse
switching between O3 and another CPU, O3's tick event might still be scheduled
in the event queue (as squashed). Therefore, check for a squashed tick event
as well as a non-scheduled event when taking over from another CPU and deal
with it accordingly.
It would be nice if python had a tree class that would do this for real,
but since we don't, we'll just keep a sorted list of keys and update
it on demand.
If the user sets the environment variable M5_OVERRIDE_PY_SOURCE to
True, then imports that would normally find python code compiled into
the executable will instead first check in the absolute location where
the code was found during the build of the executable. This only
works for files in the src (or extras) directories, not automatically
generated files.
This is a developer feature!
This tidbit was pulled from a larger patch for Tim's sake, so
the comment reflects functions that haven't been exported yet.
I hope to commit them soon so it didn't seem worth cleaning up.
m5 doesnt do stats specific to binary and this resource request stat is probably only
useful for people who really know the ins/outs of the model anyway
replace priority queue with vector of lists(1 list per stage) and place inside a class
so that we have more control of when an instruction uses a particular schedule entry
...
also, this is the 1st step toward making the InOrderCPU fully parameterizable. See the
wiki for details on this process
- use InOrderBPred instead of Resource for DPRINTFs
- account for DELAY SLOT in updating RAS and in squashing
- don't let squashed instructions update the predictor
- the BTB needs to use the ASID not the TID to work for multithreaded programs
- add stats for BTB hits
Requires new "SCUpgradeReq" message that marks upgrades
for store conditionals, so downstream caches can fail
these when they run into invalidations.
See http://www.m5sim.org/flyspray/task/197
Only set the dirty bit when we actually write to a block
(not if we thought we might but didn't, as in a failed
SC or CAS). This requires makeing sure the dirty bit
stays set when we get an exclusive (writable) copy
in a cache-to-cache transfer from another owner, which
n turn requires copying the mem-inhibit flag from
timing-mode requests to their associated responses.
One big difference is that PrioHeap puts the smallest element at the
top of the heap, whereas stl puts the largest element on top, so I
changed all comparisons so they did the right thing.
Some usage of PrioHeap was simply changed to a std::vector, using sort
at the right time, other usage had me just use the various heap functions
in the stl.
This was somewhat tricky because the RefCnt API was somewhat odd. The
biggest confusion was that the the RefCnt object's constructor that
took a TYPE& cloned the object. I created an explicit virtual clone()
function for things that took advantage of this version of the
constructor. I was conservative and used clone() when I was in doubt
of whether or not it was necessary. I still think that there are
probably too many instances of clone(), but hopefully not too many.
I converted several instances of const MsgPtr & to a simple MsgPtr.
If the function wants to avoid the overhead of creating another
reference, then it should just use a regular pointer instead of a ref
counting ptr.
There were a couple of instances where refcounted objects were created
on the stack. This seems pretty dangerous since if you ever
accidentally make a reference to that object with a ref counting
pointer, bad things are bound to happen.
Expand the help text on the --remote-gdb-port option so
people know you can use it to disable remote gdb without
reading the source code, and thus don't waste any time
trying to add a separate option to do that.
Clean up some gdb-related cruft I found while looking
for where one would add a gdb disable option, before
I found the comment that told me that I didn't need
to do that.
Spec2k benchmarks seem to run with atomic or timing mode simple
CPUs. Fixed up some constants, handling of 64 bit arguments,
and marked a few more syscalls ignoreFunc.
This will help keep the high level decode together and not have it spread into
the subordinate decode stuff. The ##include lines still need to be on a line
by themselves, though.
There were four bugs in these instructions. First, the loaded value was being
stored into a floating point register as floating point, changing the value as
it was transfered. Second, the meaning of the "up" bit had been reversed.
Third, the statically sized microop array wasn't bit enough for all possible
inputs. It's now dynamically sized and should always be big enough. Fourth,
the offset was stored as an unsigned 8 bit value. Negative offsets would look
like moderately large positive offsets.
These enter and leave thumbEE mode. Currently thumbEE mode behaves exactly the
same as Thumb mode, but at least this will make it -look- like we're enter and
leaving it. The actual behavioral changes will be implemented in future
changes.
This register will always report 0 caches as implemented. It's not clear how
to find out how many there really are when dealing with an arbitrary
hierarchy.
This register controls access to the coprocessors. This doesn't actually
implement it, it allows writes which don't turn anything off. In other words,
it allows the simulated program to ask for what it already has.
This register is supposed to "Clean and invalidate data or unified cache line
by set/way." Since there isn't a good way to do that, we'll just ignore these
and warn about it.
This change moves the writeback of load multiple instructions to the beginning
of the macroop. That way, the MicroLdrRetUop that changes the mode will
necessarily happen later, ensuring the writeback happens in the original mode.
The actual value in the base register if it also shows up in the register list
is undefined, so it's fine if it gets clobbered by one of the loads. For
stores where the base register is the lowest numbered in the register list,
the original value should be written back. That means stores can't write back
at the beginning, but the mode changing problem doesn't affect them so they
can continue to write back at the end.
Instead of panic immediately when these instructions are executed, an
UndefinedInstruction fault is returned. In FS mode (not currently
implemented), this is the fault that should, to my knowledge, be triggered in
these situations and should be handled using the normal architected
mechanisms. In SE mode, the fault causes a panic when it's invoked that gives
the same information as the instruction did. When/if support for speculative
execution of ARM is supported, this will allow a mispeculated and unrecognized
and/or unimplemented instruction from causing a panic. Only once the
instruction is going to be committed will the fault be invoked, triggering the
panic.
Shifting to the right of a signed value when the MSB is one is technically
undefined behavior, even though in my experience it's done the "right thing"
and sign extended the value. This replaces the arithmetic right shift code in
ARM that uses that coincidence with some code that relies on bit math.
This allows the templates to all be available at the same time before any of
the formats, etc. This breaks an artificial circular dependence.
--HG--
rename : src/arch/arm/isa/formats/pred.isa => src/arch/arm/isa/templates/pred.isa
This isn't technically correct since the .w should only be added if there are
32 and 16 bit encodings, but always having it always is better than never
having it.
Further cleanup should probably be done to make this class be non-Ruby
specific and put it in src/base.
There are probably several cases where this class is used, std::bitset
could be used instead.
Suppose the saturating counters of a branch predictor contain n bits. When the
counter is between 0 and (2^(n-1) - 1), boundaries included, the branch is
predicted as not taken. When the counter is between 2^(n-1) and (2^n - 1),
boundaries included, the branch is predicted as taken.
Time from base/time.hh has a name clash with Time from Ruby's
TypeDefines.hh. Eventually Ruby's Time should go away, so instead of
fixing this properly just try to avoid the clash.
When doing an unsigned 64 bit division with a divisor that has its most
significant bit set, the division code would spill a bit off of the end of a
uint64_t trying to shift the dividend into position. This change adds code
that handles that case specially by purposefully letting it spill and then
going ahead assuming there was a 65th one bit.
This causes builds to happen in sorted order rather than in
declaration order. This gets annoying when you make a global change
and then you notice that the files that are being compiled are jumping
around the directory hierarchy.
when insts execute, they mark the time they finish to be used for subsequent isnts
they may need forwarding of data. However, the regdepmap was using the wrong
value to index into the destination operands of the instruction to be forwarded.
Thus, in some cases, we are checking to see if the 3rd destination register
for an instruction is executed at a certain time, when there is only 1 dest. register
valid. Thus, we get a bad, uninitialized time value that will stall forwarding
causing performance loss but still the correct execution.
- Make the initialized flag always available, not just in debug mode.
- Make the Initialized flag actually use several bits so it is very
unlikely that something that's uninitialized accidentally looks
initialized.
- Add an initialized() function that tells you if the current event is
indeed initialized.
- Clear the flags on delete so it can't be accidentally thought of as
initialized.
- Fix getFlags assert statement. "How did this ever work?"
Symbolic names should still be used, but this makes it easier to do
things like:
Event::Priority MyObject_Pri = Event::Default_Pri + 1
Remember that higher numbers are lower priority (should we fix this?)
In addition to obvious changes, this required a slight change to the slicc
grammar to allow types with :: in them. Otherwise slicc barfs on std::string
which we need for the headers that slicc generates.
make sure to only read 1 src reg. for write-hint and any other similar
'store' instruction. Reading the source reg when its not necessary
can cause the simulator to read from uninitialized values
These recordEvent() calls could cause crashes since they
access the req pointer after it's potentially been
deleted during a failed translation call. (Similar
problem to the traceData bug fixed in the previous cset.)
Moving them above the translation call (as was done
recentlyi in cset 8b2b8e5e7d35) avoids the crash
but doesn't work, since at that point we don't know if
the access is uncached or not.
It's not clear why these calls are there, and no one
seems to use them, so we'll just delete them. If they
are needed, they should be moved to somewhere that's
guaranteed to be after the translation completes but
before the request is possibly deleted, e.g., in
finishTranslation().
Accessing traceData (to call setAddress() and/or setData())
after initiating a timing translation was causing crashes,
since a failed translation could delete the traceData
object before returning.
It turns out that there was never a need to access traceData
after initiating the translation, as the traced data was
always available earlier; this ordering was merely
historical. Furthermore, traceData->setAddress() and
traceData->setData() were being called both from the CPU
model and the ISA definition, often redundantly.
This patch standardizes all setAddress and setData calls
for memory instructions to be in the CPU models and not
in the ISA definition. It also moves those calls above
the translation calls to eliminate the crashes.
Previously, the set size was set to 4. This was mostly do to the fact that a
crazy graduate student use to create networks with 256 l2 cache banks. Now it
is far more likely that users will create systems with less than 64 of any
particular controller type. Therefore Ruby should be optimized for a set size
of 1.
On the config end, if a shared L2 is created for the system, it is
parameterized to have n sharers as defined by option.num_cpus. In addition to
making the cache sharing aware so that discriminating tag policies can make use
of context_ids to make decisions, I added an occupancy AverageStat and an occ %
stat to each cache so that you could know which contexts are occupying how much
cache on average, both in terms of blocks and percentage. Note that since
devices have context_id -1, having an array of occ stats that correspond to
each context_id will break here, so in FS mode I add an extra bucket for device
blocks. This bucket is explicitly not added in SE mode in order to not only
avoid ugliness in the stats.txt file, but to avoid broken stats (some formulas
break when a bucket is 0).
Also, make Formulas work on AverageVector. First, Stat::Average (and thus
Stats::AverageVector) was broken when coming out of a checkpoint and on resets,
this fixes that. Formulas also didn't work with AverageVector, but added
support for that.
When implementing timing address translations instead of atomic, I
forgot to preserve the faults that are returned from the read and
write calls. This patch reinstates them.
When each load or store is sent to the LSQ, we check whether it will cross a
cache line boundary and, if so, split it in two. This creates two TLB
translations and two memory requests. Care has to be taken if the first
packet of a split load is sent but the second blocks the cache. Similarly,
for a store, if the first packet cannot be sent, we must store the second
one somewhere to retry later.
This modifies the LSQSenderState class to record both packets in a split
load or store.
Finally, a new const variable, HasUnalignedMemAcc, is added to each ISA
to indicate whether unaligned memory accesses are allowed. This is used
throughout the changed code so that compiler can optimise away code dealing
with split requests for ISAs that don't need them.
This initiates a timing translation and passes the read or write on to the
processor before waiting for it to finish. Once the translation is finished,
the instruction's state is updated via the 'finish' function. A new
DataTranslation class is created to handle this.
The idea is taken from the implementation of timing translations in
TimingSimpleCPU by Gabe Black. This patch also separates out the timing
translations from this CPU and uses the new DataTranslation class.
- on certain retry requests you can get an assertion failure
- fix by allowing the request to literally "Retry" itself
if it wasnt successful before, and then block any requests
through cache port while waiting for the cache to be
made available for access
when threads are switching in/out the CPU, we need to keep
track of special cases like branches. Add appropriate
variables in ThreadState t track this and then use
these variables when updating pc after context switch
this will be used for when a thread comes back from a cache miss, it needs to update the PCs
because the inst might of been a branch or delayslot in which the next PC isnt always
a straight addition
allow a thread to wakeup and be activated after
it has been in suspended state and another
thread is switched out. Need to give
pipeline stages a "activateThread" function
so that can get to their suspended instruction
when the time is right.
give resources their own specific
activity to do for a "suspend" event
instead of defaulting to deactivating the thread for a
suspend thread event. This really matters
for the fetch sequence unit which wants to remove the
thread from fetching while other units want to
ignore a thread suspension. If you deactivate a thread
in a resource then you may lose some of the allotted
bandwidth that the thread is taking up...
update/add in the use of isThreadReady & isThreadSuspended
functions.Check in activateThread what list a thread is
on so it can be managed accordingly.
-Support ability to activate next ready thread after a cache miss
through the activateNextReadyContext/Thread() functions
-To support this a "readyList" of thread ids is added
-After a cache miss, thread will suspend and then call
activitynextreadythread
allow for events to schedule themselves later if desired. this is important
because of cases like where you need to activate a thread only after the previous
thread has been deactivated. The ordering there has to be enforced
add code to recognize memory stalls in resources and the pipeline as well
as squash a thread if there is a stall and we are in the switch on cache miss
model
add buffer for instructions to switch out to in a pipeline stage
can't squash the instruction and remove the pipeline so we kind of need
to 'suspend' an instruction at the stage while the memory stall resolves
for the switch on cache miss model
- loads were happening on same cycle as the address was generated which is slightly
unrealistic. Instead, force address generation to be on separate cycle from load
initiation
- also, mark the stages in a more traditional way (F-D-X-M-W)
This patch includes the necessary regression updates to test the new ruby
configuration system. The patch includes support for multiple ruby protocols
and adds the ruby random tester. The patch removes atomic mode test for
ruby since ruby does not support atomic mode acceses. These tests can be
added back in when ruby supports atomic mode for real.
--HG--
rename : tests/quick/50.memtest/test.py => tests/quick/60.rubytest/test.py
Removed the dummy power function implementations so that Orion can implement
them correctly. Since Orion lacks modular design, this patch simply enables
scons to compile it. There are no python configuration changes in this patch.
Renamed the MESI directory file to be consistent with all other protocols.
--HG--
rename : src/mem/protocol/MESI_CMP_directory-mem.sm => src/mem/protocol/MESI_CMP_directory-dir.sm
Cleaned up the ruby profilers by moving the memory controller profiling code
out of the main profiler object and into a separate object similar to the
current CacheProfiler. Both the CacheProfiler and MemCntrlProfiler are
specific to a particular Ruby object, CacheMemory and MemoryControl
respectively. Therefore, these profilers should not be SimObjects and
created by the python configuration system, but instead private objects. This
simplifies the creation of these profilers.
Reorganized ruby python configuration so that protocol and ruby memory system
configuration code can be shared by multiple front-end configuration files
(i.e. memory tester, full system, and hopefully the regression tester). This
code works for memory tester, but have not tested fs mode.
Modified ruby's tracing support to no longer rely on the RubySystem map
to convert a sequencer string name to a sequencer pointer. As a
temporary solution, the code uses the sim_object find function.
Eventually, we should develop a better fix.
This patch includes a rather substantial change to the memory controller
profiler in order to work with the new configuration system. Most
noteably, the mem_cntrl_profiler no longer uses a string map, but instead
a vector. Eventually this support should be removed from the main
profiler and go into a separate object. Each memory controller should have
a pointer to that new mem_cntrl profile object.
This patch includes the necessary changes to connect ruby objects using
the python configuration system. Mainly it consists of removing
unnecessary ruby object pointers and connecting the necessary object
pointers using the generated param objects. This patch includes the
slicc changes necessary to connect generated ruby objects together using
the python configuraiton system.
The necessary companion conversion of Ruby objects generated by SLICC
are converted to M5 SimObjects in the following patch, so this patch
alone does not compile.
Conversion of Garnet network models is also handled in a separate
patch; that code is temporarily disabled from compiling to allow
testing of interim code.
Though OutPort's message type is not used to generate code, this fix checks
that the programmer's intent is correct. Eventually, we may want to
remove the message type from the OutPort declaration statement.
1) Move alpha-specific code out of page_table.cc:serialize().
2) Begin serializing M5_pid and unserializing it, but adding an function to do optional paramIn so that old checkpoints don't need to be fixed up.
3) Fix up alpha startup code so that the unserialized M5_pid value is properly written to DTB_IPR_ASN.
4) Fix the memory unserialize that I forgot somehow in the last changeset.
5) Add in an agg_se.py to handle aggregated checkpoints. --bench foo-bar plus positional arguments foo bar are the only changes in usage from se.py.
Note this aggregation stuff has only been tested for Alpha and nothing else, though it should take a very minimal amount of work to get it to work with another ISA.
This patch changes the way that Ruby handles atomic RMW instructions. This implementation, unlike the prior one, is protocol independent. It works by locking an address from the sequencer immediately after the read portion of an RMW completes. When that address is locked, the coherence controller will only satisfy requests coming from one port (e.g., the mandatory queue) and will ignore all others. After the write portion completed, the line is unlocked. This should also work with multi-line atomics, as long as the blocks are always acquired in the same order.