The DTB expects the correct PC in the ThreadContext
but how if the memory accesses are speculative? Shouldn't
we send along the requestor's PC to the translate functions?
if a faulting instruction reaches an execution unit,
then ignore it and pass it through the pipeline.
Once we recognize the fault in the graduation unit,
dont allow a second fault to creep in on the same cycle.
Before graduating an instruction, explicitly check fault
by making the fault check it's own separate command
that can be put on an instruction schedule.
remove events in the resource pool that can be called from the CPU event, since the CPU
event is scheduled at the same time at the resource pool event.
----
Also, match the resPool event function names to the cpu event function names
----
only update BTB on a taken branch and update branch predictor w/pcstate from instruction
---
only pay attention to branch predictor updates if the the inst. is in fact a branch
formerly, this was implicit when you accessed the execution unit
or the use-def unit but it's better that this just be something
that a user can specify.
Architectures like SPARC need to read the window pointer
in order to figure out it's register dependence. However,
this may not get updated until after an instruction gets
executed, so now we lazily detect the register dependence
in the EXE stage (execution unit or use_def). This
makes sure we get the mapping after the most current change.
Add a few constants and functions that the InOrder model wants for SPARC.
* * *
sparc: add eaComp function
InOrder separates the address generation from the actual access so give
Sparc that functionality
* * *
sparc: add control flags for branches
branch predictors and other cpu model functions need to know specific information
about branches, so add the necessary flags here
Calculation of offset to copy from storeQueue[idx].data structure for load to
store forwarding fixed to be difference in bytes between store and load virtual
addresses. Previous method would induce bug where a load would index into
buffer at the wrong location.
If a split load fails on a blocked cache wbOutstanding can be decremented
twice if the first part of the split load succeeds and the second part fails.
Condition the decrementing on not having completed the first part of the load.
This patch fixes two problems with the O3 cpu model. The first is an issue
with an instruction fetch causing a fault on the next address while the
current macro-op is being issued. This happens when the micro-ops exceed
the fetch bandwdith and then on the next cycle the fetch stage attempts
to issue a request to the next line while it still has micro-ops to issue
if the next line faults a fault is attached to a micro-op in the currently
executing macro-op rather than a "nop" from the next instruction block.
This leads to an instruction incorrectly faulting when on fetch when
it had no reason to fault.
A similar problem occurs with interrupts. When an interrupt occurs the
fetch stage nominally stops issuing instructions immediately. This is incorrect
in the case of a macro-op as the current location might not be interruptable.
Debug flags are ExecUser, ExecKernel, and ExecAsid. ExecUser and
ExecKernel are set by default when Exec is specified. Use minus
sign with ExecUser or ExecKernel to remove user or kernel tracing
respectively.
Instructions that load an address and are control instructions can
execute down the wrong path if they were predicted correctly and then
instructions following them are squashed. If an instruction is a
memory and control op use the predicted address for the next PC instead
of just advancing the PC. Without this change NPC is used for the next
instruction, but predPC is used to verify that the branch was successful
so the wrong path is silently executed.
The network tester terminates after injecting for sim_cycles
(default=1000), instead of having to explicitly pass --maxticks from the
command line as before. If fixed_pkts is enabled, the tester only
injects maxpackets number of packets, else it keeps injecting till sim_cycles.
The tester also works with zero command line arguments now.
At the same time, rename the trace flags to debug flags since they
have broader usage than simply tracing. This means that
--trace-flags is now --debug-flags and --trace-help is now --debug-help
This change fixes a small bug in the arm copyRegs() code where some registers
wouldn't be copied if the processor was in a mode other than MODE_USER.
Additionally, this change simplifies the way the O3 switchCpu code works by
utilizing TheISA::copyRegs() to copy the required context information
rather than the adhoc copying that goes on in the CPU model. The current code
makes assumptions about the visibility of int and float registers that aren't
true for all architectures in FS mode.
The comment in the code suggests that the checking granularity should be 16
bytes, however in reality the shift by 8 is 256 bytes which seems much
larger than required.
***
(1): get rid of expandForMT function
MIPS is the only ISA that cares about having a piece of ISA state integrate
multiple threads so add constants for MIPS and relieve the other ISAs from having
to define this. Also, InOrder was the only core that was actively calling
this function
* * *
(2): get rid of corespecific type
The CoreSpecific type was used as a proxy to pass in HW specific params to
a MIPS CPU, but since MIPS FS hasnt been touched for awhile, it makes sense
to not force every other ISA to use CoreSpecific as well use a special
reset function to set it. That probably should go in a PowerOn reset fault
anyway.
The tester code is in testers/networktest.
The tester can be invoked by configs/example/ruby_network_test.py.
A dummy coherence protocol called Network_test is also addded for network-only simulations and testing. The protocol takes in messages from the tester and just pushes them into the network in the appropriate vnet, without storing any state.
This change speeds up booting, especially in MP cases, by not executing
udelay() on the core but instead skipping ahead tha amount of time that is being
delayed.
This change fixes the problem for all the cases we actively use. If you want to try
more creative I/O device attachments (E.g. sharing an L2), this won't work. You
would need another level of caching between the I/O device and the cache
(which you actually need anyway with our current code to make sure writes
propagate). This is required so that you can mark the cache in between as
top level and it won't try to send ownership of a block to the I/O device.
Asserts have been added that should catch any issues.
Without this change the a store can be issued to the cache multiple times.
If this case occurs when the l1 cache is out of mshrs (and thus blocked)
the processor will never make forward progress because each cycle it will
send a single request using the recently freed mshr and not completing the
multipart store. This will continue forever.
There may not be a formally correct spelling for the past tense of mmap, but
mmapped is the spelling Google doesn't try to autocorrect. This makes sense
because it mirrors the past tense of map->mapped and not the past tense of
cape->caped.
--HG--
rename : src/arch/alpha/mmaped_ipr.hh => src/arch/alpha/mmapped_ipr.hh
rename : src/arch/arm/mmaped_ipr.hh => src/arch/arm/mmapped_ipr.hh
rename : src/arch/mips/mmaped_ipr.hh => src/arch/mips/mmapped_ipr.hh
rename : src/arch/power/mmaped_ipr.hh => src/arch/power/mmapped_ipr.hh
rename : src/arch/sparc/mmaped_ipr.hh => src/arch/sparc/mmapped_ipr.hh
rename : src/arch/x86/mmaped_ipr.hh => src/arch/x86/mmapped_ipr.hh
This patch changes DataBlock.hh so that it is not dependent on RubySystem.
This dependence seems unecessary. All those functions that depende on
RubySystem have been moved to DataBlock.cc file.
Because int and not InstSeqNum was used in a couple of places, you can
overflow the int type and thus get wierd bugs when the sequence number
is negative (or some wierd value)
remove constructors that werent being used (it just gets confusing)
use initialization list for all the variables instead of relying on initVars()
function
-use a pointer to CacheReqPacket instead of PacketPtr so correct destructors
get called on packet deletion
- make sure to delete the packet if the cache blocks the sendTiming request
or for some reason we dont use the packet
- dont overwrite memory requests since in the worst case an instruction will
be replaying a request so no need to keep allocating a new request
- we dont use retryPkt so delete it
- fetch code was split out already, so just assert that this is a memory
reference inst. and that the staticInst is available
If there is an outstanding table walk and no other activity in the CPU
it can go to sleep and never wake up. This change makes the instruction
queue always active if the CPU is waiting for a store to translate.
If Gabe changes the way this code works then the below should be removed
as indicated by the todo.
keep track of when an instruction needs the execution
behind it to be serialized. Without this, in SE Mode
instructions can execute behind a system call exit().
resources don't need to call getLatency because the latency is already a member
in the class. If there is some type of special case where different instructions
impose a different latency inside a resource then we can revisit this and
add getLatency() back in
each resource has a certain # of requests it can take per cycle. update the #s here
to be more realistic based off of the pipeline width and if the resource needs to
be accessed on multiple cycles
---
need to delete the cache request's data on clearRequest() now that we are recycling
requests
---
fetch unit needs to deallocate the fetch buffer blocks when they are replaced or
squashed.
formerly, to free up bandwidth in a resource, we could just change the pointer in that resource
but at the same time the pipeline stages had visibility to see what happened to a resource request.
Now that we are recycling these requests (to avoid too much dynamic allocation), we can't throw
away the request too early or the pipeline stage gets bad information. Instead, mark when a request
is done with the resource all together and then let the pipeline stage call back to the resource
that it's time to free up the bandwidth for more instructions
*** inteface notes ***
- When an instruction completes and is done in a resource for that cycle, call done()
- When an instruction fails and is done with a resource for that cycle, call done(false)
- When an instruction completes, but isnt finished with a resource, call completed()
- When an instruction fails, but isnt finished with a resource, call completed(false)
* * *
inorder: tlbmiss wakeup bug fix
take away all instances of reqMap in the code and make all references use the built-in
request vectors inside of each resource. The request map was dynamically allocating
a request per instruction. The request vector just allocates N number of requests
during instantiation and then the surrounding code is fixed up to reuse those N requests
***
setRequest() and clearRequest() are the new accessors needed to define a new
request in a resource
we are going to be getting away from creating new resource requests for every
instruction so no more need to keep track of a reqRemoveList and clean it up
every tick
first change in an optimization that will stop InOrder from allocating new memory for every instruction's
request to a resource. This gets expensive since every instruction needs to access ~10 requests before
graduation. Instead, the plan is to allocate just enough resource request objects to satisfy each resource's
bandwidth (e.g. the execution unit would need to allocate 3 resource request objects for a 1-issue pipeline
since on any given cycle it could have 2 read requests and 1 write request) and then let the instructions
contend and reuse those allocated requests. The end result is a smaller memory footprint for the InOrder model
and increased simulation performance
resource skeds are divided into two parts: front end (all insts) and back end (inst. specific)
each of those are implemented as separate lists, so this iterator wraps around
the traditional list iterator so that an instruction can walk it's schedule but seamlessly
transfer from front end to back end when necessary
add a stage scheduler class to replace InstStage in pipeline_traits.cc
use that class to define a default front-end, resource schedule that all
instructions will follow. This will also replace the back end schedule in
pipeline_traits.cc. The reason for adding this is so that we can cache
instruction schedules in the future instead of calling the same function
over/over again as well as constantly dynamically alllocating memory on
every instruction to try to figure out it's schedule
When a table walk is initiated by the fetch stage, the CPU can
potentially move to the idle state and never wake up.
The fetch stage must call cpu->wakeCPU() when a translation completes
(in finishTranslation()).
This change fixes an issue where a DTLB fault occurs and redirects fetch to
handle the fault and the ITLB requires a walk which delays translation. In this
case the status of the cpu isn't updated appropriately, and an additional
instruction fetch occurs. Eventually this hits an assert as multiple instruction
fetches are occuring in the system and when the second one returns the
processor is in the wrong state.
Some asserts below are removed because it was always true (typo) and the state
after the initiateAcc() the processor could be in any valid state when a
d-side fault occurs.
Some ISAs (like ARM) relies on hardware page table walkers. For those ISAs,
when a TLB miss occurs, initiateTranslation() can return with NoFault but with
the translation unfinished.
Instructions experiencing a delayed translation due to a hardware page table
walk are deferred until the translation completes and kept into the IQ. In
order to keep track of them, the IQ has been augmented with a queue of the
outstanding delayed memory instructions. When their translation completes,
instructions are re-executed (only their initiateAccess() was already
executed; their DTB translation is now skipped). The IEW stage has been
modified to support such a 2-pass execution.
In sendSplitData, keep a pointer to the senderState that may be updated after
the call to handle*Packet. This way, if the receiver updates the packet
senderState, it can still be accessed in sendSplitData.
Maintain all information about an instruction's fault in the DynInst object rather
than any cpu-request object. Also, if there is a fault during the execution stage
then just save the fault inside the instruction and trap once the instruction
tries to graduate
Give fetch unit it's own parameterizable fetch buffer to read from. Very inefficient
(architecturally and in simulation) to continually fetch at the granularity of the
wordsize. As expected, the number of fetch memory requests drops dramatically
instead of having one cache-unit class be responsible for both data and code
accesses, separate code that is just for fetch in it's own derived class off the
original base class. This makes the code easier to manage as well as handle
future cases of special fetch handling
allow the user to specify how many instructions a pipeline stage can process
on any given cycle (stageWidth...i.e.bandwidth) by setting the parameter through
the python interface rather than compile the code after changing the *.cc file.
(we always had the parameter there, but still used the static 'ThePipeline::StageWidth'
instead)
-
Since StageWidth is now dynamically defined, change the interstage communication
structure to use a vector and get rid of array and array handling index (toNextStageIndex)
since we can just make calls to the list for the same information
use skidbuffer as only location for instructions between stages. before,
we had the insts queue from the prior stage and the skidbuffer for the
current stage, but that gets confusing and this consolidation helps
when handling squash cases
manage insertion and deletion like a queue but will need
access to internal elements for future changes
Currently, skidbuffer manages any instruction that was
in a stage but could not complete processing, however
we will want to manage all blocked instructions (from prev stage
and from cur. stage) in just one buffer.
Previous code was marking CPU activity on almost every cycle due to a bug in
tracking the status of pipeline stages. This disables the CPU from sleeping
on long latency stalls and increases simulation time
This makes sure that the address ranges requested for caches and uncached ports
don't conflict with each other, and that accesses which are always uncached
(message signaled interrupts for instance) don't waste time passing through
caches.