Commit graph

145 commits

Author SHA1 Message Date
Korey Sewell
470aa289da inorder: clean up the old way of inst. scheduling
remove remnants of old way of instruction scheduling which dynamically allocated
a new resource schedule for every instruction
2011-02-12 10:14:48 -05:00
Korey Sewell
e26aee514d inorder: utilize cached skeds in pipeline
allow the pipeline and resources to use the cached instruction schedule and resource
sked iterator
2011-02-12 10:14:45 -05:00
Korey Sewell
516b611462 inorder: define iterator for resource schedules
resource skeds are divided into two parts: front end (all insts) and back end (inst. specific)
each of those are implemented as separate lists,  so this iterator wraps around
the traditional list iterator so that an instruction can walk it's schedule but seamlessly
transfer from front end to back end when necessary
2011-02-12 10:14:43 -05:00
Korey Sewell
ec9b2ec251 inorder: stage scheduler for front/back end schedule creation
add a stage scheduler class to replace InstStage in pipeline_traits.cc
use that class to define a default front-end, resource schedule that all
instructions will follow. This will also replace the back end schedule in
pipeline_traits.cc. The reason for adding this is so that we can cache
instruction schedules in the future instead of calling the same function
over/over again as well as constantly dynamically alllocating memory on
every instruction to try to figure out it's schedule
2011-02-12 10:14:40 -05:00
Korey Sewell
6713dbfe08 inorder: cache instruction schedules
first step in a optimization to not dynamically allocate an instruction schedule
for every instruction but rather used cached schedules
2011-02-12 10:14:36 -05:00
Korey Sewell
af67631790 inorder: comments for resource sked class 2011-02-12 10:14:34 -05:00
Korey Sewell
800e93f358 inorder: remove unused file
inst_buffer file isn't used , so remove it
2011-02-12 10:14:32 -05:00
Korey Sewell
e396a34b01 inorder: fault handling
Maintain all information about an instruction's fault in the DynInst object rather
than any cpu-request object. Also, if there is a fault during the execution stage
then just save the fault inside the instruction and trap once the instruction
tries to graduate
2011-02-04 00:09:20 -05:00
Korey Sewell
e57613588b inorder: pcstate and delay slots bug
not taken delay slots were not being advanced correctly to pc+8, so for those ISAs
we 'advance()' the pcstate one more time for the desired effect
2011-02-04 00:09:19 -05:00
Korey Sewell
68d962f8af inorder: add a fetch buffer to fetch unit
Give fetch unit it's own parameterizable fetch buffer to read from. Very inefficient
(architecturally and in simulation) to continually fetch at the granularity of the
wordsize. As expected, the number of fetch memory requests drops dramatically
2011-02-04 00:08:22 -05:00
Korey Sewell
56ce8acd41 inorder: overload find-req fn
no need to have separate function name findSplitRequest, just overload the function
2011-02-04 00:08:21 -05:00
Korey Sewell
ab3d37d398 inorder: implement separate fetch unit
instead of having one cache-unit class be responsible for both data and code
accesses, separate code that is just for fetch in it's own derived class off the
original base class. This makes the code easier to manage as well as handle
future cases of special fetch handling
2011-02-04 00:08:20 -05:00
Korey Sewell
f80508de65 inorder: cache port blocking
set the request to false when the cache port blocks so we dont deadlock.
also, comment out the outstanding address list sanity check for now.
2011-02-04 00:08:19 -05:00
Korey Sewell
0c6a679359 inorder: stage width as a python parameter
allow the user to specify how many instructions a pipeline stage can process
on any given cycle (stageWidth...i.e.bandwidth) by setting the parameter through
the python interface rather than compile the code after changing the *.cc file.
(we always had the parameter there, but still used the static 'ThePipeline::StageWidth'
instead)
-
Since StageWidth is now dynamically defined, change the interstage communication
structure to use a vector and get rid of array and array handling index (toNextStageIndex)
since we can just make calls to the list for the same information
2011-02-04 00:08:18 -05:00
Korey Sewell
8ac717ef4c inorder: multi-issue branch resolution
Only execute (resolve) one branch per cycle because handling more than one is
a little more complicated
2011-02-04 00:08:17 -05:00
Korey Sewell
be17617990 inorder: pipe. stage inst. buffering
use skidbuffer as only location for instructions between stages. before,
we had the insts queue from the prior stage and the skidbuffer for the
current stage, but that gets confusing and this consolidation helps
when handling squash cases
2011-02-04 00:08:16 -05:00
Korey Sewell
050944dd73 inorder: change skidBuffer to list instead of queue
manage insertion and deletion like a queue but will need
access to internal elements for future changes
Currently, skidbuffer manages any instruction that was
in a stage but could not complete processing, however
we will want to manage all blocked instructions (from prev stage
and from cur. stage) in just one buffer.
2011-02-04 00:08:15 -05:00
Korey Sewell
7f937e11e2 inorder: activity tracking bug
Previous code was marking CPU activity on almost every cycle due to a bug in
tracking the status of pipeline stages. This disables the CPU from sleeping
on long latency stalls and increases simulation time
2011-02-04 00:08:13 -05:00
Gabe Black
00f24ae92c Config: Keep track of uncached and cached ports separately.
This makes sure that the address ranges requested for caches and uncached ports
don't conflict with each other, and that accesses which are always uncached
(message signaled interrupts for instance) don't waste time passing through
caches.
2011-02-03 20:23:00 -08:00
Korey Sewell
cd5a7f7221 inorder: fix RUBY_FS build
the current code was using incorrect dummy instruction in interrupts function
2011-01-12 11:52:29 -05:00
Steve Reinhardt
6f1187943c Replace curTick global variable with accessor functions.
This step makes it easy to replace the accessor functions
(which still access a global variable) with ones that access
per-thread curTick values.
2011-01-07 21:50:29 -08:00
Steve Reinhardt
d60c293bbc inorder: replace schedEvent() code with reschedule().
There were several copies of similar functions that looked
like they all replicated reschedule(), so I replaced them
with direct calls.  Keeping this separate from the previous
cset since there may be some subtle functional differences
if the code ever reschedules an event that is scheduled but
not squashed (though none were detected in the regressions).
2011-01-07 21:50:29 -08:00
Steve Reinhardt
214cc0fafc inorder: get rid of references to mainEventQueue.
Events need to be scheduled on the queue assigned
to the SimObject, not on the global queue (which
should be going away).
Also cleaned up a number of redundant expressions
that made the code unnecessarily verbose.
2011-01-07 21:50:29 -08:00
Steve Reinhardt
89cf3f6e85 Move sched_list.hh and timebuf.hh from src/base to src/cpu.
These files really aren't general enough to belong in src/base.
This patch doesn't reorder include lines, leaving them unsorted
in many cases, but Nate's magic script will fix that up shortly.

--HG--
rename : src/base/sched_list.hh => src/cpu/sched_list.hh
rename : src/base/timebuf.hh => src/cpu/timebuf.hh
2011-01-03 14:35:47 -08:00
Steve Reinhardt
c69d48f007 Make commenting on close namespace brackets consistent.
Ran all the source files through 'perl -pi' with this script:

s|\s*(};?\s*)?/\*\s*(end\s*)?namespace\s*(\S+)\s*\*/(\s*})?|} // namespace $3|;
s|\s*};?\s*//\s*(end\s*)?namespace\s*(\S+)\s*|} // namespace $2\n|;
s|\s*};?\s*//\s*(\S+)\s*namespace\s*|} // namespace $1\n|;

Also did a little manual editing on some of the arch/*/isa_traits.hh files
and src/SConscript.
2011-01-03 14:35:43 -08:00
Gabe Black
672d6a4b98 Style: Replace some tabs with spaces. 2010-12-20 16:24:40 -05:00
Giacomo Gabrielli
719f9a6d4f O3: Make all instructions that write a misc. register not perform the write until commit.
ARM instructions updating cumulative flags (ARM FP exceptions and saturation
flags) are not serialized.

Added aliases for ARM FP exceptions and saturation flags in FPSCR.  Removed
write accesses to the FP condition codes for most ARM VFP instructions: only
VCMP and VCMPE instructions update the FP condition codes.  Removed a potential
cause of seg. faults in the O3 model for NEON memory macro-ops (ARM).
2010-12-07 16:19:57 -08:00
Ali Saidi
cdacbe734a ARM/Alpha/Cpu: Change prefetchs to be more like normal loads.
This change modifies the way prefetches work. They are now like normal loads
that don't writeback a register. Previously prefetches were supposed to call
prefetch() on the exection context, so they executed with execute() methods
instead of initiateAcc() completeAcc(). The prefetch() methods for all the CPUs
are blank, meaning that they get executed, but don't actually do anything.

On Alpha dead cache copy code was removed and prefetches are now normal ops.
They count as executed operations, but still don't do anything and IsMemRef is
not longer set on them.

On ARM IsDataPrefetch or IsInstructionPreftech is now set on all prefetch
instructions. The timing simple CPU doesn't try to do anything special for
prefetches now and they execute with the normal memory code path.
2010-11-08 13:58:22 -06:00
Gabe Black
6f4bd2c1da ISA,CPU,etc: Create an ISA defined PC type that abstracts out ISA behaviors.
This change is a low level and pervasive reorganization of how PCs are managed
in M5. Back when Alpha was the only ISA, there were only 2 PCs to worry about,
the PC and the NPC, and the lsb of the PC signaled whether or not you were in
PAL mode. As other ISAs were added, we had to add an NNPC, micro PC and next
micropc, x86 and ARM introduced variable length instruction sets, and ARM
started to keep track of mode bits in the PC. Each CPU model handled PCs in
its own custom way that needed to be updated individually to handle the new
dimensions of variability, or, in the case of ARMs mode-bit-in-the-pc hack,
the complexity could be hidden in the ISA at the ISA implementation's expense.
Areas like the branch predictor hadn't been updated to handle branch delay
slots or micropcs, and it turns out that had introduced a significant (10s of
percent) performance bug in SPARC and to a lesser extend MIPS. Rather than
perpetuate the problem by reworking O3 again to handle the PC features needed
by x86, this change was introduced to rework PC handling in a more modular,
transparent, and hopefully efficient way.


PC type:

Rather than having the superset of all possible elements of PC state declared
in each of the CPU models, each ISA defines its own PCState type which has
exactly the elements it needs. A cross product of canned PCState classes are
defined in the new "generic" ISA directory for ISAs with/without delay slots
and microcode. These are either typedef-ed or subclassed by each ISA. To read
or write this structure through a *Context, you use the new pcState() accessor
which reads or writes depending on whether it has an argument. If you just
want the address of the current or next instruction or the current micro PC,
you can get those through read-only accessors on either the PCState type or
the *Contexts. These are instAddr(), nextInstAddr(), and microPC(). Note the
move away from readPC. That name is ambiguous since it's not clear whether or
not it should be the actual address to fetch from, or if it should have extra
bits in it like the PAL mode bit. Each class is free to define its own
functions to get at whatever values it needs however it needs to to be used in
ISA specific code. Eventually Alpha's PAL mode bit could be moved out of the
PC and into a separate field like ARM.

These types can be reset to a particular pc (where npc = pc +
sizeof(MachInst), nnpc = npc + sizeof(MachInst), upc = 0, nupc = 1 as
appropriate), printed, serialized, and compared. There is a branching()
function which encapsulates code in the CPU models that checked if an
instruction branched or not. Exactly what that means in the context of branch
delay slots which can skip an instruction when not taken is ambiguous, and
ideally this function and its uses can be eliminated. PCStates also generally
know how to advance themselves in various ways depending on if they point at
an instruction, a microop, or the last microop of a macroop. More on that
later.

Ideally, accessing all the PCs at once when setting them will improve
performance of M5 even though more data needs to be moved around. This is
because often all the PCs need to be manipulated together, and by getting them
all at once you avoid multiple function calls. Also, the PCs of a particular
thread will have spatial locality in the cache. Previously they were grouped
by element in arrays which spread out accesses.


Advancing the PC:

The PCs were previously managed entirely by the CPU which had to know about PC
semantics, try to figure out which dimension to increment the PC in, what to
set NPC/NNPC, etc. These decisions are best left to the ISA in conjunction
with the PC type itself. Because most of the information about how to
increment the PC (mainly what type of instruction it refers to) is contained
in the instruction object, a new advancePC virtual function was added to the
StaticInst class. Subclasses provide an implementation that moves around the
right element of the PC with a minimal amount of decision making. In ISAs like
Alpha, the instructions always simply assign NPC to PC without having to worry
about micropcs, nnpcs, etc. The added cost of a virtual function call should
be outweighed by not having to figure out as much about what to do with the
PCs and mucking around with the extra elements.

One drawback of making the StaticInsts advance the PC is that you have to
actually have one to advance the PC. This would, superficially, seem to
require decoding an instruction before fetch could advance. This is, as far as
I can tell, realistic. fetch would advance through memory addresses, not PCs,
perhaps predicting new memory addresses using existing ones. More
sophisticated decisions about control flow would be made later on, after the
instruction was decoded, and handed back to fetch. If branching needs to
happen, some amount of decoding needs to happen to see that it's a branch,
what the target is, etc. This could get a little more complicated if that gets
done by the predecoder, but I'm choosing to ignore that for now.


Variable length instructions:

To handle variable length instructions in x86 and ARM, the predecoder now
takes in the current PC by reference to the getExtMachInst function. It can
modify the PC however it needs to (by setting NPC to be the PC + instruction
length, for instance). This could be improved since the CPU doesn't know if
the PC was modified and always has to write it back.


ISA parser:

To support the new API, all PC related operand types were removed from the
parser and replaced with a PCState type. There are two warts on this
implementation. First, as with all the other operand types, the PCState still
has to have a valid operand type even though it doesn't use it. Second, using
syntax like PCS.npc(target) doesn't work for two reasons, this looks like the
syntax for operand type overriding, and the parser can't figure out if you're
reading or writing. Instructions that use the PCS operand (which I've
consistently called it) need to first read it into a local variable,
manipulate it, and then write it back out.


Return address stack:

The return address stack needed a little extra help because, in the presence
of branch delay slots, it has to merge together elements of the return PC and
the call PC. To handle that, a buildRetPC utility function was added. There
are basically only two versions in all the ISAs, but it didn't seem short
enough to put into the generic ISA directory. Also, the branch predictor code
in O3 and InOrder were adjusted so that they always store the PC of the actual
call instruction in the RAS, not the next PC. If the call instruction is a
microop, the next PC refers to the next microop in the same macroop which is
probably not desirable. The buildRetPC function advances the PC intelligently
to the next macroop (in an ISA specific way) so that that case works.


Change in stats:

There were no change in stats except in MIPS and SPARC in the O3 model. MIPS
runs in about 9% fewer ticks. SPARC runs with 30%-50% fewer ticks, which could
likely be improved further by setting call/return instruction flags and taking
advantage of the RAS.


TODO:

Add != operators to the PCState classes, defined trivially to be !(a==b).
Smooth out places where PCs are split apart, passed around, and put back
together later. I think this might happen in SPARC's fault code. Add ISA
specific constructors that allow setting PC elements without calling a bunch
of accessors. Try to eliminate the need for the branching() function. Factor
out Alpha's PAL mode pc bit into a separate flag field, and eliminate places
where it's blindly masked out or tested in the PC.
2010-10-31 00:07:20 -07:00
Gabe Black
ab8d7eee76 CPU: Fix O3 and possible InOrder segfaults in FS. 2010-09-20 02:46:42 -07:00
Gabe Black
8f3fbd2d13 CPU: Get rid of the now unnecessary getInst/setInst family of functions.
This code is no longer needed because of the preceeding change which adds a
StaticInstPtr parameter to the fault's invoke method, obviating the only use
for this pair of functions.
2010-09-13 21:58:34 -07:00
Gabe Black
6833ca7eed Faults: Pass the StaticInst involved, if any, to a Fault's invoke method.
Also move the "Fault" reference counted pointer type into a separate file,
sim/fault.hh. It would be better to name this less similarly to sim/faults.hh
to reduce confusion, but fault.hh matches the name of the type. We could change
Fault to FaultPtr to match other pointer types, and then changing the name of
the file would make more sense.
2010-09-13 19:26:03 -07:00
Gabe Black
c4ba6967a5 Inorder: Fix compilation of m5.fast.
printMemData is only used in DPRINTFs. If those are removed by compiling
m5.fast, that function is unused, gcc generates a warning, that gets turned
into an error, and the build fails. This change surrounds the function
definition with #if TRACING_ON so it only gets compiled in if the DPRINTFs do
to.
2010-08-14 01:00:45 -07:00
Gabe Black
aa8c6e9c95 CPU: Add readBytes and writeBytes functions to the exec contexts. 2010-08-13 06:16:02 -07:00
Gabe Black
65dbcc6ea1 InOrder: Clean up some DPRINTFs that print data sent to/from the cache. 2010-08-13 06:16:00 -07:00
Korey Sewell
84489c5874 inorder: remove another debug stat 2010-06-28 07:33:33 -04:00
Korey Sewell
792c18a1fc inorder: remove debugging stat
m5 doesnt do stats specific to binary and this resource request stat is probably only
useful for people who really know the ins/outs of the model anyway
2010-06-26 09:41:39 -04:00
Korey Sewell
868181f24d inorder: Return Address Stack bug
the nextPC was getting sent to the branch predictor not the current PC, so
the RAS was returning the wrong PC and mispredicting everything.
2010-06-25 17:42:35 -04:00
Korey Sewell
6bfd766f2c inorder: resource scheduling backend
replace priority queue with vector of lists(1 list per stage) and place inside a class
so that we have more control of when an instruction uses a particular schedule entry
...
also, this is the 1st step toward making the InOrderCPU fully parameterizable. See the
wiki for details on this process
2010-06-25 17:42:34 -04:00
Korey Sewell
71b67d408b inorder: cleanup virtual functions
remove the annotation 'virtual' from  function declaration that isnt being derived from
2010-06-24 15:34:19 -04:00
Korey Sewell
f95430d97e inorder: enforce 78-character rule 2010-06-24 15:34:12 -04:00
Korey Sewell
ecba3074c2 inorder: exe_unit_stats for resolved branches 2010-06-24 13:58:27 -04:00
Korey Sewell
1a73764403 inorder: squash from memory stall
this applies to multithreading models which would like to squash a thread on memory stall
2010-06-23 22:09:49 -04:00
Korey Sewell
1f778b3583 inorder: record load/store trace data 2010-06-23 18:21:12 -04:00
Korey Sewell
defab3ffd5 inorder: update branch predictor
- use InOrderBPred instead of Resource for DPRINTFs
- account for DELAY SLOT in updating RAS and in squashing
- don't let squashed instructions update the predictor
- the BTB needs to use the ASID not the TID to work for multithreaded programs
- add stats for BTB hits
2010-06-23 18:19:18 -04:00
Korey Sewell
9f0d8f252c inorder-stats: add instruction type stats
also, remove inst-req stats as default.good for debugging
but in terms of pure processor stats they aren't useful
2010-06-23 18:18:20 -04:00
Korey Sewell
39ac4dce04 inorder: stall signal handling
remove stall only when necessary
add debugging printfs
2010-06-23 18:15:23 -04:00
Korey Sewell
7695d4c63f inorder: tick scheduling
use nextCycle to calculate ticks after addition
2010-06-23 18:14:59 -04:00
Korey Sewell
b49511ae48 inorder: timing for inst forwarding
when insts execute, they mark the time they finish to be used for subsequent isnts
they may need forwarding of data. However, the regdepmap was using the wrong
value to index into the destination operands of the instruction to be forwarded.
Thus, in some cases, we are checking to see if the 3rd destination register
for an instruction is executed at a certain time, when there is only 1 dest. register
valid. Thus, we get a bad, uninitialized time value that will stall forwarding
causing performance loss but still the correct execution.
2010-04-10 23:31:36 -04:00
Korey Sewell
1c98bc5a56 m5: merge inorder updates 2010-03-27 02:23:00 -04:00