It was technically possible but clumsy to determine what endianness a guest
was configured with using the state in byteswap.hh. This change makes that
information available more directly.
Also get rid of unused (and mildly redundant) ByteOrderDiffers constant.
The decoder now checks the value of FULL_SYSTEM in a switch statement to
decide whether to return a real syscall instruction or one that triggers
syscall emulation (or a panic in FS mode). The switch statement should devolve
into an if, and also should be optimized out since it's based on constant
input.
In FS mode the syscall function will panic, but the interface will be
consistent and code which calls syscall can be compiled in. This will allow,
for instance, instructions that use syscall to be built unconditionally but
then not returned by the decoder.
Some DPRINTFs were printing uninitalized values because the DPRINTFs were
always being printed even when the features they were printing weren't
being used. This change moves the DPRINTFs into the appropriate if blocks
and initializes the state variables correctly.
There also is a case where the offset into the packet could be calculated
incorrectly during a DMA that is fixed.
Check that we're not currently writing back an address the prefetcher is trying
to prefetch before issuing it. We previously checked the mshrQueue and the cache
itself, but forgot to check the writeBuffer. This fixes a memory corrucption
issue with an L2 prefetcher.
Only create a memory ordering violation when the value could have changed
between two subsequent loads, instead of just when loads go out-of-order
to the same address. While not very common in the case of Alpha, with
an architecture with a hardware table walker this can happen reasonably
frequently beacuse a translation will miss and start a table walk and
before the CPU re-schedules the faulting instruction another one will
pass it to the same address (or cache block depending on the dendency
checking).
This patch has been tested with a couple of self-checking hand crafted
programs to stress ordering between two cores.
The performance improvement on SPEC benchmarks can be substantial (2-10%).
So a mips-cross-gdb can connect with gem5(MIPS_SE), and do some remote
debugging.
Testing:
Build gem5 for MIPS_SE and make gem5 wait at beginning:
modify "rgdb_wait = -1" to "rgdb_wait = 0" in src/sim/system.cc;
scons build/MIPS_SE/gem5.opt CPU_MODELS=O3CPU
----
Build GDB-7.3 mips-cross:
./configure --target=mips-linux-gnu --prefix=xxx/gdb-7.3-install/
make
make install
----
Run:
./build/MIPS_SE/gem5.opt configs/example/se.py --detailed --caches
./mips-linux-gnu-gdb xxx/gem5/tests/test-progs/hello/bin/mips/linux/hello
(gdb) target remote :7000
(gdb) info registers
(gdb) disassemble
(gdb) si
(gdb) break main
(gdb) c
(gdb) quit
Testing done.
Having two StaticInst classes, one nominally ISA dependent and the other ISA
dependent, has not been historically useful and makes the StaticInst class
more complicated that it needs to be. This change merges StaticInstBase into
StaticInst.
This change pulls the instruction decoding machinery (including caches) out of
the StaticInst class and puts it into its own class. This has a few intrinsic
benefits. First, the StaticInst code, which has gotten to be quite large, gets
simpler. Second, the code that handles decode caching is now separated out
into its own component and can be looked at in isolation, making it easier to
understand. I took the opportunity to restructure the code a bit which will
hopefully also help.
Beyond that, this change also lays some ground work for each ISA to have its
own, potentially stateful decode object. We'd be able to include less
contextualizing information in the ExtMachInst objects since that context
would be applied at the decoder. Also, the decoder could "know" ahead of time
that all the instructions it's going to see are going to be, for instance, 64
bit mode, and it will have one less thing to check when it decodes them.
Because the decode caching mechanism has been separated out, it's now possible
to have multiple caches which correspond to different types of decoding
context. Having one cache for each element of the cross product of different
configurations may become prohibitive, so it may be desirable to clear out the
cache when relatively static state changes and not to have one for each
setting.
Because the decode function is no longer universally accessible as a static
member of the StaticInst class, a new function was added to the ThreadContexts
that returns the applicable decode object.
Do some minor cleanup of some recently added comments, a warning, and change
other instances of stack extension to be like what's now being done for x86.
The way flag bits were being set for microops in x86 ended up implicitly
calling the bitset constructor which was truncating flags beyond the width of
an unsigned long. This change sets the bits in chunks which are always small
enough to avoid being truncated. On 64 bit machines this should reduce to be
the same as before, and on 32 bit machines it should work properly and not be
unreasonably inefficient.
When an instruction is translated in the x86 TLB, a variable called
delayedResponse is passed back and forth which tracks whether a translation
could be completed immediately, or if there's going to be callback that will
finish things up. If a read was to the internal memory space, memory mapped
registers used to implement things like MSRs, the function hadn't yet gotten
to where delayedResponse was set to false, it's default. That meant that the
value was never set, and the TLB could start waiting for a callback that would
never come. This change simply moves the assignment to above where control
can divert to translateInt().
Nothing big here, but when you have an address that is not in the page table request to be allocated, if it falls outside of the maximum stack range all you get is a page fault and you don't know why. Add a little warn() to explain it a bit. Also add some comments and alter logic a little so that you don't totally ignore the return value of checkAndAllocNextPage().
Even though the code is safe, compiler flags a warning here, which are treated as errors for fast/opt. I know it's redundant but it has no side effects and fixes the compile.
In the current implementation of Functional Accesses, it's very hard to
implement broadcast or snooping protocols where the memory has no idea if it
has exclusive access to a cache block or not. Without this knowledge, making
sure the RW vs. RO permissions are right are next to impossible. So we add a
new state called Backing_Store to enable the conveyance that this is the backup
storage for a block, so that it can be written if it is the only possibly RW
block in the system, or written even if there is another RW block in the
system, without causing problems.
Also, a small change to actually set the m_name field for each Controller so
that debugging can be easier. Now you can access a controller's name just by
controller->getName().
There are a set of locations is the linux kernel that are managed via
cache maintence instructions until all processors enable their MMUs & TLBs.
Writes to these locations are manually flushed from the cache to main
memory when the occur so that cores operating without their MMU enabled
and only issuing uncached accesses can receive the correct data. Unfortuantely,
gem5 doesn't support any kind of software directed maintence of the cache.
Until such time as that support exists this patch marks the specific cache blocks
that need to be coherent as non-cacheable until all CPUs enable their MMU and
thus allows gem5 to boot MP systems with caches enabled (a requirement for
booting an O3 cpu and thus an O3 CPU regression).
The driver can read the IDE config register as a 32 bit register since
some adapters use bit 18 as a disable channel bit. If the size isn't
set in a PRD it should be 64K according to the SPEC (and driver) not
128K.
SEV instructions were originally implemented to cause asynchronous squashes
via the generateTCSquash() function in the O3 pipeline when updating the
SEV_MAILBOX miscReg. This caused race conditions between CPUs in an MP system
that would lead to a pipeline either going inactive indefinitely or not being
able to commit squashed instructions. Fixed SEV instructions to behave like
interrupts and cause synchronous sqaushes inside the pipeline, eliminating
the race conditions. Also fixed up the semantics of the WFE instruction to
behave as documented in the ARMv7 ISA description to not sleep if SEV_MAILBOX=1
or unmasked interrupts are pending.
Two issues are fixed in this patch:
1. The load and store pc passed to the predictor are passed in reverse order.
2. The flag indicating that a barrier is inflight was never cleared when
the barrier was squashed instead of committed. This made all load insts
dependent on a non-existent barrier in-flight.
Change the way instructions are squashed on memory ordering violations
to squash the violator and younger instructions, not all instructions
that are younger than the instruction they violated (no reason to throw
away valid work).
Cortex-A9 processors can have a local timer and watchdog counter. It
is enabled by default in Linux and up to this point we've had to disable
them since a model wasn't available. This change allows a default
MP ARM Linux configuration to boot.
Control register operands are set up so that writing to them is serialize
after, serialize before, and non-speculative. These are probably overboard,
but they should usually be safe. Unfortunately there are times when even these
aren't enough. If an instruction modifies state that affects fetch, later
serialized instructions which come after it might have already gone through
fetch and decode by the time it commits. These instructions may have been
translated incorrectly or interpretted incorrectly and need to be destroyed.
This change modifies instructions which will or may have this behavior so that
they use the IsSquashAfter flag when necessary.
It's possible (though until now very unlikely) for fetchAddr to get out of
sync with the actual PC of the current instruction. This change forcefull
resets fetchAddr at the end of every instruction.
Until now, the only reason a macroop would be left was because it ended at a
microop marked as the last microop. In O3 with branch prediction, it's
possible for the branch predictor to have entries which originally came from
different instructions which happened to have the same RIP. This could
theoretically happen in many ways, but it was encountered specifically when
different programs in different address spaces ran one after the other in
X86_FS.
What would happen in that case was that the macroop would continue to be
looped over and microops fetched from it until it reached the last microop
even though the macropc had moved out from under it. If things lined up
properly, this could mean that the end bytes of an instruction actually fell
into the instruction sized block of memory after the one in the predecoder.
The fetch loop implicitly assumes that the last instruction sized chunk of
memory processed was the last one needed for the instruction it just finished
executing. It would then tell the predecoder to move to an offset within the
bytes it was given that is larger than those bytes, and that would trip an
assert in the x86 predecoder.
This change fixes this problem by making fetch stop processing the current
macroop if the address it should be fetching from changed when the PC is
updated. That happens when the last microop was reached because the instruction
handled it properly, and it also catches the case where the branch predictor
makes fetch do a macro level branch when it shouldn't.
The check of isLastMicroop is retained because otherwise, a macroop that
branches back to itself would act like a single, long macroop instead of
multiple instances of the same microop. There may be situations (which may
turn out to be purely hypothetical) where that matters.
This also fixes a relatively minor issue where the curMacroop variable would
be set to NULL immediately after seeing that a microop was the last one before
curMacroop was used to build the dyninst. The traceData structure would have a
NULL pointer to the macroop for that microop.
Before this change, the commit stage would wait until the ROB and store queue
were empty before recognizing an interrupt. The fetch stage would stop
generating instructions at an appropriate point, so commit would then wait
until a valid time to interrupt the instruction stream. Instructions might be
in flight after fetch but not the in the ROB or store queue (in rename, for
instance), so this change makes commit wait until all in flight instructions
are finished.
This patch replaces RUBY with PROTOCOL in all the SConscript files as
the environment variable that decides whether or not certain components
of the simulator are compiled.
This constructor assumes that the ExtMachInst can be decoded directly into a
StaticInst that's useful to execute. With the advent of microcoded
instructions that's no longer true.
This patch drops RUBY as a compile time option. Instead the PROTOCOL option
is used to figure out whether or not to build Ruby. If the specified protocol
is 'None', then Ruby is not compiled.
When fetching from the microcode ROM, if the PC is set so that it isn't in the
cache block that's been fetched the CPU will get stuck. The fetch stage
notices that it's in the ROM so it doesn't try to fetch from the current PC.
It then later notices that it's outside of the current cache block so it skips
generating instructions expecting to continue once the right bytes have been
fetched. This change lets the fetch stage attempt to generate instructions,
and only checks if the bytes it's going to use are valid if it's really going
to use them.
Currently, functions associated with a controller go into separate files.
This patch puts all the functions in the controller's .cc file. This should
hopefully take away some time from compilation.
Prefetch requests issued from the L2 or below wouldn't check if valid data is
present higher in the system. If a prefetch into the L2 occured at the same
time as writeback from a higher-level cache the dirty data could be replaced
in by unmodified data in memory.
Implemented a pipeline activity viewer as a python script (util/o3-pipeview.py)
and modified O3 code base to support an extra trace flag (O3PipeView) for
generating traces to be used as inputs by the tool.
SWP and SWPB now throw an undefined instruction exception if
SCTLR.SW == 0. This also required the MIDR to be changed
slightly so programs can correctly determine that gem5 supports
the ARM v7 behavior of SWP/SWPB (in ARM v6, SWP/SWPB were
deprecated, but not disabled at CPU startup).
Adds MISCREG_ID_MMFR2 and removes break on access to MISCREG_CLIDR. Both
registers now return values that are consistent with current ARM
implementations.
This patch implements the copyRegs() function for the x86 architecture.
The patch assumes that no side effects other than TLB invalidation need
to be considered while copying the registers. This may not hold true in
future.
Branch predictor could not predict a branch in a nested loop because:
1. The global history was not updated after a mispredict squash.
2. The global history was updated in the fetch stage. The choice predictors
that were updated used the changed global history. This is incorrect, as
it incorporates the state of global history after the branch in
encountered. Fixed update to choice predictor using the global history
state before the branch happened.
3. The global predictor table was also updated using the global history state
before the branch happened as above.
Additionally, parameters to initialize ctr and history size were reversed.
Fixed up the patch from Yasuko Watanabe that enabled pipelining of fetch accessess to
icache to work with recent changes to main repository.
Also added in ability for fetch stage to delay issuing the fault carrying
nop when a pipeline fetch causes a fault and no fetch bandwidth is available
until the next cycle.
change hwrei back to being a non-control instruction so O3-FS mode will work
add squash in inorder that will catch a hwrei (or any other genric instruction)
that isnt a control inst but changes the PC. Additional testing still needs to be done
for inorder-FS mode but this change will free O3 development back up in the interim
This makes it possible to use the grammar multiple times and use the multiple
instances concurrently. This makes implementing an include statement as part
of a grammar possible.
This change simplifies the code surrounding operand type handling and makes it
depend only on the ctype that goes with each operand type. Future changes will
allow defining operand types by their ctypes directly, convert the ISAs over
to that style of definition, and then remove support for the old style. These
changes are to make it easier to use non-builtin types like classes or
structures as the type for operands.
Addition of functional access support to Ruby necessitated some changes to
the way coherence protocols are written. I had forgotten to update the
Network_test protocol. This patch makes those updates.
readBytes and writeBytes had the word "bytes" in their names because they
accessed blobs of bytes. This distinguished them from the read and write
functions which handled higher level data types. Because those functions don't
exist any more, this change renames readBytes and writeBytes to more general
names, readMem and writeMem, which reflect the fact that they are how you read
and write memory. This also makes their names more consistent with the
register reading/writing functions, although those are still read and set for
some reason.
This patch rpovides functional access support in Ruby. Currently only
the M5Port of RubyPort supports functional accesses. The support for
functional through the PioPort will be added as a separate patch.
The DTB expects the correct PC in the ThreadContext
but how if the memory accesses are speculative? Shouldn't
we send along the requestor's PC to the translate functions?
if a faulting instruction reaches an execution unit,
then ignore it and pass it through the pipeline.
Once we recognize the fault in the graduation unit,
dont allow a second fault to creep in on the same cycle.
Before graduating an instruction, explicitly check fault
by making the fault check it's own separate command
that can be put on an instruction schedule.
this always changes the PC and is basically an impromptu branch instruction. why
not speculate on this instead of always be forced to mispredict/squash after the
hwrei gets resolved?
The InOrder model needs this marked as "isControl" so it knows to update the PC
after the ALU executes it. If this isnt marked as control, then it's going to
force the model to check the PC of every instruction at commit (what O3 does?),
and that would be a wasteful check for a very high percentage of instructions.
remove events in the resource pool that can be called from the CPU event, since the CPU
event is scheduled at the same time at the resource pool event.
----
Also, match the resPool event function names to the cpu event function names
----
only update BTB on a taken branch and update branch predictor w/pcstate from instruction
---
only pay attention to branch predictor updates if the the inst. is in fact a branch
formerly, this was implicit when you accessed the execution unit
or the use-def unit but it's better that this just be something
that a user can specify.
Architectures like SPARC need to read the window pointer
in order to figure out it's register dependence. However,
this may not get updated until after an instruction gets
executed, so now we lazily detect the register dependence
in the EXE stage (execution unit or use_def). This
makes sure we get the mapping after the most current change.
Instead of clearing the entire TLB on initialization and flush, the code was
clearing only one element. This patch corrects the memsets in the init and
flush routines.
The code for Set class was written under the assumption that
std::numeric_limits<long>::digits returns the number of bits used for
data type long, which was presumed to be either 32 or 64. But return value
is actually one less, that is, it is either 31 or 63. The value is now
being incremented by 1 so as to correctly set it.
If there's a problem when reading the section names from a supposed ELF file,
this change makes gem5 print an error message as returned by libelf and die.
Previously these sorts of errors would make gem5 segfault when it tried to
access the section name through a NULL pointer.
this flag is only used for early branch resolution in the O3 model (of pc-relative branches)
but this isnt cleanly working even when the branch target code is added for sparc. For now,
we'll ignore this optimization and add a todo in the SPARC ISA for future developers
Add a few constants and functions that the InOrder model wants for SPARC.
* * *
sparc: add eaComp function
InOrder separates the address generation from the actual access so give
Sparc that functionality
* * *
sparc: add control flags for branches
branch predictors and other cpu model functions need to know specific information
about branches, so add the necessary flags here
The access permissions for the directory entries are not being set correctly.
This is because pointers are not used for handling directory entries.
function. get and set functions for access permissions have been added to the
Controller state machine. The changePermission() function provided by the
AbstractEntry and AbstractCacheEntry classes has been exposed to SLICC
code once again. The set_permission() functionality has been removed.
NOTE: Each protocol will have to define these get and set functions in order
to compile successfully.
The regular expressions matching filenames in the ##include directives and the
internally generated ##newfile directives where only looking for filenames
composed of alpha numeric characters, periods, and dashes. In Unix/Linux, the
rules for what characters can be in a filename are much looser than that. This
change replaces those expressions with ones that look for anything other than
a quote character. Technically quote characters are allowed as well so we
should allow escaping them somehow, but the additional complexity probably
isn't worth it.
Currently, the machine name is appended before any of the functions
defined with in the sm files. This is not necessary and it also
means that these functions cannot be used outside the sm files.
This patch does away with the prefixes. Note that the generated
C++ files in which the code for these functions is present are
still named such that the machine name is the prefix.
The default generated binary is now gem5.<type> instead of m5.<type>.
The latter does still work but gem5.<type> will be generated first and
then m5.<type> will be hard linked to it.
The end of the COPYING file was generated with:
% python ./util/find_copyrights.py configs src system tests util
Update -C command line option to spit out COPYING file
We were getting a spurious warning in the regressions that turned
out to be due to having the wrong value for TGT_MAP_ANONYMOUS for
Power Linux, but in the process of tracking it down I ended up
doing some cleanup of the mmap handling in general.
A significant contributor to the need for adoptOrphanParams()
is the practice of appending to SimObjectVectors which have
already been assigned as children. This practice sidesteps the
assignment operation for those appended SimObjects, which is
where parent/child relationships are typically established.
This patch reworks the config scripts that use append() on
SimObjectVectors, which all happen to be in the x86 system
configuration. At some point in the future, I hope to make
SimObjectVectors immutable (by deriving from tuple rather than
list), at which time this patch will be necessary for correct
operation. For now, it just avoids some of the warning
messages that get printed in adoptOrphanParams().
Re-enabling implicit parenting (see previous patch) causes current
Ruby config scripts to create some strange hierarchies and generate
several warnings. This patch makes three general changes to address
these issues.
1. The order of object creation in the ruby config files makes the L1
caches children of the sequencer rather than the controller; these
config ciles are rewritten to assign the L1 caches to the
controller first.
2. The assignment of the sequencer list to system.ruby.cpu_ruby_ports
causes the sequencers to be children of system.ruby, generating
warnings because they are already parented to their respective
controllers. Changing this attribute to _cpu_ruby_ports fixes this
because the leading underscore means this is now treated as a plain
Python attribute rather than a child assignment. As a result, the
configuration hierarchy changes such that, e.g.,
system.ruby.cpu_ruby_ports0 becomes system.l1_cntrl0.sequencer.
3. In the topology classes, the routers become children of some random
internal link node rather than direct children of the topology.
The topology classes are rewritten to assign the routers to the
topology object first.
Last summer's big rewrite of the initialization code (in
particular cset 6efc3672733b) got rid of the implicit parenting
that used to occur when an unparented SimObject was assigned as
a parameter value to another SimObject. The idea was that the
new adoptOrphanParams() step would catch these anyway so it was
unnecessary.
Unfortunately it turns out that adoptOrphanParams() has some
inherent instability in that the parent that does the adoption
depends on the config tree traversal order. Even making this
order deterministic (e.g., by traversing children in
alphabetical order) can introduce unwanted and unexpected
hierarchy changes between similar configs (e.g., when adding a
switch_cpu in place of a cpu), causing problems when trying to
restore checkpoints across similar configs. The hierarchy
created by implicit parenting is more stable and more
controllable, so this patch turns that behavior back on.
This patch also cleans up some long-standing holes regarding
parenting of SimObjects that are created in class definitions
(either in the body of the class, or as default parameters).
To avoid breaking some existing config files, this necessitated
changing the error on reparenting children to a warning. This
change fixes another bug where attempting to print the prior
error message would fail on reparenting SimObjectVectors
because they lack a _parent attribute. Some further issues
with SimObjectVectors were cleaned up by getting rid of the
get_parent() call (which could cause errors with some
SimObjectVectors where there was no single parent to return)
with has_parent() (since all the uses of get_parent() were just
boolean tests anyway).
Finally, since the adoptOrphanParam() step turned out to be so
problematic, we now issue a warning when it actually has to do
an adoption. Future cleanup of config files will get rid of
current warnings.
Calculation of offset to copy from storeQueue[idx].data structure for load to
store forwarding fixed to be difference in bytes between store and load virtual
addresses. Previous method would induce bug where a load would index into
buffer at the wrong location.
If a split load fails on a blocked cache wbOutstanding can be decremented
twice if the first part of the split load succeeds and the second part fails.
Condition the decrementing on not having completed the first part of the load.
This patch fixes two problems with the O3 cpu model. The first is an issue
with an instruction fetch causing a fault on the next address while the
current macro-op is being issued. This happens when the micro-ops exceed
the fetch bandwdith and then on the next cycle the fetch stage attempts
to issue a request to the next line while it still has micro-ops to issue
if the next line faults a fault is attached to a micro-op in the currently
executing macro-op rather than a "nop" from the next instruction block.
This leads to an instruction incorrectly faulting when on fetch when
it had no reason to fault.
A similar problem occurs with interrupts. When an interrupt occurs the
fetch stage nominally stops issuing instructions immediately. This is incorrect
in the case of a macro-op as the current location might not be interruptable.
The virtual channels within "response" vnets are made buffers_per_data_vc
deep (default=4), while virtual channels within other vnets are made
buffers_per_ctrl_vc deep (default = 1). This is for accurate power estimates.
Identifying response vnets versus other vnets will allow garnet to
determine which vnets will carry data packets, and which will carry
ctrl packets, and use appropriate buffer sizes (since data packets are larger
than ctrl packets). This in turn allows the orion power model to accurately
estimate buffer power.
Renamed (message) class to vnet for consistency with rest of ruby.
Moved some parameters specific to fixed/flexible garnet networks into their
corresponding py files.
This change further eliminates cases where condition codes were being read
just so they could be written without change because the instruction in
question was supposed to preserve them. This is done by creating the condition
code code based on the input rather than just doing a simple substitution.
If one of the condition codes isn't being used in the execution we should only
read it if the instruction might be dependent on it. With the preeceding changes
there are several more cases where we should dynamically pick instead of assuming
as we did before.
Break up the condition code bits into NZ, C, V registers. These are individually
written and this removes some incorrect dependencies between instructions.
Move the saturating bit (which is also saturating) from the renamed register
that holds the flags to the CPSR miscreg and adds a allows setting it in a
similar way to the FP saturating registers. This removes a dependency in
instructions that don't write, but need to preserve the Q bit.
This change splits out the condcodes from being one monolithic register
into three blocks that are updated independently. This allows CPUs
to not have to do RMW operations on the flags registers for instructions
that don't write all flags.
Debug flags are ExecUser, ExecKernel, and ExecAsid. ExecUser and
ExecKernel are set by default when Exec is specified. Use minus
sign with ExecUser or ExecKernel to remove user or kernel tracing
respectively.
Add registers and components to better support the VersatileEB board.
Made the MIDR and SYS_ID register parameters to ArmSystem and RealviewCtrl
respectively.
Instructions that load an address and are control instructions can
execute down the wrong path if they were predicted correctly and then
instructions following them are squashed. If an instruction is a
memory and control op use the predicted address for the next PC instead
of just advancing the PC. Without this change NPC is used for the next
instruction, but predPC is used to verify that the branch was successful
so the wrong path is silently executed.
The network tester terminates after injecting for sim_cycles
(default=1000), instead of having to explicitly pass --maxticks from the
command line as before. If fixed_pkts is enabled, the tester only
injects maxpackets number of packets, else it keeps injecting till sim_cycles.
The tester also works with zero command line arguments now.
The RubyMemory flag wasnt used in the code, creating large gaps in trace output. Replace cprintfs w/dprintfs
using RubyMemory in memory controller. DPRINTF also deprecate the usage of the setDebug() pure virtual
function in the AbstractMemoryOrCache Class as well the m_debug/cprintf functions in MemoryControl.hh/cc
The simple network's endpoint bandwidth value is used to adjust the overall
bandwidth of the network. Specifically, the ration between endpoint bandwidth
and the MESSAGE_SIZE_MULTIPLIER determines the increase. By setting the value
to 1000, that means the bandwdith factor specified in the links translates to
the link bandwidth in bytes. Previously, it was increasing that value by 10.
This patch will likely require a reset of the ruby regression tester stats.
Moved the buffer_size, endpoint_bandwidth, and adaptive_routing params out of
the top-level parent network object and to only those networks that actually
use those parameters.
This patch ensures that both Garnet and the simple networks use the bw value
specified in the topology. To do so, the patch generalizes the specification
of bw for basic links. This value is then translated to the specific value
used by the simple and Garnet networks. Since Garent does not support
non-uniformed link bandwidth, the patch also adds a check to ensure all bws are
equal.
--HG--
rename : src/mem/ruby/network/BasicLink.cc => src/mem/ruby/network/simple/SimpleLink.cc
rename : src/mem/ruby/network/BasicLink.hh => src/mem/ruby/network/simple/SimpleLink.hh
rename : src/mem/ruby/network/BasicLink.py => src/mem/ruby/network/simple/SimpleLink.py
This patch converts links and switches from second class simobjects that were
virtually ignored by the networks (both simple and Garnet) to first class
simobjects that directly correspond to c++ ojbects manipulated by the
topology and network classes. This is especially true for Garnet, where the
links and switches directly correspond to specific C++ objects.
By making this change, many aspects of the Topology class were simplified.
--HG--
rename : src/mem/ruby/network/Network.cc => src/mem/ruby/network/BasicLink.cc
rename : src/mem/ruby/network/Network.hh => src/mem/ruby/network/BasicLink.hh
rename : src/mem/ruby/network/Network.cc => src/mem/ruby/network/garnet/fixed-pipeline/GarnetLink_d.cc
rename : src/mem/ruby/network/Network.hh => src/mem/ruby/network/garnet/fixed-pipeline/GarnetLink_d.hh
rename : src/mem/ruby/network/garnet/fixed-pipeline/GarnetNetwork_d.py => src/mem/ruby/network/garnet/fixed-pipeline/GarnetLink_d.py
rename : src/mem/ruby/network/garnet/fixed-pipeline/GarnetNetwork_d.py => src/mem/ruby/network/garnet/fixed-pipeline/GarnetRouter_d.py
rename : src/mem/ruby/network/Network.cc => src/mem/ruby/network/garnet/flexible-pipeline/GarnetLink.cc
rename : src/mem/ruby/network/Network.hh => src/mem/ruby/network/garnet/flexible-pipeline/GarnetLink.hh
rename : src/mem/ruby/network/garnet/fixed-pipeline/GarnetNetwork_d.py => src/mem/ruby/network/garnet/flexible-pipeline/GarnetLink.py
rename : src/mem/ruby/network/garnet/fixed-pipeline/GarnetNetwork_d.py => src/mem/ruby/network/garnet/flexible-pipeline/GarnetRouter.py
Moved the Topology class to the top network directory because it is shared by
both the simple and Garnet networks.
--HG--
rename : src/mem/ruby/network/simple/Topology.cc => src/mem/ruby/network/Topology.cc
rename : src/mem/ruby/network/simple/Topology.hh => src/mem/ruby/network/Topology.hh
Due to certain changes made via changeset 8229, the compilation was failing
in certain cases. The compiler pointed to base/stats/mysql.hh for not naming
a certain types like uint64_t. To rectify this, base/types.hh is being
included in base/stats/mysql.hh.
This change makes the decoder figure out if an instruction that only supports
memory is using a register encoding and decodes directly to "Unknown" which will
behave appropriately. This prevents other parts of the instruction creation
process from seeing the mismatch and asserting.
This is similar to guards on mercurial queues and they're used for selecting
which files are compiled into some given object. We already do something
similar, but it's mostly hard coded for the m5 binary and the m5 library
and I'd like to make it more flexible to better support the unittests
At the same time, rename the trace flags to debug flags since they
have broader usage than simply tracing. This means that
--trace-flags is now --debug-flags and --trace-help is now --debug-help
This is basically like the range_map stuff in src/base (range already
exists in Python). This code is like a set of ranges. I'm using it
to keep track of changed lines in source code, but it could be use to
keep track of memory ranges and holes in memory regions. It could
also be used in memory allocation type stuff. (Though it's not at all
optimized.)
Frame buffer and boot linux:
./build/ARM_FS/m5.opt configs/example/fs.py --benchmark=ArmLinuxFrameBuf --kernel=vmlinux.touchkit
Linux from a CF card:
./build/ARM_FS/m5.opt configs/example/fs.py --benchmark=ArmLinuxCflash --kernel=vmlinux.touchkit
Run Android
./build/ARM_FS/m5.opt configs/example/fs.py --benchmark=ArmAndroid --kernel=vmlinux.android
Run MP
./build/ARM_FS/m5.opt configs/example/fs.py --benchmark=ArmLinuxCflash --kernel=vmlinux.mp-2.6.38
This change fixes a small bug in the arm copyRegs() code where some registers
wouldn't be copied if the processor was in a mode other than MODE_USER.
Additionally, this change simplifies the way the O3 switchCpu code works by
utilizing TheISA::copyRegs() to copy the required context information
rather than the adhoc copying that goes on in the CPU model. The current code
makes assumptions about the visibility of int and float registers that aren't
true for all architectures in FS mode.
The comment in the code suggests that the checking granularity should be 16
bytes, however in reality the shift by 8 is 256 bytes which seems much
larger than required.
Fixed an error reguarding DMA for uninprocessor systems. Basically removed an
overly agressive optimization that lead to inconsistent state between the
cache and the directory.
This function duplicates the functionality of allocate() exactly, except that it does not return
a return value. In protocols where you just want to allocate a block
but do not want that block to be your implicitly passed cache_entry, use this function.
Otherwise, SLICC will complain if you do not consume the pointer returned by allocate(),
and if you do a dummy assignment Entry foo := cache.allocate(address), the C++
compiler will complain of an unused variable. This is kind of a hack to get around
those issues, but suggestions welcome.
Before this changeset, all local variables of type Entry and TBE were considered
to be pointers, but an immediate use of said variables would not be automatically
deferenced in SLICC-generated code. Instead, deferences occurred when such
variables were passed to functions, and were automatically dereferenced in
the bodies of the functions (e.g. the implicitly passed cache_entry).
This is a more general way to do it, which leaves in place the
assumption that parameters to functions and local variables of type AbstractCacheEntry
and TBE are always pointers, but instead of dereferencing to access member variables
on a contextual basis, the dereferencing automatically occurs on a type basis at the
moment a member is being accessed. So, now, things you can do that you couldn't before
include:
Entry foo := getCacheEntry(address);
cache_entry.DataBlk := foo.DataBlk;
or
cache_entry.DataBlk := getCacheEntry(address).DataBlk;
or even
cache_entry.DataBlk := static_cast(Entry, pointer, cache.lookup(address)).DataBlk;
This is a substitute for MessageBuffers between controllers where you don't
want messages to actually go through the Network, because requests/responses can
always get reordered wrt to one another (even if you turn off Randomization and turn on Ordered)
because you are, after all, going through a network with contention. For systems where you model
multiple controllers that are very tightly coupled and do not actually go through a network,
it is a pain to have to write a coherence protocol to account for mixed up request/response orderings
despite the fact that it's completely unrealistic. This is *not* meant as a substitute for real
MessageBuffers when messages do in fact go over a network.
It is useful for Ruby to understand from whence request packets came.
This has all request packets going into Ruby pass the contextId value, if
it exists. This supplants the old libruby proc_id value passed around in
all the Messages, so I've also removed the unused unsigned proc_id; member
generated by SLICC for all Message types.
***
(1): get rid of expandForMT function
MIPS is the only ISA that cares about having a piece of ISA state integrate
multiple threads so add constants for MIPS and relieve the other ISAs from having
to define this. Also, InOrder was the only core that was actively calling
this function
* * *
(2): get rid of corespecific type
The CoreSpecific type was used as a proxy to pass in HW specific params to
a MIPS CPU, but since MIPS FS hasnt been touched for awhile, it makes sense
to not force every other ISA to use CoreSpecific as well use a special
reset function to set it. That probably should go in a PowerOn reset fault
anyway.
The goal of the patch is to do away with the CacheMsg class currently in use
in coherence protocols. In place of CacheMsg, the RubyRequest class will used.
This class is already present in slicc_interface/RubyRequest.hh. In fact,
objects of class CacheMsg are generated by copying values from a RubyRequest
object.
The tester code is in testers/networktest.
The tester can be invoked by configs/example/ruby_network_test.py.
A dummy coherence protocol called Network_test is also addded for network-only simulations and testing. The protocol takes in messages from the tester and just pushes them into the network in the appropriate vnet, without storing any state.
I had recently committed a patch that removed the WakeUp*.py files from the
slicc/ast directory. I had forgotten to remove the import calls for these
files from slicc/ast/__init__.py. This resulted in error while running
regressions on zizzer. This patch does the needful.
This patch fixes the problem where Ruby would fail to call sendRetry on ports
after it nacked the port. This patch is particularly helpful for bursty dma
requests which often include several packets.
In SLICC, in order to define a type a data type for which it should not
generate any code, the keyword external_type is used. For those data types for
which code should be generated, the keyword structure is used. This patch
eliminates the use of keyword external_type for defining structures. structure
key word can now have an optional attribute external, which would be used for
figuring out whether or not to generate the code for this structure. Also, now
structures can have functions as well data members in them.
In order to add stall and wait facility for protocols, a keyword
wake_up_dependents was introduced. This patch removes the keyword,
instead this functionality is now implemented as function call.
In order to add stall and wait facility for protocols, a keyword
wake_up_all_dependents was introduced. This patch removes the keyword,
instead this functionality is now implemented as function call.
Thanks to swig this was interfering with the standard Python
random module. The only function in that module was seed(),
which erroneously called srand48(). Moved the function to
m5.internal.core, renamed it seedRandom(), and made it call
random_mt.init() instead.
FastAlloc's reuse policies can mask allocation bugs, so
we typically want it disabled when debugging. Set
FORCE_FAST_ALLOC to enable even when debugging, and set
NO_FAST_ALLOC to disable even in non-debug builds.
The ISAR registers describe which features the processor supports.
Transcribe the values listed in section B5.2.5 of the ARM ARM
into the registers as read-only values
This change speeds up booting, especially in MP cases, by not executing
udelay() on the core but instead skipping ahead tha amount of time that is being
delayed.
This patch prevents not executed conditional instructions marked as
IsQuiesce from stalling the pipeline indefinitely. If the instruction
is not executed the quiesceSkip psuedoinst is called which schedules a
wakes up call to the fetch stage.
This changes the RFE macroop into 3 microops:
URa = [sp]; URb = [sp+4]; // load CPSR,PC values from stack
sp = sp + offset; // optionally auto-increment
PC = URa; CPSR = URb; // write to the PC and CPSR.
Importantly:
- writing to PC is handled in the last micro-op.
- loading occurs prior to state changes.
This change fixes the problem for all the cases we actively use. If you want to try
more creative I/O device attachments (E.g. sharing an L2), this won't work. You
would need another level of caching between the I/O device and the cache
(which you actually need anyway with our current code to make sure writes
propagate). This is required so that you can mark the cache in between as
top level and it won't try to send ownership of a block to the I/O device.
Asserts have been added that should catch any issues.
Without this change the a store can be issued to the cache multiple times.
If this case occurs when the l1 cache is out of mshrs (and thus blocked)
the processor will never make forward progress because each cycle it will
send a single request using the recently freed mshr and not completing the
multipart store. This will continue forever.
This causes a lot of rebuilds that could have otherwise possibly been
avoided, and, more annoyingly, a lot of unnecessary rerunning of the
regressions. The benefits of having the revision in the output haven't
materialized, so this change removes it.
None of the code in the ruby tester directory is compiled or referred to
outside of that directory. This change eliminates it. If it's needed in the
future, it can be revived from the history. In the mean time, this removes
clutter and the only use of the GEMS_ROOT scons variable.
The internet says this instruction was created by accident when an Intel CPU
failed to decode x87 instructions properly. It's been documented on a few rare
occasions and has generally worked to ensure backwards compatability. One
source claims that the gcc toolchain is basically the only thing that emits
it, and that emulators/binary translators like qemu and bochs implement it.
We won't actually implement it here since we're hardly implementing any other
x87 instructions either. If we were to implement it, it would behave the same
as ffree but then also pop the register stack.
http://www.pagetable.com/?p=16
There may not be a formally correct spelling for the past tense of mmap, but
mmapped is the spelling Google doesn't try to autocorrect. This makes sense
because it mirrors the past tense of map->mapped and not the past tense of
cape->caped.
--HG--
rename : src/arch/alpha/mmaped_ipr.hh => src/arch/alpha/mmapped_ipr.hh
rename : src/arch/arm/mmaped_ipr.hh => src/arch/arm/mmapped_ipr.hh
rename : src/arch/mips/mmaped_ipr.hh => src/arch/mips/mmapped_ipr.hh
rename : src/arch/power/mmaped_ipr.hh => src/arch/power/mmapped_ipr.hh
rename : src/arch/sparc/mmaped_ipr.hh => src/arch/sparc/mmapped_ipr.hh
rename : src/arch/x86/mmaped_ipr.hh => src/arch/x86/mmapped_ipr.hh
At a couple of places in PerfectSwitch.cc and MessageBuffer.cc, DPRINTF()
has not been provided with correct number of arguments. The patch fixes these
bugs.
This patch removes the store buffer from Ruby. It is not in use currently.
Since libruby is being and store buffer makes calls to libruby, it is not
possible to maintain it until substantial changes are made.
This patch changes Address.hh so that it is not dependent on RubySystem.
This dependence seems unecessary. All those functions that depend on
RubySystem have been moved to Address.cc file.
This patch changes DataBlock.hh so that it is not dependent on RubySystem.
This dependence seems unecessary. All those functions that depende on
RubySystem have been moved to DataBlock.cc file.
This patch integrates permissions with cache and memory states, and then
automates the setting of permissions within the generated code. No longer
does one need to manually set the permissions within the setState funciton.
This patch will faciliate easier functional access support by always correctly
setting permissions for both cache and memory states.
--HG--
rename : src/mem/slicc/ast/EnumDeclAST.py => src/mem/slicc/ast/StateDeclAST.py
rename : src/mem/slicc/ast/TypeFieldEnumAST.py => src/mem/slicc/ast/TypeFieldStateAST.py
Because int and not InstSeqNum was used in a couple of places, you can
overflow the int type and thus get wierd bugs when the sequence number
is negative (or some wierd value)
remove constructors that werent being used (it just gets confusing)
use initialization list for all the variables instead of relying on initVars()
function
-use a pointer to CacheReqPacket instead of PacketPtr so correct destructors
get called on packet deletion
- make sure to delete the packet if the cache blocks the sendTiming request
or for some reason we dont use the packet
- dont overwrite memory requests since in the worst case an instruction will
be replaying a request so no need to keep allocating a new request
- we dont use retryPkt so delete it
- fetch code was split out already, so just assert that this is a memory
reference inst. and that the staticInst is available
If there is an outstanding table walk and no other activity in the CPU
it can go to sleep and never wake up. This change makes the instruction
queue always active if the CPU is waiting for a store to translate.
If Gabe changes the way this code works then the below should be removed
as indicated by the todo.
We only support EABI binaries, so there is no reason to support OABI syscalls.
The loader detects OABI calls and fatal() so there is no reason to even check
here.
The ARM performance counters are not currently supported by the model.
This patch interprets a 'reset performance counters' command to mean 'reset
the simulator statistics' instead.
"executing" isnt a very descriptive debug message and in going through the
output you get multiple messages that say "executing" but nothing to help
you parse through the code/execution.
So instead, at least print out the name of the action that is taking
place in these functions.
Overall, continue to progress Ruby debug messages to more of the normal M5
debug message style
- add a name() to the Ruby Throttle & PerfectSwitch objects so that the debug output
isn't littered w/"global:" everywhere.
- clean up messages that print over multiple lines when possible
- clean up duplicate prints in the message buffer
In certain actions of the L1 cache controller, while creating an outgoing
message, the machine type was not being set. This results in a
segmentation fault when trace is collected. Joseph Pusudesris provided
his patch for fixing this issue.
keep track of when an instruction needs the execution
behind it to be serialized. Without this, in SE Mode
instructions can execute behind a system call exit().
resources don't need to call getLatency because the latency is already a member
in the class. If there is some type of special case where different instructions
impose a different latency inside a resource then we can revisit this and
add getLatency() back in
each resource has a certain # of requests it can take per cycle. update the #s here
to be more realistic based off of the pipeline width and if the resource needs to
be accessed on multiple cycles
---
need to delete the cache request's data on clearRequest() now that we are recycling
requests
---
fetch unit needs to deallocate the fetch buffer blocks when they are replaced or
squashed.
formerly, to free up bandwidth in a resource, we could just change the pointer in that resource
but at the same time the pipeline stages had visibility to see what happened to a resource request.
Now that we are recycling these requests (to avoid too much dynamic allocation), we can't throw
away the request too early or the pipeline stage gets bad information. Instead, mark when a request
is done with the resource all together and then let the pipeline stage call back to the resource
that it's time to free up the bandwidth for more instructions
*** inteface notes ***
- When an instruction completes and is done in a resource for that cycle, call done()
- When an instruction fails and is done with a resource for that cycle, call done(false)
- When an instruction completes, but isnt finished with a resource, call completed()
- When an instruction fails, but isnt finished with a resource, call completed(false)
* * *
inorder: tlbmiss wakeup bug fix
take away all instances of reqMap in the code and make all references use the built-in
request vectors inside of each resource. The request map was dynamically allocating
a request per instruction. The request vector just allocates N number of requests
during instantiation and then the surrounding code is fixed up to reuse those N requests
***
setRequest() and clearRequest() are the new accessors needed to define a new
request in a resource
we are going to be getting away from creating new resource requests for every
instruction so no more need to keep track of a reqRemoveList and clean it up
every tick
first change in an optimization that will stop InOrder from allocating new memory for every instruction's
request to a resource. This gets expensive since every instruction needs to access ~10 requests before
graduation. Instead, the plan is to allocate just enough resource request objects to satisfy each resource's
bandwidth (e.g. the execution unit would need to allocate 3 resource request objects for a 1-issue pipeline
since on any given cycle it could have 2 read requests and 1 write request) and then let the instructions
contend and reuse those allocated requests. The end result is a smaller memory footprint for the InOrder model
and increased simulation performance
Currently the wakeup function for the PerfectSwitch contains three loops -
loop on number of virtual networks
loop on number of incoming links
loop till all messages for this (link, network) have been routed
With an 8 processor mesh network and Hammer protocol, about 11-12% of the
was observed to have been spent in this function, which is the highest
amongst all the functions. It was found that the innermost loop is executed
about 45 times per invocation of the wakeup function, when each invocation
of the wakeup function processes just about one message.
The patch tries to do away with the redundant executions of the innermost
loop. Counters have been added for each virtual network that record the
number of messages that need to be routed for that virtual network. The
inner loops are only executed when the number of messages for that particular
virtual network > 0. This does away with almost 80% of the executions of the
innermost loop. The function now consumes about 5-6% of the total execution
time.
In x86, 32 and 64 bit writes to registers in which registers appear to be 32 or
64 bits wide overwrite all bits of the destination register. This change
removes false dependencies in these cases where the previous value of a
register doesn't need to be read to write a new value. New versions of most
microops are created that have a "Big" suffix which simply overwrite their
destination, and the right version to use is selected during microop
allocation based on the selected data size.
This does not change the performance of the O3 CPU model significantly, I
assume because there are other false dependencies from the condition code bits
in the flags register.
These faults can panic/warn/warn_once, etc., instead of instructions doing
that themselves directly. That way, instructions can be speculatively
executed, and only if they're actually going to commit will their fault be
invoked and the panic, etc., happen.
When redirecting fetch to handle branches, the npc of the current pc state
needs to be left alone. This change makes the pc state record whether or not
the npc already reflects a real value by making it keep track of the current
instruction size, or if no size has been set.
The patch changes the order in which L1 dcache and icache are looked up when
a request comes in. Earlier, if a request came in for instruction fetch, the
dcache was looked up before the icache, to correctly handle self-modifying
code. But, in the common case, dcache is going to report a miss and the
subsequent icache lookup is going to report a hit. Given the invariant -
caches under the same controller keep track of disjoint sets of cache blocks,
we can move the icache lookup before the dcache lookup. In case of a hit in
the icache, using our invariant, we know that the dcache would have reported
a miss. In case of a miss in the icache, we know that icache would have
missed even if the dcache was looked up before looking up the icache.
Effectively, we are doing the same thing as before, though in the common case,
we expect reduction in the number of lookups. This was empirically confirmed
for MOESI hammer. The ratio lookups to access requests is now about 1.1 to 1.
resource skeds are divided into two parts: front end (all insts) and back end (inst. specific)
each of those are implemented as separate lists, so this iterator wraps around
the traditional list iterator so that an instruction can walk it's schedule but seamlessly
transfer from front end to back end when necessary
add a stage scheduler class to replace InstStage in pipeline_traits.cc
use that class to define a default front-end, resource schedule that all
instructions will follow. This will also replace the back end schedule in
pipeline_traits.cc. The reason for adding this is so that we can cache
instruction schedules in the future instead of calling the same function
over/over again as well as constantly dynamically alllocating memory on
every instruction to try to figure out it's schedule
When a table walk is initiated by the fetch stage, the CPU can
potentially move to the idle state and never wake up.
The fetch stage must call cpu->wakeCPU() when a translation completes
(in finishTranslation()).
Uncacheable requests were set as such only in atomic mode.
currState->delayed is checked in place of currState->timing for resetting
currState in atomic mode.
This change fixes an issue where a DTLB fault occurs and redirects fetch to
handle the fault and the ITLB requires a walk which delays translation. In this
case the status of the cpu isn't updated appropriately, and an additional
instruction fetch occurs. Eventually this hits an assert as multiple instruction
fetches are occuring in the system and when the second one returns the
processor is in the wrong state.
Some asserts below are removed because it was always true (typo) and the state
after the initiateAcc() the processor could be in any valid state when a
d-side fault occurs.
Some ISAs (like ARM) relies on hardware page table walkers. For those ISAs,
when a TLB miss occurs, initiateTranslation() can return with NoFault but with
the translation unfinished.
Instructions experiencing a delayed translation due to a hardware page table
walk are deferred until the translation completes and kept into the IQ. In
order to keep track of them, the IQ has been augmented with a queue of the
outstanding delayed memory instructions. When their translation completes,
instructions are re-executed (only their initiateAccess() was already
executed; their DTB translation is now skipped). The IEW stage has been
modified to support such a 2-pass execution.
Setup initial timesync event in initState or loadState so that curTick has
been updated to the new value, otherwise the event is scheduled in the past.
The TBE pointer in the MESI CMP implementation was not being set to NULL
when the TBE is deallocated. This resulted in segmentation fault on testing
the protocol when the ProtocolTrace was switched on.
JMP_FAR_I was unpacking its far pointer operand using sll instead of srl like
it should, and also putting the components in the wrong registers for use by
other microcode.
During iret access LDT/GDT at CPL0 rather than after transition to user mode
(if I'm reading the Intel IA-64 architecture spec correctly, the contents of
the descriptor table are read before the CPL is updated).
The code for Orion 2.0 makes use of printf() at several places where there as
an error in configuration of the model. These have been replaced with fatal().