As highlighed on the mailing list gem5's writeback modeling can impact
performance. This patch removes the limitation on maximum outstanding issued
instructions, however the number that can writeback in a single cycle is still
respected in instToCommit().
the Cortex-A15 has a random replacement policy for its L2 cache. see the
Cortex-A15 Technical Reference Manual 1.7 About the L2 memory system. this
patch makes the PseudoLRU tags the default for the ARM O3 CPU's L2 cache.
Note: AArch64 and AArch32 interworking is not supported. If you use an AArch64
kernel you are restricted to AArch64 user-mode binaries. This will be addressed
in a later patch.
Note: Virtualization is only supported in AArch32 mode. This will also be fixed
in a later patch.
Contributors:
Giacomo Gabrielli (TrustZone, LPAE, system-level AArch64, AArch64 NEON, validation)
Thomas Grocutt (AArch32 Virtualization, AArch64 FP, validation)
Mbou Eyole (AArch64 NEON, validation)
Ali Saidi (AArch64 Linux support, code integration, validation)
Edmund Grimley-Evans (AArch64 FP)
William Wang (AArch64 Linux support)
Rene De Jong (AArch64 Linux support, performance opt.)
Matt Horsnell (AArch64 MP, validation)
Matt Evans (device models, code integration, validation)
Chris Adeniyi-Jones (AArch64 syscall-emulation)
Prakash Ramrakhyani (validation)
Dam Sunwoo (validation)
Chander Sudanthi (validation)
Stephan Diestelhorst (validation)
Andreas Hansson (code integration, performance opt.)
Eric Van Hensbergen (performance opt.)
Gabe Black
the current implementation of the fetch buffer in the o3 cpu
is only allowed to be the size of a cache line. some
architectures, e.g., ARM, have fetch buffers smaller than a cache
line, see slide 22 at:
http://www.arm.com/files/pdf/at-exploring_the_design_of_the_cortex-a15.pdf
this patch allows the fetch buffer to be set to values smaller
than a cache line.
having separate params for the local/globalHistoryBits and the
local/globalPredictorSize can lead to inconsistencies if they
are not carefully set. this patch dervies the number of bits
necessary to index into the local/global predictors based on
their size.
the value of the localHistoryTableSize for the ARM O3 CPU has been
increased to 1024 from 64, which is more accurate for an A15 based
on some correlation against A15 hardware.
This patch moves the branch predictor files in the o3 and inorder directories
to src/cpu/pred. This allows sharing the branch predictor across different
cpu models.
This patch was originally posted by Timothy Jones in July 2010
but never made it to the repository.
--HG--
rename : src/cpu/o3/bpred_unit.cc => src/cpu/pred/bpred_unit.cc
rename : src/cpu/o3/bpred_unit.hh => src/cpu/pred/bpred_unit.hh
rename : src/cpu/o3/bpred_unit_impl.hh => src/cpu/pred/bpred_unit_impl.hh
rename : src/cpu/o3/sat_counter.hh => src/cpu/pred/sat_counter.hh
The defer_registration parameter is used to prevent a CPU from
initializing at startup, leaving it in the "switched out" mode. The
name of this parameter (and the help string) is confusing. This patch
renames it to switched_out, which should be more descriptive.
globalHistoryBits, globalPredictorSize, and choicePredictorSize are decoupled.
globalHistoryBits controls how much history is kept, global and choice
predictor sizes control how much of that history is used when accessing
predictor tables. This way, global and choice predictors can actually be
different sizes, and it is no longer possible to walk off the predictor arrays
and cause a seg fault.
There are now individual thresholds for choice, global, and local saturating
counters, so that taken/not taken decisions are correct even when the
predictors' counters' sizes are different.
The interface for localPredictorSize has been removed from TournamentBP because
the value can be calculated from localHistoryBits.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch changes the cache-related latencies from an absolute time
expressed in Ticks, to a number of cycles that can be scaled with the
clock period of the caches. Ultimately this patch serves to enable
future work that involves dynamic frequency scaling. As an immediate
benefit it also makes it more convenient to specify cache performance
without implicitly assuming a specific CPU core operating frequency.
The stat blocked_cycles that actually counter in ticks is now updated
to count in cycles.
As the timing is now rounded to the clock edges of the cache, there
are some regressions that change. Plenty of them have very minor
changes, whereas some regressions with a short run-time are perturbed
quite significantly. A follow-on patch updates all the statistics for
the regressions.
In the current caches the hit latency is paid twice on a miss. This patch lets
a configurable response latency be set of the cache for the backward path.