This patch adds basic functionality to quickly visualise the output
from the DRAM efficiency script. There are some unfortunate hacks
needed to communicate the needed information from one script to the
other, and we fall back on (ab)using the simout to do this.
As part of this patch we also trim the efficiency sweep to stop at 512
bytes as this should be sufficient for all forseeable DRAMs.
This patch is the final patch in a series of patches. The aim of the series
is to make ruby more configurable than it was. More specifically, the
connections between controllers are not at all possible (unless one is ready
to make significant changes to the coherence protocol). Moreover the buffers
themselves are magically connected to the network inside the slicc code.
These connections are not part of the configuration file.
This patch makes changes so that these connections will now be made in the
python configuration files associated with the protocols. This requires
each state machine to expose the message buffers it uses for input and output.
So, the patch makes these buffers configurable members of the machines.
The patch drops the slicc code that usd to connect these buffers to the
network. Now these buffers are exposed to the python configuration system
as Master and Slave ports. In the configuration files, any master port
can be connected any slave port. The file pyobject.cc has been modified to
take care of allocating the actual message buffer. This is inline with how
other port connections work.
This patch fixes scripts related to ruby by adding the ruby clock domain.
Now the L1 controllers and the Sequencer shares the cpu clock domain,
while the rest of the components use the ruby clock domain.
Before this patch, running simulations with the cpu clock set at 2GHz or
1GHz will output the same time results and could distort power measurements.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
the Cortex-A15 has a random replacement policy for its L2 cache. see the
Cortex-A15 Technical Reference Manual 1.7 About the L2 memory system. this
patch makes the PseudoLRU tags the default for the ARM O3 CPU's L2 cache.
This patch contains a new CPU model named `Minor'. Minor models a four
stage in-order execution pipeline (fetch lines, decompose into
macroops, decompose macroops into microops, execute).
The model was developed to support the ARM ISA but should be fixable
to support all the remaining gem5 ISAs. It currently also works for
Alpha, and regressions are included for ARM and Alpha (including Linux
boot).
Documentation for the model can be found in src/doc/inside-minor.doxygen and
its internal operations can be visualised using the Minorview tool
utils/minorview.py.
Minor was designed to be fairly simple and not to engage in a lot of
instruction annotation. As such, it currently has very few gathered
stats and may lack other gem5 features.
Minor is faster than the o3 model. Sample results:
Benchmark | Stat host_seconds (s)
---------------+--------v--------v--------
(on ARM, opt) | simple | o3 | minor
| timing | timing | timing
---------------+--------+--------+--------
10.linux-boot | 169 | 1883 | 1075
10.mcf | 117 | 967 | 491
20.parser | 668 | 6315 | 3146
30.eon | 542 | 3413 | 2414
40.perlbmk | 2339 | 20905 | 11532
50.vortex | 122 | 1094 | 588
60.bzip2 | 2045 | 18061 | 9662
70.twolf | 207 | 2736 | 1036
in makeDualRoot() the etherlink interfaces are set using the tsunami interface
however, they are set again a few lines later based on whether or not the system
is a realview or tsunami system; the original assignment is always overwritten
or there will be a fatal. this seems like an artifact from when tsunami was the
only type of system capable of running with the dual option.
This patch bumps the bus clock speed such that the interconnect does
not become a bottleneck with a DDR4-2400-x64 DRAM delivering 19.2
GByte/s theoretical max.
Adds the parameter --num-work-ids to Options.py and reads the parameter
into the System params in Simulation.py. This parameter enables setting
the number of possible work items to different than 16. Support for this
parameter already exists in src/sim/System.py, so this changeset only
affects the Python config files.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
A recent changeset altered the default memory class to DRAMCtrl. In se mode,
ruby uses the physical memory to check if a given address is within the bounds
of the physical memory. SimpleMemory is enough for this. Moreover,
SimpleMemory does not check whether it is connected or not, something which
DRAMCtrl does.
This patch renames the not-so-simple SimpleDRAM to a more suitable
DRAMCtrl. The name change is intended to ensure that we do not send
the wrong message (although the "simple" in SimpleDRAM was originally
intended as in cleverly simple, or elegant).
As the DRAM controller modelling work is being presented at ISPASS'14
our hope is that a broader audience will use the model in the future.
--HG--
rename : src/mem/SimpleDRAM.py => src/mem/DRAMCtrl.py
rename : src/mem/simple_dram.cc => src/mem/dram_ctrl.cc
rename : src/mem/simple_dram.hh => src/mem/dram_ctrl.hh
Make the default memory type DDR3-1600 x64, and use the open-adaptive
page policy. This change is aiming to ensure that users by default are
using a realistic memory system.
This patch adds a configuration that simplifies evaluation of DRAM
controller configurations by automating a sweep of stride size and
bank parallelism. It works in a rather unconventional way, as it needs
to print the traffic generator stimuli based on the memory
organisation. Hence, it starts by configuring the memory, then it
prints a traffic-generator config file, and loads it.
The resulting stats have one period per data point, identified by the
stride size, and the number of banks being used.
This patch adds the row bits to the name of the address mapping
schemes to make it more clear that all the current schemes places the
row bits as the most significant bits.
This helps in configuring the network interfaces from the python script and
these objects no longer rely on the network object for the timing information.
The code that creates test and drive systems is being moved to separate
functions so as to make the code more readable. Ultimately the two
functions would be combined so that the replicated code is eliminated.
The patch removes the ruby_fs.py file. The functionality is being moved to
fs.py. This would being ruby fs simulations in line with how ruby se
simulations are started (using --ruby option). The alpha fs config functions
are being combined for classing and ruby memory systems. This required
renaming the piobus in ruby to iobus. So, we will have stats being renamed
in the stats file for ruby fs regression.
Piobus was recently added to se scripts for ruby so that the interrupt
controller can be connected to something (required since the interrupt
controller sends address range messages). This patch removes the piobus
and instead, the pio port of ruby port will now ignore the range change
messages in se mode.
Couple of errors were discovered in 4eec7bdde5b0 which necessitated this patch.
Firstly, we create interrupt controllers in the se mode, but no piobus was
being created. RubyPort, which earlier used to ignore range changes now
forwards those to the piobus. The lack of piobus resulted in segmentation
fault. This patch creates a piobus even in se mode. It is not created only
when some tester is running. Secondly, I had missed out on modifying port
connections for other coherence protocols.
Currently, the interrupt controller in x86 is connected to the io bus
directly. Therefore the packets between the io devices and the interrupt
controller do not go through ruby. This patch changes ruby port so that
these packets arrive at the ruby port first, which then routes them to their
destination. Note that the patch does not make these packets go through the
ruby network. That would happen in a subsequent patch.
Modifies FSConfig.py to enable ARMv8 compatibility.
To boot gem5 with ARMv8:
Download the v8 kernel, .dtb file, and root FS from: http://gem5.org/Download
Download the ARMv8 toolchain, and add the bin dir to your path:
http://www.linaro.org/engineering/engineering-projects/armv8
Build gem5 for ARM
Build the v8 bootloader (in gem5/system/arm/aarch64_bootloader)
Make script in gem5/system/arm/aarch64_bootloader will require v8 toolchain,
drop the produced boot_emm.arm64 in $(M5_PATH)/binaries/
Run:
$ build/ARM/gem5.fast configs/example/fs.py --machine-type=VExpress_EMM64 \
--kernel=/path/to/kernel/vmlinux-linaro-tracking \
--dtb-filename=/path/to/dtb/rtsm_ve-aemv8a.dtb \
--disk-image=/path/to/img/linaro-minimal-armv8.img
This patch adds DRAMSim2 as a memory controller by wrapping the
external library and creating a sublass of AbstractMemory that bridges
between the semantics of gem5 and the DRAMSim2 interface.
The DRAMSim2 wrapper extracts the clock period from the config
file. There is no way of extracting this information from DRAMSim2
itself, so we simply read the same config file and get it from there.
To properly model the response queue, the wrapper keeps track of how
many transactions are in the actual controller, and how many are
stacking up waiting to be sent back as responses (in the wrapper). The
latter requires us to move away from the queued port and manage the
packets ourselves. This is due to DRAMSim2 not having any flow control
on the response path.
DRAMSim2 assumes that the transactions it is given are matching the
burst size of the choosen memory. The wrapper checks to ensure the
cache line size of the system matches the burst size of DRAMSim2 as
there are currently no provisions to split the system requests. In
theory we could allow a cache line size smaller than the burst size,
but that would lead to inefficient use of the DRAM, so for not we
fatal also in this case.
Note: AArch64 and AArch32 interworking is not supported. If you use an AArch64
kernel you are restricted to AArch64 user-mode binaries. This will be addressed
in a later patch.
Note: Virtualization is only supported in AArch32 mode. This will also be fixed
in a later patch.
Contributors:
Giacomo Gabrielli (TrustZone, LPAE, system-level AArch64, AArch64 NEON, validation)
Thomas Grocutt (AArch32 Virtualization, AArch64 FP, validation)
Mbou Eyole (AArch64 NEON, validation)
Ali Saidi (AArch64 Linux support, code integration, validation)
Edmund Grimley-Evans (AArch64 FP)
William Wang (AArch64 Linux support)
Rene De Jong (AArch64 Linux support, performance opt.)
Matt Horsnell (AArch64 MP, validation)
Matt Evans (device models, code integration, validation)
Chris Adeniyi-Jones (AArch64 syscall-emulation)
Prakash Ramrakhyani (validation)
Dam Sunwoo (validation)
Chander Sudanthi (validation)
Stephan Diestelhorst (validation)
Andreas Hansson (code integration, performance opt.)
Eric Van Hensbergen (performance opt.)
Gabe Black
The first two levels (L0, L1) are private to the core, the third level (L2)is
possibly shared. The protocol supports clustered designs. For example, one
can have two sets of two cores. Each core has an L0 and L1 cache. There are
two L2 controllers where each set accesses only one of the L2 controllers.
For some reason, the default x86 kernel is specified in
tests/configs/x86_generic.py and not in configs/common/FSConfig.py,
where the kernels for all the other ISAs are. This means that
running configs/example/fs.py for x86 fails because no kernel
is specified. Moving the specification over fixes this problem.
There is another problem that this uncovers, which is that going
past the init stage (i.e., past where the regression test stops)
fails because the fsck test on the disk device fails, but that's
a separate issue.
The directory controller should not have the sharer field since there is
only one level 2 cache. Anyway the field was not in use. The owner field
was being used to track the l2 cache version (in case of distributed l2) that
has the cache block under consideration. The information is not required
since the version of the level 2 cache can be obtained from a subset of the
address bits.
the current implementation of the fetch buffer in the o3 cpu
is only allowed to be the size of a cache line. some
architectures, e.g., ARM, have fetch buffers smaller than a cache
line, see slide 22 at:
http://www.arm.com/files/pdf/at-exploring_the_design_of_the_cortex-a15.pdf
this patch allows the fetch buffer to be set to values smaller
than a cache line.
This Python script generates an ARM DS-5 Streamline .apc project based
on gem5 run. To successfully convert, the gem5 runs needs to be run
with the context-switch-based stats dump option enabled (The guest
kernel also needs to be patched to allow gem5 interrogate its task
information.) See help for more information.
A couple of recent changesets added/deleted/edited some variables
that are needed for running the example ruby scripts. This changeset
edits these scripts to bring them to a working state.
In order to support m5ops in virtualized environments, we need to use
a memory mapped interface. This changeset adds support for that by
reserving 0xFFFF0000-0xFFFFFFFF and mapping those to the generic IPR
interface for m5ops. The mapping is done in the
X86ISA::TLB::finalizePhysical() which means that it just works for all
of the CPU models, including virtualized ones.
Recent changes added setting of system-wide cache line size and these settings
occur in the top-level configs (se.py and fs.py). This setting also needs to
take place in ruby_fs.py. This change sets the cache line size as appropriate.
The previous changeset (9816) that fixes the use of max ticks introduced the
variable cpt_starttick, which is used for setting the relative max tick.
Unfortunately, with checkpointing at an instruction count or with simpoints,
the checkpoint tick is not stored conveniently, so to ensure that cpt_starttick
is initialized, set it to 0. Also, if using --rel-max-tick, check the use of
instruction counts or simpoints to warn the user that the max tick setting does
not include the checkpoint ticks.
The routers are created before the network class. This results in the routers
becoming children of the first link they are connected to and they get generic
names like int_node and node_b. This patch creates the network object first
and passes it to the topology creation function. Now the routers are children
of the network object and names are much more sensible.
The number of transitions per cycle that a controller can carry out is
a proxy for the number of ports that a controller has. This value is
currently 32 which is way too high. The patch introduces an option
for the number of ports and uses this option in the protocol files
to set the number of transitions. The default value is being set to
4. None of the se regressions change. Ruby stats for the fs regression
change and are being updated.
This patch adds support for specifying multi-channel memory
configurations on the command line, e.g. 'se/fs.py
--mem-type=ddr3_1600_x64 --mem-channels=4'. To enable this, it
enhances the functionality of MemConfig and moves the existing
makeMultiChannel class method from SimpleDRAM to the support scripts.
The se/fs.py example scripts are updated to make use of the new
feature.
This patch changes the default parameter value of conf_table_reported
to match the common case. It also simplifies the regression and config
scripts to reflect this change.
This patch adds the notion of voltage domains, and groups clock
domains that operate under the same voltage (i.e. power supply) into
domains. Each clock domain is required to be associated with a voltage
domain, and the latter requires the voltage to be explicitly set.
A voltage domain is an independently controllable voltage supply being
provided to section of the design. Thus, if you wish to perform
dynamic voltage scaling on a CPU, its clock domain should be
associated with a separate voltage domain.
The current implementation of the voltage domain does not take into
consideration cases where there are derived voltage domains running at
ratio of native voltage domains, as with the case where there can be
on-chip buck/boost (charge pumps) voltage regulation logic.
The regression and configuration scripts are updated with a generic
voltage domain for the system, and one for the CPUs.
This patch moves the instantiation of the memory controller outside
FSConfig and instead relies on the mem_ranges to pass the information
to the caller (e.g. fs.py or one of the regression scripts). The main
motivation for this change is to expose the structural composition of
the memory system and allow more tuning and configuration without
adding a large number of options to the makeSystem functions.
The patch updates the relevant example scripts to maintain the current
functionality. As the order that ports are connected to the memory bus
changes (in certain regresisons), some bus stats are shuffled
around. For example, what used to be layer 0 is now layer 1.
Going forward, options will be added to support the addition of
multi-channel memory controllers.
This patch contains three fixes to max tick options handling in Options.py and
Simulation.py:
1) Since the global simulator frequency isn't bound until m5.instantiate()
is called, the maxtick resolution needs to happen after this call, since
changes to the global frequency will cause m5.simulate() to misinterpret the
maxtick value. Shuffling this also requires tweaking the checkpoint directory
handling to signal the checkpoint restore tick back to run(). Fixing this
completely and correctly will require storing the simulation frequency into
checkpoints, which is beyond the scope of this patch.
2) The maxtick option in Options.py was defaulted to MaxTicks, so the old code
would always skip over the maxtime part of the conditionals at the beginning
of run(). Change the maxtick default to None, and set the maxtick local
variable in run() appropriately.
3) To clarify whether max ticks settings are relative or absolute, split the
maxtick option into separate options, for relative and absolute. Ensure that
these two options and the maxtime option are handled appropriately to set the
maxtick variable in Simulation.py.
The configuration scripts provided for ruby assume that the available
physical memory is equally distributed amongst the directory controllers.
But there is no check to ensure this assumption has been adhered to. This
patch adds the required check.
This patch adds the notion of source- and derived-clock domains to the
ClockedObjects. As such, all clock information is moved to the clock
domain, and the ClockedObjects are grouped into domains.
The clock domains are either source domains, with a specific clock
period, or derived domains that have a parent domain and a divider
(potentially chained). For piece of logic that runs at a derived clock
(a ratio of the clock its parent is running at) the necessary derived
clock domain is created from its corresponding parent clock
domain. For now, the derived clock domain only supports a divider,
thus ensuring a lower speed compared to its parent. Multiplier
functionality implies a PLL logic that has not been modelled yet
(create a separate clock instead).
The clock domains should be used as a mechanism to provide a
controllable clock source that affects clock for every clocked object
lying beneath it. The clock of the domain can (in a future patch) be
controlled by a handler responsible for dynamic frequency scaling of
the respective clock domains.
All the config scripts have been retro-fitted with clock domains. For
the System a default SrcClockDomain is created. For CPUs that run at a
different speed than the system, there is a seperate clock domain
created. This domain incorporates the CPU and the associated
caches. As before, Ruby runs under its own clock domain.
The clock period of all domains are pre-computed, such that no virtual
functions or multiplications are needed when calling
clockPeriod. Instead, the clock period is pre-computed when any
changes occur. For this to be possible, each clock domain tracks its
children.
This patch adds a 'sys_clock' command-line option and use it to assign
clocks to the system during instantiation.
As part of this change, the default clock in the System class is
removed and whenever a system is instantiated a system clock value
must be set. A default value is provided for the command-line option.
The configs and tests are updated accordingly.
This patch adds a 'cpu_clock' command-line option and uses the value
to assign clocks to components running at the CPU speed (L1 and L2
including the L2-bus). The configuration scripts are updated
accordingly.
The 'clock' option is left unchanged in this patch as it is still used
by a number of components. In follow-on patches the latter will be
disambiguated further.
This patch removes the explicit setting of the clock period for
certain instances of CoherentBus, NonCoherentBus and IOCache where the
specified clock is same as the default value of the system clock. As
all the values used are the defaults, there are no performance
changes. There are similar cases where the toL2Bus is set to use the
parent CPU clock which is already the default behaviour.
The main motivation for these simplifications is to ease the
introduction of clock domains.
This patch moves the instantiation of system.membus in se.py to the area of
code where classic memory system has been dealt with. Ruby does not require
this bus and hence it should not be instantiated.
The --restore-with-cpu option didn't use CpuConfig.cpu_names() to
determine which CPU names are valid, instead it used a static list of
known CPU names. This changeset makes the option parsing code use the
CPU list from the CpuConfig module instead.
This patch changes the class names of the variuos DRAM configurations
to better reflect what memory they are based on. The speed and
interface width is now part of the name, and also the alias that is
used to select them on the command line.
Some minor changes are done to the actual parameters, to better
reflect the named configurations. As a result of these changes the
regressions change slightly and the stats will be bumped in a separate
patch.
This patch adds a typical (leaning towards fast) LPDDR3 configuration
based on publically available data. As expected, it looks very similar
to the LPDDR2-S4 configuration, only with a slightly lower burst time.
This patch removes the explicit memset as it is redundant and causes
the simulator to touch the entire space, forcing the host system to
allocate the pages.
Anonymous pages are mapped on the first access, and the page-fault
handler is responsible for zeroing them. Thus, the pages are still
zeroed, but we avoid touching the entire allocated space which enables
us to use much larger memory sizes as long as not all the memory is
actually used.
The option was not being passed to directory controllers for the protocols
MOESI_CMP_token and MOESI_CMP_directory. This was resulting in an error
while instantiating the directory controller as it tries to access the
wrong type of memory.
having separate params for the local/globalHistoryBits and the
local/globalPredictorSize can lead to inconsistencies if they
are not carefully set. this patch dervies the number of bits
necessary to index into the local/global predictors based on
their size.
the value of the localHistoryTableSize for the ARM O3 CPU has been
increased to 1024 from 64, which is more accurate for an A15 based
on some correlation against A15 hardware.
This fixes missing mem-type arguments to makeLinuxAlphaRubySystem and
makeLinuxX86System after a recent changeset allowing mem-type to be
configured via options missed fixing these calls.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch enables selection of the memory controller class through a
mem-type command-line option. Behind the scenes, this option is
treated much like the cpu-type, and a similar framework is used to
resolve the valid options, and translate the short-hand description to
a valid class.
The regression scripts are updated with a hardcoded memory class for
the moment. The best solution going forward is probably to get the
memory out of the makeSystem functions, but Ruby complicates things as
it does not connect the memory controller to the membus.
--HG--
rename : configs/common/CpuConfig.py => configs/common/MemConfig.py
KVM-based CPUs need a KVM VM object in the system to manage
system-global KVM stuff (VM creation, interrupt delivery, memory
managment, etc.). This changeset adds a VM to the system if KVM has
been enabled at compile time (the BaseKvmCPU object exists) and a
KVM-based CPU has been selected at runtime.
This patch is based on http://reviews.m5sim.org/r/1474/ originally written by
Mitch Hayenga. Basic block vectors are generated (simpoint.bb.gz in simout
folder) based on start and end addresses of basic blocks.
Some comments to the original patch are addressed and hooks are added to create
and resume from checkpoints based on instruction counts dictated by external
SimPoint analysis tools.
SimPoint creation/resuming options will be implemented as a separate patch.
In Simulation.py, calls to m5.simulate(num_ticks) will run the simulated system
for num_ticks after the current tick. Fix calls to m5.simulate in
scriptCheckpoints() and benchCheckpoints() to appropriately handle the maxticks
variable.
As of now, we mark the top 1MB of memory space as unusable. Part of
it is actually usable and is required to be marked so by some of the
newer versions of linux kernel. This patch marks the top 639KB as usable.
This value was chosen by looking at QEMU's output for bios memory map.
The Topology class in Ruby does not need to inherit from SimObject class.
This patch turns it into a regular class. The topology object is now created
in the constructor of the Network class. All the parameters for the topology
class have been moved to the network class.
The default cache configuration script currently import the O3_ARM_v7a
model configuration, which depends on the O3 CPU. This breaks if gem5
has been compiled without O3 support. This changeset removes the
dependency by only importing the model if it is requested by the
user. As a bonus, it actually removes some code duplication in the
configuration scripts.
CPU switching consists of the following steps:
1. Drain the system
2. Switch out old CPUs (cpu.switchOut())
3. Change the system timing mode to the mode the new CPUs require
4. Flush caches if switching to hardware virtualization
5. Inform new CPUs of the handover (cpu.takeOverFrom())
6. Resume the system
m5.switchCpus() previously only did step 2 & 5. Since information
about the new processors' memory system requirements is now exposed,
do all of the steps above.
This patch adds automatic memory system switching and flush (if
needed) to switchCpus(). Additionally, it adds optional draining to
switchCpus(). This has the following implications:
* changeToTiming and changeToAtomic are no longer needed, so they have
been removed.
* changeMemoryMode is only used internally, so it is has been renamed
to be private.
* switchCpus requires a reference to the system containing the CPUs as
its first parameter.
WARNING: This changeset breaks compatibility with existing
configuration scripts since it changes the signature of
m5.switchCpus().
The CPUs supported by the configuration scripts used to be
hard-coded. This was not ideal for several reasons. For example, the
configuration scripts depend on all CPU models even though only a
subset might have been compiled.
This changeset adds a new module to the configuration scripts that
automatically discovers the available CPU models from the compiled
SimObjects. As a nice bonus, the use of introspection allows us to
automatically generate a list of available CPU models suitable for
printing. This list is augmented with the Python doc string from the
underlying class if available.
The configuration scripts currently hard-code the requirements of each
CPU. This is clearly not optimal as it makes writing new configuration
scripts painful and adding new CPU models requires existing scripts to
be updated. This patch adds the following class methods to the base
CPU and all relevant CPUs:
* memory_mode -- Return a string describing the current memory mode
(invalid/atomic/timing).
* require_caches -- Does the CPU model require caches?
* support_take_over -- Does the CPU support CPU handover?
The run() method in Simulation.py used to call sys.exit() when the
simulator exits. This is undesirable when user has requested the
simulator to be run in interactive mode since it causes the simulator
to exit rather than entering the interactive Python environment.
This patch moves the default DRAM parameters from the SimpleDRAM class
to two different subclasses, one for DDR3 and one for LPDDR2. More can
be added as we go forward.
The regressions that previously used the SimpleDRAM are now using
SimpleDDR3 as this is the most similar configuration.
This patch moves the branch predictor files in the o3 and inorder directories
to src/cpu/pred. This allows sharing the branch predictor across different
cpu models.
This patch was originally posted by Timothy Jones in July 2010
but never made it to the repository.
--HG--
rename : src/cpu/o3/bpred_unit.cc => src/cpu/pred/bpred_unit.cc
rename : src/cpu/o3/bpred_unit.hh => src/cpu/pred/bpred_unit.hh
rename : src/cpu/o3/bpred_unit_impl.hh => src/cpu/pred/bpred_unit_impl.hh
rename : src/cpu/o3/sat_counter.hh => src/cpu/pred/sat_counter.hh
This patch moves the contollers to be children of the ruby_system instead of
'system' under the python object hierarchy. This is so that these objects
can inherit some of the ruby_system's parameter values without resorting to
calling a global system pointer during run-time.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Used as a command in full-system scripts helps the user ensure the benchmarks have finished successfully.
For example, one can use:
/path/to/benchmark args || /sbin/m5 fail 1
and thus ensure gem5 will exit with an error if the benchmark fails.
The defer_registration parameter is used to prevent a CPU from
initializing at startup, leaving it in the "switched out" mode. The
name of this parameter (and the help string) is confusing. This patch
renames it to switched_out, which should be more descriptive.
This patch generalises the address range resolution for the I/O cache
and I/O bridge such that they do not assume a single memory. The patch
involves adding a parameter to the system which is then defined based
on the memories that are to be visible from the I/O subsystem, whether
behind a cache or a bridge.
The change is needed to allow interleaved memory controllers in the
system.
The ISA class on stores the contents of ID registers on many
architectures. In order to make reset values of such registers
configurable, we make the class inherit from SimObject, which allows
us to use the normal generated parameter headers.
This patch introduces a Python helper method, BaseCPU.createThreads(),
which creates a set of ISAs for each of the threads in an SMT
system. Although it is currently only needed when creating
multi-threaded CPUs, it should always be called before instantiating
the system as this is an obvious place to configure ID registers
identifying a thread/CPU.
The directed tester supports only generating only read or only write accesses. The
patch modifies the tester to support streams that have both read and write accesses.
globalHistoryBits, globalPredictorSize, and choicePredictorSize are decoupled.
globalHistoryBits controls how much history is kept, global and choice
predictor sizes control how much of that history is used when accessing
predictor tables. This way, global and choice predictors can actually be
different sizes, and it is no longer possible to walk off the predictor arrays
and cause a seg fault.
There are now individual thresholds for choice, global, and local saturating
counters, so that taken/not taken decisions are correct even when the
predictors' counters' sizes are different.
The interface for localPredictorSize has been removed from TournamentBP because
the value can be calculated from localHistoryBits.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
There is no point in exporting the old drain() method in
Simulate.py. It should only be used internally by doDrain(). This
patch moves the old drain() method into doDrain() and renames
doDrain() to drain().
Changeset 4f54b0f229b5 removed the call to doDrain in changeToTiming
based on the assumption that the system does not need draining when
running in atomic mode. This is a false assumption since at least the
System class requires the system to be drained before it allows
switching of memory modes. This patch reverts that part of the
changeset.
This patch unified the L1 and L2 caches used throughout the
regressions instead of declaring different, but very similar,
configurations in the different scripts.
The patch also changes the default L2 configuration to match what it
used to be for the fs and se scripts (until the last patch that
updated the regressions to also make use of the cache config). The
MSHRs and targets per MSHR are now set to a more realistic default of
20 and 12, respectively.
As a result of both the aforementioned changes, many of the regression
stats are changed. A follow-on patch will bump the stats.
This patch favours using SimpleDRAM with the default timing instead of
SimpleMemory for all regressions that involve the o3 or inorder CPU,
or are full system (in other words, where the actual performance of
the memory is important for the overall performance).
Moving forward, the solution for FSConfig and the users of fs.py and
se.py is probably something similar to what we use to choose the CPU
type. I envision a few pre-set configurations SimpleLPDDR2,
SimpleDDR3, etc that can be choosen by a dram_type option. Feedback on
this part is welcome.
This patch changes plenty stats and adds all the DRAM controller
related stats. A follow-on patch updates the relevant statistics. The
total run-time for the entire regression goes up with ~5% with this
patch due to the added complexity of the SimpleDRAM model. This is a
concious trade-off to ensure that the model is properly tested.
This patch uses the common L1, L2 and IOCache configuration for the
regressions that all share the same cache parameters. There are a few
regressions that use a slightly different configuration (memtest,
o3-timing=mp, simple-atomic-mp and simple-timing-mp), and the latter
are not changed in this patch. They will be updated in a future patch.
The common cache configurations are changed to match the ones used in
the regressions, and are slightly changed with respect to what they
were. Hopefully this means we can converge on a common base
configuration, used both in the normal user configurations and
regressions.
As only regressions that shared the same cache configuration are
updated, no regressions are affected.
This patch adds support to different entities in the ruby memory system
for more reliable functional read/write accesses. Only the simple network
has been augmented as of now. Later on Garnet will also support functional
accesses.
The patch adds functional access code to all the different types of messages
that protocols can send around. These messages are functionally accessed
by going through the buffers maintained by the network entities.
The patch also rectifies some of the bugs found in coherence protocols while
testing the patch.
With this patch applied, functional writes always succeed. But functional
reads can still fail.
This patch changes the cache-related latencies from an absolute time
expressed in Ticks, to a number of cycles that can be scaled with the
clock period of the caches. Ultimately this patch serves to enable
future work that involves dynamic frequency scaling. As an immediate
benefit it also makes it more convenient to specify cache performance
without implicitly assuming a specific CPU core operating frequency.
The stat blocked_cycles that actually counter in ticks is now updated
to count in cycles.
As the timing is now rounded to the clock edges of the cache, there
are some regressions that change. Plenty of them have very minor
changes, whereas some regressions with a short run-time are perturbed
quite significantly. A follow-on patch updates all the statistics for
the regressions.
This patch changes the CoherentBus between the L1s and L2 to use the
CPU clock and also four times the width compared to the default
bus. The parameters are not intending to fit every single scenario,
but rather serve as a better startingpoint than what we previously
had.
Note that the scripts that do not use the addTwoLevelCacheHiearchy are
not affected by this change.
A separate patch will update the stats.
The memtest.py script used to connect the system port directly to the
SimpleMemory, but the latter is now single ported. Since the system
port is not used for anything in this particular example, a quick fix
is to attach it to the functional bus instead.
In the current caches the hit latency is paid twice on a miss. This patch lets
a configurable response latency be set of the cache for the backward path.
This patch simplifies the Range object hierarchy in preparation for an
address range class that also allows striping (e.g. selecting a few
bits as matching in addition to the range).
To extend the AddrRange class to an AddrRegion, the first step is to
simplify the hierarchy such that we can make it as lean as possible
before adding the new functionality. The only class using Range and
MetaRange is AddrRange, and the three classes are now collapsed into
one.
In order to ensure correct functionality of switch CPUs, the TLB walker ports
must be connected to the Ruby system in x86 simulation.
This fixes x86 assertion failures that the TLB walker ports are not connected
during the CPU switch process.
When switching from an atomic CPU to any of the timing CPUs, a drain is
unnecessary since no events are scheduled in atomic mode. However, when
trying to switch CPUs starting with a timing CPU, there may be events
scheduled. This change ensures that all events are drained from the system
by calling m5.drain before switching CPUs.
This patch allows for specifying multiple programs via command line. It also
adds an option for specifying whether to use of SMT. But SMT does not work for
the o3 cpu as of now.
This patch removes the NACKing in the bridge, as the split
request/response busses now ensure that protocol deadlocks do not
occur, i.e. the message-dependency chain is broken by always allowing
responses to make progress without being stalled by requests. The
NACKs had limited support in the system with most components ignoring
their use (with a suitable call to panic), and as the NACKs are no
longer needed to avoid protocol deadlocks, the cleanest way is to
simply remove them.
The bridge is the starting point as this is the only place where the
NACKs are created. A follow-up patch will remove the code that deals
with NACKs in the endpoints, e.g. the X86 table walker and DMA
port. Ultimately the type of packet can be complete removed (until
someone sees a need for modelling more complex protocols, which can
now be done in parts of the system since the port and interface is
split).
As a consequence of the NACK removal, the bridge now has to send a
retry to a master if the request or response queue was full on the
first attempt. This change also makes the bridge ports very similar to
QueuedPorts, and a later patch will change the bridge to use these. A
first step in this direction is taken by aligning the name of the
member functions, as done by this patch.
A bit of tidying up has also been done as part of the simplifications.
Surprisingly, this patch has no impact on any of the
regressions. Hence, there was never any NACKs issued. In a follow-up
patch I would suggest changing the size of the bridge buffers set in
FSConfig.py to also test the situation where the bridge fills up.
This patch fixes the checkpointing by ensuring that the directory is
passer to the scriptCheckpoints function, and that the num_checkpoints
is not used before it is initialised.
This patch adds a --repeat-switch option that will enable repeat core
switching at a user defined period (set with --switch-freq option).
currently, a switch can only occur between like CPU types. inorder CPU
switching is not supported.
*note*
this patch simply allows a config that will perform repeat switching, it
does not fix drain/switchout functionality. if you run with repeat switching
you will hit assertion failures and/or your workload with hang or die.
This patch moves instantiateTopology into Ruby.py and removes the
mem/ruby/network/topologies directory. It also adds some extra inheritance to
the topologies to clean up some issues in the existing topologies.
This patch moves the code related to checkpointing from the run() function to
several different functions. The aim is to make the code more manageable. No
functionality changes are expected, but since the code is kind of unruly, it
is possible that some change might have creeped in.
This changes the way in which the cpu class while restoring from a checkpoint
is set. Earlier it was assumed if cpu type with which to restore is not same
as the cpu type with the which to run the simulation, then the checkpoint
should be restored with the atomic cpu. This assumption is being dropped. The
checkpoint can now be restored with any cpu type, the default being atomic cpu.
This patch changes the se and fs script to use the clock option and
not simply set the CPUs clock to 2 GHz. It also makes a minor change
to the assignment of the switch_cpus clock to allow different clocks.
This patch changes the simple memory to have a single slave port
rather than a vector port. The simple memory makes no attempts at
modelling the contention between multiple ports, and any such
multiplexing and demultiplexing could be done in a bus (or crossbar)
outside the memory controller. This scenario also matches with the
ongoing work on a SimpleDRAM model, which will be a single-ported
single-channel controller that can be used in conjunction with a bus
(or crossbar) to create a multi-port multi-channel controller.
There are only very few regressions that make use of the vector port,
and these are all for functional accesses only. To facilitate these
cases, memtest and memtest-ruby have been updated to also have a
"functional" bus to perform the (de)multiplexing of the functional
memory accesses.
Instead of just passing a list of controllers to the makeTopology function
in src/mem/ruby/network/topologies/<Topo>.py we pass in a function pointer
which knows how to make the topology, possibly with some extra state set
in the configs/ruby/<protocol>.py file. Thus, we can move all of the files
from network/topologies to configs/topologies. A new class BaseTopology
is added which all topologies in configs/topologies must inheirit from and
follow its API.
--HG--
rename : src/mem/ruby/network/topologies/Crossbar.py => configs/topologies/Crossbar.py
rename : src/mem/ruby/network/topologies/Mesh.py => configs/topologies/Mesh.py
rename : src/mem/ruby/network/topologies/MeshDirCorners.py => configs/topologies/MeshDirCorners.py
rename : src/mem/ruby/network/topologies/Pt2Pt.py => configs/topologies/Pt2Pt.py
rename : src/mem/ruby/network/topologies/Torus.py => configs/topologies/Torus.py
1) Modifies Benchmarks.py to add support for Android ICS and BBench on Android ICS.
2) An rcS script is added for BBench on ICS.
3) Separates benchmark entries and rcS scripts for GB/ICS
4) Removes the debugging output from the existing BBench run script. These
print statements were used for debugging and they seemed to confuse users
into believing they should see some terminal output.
As status matrix, MIPS fs does not work. Hence, these options are not
required. Secondly, the function is setting param values for a CPU class.
This seems strange, should probably be done in a different way.
This patch introduces a class hierarchy of buses, a non-coherent one,
and a coherent one, splitting the existing bus functionality. By doing
so it also enables further specialisation of the two types of buses.
A non-coherent bus connects a number of non-snooping masters and
slaves, and routes the request and response packets based on the
address. The request packets issued by the master connected to a
non-coherent bus could still snoop in caches attached to a coherent
bus, as is the case with the I/O bus and memory bus in most system
configurations. No snoops will, however, reach any master on the
non-coherent bus itself. The non-coherent bus can be used as a
template for modelling PCI, PCIe, and non-coherent AMBA and OCP buses,
and is typically used for the I/O buses.
A coherent bus connects a number of (potentially) snooping masters and
slaves, and routes the request and response packets based on the
address, and also forwards all requests to the snoopers and deals with
the snoop responses. The coherent bus can be used as a template for
modelling QPI, HyperTransport, ACE and coherent OCP buses, and is
typically used for the L1-to-L2 buses and as the main system
interconnect.
The configuration scripts are updated to use a NoncoherentBus for all
peripheral and I/O buses.
A bit of minor tidying up has also been done.
--HG--
rename : src/mem/bus.cc => src/mem/coherent_bus.cc
rename : src/mem/bus.hh => src/mem/coherent_bus.hh
rename : src/mem/bus.cc => src/mem/noncoherent_bus.cc
rename : src/mem/bus.hh => src/mem/noncoherent_bus.hh
Multithreaded programs did not run by just specifying the binary once on the
command line of SE mode.The default mode is multi-programmed mode. Added
check in SE mode to run multi-threaded programs in case only one program is
specified with multiple CPUS. Default mode is still multi-programmed mode.
Added the options to Options.py for FS mode with backward compatibility. It is
good to provide an option to specify the disk image and the memory size from
command line since a lot of disk images are created to support different
benchmark suites as well as per user needs. Change in program also leads to
change in memory requirements. These options provide the interface to provide
both disk image and memory size from the command line and gives more
flexibility.
This patch removes the assumption on having on single instance of
PhysicalMemory, and enables a distributed memory where the individual
memories in the system are each responsible for a single contiguous
address range.
All memories inherit from an AbstractMemory that encompasses the basic
behaviuor of a random access memory, and provides untimed access
methods. What was previously called PhysicalMemory is now
SimpleMemory, and a subclass of AbstractMemory. All future types of
memory controllers should inherit from AbstractMemory.
To enable e.g. the atomic CPU and RubyPort to access the now
distributed memory, the system has a wrapper class, called
PhysicalMemory that is aware of all the memories in the system and
their associated address ranges. This class thus acts as an
infinitely-fast bus and performs address decoding for these "shortcut"
accesses. Each memory can specify that it should not be part of the
global address map (used e.g. by the functional memories by some
testers). Moreover, each memory can be configured to be reported to
the OS configuration table, useful for populating ATAG structures, and
any potential ACPI tables.
Checkpointing support currently assumes that all memories have the
same size and organisation when creating and resuming from the
checkpoint. A future patch will enable a more flexible
re-organisation.
--HG--
rename : src/mem/PhysicalMemory.py => src/mem/AbstractMemory.py
rename : src/mem/PhysicalMemory.py => src/mem/SimpleMemory.py
rename : src/mem/physical.cc => src/mem/abstract_mem.cc
rename : src/mem/physical.hh => src/mem/abstract_mem.hh
rename : src/mem/physical.cc => src/mem/simple_mem.cc
rename : src/mem/physical.hh => src/mem/simple_mem.hh
With recent changes to the memory system, a port cannot be assigned a peer
port twice. While making use of the Ruby memory system in FS mode, DMA
ports were assigned peer twice, once for the classic memory system
and once for the Ruby memory system. This patch removes this double
assignment of peer ports.
This patch removes the physmem_port from the Atomic CPU and instead
uses the system pointer to access the physmem when using the fastmem
option. The system already keeps track of the physmem and the valid
memory address ranges, and with this patch we merely make use of that
existing functionality. As a result of this change, the overloaded
getMasterPort in the Atomic CPU can be removed, thus unifying the CPUs.
This patch removes the physMemPort from the RubySequencer and instead
uses the system pointer to access the physmem. The system already
keeps track of the physmem and the valid memory address ranges, and
with this patch we merely make use of that existing functionality. The
memory is modified so that it is possible to call the access functions
(atomic and functional) without going through the port, and the memory
is allowed to be unconnected, i.e. have no ports (since Ruby does not
attach it like the conventional memory system).
I am not too happy with the way options are added in files se.py and fs.py
currently. This patch moves all the options to the file Options.py, functions
from which are called when required.
With the SE/FS merge, interrupt controller is created irrespective of the
mode. This patch creates the interrupt controller when Ruby is used and
connects its ports.
Enables the CheckerCPU to be selected at runtime with the --checker option
from the configs/example/fs.py and configs/example/se.py configuration
files. Also merges with the SE/FS changes.
This patch cleans up a number of remaining uses of bus.port which
is now split into bus.master and bus.slave. The only non-trivial change
is the memtest where the level building now has to be aware of the role
of the ports used in the previous level.
This patch merely removes the use of the num_cpus cache parameter
which no longer exists after the introduction of the masterIds. The
affected scripts fail when trying to set the parameter. Note that this
patch does not update the regression stats.
This patch classifies all ports in Python as either Master or Slave
and enforces a binding of master to slave. Conceptually, a master (such
as a CPU or DMA port) issues requests, and receives responses, and
conversely, a slave (such as a memory or a PIO device) receives
requests and sends back responses. Currently there is no
differentiation between coherent and non-coherent masters and slaves.
The classification as master/slave also involves splitting the dual
role port of the bus into a master and slave port and updating all the
system assembly scripts to use the appropriate port. Similarly, the
interrupt devices have to have their int_port split into a master and
slave port. The intdev and its children have minimal changes to
facilitate the extra port.
Note that this patch does not enforce any port typing in the C++
world, it merely ensures that the Python objects have a notion of the
port roles and are connected in an appropriate manner. This check is
carried when two ports are connected, e.g. bus.master =
memory.port. The following patches will make use of the
classifications and specialise the C++ ports into masters and slaves.
This patch moves the connection of the system port to create_system in
Ruby.py. Thereby it allows the failing Ruby test (and other Ruby
systems) to run again.
This patch fixes the currently broken fs.py by specifying the size of
the bridge range rather than the end address. This effectively
subtracts one when determining the address range for the IO bridge
(from IO bus to membus), and thus avoids the overlapping ranges.
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
is the LSQ in case of the O3 CPU.
In preparation for the introduction of Master and Slave ports, this
patch removes the default port parameter in the Python port and thus
forces the argument list of the Port to contain only the
description. The drawback at this point is that the config port and
dma port of PCI and DMA devices have to be connected explicitly. This
is key for future diversification as the pio and config port are
slaves, but the dma port is a master.
This patch makes the bus bridge uni-directional and specialises the
bus ports to be a master port and a slave port. This greatly
simplifies the assumptions on both sides as either port only has to
deal with requests or responses. The following patches introduce the
notion of master and slave ports, and would not be possible without
this split of responsibilities.
In making the bridge unidirectional, the address range mechanism of
the bridge is also changed. For the cases where communication is
taking place both ways, an additional bridge is needed. This causes
issues with the existing mechanism, as the busses cannot determine
when to stop iterating the address updates from the two bridges. To
avoid this issue, and also greatly simplify the specification, the
bridge now has a fixed set of address ranges, specified at creation
time.
Port proxies are used to replace non-structural ports, and thus enable
all ports in the system to correspond to a structural entity. This has
the advantage of accessing memory through the normal memory subsystem
and thus allowing any constellation of distributed memories, address
maps, etc. Most accesses are done through the "system port" that is
used for loading binaries, debugging etc. For the entities that belong
to the CPU, e.g. threads and thread contexts, they wrap the CPU data
port in a port proxy.
The following replacements are made:
FunctionalPort > PortProxy
TranslatingPort > SETranslatingPortProxy
VirtualPort > FSTranslatingPortProxy
--HG--
rename : src/mem/vport.cc => src/mem/fs_translating_port_proxy.cc
rename : src/mem/vport.hh => src/mem/fs_translating_port_proxy.hh
rename : src/mem/translating_port.cc => src/mem/se_translating_port_proxy.cc
rename : src/mem/translating_port.hh => src/mem/se_translating_port_proxy.hh
Currently there is an assumption that restoration from a checkpoint will
happen by first restoring to an atomic CPU and then switching to a timing
CPU. This patch adds support for directly restoring to a timing CPU. It
adds a new option '--restore-with-cpu' which is used to specify the type
of CPU to which the checkpoint should be restored to. It defaults to
'atomic' which was the case before.