This patch provides the example test script to configure different HMC
architecture and run traffic through traffic generator.
Committed by Jason Lowe-Power <jason@lowepower.com>
In this new hmc configuration we have used the existing components in gem5
mainly [SerialLink] [NoncoherentXbar]& [DRAMCtrl] to define 3 different
architecture for HMC.
Highlights
1- It explores 3 different HMC architectures
2- It creates 4-HMC crossbars and attaches 16 vault controllers with it.
This will connect vaults to serial links
3- From the previous version, HMCController with round robin funtionality
is being removed and all the serial links are being accessible directly
from user ports
4- Latency incorporated by HMCController (in previous version) is being
added to SerialLink
Committed by Jason Lowe-Power <jason@lowepower.com>
This patch ensures a walker cache is instantiated if specfied.
Change-Id: I2c6b4bf3454d56bb19558c73b406e1875acbd986
Reviewed-by: Curtis Dunham <curtis.dunham@arm.com>
Reviewed-by: Mitch Hayenga <mitch.hayenga@arm.com>
Eliminate the VSZ constant that defined the Wavefront size (in numbers of work
items); replaced it with a parameter in the GPU.py configuration script.
Changed all data structures dependent on the Wavefront size to be dynamically
sized. Legal values of Wavefront size are 16, 32, 64 for now and checked at
initialization time.
Disable the default snoop filter in the SystemXBar so that the
typical membus does not have a snoop filter by default. Instead,
add the snoop filter only when there are caches added to the system
(with the caches / l2cache options).
The underlying problem is that the snoop filter grows without
bounds (for now) if there are no caches to tell it that lines have
been evicted. This causes slow regression runs for all the atomic
regressions. This patch fixes this behaviour.
--HG--
extra : source : f97c20511828209757440839ed48d741d02d428f
According to the Intel Multi Processor Specification rev 1.4 (-006) (*),
section 4.3.2 Bus Entries, Bus type strings are >>6-character ASCII
(blank-filled) strings<<.
This patch properly pads the entries with the missing spaces at the end.
(*) http://www.intel.com/design/pentium/datashts/24201606.pdf
Committed by Jason Lowe-Power <power.jg@gmail.com>
Distributed gem5 is the result of the convergence effort between
multi-gem5 and pd-gem5. It relies on the base multi-gem5 infrastructure
for packet forwarding, synchronisation and checkpointing but combines
those with the elaborated network switch model from pd-gem5.
This patch adds a config script that broadly replicates the behaviour
of lat_mem_rd. The test is based on traffic generators, and as such we
simply randomise addresses in increasingly large ranges, and play them
back using the trace functionality of the traffic generator.
The test script is accompanied by a post-processing and visualisation
script. At the moment no configurability is added to tweak the memory
hierarchy, but a follow on patch could easily extend the
functionality.
This patch introduces the ability of making the coherent crossbar the
point of coherency. If so, the crossbar does not forward packets where
a cache with ownership has already committed to responding, and also
does not forward any coherency-related packets that are not intended
for a downstream memory controller. Thus, invalidations and upgrades
are turned around in the crossbar, and the memory controller only sees
normal reads and writes.
In addition this patch moves the express snoop promotion of a packet
to the crossbar, thus allowing the downstream cache to check the
express snoop flag (as it should) for bypassing any blocking, rather
than relying on whether a cache is responding or not.
This patch changes how the cache determines if snoops should be
forwarded from the memory side to the CPU side. Instead of having a
parameter, the cache now looks at the port connected on the CPU side,
and if it is a snooping port, then snoops are forwarded. Less error
prone, and less parameters to worry about.
The patch also tidies up the CPU classes to ensure that their I-side
port is not snooping by removing overrides to the snoop request
handler, such that snoop requests will panic via the default
MasterPort implement
Add a platform with support for both aarch32 and aarch64. This
platform implements a subset of the devices in a real Versatile
Express and extends it with some gem5-specific functionality. It is in
many ways similar to the old VExpress_EMM64 platform, but supports the
following new features:
* Automatic PCI interrupt assignment
* PCI interrupts allocated in a contiguous range.
* Automatic boot loader selection (32-bit / 64-bit)
* Cleaner memory map where gem5-specific devices live in CS5 which
isn't used by current Versatile Express platforms.
* No fake devices. Devices that were previously faked will be
removed from the device tree instead.
* Support for 510 GiB contiguous memory
This patch allows the ruby random tester to use ruby ports that may only
support instr or data requests. This patch is similar to a previous changeset
(8932:1b2c17565ac8) that was unfortunately broken by subsequent changesets.
This current patch implements the support in a more straight-forward way.
Since retries are now tested when running the ruby random tester, this patch
splits up the retry and drain check behavior so that RubyPort children, such
as the GPUCoalescer, can perform those operations correctly without having to
duplicate code. Finally, the patch also includes better DPRINTFs for
debugging the tester.
This patch adds changes to the configuration scripts to support elastic
tracing and replay.
The patch adds a command line option to enable elastic tracing in SE mode
and FS mode. When enabled the Elastic Trace cpu probe is attached to O3CPU
and a few O3 CPU parameters are tuned. The Elastic Trace probe writes out
both instruction fetch and data dependency traces. The patch also enables
configuring the TraceCPU to replay traces using the SE and FS script.
The replay run is designed to resume from checkpoint using atomic cpu to
restore state keeping it consistent with FS run flow. It then switches to
TraceCPU to replay the input traces.
The gem5's current PCI host functionality is very ad hoc. The current
implementations require PCI devices to be hooked up to the
configuration space via a separate configuration port. Devices query
the platform to get their config-space address range. Un-mapped parts
of the config space are intercepted using the XBar's default port
mechanism and a magic catch-all device (PciConfigAll).
This changeset redesigns the PCI host functionality to improve code
reuse and make config-space and interrupt mapping more
transparent. Existing platform code has been updated to use the new
PCI host and configured to stay backwards compatible (i.e., no
guest-side visible changes). The current implementation does not
expose any new functionality, but it can easily be extended with
features such as automatic interrupt mapping.
PCI devices now register themselves with a PCI host controller. The
host controller interface is defined in the abstract base class
PciHost. Registration is done by PciHost::registerDevice() which takes
the device, its bus position (bus/dev/func tuple), and its interrupt
pin (INTA-INTC) as a parameter. The registration interface returns a
PciHost::DeviceInterface that the PCI device can use to query memory
mappings and signal interrupts.
The host device manages the entire PCI configuration space. Accesses
to devices decoded into the devices bus position and then forwarded to
the correct device.
Basic PCI host functionality is implemented in the GenericPciHost base
class. Most platforms can use this class as a basic PCI controller. It
provides the following functionality:
* Configurable configuration space decoding. The number of bits
dedicated to a device is a prameter, making it possible to support
both CAM, ECAM, and legacy mappings.
* Basic interrupt mapping using the interruptLine value from a
device's configuration space. This behavior is the same as in the
old implementation. More advanced controllers can override the
interrupt mapping method to dynamically assign host interrupts to
PCI devices.
* Simple (base + addr) remapping from the PCI bus's address space to
physical addresses for PIO, memory, and DMA.
Add support for automatically discover available platforms. The
Python-side uses functionality similar to what we use when
auto-detecting available CPU models. The machine IDs have been updated
to match the platform configurations. If there isn't a matching
machine ID, the configuration scripts default to -1 which Linux uses
for device tree only platforms.
Added the missing types EthernetAddr and Current to the JSON/INI file
reader example configs/example/read_config.py.
Also added __str__ to EthernetAddr to make values appear in the same form
in JSON an INI files.
This patch adds yet another twist to the memtest cache hierarchy, in that
the writeback_clean option is toggled at every level to match the
clusivity of the downstream cache.
This patch adds the necessary commands and cache functionality to
allow clean writebacks. This functionality is crucial, especially when
having exclusive (victim) caches. For example, if read-only L1
instruction caches are not sending clean writebacks, there will never
be any spills from the L1 to the L2. At the moment the cache model
defaults to not sending clean writebacks, and this should possibly be
re-evaluated.
The implementation of clean writebacks relies on a new packet command
WritebackClean, which acts much like a Writeback (renamed
WritebackDirty), and also much like a CleanEvict. On eviction of a
clean block the cache either sends a clean evict, or a clean
writeback, and if any copies are still cached upstream the clean
evict/writeback is dropped. Similarly, if a clean evict/writeback
reaches a cache where there are outstanding MSHRs for the block, the
packet is dropped. In the typical case though, the clean writeback
allocates a block in the downstream cache, and marks it writable if
the evicted block was writable.
The patch changes the O3_ARM_v7a L1 cache configuration and the
default L1 caches in config/common/Caches.py
This patch adds an new twist to the memtest cache hierarchy, in that
it switches from mostly inclusive to mostly exclusive at every level
in the tree. This has helped weed out plenty issues, and serves as a
good stress tests.
This patch adds a parameter to control the cache clusivity, that is if
the cache is mostly inclusive or exclusive. At the moment there is no
intention to support strict policies, and thus the options are: 1)
mostly inclusive, or 2) mostly exclusive.
The choice of policy guides the behaviuor on a cache fill, and a new
helper function, allocOnFill, is created to encapsulate the decision
making process. For the timing mode, the decision is annotated on the
MSHR on sending out the downstream packet, and in atomic we directly
pass the decision to handleFill. We (ab)use the tempBlock in cases
where we are not allocating on fill, leaving the rest of the cache
unaffected. Simple and effective.
This patch also makes it more explicit that multiple caches are
allowed to consider a block writable (this is the case
also before this patch). That is, for a mostly inclusive cache,
multiple caches upstream may also consider the block exclusive. The
caches considering the block writable/exclusive all appear along the
same path to memory, and from a coherency protocol point of view it
works due to the fact that we always snoop upwards in zero time before
querying any downstream cache.
Note that this patch does not introduce clean writebacks. Thus, for
clean lines we are essentially removing a cache level if it is made
mostly exclusive. For example, lines from the read-only L1 instruction
cache or table-walker cache are always clean, and simply get dropped
rather than being passed to the L2. If the L2 is mostly exclusive and
does not allocate on fill it will thus never hold the line. A follow
on patch adds the clean writebacks.
The patch changes the L2 of the O3_ARM_v7a CPU configuration to be
mostly exclusive (and stats are affected accordingly).
This patch enables modeling a complete Hybrid Memory Cube (HMC) device. It
highly reuses the existing components in gem5's general memory system with some
small modifications. This changeset requires additional patches to model a
complete HMC device.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
makeSparcSystem() in configs/common/FSConfig.py is missing the cmdLine
parameter Without the parameter the simulation fails to start. With the
parameter the simulation starts properly.
Adds SMT support to the "simple" CPU models so that they can be
used with other SMT-supported CPUs. Example usage: this enables
the TimingSimpleCPU to be used to warmup caches before swapping to
detailed mode with the in-order or out-of-order based CPU models.
Added a new directory in configs (learning_gem5) to hold the scripts that are
used in the book. See http://lowepower.com/jason/learning_gem5/ for a working
copy. For now, only the scripts in Part 1: Getting started with gem5
have been added. A separate patch adds tests for these scripts.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
We no longer use the C library based random number generator: random().
Instead we use the C++ library provided rng. So setting the random seed for
the RubySystem class has no effect. Hence the variable and the corresponding
option are being dropped.
Open up for other subclasses to BaseCache and transition to using the
explicit Cache subclass.
--HG--
rename : src/mem/cache/BaseCache.py => src/mem/cache/Cache.py
This patch serves to avoid name clashes with the classic cache. For
some reason having two 'SimObject' files with the same name creates
problems.
--HG--
rename : src/mem/ruby/structures/Cache.py => src/mem/ruby/structures/RubyCache.py
We no longer use the C library based random number generator: random().
Instead we use the C++ library provided rng. So setting the random seed for
the RubySystem class has no effect. Hence the variable and the corresponding
option are being dropped.
Expose MessageBuffers from SLICC controllers as SimObjects that can be
manipulated in Python. This patch has numerous benefits:
1) First and foremost, it exposes MessageBuffers as SimObjects that can be
manipulated in Python code. This allows parameters to be set and checked in
Python code to avoid obfuscating parameters within protocol files. Further, now
as SimObjects, MessageBuffer parameters are printed to config output files as a
way to track parameters across simulations (e.g. buffer sizes)
2) Cleans up special-case code for responseFromMemory buffers, and aligns their
instantiation and use with mandatoryQueue buffers. These two special buffers
are the only MessageBuffers that are exposed to components outside of SLICC
controllers, and they're both slave ends of these buffers. They should be
exposed outside of SLICC in the same way, and this patch does it.
3) Distinguishes buffer-specific parameters from buffer-to-network parameters.
Specifically, buffer size, randomization, ordering, recycle latency, and ports
are all specific to a MessageBuffer, while the virtual network ID and type are
intrinsics of how the buffer is connected to network ports. The former are
specified in the Python object, while the latter are specified in the
controller *.sm files. Unlike buffer-specific parameters, which may need to
change depending on the simulated system structure, buffer-to-network
parameters can be specified statically for most or all different simulated
systems.
The RubyCache (CacheMemory) latency parameter is only used for top-level caches
instantiated for Ruby coherence protocols. However, the top-level cache hit
latency is assessed by the Sequencer as accesses flow through to the cache
hierarchy. Further, protocol state machines should be enforcing these cache hit
latencies, but RubyCaches do not expose their latency to any existng state
machines through the SLICC/C++ interface. Thus, the RubyCache latency parameter
is superfluous for all caches. This is confusing for users.
As a step toward pushing L0/L1 cache hit latency into the top-level cache
controllers, move their latencies out of the RubyCache declarations and over to
their Sequencers. Eventually, these Sequencer parameters should be exposed as
parameters to the top-level cache controllers, which should assess the latency.
NOTE: Assessing these latencies in the cache controllers will require modifying
each to eliminate instantaneous Ruby hit callbacks in transitions that finish
accesses, which is likely a large undertaking.
Transaction Level Modeling (TLM2.0) is widely used in industry for creating
virtual platforms (IEEE 1666 SystemC). This patch contains a standard compliant
implementation of an external gem5 port, that enables the usage of gem5 as a
TLM initiator component in SystemC based virtual platforms. Both TLM coding
paradigms loosely timed (b_transport) and aproximately timed (nb_transport) are
supported.
Compared to the original patch a TLM memory manager was added. Furthermore, the
transaction object was removed and for each TLM payload a PacketPointer that
points to the original gem5 packet is added as an TLM extension. For event
handling single events are now created.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>