Add some values and methods to the request object to track the translation
and access latency for a request and which level of the cache hierarchy responded
to the request.
The first two levels (L0, L1) are private to the core, the third level (L2)is
possibly shared. The protocol supports clustered designs. For example, one
can have two sets of two cores. Each core has an L0 and L1 cache. There are
two L2 controllers where each set accesses only one of the L2 controllers.
A cluster over here means a set of controllers that can be accessed only by a
certain set of cores. For example, consider a two level hierarchy. Assume
there are 4 L1 controllers (private) and 2 L2 controllers. We can have two
different hierarchies here:
a. the address space is partitioned between the two L2 controllers. Each L1
controller accesses both the L2 controllers. In this case, each L1 controller
is a cluster initself.
b. both the L2 controllers can cache any address. An L1 controller has access
to only one of the L2 controllers. In this case, each L2 controller
along with the L1 controllers that access it, form a cluster.
This patch allows for each controller to have a cluster ID, which is 0 by
default. By setting the cluster ID properly, one can instantiate hierarchies
with clusters. Note that the coherence protocol might have to be changed as
well.
This patch fixes couple of bugs in the L2 controller of the mesi cmp
directory protocol.
1. The state MT_I was transitioning to NP on receiving a clean writeback
from the L1 controller. This patch makes it inform the directory controller
about the writeback.
2. The L2 controller was sending the dirty bit to the L1 controller and the
L2 controller used writeback from the L1 controller to update the dirty bit
unconditionally. Now, the L1 controller always assumes that the incoming
data is clean. The L2 controller updates the dirty bit only when the L1
controller writes to the block.
3. Certain unused functions and events are being removed.
This patch replaces max_in_port_rank with the number of inports. The use of
max_in_port_rank was causing spurious re-builds and incorrect initialization
of variables in ruby related regression tests. This was due to the variable
value being used across threads while compiling when it was not meant to be.
Since the number of inports is state machine specific value, this problem
should get solved.
The directory controller should not have the sharer field since there is
only one level 2 cache. Anyway the field was not in use. The owner field
was being used to track the l2 cache version (in case of distributed l2) that
has the cache block under consideration. The information is not required
since the version of the level 2 cache can be obtained from a subset of the
address bits.
This patch fixes a number of stats accounting issues in the DRAM
controller. Most importantly, it separates the system interface and
DRAM interface so that it is clearer what the actual DRAM bandwidth
(and consequently utilisation) is.
This patch unifies the request selection across read and write queues
for FR-FCFS scheduling policy. It also fixes the request selection
code to prioritize the row hits present in the request queues over the
selection based on earliest bank availability.
This patch adds a basic adaptive version of the open-page policy that
guides the decision to keep open or close by looking at the contents
of the controller queues. If no row hits are found, and bank conflicts
are present, then the row is closed by means of an auto
precharge. This is a well-known technique that should improve
performance in most use-cases.
This patch removes the untimed while loop in the write scheduling
mechanism and now schedule commands taking into account the minimum
timing constraint. It also introduces an optimization to track write
queue size and switch from writes to reads if the number of write
requests fall below write low threshold.
This patch adds the tRRD parameter to the DRAM controller. With the
recent addition of the actAllowedAt member for each bank, this
addition is trivial.
This patch changes the tXAW constraint so that it is enforced per rank
rather than globally for all ranks in the channel. It also avoids
using the bank freeAt to enforce the activation limit, as doing so
also precludes performing any column or row command to the
DRAM. Instead the patch introduces a new variable actAllowedAt for the
banks and use this to track when a potential activation can occur.
This patch fixes the controller when a write threshold of 100% is
used. Earlier for 100% write threshold no data is written to memory
as writes never get triggered since this corner case is not
considered.
This patch changes the FCFS bit of FR-FCFS such that requests that
target the earliest available bank are picked first (as suggested in
the original work on FR-FCFS by Rixner et al). To accommodate this we
add functionality to identify a bank through a one-dimensional
identifier (bank id). The member names of the DRAMPacket are also
update to match the style guide.
This patch changes the time the controller is woken up to take the
next scheduling decisions. tRAS is now handled in estimateLatency and
doDRAMAccess and we do not need to worry about it at scheduling
time. The earliest we need to wake up is to do a pre-charge, row
access and column access before the bus becomes free for use.
This patch adds an explicit tRAS parameter to the DRAM controller
model. Previously tRAS was, rather conservatively, assumed to be tRCD
+ tCL + tRP. The default values for tRAS are chosen to match the
previous behaviour and will be updated later.
This patch adds missing initializations of the SenderMachine field of
out_msg's when thery are created in the L2 cache controller of the
MOESI_CMP_directory coherence protocol. When an out_msg is created and this
field is left uninitialized, it is set to the default value MachineType_NUM.
This causes a panic in the MachineType_to_string function when gem5 is
executed with the Ruby debug flag on and it tries to print the message.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch fixes a problem where in Garnet, the enqueue time in the
VCallocator and the SWallocator which is of type Cycles was being stored
inside a variable with int type.
This lead to a known problem restoring checkpoints with garnet & the fixed
pipeline enabled. That value was really big and didn't fit in the variable
overflowing it, therefore some conditions on the VC allocation stage & the
SW allocation stage were not met and the packets didn't advance through the
network, leading to a deadlock panic right after the checkpoint was restored.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
The CoherentBus eventually got virtual methods for its interface. The
"virtuality" of the CoherentBus, however, comes already from the virtual
interface of the bus' ports. There is no need to add another layer of virtual
functions, here.
Get rid of non-deterministic "stats" in ruby.stats output
such as time & date of run, elapsed & CPU time used,
and memory usage. These values cause spurious
miscomparisons when looking at output diffs (though
they don't affect regressions, since the regressions
pass/fail status currently ignores ruby.stats entirely).
Most of this information is already captured in other
places (time & date in stdout, elapsed time & mem usage
in stats.txt), where the regression script is smart
enough to filter it out. It seems easier to get rid of
the redundant output rather than teaching the
regression tester to ignore the same information in
two different places.
ASI_BITS in the Request object were originally used to store a memory
request's ASI on SPARC. This is not the case any more since other ISAs
use the ASI bits to store architecture-dependent information. This
changeset renames the ASI_BITS to ARCH_BITS which better describes
their use. Additionally, the getAsi() accessor is renamed to
getArchFlags().
Using address bit 63 to identify generic IPRs caused problems on
SPARC, where IPRs are heavily used. This changeset redefines how
generic IPRs are identified. Instead of using bit 63, we now use a
separate flag (GENERIC_IPR) a memory request.
This patch ensures that a dequeue event is not scheduled if the memory
controller is waiting for a retry already. Without this check it is
possible for the controller to attempt sending something whilst
already having one packet that is in retry, thus causing the bus to
have an assertion failure.
The Topology source sets up input and output buffers for each of the external
nodes of a topology by indexing on Ruby's generated controller unique IDs.
These unique IDs are found by adding the MachineType_base_number to the version
number of each controller (see any generated *_Controller.cc - init() calls
getToNetQueue and getFromNetQueue using m_version + base). However, the
Topology object used the cntrl_id - which is required to be unique across all
controllers - to index the controllers list as they are being connected to
their input and output buffers. If the cntrl_ids did not match the Ruby unique
ID, the throttles end up connected to incorrectly indexed nodes in the network,
resulting in packets traversing incorrect network paths. This patch fixes the
Topology indexing scheme by using the Ruby unique ID to match that of the
SimpleNetwork buffer vectors.
The previous changeset (9863:9483739f83ee) used STL vector containers to
dynamically allocate stats in the Ruby SimpleNetwork, Switch and Throttle. For
gcc versions before at least 4.6.3, this causes the standard vector allocator
to call Stats copy constructors (a no-no, since stats should be allocated in
the body of each SimObject instance). Since the size of these stats arrays is
known at compile time (NOTE: after code generation), this patch changes their
allocation to be static rather than using an STL vector.
This patch makes it possible to once again build gem5 without any
ISA. The main purpose is to enable work around the interconnect and
memory system without having to build any CPU models or device models.
The regress script is updated to include the NULL ISA target. Currently
no regressions make use of it, but all the testers could (and perhaps
should) transition to it.
--HG--
rename : build_opts/NOISA => build_opts/NULL
rename : src/arch/noisa/SConsopts => src/arch/null/SConsopts
rename : src/arch/noisa/cpu_dummy.hh => src/arch/null/cpu_dummy.hh
rename : src/cpu/intr_control.cc => src/cpu/intr_control_noisa.cc
This patch updates the stats to reflect the: 1) addition of the
internal queue in SimpleMemory, 2) moving of the memory class outside
FSConfig, 3) fixing up of the 2D vector printing format, 4) specifying
burst size and interface width for the DRAM instead of relying on
cache-line size, 5) performing merging in the DRAM controller write
buffer, and 6) fixing how idle cycles are counted in the atomic and
timing CPU models.
The main reason for bundling them up is to minimise the changeset
size.
This patch adds support for specifying multi-channel memory
configurations on the command line, e.g. 'se/fs.py
--mem-type=ddr3_1600_x64 --mem-channels=4'. To enable this, it
enhances the functionality of MemConfig and moves the existing
makeMultiChannel class method from SimpleDRAM to the support scripts.
The se/fs.py example scripts are updated to make use of the new
feature.
This patch changes the default parameter value of conf_table_reported
to match the common case. It also simplifies the regression and config
scripts to reflect this change.
This patch changes the data structure used for the DRAM read, write
and response queues from an STL list to deque. This optimisation is
based on the observation that the size is small (and fixed), and that
the structures are frequently iterated over in a linear fashion.
This patch implements basic write merging in the DRAM to avoid
redundant bursts. When a new access is added to the queue it is
compared against the existing entries, and if it is either
intersecting or immediately succeeding/preceeding an existing item it
is merged.
There is currently no attempt made at avoiding iterating over the
existing items in determining whether merging is possible or not.
This patch gets rid of bytesPerCacheLine parameter and makes the DRAM
configuration separate from cache line size. Instead of
bytesPerCacheLine, we define a parameter for the DRAM called
burst_length. The burst_length parameter shows the length of a DRAM
device burst in bits. Also, lines_per_rowbuffer is replaced with
device_rowbuffer_size to improve code portablity.
This patch adds a burst length in beats for each memory type, an
interface width for each memory type, and the memory controller model
is extended to reason about "system" packets vs "dram" packets and
assemble the responses properly. It means that system packets larger
than a full burst are split into multiple dram packets.
This patch adds a packet queue in SimpleMemory to avoid using the
packet queue in the port (and thus have no involvement in the flow
control). The port queue was bound to 100 packets, and as the
SimpleMemory is modelling both a controller and an actual RAM, it
potentially has a large number of packets in flight. There is
currently no limit on the number of packets in the memory controller,
but this could easily be added in a follow-on patch.
As a result of the added internal storage, the functional access and
draining is updated. Some minor cleaning up and renaming has also been
done.
The memtest regression changes as a result of this patch and the stats
will be updated.
Some of the code in StateMachine.py file is added to all the controllers and
is independent of the controller definition. This code is being moved to the
AbstractController class which is the parent class of all controllers.
This patch removes the notion of a peer block size and instead sets
the cache line size on the system level.
Previously the size was set per cache, and communicated through the
interconnect. There were plenty checks to ensure that everyone had the
same size specified, and these checks are now removed. Another benefit
that is not yet harnessed is that the cache line size is now known at
construction time, rather than after the port binding. Hence, the
block size can be locally stored and does not have to be queried every
time it is used.
A follow-on patch updates the configuration scripts accordingly.
This code seems not to be of any use now. There is no path in the simulator
that allows for reconfiguring the network. A better approach would be to
take a checkpoint and start the simulation from the checkpoint with the new
configuration.
This patch reorganizes the cache tags to allow more flexibility to
implement new replacement policies. The base tags class is now a
clocked object so that derived classes can use a clock if they need
one. Also having deriving from SimObject allows specialized Tag
classes to be swapped in/out in .py files.
The cache set is now templatized to allow it to contain customized
cache blocks with additional informaiton. This involved moving code to
the .hh file and removing cacheset.cc.
The statistics belonging to the cache tags are now including ".tags"
in their name. Hence, the stats need an update to reflect the change
in naming.
This patch adds the notion of source- and derived-clock domains to the
ClockedObjects. As such, all clock information is moved to the clock
domain, and the ClockedObjects are grouped into domains.
The clock domains are either source domains, with a specific clock
period, or derived domains that have a parent domain and a divider
(potentially chained). For piece of logic that runs at a derived clock
(a ratio of the clock its parent is running at) the necessary derived
clock domain is created from its corresponding parent clock
domain. For now, the derived clock domain only supports a divider,
thus ensuring a lower speed compared to its parent. Multiplier
functionality implies a PLL logic that has not been modelled yet
(create a separate clock instead).
The clock domains should be used as a mechanism to provide a
controllable clock source that affects clock for every clocked object
lying beneath it. The clock of the domain can (in a future patch) be
controlled by a handler responsible for dynamic frequency scaling of
the respective clock domains.
All the config scripts have been retro-fitted with clock domains. For
the System a default SrcClockDomain is created. For CPUs that run at a
different speed than the system, there is a seperate clock domain
created. This domain incorporates the CPU and the associated
caches. As before, Ruby runs under its own clock domain.
The clock period of all domains are pre-computed, such that no virtual
functions or multiplications are needed when calling
clockPeriod. Instead, the clock period is pre-computed when any
changes occur. For this to be possible, each clock domain tracks its
children.
This patch removes the explicit setting of the clock period for
certain instances of CoherentBus, NonCoherentBus and IOCache where the
specified clock is same as the default value of the system clock. As
all the values used are the defaults, there are no performance
changes. There are similar cases where the toL2Bus is set to use the
parent CPU clock which is already the default behaviour.
The main motivation for these simplifications is to ease the
introduction of clock domains.
This patch does a bit of tidying up in the bridge code, adding const
where appropriate and also removing redundant checks and adding a few
new ones.
There are no changes to the behaviour of any regressions.
This patch fixes the CommMonitor local variable names, and also
introduces a variable to capture if it expects to see a response. The
latter check considers both needsResponse and memInhibitAsserted.
This patch fixes an outstanding issue in the cache timing calculations
where an atomic access returned a time in Cycles, but the port
forwarded it on as if it was in Ticks.
A separate patch will update the regression stats.
This patch changes the updards snoop packet to avoid allocating and
later deleting it. As the code executes in 0 time and the lifetime of
the packet does not extend beyond the block there is no reason to heap
allocate it.
This patch adds separate actions for requests that missed in the local cache
and messages were sent out to get the requested line. These separate actions
are required for differentiating between the hit and miss latencies in the
statistics collected.
This patch adds separate actions for requests that missed in the local cache
and messages were sent out to get the requested line. These separate actions
are required for differentiating between the hit and miss latencies in the
statistics collected.
The patch started of with removing the global variables from the profiler for
profiling the miss latency of requests made to the cache. The corrresponding
histograms have been moved to the Sequencer. These are combined together when
the histograms are printed. Separate histograms are now maintained for
tracking latency of all requests together, of hits only and of misses only.
A particular set of histograms used to use the type GenericMachineType defined
in one of the protocol files. This patch removes this type. Now, everything
that relied on this type would use MachineType instead. To do this, SLICC has
been changed so that multiple machine types can be declared by a controller
in its preamble.
This patch removes the following three files: RubySlicc_Profiler.sm,
RubySlicc_Profiler_interface.cc and RubySlicc_Profiler_interface.hh.
Only one function prototyped in the file RubySlicc_Profiler.sm. Rest of the
code appearing in any of these files is not in use. Therefore, these files
are being removed.
That one single function, profileMsgDelay(), is being moved to the protocol
files where it is in use. If we need any of these deleted functions, I think
the right way to make them visible is to have the AbstractController class in
a .sm and let the controller state machine inherit from this class. The
AbstractController class can then have the prototypes of these profiling
functions in its definition.
2013-06-24 08:59:08 -05:00
Joel Hestness ext:(%2C%20Nilay%20Vaish%20%3Cnilay%40cs.wisc.edu%3E)
The m_size variable attempted to track m_prio_heap.size(), but it did so
incorrectly due to the functions reanalyzeMessages and reanalyzeAllMessages().
Since this variable is intended to track m_prio_heap.size(), we can simply
replace instances where m_size is referenced with m_prio_heap.size(), which
has the added bonus of removing the need for m_size.
Note: This patch also removes an extraneous DPRINTF format string designator
from reanalyzeAllMessages()
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Previously, .sm files were allowed to use the same name for a type and a
variable. This is unnecessarily confusing and has some bad side effects, like
not being able to declare later variables in the same scope with the same type.
This causes the compiler to complain and die on things like Address Address.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Change all occurrances of Address as a variable name to instead use Addr.
Address is an allowed name in slicc even when Address is also being used as a
type, leading to declarations of "Address Address". While this works, it
prevents adding another field of type Address because the compiler then thinks
Address is a variable name, not type.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Reuse the address finalization code in the TLB instead of replicating
it when handling MMIO. This patch also adds support for injecting
memory mapped IPR requests into the memory system.
This patch removes per processor cycle count, histogram for filter stats,
histogram for multicasts, histogram for prefetch wait, some function
prototypes that do not have definitions.
The Profiler class does not need an event for dumping statistics
periodically. This is because there is a method for dumping statistics
for all the sim objects periodically. Since Ruby is a sim object, its
statistics are also included.
This moves event and transition count statistics for cache controllers to
gem5's statistics. It does the same for the statistics associated with the
memory controller in ruby.
All the cache/directory/dma controllers individually collect the event and
transition counts. A callback function, collateStats(), has been added that
is invoked on the controller version 0 of each controller class. This
function adds all the individual controller statistics to a vector
variables. All the code for registering the statistical variables and
collating them is generated by SLICC. The patch removes the files
*_Profiler.{cc,hh} and *_ProfileDumper.{cc,hh} which were earlier used for
collecting and dumping statistics respectively.
This patch changes the class names of the variuos DRAM configurations
to better reflect what memory they are based on. The speed and
interface width is now part of the name, and also the alias that is
used to select them on the command line.
Some minor changes are done to the actual parameters, to better
reflect the named configurations. As a result of these changes the
regressions change slightly and the stats will be bumped in a separate
patch.
This patch adds a histogram to track how many bytes are accessed in an
open row before it is closed. This metric is useful in characterising
a workload and the efficiency of the DRAM scheduler. For example, a
DDR3-1600 device requires 44 cycles (tRC) before it can activate
another row in the same bank. For a x32 interface (8 bytes per cycle)
that means 8 x 44 = 352 bytes must be transferred to hide the
preparation time.
This patch adds a frontend and backend static latency to the DRAM
controller by delaying the responses. Two parameters expressing the
frontend and backend contributions in absolute time are added to the
controller, and the appropriate latency is added to the responses when
adding them to the (infinite) queued port for sending.
For writes and reads that hit in the write buffer, only the frontend
latency is added. For reads that are serviced by the DRAM, the static
latency is the sum of the pipeline latencies of the entire frontend,
backend and PHY. The default values are chosen based on having roughly
10 pipeline stages in total at 500 MHz.
In the future, it would be sensible to make the controller use its
clock and convert these latencies (and a few of the DRAM timings) to
cycles.
This patch does some minor tidying up of the MSHR and MSHRQueue. The
clean up started as part of some ad-hoc tracing and debugging, but
seems worthwhile enough to go in as a separate patch.
The highlights of the changes are reduced scoping (private) members
where possible, avoiding redundant new/delete, and constructor
initialisation to please static code analyzers.
This patch introduces a mirrored internal snoop port to facilitate
easy addition of flow control for the snoop responses that are turned
into normal responses on their return. To perform this, the slave
ports of the coherent bus are wrapped in internal master ports that
are passed as the source ports to the response layer in question.
As a result of this patch, there is more contention for the response
resources, and as such system performance will decrease slightly.
A consequence of the mirrored internal port is that the port the bus
tells to retry (the internal one) and the port actually retrying (the
mirrored) one are not the same. Thus, the existing check in tryTiming
is not longer correct. In fact, the test is redundant as the layer is
only in the retry state while calling sendRetry on the waiting port,
and if the latter does not immediately call the bus then the retry
state is left. Consequently the check is removed.
This patch makes the buses multi layered, and effectively creates a
crossbar structure with distributed contention ports at the
destination ports. Before this patch, a bus could have a single
request, response and snoop response in flight at any time, and with
these changes there can be as many requests as connected slaves (bus
master ports), and as many responses as connected masters (bus slave
ports).
Together with address interleaving, this patch enables us to create
high-throughput memory interconnects, e.g. 50+ GByte/s.
This patch makes the flow control and state updates of the coherent
bus more clear by separating the two cases, i.e. forward as a snoop
response, or turn it into a normal response.
With this change it is also more clear what resources are being
occupied, and that we effectively bypass the busy check for the second
case. As a result of the change in resource usage some stats change.
This patch does some minor housekeeping on the bus code, removing
redundant code, and moving the extraction of the destination id to the
top of the functions using it.
This patch adds a basic set of stats which are hard to impossible to
implement using only communication monitors, and are needed for
insight such as bus utilization, transactions through the bus etc.
Stats added include throughput and transaction distribution, and also
a two-dimensional vector capturing how many packets and how much data
is exchanged between the masters and slaves connected to the bus.
This patch changes the set used to track outstanding requests to an
unordered set (part of C++11 STL). There is no need to maintain the
order, and hopefully there might even be a small performance benefit.