Adds DVFS capabilities to gem5, by allowing users to specify lists for
frequencies and voltages in SrcClockDomains and VoltageDomains respectively.
A separate component, DVFSHandler, provides a small interface to change
operating points of the associated domains.
Clock domains will be linked to voltage domains and thus allow separate clock,
but shared voltage lines.
Currently all the valid performance-level updates are performed with a fixed
transition latency as specified for the domain.
Config file example:
...
vd = VoltageDomain(voltage = ['1V','0.95V','0.90V','0.85V'])
tsys.cluster1.clk_domain.clock = ['1GHz','700MHz','400MHz','230MHz']
tsys.cluster2.clk_domain.clock = ['1GHz','700MHz','400MHz','230MHz']
tsys.cluster1.clk_domain.domain_id = 0
tsys.cluster2.clk_domain.domain_id = 1
tsys.cluster1.clk_domain.voltage_domain = vd
tsys.cluster2.clk_domain.voltage_domain = vd
tsys.dvfs_handler.domains = [tsys.cluster1.clk_domain,
tsys.cluster2.clk_domain]
tsys.dvfs_handler.enable = True
This patch adds a DRAMPower flag to enable off-line DRAM power
analysis using the DRAMPower tool. A new DRAMPower flag is added
and a follow-on patch adds a Python script to post-process the output
and order it based on time stamps.
The long-term goal is to link DRAMPower as a library and provide the
commands through function calls to the model rather than first
printing and then parsing the commands. At the moment it is also up to
the user to ensure that the same DRAM configuration is used by the
gem5 controller model and DRAMPower.
This patch adds the index of the bank and rank as a field so that we can
determine the identity of a given bank (reference or pointer) for the
power tracing. We also grab the opportunity of cleaning up the
arguments used for identifying the bank when activating.
In a cycle, we could see a R and W requests corresponding to the same
page walk being sent to the memory. During the cycle that assertion
happens, we have 2 responses corresponding to the R and W above. We
also have a 'read' variable to keep track of the inflight Read
request, this gets reset to NULL right after we send out any R
request; and gets set to the next R in the page walk when a response
comes back.
The issue we are seeing here is when we get a response for W request,
assert(!read) fires because we got a response for R request right
before this, hence we set 'read' to NOT NULL value, pointing to the
next R request in the pagewalk!
This work was done while Binh was an intern at AMD Research.
Check for free entries in Load Queue and Store Queue separately to
avoid cases when load cannot be renamed due to full Store Queue and
vice versa.
This work was done while Binh was an intern at AMD Research.
This patch bumps the supported version of gcc from 4.4 to 4.6, and
clang from 2.9 to 3.0. This enables, amongst other things, range-based
for loops, lambda expressions, etc. The STL implementation shipping
with 4.6 also has a full functional implementation of unique_ptr and
shared_ptr.
The language describing the clockEdge and nextCycle functions were ambiguous,
and so were prone to misinterpretation/misuse. Clear up the comments to more
rigorously describe their functionality.
Using '== true' in a boolean expression is totally redundant,
and using '== false' is pretty verbose (and arguably less
readable in most cases) compared to '!'.
It's somewhat of a pet peeve, perhaps, but I had some time
waiting for some tests to run and decided to clean these up.
Unfortunately, SLICC appears not to have the '!' operator,
so I had to leave the '== false' tests in the SLICC code.
This patch removes the stat totalCommittedInsts. This variable was used for
recording the total number of instructions committed across all the threads
of a core. The instructions committed by each thread are recorded invidually.
The total would now be generated by summing these individual counts.
This patch makes a more firm connection between the DDR3-1600
configuration and the corresponding datasheet, and also adds a
DDR3-2133 and a DDR4-2400 configuration. At the moment there is also
an ongoing effort to align the choice of datasheets to what is
available in DRAMPower.
This patch extends the current timing parameters with the DRAM cycle
time. This is needed as the DRAMPower tool expects timestamps in DRAM
cycles. At the moment we could get away with doing this in a
post-processing step as the DRAMPower execution is separate from the
simulation run. However, in the long run we want the tool to be called
during the simulation, and then the cycle time is needed.
This patch adds the basic ingredients for a precharge all operation,
to be used in conjunction with DRAM power modelling.
Currently we do not try and apply any cleverness when precharging all
banks, thus even if only a single bank is open we use PREA as opposed
to PRE. At the moment we only have a single tRP (tRPpb), and do not
model the slightly longer all-bank precharge constraint (tRPab).
This patch adds the tRTP timing constraint, governing the minimum time
between a read command and a precharge. Default values are provided
for the existing DRAM types.
This patch merges the two control paths used to estimate the latency
and update the bank state. As a result of this merging the computation
is now in one place only, and should be easier to follow as it is all
done in absolute (rather than relative) time.
As part of this change, the scheduling is also refined to ensure that
we look at a sensible estimate of the bank ready time in choosing the
next request. The bank latency stat is removed as it ends up being
misleading when the DRAM access code gets evaluated ahead of time (due
to the eagerness of waking the model up for scheduling the next
request).
This patch adds the write recovery time to the DRAM timing
constraints, and changes the current tRASDoneAt to a more generic
preAllowedAt, capturing when a precharge is allowed to take place.
The part of the DRAM access code that accounts for the precharge and
activate constraints is updated accordingly.
This patch adds power states to the controller. These states and the
transitions can be used together with the Micron power model. As a
more elaborate use-case, the transitions can be used to drive the
DRAMPower tool.
At the moment, the power-down modes are not used, and this patch
simply serves to capture the idle, auto refresh and active modes. The
patch adds a third state machine that interacts with the refresh state
machine.
This patch adds a state machine for the refresh scheduling to
ensure that no accesses are allowed while the refresh is in progress,
and that all banks are propely precharged.
As part of this change, the precharging of banks of broken out into a
method of its own, making is similar to how activations are dealt
with. The idle accounting is also updated to ensure that the refresh
duration is not added to the time that the DRAM is in the idle state
with all banks precharged.
This patch changes the read/write event loop to use a single event
(nextReqEvent), along with a state variable, thus joining the two
control flows. This change makes it easier to follow the state
transitions, and control what happens when.
With the new loop we modify the overly conservative switching times
such that the write-to-read switch allows bank preparation to happen
in parallel with the bus turn around. Similarly, the read-to-write
switch uses the introduced tRTW constraint.
This patch adds a the member function StaticInst::printFlags to allow all
of an instruction's flags to be printed without using the individual
is... member functions or resorting to exposing the 'flags' vector
It also replaces the enum definition StaticInst::Flags with a
Python-generated enumeration and adds to the enum generation mechanism
in src/python/m5/params.py to allow Enums to be placed in namespaces
other than Enums or, alternatively, in wrapper structs allowing them to
be inherited by other classes (so populating that class's name-space
with the enumeration element names).
This patch encompasses several interrelated and interdependent changes
to the ISA generation step. The end goal is to reduce the size of the
generated compilation units for instruction execution and decoding so
that batch compilation can proceed with all CPUs active without
exhausting physical memory.
The ISA parser (src/arch/isa_parser.py) has been improved so that it can
accept 'split [output_type];' directives at the top level of the grammar
and 'split(output_type)' python calls within 'exec {{ ... }}' blocks.
This has the effect of "splitting" the files into smaller compilation
units. I use air-quotes around "splitting" because the files themselves
are not split, but preprocessing directives are inserted to have the same
effect.
Architecturally, the ISA parser has had some changes in how it works.
In general, it emits code sooner. It doesn't generate per-CPU files,
and instead defers to the C preprocessor to create the duplicate copies
for each CPU type. Likewise there are more files emitted and the C
preprocessor does more substitution that used to be done by the ISA parser.
Finally, the build system (SCons) needs to be able to cope with a
dynamic list of source files coming out of the ISA parser. The changes
to the SCons{cript,truct} files support this. In broad strokes, the
targets requested on the command line are hidden from SCons until all
the build dependencies are determined, otherwise it would try, realize
it can't reach the goal, and terminate in failure. Since build steps
(i.e. running the ISA parser) must be taken to determine the file list,
several new build stages have been inserted at the very start of the
build. First, the build dependencies from the ISA parser will be emitted
to arch/$ISA/generated/inc.d, which is then read by a new SCons builder
to finalize the dependencies. (Once inc.d exists, the ISA parser will not
need to be run to complete this step.) Once the dependencies are known,
the 'Environments' are made by the makeEnv() function. This function used
to be called before the build began but now happens during the build.
It is easy to see that this step is quite slow; this is a known issue
and it's important to realize that it was already slow, but there was
no obvious cause to attribute it to since nothing was displayed to the
terminal. Since new steps that used to be performed serially are now in a
potentially-parallel build phase, the pathname handling in the SCons scripts
has been tightened up to deal with chdir() race conditions. In general,
pathnames are computed earlier and more likely to be stored, passed around,
and processed as absolute paths rather than relative paths. In the end,
some of these issues had to be fixed by inserting serializing dependencies
in the build.
Minor note:
For the null ISA, we just provide a dummy inc.d so SCons is never
compelled to try to generate it. While it seems slightly wrong to have
anything in src/arch/*/generated (i.e. a non-generated 'generated' file),
it's by far the simplest solution.
The unproxy code for Parent.any can generate a circular reference
in certain situations with classes hierarchies like those in ClockDomain.py.
This patch solves this by marking ouself as visited to make sure the
search does not resolve to a self-reference.
The ARM TLBs have a bootUncacheability flag used to make some loads
and stores become uncacheable when booting in FS mode. Later the
flag is cleared to let those loads and stores operate as normal. When
doing a takeOverFrom(), this flag's state is not preserved and is
momentarily reset until the CPSR is touched. On single core runs this
is a non-issue. On multi-core runs this can lead to crashes on the O3
CPU model from the following series of events:
1) takeOverFrom executed to switch from Atomic -> O3
2) All bootUncacheability flags are reset to true
3) Core2 tries to execute a load covered by bootUncacheability, it
is flagged as uncacheable
4) Core2's load needs to replay due to a pipeline flush
3) Core1 core does an action on CPSR
4) The handling code for CPSR then checks all other cores
to determine if bootUncacheability can be set to false
5) Asynchronously set bootUncacheability on all cores to false
6) Core2 replays load previously set as uncacheable and notices
it is now flagged as cacheable, leads to a panic.
This patch implements takeOverFrom() functionality for the ARM TLBs
to preserve flag values when switching from atomic -> detailed.
For the o3, add instruction mix (OpClass) histogram at commit (stats
also already collected at issue). For the simple CPUs we add a
histogram of executed instructions
This patch squashes prefetch requests from downstream caches,
so that they do not steal cachelines away from caches closer
to the cpu. It was originally coded by Mitch Hayenga and
modified by Aasheesh Kolli.
This source for stats binds an object and a method / function from the object
to a stats object. This allows pulling out stats from object methods without
needing to go through a global, or static shim.
Syntax is somewhat unpleasant, but the templates and method pointer type
specification were quite tricky. Interface is very clean though; and similar
to .functor
Allow the specification of a socket ID for every core that is reflected in the
MPIDR field in ARM systems. This allows studying multi-socket / cluster
systems with ARM CPUs.
Splits the CommMonitor trace_file parameter into three parameters. Previously,
the trace was only enabled if the trace_file parameter was set, and would be
written to this file. This patch adds in a trace_enable and trace_compress
parameter to the CommMonitor.
No trace is generated if trace_enable is set to False. If it is set to True, the
trace is written to a file based on the name of the SimObject in the simulation
hierarchy. For example, system.cluster.il1_commmonitor.trc. This filename can be
overridden by additionally specifying a file name to the trace_file parameter
(more on this later).
The trace_compress parameter will append .gz to any filename if set to True.
This enables compression of the generated traces. If the file name already ends
in .gz, then no changes are made.
The trace_file parameter will override the name set by the trace_enable
parameter. In the case that the specified name does not end in .gz but
trace_compress is set to true, .gz is appended to the supplied file name.
Unimplemented miscregs for the generic timer were guarded by panics
in arm/isa.cc which can be tripped by the O3 model if it speculatively
executes a wrong path containing a mrs instruction with a bad miscreg
index. These registers were flagged as implemented and accessible.
This patch changes the miscreg info bit vector to flag them as
unimplemented and inaccessible. In this case, and UndefinedInst
fault will be generated if the register access is not trapped
by a hypervisor.
This patch changes the default pixel clock to effectively generate
1080p resolution at 60 frames per second. It is dependent upon the
kernel device tree file using the specified resolution / display
string in the comments.
This is a quick hack to communicate a greater number of CPUs to a guest OS via
the ARM A9 SCU config register. Some OSes (Linux) just look at the bottom field
to count CPUs and with a small change can look at bits [3:0] to learn about up
to 16 CPUs.
Very much unsupported (and contains warning messages as such) but useful for
running 8 core sims without hardwiring CPU count in the guest OS.
With (upcoming) separate compilation, they are useless. Only
link-time optimization could re-inline them, but ideally
feedback-directed optimization would choose to do so only for
profitable (i.e. common) instructions.
The MicroMemOp class generates the disassembly for both integer
and floating point instructions, but it would always print its
first operand as an integer register without considering that the
op may be a floating instruction in which case a float register
should be displayed instead.
In the O3 LSQ, data read/written is printed out in DPRINTFs. However,
the data field is treated as a character string with a null terminated.
However the data field is not encoded this way. This patch removes
that possibility by removing the data part of the print.
This never actually worked since it was printing out only a word
of the cache block and not the entire thing and doubly didn't work
csprintf overrides the %#x specifier and assumes a char* array is
actually a string.
O3CPU has a compile-time maximum width set in o3/impl.hh, but checking
the configuration against this limit was not implemented anywhere
except for fetch. Configuring a wider pipe than the limit can silently
cause various issues during the simulation. This patch adds the proper
checking in the constructor of the various pipeline stages.
Without this declaration, new clangs will complain about this value
being unused. It has no explicit use in the codebase, but it can be
useful to turn on all debugging flags while in a debugger to greatly
increase simulator verbosity.
A number of calls to isEmpty() and numFreeEntries()
should be thread-specific.
In cpu.cc, the fact that tid is /*commented*/ out is a bug. Say the rob
has instructions from thread 0 (isEmpty() returns false), and none from
thread 1. If we are trying to squash all of thread 1, then
readTailInst(thread 1) will be called because rob->isEmpty() returns
false. The result is end_it is not in the list and the while
statement loops indefinitely back over the cpu's instList.
In iew_impl.hh, all threads are told they have the entire remaining IQ, when
each thread actually has a certain allocation. The result is extra stalls at
the iew dispatch stage which the rename stage usually takes care of.
In commit_impl.hh, rob->readHeadInst(thread 1) can be called if the rob only
contains instructions from thread 0. This returns a dummyInst (which may work
since we are trying to squash all instructions, but hardly seems like the right
way to do it).
In rob_impl.hh this fix skips the rest of the function more frequently and is
more efficient.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Upon aggregating records, serialize system's cache-block size, as the
cache-block size can be different when restoring from a checkpoint. This way,
we can correctly read all records when restoring from a checkpoints, even if
the cache-block size is different.
Note, that it is only possible to restore from a checkpoint if the
desired cache-block size is smaller or equal to the cache-block size
when the checkpoint was taken; we can split one larger request into
multiple small ones, but it is not reliable to do the opposite.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Simulating a SMP or multicore requires devices to be shared between
multiple KVM vCPUs. This means that locking is required when accessing
devices. This changeset adds the necessary locking to allow devices to
execute correctly. It is implemented by temporarily migrating the KVM
CPU to the VM's (and devices) event queue when handling
MMIO. Similarly, the VM migrates to the interrupt controller's event
queue when delivering an interrupt.
The support for fast-forwarding of multicore simulations added by this
changeset assumes that all devices in a system are simulated in the
same thread and each vCPU has its own thread. Special care must be
taken to ensure that devices living under the CPU in the object
hierarchy (e.g., the interrupt controller) do not inherit the parent
CPUs thread and are assigned to device thread. The KvmVM object is
assumed to live in the same thread as the other devices in the system.
The calling thread is undefined when the PollQueue services events.
This implies that PollEvents need to handle the case where they are
processed from a different thread than the thread that created the
event. This changeset adds temporary event queue migrations to the VNC
server, the ethernet tap device, and the terminal to protect them from
inter-thread calls.
As of now, the enqueue statement can take in any number of 'pairs' as
argument. But we only use the pair in which latency is the key. This
latency is allowed to be either a fixed integer or a member variable of
controller in which the expression appears. This patch drops the use of pairs
in an enqueue statement. Instead, an expression is allowed which will be
interpreted to be the latency of the enqueue. This expression can anything
allowed by slicc including a constant integer or a member variable.
We need the ability to lock event queues to enable device accesses
across threads. The serviceOne() method now takes a service lock prior
to handling a new event. By locking an event queue, a different
thread/eq can effectively execute in the context of the locked event
queue. To simplify temporary event queue migrations, this changeset
introduces the EventQueue::ScopedMigration class that unlocks the
current event queue, locks a new event queue, and updates the current
event queue variable.
In order to prevent deadlocks, event queues need to be released when
waiting on barriers. This is implemented using the
EventQueue::ScopedRelease class. An instance of this class is, for
example, used in the BaseGlobalEvent class to release the event queue
when waiting on the synchronization barrier.
The intended use for this functionality is when devices need to be
accessed across thread boundaries. For example, when fast-forwarding,
it might be useful to run devices and CPUs in separate threads. In
such a case, the CPU locks the device queue whenever it needs to
perform IO. This functionality is primarily intended for KVM.
Note: Migrating between event queues can lead to non-deterministic
timing. Use with extreme care!
--HG--
extra : rebase_source : 23e3a741a1fd73861d1339782dbbe1bc76285315
This patch fixes violation of TSO in the O3CPU, as all loads must be
ordered with all other loads. In the LQ, if a snoop is observed, all
subsequent loads need to be squashed if the system is TSO.
Prior to this patch, the following case could be violated:
P0 | P1 ;
MOV [x],mail=/usr/spool/mail/nilay | MOV EAX,[y] ;
MOV [y],mail=/usr/spool/mail/nilay | MOV EBX,[x] ;
exists (1:EAX=1 /\ 1:EBX=0) [is a violation]
The problem was found using litmus [http://diy.inria.fr].
Committed by: Nilay Vaish <nilay@cs.wisc.edu
This patch adds stats for tracking the number of reads/writes per bus
turn around, and also adds hysteresis to the write-to-read switching
to ensure that the queue does not oscilate around the low threshold.
This patch renames the not-so-simple SimpleDRAM to a more suitable
DRAMCtrl. The name change is intended to ensure that we do not send
the wrong message (although the "simple" in SimpleDRAM was originally
intended as in cleverly simple, or elegant).
As the DRAM controller modelling work is being presented at ISPASS'14
our hope is that a broader audience will use the model in the future.
--HG--
rename : src/mem/SimpleDRAM.py => src/mem/DRAMCtrl.py
rename : src/mem/simple_dram.cc => src/mem/dram_ctrl.cc
rename : src/mem/simple_dram.hh => src/mem/dram_ctrl.hh
Make the default memory type DDR3-1600 x64, and use the open-adaptive
page policy. This change is aiming to ensure that users by default are
using a realistic memory system.
This patch adds a basic starvation-prevention mechanism where a DRAM
page is forced to close after a certain number of accesses. The limit
is combined with the open and open-adaptive page policy and if reached
causes an auto-precharge.
This patch changes the triggering condition for the write draining
such that we grab the opportunity to issue writes if there are no
reads waiting (as opposed to waiting for the writes to reach the high
threshold). As a result, we potentially drain some of the writes in read
idle periods (if any).
A low threshold is added to be able to control how many write bursts
are kept in the memory controller queue (acting as on-chip storage).
The high and low thresholds are updated to sensible values for a 32/64
size write buffer. Note that the thresholds should be adjusted along
with the queue sizes.
This patch also adds some basic initialisation sanity checks and moves
part of the initialisation to the constructor.
This patch enables a new 'DRAM' mode to the existing traffic
generator, catered to generate specific requests to DRAM based on
required hit length (stride size) and bank utilization. It is an add on
to the Random mode.
The basic idea is to control how many successive packets target the
same page, and how many banks are being used in parallel. This gives a
two-dimensional space that stresses different aspects of the DRAM
timing.
The configuration file needed to use this patch has to be changed as
follow: (reference to Random Mode, LPDDR3 memory type)
'STATE 0 10000000000 RANDOM 50 0 134217728 64 3004 5002 0'
-> 'STATE 0 10000000000 DRAM 50 0 134217728 32 3004 5002 0 96 1024 8 6 1'
The last 4 parameters to be added are:
<stride size (bytes), page size(bytes), number of banks available in DRAM,
number of banks to be utilized, address mapping scheme>
The address mapping information is used to get the stride address
stream of the specified size and to know where to find the bank
bits. The configuration file has a parameter where '0'-> RoCoRaBaCh,
'1'-> RoRaBaCoCh/RoRaBaChCo address-mapping schemes. Note that the
generator currently assumes a single channel and a single rank. This
is to avoid overwhelming the traffic generator with information about
the memory organisation.
This patch adds the row bits to the name of the address mapping
schemes to make it more clear that all the current schemes places the
row bits as the most significant bits.
This patch moves the Ruby-related debug flags to the ruby
sub-directory, and also removes the state SConsopts that add the
no-longer-used NO_VECTOR_BOUNDS_CHECK.
Prevent incomplete configuration of TrafficGen class from causing
segmentation faults. If an 'INIT' line is not present in the
configuration file then the currState variable will remain
uninitialized which may result in a crash.
There were several sections of the m5ops code which were
essentially copy/pasted versions of the 32-bit code. The
problem is that some of these didn't account fo4 64-bit
registers leading to arguments being in the wrong registers.
This patch addresses the args for readfile64, writefile64,
and addsymbol64 -- all of which seemed to suffer from a
similar set of problems when moving to 64-bit.
Each consumer object maintains a set of tick values when the object is supposed
to wakeup and do some processing. As of now, the object accesses this set both
when scheduling a wakeup event and when the object actually wakes up. The set
is accessed during wakeup to remove the current tick value from the set. This
functionality is now being moved to the scheduling function where ticks are
removed at a later time.
This helps in configuring the network interfaces from the python script and
these objects no longer rely on the network object for the timing information.
Piobus was recently added to se scripts for ruby so that the interrupt
controller can be connected to something (required since the interrupt
controller sends address range messages). This patch removes the piobus
and instead, the pio port of ruby port will now ignore the range change
messages in se mode.
KVM used to use two signals, one for instruction count exits and one
for timer exits. There is really no need to distinguish between the
two since they only trigger exits from KVM. This changeset unifies and
renames the signals and adds a method, kick(), that can be used to
raise the control signal in the vCPU thread. It also removes the early
timer warning since we do not normally see if the signal was
delivered.
--HG--
extra : rebase_source : cd0e45ca90894c3d6f6aa115b9b06a1d8f0fda4d
gem5 seems to store the PC as RIP+CS_BASE. This is not what KVM
expects, so we need to subtract CS_BASE prior to transferring the PC
into KVM. This changeset adds the necessary PC manipulation and
refactors thread context updates slightly to avoid reading registers
multiple times from KVM.
--HG--
extra : rebase_source : 3f0569dca06a1fcd8694925f75c8918d954ada44
This changeset adds support for INIT and STARTUP IPI handling. We
currently handle both of these interrupts in gem5 and transfer the
state to KVM. Since we do not have a BIOS loaded, we pretend that the
INIT interrupt suspends the CPU after reset.
--HG--
extra : rebase_source : 7f3b25f3801d68f668b6cd91eaf50d6f48ee2a6a
The table walker code currently accounts for two types of walks,
Atomic and Timing, and treats them differently. Atomic walks keep a
single instance of WalkerState around for all walks to use in
currState. Timing mode keeps a queue of in-flight WalkerStates and
maintains currState as NULL between walks.
If a functional walk is done during Timing mode, it is treated as an
atomic walk and either creates a persistent WalkerState if in between
Timing walks, or stomps an existing currState for an in-progress
Timing walk.
This patch distinguishes functional walks as being able to exist at
any time and sets up a temporary WalkerState for its exclusive use and
then cleans up when finished, leaving any in progress Atomic or Timing
walks undisturbed.
This patch fixes an assert condition that is not true at all
times. There are valid situations that arise in dual-core
dual-workload runs where the assert condition is false. The function
call following the assert however needs to be called only when the
condition is true (a block cannot be invalidated in the tags structure
if has not been allocated in the structure, and the tempBlock is never
allocated). Hence the 'assert' has been replaced with an 'if'.
This patch changes the decode script to output the optional fields of
the proto message Packet, namely id and flags. The flags field is set
by the communication monitor.
The id field is useful for CPU trace experiments, e.g. linking the
fetch side to decode side. It had to be renamed because it clashes
with a built in python function id() for getting the "identity" of an
object.
This patch also takes a few common function definitions out from the
multiple scripts and adds them to a protolib python module.
This snippet can be used to replace if + {panics, fatals, asserts} constructs.
The idea is to have both the condition checking and a verbose printout in a single statement. The interface is as follows:
panic_if(foo != bar, "These should be equal: foo %i bar %i", foo, bar);
fatal_if(foo != bar, "These should be equal: foo %i bar %i", foo, bar);
chatty_assert(foo == bar, "These should be equal: foo %i bar %i", foo, bar);
Small fix for a warning that prevents compilation with gcc 4.8.1 due
to detecting that a variable might be uninitialised. The fix is to
assign a safe default.
The global synchronization event used to be scheduled at
simQuantum. This prevented repeated entries into gem5 from Python as
it can be scheduled in the past. This changeset ensures that the first
global synchronization happens at curTick() + simQuantum instead.
The TSL/LDT & TR/TSS segments didn't contain valid attributes. This
caused problems when transfering the state into KVM where invalid
state is a no-go. Fixup the attributes with values from AMD's
architecture programmer's manual.
When transferring segment registers into kvm, we need to find the
value of the unusable bit. We used to assume that this could be
inferred from the selector since segments are generally unusable if
their selector is 0. This assumption breaks in some weird corner
cases. Instead, we just assume that segments are always usable. This
is what qemu does so it should work.
Signal handlers in KVM are controlled per thread and should be
initialized from the thread that is going to execute the CPU. This
changeset moves the initialization call from startup() to
startupThread().
A copyRegs() function is added to MIPS utilities
to copy architectural state from the old CPU to
the new CPU during fast-forwarding. This
addition alone enables fast-forwarding for the
o3 cpu model running MIPS.
The patch also adds takeOverFrom() and
drainResume() functions to the InOrderCPU to
enable it to take over from another CPU. This
change enables fast-forwarding for the inorder
cpu model running MIPS, but not for Alpha.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Couple of users observed segmentation fault when the simulator tries to
register the statistical variable m_IncompleteTimes. It seems that there
is some problem with the initialization of these variables when allocated
in the constructor.
Currently, the interrupt controller in x86 is connected to the io bus
directly. Therefore the packets between the io devices and the interrupt
controller do not go through ruby. This patch changes ruby port so that
these packets arrive at the ruby port first, which then routes them to their
destination. Note that the patch does not make these packets go through the
ruby network. That would happen in a subsequent patch.
This patch simplfies the retry logic in the RubyPort, avoiding
redundant attributes, and enforcing more stringent checks on the
interactions with the normal ports. The patch also simplifies the
routing done by the RubyPort, using the port identifiers instead of a
heavy-weight sender state.
The patch also fixes a bug in the sending of responses from PIO
ports. Previously these responses bypassed the queue in the queued
port, and ignored the return value, potentially leading to response
packets being lost.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Code in two of the functions was exactly the same. This patch moves
this code to a new function which is called from the two functions
mentioned initially.
At several places, there are functions that take a cycle value as input
and performs some computation. Along with each such function, another
function was being defined that simply added one more cycle to input and
computed the same function. This patch removes this second copy of the
function. Places where these functions were being called have been updated
to use the original function with argument being current cycle + 1.
Two files had been incorrectly named with a .cache suffix.
--HG--
rename : src/mem/protocol/MESI_Three_Level-L0.cache => src/mem/protocol/MESI_Three_Level-L0cache.sm
rename : src/mem/protocol/MESI_Three_Level-L1.cache => src/mem/protocol/MESI_Three_Level-L1cache.sm
The introduction of parallel event queues added most of the support
needed to run multiple VMs (systems) within the same gem5
instance. This changeset fixes up signal delivery so that KVM's
control signals are delivered to the thread that executes the CPU's
event queue. Specifically:
* Timers and counters are now initialized from a separate method
(startupThread) that is scheduled as the first event in the
thread-specific event queue. This ensures that they are
initialized from the thread that is going to execute the CPUs
event queue and enables signal delivery to the right thread when
exiting from KVM.
* The POSIX-timer-based KVM timer (used to force exits from KVM) has
been updated to deliver signals to the thread that's executing KVM
instead of the process (thread is undefined in that case). This
assumes that the timer is instantiated from the thread that is
going to execute the KVM vCPU.
* Signal masking is now done using pthread_sigmask instead of
sigprocmask. The behavior of the latter is undefined in threaded
applications.
* Since signal masks can be inherited, make sure to actively unmask
the control signals when setting up the KVM signal mask.
There are currently no facilities to multiplex between multiple KVM
CPUs in the same event queue, we are therefore limited to
configurations where there is only one KVM CPU per event queue. In
practice, this means that multi-system configurations can be
simulated, but not multiple CPUs in a shared-memory configuration.
This patch fixes a bug in how physical memory used to be mapped and
unmapped. Previously we unmapped and re-mapped if restoring from a
checkpoint. However, we never checked that the new mapping was
actually the same, it was just magically working as the OS seems to
fairly reliably give us the same chunk back. This patch fixes this
issue by relying entirely on the mmap call in the constructor.
This patch enbles use of the basic PIO devices as part of the NULL
build. Although it might seem counter intuitive to have a PIO device
without being able to execute a driver, this change enables us to
break a device class hierarchy into an ISA-agnostic part, and an
ISA-specific part, without requiring multiple-inheritance. The
ISA-agnostic base class is a PIO device, but does not make use of the
port.
This patch adds a filter to the cache to drop snoop requests that are
not for a range covered by the cache. This fixes an issue observed
when multiple caches are placed in parallel, covering different
address ranges. Without this patch, all the caches will forward the
snoop upwards, when only one should do so.
This patch adds DRAMSim2 as a memory controller by wrapping the
external library and creating a sublass of AbstractMemory that bridges
between the semantics of gem5 and the DRAMSim2 interface.
The DRAMSim2 wrapper extracts the clock period from the config
file. There is no way of extracting this information from DRAMSim2
itself, so we simply read the same config file and get it from there.
To properly model the response queue, the wrapper keeps track of how
many transactions are in the actual controller, and how many are
stacking up waiting to be sent back as responses (in the wrapper). The
latter requires us to move away from the queued port and manage the
packets ourselves. This is due to DRAMSim2 not having any flow control
on the response path.
DRAMSim2 assumes that the transactions it is given are matching the
burst size of the choosen memory. The wrapper checks to ensure the
cache line size of the system matches the burst size of DRAMSim2 as
there are currently no provisions to split the system requests. In
theory we could allow a cache line size smaller than the burst size,
but that would lead to inefficient use of the DRAM, so for not we
fatal also in this case.
This changesets adds branch predictor support to the
BaseSimpleCPU. The simple CPUs normally don't need a branch predictor,
however, there are at least two cases where it can be desirable:
1) A simple CPU can be used to warm the branch predictor of an O3
CPU before switching to the slower O3 model.
2) The simple CPU can be used as a quick way of evaluating/debugging
new branch predictors since it exposes branch predictor
statistics.
Limitations:
* Since the simple CPU doesn't speculate, only one instruction will
be active in the branch predictor at a time (i.e., the branch
predictor will never see speculative branches).
* The outcome of a branch prediction does not affect the performance
of the simple CPU.
Currently fatal() ends the simulation in a normal fashion. This results in
the call stack getting lost when using a debugger and it is not always
possible to debug the simulation just from the information provided by the
printed error message. Even though the error is likely due to a user's fault,
the information available should not be thrown away. Hence, this patch to
call abort() from fatal().
Changeset 7274310be1bb (isa: clean up register constants) increased
the value of NumFloatRegs, which triggered a bug in
X86ISA::copyRegs(). This bug is caused by the x87 stack being copied
twice since register indexes past NUM_FLOATREGS are mapped into the
x87 stack relative to the top of the stack, which is undefined when
the copy takes place.
This changeset updates the copyRegs() function to use access registers
using the non-flattening interface, which guarantees that undesirable
register folding does not happen.
The getRFlags and setRFlags utility functions were not updated
correctly when condition registers were separated into their own
register class. This lead to incorrect state transfer in calls from
kvm into the simulator (e.g., m5 readfile ended up in an infinite
loop) and when switching CPUs. This patch makes these utility
functions use getCCReg and setCCReg instead of getIntReg and setIntReg
which read and write the integer registers.
Reviewed-by: Andreas Sandberg <andreas@sandberg.pp.se>
Forces the prefetcher to mispredict twice in a row before resetting the
confidence of prefetching. This helps cases where a load PC strides by a
constant factor, however it may operate on different arrays at times.
Avoids the cost of retraining. Primarily helps with small iteration loops.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
For systems with a tightly coupled L2, a stride-based prefetcher may observe
access requests from both instruction and data L1 caches. However, the PC
address of an instruction miss gives no relevant training information to the
stride based prefetcher(there is no stride to train). In theses cases, its
better if the L2 stride prefetcher simply reverted back to a simple N-block
ahead prefetcher. This patch enables this option.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch extends the classic prefetcher to work on non-block aligned
addresses. Because the existing prefetchers in gem5 mask off the lower
address bits of cache accesses, many predictable strides fail to be
detected. For example, if a load were to stride by 48 bytes, with 64 byte
cachelines, the current stride based prefetcher would see an access pattern
of 0, 64, 64, 128, 192.... Thus not detecting a constant stride pattern. This
patch fixes this, by training the prefetcher on access and not masking off the
lower address bits.
It also adds the following configuration options:
1) Training/prefetching only on cache misses,
2) Training/prefetching only on data acceses,
3) Optionally tagging prefetches with a PC address.
#3 allows prefetchers to train off of prefetch requests in systems with
multiple cache levels and PC-based prefetchers present at multiple levels.
It also effectively allows a pipelining of prefetch requests (like in POWER4)
across multiple levels of cache hierarchy.
Improves performance on my gem5 configuration by 4.3% for SPECINT and 4.7% for SPECFP (geomean).
gem5 makes the incorrect assumption that by binding a socket, it
effectively has allocated a port. Linux only allocates ports once you call
listen on the given socket, not when you call bind. So even if the port was
free when bind was called, another process (gem5 instance) could race in
between the bind & listen calls and steal the port. In the current code, if
the call to bind fails due to the port being in use (EADDRINUSE), gem5 retries
for a different port. However if listen fails, gem5 just panics. The fix is
testing the return value of listen and re-trying if it was due to EADDRINUSE.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
The patch
(1) removes the redundant writeback argument from findVictim()
(2) fixes the description of access() function
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Note: AArch64 and AArch32 interworking is not supported. If you use an AArch64
kernel you are restricted to AArch64 user-mode binaries. This will be addressed
in a later patch.
Note: Virtualization is only supported in AArch32 mode. This will also be fixed
in a later patch.
Contributors:
Giacomo Gabrielli (TrustZone, LPAE, system-level AArch64, AArch64 NEON, validation)
Thomas Grocutt (AArch32 Virtualization, AArch64 FP, validation)
Mbou Eyole (AArch64 NEON, validation)
Ali Saidi (AArch64 Linux support, code integration, validation)
Edmund Grimley-Evans (AArch64 FP)
William Wang (AArch64 Linux support)
Rene De Jong (AArch64 Linux support, performance opt.)
Matt Horsnell (AArch64 MP, validation)
Matt Evans (device models, code integration, validation)
Chris Adeniyi-Jones (AArch64 syscall-emulation)
Prakash Ramrakhyani (validation)
Dam Sunwoo (validation)
Chander Sudanthi (validation)
Stephan Diestelhorst (validation)
Andreas Hansson (code integration, performance opt.)
Eric Van Hensbergen (performance opt.)
Gabe Black
The CheckerCPU model in pre-v8 code was not checking the
updates to miscellaneous registers due to some methods
for setting misc regs were not instrumented. The v8 patches
exposed this by calling the instrumented misc reg update
methods and then invoking the checker before the main CPU had
updated its misc regs, leading to false positives about
register mismatches. This patch fixes the non-instrumented
misc reg update methods and places calls to the checker in
the proper places in the O3 model.
With ARMv8 support the same misc register id results in accessing different
registers depending on the current mode of the processor. This patch adds
the same orthogonality to the misc register file as the others (int, float, cc).
For all the othre ISAs this is currently a null-implementation.
Additionally, a system variable is added to all the ISA objects.
This patch add support for generating wake-up events in the CPU when an address
that is currently in the exclusive state is hit by a snoop. This mechanism is required
for ARMv8 multi-processor support.
Previously we were casting the result type to the the memory type which
is incorrect for things like dual-memory operations which still return a
single result.
Adds very basic statistics on the number of tag and data accesses within the
cache, which is important for power modelling. For the tags, simply count
the associativity of the cache each time. For the data, this depends on
whether tags and data are accessed sequentially, which is given by a new
parameter. In the parallel case, all data blocks are accessed each time, but
with sequential accesses, a single data block is accessed only on a hit.
This patch enables tracking of cache occupancy per thread along with
ages (in buckets) per cache blocks. Cache occupancy stats are
recalculated on each stat dump.
The probe patch is motivated by the desire to move analytical and trace code
away from functional code. This is achieved by the probe interface which is
essentially a glorified observer model.
What this means to users:
* add a probe point and a "notify" call at the source of an "event"
* add an isolated module, that is being used to carry out *your* analysis (e.g. generate a trace)
* register that module as a probe listener
Note: an example is given for reference in src/cpu/o3/simple_trace.[hh|cc] and src/cpu/SimpleTrace.py
What is happening under the hood:
* every SimObject maintains has a ProbeManager.
* during initialization (src/python/m5/simulate.py) first regProbePoints and
the regProbeListeners is called on each SimObject. this hooks up the probe
point notify calls with the listeners.
FAQs:
Why did you develop probe points:
* to remove trace, stats gathering, analytical code out of the functional code.
* the belief that probes could be generically useful.
What is a probe point:
* a probe point is used to notify upon a given event (e.g. cpu commits an instruction)
What is a probe listener:
* a class that handles whatever the user wishes to do when they are notified
about an event.
What can be passed on notify:
* probe points are templates, and so the user can generate probes that pass any
type of argument (by const reference) to a listener.
What relationships can be generated (1:1, 1:N, N:M etc):
* there isn't a restriction. You can hook probe points and listeners up in a
1:1, 1:N, N:M relationship. They become useful when a number of modules
listen to the same probe points. The idea being that you can add a small
number of probes into the source code and develop a larger number of useful
analysis modules that use information passed by the probes.
Can you give examples:
* adding a probe point to the cpu's commit method allows you to build a trace
module (outputting assembler), you could re-use this to gather instruction
distribution (arithmetic, load/store, conditional, control flow) stats.
Why is the probe interface currently restricted to passing a const reference:
* the desire, initially at least, is to allow an interface to observe
functionality, but not to change functionality.
* of course this can be subverted by const-casting.
What is the performance impact of adding probes:
* when nothing is actively listening to the probes they should have a
relatively minor impact. Profiling has suggested even with a large number of
probes (60) the impact of them (when not active) is very minimal (<1%).
This patch adds observability to the clock period of the clock domains
by including it as a stat.
As a result of adding this, the regressions will be updated in a
separate patch.
Add some values and methods to the request object to track the translation
and access latency for a request and which level of the cache hierarchy responded
to the request.
This patch makes the Clock a TickParamValue just like
Latency/Frequency. There is no longer any need to distinguish it
(originally needed to support multiplication).
This patch fixes a memory leak in the table walker, by ensuring that
the sender state is deleted again if the request packet cannot be
successfully sent.
This patch relaxes the check performed when squashing non-speculative
instructions, as it caused problems with loads that were marked ready,
and then stalled on a blocked cache. The assertion is now allowing
memory references to be non-faulting.
This patch removes an assertion in the simpoint profiling code that
asserts that a previously-seen basic block has the exact same number
of instructions executed as before. This can be false if the basic
block generates aborts or takes interrupts at different locations
within the basic block. The basic block profiling are not affected
significantly as these events are rare in general.
This patch adds a function to the HistStor class for adding two histograms.
This functionality is required for Ruby. It also adds support for printing
histograms in a single line.
The first two levels (L0, L1) are private to the core, the third level (L2)is
possibly shared. The protocol supports clustered designs. For example, one
can have two sets of two cores. Each core has an L0 and L1 cache. There are
two L2 controllers where each set accesses only one of the L2 controllers.
A cluster over here means a set of controllers that can be accessed only by a
certain set of cores. For example, consider a two level hierarchy. Assume
there are 4 L1 controllers (private) and 2 L2 controllers. We can have two
different hierarchies here:
a. the address space is partitioned between the two L2 controllers. Each L1
controller accesses both the L2 controllers. In this case, each L1 controller
is a cluster initself.
b. both the L2 controllers can cache any address. An L1 controller has access
to only one of the L2 controllers. In this case, each L2 controller
along with the L1 controllers that access it, form a cluster.
This patch allows for each controller to have a cluster ID, which is 0 by
default. By setting the cluster ID properly, one can instantiate hierarchies
with clusters. Note that the coherence protocol might have to be changed as
well.
If you successfully export a C++ SimObject method, but try to
invoke it from Python before the C++ object is created, you
get a confusing error that says the attribute does not exist,
making you question whether you successfully exported the
method at all. In reality, your only problem is that you're
calling the method too soon. This patch enhances the error
message to give you a better clue.
Updating the SimObject topology of a cloned hierarchy is a little
dangerous, in that cloning is a "deep copy" and the clone does not
inherit SimObject updates the same way it would inherit scalar
variable assignments.
However, because of various SimObject-valued proxy parameters,
like 'memories', 'clk_domain', and 'system', it turns out that
there are a number of implicit topology changes that happen at
instantiation, which means that these changes are impossible to
avoid. So in order to make cloning systems useful, this error
has to go. Changing it to a warning produces a lot of noise,
so it seems best just to delete it.
This patch provides support for DFS by having ClockedObjects register
themselves with their clock domain at construction time in a member list.
Using this list, a clock domain can update each member's tick to the
curTick() before modifying the clock period.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
In mips architecture, floating point convert instructions use the
FloatConvertOp format defined in src/arch/mips/isa/formats/fp.isa. The type
of the operands in the ISA description file (_sw for signed word, or _sf for
signed float, etc.) is used to create a type for the operand in C++. Then the
operand is converted using the fpConvert() function in src/arch/mips/utility.cc.
If we are converting from a word to a float, and we want to convert 0xffffffff,
we expect -1 to be passed into fpConvert(). Instead, we see MAX_INT passed in.
Then fpConvert() converts _val_ to MAX_INT in single-precision floating point,
and we get the wrong value.
To fix it, the signs of the convert operands are being changed from unsigned to
signed in the MIPS ISA description.
Then, the FloatConvertOp format is being changed to insert a int32_t into the
C++ code instead of a uint32_t.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch fixes couple of bugs in the L2 controller of the mesi cmp
directory protocol.
1. The state MT_I was transitioning to NP on receiving a clean writeback
from the L1 controller. This patch makes it inform the directory controller
about the writeback.
2. The L2 controller was sending the dirty bit to the L1 controller and the
L2 controller used writeback from the L1 controller to update the dirty bit
unconditionally. Now, the L1 controller always assumes that the incoming
data is clean. The L2 controller updates the dirty bit only when the L1
controller writes to the block.
3. Certain unused functions and events are being removed.
This patch replaces max_in_port_rank with the number of inports. The use of
max_in_port_rank was causing spurious re-builds and incorrect initialization
of variables in ruby related regression tests. This was due to the variable
value being used across threads while compiling when it was not meant to be.
Since the number of inports is state machine specific value, this problem
should get solved.
The directory controller should not have the sharer field since there is
only one level 2 cache. Anyway the field was not in use. The owner field
was being used to track the l2 cache version (in case of distributed l2) that
has the cache block under consideration. The information is not required
since the version of the level 2 cache can be obtained from a subset of the
address bits.
Currently statistics are reset after the initial / checkpoint state
has been loaded. But ruby does some checkpoint processing in its
startup() function. So the stats need to be reset after the startup()
function has been called. This patch moves the class to stats.reset()
to achieve this change in functionality.
There is a race between enabling asynchronous IO for a file descriptor
and IO events happening on that descriptor. A SIGIO won't normally be
delivered if an event is pending when asynchronous IO is
enabled. Instead, the signal will be raised the next time there is an
event on the FD. This changeset simulates a SIGIO by setting the
async_io flag when setting up asynchronous IO for an FD. This causes
the main event loop to poll all file descriptors to check for pending
IO. As a consequence of this, the old SIGALRM hack should no longer be
needed and is therefore removed.
The PollEvent class dynamically installs a SIGIO and SIGALRM handler
when a file handler is registered. Most signal handlers currently get
registered in the initSignals() function. This changeset moves the
SIGIO/SIGALRM handlers to initSignals() to live with the other signal
handlers. The original code installs SIGIO and SIGALRM with the
SA_RESTART option to prevent syscalls from returning EINTR. This
changeset consistently uses this flag for all signal handlers to
ensure that other signals that trigger asynchronous behavior (e.g.,
statistics dumping) do not cause undesirable EINTR returns.
The performance counting framework in Linux 3.2 and onwards supports
an attribute to exclude events generated by the host when running
KVM. Setting this attribute allows us to get more reliable
measurements of the guest machine. For example, on a highly loaded
system, the instruction counts from the guest can be severely
distorted by the host kernel (e.g., by page fault handlers).
This changeset introduces a check for the attribute and enables it in
the KVM CPU if present.