Make these names more meaningful.
Specifically, made these substitutions:
s/FP_Base_DepTag/FP_Reg_Base/g;
s/Ctrl_Base_DepTag/Misc_Reg_Base/g;
s/Max_DepTag/Max_Reg_Index/g;
Move from a poorly documented scheme where the mapping
of unified architectural register indices to register
classes is hardcoded all over to one where there's an
enum for the register classes and a function that
encapsulates the mapping.
In order to support m5ops on virtualized CPUs, we need to either
intercept hypercall instructions or provide a memory mapped m5ops
interface. Since KVM does not normally pass the results of hypercalls
to userspace, which makes that method unfeasible. This changeset
introduces support for m5ops using memory mapped mmapped IPRs. This is
implemented by adding a class of "generic" IPRs which are handled by
architecture-independent code. Such IPRs always have bit 63 set and
are handled by handleGenericIprRead() and
handleGenericIprWrite(). Platform specific impementations of
handleIprRead and handleIprWrite should use
GenericISA::isGenericIprAccess to determine if an IPR address should
be handled by the generic code instead of the architecture-specific
code. Platforms that don't need their own IPR support can reuse
GenericISA::handleIprRead() and GenericISA::handleIprWrite().
This patch removes the notion of a peer block size and instead sets
the cache line size on the system level.
Previously the size was set per cache, and communicated through the
interconnect. There were plenty checks to ensure that everyone had the
same size specified, and these checks are now removed. Another benefit
that is not yet harnessed is that the cache line size is now known at
construction time, rather than after the port binding. Hence, the
block size can be locally stored and does not have to be queried every
time it is used.
A follow-on patch updates the configuration scripts accordingly.
in the TLB
Some architectures (currently only x86) require some fixing-up of
physical addresses after a normal address translation. This is usually
to remap devices such as the APIC, but could be used for other memory
mapped devices as well. When running the CPU in a using hardware
virtualization, we still need to do these address fix-ups before
inserting the request into the memory system. This patch moves this
patch allows that code to be used by such CPUs without doing full
address translations.
Add the method checkRaw to ArmISA::Interrupts. This method can be used
to query the raw state (ignoring CPSR masks) of an interrupt. It is
primarily intended for hardware virtualized CPUs.
Add the options 'panic_on_panic' and 'panic_on_oops' to the
LinuxArmSystem SimObject. When these option are enabled, the simulator
panics when the guest kernel panics or oopses. Enable panic on panic
and panic on oops in ARM-based test cases.
This changeset adds support for forwarding arguments to the PC
event constructors to following methods:
addKernelFuncEvent
addFuncEvent
Additionally, this changeset adds the following helper method to the
System base class:
addFuncEventOrPanic - Hook a PCEvent to a symbol, panic on failure.
addKernelFuncEventOrPanic - Hook a PCEvent to a kernel symbol, panic
on failure.
System implementations have been updated to use the new functionality
where appropriate.
This patch adds a missing flag to the ldr_ret_uop microop instruction.
The flag is added when the instruction is used, not directly in the
constructor of the instruction.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>"
This patch fixes the warnings that clang3.2svn emit due to the "-Wall"
flag. There is one case of an uninitialised value in the ARM neon ISA
description, and then a whole range of unused private fields that are
pruned.
A derived function with a different signature than a base class
function will result in the base class function of the same name being
hidden. The parameter list and return type for the member function in
the derived class must match those of the member function in the base
class, otherwise the function in the derived class will hide the
function in the base class and no polymorphic behaviour will occur.
This patch addresses these warnings by ensuring a unique function name
to avoid (unintentionally) hiding any functions.
This patch address the most important name shadowing warnings (as
produced when using gcc/clang with -Wshadow). There are many
locations where constructor parameters and function parameters shadow
local variables, but these are left unchanged.
If multiple memory operations to the same page are miss the TLB they are
all inserted into the page table queue and before this change could result
in multiple uncessesary walks as well as duplicate enteries being inserted
into the TLB.
Virtualized CPUs and the fastmem mode of the atomic CPU require direct
access to physical memory. We currently require caches to be disabled
when using them to prevent chaos. This is not ideal when switching
between hardware virutalized CPUs and other CPU models as it would
require a configuration change on each switch. This changeset
introduces a new version of the atomic memory mode,
'atomic_noncaching', where memory accesses are inserted into the
memory system as atomic accesses, but bypass caches.
To make memory mode tests cleaner, the following methods are added to
the System class:
* isAtomicMode() -- True if the memory mode is 'atomic' or 'direct'.
* isTimingMode() -- True if the memory mode is 'timing'.
* bypassCaches() -- True if caches should be bypassed.
The old getMemoryMode() and setMemoryMode() methods should never be
used from the C++ world anymore.
The explict tests in the follwing fp comparison operations were
incorrect as they checked for only signaling NaNs and not quite-NaNs
as well. When compiled with gcc, the comparison generates a fp exception
that causes the FE_INVALID flag to be set and we check for it, so even
though the check was incorrect, the correct exception was set. With clang
this behavior seems to not occur. The checks are updated to test for nans and
the behavior is now correct with both clang and gcc.
Clang generated executables would enter the if condition when it wasn't
supposted to, resulting in the wrong simulated behavior.
Implementing the operation this way is a bit faster anyway.
The changes made by the changeset 270c9a75e91f do not work well with switching
of cpus. The problem is that decoder for the old thread context holds state
that is not taken over by the new decoder.
This patch adds a takeOverFrom() function to Decoder class in each ISA. Except
for x86, functions in other ISAs are blank. For x86, the function copies state
from the old decoder to the new decoder.
The changes made by the changeset 9376 were not quite correct. The patch made
changes to the code which resulted in decoder not getting initialized correctly
when the state was restored from a checkpoint.
This patch adds a startup function to each ISA object. For x86, this function
sets the required state in the decoder. For other ISAs, the function is empty
right now.
Currently, we invalidate the cached miscregs in
TLB::unserialize(). The intended use of the drainResume() method is to
invalidate cached state and prepare the system to resume after a CPU
handover or (un)serialization. This patch moves the TLB miscregs
invalidation code to the drainResume() method to avoid surprising
behavior.
Since the page table walker only checks if a drain has completed in
doL1DescriptorWrapper() and doL2DescriptorWrapper(), it sometimes
looses track of a drain request if there is a squash. This changeset
adds a completeDrain() call after squashing requests in the pending
queue, which fixes this issue.
In order to see all registers independent of the current CPU mode, the
ARM architecture model uses the magic MISCREG_CPSR_MODE register to
change the register mappings without actually updating the CPU
mode. This hack is no longer needed since the thread context now
provides a flat interface to the register file. This patch replaces
the CPSR_MODE hack with the flat register interface.
After making the ISA an independent SimObject, it is serialized
automatically by the Python world. Previously, this just resulted in
an empty ISA section. This patch moves the contents of the ISA to that
section and removes the explicit ISA serialization from the thread
contexts, which makes it behave like a normal SimObject during
serialization.
Note: This patch breaks checkpoint backwards compatibility! Use the
cpt_upgrader.py utility to upgrade old checkpoints to the new format.
This patch makes the start and end address private in a move to
prevent direct manipulation and matching of ranges based on these
fields. This is done so that a transition to ranges with interleaving
support is possible.
As a result of hiding the start and end, a number of member functions
are needed to perform the comparisons and manipulations that
previously took place directly on the members. An accessor function is
provided for the start address, and a function is added to test if an
address is within a range. As a result of the latter the != and ==
operator is also removed in favour of the member function. A member
function that returns a string representation is also created to allow
debug printing.
In general, this patch does not add any functionality, but it does
take us closer to a situation where interleaving (and more cleverness)
can be added under the bonnet without exposing it to the user. More on
that in a later patch.
This patch makes the values of ID_ISARx, MIDR, and FPSID configurable
as ISA parameter values. Additionally, setMiscReg now ignores writes
to all of the ID registers.
Note: This moves the MIDR parameter from ArmSystem to ArmISA for
consistency.
The ISA class on stores the contents of ID registers on many
architectures. In order to make reset values of such registers
configurable, we make the class inherit from SimObject, which allows
us to use the normal generated parameter headers.
This patch introduces a Python helper method, BaseCPU.createThreads(),
which creates a set of ISAs for each of the threads in an SMT
system. Although it is currently only needed when creating
multi-threaded CPUs, it should always be called before instantiating
the system as this is an obvious place to configure ID registers
identifying a thread/CPU.
This patch unlocks the cpu-local monitor when the CPU sees a snoop to a locked
address. Previously we relied on the cache to handle the locking for us, however
some users on the gem5 mailing list reported a case where the cpu speculatively
executes a ll operation after a pending sc operation in the pipeline and that
makes the cache monitor valid. This should handle that case by invaliding the
local monitor.
This interface is no longer used, and getting rid of it simplifies the
decoders and code that sets up the decoders. The thread context had been used
to read architectural state which was used to contextualize the instruction
memory as it came in. That was changed so that the state is now sent to the
decoders to keep locally if/when it changes. That's significantly more
efficient.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
Avoid reading them every instruction, and also eliminate the last use of the
thread context in the decoders.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
uopSet_uop is microop instruction that has the IsControl flags set, but the
IsCondControl or IsUncondControl flags seems not to be set, neither in
the construction nor where the microop is used. This patch adds the the
flags in the constructor of the instruction (MicroUopSetPCCPSR).
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
A flag was missing for the movret_uop microop instruction. This patch adds
that flag when the instruction is used, not directly in the constructor of
the instruction.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
This patch moves the draining interface from SimObject to a separate
class that can be used by any object needing draining. However,
objects not visible to the Python code (i.e., objects not deriving
from SimObject) still depend on their parents informing them when to
drain. This patch also gets rid of the CountedDrainEvent (which isn't
really an event) and replaces it with a DrainManager.
When casting objects in the generated SWIG interfaces, SWIG uses
classical C-style casts ( (Foo *)bar; ). In some cases, this can
degenerate into the equivalent of a reinterpret_cast (mainly if only a
forward declaration of the type is available). This usually works for
most compilers, but it is known to break if multiple inheritance is
used anywhere in the object hierarchy.
This patch introduces the cxx_header attribute to Python SimObject
definitions, which should be used to specify a header to include in
the SWIG interface. The header should include the declaration of the
wrapped object. We currently don't enforce header the use of the
header attribute, but a warning will be generated for objects that do
not use it.
This patch enables dumping statistics and Linux process information on
context switch boundaries (__switch_to() calls) that are used for
Streamline integration (a graphical statistics viewer from ARM).
This patch takes the Linux thread info support scattered across
different ISA implementations (currently in ARM, ALPHA, and MIPS), and
unifies them into a single file.
Adds a few more helper functions to read out TGID, mm, etc.
ISA-specific information (e.g., ALPHA PCBB register) is now moved to
the corresponding isa_traits.hh files.
This patch simplifies the scheduling of the next walk for the ARM
table walker. Previously it used the CPU clock, but as the table
walker inherits the clock from the CPU, it is cleaner to simply use
its own clock (which is the same).
This patch adds an additional level of ports in the inheritance
hierarchy, separating out the protocol-specific and protocl-agnostic
parts. All the functionality related to the binding of ports is now
confined to use BaseMaster/BaseSlavePorts, and all the
protocol-specific parts stay in the Master/SlavePort. In the future it
will be possible to add other protocol-specific implementations.
The functions used in the binding of ports, i.e. getMaster/SlavePort
now use the base classes, and the index parameter is updated to use
the PortID typedef with the symbolic InvalidPortID as the default.
This patch addresses a number of smaller issues identified by the code
inspection utility cppcheck. There are a number of identified leaks in
the arm/linux/system.cc (although the function only get's called once
so it is not a major problem), a few deletes in dev/x86/i8042.cc that
were not array deletes, and sprintfs where the character array had one
element less than needed. In the IIC tags there was a function
allocating an array of longs which is in fact never used.
Newer Linux kernels require DTB (device tree blobs) to specify platform
configurations. The input DTB filename can be specified through gem5 parameters
in LinuxArmSystem.
Instead of statically defining miscRegName to contain NUM_MISCREGS
elements, let the compiler determine the length of the array. This
allows us to use a static_assert to test that all registers are listed
in the name vector.
This patch addresses the comments and feedback on the preceding patch
that reworks the clocks and now more clearly shows where cycles
(relative cycle counts) are used to express time.
Instead of bumping the existing patch I chose to make this a separate
patch, merely to try and focus the discussion around a smaller set of
changes. The two patches will be pushed together though.
This changes done as part of this patch are mostly following directly
from the introduction of the wrapper class, and change enough code to
make things compile and run again. There are definitely more places
where int/uint/Tick is still used to represent cycles, and it will
take some time to chase them all down. Similarly, a lot of parameters
should be changed from Param.Tick and Param.Unsigned to
Param.Cycles.
In addition, the use of curTick is questionable as there should not be
an absolute cycle. Potential solutions can be built on top of this
patch. There is a similar situation in the o3 CPU where
lastRunningCycle is currently counting in Cycles, and is still an
absolute time. More discussion to be had in other words.
An additional change that would be appropriate in the future is to
perform a similar wrapping of Tick and probably also introduce a
Ticks class along with suitable operators for all these classes.
This patch introduces the notion of a clock update function that aims
to avoid costly divisions when turning the current tick into a
cycle. Each clocked object advances a private (hidden) cycle member
and a tick member and uses these to implement functions for getting
the tick of the next cycle, or the tick of a cycle some time in the
future.
In the different modules using the clocks, changes are made to avoid
counting in ticks only to later translate to cycles. There are a few
oddities in how the O3 and inorder CPU count idle cycles, as seen by a
few locations where a cycle is subtracted in the calculation. This is
done such that the regression does not change any stats, but should be
revisited in a future patch.
Another, much needed, change that is not done as part of this patch is
to introduce a new typedef uint64_t Cycle to be able to at least hint
at the unit of the variables counting Ticks vs Cycles. This will be
done as a follow-up patch.
As an additional follow up, the thread context still uses ticks for
the book keeping of last activate and last suspend and this should
probably also be changed into cycles as well.
This patch removes the NACK frrom the packet as there is no longer any
module in the system that issues them (the bridge was the only one and
the previous patch removes that).
The handling of NACKs was mostly avoided throughout the code base, by
using e.g. panic or assert false, but in a few locations the NACKs
were actually dealt with (although NACKs never occured in any of the
regressions). Most notably, the DMA port will now never receive a NACK
and the backoff time is thus never changed. As a consequence, the
entire backoff mechanism (similar to a PCI bus) is now removed and the
DMA port entirely relies on the bus performing the arbitration and
issuing a retry when appropriate. This is more in line with e.g. PCIe.
Surprisingly, this patch has no impact on any of the regressions. As
mentioned in the patch that removes the NACK from the bridge, a
follow-up patch should change the request and response buffer size for
at least one regression to also verify that the system behaves as
expected when the bridge fills up.
This patch fixes some problems with the drain/switchout functionality
for the O3 cpu and for the ARM ISA and adds some useful debug print
statements.
This is an incremental fix as there are still a few bugs/mem leaks with the
switchout code. Particularly when switching from an O3CPU to a
TimingSimpleCPU. However, when switching from O3 to O3 cores with the ARM ISA
I haven't encountered any more assertion failures; now the kernel will
typically panic inside of simulation.
Enable different whitelists for different OS/arch combinations,
since some use the generic Linux definitions only, and others
use definitions inherited from earlier Unix flavors on those
architectures.
Also update x86 function pointers so ioctl is no longer
unimplemented on that platform.
This patch is a revised version of Vince Weaver's earlier patch.
According to the A15 TRM the value of this register is as follows (assuming 16 word = 64 byte lines)
[31:29] Format - b100 specifies v7
[28] RAZ - b0
[27:24] CWG log2(max writeback size #words) - 0x4 16 words
[23:20] ERG log2(max reservation size #words) - 0x4 16 words
[19:16] DminLine log2(smallest dcache line #words) - 0x4 16 words
[15:14] L1Ip L1 index/tagging policy - b11 specifies PIPT
[13:4] RAZ - b0000000000
[3:0] IminLine log2(smallest icache line #words) - 0x4 16 words
npc in PCState for ARM was being calculated before the current flags were
updated with the next flags. This causes an issue as the npc is incremented by
two or four depending on the current flags (thumb or not) and was leading to
branches that were predicted correctly being identified as mispredicted.
initCPU() will be called to initialize switched out CPUs for the simple and
inorder CPU models. this patch prevents those CPUs from being initialized
because they should get their state from the active CPU when it is switched
out.
This change allows designating a system as MP capable or not as some
bootloaders/kernels care that it's set right. You can have a single
processor MP capable system, but you can't have a multi-processor
UP only system. This change also fixes the initialization of the MIDR
register.
This will allow it to be specialized by the ISAs. The existing caching scheme
is provided by the BasicDecodeCache in the GenericISA namespace and is built
from the generalized components.
--HG--
rename : src/cpu/decode_cache.cc => src/arch/generic/decode_cache.cc
These classes are always used together, and merging them will give the ISAs
more flexibility in how they cache things and manage the process.
--HG--
rename : src/arch/x86/predecoder_tables.cc => src/arch/x86/decoder_tables.cc
This patch moves the DMA device to its own set of files, splitting it
from the IO device. There are no behavioural changes associated with
this patch.
The patch also grabs the opportunity to do some very minor tidying up,
including some white space removal and pruning some redundant
parameters.
Besides the immediate benefits of the separation-of-concerns, this
patch also makes upcoming changes more streamlined as it split the
devices that are only slaves and the DMA device that also acts as a
master.
--HG--
rename : src/dev/io_device.cc => src/dev/dma_device.cc
rename : src/dev/io_device.hh => src/dev/dma_device.hh
This patch makes the (device) DmaPort non-snooping and removes the
recvSnoop constructor parameter and instead introduces a
SnoopingDmaPort subclass for the ARM table walker.
Functionality is unchanged, as are the stats, and the patch merely
clarifies that the normal DMA ports are not snooping (although they
may issue requests that are snooped by others, as done with PCI, PCIe,
AMBA4 ACE etc).
Currently this port is declared in the ARM table walker as it is not
used anywhere else. If other ports were to have similar behaviour it
could be moved in a future patch.
Symbol tables masked with the loadAddrMask create redundant entries
that could conflict with kernel function events that rely on the
original addresses. This patch guards the creation of those masked
symbol tables by default, with an option to enable them when needed
(for early-stage kernel debugging, etc.)
This patch simplifies the packet by removing the broadcast flag and
instead more firmly relying on (and enforcing) the semantics of
transactions in the classic memory system, i.e. request packets are
routed from a master to a slave based on the address, and when they
are created they have neither a valid source, nor destination. On
their way to the slave, the request packet is updated with a source
field for all modules that multiplex packets from multiple master
(e.g. a bus). When a request packet is turned into a response packet
(at the final slave), it moves the potentially populated source field
to the destination field, and the response packet is routed through
any multiplexing components back to the master based on the
destination field.
Modules that connect multiplexing components, such as caches and
bridges store any existing source and destination field in the sender
state as a stack (just as before).
The packet constructor is simplified in that there is no longer a need
to pass the Packet::Broadcast as the destination (this was always the
case for the classic memory system). In the case of Ruby, rather than
using the parameter to the constructor we now rely on setDest, as
there is already another three-argument constructor in the packet
class.
In many places where the packet information was printed as part of
DPRINTFs, request packets would be printed with a numeric "dest" that
would always be -1 (Broadcast) and that field is now removed from the
printing.
This patch addresses a number of minor issues that cause problems when
compiling with clang >= 3.0 and gcc >= 4.6. Most importantly, it
avoids using the deprecated ext/hash_map and instead uses
unordered_map (and similarly so for the hash_set). To make use of the
new STL containers, g++ and clang has to be invoked with "-std=c++0x",
and this is now added for all gcc versions >= 4.6, and for clang >=
3.0. For gcc >= 4.3 and <= 4.5 and clang <= 3.0 we use the tr1
unordered_map to avoid the deprecation warning.
The addition of c++0x in turn causes a few problems, as the
compiler is more stringent and adds a number of new warnings. Below,
the most important issues are enumerated:
1) the use of namespaces is more strict, e.g. for isnan, and all
headers opening the entire namespace std are now fixed.
2) another other issue caused by the more stringent compiler is the
narrowing of the embedded python, which used to be a char array,
and is now unsigned char since there were values larger than 128.
3) a particularly odd issue that arose with the new c++0x behaviour is
found in range.hh, where the operator< causes gcc to complain about
the template type parsing (the "<" is interpreted as the beginning
of a template argument), and the problem seems to be related to the
begin/end members introduced for the range-type iteration, which is
a new feature in c++11.
As a minor update, this patch also fixes the build flags for the clang
debug target that used to be shared with gcc and incorrectly use
"-ggdb".
This patch removes the assumption on having on single instance of
PhysicalMemory, and enables a distributed memory where the individual
memories in the system are each responsible for a single contiguous
address range.
All memories inherit from an AbstractMemory that encompasses the basic
behaviuor of a random access memory, and provides untimed access
methods. What was previously called PhysicalMemory is now
SimpleMemory, and a subclass of AbstractMemory. All future types of
memory controllers should inherit from AbstractMemory.
To enable e.g. the atomic CPU and RubyPort to access the now
distributed memory, the system has a wrapper class, called
PhysicalMemory that is aware of all the memories in the system and
their associated address ranges. This class thus acts as an
infinitely-fast bus and performs address decoding for these "shortcut"
accesses. Each memory can specify that it should not be part of the
global address map (used e.g. by the functional memories by some
testers). Moreover, each memory can be configured to be reported to
the OS configuration table, useful for populating ATAG structures, and
any potential ACPI tables.
Checkpointing support currently assumes that all memories have the
same size and organisation when creating and resuming from the
checkpoint. A future patch will enable a more flexible
re-organisation.
--HG--
rename : src/mem/PhysicalMemory.py => src/mem/AbstractMemory.py
rename : src/mem/PhysicalMemory.py => src/mem/SimpleMemory.py
rename : src/mem/physical.cc => src/mem/abstract_mem.cc
rename : src/mem/physical.hh => src/mem/abstract_mem.hh
rename : src/mem/physical.cc => src/mem/simple_mem.cc
rename : src/mem/physical.hh => src/mem/simple_mem.hh
This patch introduces the notion of a master and slave port in the C++
code, thus bringing the previous classification from the Python
classes into the corresponding simulation objects and memory objects.
The patch enables us to classify behaviours into the two bins and add
assumptions and enfore compliance, also simplifying the two
interfaces. As a starting point, isSnooping is confined to a master
port, and getAddrRanges to slave ports. More of these specilisations
are to come in later patches.
The getPort function is not getMasterPort and getSlavePort, and
returns a port reference rather than a pointer as NULL would never be
a valid return value. The default implementation of these two
functions is placed in MemObject, and calls fatal.
The one drawback with this specific patch is that it requires some
code duplication, e.g. QueuedPort becomes QueuedMasterPort and
QueuedSlavePort, and BusPort becomes BusMasterPort and BusSlavePort
(avoiding multiple inheritance). With the later introduction of the
port interfaces, moving the functionality outside the port itself, a
lot of the duplicated code will disappear again.
This patch cleans up a number of minor issues aiming to get closer to
compliance with the C++0x standard as interpreted by gcc and clang
(compile with std=c++0x and -pedantic-errors). In particular, the
patch cleans up enums where the last item was succeded by a comma,
namespaces closed by a curcly brace followed by a semi-colon, and the
use of the GNU-extension typeof (replaced by templated functions). It
does not address variable-length arrays, zero-size arrays, anonymous
structs, range expressions in switch statements, and the use of long
long. The generated CPU code also has a large number of issues that
remain to be fixed, mainly related to overflows in implicit constant
conversion (due to shifts).
Enables the CheckerCPU to be selected at runtime with the --checker option
from the configs/example/fs.py and configs/example/se.py configuration
files. Also merges with the SE/FS changes.
The change to port proxies recently moved code out of the constructor into
initState(). This is needed for code that loads data into memory, however
for code that setups symbol tables, kernel based events, etc this is the wrong
thing to do as that code is only called when a checkpoint isn't being restored
from.
New kernels attempt to read CP14 what debug architecture is available.
These changes add the debug registers and return that none is currently
available.
With the recent series of patches, the symbol table loading moved from
"construct" time to "init" time, but the kernel function event
callback registration was left behind. This patch moves it to the
proper location.
Add extra declarations to allow the compiler to pick up the right function.
Please note that these declarations have been added as part of the
clang-related changes.
This patch is adding a clearer design intent to all objects that would
not be complete without a port proxy by making the proxies members
rathen than dynamically allocated. In essence, if NULL would not be a
valid value for the proxy, then we avoid using a pointer to make this
clear.
The same approach is used for the methods using these proxies, such as
loadSections, that now use references rather than pointers to better
reflect the fact that NULL would not be an acceptable value (in fact
the code would break and that is how this patch started out).
Overall the concept of "using a reference to express unconditional
composition where a NULL pointer is never valid" could be done on a
much broader scale throughout the code base, but for now it is only
done in the locations affected by the proxies.
This patch moves all port creation from the getPort method to be
consistently done in the MemObject's constructor. This is possible
thanks to the Swig interface passing the length of the vector ports.
Previously there was a mix of: 1) creating the ports as members (at
object construction time) and using getPort for the name resolution,
or 2) dynamically creating the ports in the getPort call. This is now
uniform. Furthermore, objects that would not be complete without a
port have these ports as members rather than having pointers to
dynamically allocated ports.
This patch also enables an elaboration-time enumeration of all the
ports in the system which can be used to determine the masterId.
This patch classifies all ports in Python as either Master or Slave
and enforces a binding of master to slave. Conceptually, a master (such
as a CPU or DMA port) issues requests, and receives responses, and
conversely, a slave (such as a memory or a PIO device) receives
requests and sends back responses. Currently there is no
differentiation between coherent and non-coherent masters and slaves.
The classification as master/slave also involves splitting the dual
role port of the bus into a master and slave port and updating all the
system assembly scripts to use the appropriate port. Similarly, the
interrupt devices have to have their int_port split into a master and
slave port. The intdev and its children have minimal changes to
facilitate the extra port.
Note that this patch does not enforce any port typing in the C++
world, it merely ensures that the Python objects have a notion of the
port roles and are connected in an appropriate manner. This check is
carried when two ports are connected, e.g. bus.master =
memory.port. The following patches will make use of the
classifications and specialise the C++ ports into masters and slaves.
This change adds a master id to each request object which can be
used identify every device in the system that is capable of issuing a request.
This is part of the way to removing the numCpus+1 stats in the cache and
replacing them with the master ids. This is one of a series of changes
that make way for the stats output to be changed to python.
This patch adds the necessary flags to the SConstruct and SConscript
files for compiling using clang 2.9 and later (on Ubuntu et al and OSX
XCode 4.2), and also cleans up a bunch of compiler warnings found by
clang. Most of the warnings are related to hidden virtual functions,
comparisons with unsigneds >= 0, and if-statements with empty
bodies. A number of mismatches between struct and class are also
fixed. clang 2.8 is not working as it has problems with class names
that occur in multiple namespaces (e.g. Statistics in
kernel_stats.hh).
clang has a bug (http://llvm.org/bugs/show_bug.cgi?id=7247) which
causes confusion between the container std::set and the function
Packet::set, and this is currently addressed by not including the
entire namespace std, but rather selecting e.g. "using std::vector" in
the appropriate places.
Usage: m5 writefile <filename>
File will be created in the gem5 output folder with the identical filename.
Implementation is largely based on the existing "readfile" functionality.
Currently does not support exporting of folders.
Brings the CheckerCPU back to life to allow FS and SE checking of the
O3CPU. These changes have only been tested with the ARM ISA. Other
ISAs potentially require modification.
This patch cleans up forward declarations and a member-function
prototype that still referred to the old FunctionalPort, VirtualPort
and TranslatingPort. There is no change in functionality.
Port proxies are used to replace non-structural ports, and thus enable
all ports in the system to correspond to a structural entity. This has
the advantage of accessing memory through the normal memory subsystem
and thus allowing any constellation of distributed memories, address
maps, etc. Most accesses are done through the "system port" that is
used for loading binaries, debugging etc. For the entities that belong
to the CPU, e.g. threads and thread contexts, they wrap the CPU data
port in a port proxy.
The following replacements are made:
FunctionalPort > PortProxy
TranslatingPort > SETranslatingPortProxy
VirtualPort > FSTranslatingPortProxy
--HG--
rename : src/mem/vport.cc => src/mem/fs_translating_port_proxy.cc
rename : src/mem/vport.hh => src/mem/fs_translating_port_proxy.hh
rename : src/mem/translating_port.cc => src/mem/se_translating_port_proxy.cc
rename : src/mem/translating_port.hh => src/mem/se_translating_port_proxy.hh
Adds the flag 'recvSnoops' which enables pagewalkers using DmaPorts,
to properly configure snoops.
--HG--
extra : rebase_source : 64207bef62c3268ddff2236ee4adae873812325f
Squashes the subsequent instructions in O3 pipe after the service call, so that
they see the effect of the system call when re-executed. This isn't really an issue
with FS mode, but can show up in SE mode.
--HG--
extra : rebase_source : 613a69fe1d9834261e25a8cd340aa6b47578e1fe
PageTable supported an allocate() call that called back
through the Process to allocate memory, but did not have
a method to map addresses without allocating new pages.
It makes more sense for Process to do the allocation, so
this method was renamed allocateMem() and moved to Process,
and uses a new map() call on PageTable.
The remaining uses of the process pointer in PageTable
were only to get the name and the PID, so by passing these
in directly in the constructor, we can make PageTable
completely independent of Process.
By using an underscore, the "." is still available and can unambiguously be
used to refer to members of a structure if an operand is a structure, class,
etc. This change mostly just replaces the appropriate "."s with "_"s, but
there were also a few places where the ISA descriptions where handling the
extensions themselves and had their own regular expressions to update. The
regular expressions in the isa parser were updated as well. It also now
looks for one of the defined type extensions specifically after connecting "_"
where before it would look for any sequence of characters after a "."
following an operand name and try to use it as the extension. This helps to
disambiguate cases where a "_" may legitimately be part of an operand name but
not separate the name from the type suffix.
Because leaving the "_" and suffix on the variable name still leaves a valid
C++ identifier and all extensions need to be consistent in a given context, I
considered leaving them on as a breadcrumb that would show what the intended
type was for that operand. Unfortunately the operands can be referred to in
code templates, the Mem operand in particular, and since the exact type of Mem
can be different for different uses of the same template, that broke things.
Only create a memory ordering violation when the value could have changed
between two subsequent loads, instead of just when loads go out-of-order
to the same address. While not very common in the case of Alpha, with
an architecture with a hardware table walker this can happen reasonably
frequently beacuse a translation will miss and start a table walk and
before the CPU re-schedules the faulting instruction another one will
pass it to the same address (or cache block depending on the dendency
checking).
This patch has been tested with a couple of self-checking hand crafted
programs to stress ordering between two cores.
The performance improvement on SPEC benchmarks can be substantial (2-10%).
Having two StaticInst classes, one nominally ISA dependent and the other ISA
dependent, has not been historically useful and makes the StaticInst class
more complicated that it needs to be. This change merges StaticInstBase into
StaticInst.
This change pulls the instruction decoding machinery (including caches) out of
the StaticInst class and puts it into its own class. This has a few intrinsic
benefits. First, the StaticInst code, which has gotten to be quite large, gets
simpler. Second, the code that handles decode caching is now separated out
into its own component and can be looked at in isolation, making it easier to
understand. I took the opportunity to restructure the code a bit which will
hopefully also help.
Beyond that, this change also lays some ground work for each ISA to have its
own, potentially stateful decode object. We'd be able to include less
contextualizing information in the ExtMachInst objects since that context
would be applied at the decoder. Also, the decoder could "know" ahead of time
that all the instructions it's going to see are going to be, for instance, 64
bit mode, and it will have one less thing to check when it decodes them.
Because the decode caching mechanism has been separated out, it's now possible
to have multiple caches which correspond to different types of decoding
context. Having one cache for each element of the cross product of different
configurations may become prohibitive, so it may be desirable to clear out the
cache when relatively static state changes and not to have one for each
setting.
Because the decode function is no longer universally accessible as a static
member of the StaticInst class, a new function was added to the ThreadContexts
that returns the applicable decode object.
There are a set of locations is the linux kernel that are managed via
cache maintence instructions until all processors enable their MMUs & TLBs.
Writes to these locations are manually flushed from the cache to main
memory when the occur so that cores operating without their MMU enabled
and only issuing uncached accesses can receive the correct data. Unfortuantely,
gem5 doesn't support any kind of software directed maintence of the cache.
Until such time as that support exists this patch marks the specific cache blocks
that need to be coherent as non-cacheable until all CPUs enable their MMU and
thus allows gem5 to boot MP systems with caches enabled (a requirement for
booting an O3 cpu and thus an O3 CPU regression).
SEV instructions were originally implemented to cause asynchronous squashes
via the generateTCSquash() function in the O3 pipeline when updating the
SEV_MAILBOX miscReg. This caused race conditions between CPUs in an MP system
that would lead to a pipeline either going inactive indefinitely or not being
able to commit squashed instructions. Fixed SEV instructions to behave like
interrupts and cause synchronous sqaushes inside the pipeline, eliminating
the race conditions. Also fixed up the semantics of the WFE instruction to
behave as documented in the ARMv7 ISA description to not sleep if SEV_MAILBOX=1
or unmasked interrupts are pending.
SWP and SWPB now throw an undefined instruction exception if
SCTLR.SW == 0. This also required the MIDR to be changed
slightly so programs can correctly determine that gem5 supports
the ARM v7 behavior of SWP/SWPB (in ARM v6, SWP/SWPB were
deprecated, but not disabled at CPU startup).
Adds MISCREG_ID_MMFR2 and removes break on access to MISCREG_CLIDR. Both
registers now return values that are consistent with current ARM
implementations.
readBytes and writeBytes had the word "bytes" in their names because they
accessed blobs of bytes. This distinguished them from the read and write
functions which handled higher level data types. Because those functions don't
exist any more, this change renames readBytes and writeBytes to more general
names, readMem and writeMem, which reflect the fact that they are how you read
and write memory. This also makes their names more consistent with the
register reading/writing functions, although those are still read and set for
some reason.
Instead of clearing the entire TLB on initialization and flush, the code was
clearing only one element. This patch corrects the memsets in the init and
flush routines.
This patch fixes two problems with the O3 cpu model. The first is an issue
with an instruction fetch causing a fault on the next address while the
current macro-op is being issued. This happens when the micro-ops exceed
the fetch bandwdith and then on the next cycle the fetch stage attempts
to issue a request to the next line while it still has micro-ops to issue
if the next line faults a fault is attached to a micro-op in the currently
executing macro-op rather than a "nop" from the next instruction block.
This leads to an instruction incorrectly faulting when on fetch when
it had no reason to fault.
A similar problem occurs with interrupts. When an interrupt occurs the
fetch stage nominally stops issuing instructions immediately. This is incorrect
in the case of a macro-op as the current location might not be interruptable.
This change further eliminates cases where condition codes were being read
just so they could be written without change because the instruction in
question was supposed to preserve them. This is done by creating the condition
code code based on the input rather than just doing a simple substitution.
If one of the condition codes isn't being used in the execution we should only
read it if the instruction might be dependent on it. With the preeceding changes
there are several more cases where we should dynamically pick instead of assuming
as we did before.
Break up the condition code bits into NZ, C, V registers. These are individually
written and this removes some incorrect dependencies between instructions.
Move the saturating bit (which is also saturating) from the renamed register
that holds the flags to the CPSR miscreg and adds a allows setting it in a
similar way to the FP saturating registers. This removes a dependency in
instructions that don't write, but need to preserve the Q bit.
This change splits out the condcodes from being one monolithic register
into three blocks that are updated independently. This allows CPUs
to not have to do RMW operations on the flags registers for instructions
that don't write all flags.
Debug flags are ExecUser, ExecKernel, and ExecAsid. ExecUser and
ExecKernel are set by default when Exec is specified. Use minus
sign with ExecUser or ExecKernel to remove user or kernel tracing
respectively.
Add registers and components to better support the VersatileEB board.
Made the MIDR and SYS_ID register parameters to ArmSystem and RealviewCtrl
respectively.
At the same time, rename the trace flags to debug flags since they
have broader usage than simply tracing. This means that
--trace-flags is now --debug-flags and --trace-help is now --debug-help
This change fixes a small bug in the arm copyRegs() code where some registers
wouldn't be copied if the processor was in a mode other than MODE_USER.
Additionally, this change simplifies the way the O3 switchCpu code works by
utilizing TheISA::copyRegs() to copy the required context information
rather than the adhoc copying that goes on in the CPU model. The current code
makes assumptions about the visibility of int and float registers that aren't
true for all architectures in FS mode.
***
(1): get rid of expandForMT function
MIPS is the only ISA that cares about having a piece of ISA state integrate
multiple threads so add constants for MIPS and relieve the other ISAs from having
to define this. Also, InOrder was the only core that was actively calling
this function
* * *
(2): get rid of corespecific type
The CoreSpecific type was used as a proxy to pass in HW specific params to
a MIPS CPU, but since MIPS FS hasnt been touched for awhile, it makes sense
to not force every other ISA to use CoreSpecific as well use a special
reset function to set it. That probably should go in a PowerOn reset fault
anyway.
The ISAR registers describe which features the processor supports.
Transcribe the values listed in section B5.2.5 of the ARM ARM
into the registers as read-only values
This change speeds up booting, especially in MP cases, by not executing
udelay() on the core but instead skipping ahead tha amount of time that is being
delayed.
This patch prevents not executed conditional instructions marked as
IsQuiesce from stalling the pipeline indefinitely. If the instruction
is not executed the quiesceSkip psuedoinst is called which schedules a
wakes up call to the fetch stage.
This changes the RFE macroop into 3 microops:
URa = [sp]; URb = [sp+4]; // load CPSR,PC values from stack
sp = sp + offset; // optionally auto-increment
PC = URa; CPSR = URb; // write to the PC and CPSR.
Importantly:
- writing to PC is handled in the last micro-op.
- loading occurs prior to state changes.
There may not be a formally correct spelling for the past tense of mmap, but
mmapped is the spelling Google doesn't try to autocorrect. This makes sense
because it mirrors the past tense of map->mapped and not the past tense of
cape->caped.
--HG--
rename : src/arch/alpha/mmaped_ipr.hh => src/arch/alpha/mmapped_ipr.hh
rename : src/arch/arm/mmaped_ipr.hh => src/arch/arm/mmapped_ipr.hh
rename : src/arch/mips/mmaped_ipr.hh => src/arch/mips/mmapped_ipr.hh
rename : src/arch/power/mmaped_ipr.hh => src/arch/power/mmapped_ipr.hh
rename : src/arch/sparc/mmaped_ipr.hh => src/arch/sparc/mmapped_ipr.hh
rename : src/arch/x86/mmaped_ipr.hh => src/arch/x86/mmapped_ipr.hh
We only support EABI binaries, so there is no reason to support OABI syscalls.
The loader detects OABI calls and fatal() so there is no reason to even check
here.
The ARM performance counters are not currently supported by the model.
This patch interprets a 'reset performance counters' command to mean 'reset
the simulator statistics' instead.
Uncacheable requests were set as such only in atomic mode.
currState->delayed is checked in place of currState->timing for resetting
currState in atomic mode.
Some ISAs (like ARM) relies on hardware page table walkers. For those ISAs,
when a TLB miss occurs, initiateTranslation() can return with NoFault but with
the translation unfinished.
Instructions experiencing a delayed translation due to a hardware page table
walk are deferred until the translation completes and kept into the IQ. In
order to keep track of them, the IQ has been augmented with a queue of the
outstanding delayed memory instructions. When their translation completes,
instructions are re-executed (only their initiateAccess() was already
executed; their DTB translation is now skipped). The IEW stage has been
modified to support such a 2-pass execution.
Any change of control flow now resets the itstate to 0 mask and 0 condition,
except where the control flow alteration write into the cpsr register. These
case, for example return from an iterrupt, require the predecoder to recover
the itstate.
As there is a window of opportunity between the return from an interrupt
changing the control flow at the head of the pipe and the commit of the update
to the CPSR, the predecoder needs to be able to grab the ITstate early. This
is now handled by setting the forcedItState inside a PCstate for the control
flow altering instruction.
That instruction will have the correct mask/cond, but will not have a valid
itstate until advancePC is called (note this happens to advance the execution).
When the new PCstate is copy constructed it gets the itstate cond/mask, and
upon advancing the PC the itstate becomes valid.
Subsequent advancing invalidates the state and zeroes the cond/mask. This is
handled in isolation for the ARM ISA and should have no impact on other ISAs.
Refer arch/arm/types.hh and arch/arm/predecoder.cc for the details.
When this condition occurs the cpu should restart the fetch stage to fetch from
the original execution path. Fault handling in the commit stage is cleaned up a
little bit so the control flow is simplier. Finally, if an instruction is being
used to carry a fault it isn't executed, so the fault propagates appropriately.
Ran all the source files through 'perl -pi' with this script:
s|\s*(};?\s*)?/\*\s*(end\s*)?namespace\s*(\S+)\s*\*/(\s*})?|} // namespace $3|;
s|\s*};?\s*//\s*(end\s*)?namespace\s*(\S+)\s*|} // namespace $2\n|;
s|\s*};?\s*//\s*(\S+)\s*namespace\s*|} // namespace $1\n|;
Also did a little manual editing on some of the arch/*/isa_traits.hh files
and src/SConscript.
ARM instructions updating cumulative flags (ARM FP exceptions and saturation
flags) are not serialized.
Added aliases for ARM FP exceptions and saturation flags in FPSCR. Removed
write accesses to the FP condition codes for most ARM VFP instructions: only
VCMP and VCMPE instructions update the FP condition codes. Removed a potential
cause of seg. faults in the O3 model for NEON memory macro-ops (ARM).
The L1 cache may have been accessed to provide this data, which confuses
it, if it ends up being accesses twice in one cycle. Instead wait 1 tick
which will force the timing simple CPU to forward to its next clock cycle
when the translation completes.
Also prevent multiple outstanding table walks from occuring at once.
This change modifies the way prefetches work. They are now like normal loads
that don't writeback a register. Previously prefetches were supposed to call
prefetch() on the exection context, so they executed with execute() methods
instead of initiateAcc() completeAcc(). The prefetch() methods for all the CPUs
are blank, meaning that they get executed, but don't actually do anything.
On Alpha dead cache copy code was removed and prefetches are now normal ops.
They count as executed operations, but still don't do anything and IsMemRef is
not longer set on them.
On ARM IsDataPrefetch or IsInstructionPreftech is now set on all prefetch
instructions. The timing simple CPU doesn't try to do anything special for
prefetches now and they execute with the normal memory code path.
This change is a low level and pervasive reorganization of how PCs are managed
in M5. Back when Alpha was the only ISA, there were only 2 PCs to worry about,
the PC and the NPC, and the lsb of the PC signaled whether or not you were in
PAL mode. As other ISAs were added, we had to add an NNPC, micro PC and next
micropc, x86 and ARM introduced variable length instruction sets, and ARM
started to keep track of mode bits in the PC. Each CPU model handled PCs in
its own custom way that needed to be updated individually to handle the new
dimensions of variability, or, in the case of ARMs mode-bit-in-the-pc hack,
the complexity could be hidden in the ISA at the ISA implementation's expense.
Areas like the branch predictor hadn't been updated to handle branch delay
slots or micropcs, and it turns out that had introduced a significant (10s of
percent) performance bug in SPARC and to a lesser extend MIPS. Rather than
perpetuate the problem by reworking O3 again to handle the PC features needed
by x86, this change was introduced to rework PC handling in a more modular,
transparent, and hopefully efficient way.
PC type:
Rather than having the superset of all possible elements of PC state declared
in each of the CPU models, each ISA defines its own PCState type which has
exactly the elements it needs. A cross product of canned PCState classes are
defined in the new "generic" ISA directory for ISAs with/without delay slots
and microcode. These are either typedef-ed or subclassed by each ISA. To read
or write this structure through a *Context, you use the new pcState() accessor
which reads or writes depending on whether it has an argument. If you just
want the address of the current or next instruction or the current micro PC,
you can get those through read-only accessors on either the PCState type or
the *Contexts. These are instAddr(), nextInstAddr(), and microPC(). Note the
move away from readPC. That name is ambiguous since it's not clear whether or
not it should be the actual address to fetch from, or if it should have extra
bits in it like the PAL mode bit. Each class is free to define its own
functions to get at whatever values it needs however it needs to to be used in
ISA specific code. Eventually Alpha's PAL mode bit could be moved out of the
PC and into a separate field like ARM.
These types can be reset to a particular pc (where npc = pc +
sizeof(MachInst), nnpc = npc + sizeof(MachInst), upc = 0, nupc = 1 as
appropriate), printed, serialized, and compared. There is a branching()
function which encapsulates code in the CPU models that checked if an
instruction branched or not. Exactly what that means in the context of branch
delay slots which can skip an instruction when not taken is ambiguous, and
ideally this function and its uses can be eliminated. PCStates also generally
know how to advance themselves in various ways depending on if they point at
an instruction, a microop, or the last microop of a macroop. More on that
later.
Ideally, accessing all the PCs at once when setting them will improve
performance of M5 even though more data needs to be moved around. This is
because often all the PCs need to be manipulated together, and by getting them
all at once you avoid multiple function calls. Also, the PCs of a particular
thread will have spatial locality in the cache. Previously they were grouped
by element in arrays which spread out accesses.
Advancing the PC:
The PCs were previously managed entirely by the CPU which had to know about PC
semantics, try to figure out which dimension to increment the PC in, what to
set NPC/NNPC, etc. These decisions are best left to the ISA in conjunction
with the PC type itself. Because most of the information about how to
increment the PC (mainly what type of instruction it refers to) is contained
in the instruction object, a new advancePC virtual function was added to the
StaticInst class. Subclasses provide an implementation that moves around the
right element of the PC with a minimal amount of decision making. In ISAs like
Alpha, the instructions always simply assign NPC to PC without having to worry
about micropcs, nnpcs, etc. The added cost of a virtual function call should
be outweighed by not having to figure out as much about what to do with the
PCs and mucking around with the extra elements.
One drawback of making the StaticInsts advance the PC is that you have to
actually have one to advance the PC. This would, superficially, seem to
require decoding an instruction before fetch could advance. This is, as far as
I can tell, realistic. fetch would advance through memory addresses, not PCs,
perhaps predicting new memory addresses using existing ones. More
sophisticated decisions about control flow would be made later on, after the
instruction was decoded, and handed back to fetch. If branching needs to
happen, some amount of decoding needs to happen to see that it's a branch,
what the target is, etc. This could get a little more complicated if that gets
done by the predecoder, but I'm choosing to ignore that for now.
Variable length instructions:
To handle variable length instructions in x86 and ARM, the predecoder now
takes in the current PC by reference to the getExtMachInst function. It can
modify the PC however it needs to (by setting NPC to be the PC + instruction
length, for instance). This could be improved since the CPU doesn't know if
the PC was modified and always has to write it back.
ISA parser:
To support the new API, all PC related operand types were removed from the
parser and replaced with a PCState type. There are two warts on this
implementation. First, as with all the other operand types, the PCState still
has to have a valid operand type even though it doesn't use it. Second, using
syntax like PCS.npc(target) doesn't work for two reasons, this looks like the
syntax for operand type overriding, and the parser can't figure out if you're
reading or writing. Instructions that use the PCS operand (which I've
consistently called it) need to first read it into a local variable,
manipulate it, and then write it back out.
Return address stack:
The return address stack needed a little extra help because, in the presence
of branch delay slots, it has to merge together elements of the return PC and
the call PC. To handle that, a buildRetPC utility function was added. There
are basically only two versions in all the ISAs, but it didn't seem short
enough to put into the generic ISA directory. Also, the branch predictor code
in O3 and InOrder were adjusted so that they always store the PC of the actual
call instruction in the RAS, not the next PC. If the call instruction is a
microop, the next PC refers to the next microop in the same macroop which is
probably not desirable. The buildRetPC function advances the PC intelligently
to the next macroop (in an ISA specific way) so that that case works.
Change in stats:
There were no change in stats except in MIPS and SPARC in the O3 model. MIPS
runs in about 9% fewer ticks. SPARC runs with 30%-50% fewer ticks, which could
likely be improved further by setting call/return instruction flags and taking
advantage of the RAS.
TODO:
Add != operators to the PCState classes, defined trivially to be !(a==b).
Smooth out places where PCs are split apart, passed around, and put back
together later. I think this might happen in SPARC's fault code. Add ISA
specific constructors that allow setting PC elements without calling a bunch
of accessors. Try to eliminate the need for the branching() function. Factor
out Alpha's PAL mode pc bit into a separate flag field, and eliminate places
where it's blindly masked out or tested in the PC.