While FastAlloc provides a small performance increase (~1.5%) over regular malloc it isn't thread safe.
After removing FastAlloc and using tcmalloc I've seen a performance increase of 12% over libc malloc
when running twolf for ARM.
This patch makes two very minor changes to please gcc 4.7. The
CopyData function no longer exists and this has been replaced. For
some reason previous versions of gcc did not complain on the const
char casting not having an implementation, but this is now addressed.
If the length argument to mmap is larger than the arbitrary but reasonable
limit of 4GB, there's a good chance that the value is nonsense and not
intentional. Rather than attempting to satisfy the mmap anyway, this change
makes gem5 warn to make it more apparent what's going wrong.
Track the point in the initialization where statistics have been registered.
After this point registering new masterIds can no longer work as some
SimObjects may have sized stats vectors based on the previous value. If someone
tries to register a masterId after this point the simulator executes fatal().
This patch moves send/recvTiming and send/recvTimingSnoop from the
Port base class to the MasterPort and SlavePort, and also splits them
into separate member functions for requests and responses:
send/recvTimingReq, send/recvTimingResp, and send/recvTimingSnoopReq,
send/recvTimingSnoopResp. A master port sends requests and receives
responses, and also receives snoop requests and sends snoop
responses. A slave port has the reciprocal behaviour as it receives
requests and sends responses, and sends snoop requests and receives
snoop responses.
For all MemObjects that have only master ports or slave ports (but not
both), e.g. a CPU, or a PIO device, this patch merely adds more
clarity to what kind of access is taking place. For example, a CPU
port used to call sendTiming, and will now call
sendTimingReq. Similarly, a response previously came back through
recvTiming, which is now recvTimingResp. For the modules that have
both master and slave ports, e.g. the bus, the behaviour was
previously relying on branches based on pkt->isRequest(), and this is
now replaced with a direct call to the apprioriate member function
depending on the type of access. Please note that send/recvRetry is
still shared by all the timing accessors and remains in the Port base
class for now (to maintain the current bus functionality and avoid
changing the statistics of all regressions).
The packet queue is split into a MasterPort and SlavePort version to
facilitate the use of the new timing accessors. All uses of the
PacketQueue are updated accordingly.
With this patch, the type of packet (request or response) is now well
defined for each type of access, and asserts on pkt->isRequest() and
pkt->isResponse() are now moved to the appropriate send member
functions. It is also worth noting that sendTimingSnoopReq no longer
returns a boolean, as the semantics do not alow snoop requests to be
rejected or stalled. All these assumptions are now excplicitly part of
the port interface itself.
This patch introduces port access methods that separates snoop
request/responses from normal memory request/responses. The
differentiation is made for functional, atomic and timing accesses and
builds on the introduction of master and slave ports.
Before the introduction of this patch, the packets belonging to the
different phases of the protocol (request -> [forwarded snoop request
-> snoop response]* -> response) all use the same port access
functions, even though the snoop packets flow in the opposite
direction to the normal packet. That is, a coherent master sends
normal request and receives responses, but receives snoop requests and
sends snoop responses (vice versa for the slave). These two distinct
phases now use different access functions, as described below.
Starting with the functional access, a master sends a request to a
slave through sendFunctional, and the request packet is turned into a
response before the call returns. In a system without cache coherence,
this is all that is needed from the functional interface. For the
cache-coherent scenario, a slave also sends snoop requests to coherent
masters through sendFunctionalSnoop, with responses returned within
the same packet pointer. This is currently used by the bus and caches,
and the LSQ of the O3 CPU. The send/recvFunctional and
send/recvFunctionalSnoop are moved from the Port super class to the
appropriate subclass.
Atomic accesses follow the same flow as functional accesses, with
request being sent from master to slave through sendAtomic. In the
case of cache-coherent ports, a slave can send snoop requests to a
master through sendAtomicSnoop. Just as for the functional access
methods, the atomic send and receive member functions are moved to the
appropriate subclasses.
The timing access methods are different from the functional and atomic
in that requests and responses are separated in time and
send/recvTiming are used for both directions. Hence, a master uses
sendTiming to send a request to a slave, and a slave uses sendTiming
to send a response back to a master, at a later point in time. Snoop
requests and responses travel in the opposite direction, similar to
what happens in functional and atomic accesses. With the introduction
of this patch, it is possible to determine the direction of packets in
the bus, and no longer necessary to look for both a master and a slave
port with the requested port id.
In contrast to the normal recvFunctional, recvAtomic and recvTiming
that are pure virtual functions, the recvFunctionalSnoop,
recvAtomicSnoop and recvTimingSnoop have a default implementation that
calls panic. This is to allow non-coherent master and slave ports to
not implement these functions.
This patch addresses a number of minor issues that cause problems when
compiling with clang >= 3.0 and gcc >= 4.6. Most importantly, it
avoids using the deprecated ext/hash_map and instead uses
unordered_map (and similarly so for the hash_set). To make use of the
new STL containers, g++ and clang has to be invoked with "-std=c++0x",
and this is now added for all gcc versions >= 4.6, and for clang >=
3.0. For gcc >= 4.3 and <= 4.5 and clang <= 3.0 we use the tr1
unordered_map to avoid the deprecation warning.
The addition of c++0x in turn causes a few problems, as the
compiler is more stringent and adds a number of new warnings. Below,
the most important issues are enumerated:
1) the use of namespaces is more strict, e.g. for isnan, and all
headers opening the entire namespace std are now fixed.
2) another other issue caused by the more stringent compiler is the
narrowing of the embedded python, which used to be a char array,
and is now unsigned char since there were values larger than 128.
3) a particularly odd issue that arose with the new c++0x behaviour is
found in range.hh, where the operator< causes gcc to complain about
the template type parsing (the "<" is interpreted as the beginning
of a template argument), and the problem seems to be related to the
begin/end members introduced for the range-type iteration, which is
a new feature in c++11.
As a minor update, this patch also fixes the build flags for the clang
debug target that used to be shared with gcc and incorrectly use
"-ggdb".
This patch removes the assumption on having on single instance of
PhysicalMemory, and enables a distributed memory where the individual
memories in the system are each responsible for a single contiguous
address range.
All memories inherit from an AbstractMemory that encompasses the basic
behaviuor of a random access memory, and provides untimed access
methods. What was previously called PhysicalMemory is now
SimpleMemory, and a subclass of AbstractMemory. All future types of
memory controllers should inherit from AbstractMemory.
To enable e.g. the atomic CPU and RubyPort to access the now
distributed memory, the system has a wrapper class, called
PhysicalMemory that is aware of all the memories in the system and
their associated address ranges. This class thus acts as an
infinitely-fast bus and performs address decoding for these "shortcut"
accesses. Each memory can specify that it should not be part of the
global address map (used e.g. by the functional memories by some
testers). Moreover, each memory can be configured to be reported to
the OS configuration table, useful for populating ATAG structures, and
any potential ACPI tables.
Checkpointing support currently assumes that all memories have the
same size and organisation when creating and resuming from the
checkpoint. A future patch will enable a more flexible
re-organisation.
--HG--
rename : src/mem/PhysicalMemory.py => src/mem/AbstractMemory.py
rename : src/mem/PhysicalMemory.py => src/mem/SimpleMemory.py
rename : src/mem/physical.cc => src/mem/abstract_mem.cc
rename : src/mem/physical.hh => src/mem/abstract_mem.hh
rename : src/mem/physical.cc => src/mem/simple_mem.cc
rename : src/mem/physical.hh => src/mem/simple_mem.hh
This patch introduces the notion of a master and slave port in the C++
code, thus bringing the previous classification from the Python
classes into the corresponding simulation objects and memory objects.
The patch enables us to classify behaviours into the two bins and add
assumptions and enfore compliance, also simplifying the two
interfaces. As a starting point, isSnooping is confined to a master
port, and getAddrRanges to slave ports. More of these specilisations
are to come in later patches.
The getPort function is not getMasterPort and getSlavePort, and
returns a port reference rather than a pointer as NULL would never be
a valid return value. The default implementation of these two
functions is placed in MemObject, and calls fatal.
The one drawback with this specific patch is that it requires some
code duplication, e.g. QueuedPort becomes QueuedMasterPort and
QueuedSlavePort, and BusPort becomes BusMasterPort and BusSlavePort
(avoiding multiple inheritance). With the later introduction of the
port interfaces, moving the functionality outside the port itself, a
lot of the duplicated code will disappear again.
This patch cleans up a number of minor issues aiming to get closer to
compliance with the C++0x standard as interpreted by gcc and clang
(compile with std=c++0x and -pedantic-errors). In particular, the
patch cleans up enums where the last item was succeded by a comma,
namespaces closed by a curcly brace followed by a semi-colon, and the
use of the GNU-extension typeof (replaced by templated functions). It
does not address variable-length arrays, zero-size arrays, anonymous
structs, range expressions in switch statements, and the use of long
long. The generated CPU code also has a large number of issues that
remain to be fixed, mainly related to overflows in implicit constant
conversion (due to shifts).
This patch makes the code compile with clang 2.9 and 3.0 again by
making two very minor changes. Firt, it maintains a strict typing in
the forward declaration of the BaseCPUParams. Second, it adds a
FullSystemInt flag of the type unsigned int next to the boolean
FullSystem flag. The FullSystemInt variable can be used in
decode-statements (expands to switch statements) in the instruction
decoder.
The change to port proxies recently moved code out of the constructor into
initState(). This is needed for code that loads data into memory, however
for code that setups symbol tables, kernel based events, etc this is the wrong
thing to do as that code is only called when a checkpoint isn't being restored
from.
This patch is adding a clearer design intent to all objects that would
not be complete without a port proxy by making the proxies members
rathen than dynamically allocated. In essence, if NULL would not be a
valid value for the proxy, then we avoid using a pointer to make this
clear.
The same approach is used for the methods using these proxies, such as
loadSections, that now use references rather than pointers to better
reflect the fact that NULL would not be an acceptable value (in fact
the code would break and that is how this patch started out).
Overall the concept of "using a reference to express unconditional
composition where a NULL pointer is never valid" could be done on a
much broader scale throughout the code base, but for now it is only
done in the locations affected by the proxies.
This patch classifies all ports in Python as either Master or Slave
and enforces a binding of master to slave. Conceptually, a master (such
as a CPU or DMA port) issues requests, and receives responses, and
conversely, a slave (such as a memory or a PIO device) receives
requests and sends back responses. Currently there is no
differentiation between coherent and non-coherent masters and slaves.
The classification as master/slave also involves splitting the dual
role port of the bus into a master and slave port and updating all the
system assembly scripts to use the appropriate port. Similarly, the
interrupt devices have to have their int_port split into a master and
slave port. The intdev and its children have minimal changes to
facilitate the extra port.
Note that this patch does not enforce any port typing in the C++
world, it merely ensures that the Python objects have a notion of the
port roles and are connected in an appropriate manner. This check is
carried when two ports are connected, e.g. bus.master =
memory.port. The following patches will make use of the
classifications and specialise the C++ ports into masters and slaves.
This change adds a master id to each request object which can be
used identify every device in the system that is capable of issuing a request.
This is part of the way to removing the numCpus+1 stats in the cache and
replacing them with the master ids. This is one of a series of changes
that make way for the stats output to be changed to python.
The code that checks whether pages allocated by allocPhysPages only checks
that the first page fits into physical memory, not that all of them do. This
change makes the code check the last page which should work properly. This
function used to only allocate one page at a time, so the first page and last
page used to be the same thing.
This patch adds the necessary flags to the SConstruct and SConscript
files for compiling using clang 2.9 and later (on Ubuntu et al and OSX
XCode 4.2), and also cleans up a bunch of compiler warnings found by
clang. Most of the warnings are related to hidden virtual functions,
comparisons with unsigneds >= 0, and if-statements with empty
bodies. A number of mismatches between struct and class are also
fixed. clang 2.8 is not working as it has problems with class names
that occur in multiple namespaces (e.g. Statistics in
kernel_stats.hh).
clang has a bug (http://llvm.org/bugs/show_bug.cgi?id=7247) which
causes confusion between the container std::set and the function
Packet::set, and this is currently addressed by not including the
entire namespace std, but rather selecting e.g. "using std::vector" in
the appropriate places.
Usage: m5 writefile <filename>
File will be created in the gem5 output folder with the identical filename.
Implementation is largely based on the existing "readfile" functionality.
Currently does not support exporting of folders.
This patch simplifies the address-range determination mechanism and
also unifies the naming across ports and devices. It further splits
the queries for determining if a port is snooping and what address
ranges it responds to (aiming towards a separation of
cache-maintenance ports and pure memory-mapped ports). Default
behaviours are such that most ports do not have to define isSnooping,
and master ports need not implement getAddrRanges.
Port proxies are used to replace non-structural ports, and thus enable
all ports in the system to correspond to a structural entity. This has
the advantage of accessing memory through the normal memory subsystem
and thus allowing any constellation of distributed memories, address
maps, etc. Most accesses are done through the "system port" that is
used for loading binaries, debugging etc. For the entities that belong
to the CPU, e.g. threads and thread contexts, they wrap the CPU data
port in a port proxy.
The following replacements are made:
FunctionalPort > PortProxy
TranslatingPort > SETranslatingPortProxy
VirtualPort > FSTranslatingPortProxy
--HG--
rename : src/mem/vport.cc => src/mem/fs_translating_port_proxy.cc
rename : src/mem/vport.hh => src/mem/fs_translating_port_proxy.hh
rename : src/mem/translating_port.cc => src/mem/se_translating_port_proxy.cc
rename : src/mem/translating_port.hh => src/mem/se_translating_port_proxy.hh
The system port is used as a globally reachable access point to the
memory subsystem. The benefit of using an actual port is that the
usual infrastructure is used to resolve any access and thus makes the
overall system able to handle distributed memories in any
configuration, and also makes the accesses agnostic to the address
map. This patch only introduces the port and does not actually use it
for anything.
This patch adds a mechanism to collect run time samples for specific portions
of a benchmark, using work_begin and work_end pseudo instructions.It also enhances
the histogram stat to report geometric mean.
This patch adds a function for replacing the event at the head of the queue
with another event. This helps in running a different set of events. Events
already scheduled can processed by replacing the original head event back.
This function has been specifically added to support cache warmup and
cooldown required for creating and restoring checkpoints.
--HG--
extra : rebase_source : ed6e2905720b6bfdefd020fab76235ccf33d28d1
This parameter depends on a number of coincidences to work properly. First,
there must be an array assigned to system called "cpu" even though there's no
parameter called that. Second, the items in the "cpu" array have to have a
"clock" parameter which has a "frequency" member. This is true of the normal
CPUs, but isn't true of the memory tester CPUs. This happened to work before
because the memory tester CPUs were only used in SE mode where this parameter
was being excluded. Since everything is being pulled into a common binary,
this won't work any more. Since the boot_cpu_frequency parameter is only used
by Alpha's Linux System object (and Mips's through copy and paste), the
definition of that parameter is moved down to those objects specifically.
PageTable supported an allocate() call that called back
through the Process to allocate memory, but did not have
a method to map addresses without allocating new pages.
It makes more sense for Process to do the allocation, so
this method was renamed allocateMem() and moved to Process,
and uses a new map() call on PageTable.
The remaining uses of the process pointer in PageTable
were only to get the name and the PID, so by passing these
in directly in the constructor, we can make PageTable
completely independent of Process.
Replace the (broken as of previous changeset) swig_objdecl() method
that allowed/forced you to substitute a whole new C++ struct
definition for SWIG to wrap with a set of export_method* hooks
that let you just declare a set of C++ methods (or other declarations)
that get inserted in the auto-generated struct.
Restore the System get/setMemoryMode methods, and use this mechanism
to specialize SimObject as well, eliminating teh need for sim_object.i.
Needed bits of sim_object.i are moved to the new pyobject.i.
Also sucked a little SimObject specialization into cxx_param_decl()
allowing us to get rid of src/sim/sim_object_params.hh. Now the
generation and wrapping of the base SimObject param struct is more
in line with how derived objects are handled.
--HG--
rename : src/python/swig/sim_object.i => src/python/swig/pyobject.i
- Move the random bits of SWIG code generation out of src/SConscript
file and into methods on the objects being wrapped.
- Cleaned up some variable naming and added some comments to make
the process a little clearer.
- Did a little generated file/module renaming:
- vptype_Foo now Foo_vector
- init_Foo is now Foo_init
This makes it easier to see all the Foo-related files in a
sorted directory listing.
- Made cxx_predecls and swig_predecls normal SimObject classmethods.
- Got rid of swig_objdecls hook, even though this breaks the System
objects get/setMemoryMode method exports. Will be fixing this in
a future changeset.
In order for a system object to work in SE mode and FS mode, it has to either
always require a platform object even in SE mode, or get rid of the
requirement all together. Making SE mode carry around unnecessary/unused bits
of FS seems less than ideal, so I decided to go with the second option. The
platform pointer in the System class was used for exactly one purpose, a path
for the Alpha Linux system object to get to the real time clock and read its
frequency so that it could short cut the loops_per_jiffy calculation. There
was also a copy and pasted implementation in MIPS, but since it was only there
because it was there in Alpha I still count that as one use.
This change reverses the mechanism that communicates the RTC frequency so that
the Tsunami platform object pushes it up to the AlphaSystem object. This is
slightly less specific than it could be because really only the
AlphaLinuxSystem uses it. Because the intrFrequency function on the Platform
class was no longer necessary (and unimplemented on anything but Alpha) it was
eliminated.
After this change, a platform will need to have a system, but a system won't
have to have a platform.
All of the classes will now be available in both modes, and only
GenericPageTableFault will continue to check the mode for conditional
compilation. It uses a process object to handle the fault in SE mode, and
for now those aren't available in FS mode.
This constant will have the same value as FULL_SYSTEM but will not be usable
by the preprocessor. It can be substituted into places where FULL_SYSTEM is
used in a C++ context and will make it easier to find which parts of the
simulator still use FULL_SYSTEM with the preprocessor using grep.
Initialize flags via the Event constructor instead of calling
setFlags() in the body of the derived class's constructor. I
forget exactly why, but this made life easier when implementing
multi-queue support.
Also rename Event::getFlags() to isFlagSet() to better match
common usage, and get rid of some unused Event methods.
Use exitSimLoop() function instead of explicitly scheduling
on mainEventQueue (which won't work once we go to multiple
event queues). Also introduced a local params variable to
shorten a lot of expressions.
It was technically possible but clumsy to determine what endianness a guest
was configured with using the state in byteswap.hh. This change makes that
information available more directly.
Also get rid of unused (and mildly redundant) ByteOrderDiffers constant.
Only create a memory ordering violation when the value could have changed
between two subsequent loads, instead of just when loads go out-of-order
to the same address. While not very common in the case of Alpha, with
an architecture with a hardware table walker this can happen reasonably
frequently beacuse a translation will miss and start a table walk and
before the CPU re-schedules the faulting instruction another one will
pass it to the same address (or cache block depending on the dendency
checking).
This patch has been tested with a couple of self-checking hand crafted
programs to stress ordering between two cores.
The performance improvement on SPEC benchmarks can be substantial (2-10%).
Do some minor cleanup of some recently added comments, a warning, and change
other instances of stack extension to be like what's now being done for x86.
Nothing big here, but when you have an address that is not in the page table request to be allocated, if it falls outside of the maximum stack range all you get is a page fault and you don't know why. Add a little warn() to explain it a bit. Also add some comments and alter logic a little so that you don't totally ignore the return value of checkAndAllocNextPage().
We were getting a spurious warning in the regressions that turned
out to be due to having the wrong value for TGT_MAP_ANONYMOUS for
Power Linux, but in the process of tracking it down I ended up
doing some cleanup of the mmap handling in general.
This is similar to guards on mercurial queues and they're used for selecting
which files are compiled into some given object. We already do something
similar, but it's mostly hard coded for the m5 binary and the m5 library
and I'd like to make it more flexible to better support the unittests
At the same time, rename the trace flags to debug flags since they
have broader usage than simply tracing. This means that
--trace-flags is now --debug-flags and --trace-help is now --debug-help
This patch prevents not executed conditional instructions marked as
IsQuiesce from stalling the pipeline indefinitely. If the instruction
is not executed the quiesceSkip psuedoinst is called which schedules a
wakes up call to the fetch stage.
Some ISAs (like ARM) relies on hardware page table walkers. For those ISAs,
when a TLB miss occurs, initiateTranslation() can return with NoFault but with
the translation unfinished.
Instructions experiencing a delayed translation due to a hardware page table
walk are deferred until the translation completes and kept into the IQ. In
order to keep track of them, the IQ has been augmented with a queue of the
outstanding delayed memory instructions. When their translation completes,
instructions are re-executed (only their initiateAccess() was already
executed; their DTB translation is now skipped). The IEW stage has been
modified to support such a 2-pass execution.
Setup initial timesync event in initState or loadState so that curTick has
been updated to the new value, otherwise the event is scheduled in the past.
Moving the definition of NoFault into fault.hh doesn't bring any new
dependencies with it, and allows some files to include just fault.hh which has
less baggage. NoFault will still be available to everything that includes
faults.hh because it includes fault.hh.
M5 skips over any simulated time where it doesn't have any work to do. When
the simulation is active, the time skipped is short and the work done at any
point in time is relatively substantial. If the time between events is long
and/or the work to do at each event is small, it's possible for simulated time
to pass faster than real time. When running a benchmark that can be good
because it means the simulation will finish sooner in real time. When
interacting with the real world through, for instance, a serial terminal or
bridge to a real network, this can be a problem. Human or network response time
could be greatly exagerated from the perspective of the simulation and make
simulated events happen "too soon" from an external perspective.
This change adds the capability to force the simulation to run no faster than
real time. It does so by scheduling a periodic event that checks to see if
its simulated period is shorter than its real period. If it is, it stalls the
simulation until they're equal. This is called time syncing.
A future change could add pseudo instructions which turn time syncing on and
off from within the simulation. That would allow time syncing to be used for
the interactive parts of a session but then turned off when running a
benchmark using the m5 utility program inside a script. Time syncing would
probably not happen anyway while running a benchmark because there would be
plenty of work for M5 to do, but the event overhead could be avoided.
There's no reason for it to derive from SimLoopExitEvent.
This whole drain thing needs to be redone eventually,
but this is a stopgap to make later changes to
SimLoopExitEvent feasible.
Avoid direct references to mainEventQueue in pseudo-insts
by indirecting through associated CPU object.
Made exitSimLoop() more flexible to enable some of these.
Ran all the source files through 'perl -pi' with this script:
s|\s*(};?\s*)?/\*\s*(end\s*)?namespace\s*(\S+)\s*\*/(\s*})?|} // namespace $3|;
s|\s*};?\s*//\s*(end\s*)?namespace\s*(\S+)\s*|} // namespace $2\n|;
s|\s*};?\s*//\s*(\S+)\s*namespace\s*|} // namespace $1\n|;
Also did a little manual editing on some of the arch/*/isa_traits.hh files
and src/SConscript.
This change is a low level and pervasive reorganization of how PCs are managed
in M5. Back when Alpha was the only ISA, there were only 2 PCs to worry about,
the PC and the NPC, and the lsb of the PC signaled whether or not you were in
PAL mode. As other ISAs were added, we had to add an NNPC, micro PC and next
micropc, x86 and ARM introduced variable length instruction sets, and ARM
started to keep track of mode bits in the PC. Each CPU model handled PCs in
its own custom way that needed to be updated individually to handle the new
dimensions of variability, or, in the case of ARMs mode-bit-in-the-pc hack,
the complexity could be hidden in the ISA at the ISA implementation's expense.
Areas like the branch predictor hadn't been updated to handle branch delay
slots or micropcs, and it turns out that had introduced a significant (10s of
percent) performance bug in SPARC and to a lesser extend MIPS. Rather than
perpetuate the problem by reworking O3 again to handle the PC features needed
by x86, this change was introduced to rework PC handling in a more modular,
transparent, and hopefully efficient way.
PC type:
Rather than having the superset of all possible elements of PC state declared
in each of the CPU models, each ISA defines its own PCState type which has
exactly the elements it needs. A cross product of canned PCState classes are
defined in the new "generic" ISA directory for ISAs with/without delay slots
and microcode. These are either typedef-ed or subclassed by each ISA. To read
or write this structure through a *Context, you use the new pcState() accessor
which reads or writes depending on whether it has an argument. If you just
want the address of the current or next instruction or the current micro PC,
you can get those through read-only accessors on either the PCState type or
the *Contexts. These are instAddr(), nextInstAddr(), and microPC(). Note the
move away from readPC. That name is ambiguous since it's not clear whether or
not it should be the actual address to fetch from, or if it should have extra
bits in it like the PAL mode bit. Each class is free to define its own
functions to get at whatever values it needs however it needs to to be used in
ISA specific code. Eventually Alpha's PAL mode bit could be moved out of the
PC and into a separate field like ARM.
These types can be reset to a particular pc (where npc = pc +
sizeof(MachInst), nnpc = npc + sizeof(MachInst), upc = 0, nupc = 1 as
appropriate), printed, serialized, and compared. There is a branching()
function which encapsulates code in the CPU models that checked if an
instruction branched or not. Exactly what that means in the context of branch
delay slots which can skip an instruction when not taken is ambiguous, and
ideally this function and its uses can be eliminated. PCStates also generally
know how to advance themselves in various ways depending on if they point at
an instruction, a microop, or the last microop of a macroop. More on that
later.
Ideally, accessing all the PCs at once when setting them will improve
performance of M5 even though more data needs to be moved around. This is
because often all the PCs need to be manipulated together, and by getting them
all at once you avoid multiple function calls. Also, the PCs of a particular
thread will have spatial locality in the cache. Previously they were grouped
by element in arrays which spread out accesses.
Advancing the PC:
The PCs were previously managed entirely by the CPU which had to know about PC
semantics, try to figure out which dimension to increment the PC in, what to
set NPC/NNPC, etc. These decisions are best left to the ISA in conjunction
with the PC type itself. Because most of the information about how to
increment the PC (mainly what type of instruction it refers to) is contained
in the instruction object, a new advancePC virtual function was added to the
StaticInst class. Subclasses provide an implementation that moves around the
right element of the PC with a minimal amount of decision making. In ISAs like
Alpha, the instructions always simply assign NPC to PC without having to worry
about micropcs, nnpcs, etc. The added cost of a virtual function call should
be outweighed by not having to figure out as much about what to do with the
PCs and mucking around with the extra elements.
One drawback of making the StaticInsts advance the PC is that you have to
actually have one to advance the PC. This would, superficially, seem to
require decoding an instruction before fetch could advance. This is, as far as
I can tell, realistic. fetch would advance through memory addresses, not PCs,
perhaps predicting new memory addresses using existing ones. More
sophisticated decisions about control flow would be made later on, after the
instruction was decoded, and handed back to fetch. If branching needs to
happen, some amount of decoding needs to happen to see that it's a branch,
what the target is, etc. This could get a little more complicated if that gets
done by the predecoder, but I'm choosing to ignore that for now.
Variable length instructions:
To handle variable length instructions in x86 and ARM, the predecoder now
takes in the current PC by reference to the getExtMachInst function. It can
modify the PC however it needs to (by setting NPC to be the PC + instruction
length, for instance). This could be improved since the CPU doesn't know if
the PC was modified and always has to write it back.
ISA parser:
To support the new API, all PC related operand types were removed from the
parser and replaced with a PCState type. There are two warts on this
implementation. First, as with all the other operand types, the PCState still
has to have a valid operand type even though it doesn't use it. Second, using
syntax like PCS.npc(target) doesn't work for two reasons, this looks like the
syntax for operand type overriding, and the parser can't figure out if you're
reading or writing. Instructions that use the PCS operand (which I've
consistently called it) need to first read it into a local variable,
manipulate it, and then write it back out.
Return address stack:
The return address stack needed a little extra help because, in the presence
of branch delay slots, it has to merge together elements of the return PC and
the call PC. To handle that, a buildRetPC utility function was added. There
are basically only two versions in all the ISAs, but it didn't seem short
enough to put into the generic ISA directory. Also, the branch predictor code
in O3 and InOrder were adjusted so that they always store the PC of the actual
call instruction in the RAS, not the next PC. If the call instruction is a
microop, the next PC refers to the next microop in the same macroop which is
probably not desirable. The buildRetPC function advances the PC intelligently
to the next macroop (in an ISA specific way) so that that case works.
Change in stats:
There were no change in stats except in MIPS and SPARC in the O3 model. MIPS
runs in about 9% fewer ticks. SPARC runs with 30%-50% fewer ticks, which could
likely be improved further by setting call/return instruction flags and taking
advantage of the RAS.
TODO:
Add != operators to the PCState classes, defined trivially to be !(a==b).
Smooth out places where PCs are split apart, passed around, and put back
together later. I think this might happen in SPARC's fault code. Add ISA
specific constructors that allow setting PC elements without calling a bunch
of accessors. Try to eliminate the need for the branching() function. Factor
out Alpha's PAL mode pc bit into a separate flag field, and eliminate places
where it's blindly masked out or tested in the PC.
In the process make add skipFuction() to handle isa specific function skipping
instead of ifdefs and other ugliness. For almost all ABIs, 64 bit arguments can
only start in even registers. Size is now passed to getArgument() so that 32
bit systems can make decisions about register selection for 64 bit arguments.
The number argument is now passed by reference because getArgument() will need
to change it based on the size of the argument and the current argument number.
For ARM, if the argument number is odd and a 64-bit register is requested the
number must first be incremented to because all 64 bit arguments are passed
in an even argument register. Then the number will be incremented again to
access both halves of the argument.
This reduces the scope of those includes and makes it less likely for there to
be a dependency loop. This also moves the hashing functions associated with
ExtMachInst objects to be with the ExtMachInst definitions and out of
utility.hh.
Also move the "Fault" reference counted pointer type into a separate file,
sim/fault.hh. It would be better to name this less similarly to sim/faults.hh
to reduce confusion, but fault.hh matches the name of the type. We could change
Fault to FaultPtr to match other pointer types, and then changing the name of
the file would make more sense.
Instead of putting all object files into m5/object/__init__.py, interrogate
the importer to find out what should be imported.
Instead of creating a single file that lists all of the embedded python
modules, use static object construction to put those objects onto a list.
Do something similar for embedded swig (C++) code.
This allows one two different OS requirements for the same ISA to be handled.
Some OSes are compiled for a virtual address and need to be loaded into physical
memory that starts at address 0, while other bare metal tools generate
images that start at address 0.
Replace direct call to unserialize() on each SimObject with a pair of
calls for better control over initialization in both ckpt and non-ckpt
cases.
If restoring from a checkpoint, loadState(ckpt) is called on each
SimObject. The default implementation simply calls unserialize() if
there is a corresponding checkpoint section, so we get backward
compatibility for existing objects. However, objects can override
loadState() to get other behaviors, e.g., doing other programmed
initializations after unserialize(), or complaining if no checkpoint
section is found. (Note that the default warning for a missing
checkpoint section is now gone.)
If not restoring from a checkpoint, we call the new initState() method
on each SimObject instead. This provides a hook for state
initializations that are only required when *not* restoring from a
checkpoint.
Given this new framework, do some cleanup of LiveProcess subclasses
and X86System, which were (in some cases) emulating initState()
behavior in startup via a local flag or (in other cases) erroneously
doing initializations in startup() that clobbered state loaded earlier
by unserialize().
Enforce that the Python Root SimObject is instantiated only
once. The C++ Root object already panics if more than one is
created. This change avoids the need to track what the root
object is, since it's available from Root.getInstance() (if it
exists). It's now redundant to have the user pass the root
object to functions like instantiate(), checkpoint(), and
restoreCheckpoint(), so that arg is gone. Users who use
configs/common/Simulate.py should not notice.
If the user sets the environment variable M5_OVERRIDE_PY_SOURCE to
True, then imports that would normally find python code compiled into
the executable will instead first check in the absolute location where
the code was found during the build of the executable. This only
works for files in the src (or extras) directories, not automatically
generated files.
This is a developer feature!
Expand the help text on the --remote-gdb-port option so
people know you can use it to disable remote gdb without
reading the source code, and thus don't waste any time
trying to add a separate option to do that.
Clean up some gdb-related cruft I found while looking
for where one would add a gdb disable option, before
I found the comment that told me that I didn't need
to do that.
Time from base/time.hh has a name clash with Time from Ruby's
TypeDefines.hh. Eventually Ruby's Time should go away, so instead of
fixing this properly just try to avoid the clash.
- Make the initialized flag always available, not just in debug mode.
- Make the Initialized flag actually use several bits so it is very
unlikely that something that's uninitialized accidentally looks
initialized.
- Add an initialized() function that tells you if the current event is
indeed initialized.
- Clear the flags on delete so it can't be accidentally thought of as
initialized.
- Fix getFlags assert statement. "How did this ever work?"
Symbolic names should still be used, but this makes it easier to do
things like:
Event::Priority MyObject_Pri = Event::Default_Pri + 1
Remember that higher numbers are lower priority (should we fix this?)
1) Move alpha-specific code out of page_table.cc:serialize().
2) Begin serializing M5_pid and unserializing it, but adding an function to do optional paramIn so that old checkpoints don't need to be fixed up.
3) Fix up alpha startup code so that the unserialized M5_pid value is properly written to DTB_IPR_ASN.
4) Fix the memory unserialize that I forgot somehow in the last changeset.
5) Add in an agg_se.py to handle aggregated checkpoints. --bench foo-bar plus positional arguments foo bar are the only changes in usage from se.py.
Note this aggregation stuff has only been tested for Alpha and nothing else, though it should take a very minimal amount of work to get it to work with another ISA.
When accessing arguments for a syscall, the position of an argument depends on
the policies of the ISA, how much space preceding arguments took up, and the
"alignment" of the index for this particular argument into the number of
possible storate locations. This change adjusts getSyscallArg to take its
index parameter by reference instead of value and to adjust it to point to the
possible location of the next argument on the stack, basically just after the
current one. This way, the rules for the new argument can be applied locally
without knowing about other arguments since those have already been taken into
account implicitly.
All system calls have also been changed to reflect the new interface. In a
number of cases this made the implementation clearer since it encourages
arguments to be collected in one place in order and then used as necessary
later, as opposed to scattering them throughout the function or using them in
place in long expressions. It also discourages using getSyscallArg over and
over to retrieve the same value when a temporary would do the job.
This adds support for the 32-bit, big endian Power ISA. This supports both
integer and floating point instructions based on the Power ISA Book I v2.06.
Glibc often assumes that memory it receives from the kernel after a brk
system call will contain only zeros. This is important during a calloc,
because it won't clear the new memory itself. In the simulator, if the
new page exists, it will be cleared using this patch, to mimic the kernel's
functionality.
Get rid of misc.py and just stick misc things in __init__.py
Move utility functions out of SCons files and into m5.util
Move utility type stuff from m5/__init__.py to m5/util/__init__.py
Remove buildEnv from m5 and allow access only from m5.defines
Rename AddToPath to addToPath while we're moving it to m5.util
Rename read_command to readCommand while we're moving it
Rename compare_versions to compareVersions while we're moving it.
--HG--
rename : src/python/m5/convert.py => src/python/m5/util/convert.py
rename : src/python/m5/smartdict.py => src/python/m5/util/smartdict.py
Using a look up table changed the run time of the SPARC_FS solaris boot
regression from:
real 14m45.951s
user 13m57.528s
sys 0m3.452s
to:
real 12m19.777s
user 12m2.685s
sys 0m2.420s
Start by turning all of the *Source functions into classes
so we can do more calculations and more easily collect the data we need.
Add parameters to the new classes for indicating what sorts of flags the
objects should be compiled with so we can allow certain files to be compiled
without Werror for example.
This patch adds limited multithreading support in syscall-emulation
mode, by using the clone system call. The clone system call works
for Alpha, SPARC and x86, and multithreaded applications run
correctly in Alpha and SPARC.
Basically merge it in with Halted.
Also had to get rid of a few other functions that
called ThreadContext::deallocate(), including:
- InOrderCPU's setThreadRescheduleCondition.
- ThreadContext::exit(). This function was there to avoid terminating
simulation when one thread out of a multi-thread workload exits, but we
need to find a better (non-cpu-centric) way.
This is mainly to allow the unit test to run without requiring the standard
M5 stats from being initialized (e.g. sim_seconds, sim_ticks, host_seconds)
Bogus calls to ChunkGenerator with negative size were triggering
a new assertion that was added there.
Also did a little renaming and cleanup in the process.
We need to add a reference when an object is put on the C++ queue, and remove
a reference when the object is removed from the queue. This was not happening
before and caused a memory problem.
the primary identifier for a hardware context should be contextId(). The
concept of threads within a CPU remains, in the form of threadId() because
sometimes you need to know which context within a cpu to manipulate.
SE. Process still keeps track of the tc's it owns, but registration occurs
with the System, this eases the way for system-wide context Ids based on
registration.
across the subclasses. generally make it so that member data is _cpuId and
accessor functions are cpuId(). The ID val comes from the python (default -1 if
none provided), and if it is -1, the index of cpuList will be given. this has
passed util/regress quick and se.py -n4 and fs.py -n4 as well as standard
switch.
Since I never implemented a proper solution, put it back to something that
at least works for now. Once I add more event queues, I'll have to really
fix this though
The major thrust of this change is to limit the amount of code
duplication surrounding the code for these functions. This code also
adds two new message types called info and hack. Info is meant to be
less harsh than warn so people don't get confused and start thinking
that the simulator is broken. Hack is a way for people to add runtime
messages indicating that the simulator just executed a code "hack"
that should probably be fixed. The benefit of knowing about these
code hacks is that it will let people know what sorts of inaccuracies
or potential bugs might be entering their experiments. Finally, I've
added some flags to turn on and off these message types so command
line options can change them.
Make them easier to express by only having the cxx_type parameter which
has the full namespace name, and drop the cxx_namespace thing.
Add support for multiple levels of namespace.
Since the early days of M5, an event needed to know which event queue
it was on, and that data was required at the time of construction of
the event object. In the future parallelized M5, this sort of
requirement does not work well since the proper event queue will not
always be known at the time of construction of an event. Now, events
are created, and the EventQueue itself has the schedule function,
e.g. eventq->schedule(event, when). To simplify the syntax, I created
a class called EventManager which holds a pointer to an EventQueue and
provides the schedule interface that is a proxy for the EventQueue.
The intent is that objects that frequently schedule events can be
derived from EventManager and then they have the schedule interface.
SimObject and Port are examples of objects that will become
EventManagers. The end result is that any SimObject can just call
schedule(event, when) and it will just call that SimObject's
eventq->schedule function. Of course, some objects may have more than
one EventQueue, so this interface might not be perfect for those, but
they should be relatively few.
Targets look like libm5_debug.so. This target can be dynamically
linked into another C++ program and provide just about all of the M5
features. Additionally, this library is a standalone module that can
be imported into python with an "import libm5_debug" type command
line.
A whole bunch of stuff has been converted to use the new params stuff, but
the CPU wasn't one of them. While we're at it, make some things a bit
more stylish. Most of the work was done by Gabe, I just cleaned stuff up
a bit more at the end.
This should allow m5 to be more easily embedded into other simulators.
The m5 binary adds a simple main function which then calls into the m5
libarary to start the simulation. In order to make this work
correctly, it was necessary embed python code directly into the
library instead of the zipfile hack. This is because you can't just
append the zipfile to the end of a library the way you can a binary.
As a result, Python files that are part of the m5 simulator are now
compile, marshalled, compressed, and then inserted into the library's
data section with a certain symbol name. Additionally, a new Importer
was needed to allow python to get at the embedded python code.
Small additional changes include:
- Get rid of the PYTHONHOME stuff since I don't think anyone ever used
it, and it just confuses things. Easy enough to add back if I'm wrong.
- Create a few new functions that are key to initializing and running
the simulator: initSignals, initM5Python, m5Main.
The original code for creating libm5 was inspired by a patch Michael
Adler, though the code here was done by me.
- Add the option of redirecting stderr to a file. With the old
behaviour, stderr would follow stdout if stdout was to a file, but
stderr went to the host stderr if stdout went to the host stdout. The
new default maintains stdout and stderr going to the host. Now the
two can specify different files, but they will share a file descriptor
if the name of the files is the same.
- Add --output and --errout options to se.py to go with --input.
- insert warnings for deprecated m5ops
- reserve opcodes for Ali's stuff
- remove code for stuff that has been deprecated forever
- simplify m5op_alpha