Commit graph

1423 commits

Author SHA1 Message Date
Andreas Sandberg 4b8be6a90b kvm: Set the perf exclude_host attribute if available
The performance counting framework in Linux 3.2 and onwards supports
an attribute to exclude events generated by the host when running
KVM. Setting this attribute allows us to get more reliable
measurements of the guest machine. For example, on a highly loaded
system, the instruction counts from the guest can be severely
distorted by the host kernel (e.g., by page fault handlers).

This changeset introduces a check for the attribute and enables it in
the KVM CPU if present.
2013-10-15 10:09:23 +02:00
Andreas Sandberg e5d63d0535 kvm: Remove the unused hostFreq member from BaseKvmCPU 2013-11-26 17:40:58 +01:00
Steve Reinhardt ext:(%2C%20Nilay%20Vaish%20%3Cnilay%40cs.wisc.edu%3E%2C%20Ali%20Saidi%20%3CAli.Saidi%40ARM.com%3E) de366a16f1 sim: simulate with multiple threads and event queues
This patch adds support for simulating with multiple threads, each of
which operates on an event queue.  Each sim object specifies which eventq
is would like to be on.  A custom barrier implementation is being added
using which eventqs synchronize.

The patch was tested in two different configurations:
1. ruby_network_test.py: in this simulation L1 cache controllers receive
   requests from the cpu. The requests are replied to immediately without
   any communication taking place with any other level.
2. twosys-tsunami-simple-atomic: this configuration simulates a client-server
   system which are connected by an ethernet link.

We still lack the ability to communicate using message buffers or ports. But
other things like simulation start and end, synchronizing after every quantum
are working.

Committed by: Nilay Vaish
2013-11-25 11:21:00 -06:00
Anthony Gutierrez 8a53da22c2 cpu: allow the fetch buffer to be smaller than a cache line
the current implementation of the fetch buffer in the o3 cpu
is only allowed to be the size of a cache line. some
architectures, e.g., ARM, have fetch buffers smaller than a cache
line, see slide 22 at:
http://www.arm.com/files/pdf/at-exploring_the_design_of_the_cortex-a15.pdf

this patch allows the fetch buffer to be set to values smaller
than a cache line.
2013-11-15 13:21:15 -05:00
Andreas Hansson f028da7af7 cpu: Fix Checker register index use
This patch fixes an issue in the checker CPU register indexing. The
code will not even compile using LTO as deep inlining causes the used
index to be outside the array bounds.
2013-11-15 03:47:10 -05:00
Faissal Sleiman 397dc784fd cpu: Construct ROB with cpu params struct instead of each variable
Most other structures/stages get passed the cpu params struct.
2013-10-31 13:41:13 -05:00
Ali Saidi 79f81e2641 cpu: Fix O3 issuse with load+barrier instructions.
Fix a problem in the O3 CPU for instructions that are both
memory loads and memory barriers (e.g. load acquire) and
to uncacheable memory. This combination can confuse the
commit stage into commitng an instruction that hasn't
executed and got it's value yet. At the same time refactor
the code slightly to remove duplication between two of
the cases.
2013-10-31 13:41:13 -05:00
Matt Horsnell 6decd70bfb cpu: add consistent guarding to *_impl.hh files. 2013-10-17 10:20:45 -05:00
Faissal Sleiman 1746eb4a11 cpu: Removing an unused variable in rename 2013-10-17 10:20:45 -05:00
Faissal Sleiman 9195f1fbfd cpu: Change IEW DPRINTF to use IEW debug flag
IEW DPRINTF uses Decode debug flag, which appears to be a copying error. This
patch changes this to the IEW Debug flag.
2013-10-17 10:20:45 -05:00
Faissal Sleiman e516531bd0 cpu: Put in assertions to check for maximum supported LQ/SQ size
LSQSenderState represents the LQ/SQ index using uint8_t, which supports up to
 256 entries (including the sentinel entry). Sending packets to memory with a
higher index than 255 truncates the index, such that the response matches the
wrong entry. For instance, this can result in a deadlock if a store completion
does not clear the head entry.
2013-10-17 10:20:45 -05:00
Ali Saidi cf266f05a9 cpu: Fix O3 uncacheable load that is replayed but misses the TLB
This change fixes an issue in the O3 CPU where an uncachable instruction
is attempted to be executed before it reaches the head of the ROB. It is
determined to be uncacheable, and is replayed, but a PanicFault is attached
to the instruction to make sure that it is properly executed before
committing. If the TLB entry it was using is replaced in the interveaning
time, the TLB returns a delayed translation when the load is replayed at
the head of the ROB, however the LSQ code can't differntiate between the
old fault and the new one. If the translation isn't complete it can't
be faulting, so clear the fault.
2013-10-17 10:20:45 -05:00
Andreas Sandberg cc42e87b85 kvm: Fix latency calculation of IPR accesses
When handling IPR accesses in doMMIOAccess, the KVM CPU used
clockEdge() to convert between cycles and ticks. This is incorrect
since doMMIOAccess is supposed to return a latency in ticks rather
than when the access is done. This changeset fixes this issue by
returning clockPeriod() * ipr_delay instead.
2013-10-16 18:12:15 +02:00
Yasuko Eckert 1bb293d1e7 arch/x86: add support for explicit CC register file
Convert condition code registers from being specialized
("pseudo") integer registers to using the recently
added CC register class.

Nilay Vaish also contributed to this patch.
2013-10-15 14:22:44 -04:00
Yasuko Eckert 2c293823aa cpu: add a condition-code register class
Add a third register class for condition codes,
in parallel with the integer and FP classes.
No ISAs use the CC class at this point though.
2013-10-15 14:22:44 -04:00
Steve Reinhardt 5526221847 cpu/o3: clean up rename map and free list
Restructured rename map and free list to clean up some
extraneous code and separate out common code that can
be reused across different register classes (int and fp
at this point).  Both components now consist of a set
of Simple* objects that are stand-alone rename map &
free list for each class, plus a Unified* object that
presents a unified interface across all register
classes and then redirects accesses to the appropriate
Simple* object as needed.

Moved free list initialization to PhysRegFile to better
isolate knowledge of physical register index mappings
to that class (and remove the need to pass a number
of parameters to the free list constructor).

Causes a small change to these stats:
  cpu.rename.int_rename_lookups
  cpu.rename.fp_rename_lookups
because they are now categorized on a per-operand basis
rather than a per-instruction basis.
That is, an instruction with mixed fp/int/misc operand
types will have each operand categorized independently,
where previously the lookup was categorized based on
the instruction type.
2013-10-15 14:22:44 -04:00
Steve Reinhardt 219c423f1f cpu: rename *_DepTag constants to *_Reg_Base
Make these names more meaningful.

Specifically, made these substitutions:

s/FP_Base_DepTag/FP_Reg_Base/g;
s/Ctrl_Base_DepTag/Misc_Reg_Base/g;
s/Max_DepTag/Max_Reg_Index/g;
2013-10-15 14:22:43 -04:00
Steve Reinhardt 9bd017b8ae cpu/o3: clean up scoreboard object
It had a bunch of fields (and associated constructor
parameters) thet it didn't really use, and the array
initialization was needlessly verbose.

Also just hardwired the getReg() method to aleays
return true for misc regs, rather than having an array
of bits that we always kept marked as ready.
2013-10-15 14:22:43 -04:00
Steve Reinhardt c009d0eb2a cpu/o3: clean up physical register file
No need for PhysRegFile to be a template class, or
have a pointer back to the CPU.  Also made some methods
for checking the physical register type (int vs. float)
based on the phys reg index, which will come in handy later.
2013-10-15 14:22:43 -04:00
Steve Reinhardt 06d246ab4a cpu/inorder: merge register class enums
The previous patch introduced a RegClass enum to clean
up register classification.  The inorder model already
had an equivalent enum (RegType) that was used internally.
This patch replaces RegType with RegClass to get rid
of the now-redundant code.
2013-10-15 14:22:43 -04:00
Steve Reinhardt 7aa423acad cpu: clean up architectural register classification
Move from a poorly documented scheme where the mapping
of unified architectural register indices to register
classes is hardcoded all over to one where there's an
enum for the register classes and a function that
encapsulates the mapping.
2013-10-15 14:22:42 -04:00
Andreas Sandberg 0dd6f87e63 kvm: Service events in the instruction event queues
This changset adds calls to the service the instruction event queues
that accidentally went missing from commit [0063c7dd18ec]. The
original commit only included the code needed to schedule instruction
stops from KVM and missed the functionality to actually service the
events.
2013-10-03 11:00:18 +02:00
Andreas Sandberg 469f2e31cf kvm: Add support for thread-specific instruction events
Instruction events are currently ignored when executing in KVM. This
changeset adds support for triggering KVM exits based on instruction
counts using hardware performance counters. Depending on the
underlying performance counter implementation, there might be some
inaccuracies due to instructions being counted in the host kernel when
entering/exiting KVM.

Due to limitations/bugs in Linux's performance counter interface, we
can't reliably change the period of an overflow counter. We work
around this issue by detaching and reattaching the counter if we need
to reconfigure it.
2013-09-30 09:53:52 +02:00
Andreas Sandberg 86bade714e kvm: FPU synchronization support on x86
This changeset adds support for synchronizing the FPU and SIMD state
of a virtual x86 CPU with gem5. It supports both the XSave API and the
KVM_(GET|SET)_FPU kernel API. The XSave interface can be disabled
using the useXSave parameter (in case of kernel
issues). Unfortunately, KVM_(GET|SET)_FPU interface seems to be buggy
in some kernels (specifically, the MXCSR register isn't always
synchronized), which means that it might not be possible to
synchronize MXCSR on old kernels without the XSave interface.

This changeset depends on the __float80 type in gcc and might not
build using llvm.
2013-09-30 09:43:43 +02:00
Andreas Sandberg 30841926a3 kvm: x86: Fix segment registers to make them VMX compatible
There are cases when the segment registers in gem5 are not compatible
with VMX. This changeset works around all known such issues. Specifically:

* The accessed bits in CS, SS, DD, ES, FS, GS are forced to 1.
* The busy bit in TR is forced to 1.
* The protection level of SS is forced to the same protection level as
  CS. The difference /seems/ to be caused by a bug in gem5's x86
  implementation.
2013-09-30 09:36:54 +02:00
Andreas Sandberg e5c319db43 kvm: Add x86 segment register verification to help debugging 2013-09-25 12:35:21 +02:00
Andreas Sandberg 599b59b387 kvm: Initial x86 support
This changeset adds support for KVM on x86. Full support is split
across a number of commits since some features are relatively
complex. This changeset includes support for:

 * Integer state synchronization (including segment regs)
 * CPUID (gem5's CPUID values are inserted into KVM)
 * x86 legacy IO (remapped and handled by gem5's memory system)
 * Memory mapped IO
 * PCI
 * MSRs
 * State dumping

Most of the functionality is fairly straight forward. There are some
quirks to support PCI enumerations since this is done in the TLB(!) in
the simulated CPUs. We currently replicate some of that code.

Unlike the ARM implementation, the x86 implementation of the virtual
CPU does not use the cycles hardware counter. KVM on x86 simulates the
time stamp counter (TSC) in the kernel. If we just measure host cycles
using perfevent, we might end up measuring a slightly different number
of cycles. If we don't get the cycle accounting right, we might end up
rewinding the TSC, with all kinds of chaos as a result.

An additional feature of the KVM CPU on x86 is extended state
dumping. This enables Python scripts controlling the simulator to
request dumping of a subset of the processor state. The following
methods are currenlty supported:

 * dumpFpuRegs
 * dumpIntRegs
 * dumpSpecRegs
 * dumpDebugRegs
 * dumpXCRs
 * dumpXSave
 * dumpVCpuEvents
 * dumpMSRs

Known limitations:
  * M5 ops are currently not supported.
  * FPU synchronization is not supported (only affects CPU switching).

Both of the limitations will be addressed in separate commits.
2013-09-25 12:24:26 +02:00
Andreas Sandberg cd9cd85ce9 kvm: Correctly handle the return value from handleIpr(Read|Write)
The KVM base class incorrectly assumed that handleIprRead and
handleIprWrite both return ticks. This is not the case, instead they
return cycles. This changeset converts the returned cycles to ticks
when handling IPR accesses.
2013-09-19 17:55:04 +02:00
Andreas Sandberg 211c10b46d kvm: Fix a case where the run timers weren't armed properly
There is a possibility that the timespec used to arm a timer becomes
zero if the number of ticks used when arming a timer is close to the
resolution of the timer. Due to the semantics of POSIX timers, this
actually disarms the timer. This changeset fixes this issue by
eliminating the rounding error (we always round away from zero
now). It also reuses the minimum number of cycles, which were
previously only used for cycle-based timers, to calculate a more
useful resolution.
2013-09-19 17:55:03 +02:00
Joel Hestness a1f9081bab cpu: Dynamically instantiate O3 CPU LSQUnits
Previously, the LSQ would instantiate MaxThreads LSQUnits in the body of it's
object, but it would only initialize numThreads LSQUnits as specified by the
user. This had the effect of leaving some LSQUnits uninitialized when the
number of threads was less than MaxThreads, and when adding statistics to the
LSQUnit that must be initialized, this caused the stats initialization check to
fail. By dynamically instantiating LSQUnits, they are all initialized and this
avoids uninitialized LSQUnits from floating around during runtime.
2013-09-11 15:34:50 -05:00
Andreas Hansson 19a5b68db7 arch: Resurrect the NOISA build target and rename it NULL
This patch makes it possible to once again build gem5 without any
ISA. The main purpose is to enable work around the interconnect and
memory system without having to build any CPU models or device models.

The regress script is updated to include the NULL ISA target. Currently
no regressions make use of it, but all the testers could (and perhaps
should) transition to it.

--HG--
rename : build_opts/NOISA => build_opts/NULL
rename : src/arch/noisa/SConsopts => src/arch/null/SConsopts
rename : src/arch/noisa/cpu_dummy.hh => src/arch/null/cpu_dummy.hh
rename : src/cpu/intr_control.cc => src/cpu/intr_control_noisa.cc
2013-09-04 13:22:57 -04:00
Andreas Hansson ea40297018 cpu: Move the branch predictor out of the BaseCPU
The branch predictor is guarded by having either the in-order or
out-of-order CPU as one of the available CPU models and therefore
should not be used in the BaseCPU. This patch moves the parameter to
the relevant CPU classes.
2013-09-04 13:22:56 -04:00
Andreas Hansson bb1d2f3957 arch: Header clean up for NOISA resurrection
This patch is a first step to getting NOISA working again. A number of
redundant includes make life more difficult than it has to be and this
patch simply removes them. There are also some redundant forward
declarations removed.
2013-09-04 13:22:55 -04:00
Andreas Hansson c6062a3981 cpu: Fix timing CPU isDrained comment formatting
This patch fixes up the comment formatting for isDrained in the timing
CPU.
2013-08-20 11:21:27 -04:00
Lena Olson 646c4a23ca cpu: Accurately count idle cycles for simple cpu
Added a couple missing updates to the notIdleFraction stat. Without
these, it sometimes gives a (not) idle fraction that is greater than 1
or less than 0.
2013-08-19 03:52:35 -04:00
Sascha Bischoff e553844efc cpu: Fix TrafficGen trace playback
This patch addresses an issue with trace playback in the TrafficGen
where the trace was reset but the header was not read from the trace
when a captured trace was played back for a second time. This resulted
in parsing errors as the expected message was not found in the trace
file.

The header check is moved to an init funtion which is called by the
constructor and when the trace is reset. This ensures that the trace
header is read each time when the trace is replayed.

This patch also addresses a small formatting issue in a panic.
2013-08-19 03:52:32 -04:00
Andreas Hansson 7a61f667f0 cpu: Fix timing CPU drain check
This patch modifies the SimpleTimingCPU drain check to also consider
the fetch event. Previously, there was an assumption that there is
never a fetch event scheduled if the CPU is not executing
microcode. However, when a context is activated, a fetch even is
scheduled, and microPC() is zero.
2013-08-19 03:52:30 -04:00
Andreas Hansson 9b2effd9e2 cpu: Fix a bug in the O3 CPU introduced by the cache line patch
This patch fixes a bug in the O3 fetch stage that was introduced when
the cache line size was moved to the system. By mistake, the
initialisation and resetting of the fetch stage was merged and put in
the constructor. The resetting is now re-added where it should be.
2013-08-19 03:52:24 -04:00
Andreas Sandberg b5bb2a25aa cpu: Remove unused getBranchPred() method from BaseCPU
Remove unused virtual getBranchPred() method from BaseCPU as it is not
implemented by any of the CPU models. It used to always return NULL.
2013-07-19 11:52:07 +02:00
Andreas Hansson d4273cc9a6 mem: Set the cache line size on a system level
This patch removes the notion of a peer block size and instead sets
the cache line size on the system level.

Previously the size was set per cache, and communicated through the
interconnect. There were plenty checks to ensure that everyone had the
same size specified, and these checks are now removed. Another benefit
that is not yet harnessed is that the cache line size is now known at
construction time, rather than after the port binding. Hence, the
block size can be locally stored and does not have to be queried every
time it is used.

A follow-on patch updates the configuration scripts accordingly.
2013-07-18 08:31:16 -04:00
Umesh Bhaskar 5ba9e7afe2 debug : Fixes the issue wherein Debug symbols were not getting dumped into trace files for SE mode 2013-07-15 11:08:34 -04:00
Akash Bagdia 7d7ab73862 sim: Add the notion of clock domains to all ClockedObjects
This patch adds the notion of source- and derived-clock domains to the
ClockedObjects. As such, all clock information is moved to the clock
domain, and the ClockedObjects are grouped into domains.

The clock domains are either source domains, with a specific clock
period, or derived domains that have a parent domain and a divider
(potentially chained). For piece of logic that runs at a derived clock
(a ratio of the clock its parent is running at) the necessary derived
clock domain is created from its corresponding parent clock
domain. For now, the derived clock domain only supports a divider,
thus ensuring a lower speed compared to its parent. Multiplier
functionality implies a PLL logic that has not been modelled yet
(create a separate clock instead).

The clock domains should be used as a mechanism to provide a
controllable clock source that affects clock for every clocked object
lying beneath it. The clock of the domain can (in a future patch) be
controlled by a handler responsible for dynamic frequency scaling of
the respective clock domains.

All the config scripts have been retro-fitted with clock domains. For
the System a default SrcClockDomain is created. For CPUs that run at a
different speed than the system, there is a seperate clock domain
created. This domain incorporates the CPU and the associated
caches. As before, Ruby runs under its own clock domain.

The clock period of all domains are pre-computed, such that no virtual
functions or multiplications are needed when calling
clockPeriod. Instead, the clock period is pre-computed when any
changes occur. For this to be possible, each clock domain tracks its
children.
2013-06-27 05:49:49 -04:00
Akash Bagdia 7eccb1b779 config: Remove redundant explicit setting of default clocks
This patch removes the explicit setting of the clock period for
certain instances of CoherentBus, NonCoherentBus and IOCache where the
specified clock is same as the default value of the system clock. As
all the values used are the defaults, there are no performance
changes. There are similar cases where the toL2Bus is set to use the
parent CPU clock which is already the default behaviour.

The main motivation for these simplifications is to ease the
introduction of clock domains.
2013-06-27 05:49:49 -04:00
Andreas Hansson 10650fc525 cpu: Consider instructions waiting for FU completion in draining
This patch changes the IEW drain check to include the FU pool as there
can be instructions that are "stored" in FU completion events and thus
not covered by the existing checks. With this patch, we simply include
a check to see if all the FUs are considered non-busy in the next
tick.

Without this patch, the pc-switcheroo-full regression fails after
minor changes to the cache timing (aligning to clock edge).
2013-06-27 05:49:49 -04:00
Andreas Sandberg 6151c0f7f4 kvm: Use the address finalization code in the TLB
Reuse the address finalization code in the TLB instead of replicating
it when handling MMIO. This patch also adds support for injecting
memory mapped IPR requests into the memory system.
2013-06-18 16:10:22 +02:00
Andreas Sandberg 64270b19c3 kvm: Add more VM stats
This changeset adds the following stats to KVM:
 * numVMHalfEntries: Number of entries into KVM to finalize pending
   IO operations without executing guest instructions. These typically
   happen as a result of a drain where the guest must finalize some
   operations before the guest state is consistent.
 * numExitSignal: Number of VM exits that have been triggered by a
   signal. These usually happen as a result of the timer that limits
   the time spent in KVM.
2013-06-11 09:43:05 +02:00
Andreas Sandberg c97a99110b kvm: Separate host frequency from simulated CPU frequency
We used to use the KVM CPU's clock to specify the host frequency. This
was not ideal for several reasons. One of them being that the clock
parameter of a CPU determines the frequency of some of the components
connected to the CPU. This changeset adds a separate hostFreq
parameter that should be used to specify the host frequency until we
add code to autodetect it. The hostFactor should still be used to
specify the conversion factor between the host performance and that of
the simulated system.
2013-06-11 09:24:55 +02:00
Andreas Sandberg 4f002930bc kvm: Don't handle IO and execute in the same tick
We currently execute instructions in the guest and then handle any IO
request right after we break out of the virtualized environment. This
has the effect of executing IO requests in the exact same tick as the
first instruction in the sequence that was just run. There seem to be
cases where this simplification upsets some timing-sensitive devices.

This changeset splits execute and IO (and other services) across
multiple ticks. This is implemented by adding a separate
RunningService state to the CPU state machine. When a VM requires
service, it enters into this state and pending IO is then serviced in
the future instead of immediately. The delay between getting the
request and servicing it depends on the number of cycles executed in
the guest, which allows other components to catch up with the CPU.
2013-06-11 09:24:51 +02:00
Andreas Sandberg df059f45a0 kvm: Maintain a local instruction counter and update totalNumInsts
Update the system's totalNumInst counter when exiting from KVM and
maintain an internal absolute instruction count instead of relying on
the one from perf.
2013-06-11 09:24:40 +02:00
Andreas Sandberg 0793d0727b cpu: Add support for scheduling multiple inst/load stop events
Currently, the only way to get a CPU to stop after a fixed number of
instructions/loads is to set a property on the CPU that causes a
SimLoopExitEvent to be scheduled when the CPU is constructed. This is
clearly not ideal in cases where the simulation script wants the CPU
to stop at multiple instruction counts (e.g., SimPoint generation).

This changeset adds the methods scheduleInstStop() and
scheduleLoadStop() to the BaseCPU. These methods are exported to
Python and are designed to be used from the simulation script. By
using these methods instead of the old properties, a simulation script
can schedule a stop at any point during simulation or schedule
multiple stops. The number of instructions specified when scheduling a
stop is relative to the current point of execution.
2013-06-11 09:18:25 +02:00