doc: Reorganize documentation site content
Take the existing ACRN technical documentation and reorganize its presentation to be persona and use-case based, in preparation for adding new scenario/use-case based architecture introduction and getting started documents. Introduce a more graphical home page and theme color tweaks. Signed-off-by: David B. Kinder <david.b.kinder@intel.com>
@@ -1,954 +0,0 @@
|
||||
.. _APL_GVT-g-hld:
|
||||
|
||||
GVT-g high-level design
|
||||
#######################
|
||||
|
||||
Introduction
|
||||
************
|
||||
|
||||
Purpose of this Document
|
||||
========================
|
||||
|
||||
This high-level design (HLD) document describes the usage requirements
|
||||
and high level design for Intel |reg| Graphics Virtualization Technology for
|
||||
shared virtual :term:`GPU` technology (:term:`GVT-g`) on Apollo Lake-I
|
||||
SoCs.
|
||||
|
||||
This document describes:
|
||||
|
||||
- The different GPU virtualization techniques
|
||||
- GVT-g mediated pass-through
|
||||
- High level design
|
||||
- Key components
|
||||
- GVT-g new architecture differentiation
|
||||
|
||||
Audience
|
||||
========
|
||||
|
||||
This document is for developers, validation teams, architects and
|
||||
maintainers of Intel |reg| GVT-g for the Apollo Lake SoCs.
|
||||
|
||||
The reader should have some familiarity with the basic concepts of
|
||||
system virtualization and Intel processor graphics.
|
||||
|
||||
Reference Documents
|
||||
===================
|
||||
|
||||
The following documents were used as references for this specification:
|
||||
|
||||
- Paper in USENIX ATC '14 - *Full GPU Virtualization Solution with
|
||||
Mediated Pass-Through* - https://www.usenix.org/node/183932
|
||||
|
||||
- Hardware Specification - PRMs -
|
||||
https://01.org/linuxgraphics/documentation/hardware-specification-prms
|
||||
|
||||
Background
|
||||
**********
|
||||
|
||||
Intel GVT-g is an enabling technology in emerging graphics
|
||||
virtualization scenarios. It adopts a full GPU virtualization approach
|
||||
based on mediated pass-through technology, to achieve good performance,
|
||||
scalability and secure isolation among Virtual Machines (VMs). A virtual
|
||||
GPU (vGPU), with full GPU features, is presented to each VM so that a
|
||||
native graphics driver can run directly inside a VM.
|
||||
|
||||
Intel GVT-g technology for Apollo Lake (APL) has been implemented in
|
||||
open source hypervisors or Virtual Machine Monitors (VMMs):
|
||||
|
||||
- Intel GVT-g for ACRN, also known as, "AcrnGT"
|
||||
- Intel GVT-g for KVM, also known as, "KVMGT"
|
||||
- Intel GVT-g for Xen, also known as, "XenGT"
|
||||
|
||||
The core vGPU device model is released under BSD/MIT dual license, so it
|
||||
can be reused in other proprietary hypervisors.
|
||||
|
||||
Intel has a portfolio of graphics virtualization technologies
|
||||
(:term:`GVT-g`, :term:`GVT-d` and :term:`GVT-s`). GVT-d and GVT-s are
|
||||
outside of the scope of this document.
|
||||
|
||||
This HLD applies to the Apollo Lake platform only. Support of other
|
||||
hardware is outside the scope of this HLD.
|
||||
|
||||
Targeted Usages
|
||||
===============
|
||||
|
||||
The main targeted usage of GVT-g is in automotive applications, such as:
|
||||
|
||||
- An Instrument cluster running in one domain
|
||||
- An In Vehicle Infotainment (IVI) solution running in another domain
|
||||
- Additional domains for specific purposes, such as Rear Seat
|
||||
Entertainment or video camera capturing.
|
||||
|
||||
.. figure:: images/APL_GVT-g-ive-use-case.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ive-use-case
|
||||
|
||||
IVE Use Case
|
||||
|
||||
Existing Techniques
|
||||
===================
|
||||
|
||||
A graphics device is no different from any other I/O device, with
|
||||
respect to how the device I/O interface is virtualized. Therefore,
|
||||
existing I/O virtualization techniques can be applied to graphics
|
||||
virtualization. However, none of the existing techniques can meet the
|
||||
general requirement of performance, scalability, and secure isolation
|
||||
simultaneously. In this section, we review the pros and cons of each
|
||||
technique in detail, enabling the audience to understand the rationale
|
||||
behind the entire GVT-g effort.
|
||||
|
||||
Emulation
|
||||
---------
|
||||
|
||||
A device can be emulated fully in software, including its I/O registers
|
||||
and internal functional blocks. There would be no dependency on the
|
||||
underlying hardware capability, therefore compatibility can be achieved
|
||||
across platforms. However, due to the CPU emulation cost, this technique
|
||||
is usually used for legacy devices, such as a keyboard, mouse, and VGA
|
||||
card. There would be great complexity and extremely low performance to
|
||||
fully emulate a modern accelerator, such as a GPU. It may be acceptable
|
||||
for use in a simulation environment, but it is definitely not suitable
|
||||
for production usage.
|
||||
|
||||
API Forwarding
|
||||
--------------
|
||||
|
||||
API forwarding, or a split driver model, is another widely-used I/O
|
||||
virtualization technology. It has been used in commercial virtualization
|
||||
productions, for example, VMware*, PCoIP*, and Microsoft* RemoteFx*.
|
||||
It is a natural path when researchers study a new type of
|
||||
I/O virtualization usage, for example, when GPGPU computing in VM was
|
||||
initially proposed. Intel GVT-s is based on this approach.
|
||||
|
||||
The architecture of API forwarding is shown in :numref:`api-forwarding`:
|
||||
|
||||
.. figure:: images/APL_GVT-g-api-forwarding.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
:name: api-forwarding
|
||||
|
||||
API Forwarding
|
||||
|
||||
A frontend driver is employed to forward high-level API calls (OpenGL,
|
||||
Directx, and so on) inside a VM, to a Backend driver in the Hypervisor
|
||||
for acceleration. The Backend may be using a different graphics stack,
|
||||
so API translation between different graphics protocols may be required.
|
||||
The Backend driver allocates a physical GPU resource for each VM,
|
||||
behaving like a normal graphics application in a Hypervisor. Shared
|
||||
memory may be used to reduce memory copying between the host and guest
|
||||
graphic stacks.
|
||||
|
||||
API forwarding can bring hardware acceleration capability into a VM,
|
||||
with other merits such as vendor independence and high density. However, it
|
||||
also suffers from the following intrinsic limitations:
|
||||
|
||||
- Lagging features - Every new API version needs to be specifically
|
||||
handled, so it means slow time-to-market (TTM) to support new standards.
|
||||
For example,
|
||||
only DirectX9 is supported, when DirectX11 is already in the market.
|
||||
Also, there is a big gap in supporting media and compute usages.
|
||||
|
||||
- Compatibility issues - A GPU is very complex, and consequently so are
|
||||
high level graphics APIs. Different protocols are not 100% compatible
|
||||
on every subtle API, so the customer can observe feature/quality loss
|
||||
for specific applications.
|
||||
|
||||
- Maintenance burden - Occurs when supported protocols and specific
|
||||
versions are incremented.
|
||||
|
||||
- Performance overhead - Different API forwarding implementations
|
||||
exhibit quite different performance, which gives rise to a need for a
|
||||
fine-grained graphics tuning effort.
|
||||
|
||||
Direct Pass-Through
|
||||
-------------------
|
||||
|
||||
"Direct pass-through" dedicates the GPU to a single VM, providing full
|
||||
features and good performance, but at the cost of device sharing
|
||||
capability among VMs. Only one VM at a time can use the hardware
|
||||
acceleration capability of the GPU, which is a major limitation of this
|
||||
technique. However, it is still a good approach to enable graphics
|
||||
virtualization usages on Intel server platforms, as an intermediate
|
||||
solution. Intel GVT-d uses this mechanism.
|
||||
|
||||
.. figure:: images/APL_GVT-g-pass-through.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
:name: gvt-pass-through
|
||||
|
||||
Pass-Through
|
||||
|
||||
SR-IOV
|
||||
------
|
||||
|
||||
Single Root IO Virtualization (SR-IOV) implements I/O virtualization
|
||||
directly on a device. Multiple Virtual Functions (VFs) are implemented,
|
||||
with each VF directly assignable to a VM.
|
||||
|
||||
.. _Graphic_mediation:
|
||||
|
||||
Mediated Pass-Through
|
||||
*********************
|
||||
|
||||
Intel GVT-g achieves full GPU virtualization using a "mediated
|
||||
pass-through" technique.
|
||||
|
||||
Concept
|
||||
=======
|
||||
|
||||
Mediated pass-through allows a VM to access performance-critical I/O
|
||||
resources (usually partitioned) directly, without intervention from the
|
||||
hypervisor in most cases. Privileged operations from this VM are
|
||||
trapped-and-emulated to provide secure isolation among VMs.
|
||||
|
||||
.. figure:: images/APL_GVT-g-mediated-pass-through.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
:name: mediated-pass-through
|
||||
|
||||
Mediated Pass-Through
|
||||
|
||||
The Hypervisor must ensure that no vulnerability is exposed when
|
||||
assigning performance-critical resource to each VM. When a
|
||||
performance-critical resource cannot be partitioned, a scheduler must be
|
||||
implemented (either in software or hardware) to allow time-based sharing
|
||||
among multiple VMs. In this case, the device must allow the hypervisor
|
||||
to save and restore the hardware state associated with the shared resource,
|
||||
either through direct I/O register reads and writes (when there is no software
|
||||
invisible state) or through a device-specific context save and restore
|
||||
mechanism (where there is a software invisible state).
|
||||
|
||||
Examples of performance-critical I/O resources include the following:
|
||||
|
||||
.. figure:: images/APL_GVT-g-perf-critical.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: perf-critical
|
||||
|
||||
Performance-Critical I/O Resources
|
||||
|
||||
|
||||
The key to implementing mediated pass-through for a specific device is
|
||||
to define the right policy for various I/O resources.
|
||||
|
||||
Virtualization Policies for GPU Resources
|
||||
=========================================
|
||||
|
||||
:numref:`graphics-arch` shows how Intel Processor Graphics works at a high level.
|
||||
Software drivers write commands into a command buffer through the CPU.
|
||||
The Render Engine in the GPU fetches these commands and executes them.
|
||||
The Display Engine fetches pixel data from the Frame Buffer and sends
|
||||
them to the external monitors for display.
|
||||
|
||||
.. figure:: images/APL_GVT-g-graphics-arch.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
:name: graphics-arch
|
||||
|
||||
Architecture of Intel Processor Graphics
|
||||
|
||||
This architecture abstraction applies to most modern GPUs, but may
|
||||
differ in how graphics memory is implemented. Intel Processor Graphics
|
||||
uses system memory as graphics memory. System memory can be mapped into
|
||||
multiple virtual address spaces by GPU page tables. A 4 GB global
|
||||
virtual address space called "global graphics memory", accessible from
|
||||
both the GPU and CPU, is mapped through a global page table. Local
|
||||
graphics memory spaces are supported in the form of multiple 4 GB local
|
||||
virtual address spaces, but are only limited to access by the Render
|
||||
Engine through local page tables. Global graphics memory is mostly used
|
||||
for the Frame Buffer and also serves as the Command Buffer. Massive data
|
||||
accesses are made to local graphics memory when hardware acceleration is
|
||||
in progress. Other GPUs have similar page table mechanism accompanying
|
||||
the on-die memory.
|
||||
|
||||
The CPU programs the GPU through GPU-specific commands, shown in
|
||||
:numref:`graphics-arch`, using a producer-consumer model. The graphics
|
||||
driver programs GPU commands into the Command Buffer, including primary
|
||||
buffer and batch buffer, according to the high-level programming APIs,
|
||||
such as OpenGL* or DirectX*. Then, the GPU fetches and executes the
|
||||
commands. The primary buffer (called a ring buffer) may chain other
|
||||
batch buffers together. The primary buffer and ring buffer are used
|
||||
interchangeably thereafter. The batch buffer is used to convey the
|
||||
majority of the commands (up to ~98% of them) per programming model. A
|
||||
register tuple (head, tail) is used to control the ring buffer. The CPU
|
||||
submits the commands to the GPU by updating the tail, while the GPU
|
||||
fetches commands from the head, and then notifies the CPU by updating
|
||||
the head, after the commands have finished execution. Therefore, when
|
||||
the GPU has executed all commands from the ring buffer, the head and
|
||||
tail pointers are the same.
|
||||
|
||||
Having introduced the GPU architecture abstraction, it is important for
|
||||
us to understand how real-world graphics applications use the GPU
|
||||
hardware so that we can virtualize it in VMs efficiently. To do so, we
|
||||
characterized, for some representative GPU-intensive 3D workloads (the
|
||||
Phoronix Test Suite), the usages of the four critical interfaces:
|
||||
|
||||
1) the Frame Buffer,
|
||||
2) the Command Buffer,
|
||||
3) the GPU Page Table Entries (PTEs), which carry the GPU page tables, and
|
||||
4) the I/O registers, including Memory-Mapped I/O (MMIO) registers,
|
||||
Port I/O (PIO) registers, and PCI configuration space registers
|
||||
for internal state.
|
||||
|
||||
:numref:`access-patterns` shows the average access frequency of running
|
||||
Phoronix 3D workloads on the four interfaces.
|
||||
|
||||
The Frame Buffer and Command Buffer exhibit the most
|
||||
performance-critical resources, as shown in :numref:`access-patterns`.
|
||||
When the applications are being loaded, lots of source vertices and
|
||||
pixels are written by the CPU, so the Frame Buffer accesses occur in the
|
||||
range of hundreds of thousands per second. Then at run-time, the CPU
|
||||
programs the GPU through the commands, to render the Frame Buffer, so
|
||||
the Command Buffer accesses become the largest group, also in the
|
||||
hundreds of thousands per second. PTE and I/O accesses are minor in both
|
||||
load and run-time phases ranging in tens of thousands per second.
|
||||
|
||||
.. figure:: images/APL_GVT-g-access-patterns.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
:name: access-patterns
|
||||
|
||||
Access Patterns of Running 3D Workloads
|
||||
|
||||
High Level Architecture
|
||||
***********************
|
||||
|
||||
:numref:`gvt-arch` shows the overall architecture of GVT-g, based on the
|
||||
ACRN hypervisor, with SOS as the privileged VM, and multiple user
|
||||
guests. A GVT-g device model working with the ACRN hypervisor,
|
||||
implements the policies of trap and pass-through. Each guest runs the
|
||||
native graphics driver and can directly access performance-critical
|
||||
resources: the Frame Buffer and Command Buffer, with resource
|
||||
partitioning (as presented later). To protect privileged resources, that
|
||||
is, the I/O registers and PTEs, corresponding accesses from the graphics
|
||||
driver in user VMs are trapped and forwarded to the GVT device model in
|
||||
SOS for emulation. The device model leverages i915 interfaces to access
|
||||
the physical GPU.
|
||||
|
||||
In addition, the device model implements a GPU scheduler that runs
|
||||
concurrently with the CPU scheduler in ACRN to share the physical GPU
|
||||
timeslot among the VMs. GVT-g uses the physical GPU to directly execute
|
||||
all the commands submitted from a VM, so it avoids the complexity of
|
||||
emulating the Render Engine, which is the most complex part of the GPU.
|
||||
In the meantime, the resource pass-through of both the Frame Buffer and
|
||||
Command Buffer minimizes the hypervisor's intervention of CPU accesses,
|
||||
while the GPU scheduler guarantees every VM a quantum time-slice for
|
||||
direct GPU execution. With that, GVT-g can achieve near-native
|
||||
performance for a VM workload.
|
||||
|
||||
In :numref:`gvt-arch`, the yellow GVT device model works as a client on
|
||||
top of an i915 driver in the SOS. It has a generic Mediated Pass-Through
|
||||
(MPT) interface, compatible with all types of hypervisors. For ACRN,
|
||||
some extra development work is needed for such MPT interfaces. For
|
||||
example, we need some changes in ACRN-DM to make ACRN compatible with
|
||||
the MPT framework. The vGPU lifecycle is the same as the lifecycle of
|
||||
the guest VM creation through ACRN-DM. They interact through sysfs,
|
||||
exposed by the GVT device model.
|
||||
|
||||
.. figure:: images/APL_GVT-g-arch.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: gvt-arch
|
||||
|
||||
AcrnGT High-level Architecture
|
||||
|
||||
Key Techniques
|
||||
**************
|
||||
|
||||
vGPU Device Model
|
||||
=================
|
||||
|
||||
The vGPU Device model is the main component because it constructs the
|
||||
vGPU instance for each guest to satisfy every GPU request from the guest
|
||||
and gives the corresponding result back to the guest.
|
||||
|
||||
The vGPU Device Model provides the basic framework to do
|
||||
trap-and-emulation, including MMIO virtualization, interrupt
|
||||
virtualization, and display virtualization. It also handles and
|
||||
processes all the requests internally, such as, command scan and shadow,
|
||||
schedules them in the proper manner, and finally submits to
|
||||
the SOS i915 driver.
|
||||
|
||||
.. figure:: images/APL_GVT-g-DM.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: GVT-DM
|
||||
|
||||
GVT-g Device Model
|
||||
|
||||
MMIO Virtualization
|
||||
-------------------
|
||||
|
||||
Intel Processor Graphics implements two PCI MMIO BARs:
|
||||
|
||||
- **GTTMMADR BAR**: Combines both :term:`GGTT` modification range and Memory
|
||||
Mapped IO range. It is 16 MB on :term:`BDW`, with 2 MB used by MMIO, 6 MB
|
||||
reserved and 8 MB allocated to GGTT. GGTT starts from
|
||||
:term:`GTTMMADR` + 8 MB. In this section, we focus on virtualization of
|
||||
the MMIO range, discussing GGTT virtualization later.
|
||||
|
||||
- **GMADR BAR**: As the PCI aperture is used by the CPU to access tiled
|
||||
graphics memory, GVT-g partitions this aperture range among VMs for
|
||||
performance reasons.
|
||||
|
||||
A 2 MB virtual MMIO structure is allocated per vGPU instance.
|
||||
|
||||
All the virtual MMIO registers are emulated as simple in-memory
|
||||
read-write, that is, guest driver will read back the same value that was
|
||||
programmed earlier. A common emulation handler (for example,
|
||||
intel_gvt_emulate_read/write) is enough to handle such general
|
||||
emulation requirements. However, some registers need to be emulated with
|
||||
specific logic, for example, affected by change of other states or
|
||||
additional audit or translation when updating the virtual register.
|
||||
Therefore, a specific emulation handler must be installed for those
|
||||
special registers.
|
||||
|
||||
The graphics driver may have assumptions about the initial device state,
|
||||
which stays with the point when the BIOS transitions to the OS. To meet
|
||||
the driver expectation, we need to provide an initial state of vGPU that
|
||||
a driver may observe on a pGPU. So the host graphics driver is expected
|
||||
to generate a snapshot of physical GPU state, which it does before guest
|
||||
driver's initialization. This snapshot is used as the initial vGPU state
|
||||
by the device model.
|
||||
|
||||
PCI Configuration Space Virtualization
|
||||
--------------------------------------
|
||||
|
||||
PCI configuration space also needs to be virtualized in the device
|
||||
model. Different implementations may choose to implement the logic
|
||||
within the vGPU device model or in default system device model (for
|
||||
example, ACRN-DM). GVT-g emulates the logic in the device model.
|
||||
|
||||
Some information is vital for the vGPU device model, including:
|
||||
Guest PCI BAR, Guest PCI MSI, and Base of ACPI OpRegion.
|
||||
|
||||
Legacy VGA Port I/O Virtualization
|
||||
----------------------------------
|
||||
|
||||
Legacy VGA is not supported in the vGPU device model. We rely on the
|
||||
default device model (for example, :term:`QEMU`) to provide legacy VGA
|
||||
emulation, which means either ISA VGA emulation or
|
||||
PCI VGA emulation.
|
||||
|
||||
Interrupt Virtualization
|
||||
------------------------
|
||||
|
||||
The GVT device model does not touch the hardware interrupt in the new
|
||||
architecture, since it is hard to combine the interrupt controlling
|
||||
logic between the virtual device model and the host driver. To prevent
|
||||
architectural changes in the host driver, the host GPU interrupt does
|
||||
not go to the virtual device model and the virtual device model has to
|
||||
handle the GPU interrupt virtualization by itself. Virtual GPU
|
||||
interrupts are categorized into three types:
|
||||
|
||||
- Periodic GPU interrupts are emulated by timers. However, a notable
|
||||
exception to this is the VBlank interrupt. Due to the demands of user
|
||||
space compositors, such as Wayland, which requires a flip done event
|
||||
to be synchronized with a VBlank, this interrupt is forwarded from
|
||||
SOS to UOS when SOS receives it from the hardware.
|
||||
|
||||
- Event-based GPU interrupts are emulated by the emulation logic. For
|
||||
example, AUX Channel Interrupt.
|
||||
|
||||
- GPU command interrupts are emulated by a command parser and workload
|
||||
dispatcher. The command parser marks out which GPU command interrupts
|
||||
are generated during the command execution and the workload
|
||||
dispatcher injects those interrupts into the VM after the workload is
|
||||
finished.
|
||||
|
||||
.. figure:: images/APL_GVT-g-interrupt-virt.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
:name: interrupt-virt
|
||||
|
||||
Interrupt Virtualization
|
||||
|
||||
Workload Scheduler
|
||||
------------------
|
||||
|
||||
The scheduling policy and workload scheduler are decoupled for
|
||||
scalability reasons. For example, a future QoS enhancement will only
|
||||
impact the scheduling policy, any i915 interface change or HW submission
|
||||
interface change (from execlist to :term:`GuC`) will only need workload
|
||||
scheduler updates.
|
||||
|
||||
The scheduling policy framework is the core of the vGPU workload
|
||||
scheduling system. It controls all of the scheduling actions and
|
||||
provides the developer with a generic framework for easy development of
|
||||
scheduling policies. The scheduling policy framework controls the work
|
||||
scheduling process without caring about how the workload is dispatched
|
||||
or completed. All the detailed workload dispatching is hidden in the
|
||||
workload scheduler, which is the actual executer of a vGPU workload.
|
||||
|
||||
The workload scheduler handles everything about one vGPU workload. Each
|
||||
hardware ring is backed by one workload scheduler kernel thread. The
|
||||
workload scheduler picks the workload from current vGPU workload queue
|
||||
and communicates with the virtual HW submission interface to emulate the
|
||||
"schedule-in" status for the vGPU. It performs context shadow, Command
|
||||
Buffer scan and shadow, PPGTT page table pin/unpin/out-of-sync, before
|
||||
submitting this workload to the host i915 driver. When the vGPU workload
|
||||
is completed, the workload scheduler asks the virtual HW submission
|
||||
interface to emulate the "schedule-out" status for the vGPU. The VM
|
||||
graphics driver then knows that a GPU workload is finished.
|
||||
|
||||
.. figure:: images/APL_GVT-g-scheduling.png
|
||||
:width: 500px
|
||||
:align: center
|
||||
:name: scheduling
|
||||
|
||||
GVT-g Scheduling Framework
|
||||
|
||||
Workload Submission Path
|
||||
------------------------
|
||||
|
||||
Software submits the workload using the legacy ring buffer mode on Intel
|
||||
Processor Graphics before Broadwell, which is no longer supported by the
|
||||
GVT-g virtual device model. A new HW submission interface named
|
||||
"Execlist" is introduced since Broadwell. With the new HW submission
|
||||
interface, software can achieve better programmability and easier
|
||||
context management. In Intel GVT-g, the vGPU submits the workload
|
||||
through the virtual HW submission interface. Each workload in submission
|
||||
will be represented as an ``intel_vgpu_workload`` data structure, a vGPU
|
||||
workload, which will be put on a per-vGPU and per-engine workload queue
|
||||
later after performing a few basic checks and verifications.
|
||||
|
||||
.. figure:: images/APL_GVT-g-workload.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: workload
|
||||
|
||||
GVT-g Workload Submission
|
||||
|
||||
|
||||
Display Virtualization
|
||||
----------------------
|
||||
|
||||
GVT-g reuses the i915 graphics driver in the SOS to initialize the Display
|
||||
Engine, and then manages the Display Engine to show different VM frame
|
||||
buffers. When two vGPUs have the same resolution, only the frame buffer
|
||||
locations are switched.
|
||||
|
||||
.. figure:: images/APL_GVT-g-display-virt.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: display-virt
|
||||
|
||||
Display Virtualization
|
||||
|
||||
Direct Display Model
|
||||
--------------------
|
||||
|
||||
.. figure:: images/APL_GVT-g-direct-display.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: direct-display
|
||||
|
||||
Direct Display Model
|
||||
|
||||
A typical automotive use case is where there are two displays in the car
|
||||
and each one needs to show one domain's content, with the two domains
|
||||
being the Instrument cluster and the In Vehicle Infotainment (IVI). As
|
||||
shown in :numref:`direct-display`, this can be accomplished through the direct
|
||||
display model of GVT-g, where the SOS and UOS are each assigned all HW
|
||||
planes of two different pipes. GVT-g has a concept of display owner on a
|
||||
per HW plane basis. If it determines that a particular domain is the
|
||||
owner of a HW plane, then it allows the domain's MMIO register write to
|
||||
flip a frame buffer to that plane to go through to the HW. Otherwise,
|
||||
such writes are blocked by the GVT-g.
|
||||
|
||||
Indirect Display Model
|
||||
----------------------
|
||||
|
||||
.. figure:: images/APL_GVT-g-indirect-display.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: indirect-display
|
||||
|
||||
Indirect Display Model
|
||||
|
||||
For security or fastboot reasons, it may be determined that the UOS is
|
||||
either not allowed to display its content directly on the HW or it may
|
||||
be too late before it boots up and displays its content. In such a
|
||||
scenario, the responsibility of displaying content on all displays lies
|
||||
with the SOS. One of the use cases that can be realized is to display the
|
||||
entire frame buffer of the UOS on a secondary display. GVT-g allows for this
|
||||
model by first trapping all MMIO writes by the UOS to the HW. A proxy
|
||||
application can then capture the address in GGTT where the UOS has written
|
||||
its frame buffer and using the help of the Hypervisor and the SOS's i915
|
||||
driver, can convert the Guest Physical Addresses (GPAs) into Host
|
||||
Physical Addresses (HPAs) before making a texture source or EGL image
|
||||
out of the frame buffer and then either post processing it further or
|
||||
simply displaying it on a HW plane of the secondary display.
|
||||
|
||||
GGTT-Based Surface Sharing
|
||||
--------------------------
|
||||
|
||||
One of the major automotive use case is called "surface sharing". This
|
||||
use case requires that the SOS accesses an individual surface or a set of
|
||||
surfaces from the UOS without having to access the entire frame buffer of
|
||||
the UOS. Unlike the previous two models, where the UOS did not have to do
|
||||
anything to show its content and therefore a completely unmodified UOS
|
||||
could continue to run, this model requires changes to the UOS.
|
||||
|
||||
This model can be considered an extension of the indirect display model.
|
||||
Under the indirect display model, the UOS's frame buffer was temporarily
|
||||
pinned by it in the video memory access through the Global graphics
|
||||
translation table. This GGTT-based surface sharing model takes this a
|
||||
step further by having a compositor of the UOS to temporarily pin all
|
||||
application buffers into GGTT. It then also requires the compositor to
|
||||
create a metadata table with relevant surface information such as width,
|
||||
height, and GGTT offset, and flip that in lieu of the frame buffer.
|
||||
In the SOS, the proxy application knows that the GGTT offset has been
|
||||
flipped, maps it, and through it can access the GGTT offset of an
|
||||
application that it wants to access. It is worth mentioning that in this
|
||||
model, UOS applications did not require any changes, and only the
|
||||
compositor, Mesa, and i915 driver had to be modified.
|
||||
|
||||
This model has a major benefit and a major limitation. The
|
||||
benefit is that since it builds on top of the indirect display model,
|
||||
there are no special drivers necessary for it on either SOS or UOS.
|
||||
Therefore, any Real Time Operating System (RTOS) that use
|
||||
this model can simply do so without having to implement a driver, the
|
||||
infrastructure for which may not be present in their operating system.
|
||||
The limitation of this model is that video memory dedicated for a UOS is
|
||||
generally limited to a couple of hundred MBs. This can easily be
|
||||
exhausted by a few application buffers so the number and size of buffers
|
||||
is limited. Since it is not a highly-scalable model, in general, Intel
|
||||
recommends the Hyper DMA buffer sharing model, described next.
|
||||
|
||||
Hyper DMA Buffer Sharing
|
||||
------------------------
|
||||
|
||||
.. figure:: images/APL_GVT-g-hyper-dma.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: hyper-dma
|
||||
|
||||
Hyper DMA Buffer Design
|
||||
|
||||
Another approach to surface sharing is Hyper DMA Buffer sharing. This
|
||||
model extends the Linux DMA buffer sharing mechanism where one driver is
|
||||
able to share its pages with another driver within one domain.
|
||||
|
||||
Applications buffers are backed by i915 Graphics Execution Manager
|
||||
Buffer Objects (GEM BOs). As in GGTT surface
|
||||
sharing, this model also requires compositor changes. The compositor of
|
||||
UOS requests i915 to export these application GEM BOs and then passes
|
||||
them on to a special driver called the Hyper DMA Buf exporter whose job
|
||||
is to create a scatter gather list of pages mapped by PDEs and PTEs and
|
||||
export a Hyper DMA Buf ID back to the compositor.
|
||||
|
||||
The compositor then shares this Hyper DMA Buf ID with the SOS's Hyper DMA
|
||||
Buf importer driver which then maps the memory represented by this ID in
|
||||
the SOS. A proxy application in the SOS can then provide the ID of this driver
|
||||
to the SOS i915, which can create its own GEM BO. Finally, the application
|
||||
can use it as an EGL image and do any post processing required before
|
||||
either providing it to the SOS compositor or directly flipping it on a
|
||||
HW plane in the compositor's absence.
|
||||
|
||||
This model is highly scalable and can be used to share up to 4 GB worth
|
||||
of pages. It is also not limited to only sharing graphics buffers. Other
|
||||
buffers for the IPU and others, can also be shared with it. However, it
|
||||
does require that the SOS port the Hyper DMA Buffer importer driver. Also,
|
||||
the SOS OS must comprehend and implement the DMA buffer sharing model.
|
||||
|
||||
For detailed information about this model, please refer to the `Linux
|
||||
HYPER_DMABUF Driver High Level Design
|
||||
<https://github.com/downor/linux_hyper_dmabuf/blob/hyper_dmabuf_integration_v4/Documentation/hyper-dmabuf-sharing.txt>`_.
|
||||
|
||||
.. _plane_restriction:
|
||||
|
||||
Plane-Based Domain Ownership
|
||||
----------------------------
|
||||
|
||||
.. figure:: images/APL_GVT-g-plane-based.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: plane-based
|
||||
|
||||
Plane-Based Domain Ownership
|
||||
|
||||
Yet another mechanism for showing content of both the SOS and UOS on the
|
||||
same physical display is called plane-based domain ownership. Under this
|
||||
model, both the SOS and UOS are provided a set of HW planes that they can
|
||||
flip their contents on to. Since each domain provides its content, there
|
||||
is no need for any extra composition to be done through the SOS. The display
|
||||
controller handles alpha blending contents of different domains on a
|
||||
single pipe. This saves on any complexity on either the SOS or the UOS
|
||||
SW stack.
|
||||
|
||||
It is important to provide only specific planes and have them statically
|
||||
assigned to different Domains. To achieve this, the i915 driver of both
|
||||
domains is provided a command line parameter that specifies the exact
|
||||
planes that this domain has access to. The i915 driver then enumerates
|
||||
only those HW planes and exposes them to its compositor. It is then left
|
||||
to the compositor configuration to use these planes appropriately and
|
||||
show the correct content on them. No other changes are necessary.
|
||||
|
||||
While the biggest benefit of this model is that is extremely simple and
|
||||
quick to implement, it also has some drawbacks. First, since each domain
|
||||
is responsible for showing the content on the screen, there is no
|
||||
control of the UOS by the SOS. If the UOS is untrusted, this could
|
||||
potentially cause some unwanted content to be displayed. Also, there is
|
||||
no post processing capability, except that provided by the display
|
||||
controller (for example, scaling, rotation, and so on). So each domain
|
||||
must provide finished buffers with the expectation that alpha blending
|
||||
with another domain will not cause any corruption or unwanted artifacts.
|
||||
|
||||
Graphics Memory Virtualization
|
||||
==============================
|
||||
|
||||
To achieve near-to-native graphics performance, GVT-g passes through the
|
||||
performance-critical operations, such as Frame Buffer and Command Buffer
|
||||
from the VM. For the global graphics memory space, GVT-g uses graphics
|
||||
memory resource partitioning and an address space ballooning mechanism.
|
||||
For local graphics memory spaces, GVT-g implements per-VM local graphics
|
||||
memory through a render context switch because local graphics memory is
|
||||
only accessible by the GPU.
|
||||
|
||||
Global Graphics Memory
|
||||
----------------------
|
||||
|
||||
Graphics Memory Resource Partitioning
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
GVT-g partitions the global graphics memory among VMs. Splitting the
|
||||
CPU/GPU scheduling mechanism requires that the global graphics memory of
|
||||
different VMs can be accessed by the CPU and the GPU simultaneously.
|
||||
Consequently, GVT-g must, at any time, present each VM with its own
|
||||
resource, leading to the resource partitioning approaching, for global
|
||||
graphics memory, as shown in :numref:`mem-part`.
|
||||
|
||||
.. figure:: images/APL_GVT-g-mem-part.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: mem-part
|
||||
|
||||
Memory Partition and Ballooning
|
||||
|
||||
The performance impact of reduced global graphics memory resource
|
||||
due to memory partitioning is very limited according to various test
|
||||
results.
|
||||
|
||||
Address Space Ballooning
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
The address space ballooning technique is introduced to eliminate the
|
||||
address translation overhead, shown in :numref:`mem-part`. GVT-g exposes the
|
||||
partitioning information to the VM graphics driver through the PVINFO
|
||||
MMIO window. The graphics driver marks the other VMs' regions as
|
||||
'ballooned', and reserves them as not being used from its graphics
|
||||
memory allocator. Under this design, the guest view of global graphics
|
||||
memory space is exactly the same as the host view and the driver
|
||||
programmed addresses, using guest physical address, can be directly used
|
||||
by the hardware. Address space ballooning is different from traditional
|
||||
memory ballooning techniques. Memory ballooning is for memory usage
|
||||
control concerning the number of ballooned pages, while address space
|
||||
ballooning is to balloon special memory address ranges.
|
||||
|
||||
Another benefit of address space ballooning is that there is no address
|
||||
translation overhead as we use the guest Command Buffer for direct GPU
|
||||
execution.
|
||||
|
||||
Per-VM Local Graphics Memory
|
||||
----------------------------
|
||||
|
||||
GVT-g allows each VM to use the full local graphics memory spaces of its
|
||||
own, similar to the virtual address spaces on the CPU. The local
|
||||
graphics memory spaces are only visible to the Render Engine in the GPU.
|
||||
Therefore, any valid local graphics memory address, programmed by a VM,
|
||||
can be used directly by the GPU. The GVT-g device model switches the
|
||||
local graphics memory spaces, between VMs, when switching render
|
||||
ownership.
|
||||
|
||||
GPU Page Table Virtualization
|
||||
=============================
|
||||
|
||||
Shared Shadow GGTT
|
||||
------------------
|
||||
|
||||
To achieve resource partitioning and address space ballooning, GVT-g
|
||||
implements a shared shadow global page table for all VMs. Each VM has
|
||||
its own guest global page table to translate the graphics memory page
|
||||
number to the Guest memory Page Number (GPN). The shadow global page
|
||||
table is then translated from the graphics memory page number to the
|
||||
Host memory Page Number (HPN).
|
||||
|
||||
The shared shadow global page table maintains the translations for all
|
||||
VMs to support concurrent accesses from the CPU and GPU concurrently.
|
||||
Therefore, GVT-g implements a single, shared shadow global page table by
|
||||
trapping guest PTE updates, as shown in :numref:`shared-shadow`. The
|
||||
global page table, in MMIO space, has 1024K PTE entries, each pointing
|
||||
to a 4 KB system memory page, so the global page table overall creates a
|
||||
4 GB global graphics memory space. GVT-g audits the guest PTE values
|
||||
according to the address space ballooning information before updating
|
||||
the shadow PTE entries.
|
||||
|
||||
.. figure:: images/APL_GVT-g-shared-shadow.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: shared-shadow
|
||||
|
||||
Shared Shadow Global Page Table
|
||||
|
||||
Per-VM Shadow PPGTT
|
||||
-------------------
|
||||
|
||||
To support local graphics memory access pass-through, GVT-g implements
|
||||
per-VM shadow local page tables. The local graphics memory is only
|
||||
accessible from the Render Engine. The local page tables have two-level
|
||||
paging structures, as shown in :numref:`per-vm-shadow`.
|
||||
|
||||
The first level, Page Directory Entries (PDEs), located in the global
|
||||
page table, points to the second level, Page Table Entries (PTEs) in
|
||||
system memory, so guest accesses to the PDE are trapped and emulated,
|
||||
through the implementation of shared shadow global page table.
|
||||
|
||||
GVT-g also write-protects a list of guest PTE pages for each VM. The
|
||||
GVT-g device model synchronizes the shadow page with the guest page, at
|
||||
the time of write-protection page fault, and switches the shadow local
|
||||
page tables at render context switches.
|
||||
|
||||
.. figure:: images/APL_GVT-g-per-vm-shadow.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: per-vm-shadow
|
||||
|
||||
Per-VM Shadow PPGTT
|
||||
|
||||
.. _GVT-g-prioritized-rendering:
|
||||
|
||||
Prioritized Rendering and Preemption
|
||||
====================================
|
||||
|
||||
Different Schedulers and Their Roles
|
||||
------------------------------------
|
||||
|
||||
.. figure:: images/APL_GVT-g-scheduling-policy.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: scheduling-policy
|
||||
|
||||
Scheduling Policy
|
||||
|
||||
In the system, there are three different schedulers for the GPU:
|
||||
|
||||
- i915 UOS scheduler
|
||||
- Mediator GVT scheduler
|
||||
- i915 SOS scheduler
|
||||
|
||||
Since UOS always uses the host-based command submission (ELSP) model,
|
||||
and it never accesses the GPU or the Graphic Micro Controller (GuC)
|
||||
directly, its scheduler cannot do any preemption by itself.
|
||||
The i915 scheduler does ensure batch buffers are
|
||||
submitted in dependency order, that is, if a compositor had to wait for
|
||||
an application buffer to finish before its workload can be submitted to
|
||||
the GPU, then the i915 scheduler of the UOS ensures that this happens.
|
||||
|
||||
The UOS assumes that by submitting its batch buffers to the Execlist
|
||||
Submission Port (ELSP), the GPU will start working on them. However,
|
||||
the MMIO write to the ELSP is captured by the Hypervisor, which forwards
|
||||
these requests to the GVT module. GVT then creates a shadow context
|
||||
based on this batch buffer and submits the shadow context to the SOS
|
||||
i915 driver.
|
||||
|
||||
However, it is dependent on a second scheduler called the GVT
|
||||
scheduler. This scheduler is time based and uses a round robin algorithm
|
||||
to provide a specific time for each UOS to submit its workload when it
|
||||
is considered as a "render owner". The workload of the UOSs that are not
|
||||
render owners during a specific time period end up waiting in the
|
||||
virtual GPU context until the GVT scheduler makes them render owners.
|
||||
The GVT shadow context submits only one workload at
|
||||
a time, and once the workload is finished by the GPU, it copies any
|
||||
context state back to DomU and sends the appropriate interrupts before
|
||||
picking up any other workloads from either this UOS or another one. This
|
||||
also implies that this scheduler does not do any preemption of
|
||||
workloads.
|
||||
|
||||
Finally, there is the i915 scheduler in the SOS. This scheduler uses the
|
||||
GuC or ELSP to do command submission of SOS local content as well as any
|
||||
content that GVT is submitting to it on behalf of the UOSs. This
|
||||
scheduler uses GuC or ELSP to preempt workloads. GuC has four different
|
||||
priority queues, but the SOS i915 driver uses only two of them. One of
|
||||
them is considered high priority and the other is normal priority with a
|
||||
GuC rule being that any command submitted on the high priority queue
|
||||
would immediately try to preempt any workload submitted on the normal
|
||||
priority queue. For ELSP submission, the i915 will submit a preempt
|
||||
context to preempt the current running context and then wait for the GPU
|
||||
engine to be idle.
|
||||
|
||||
While the identification of workloads to be preempted is decided by
|
||||
customizable scheduling policies, once a candidate for preemption is
|
||||
identified, the i915 scheduler simply submits a preemption request to
|
||||
the GuC high-priority queue. Based on the HW's ability to preempt (on an
|
||||
Apollo Lake SoC, 3D workload is preemptible on a 3D primitive level with
|
||||
some exceptions), the currently executing workload is saved and
|
||||
preempted. The GuC informs the driver using an interrupt of a preemption
|
||||
event occurring. After handling the interrupt, the driver submits the
|
||||
high-priority workload through the normal priority GuC queue. As such,
|
||||
the normal priority GuC queue is used for actual execbuf submission most
|
||||
of the time with the high-priority GuC queue only being used for the
|
||||
preemption of lower-priority workload.
|
||||
|
||||
Scheduling policies are customizable and left to customers to change if
|
||||
they are not satisfied with the built-in i915 driver policy, where all
|
||||
workloads of the SOS are considered higher priority than those of the
|
||||
UOS. This policy can be enforced through an SOS i915 kernel command line
|
||||
parameter, and can replace the default in-order command submission (no
|
||||
preemption) policy.
|
||||
|
||||
AcrnGT
|
||||
*******
|
||||
|
||||
ACRN is a flexible, lightweight reference hypervisor, built with
|
||||
real-time and safety-criticality in mind, optimized to streamline
|
||||
embedded development through an open source platform.
|
||||
|
||||
AcrnGT is the GVT-g implementation on the ACRN hypervisor. It adapts
|
||||
the MPT interface of GVT-g onto ACRN by using the kernel APIs provided
|
||||
by ACRN.
|
||||
|
||||
:numref:`full-pic` shows the full architecture of AcrnGT with a Linux Guest
|
||||
OS and an Android Guest OS.
|
||||
|
||||
.. figure:: images/APL_GVT-g-full-pic.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: full-pic
|
||||
|
||||
Full picture of the AcrnGT
|
||||
|
||||
AcrnGT in kernel
|
||||
=================
|
||||
|
||||
The AcrnGT module in the SOS kernel acts as an adaption layer to connect
|
||||
between GVT-g in the i915, the VHM module, and the ACRN-DM user space
|
||||
application:
|
||||
|
||||
- AcrnGT module implements the MPT interface of GVT-g to provide
|
||||
services to it, including set and unset trap areas, set and unset
|
||||
write-protection pages, etc.
|
||||
|
||||
- It calls the VHM APIs provided by the ACRN VHM module in the SOS
|
||||
kernel, to eventually call into the routines provided by ACRN
|
||||
hypervisor through hyper-calls.
|
||||
|
||||
- It provides user space interfaces through ``sysfs`` to the user space
|
||||
ACRN-DM, so that DM can manage the lifecycle of the virtual GPUs.
|
||||
|
||||
AcrnGT in DM
|
||||
=============
|
||||
|
||||
To emulate a PCI device to a Guest, we need an AcrnGT sub-module in the
|
||||
ACRN-DM. This sub-module is responsible for:
|
||||
|
||||
- registering the virtual GPU device to the PCI device tree presented to
|
||||
guest;
|
||||
|
||||
- registerng the MMIO resources to ACRN-DM so that it can reserve
|
||||
resources in ACPI table;
|
||||
|
||||
- managing the lifecycle of the virtual GPU device, such as creation,
|
||||
destruction, and resetting according to the state of the virtual
|
||||
machine.
|
@@ -1,18 +0,0 @@
|
||||
.. _hld-emulated-devices:
|
||||
|
||||
Emulated devices high-level design
|
||||
##################################
|
||||
|
||||
Full virtualization device models can typically
|
||||
reuse existing native device drivers to avoid implementing front-end
|
||||
drivers. ACRN implements several fully virtualized devices, as
|
||||
documented in this section.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
usb-virt-hld
|
||||
UART virtualization <uart-virt-hld>
|
||||
Watchdoc virtualization <watchdog-hld>
|
||||
random-virt-hld
|
||||
GVT-g GPU Virtualization <hld-APL_GVT-g>
|
@@ -1,24 +0,0 @@
|
||||
.. _hld-hypervisor:
|
||||
|
||||
Hypervisor high-level design
|
||||
############################
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
hv-startup
|
||||
hv-cpu-virt
|
||||
Memory management <hv-memmgt>
|
||||
I/O Emulation <hv-io-emulation>
|
||||
IOC Virtualization <hv-ioc-virt>
|
||||
Physical Interrupt <hv-interrupt>
|
||||
Timer <hv-timer>
|
||||
Virtual Interrupt <hv-virt-interrupt>
|
||||
VT-d <hv-vt-d>
|
||||
Device Passthrough <hv-dev-passthrough>
|
||||
hv-partitionmode
|
||||
Power Management <hv-pm>
|
||||
Console, Shell, and vUART <hv-console>
|
||||
Hypercall / VHM upcall <hv-hypercall>
|
||||
Compile-time configuration <hv-config>
|
@@ -1,529 +0,0 @@
|
||||
.. _hld-overview:
|
||||
|
||||
ACRN high-level design overview
|
||||
###############################
|
||||
|
||||
ACRN is an open source reference hypervisor (HV) running on top of Intel
|
||||
Apollo Lake platforms for Software Defined Cockpit (SDC) or In-Vehicle
|
||||
Experience (IVE) solutions. ACRN provides embedded hypervisor vendors
|
||||
with a reference I/O mediation solution with a permissive license and
|
||||
provides auto makers a reference software stack for in-vehicle use.
|
||||
|
||||
ACRN Supported Use Cases
|
||||
************************
|
||||
|
||||
Software Defined Cockpit
|
||||
========================
|
||||
|
||||
The SDC system consists of multiple systems: the instrument cluster (IC)
|
||||
system, the In-vehicle Infotainment (IVI) system, and one or more rear
|
||||
seat entertainment (RSE) systems. Each system runs as a VM for better
|
||||
isolation.
|
||||
|
||||
The Instrument Control (IC) system manages graphics display of
|
||||
|
||||
- driving speed, engine RPM, temperature, fuel level, odometer, trip mile, etc.
|
||||
- alerts of low fuel or tire pressure
|
||||
- rear-view camera(RVC) and surround-camera view for driving assistance.
|
||||
|
||||
In-Vehicle Infotainment
|
||||
=======================
|
||||
|
||||
A typical In-Vehicle Infotainment (IVI) system would support:
|
||||
|
||||
- Navigation systems;
|
||||
- Radios, audio and video playback;
|
||||
- Mobile devices connection for calls, music and applications via voice
|
||||
recognition and/or gesture Recognition / Touch.
|
||||
- Rear-seat RSE services such as:
|
||||
|
||||
- entertainment system
|
||||
- virtual office
|
||||
- connection to IVI front system and mobile devices (cloud
|
||||
connectivity)
|
||||
|
||||
ACRN supports guest OSes of Clear Linux OS and Android. OEMs can use the ACRN
|
||||
hypervisor and Linux or Android guest OS reference code to implement their own
|
||||
VMs for a customized IC/IVI/RSE.
|
||||
|
||||
Hardware Requirements
|
||||
*********************
|
||||
|
||||
Mandatory IA CPU features are support for:
|
||||
|
||||
- Long mode
|
||||
- MTRR
|
||||
- TSC deadline timer
|
||||
- NX, SMAP, SMEP
|
||||
- Intel-VT including VMX, EPT, VT-d, APICv, VPID, invept and invvpid
|
||||
|
||||
Recommended Memory: 4GB, 8GB preferred.
|
||||
|
||||
|
||||
ACRN Architecture
|
||||
*****************
|
||||
|
||||
ACRN is a type-I hypervisor, running on top of bare metal. It supports
|
||||
Intel Apollo Lake platforms and can be easily extended to support future
|
||||
platforms. ACRN implements a hybrid VMM architecture, using a privileged
|
||||
service VM running the service OS (SOS) to manage I/O devices and
|
||||
provide I/O mediation. Multiple user VMs can be supported, running Clear
|
||||
Linux OS or Android OS as the user OS (UOS).
|
||||
|
||||
Instrument cluster applications are critical in the Software Defined
|
||||
Cockpit (SDC) use case, and may require functional safety certification
|
||||
in the future. Running the IC system in a separate VM can isolate it from
|
||||
other VMs and their applications, thereby reducing the attack surface
|
||||
and minimizing potential interference. However, running the IC system in
|
||||
a separate VM introduces additional latency for the IC applications.
|
||||
Some country regulations requires an IVE system to show a rear-view
|
||||
camera (RVC) within 2 seconds, which is difficult to achieve if a
|
||||
separate instrument cluster VM is started after the SOS is booted.
|
||||
|
||||
:numref:`overview-arch` shows the architecture of ACRN together with IC VM and
|
||||
service VM. As shown, SOS owns most of platform devices and provides I/O
|
||||
mediation to VMs. Some of the PCIe devices function as pass-through mode
|
||||
to UOSs according to VM configuration. In addition, the SOS could run
|
||||
the IC applications and HV helper applications such as the Device Model,
|
||||
VM manager, etc. where the VM manager is responsible for VM
|
||||
start/stop/pause, virtual CPU pause/resume,etc.
|
||||
|
||||
.. figure:: images/over-image34.png
|
||||
:align: center
|
||||
:name: overview-arch
|
||||
|
||||
ACRN Architecture
|
||||
|
||||
.. _intro-io-emulation:
|
||||
|
||||
Device Emulation
|
||||
================
|
||||
|
||||
ACRN adopts various approaches for emulating devices for UOS:
|
||||
|
||||
- **Emulated device**: A virtual device using this approach is emulated in
|
||||
the SOS by trapping accesses to the device in UOS. Two sub-categories
|
||||
exist for emulated device:
|
||||
|
||||
- fully emulated, allowing native drivers to be used
|
||||
unmodified in the UOS, and
|
||||
- para-virtualized, requiring front-end drivers in
|
||||
the UOS to function.
|
||||
|
||||
- **Pass-through device**: A device passed through to UOS is fully
|
||||
accessible to UOS without interception. However, interrupts
|
||||
are first handled by the hypervisor before
|
||||
being injected to the UOS.
|
||||
|
||||
- **Mediated pass-through device**: A mediated pass-through device is a
|
||||
hybrid of the previous two approaches. Performance-critical
|
||||
resources (mostly data-plane related) are passed-through to UOSes and
|
||||
others (mostly control-plane related) are emulated.
|
||||
|
||||
I/O Emulation
|
||||
-------------
|
||||
|
||||
The device model (DM) is a place for managing UOS devices: it allocates
|
||||
memory for UOSes, configures and initializes the devices shared by the
|
||||
guest, loads the virtual BIOS and initializes the virtual CPU state, and
|
||||
invokes hypervisor service to execute the guest instructions.
|
||||
|
||||
The following diagram illustrates the control flow of emulating a port
|
||||
I/O read from UOS.
|
||||
|
||||
.. figure:: images/over-image29.png
|
||||
:align: center
|
||||
:name: overview-io-emu-path
|
||||
|
||||
I/O (PIO/MMIO) Emulation Path
|
||||
|
||||
:numref:`overview-io-emu-path` shows an example I/O emulation flow path.
|
||||
When a guest executes an I/O instruction (port I/O or MMIO), an VM exit
|
||||
happens. HV takes control, and executes the request based on the VM exit
|
||||
reason ``VMX_EXIT_REASON_IO_INSTRUCTION`` for port I/O access, for
|
||||
example. HV will then fetch the additional guest instructions, if any,
|
||||
and processes the port I/O instructions at a pre-configured port address
|
||||
(in ``AL, 20h`` for example), and place the decoded information such as
|
||||
the port I/O address, size of access, read/write, and target register
|
||||
into the I/O request in the I/O request buffer (shown in
|
||||
:numref:`overview-io-emu-path`) and notify/interrupt SOS to process.
|
||||
|
||||
The virtio and HV service module (VHM) in SOS intercepts HV interrupts,
|
||||
and accesses the I/O request buffer for the port I/O instructions. It will
|
||||
then check if there is any kernel device claiming ownership of the
|
||||
I/O port. The owning device, if any, executes the requested APIs from a
|
||||
VM. Otherwise, the VHM module leaves the I/O request in the request buffer
|
||||
and wakes up the DM thread for processing.
|
||||
|
||||
DM follows the same mechanism as VHM. The I/O processing thread of the
|
||||
DM queries the I/O request buffer to get the PIO instruction details and
|
||||
checks to see if any (guest) device emulation modules claim ownership of
|
||||
the I/O port. If yes, the owning module is invoked to execute requested
|
||||
APIs.
|
||||
|
||||
When the DM completes the emulation (port IO 20h access in this example)
|
||||
of a device such as uDev1, uDev1 will put the result into the request
|
||||
buffer (register AL). The DM will then return the control to HV
|
||||
indicating completion of an IO instruction emulation, typically thru
|
||||
VHM/hypercall. The HV then stores the result to the guest register
|
||||
context, advances the guest IP to indicate the completion of instruction
|
||||
execution, and resumes the guest.
|
||||
|
||||
MMIO access path is similar except for a VM exit reason of *EPT
|
||||
violation*.
|
||||
|
||||
DMA Emulation
|
||||
-------------
|
||||
|
||||
Currently the only fully virtualized devices to UOS are USB xHCI, UART,
|
||||
and Automotive I/O controller. None of these require emulating
|
||||
DMA transactions. ACRN does not currently support virtual DMA.
|
||||
|
||||
Hypervisor
|
||||
**********
|
||||
|
||||
ACRN takes advantage of Intel Virtualization Technology (Intel VT).
|
||||
The ACRN HV runs in Virtual Machine Extension (VMX) root operation,
|
||||
host mode, or VMM mode, while the SOS and UOS guests run
|
||||
in VMX non-root operation, or guest mode. (We'll use "root mode"
|
||||
and "non-root mode" for simplicity).
|
||||
|
||||
The VMM mode has 4 rings. ACRN
|
||||
runs HV in ring 0 privilege only, and leaves ring 1-3 unused. A guest
|
||||
running in non-root mode, has its own full rings (ring 0 to 3). The
|
||||
guest kernel runs in ring 0 in guest mode, while guest user land
|
||||
applications run in ring 3 of guest mode (ring 1 and 2 are usually not
|
||||
used by commercial OS).
|
||||
|
||||
.. figure:: images/over-image11.png
|
||||
:align: center
|
||||
:name: overview-arch-hv
|
||||
|
||||
|
||||
Architecture of ACRN hypervisor
|
||||
|
||||
:numref:`overview-arch-hv` shows an overview of the ACRN hypervisor architecture.
|
||||
|
||||
- A platform initialization layer provides an entry
|
||||
point, checking hardware capabilities and initializing the
|
||||
processors, memory, and interrupts. Relocation of the hypervisor
|
||||
image, derivation of encryption seeds are also supported by this
|
||||
component.
|
||||
|
||||
- A hardware management and utilities layer provides services for
|
||||
managing physical resources at runtime. Examples include handling
|
||||
physical interrupts and low power state changes.
|
||||
|
||||
- A layer siting on top of hardware management enables virtual
|
||||
CPUs (or vCPUs), leveraging Intel VT. A vCPU loop runs a vCPU in
|
||||
non-root mode and handles VM exit events triggered by the vCPU.
|
||||
This layer handles CPU and memory related VM
|
||||
exits and provides a way to inject exceptions or interrupts to a
|
||||
vCPU.
|
||||
|
||||
- On top of vCPUs are three components for device emulation: one for
|
||||
emulation inside the hypervisor, another for communicating with
|
||||
SOS for mediation, and the third for managing pass-through
|
||||
devices.
|
||||
|
||||
- The highest layer is a VM management module providing
|
||||
VM lifecycle and power operations.
|
||||
|
||||
- A library component provides basic utilities for the rest of the
|
||||
hypervisor, including encryption algorithms, mutual-exclusion
|
||||
primitives, etc.
|
||||
|
||||
There are three ways that the hypervisor interacts with SOS:
|
||||
VM exits (including hypercalls), upcalls, and through the I/O request buffer.
|
||||
Interaction between the hypervisor and UOS is more restricted, including
|
||||
only VM exits and hypercalls related to trusty.
|
||||
|
||||
SOS
|
||||
***
|
||||
|
||||
SOS (Service OS) is an important guest OS in the ACRN architecture. It
|
||||
runs in non-root mode, and contains many critical components including VM
|
||||
manager, device model (DM), ACRN services, kernel mediation, and virtio
|
||||
and hypercall module (VHM). DM manages UOS (User OS) and
|
||||
provide device emulation for it. The SOS also provides services
|
||||
for system power lifecycle management through ACRN service and VM manager,
|
||||
and services for system debugging through ACRN log/trace tools.
|
||||
|
||||
DM
|
||||
==
|
||||
|
||||
DM (Device Model) is an user level QEMU-like application in SOS
|
||||
responsible for creating an UOS VM and then performing devices emulation
|
||||
based on command line configurations.
|
||||
|
||||
Based on a VHM kernel module, DM interacts with VM manager to create UOS
|
||||
VM. It then emulates devices through full virtualization in DM user
|
||||
level, or para-virtualized based on kernel mediator (such as virtio,
|
||||
GVT), or pass-through based on kernel VHM APIs.
|
||||
|
||||
Refer to :ref:`hld-devicemodel` for more details.
|
||||
|
||||
VM Manager
|
||||
==========
|
||||
|
||||
VM Manager is an user level service in SOS handling UOS VM creation and
|
||||
VM state management, according to the application requirements or system
|
||||
power operations.
|
||||
|
||||
VM Manager creates UOS VM based on DM application, and does UOS VM state
|
||||
management by interacting with lifecycle service in ACRN service.
|
||||
|
||||
Please refer to VM management chapter for more details.
|
||||
|
||||
ACRN Service
|
||||
============
|
||||
|
||||
ACRN service provides
|
||||
system lifecycle management based on IOC polling. It communicates with
|
||||
VM manager to handle UOS VM state, such as S3 and power-off.
|
||||
|
||||
VHM
|
||||
===
|
||||
|
||||
VHM (virtio & hypercall module) kernel module is an SOS kernel driver
|
||||
supporting UOS VM management and device emulation. Device Model follows
|
||||
the standard Linux char device API (ioctl) to access VHM
|
||||
functionalities. VHM communicates with the ACRN hypervisor through
|
||||
hypercall or upcall interrupts.
|
||||
|
||||
Please refer to VHM chapter for more details.
|
||||
|
||||
Kernel Mediators
|
||||
================
|
||||
|
||||
Kernel mediators are kernel modules providing a para-virtualization method
|
||||
for the UOS VMs, for example, an i915 gvt driver.
|
||||
|
||||
Log/Trace Tools
|
||||
===============
|
||||
|
||||
ACRN Log/Trace tools are user level applications used to
|
||||
capture ACRN hypervisor log and trace data. The VHM kernel module provides a
|
||||
middle layer to support these tools.
|
||||
|
||||
Refer to :ref:`hld-trace-log` for more details.
|
||||
|
||||
UOS
|
||||
***
|
||||
|
||||
Currently, ACRN can boot Linux and Android guest OSes. For Android guest OS, ACRN
|
||||
provides a VM environment with two worlds: normal world and trusty
|
||||
world. The Android OS runs in the the normal world. The trusty OS and
|
||||
security sensitive applications run in the trusty world. The trusty
|
||||
world can see the memory of normal world, but normal world cannot see
|
||||
trusty world.
|
||||
|
||||
Guest Physical Memory Layout - UOS E820
|
||||
=======================================
|
||||
|
||||
DM will create E820 table for a User OS VM based on these simple rules:
|
||||
|
||||
- If requested VM memory size < low memory limitation (currently 2 GB,
|
||||
defined in DM), then low memory range = [0, requested VM memory
|
||||
size]
|
||||
|
||||
- If requested VM memory size > low memory limitation, then low
|
||||
memory range = [0, 2G], and high memory range =
|
||||
[4G, 4G + requested VM memory size - 2G]
|
||||
|
||||
.. figure:: images/over-image13.png
|
||||
:align: center
|
||||
|
||||
UOS Physical Memory Layout
|
||||
|
||||
UOS Memory Allocation
|
||||
=====================
|
||||
|
||||
DM does UOS memory allocation based on hugetlb mechanism by default.
|
||||
The real memory mapping may be scattered in SOS physical
|
||||
memory space, as shown in :numref:`overview-mem-layout`:
|
||||
|
||||
.. figure:: images/over-image15.png
|
||||
:align: center
|
||||
:name: overview-mem-layout
|
||||
|
||||
|
||||
UOS Physical Memory Layout Based on Hugetlb
|
||||
|
||||
User OS's memory is allocated by Service OS DM application, it may come
|
||||
from different huge pages in Service OS as shown in
|
||||
:numref:`overview-mem-layout`.
|
||||
|
||||
As Service OS has full knowledge of these huge pages size,
|
||||
GPA\ :sup:`SOS` and GPA\ :sup:`UOS`, it works with the hypervisor
|
||||
to complete UOS's host-to-guest mapping using this pseudo code:
|
||||
|
||||
.. code-block: none
|
||||
|
||||
for x in allocated huge pages do
|
||||
x.hpa = gpa2hpa_for_sos(x.sos_gpa)
|
||||
host2guest_map_for_uos(x.hpa, x.uos_gpa, x.size)
|
||||
end
|
||||
|
||||
Virtual Slim bootloader
|
||||
=======================
|
||||
|
||||
Virtual Slim bootloader (vSBL) is the virtual bootloader that supports
|
||||
booting the UOS on the ACRN hypervisor platform. The vSBL design is
|
||||
derived from Slim Bootloader. It follows a staged design approach that
|
||||
provides hardware initialization and payload launching that provides the
|
||||
boot logic. As shown in :numref:`overview-sbl`, the virtual SBL has an
|
||||
initialization unit to initialize virtual hardware, and a payload unit
|
||||
to boot Linux or Android guest OS.
|
||||
|
||||
.. figure:: images/over-image110.png
|
||||
:align: center
|
||||
:name: overview-sbl
|
||||
|
||||
vSBL System Context Diagram
|
||||
|
||||
The vSBL image is released as a part of the Service OS (SOS) root
|
||||
filesystem (rootfs). The vSBL is copied to UOS memory by the VM manager
|
||||
in the SOS while creating the UOS virtual BSP of UOS. The SOS passes the
|
||||
start of vSBL and related information to HV. HV sets guest RIP of UOS
|
||||
virtual BSP as the start of vSBL and related guest registers, and
|
||||
launches the UOS virtual BSP. The vSBL starts running in the virtual
|
||||
real mode within the UOS. Conceptually, vSBL is part of the UOS runtime.
|
||||
|
||||
In the current design, the vSBL supports booting Android guest OS or
|
||||
Linux guest OS using the same vSBL image.
|
||||
|
||||
For an Android VM, the vSBL will load and verify trusty OS first, and
|
||||
trusty OS will then load and verify Android OS according to the Android
|
||||
OS verification mechanism.
|
||||
|
||||
Freedom From Interference
|
||||
*************************
|
||||
|
||||
The hypervisor is critical for preventing inter-VM interference, using
|
||||
the following mechanisms:
|
||||
|
||||
- Each physical CPU is dedicated to one vCPU.
|
||||
|
||||
Sharing a physical CPU among multiple vCPUs gives rise to multiple
|
||||
sources of interference such as the vCPU of one VM flushing the
|
||||
L1 & L2 cache for another, or tremendous interrupts for one VM
|
||||
delaying the execution of another. It also requires vCPU
|
||||
scheduling in the hypervisor to consider more complexities such as
|
||||
scheduling latency and vCPU priority, exposing more opportunities
|
||||
for one VM to interfere another.
|
||||
|
||||
To prevent such interference, ACRN hypervisor adopts static
|
||||
core partitioning by dedicating each physical CPU to one vCPU. The
|
||||
physical CPU loops in idle when the vCPU is paused by I/O
|
||||
emulation. This makes the vCPU scheduling deterministic and physical
|
||||
resource sharing is minimized.
|
||||
|
||||
- Hardware mechanisms including EPT, VT-d, SMAP and SMEP are leveraged
|
||||
to prevent unintended memory accesses.
|
||||
|
||||
Memory corruption can be a common failure mode. ACRN hypervisor properly
|
||||
sets up the memory-related hardware mechanisms to ensure that:
|
||||
|
||||
1. SOS cannot access the memory of the hypervisor, unless explicitly
|
||||
allowed,
|
||||
|
||||
2. UOS cannot access the memory of SOS and the hypervisor, and
|
||||
|
||||
3. The hypervisor does not unintendedly access the memory of SOS or UOS.
|
||||
|
||||
- Destination of external interrupts are set to be the physical core
|
||||
where the VM that handles them is running.
|
||||
|
||||
External interrupts are always handled by the hypervisor in ACRN.
|
||||
Excessive interrupts to one VM (say VM A) could slow down another
|
||||
VM (VM B) if they are handled by the physical core running VM B
|
||||
instead of VM A. Two mechanisms are designed to mitigate such
|
||||
interference.
|
||||
|
||||
1. The destination of an external interrupt is set to the physical core
|
||||
that runs the vCPU where virtual interrupts will be injected.
|
||||
|
||||
2. The hypervisor maintains statistics on the total number of received
|
||||
interrupts to SOS via a hypercall, and has a delay mechanism to
|
||||
temporarily block certain virtual interrupts from being injected.
|
||||
This allows SOS to detect the occurrence of an interrupt storm and
|
||||
control the interrupt injection rate when necessary.
|
||||
|
||||
- Mitigation of DMA storm.
|
||||
|
||||
(To be documented later.)
|
||||
|
||||
Boot Flow
|
||||
*********
|
||||
|
||||
.. figure:: images/over-image85.png
|
||||
:align: center
|
||||
|
||||
|
||||
ACRN Boot Flow
|
||||
|
||||
Power Management
|
||||
****************
|
||||
|
||||
CPU P-state & C-state
|
||||
=====================
|
||||
|
||||
In ACRN, CPU P-state and C-state (Px/Cx) are controlled by the guest OS.
|
||||
The corresponding governors are managed in SOS/UOS for best power
|
||||
efficiency and simplicity.
|
||||
|
||||
Guest should be able to process the ACPI P/C-state request from OSPM.
|
||||
The needed ACPI objects for P/C-state management should be ready in
|
||||
ACPI table.
|
||||
|
||||
Hypervisor can restrict guest's P/C-state request (per customer
|
||||
requirement). MSR accesses of P-state requests could be intercepted by
|
||||
the hypervisor and forwarded to the host directly if the requested
|
||||
P-state is valid. Guest MWAIT/Port IO accesses of C-state control could
|
||||
be passed through to host with no hypervisor interception to minimize
|
||||
performance impacts.
|
||||
|
||||
This diagram shows CPU P/C-state management blocks:
|
||||
|
||||
.. figure:: images/over-image4.png
|
||||
:align: center
|
||||
|
||||
|
||||
CPU P/C-state management block diagram
|
||||
|
||||
System power state
|
||||
==================
|
||||
|
||||
ACRN supports ACPI standard defined power state: S3 and S5 in system
|
||||
level. For each guest, ACRN assume guest implements OSPM and controls its
|
||||
own power state accordingly. ACRN doesn't involve guest OSPM. Instead,
|
||||
it traps the power state transition request from guest and emulates it.
|
||||
|
||||
.. figure:: images/over-image21.png
|
||||
:align: center
|
||||
:name: overview-pm-block
|
||||
|
||||
ACRN Power Management Diagram Block
|
||||
|
||||
:numref:`overview-pm-block` shows the basic diagram block for ACRN PM.
|
||||
The OSPM in each guest manages the guest power state transition. The
|
||||
Device Model running in SOS traps and emulates the power state
|
||||
transition of UOS (Linux VM or Android VM in
|
||||
:numref:`overview-pm-block`). VM Manager knows all UOS power states and
|
||||
notifies OSPM of SOS (Service OS in :numref:`overview-pm-block`) once
|
||||
active UOS is in the required power state.
|
||||
|
||||
Then OSPM of the SOS starts the power state transition of SOS which is
|
||||
trapped to "Sx Agency" in ACRN, and it will start the power state
|
||||
transition.
|
||||
|
||||
Some details about the ACPI table for UOS and SOS:
|
||||
|
||||
- The ACPI table in UOS is emulated by Device Model. The Device Model
|
||||
knows which register the UOS writes to trigger power state
|
||||
transitions. Device Model must register an I/O handler for it.
|
||||
|
||||
- The ACPI table in SOS is passthru. There is no ACPI parser
|
||||
in ACRN HV. The power management related ACPI table is
|
||||
generated offline and hardcoded in ACRN HV.
|
@@ -1,179 +0,0 @@
|
||||
.. _hld-power-management:
|
||||
|
||||
Power Management high-level design
|
||||
##################################
|
||||
|
||||
P-state/C-state management
|
||||
**************************
|
||||
|
||||
ACPI Px/Cx data
|
||||
===============
|
||||
|
||||
CPU P-state/C-state are controlled by the guest OS. The ACPI
|
||||
P/C-state driver relies on some P/C-state-related ACPI data in the guest
|
||||
ACPI table.
|
||||
|
||||
SOS could run ACPI driver with no problem because it can access native
|
||||
the ACPI table. For UOS though, we need to prepare the corresponding ACPI data
|
||||
for Device Model to build virtual ACPI table.
|
||||
|
||||
The Px/Cx data includes four
|
||||
ACPI objects: _PCT, _PPC, and _PSS for P-state management, and _CST for
|
||||
C-state management. All these ACPI data must be consistent with the
|
||||
native data because the control method is a kind of pass through.
|
||||
|
||||
These ACPI objects data are parsed by an offline tool and hard-coded in a
|
||||
Hypervisor module named CPU state table:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct cpu_px_data {
|
||||
uint64_t core_frequency; /* megahertz */
|
||||
uint64_t power; /* milliWatts */
|
||||
uint64_t transition_latency; /* microseconds */
|
||||
uint64_t bus_master_latency; /* microseconds */
|
||||
uint64_t control; /* control value */
|
||||
uint64_t status; /* success indicator */
|
||||
} __attribute__((aligned(8)));
|
||||
|
||||
struct acpi_generic_address {
|
||||
uint8_t space_id;
|
||||
uint8_t bit_width;
|
||||
uint8_t bit_offset;
|
||||
uint8_t access_size;
|
||||
uint64_t address;
|
||||
} __attribute__((aligned(8)));
|
||||
|
||||
struct cpu_cx_data {
|
||||
struct acpi_generic_address cx_reg;
|
||||
uint8_t type;
|
||||
uint32_t latency;
|
||||
uint64_t power;
|
||||
} __attribute__((aligned(8)));
|
||||
|
||||
|
||||
With these Px/Cx data, the Hypervisor is able to intercept guest's
|
||||
P/C-state requests with desired restrictions.
|
||||
|
||||
Virtual ACPI table build flow
|
||||
=============================
|
||||
|
||||
:numref:`vACPItable` shows how to build virtual ACPI table with
|
||||
Px/Cx data for UOS P/C-state management:
|
||||
|
||||
.. figure:: images/hld-pm-image28.png
|
||||
:align: center
|
||||
:name: vACPItable
|
||||
|
||||
System block for building vACPI table with Px/Cx data
|
||||
|
||||
Some ioctl APIs are defined for Device model to query Px/Cx data from
|
||||
SOS VHM. The Hypervisor needs to provide hypercall APIs to transit Px/Cx
|
||||
data from CPU state table to SOS VHM.
|
||||
|
||||
The build flow is:
|
||||
|
||||
1) Use offline tool (e.g. **iasl**) to parse the Px/Cx data and hard-code to
|
||||
CPU state table in Hypervisor. Hypervisor loads the data after
|
||||
system boot up.
|
||||
2) Before UOS launching, Device mode queries the Px/Cx data from SOS VHM
|
||||
via ioctl interface.
|
||||
3) VHM transmits the query request to Hypervisor by hypercall.
|
||||
4) Hypervisor returns the Px/Cx data.
|
||||
5) Device model builds the virtual ACPI table with these Px/Cx data
|
||||
|
||||
Intercept Policy
|
||||
================
|
||||
|
||||
Hypervisor should be able to restrict guest's
|
||||
P/C-state request, with a user-customized policy.
|
||||
|
||||
Hypervisor should intercept guest P-state request and validate whether
|
||||
it is a valid P-state. Any invalid P-state (e.g. doesn't exist in CPU state
|
||||
table) should be rejected.
|
||||
|
||||
It is better not to intercept C-state request because the trap would
|
||||
impact both power and performance.
|
||||
|
||||
.. note:: For P-state control you should pay attention to SoC core
|
||||
voltage domain design when doing P-state measurement. The highest
|
||||
P-state would win if different P-state requests on the cores shared
|
||||
same voltage domain. In this case APERF/MPERF must be used to see
|
||||
what P-state was granted on that core.
|
||||
|
||||
S3/S5
|
||||
*****
|
||||
|
||||
ACRN assumes guest has complete S3/S5 power state management and follows
|
||||
the ACPI standard exactly. System S3/S5 needs to follow well-defined
|
||||
enter/exit paths and cooperate among different components.
|
||||
|
||||
System low power state enter process
|
||||
====================================
|
||||
|
||||
Each time, when OSPM of UOS starts power state transition, it will
|
||||
finally write the ACPI register per ACPI spec requirement.
|
||||
With help of ACRN I/O emulation framework, the UOS ACPI
|
||||
register writing will be dispatched to Device Model and Device Model
|
||||
will emulate the UOS power state (pause UOS VM for S3 and power off UOS
|
||||
VM for S5)
|
||||
|
||||
The VM Manager monitors all UOS. If all active UOSes are in required power
|
||||
state, VM Manager will notify OSPM of SOS to start SOS power state
|
||||
transition. OSPM of SOS follows a very similar process as UOS for power
|
||||
state transition. The difference is SOS ACPI register writing is trapped
|
||||
to ACRN HV. And ACRN HV will emulate SOS power state (pause SOS VM for
|
||||
S3 and no special action for S5)
|
||||
|
||||
Once SOS low power state is done, ACRN HV will go through its own low
|
||||
power state enter path.
|
||||
|
||||
The whole system is finally put into low power state.
|
||||
|
||||
System low power state exit process
|
||||
===================================
|
||||
|
||||
The low power state exit process is in reverse order. The ACRN
|
||||
hypervisor is woken up at first. It will go through its own low power
|
||||
state exit path. Then ACRN hypervisor will resume the SOS to let SOS go
|
||||
through SOS low power state exit path. After that, the DM is resumed and
|
||||
let UOS go through UOS low power state exit path. The system is resumed
|
||||
to running state after at least one UOS is resumed to running state.
|
||||
|
||||
:numref:`pmworkflow` shows the flow of low power S3 enter/exit process (S5 follows
|
||||
very similar process)
|
||||
|
||||
.. figure:: images/hld-pm-image62.png
|
||||
:align: center
|
||||
:name: pmworkflow
|
||||
|
||||
ACRN system power management workflow
|
||||
|
||||
For system power state entry:
|
||||
|
||||
1. UOS OSPM start UOS S3 entry
|
||||
2. The UOS S3 entering request is trapped ACPI PM Device of DM
|
||||
3. DM pauses UOS VM to emulate UOS S3 and notifies VM Manager that the UOS
|
||||
dedicated to it is in S3
|
||||
4. If all UOSes are in S3, VM Manager will notify OSPM of SOS
|
||||
5. SOS OSPM starts SOS S3 enter
|
||||
6. SOS S3 entering request is trapped to Sx Agency in ACRN HV
|
||||
7. ACRN HV pauses SOS VM to emulate SOS S3 and starts ACRN HV S3 entry.
|
||||
|
||||
For system power state exit:
|
||||
|
||||
1. When system is resumed from S3, native bootloader will jump to wake
|
||||
up vector of HV
|
||||
2. HV resumes S3 and jumps to wake up vector to emulate SOS resume from S3
|
||||
3. OSPM of SOS is running
|
||||
4. OSPM of SOS notifies VM Manager that it's ready to wake up UOS
|
||||
5. VM Manager will notify DM to resume the UOS
|
||||
6. DM resets the UOS VM to emulate UOS resume from S3
|
||||
|
||||
According to ACPI standard, S3 is mapped to suspend to RAM and S5 is
|
||||
mapped to shutdown. So the S5 process is a little different:
|
||||
|
||||
- UOS enters S3 -> UOS powers off
|
||||
- System enters S3 -> System powers off
|
||||
- System resumes From S3 -> System fresh start
|
||||
- UOS resumes from S3 -> UOS fresh startup
|
@@ -1,241 +0,0 @@
|
||||
.. _hld-trace-log:
|
||||
|
||||
Tracing and Logging high-level design
|
||||
#####################################
|
||||
|
||||
Both Trace and Log are built on top of a mechanism named shared
|
||||
buffer (Sbuf).
|
||||
|
||||
Shared Buffer
|
||||
*************
|
||||
|
||||
Shared Buffer is a ring buffer divided into predetermined-size slots. There
|
||||
are two use scenarios of Sbuf:
|
||||
|
||||
- sbuf can serve as a lockless ring buffer to share data from ACRN HV to
|
||||
SOS in non-overwritten mode. (Writing will fail if an overrun
|
||||
happens.)
|
||||
- sbuf can serve as a conventional ring buffer in hypervisor in
|
||||
over-written mode. A lock is required to synchronize access by the
|
||||
producer and consumer.
|
||||
|
||||
Both ACRNTrace and ACRNLog use sbuf as a lockless ring buffer. The Sbuf
|
||||
is allocated by SOS and assigned to HV via a hypercall. To hold pointers
|
||||
to sbuf passed down via hypercall, an array ``sbuf[ACRN_SBUF_ID_MAX]``
|
||||
is defined in per_cpu region of HV, with predefined sbuf id to identify
|
||||
the usage, such as ACRNTrace, ACRNLog, etc.
|
||||
|
||||
For each physical CPU there is a dedicated Sbuf. Only a single producer
|
||||
is allowed to put data into that Sbuf in HV, and a single consumer is
|
||||
allowed to get data from Sbuf in SOS. Therefore, no lock is required to
|
||||
synchronize access by the producer and consumer.
|
||||
|
||||
sbuf APIs
|
||||
=========
|
||||
|
||||
.. note:: reference APIs defined in hypervisor/include/debug/sbuf.h
|
||||
|
||||
|
||||
ACRN Trace
|
||||
**********
|
||||
|
||||
ACRNTrace is a tool running on the Service OS (SOS) to capture trace
|
||||
data. It allows developers to add performance profiling trace points at
|
||||
key locations to get a picture of what is going on inside the
|
||||
hypervisor. Scripts to analyze the collected trace data are also
|
||||
provided.
|
||||
|
||||
As shown in :numref:`acrntrace-arch`, ACRNTrace is built using
|
||||
Shared Buffers (Sbuf), and consists of three parts from bottom layer
|
||||
up:
|
||||
|
||||
- **ACRNTrace userland app**: Userland application collecting trace data to
|
||||
files (Per Physical CPU)
|
||||
|
||||
- **SOS Trace Module**: allocates/frees SBufs, creates device for each
|
||||
SBuf, sets up sbuf shared between SOS and HV, and provides a dev node for the
|
||||
userland app to retrieve trace data from Sbuf
|
||||
|
||||
- **Trace APIs**: provide APIs to generate trace event and insert to Sbuf.
|
||||
|
||||
.. figure:: images/log-image50.png
|
||||
:align: center
|
||||
:name: acrntrace-arch
|
||||
|
||||
Architectural diagram of ACRNTrace
|
||||
|
||||
Trace APIs
|
||||
==========
|
||||
|
||||
.. note:: reference APIs defined in hypervisor/include/debug/trace.h
|
||||
for trace_entry struct and functions.
|
||||
|
||||
|
||||
SOS Trace Module
|
||||
================
|
||||
|
||||
The SOS trace module is responsible for:
|
||||
|
||||
- allocating sbuf in sos memory range for each physical CPU, and assign
|
||||
the gpa of Sbuf to ``per_cpu sbuf[ACRN_TRACE]``
|
||||
- create a misc device for each physical CPU
|
||||
- provide mmap operation to map entire Sbuf to userspace for high
|
||||
flexible and efficient access.
|
||||
|
||||
On SOS shutdown, the trace module is responsible to remove misc devices, free
|
||||
SBufs, and set ``per_cpu sbuf[ACRN_TRACE]`` to null.
|
||||
|
||||
ACRNTrace Application
|
||||
=====================
|
||||
|
||||
ACRNTrace application includes a binary to retrieve trace data from
|
||||
Sbuf, and Python scripts to convert trace data from raw format into
|
||||
readable text, and do analysis.
|
||||
|
||||
Figure 2.2 shows the sequence of trace initialization and trace data
|
||||
collection. With a debug build, trace components are initialized at boot
|
||||
time. After initialization, HV writes trace event date into sbuf
|
||||
until sbuf is full, which can happen easily if the ACRNTrace app is not
|
||||
consuming trace data from Sbuf on SOS user space.
|
||||
|
||||
Once ACRNTrace is launched, for each physical CPU a consumer thread is
|
||||
created to periodically read RAW trace data from sbuf and write to a
|
||||
file.
|
||||
|
||||
.. note:: figure is missing
|
||||
|
||||
Figure 2.2 Sequence of trace init and trace data collection
|
||||
|
||||
These are the Python scripts provided:
|
||||
|
||||
- **acrntrace_format.py** converts RAW trace data to human-readable
|
||||
text offline according to given format;
|
||||
|
||||
- **acrnalyze.py** analyzes trace data (as output by acrntrace)
|
||||
based on given analyzer filters, such as vm_exit or irq, and generates a
|
||||
report.
|
||||
|
||||
See :ref:`acrntrace` for details and usage.
|
||||
|
||||
ACRN Log
|
||||
********
|
||||
|
||||
acrnlog is a tool used to capture ACRN hypervisor log to files on
|
||||
SOS filesystem. It can run as an SOS service at boot, capturing two
|
||||
kinds of logs:
|
||||
|
||||
- Current runtime logs;
|
||||
- Logs remaining in the buffer, from last crashed running.
|
||||
|
||||
Architectural diagram
|
||||
=====================
|
||||
|
||||
Similar to the design of ACRN Trace, ACRN Log is built on the top of
|
||||
Shared Buffer (Sbuf), and consists of three parts from bottom layer
|
||||
up:
|
||||
|
||||
- **ACRN Log app**: Userland application collecting hypervisor log to
|
||||
files;
|
||||
- **SOS ACRN Log Module**: constructs/frees SBufs at reserved memory
|
||||
area, creates dev for current/last logs, sets up sbuf shared between
|
||||
SOS and HV, and provides a dev node for the userland app to
|
||||
retrieve logs
|
||||
- **ACRN log support in HV**: put logs at specified loglevel to Sbuf.
|
||||
|
||||
.. figure:: images/log-image73.png
|
||||
:align: center
|
||||
|
||||
Architectural diagram of ACRN Log
|
||||
|
||||
|
||||
ACRN log support in Hypervisor
|
||||
==============================
|
||||
|
||||
To support acrn log, the following adaption was made to hypervisor log
|
||||
system:
|
||||
|
||||
- log messages with severity level higher than a specified value will
|
||||
be put into Sbuf when calling logmsg in hypervisor
|
||||
- allocate sbuf to accommodate early hypervisor logs before SOS
|
||||
can allocate and set up sbuf
|
||||
|
||||
There are 6 different loglevels, as shown below. The specified
|
||||
severity loglevel is stored in ``mem_loglevel``, initialized
|
||||
by :option:`CONFIG_MEM_LOGLEVEL_DEFAULT`. The loglevel can
|
||||
be set to a new value
|
||||
at runtime via hypervisor shell command "loglevel".
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
#define LOG_FATAL 1U
|
||||
#define LOG_ACRN 2U
|
||||
#define LOG_ERROR 3U
|
||||
#define LOG_WARNING 4U
|
||||
#define LOG_INFO 5U
|
||||
#define LOG_DEBUG 6U
|
||||
|
||||
|
||||
The element size of sbuf for logs is fixed at 80 bytes, and the max size
|
||||
of a single log message is 320 bytes. Log messages with a length between
|
||||
80 and 320 bytes will be separated into multiple sbuf elements. Log
|
||||
messages with length larger then 320 will be truncated.
|
||||
|
||||
For security, SOS allocates sbuf in its memory range and assigns it to
|
||||
the hypervisor. To handle log messages before SOS boots, sbuf for each
|
||||
physical cpu will be allocated in acrn hypervisor memory range for any
|
||||
early log entries. Once sbuf in the SOS memory range is allocated and
|
||||
assigned to hypervisor via hypercall, the Hypervisor logmsg will switch
|
||||
to use SOS allocated sbuf, early logs will be copied, and early sbuf in
|
||||
hypervisor memory range will be freed.
|
||||
|
||||
SOS ACRN Log Module
|
||||
===================
|
||||
|
||||
To enable retrieving log messages from a crash, 4MB of memory from
|
||||
0x6DE00000 is reserved for acrn log. This space is further divided into
|
||||
two each ranges, one for current run and one for last previous run:
|
||||
|
||||
.. figure:: images/log-image59.png
|
||||
:align: center
|
||||
|
||||
ACRN Log crash log/current log buffers
|
||||
|
||||
On SOS boot, SOS acrnlog module is responsible to:
|
||||
|
||||
- examine if there are log messages remaining from last crashed
|
||||
run by checking the magic number of each sbuf
|
||||
|
||||
- if there are previous crash logs, construct sbuf and create misc devices for
|
||||
these last logs
|
||||
|
||||
- construct sbuf in the usable buf range for each physical CPU,
|
||||
assign the gpa of Sbuf to ``per_cpu sbuf[ACRN_LOG]`` and create a misc
|
||||
device for each physical CPU
|
||||
|
||||
- the misc devices implement read() file operation to allow
|
||||
userspace app to read one Sbuf element.
|
||||
|
||||
When checking the validity of sbuf for last logs examination, it sets the
|
||||
current sbuf with magic number ``0x5aa57aa71aa13aa3``, and changes the
|
||||
magic number of last sbuf to ``0x5aa57aa71aa13aa2``, to distinguish which is
|
||||
the current/last.
|
||||
|
||||
On SOS shutdown, the module is responsible to remove misc devices,
|
||||
free SBufs, and set ``per_cpu sbuf[ACRN_TRACE]`` to null.
|
||||
|
||||
ACRN Log Application
|
||||
====================
|
||||
|
||||
ACRNLog application reads log messages from sbuf for each physical
|
||||
CPU and combines them into log files with log messages in ascending
|
||||
order by the global sequence number. If the sequence number is not
|
||||
continuous, a warning of "incontinuous logs" will be inserted.
|
||||
|
||||
To avoid using up storage space, the size of a single log file and
|
||||
the total number of log files are both limited. By default, log file
|
||||
size limitation is 1MB and file number limitation is 4.
|
||||
|
||||
If there are last log devices, ACRN log will read out the log
|
||||
messages, combine them, and save them into last log files.
|
||||
|
||||
See :ref:`acrnlog` for usage details.
|
@@ -1,763 +0,0 @@
|
||||
.. _hld-virtio-devices:
|
||||
.. _virtio-hld:
|
||||
|
||||
Virtio devices high-level design
|
||||
################################
|
||||
|
||||
The ACRN Hypervisor follows the `Virtual I/O Device (virtio)
|
||||
specification
|
||||
<http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html>`_ to
|
||||
realize I/O virtualization for many performance-critical devices
|
||||
supported in the ACRN project. Adopting the virtio specification lets us
|
||||
reuse many frontend virtio drivers already available in a Linux-based
|
||||
User OS, drastically reducing potential development effort for frontend
|
||||
virtio drivers. To further reduce the development effort of backend
|
||||
virtio drivers, the hypervisor provides the virtio backend service
|
||||
(VBS) APIs, that make it very straightforward to implement a virtio
|
||||
device in the hypervisor.
|
||||
|
||||
The virtio APIs can be divided into 3 groups: DM APIs, virtio backend
|
||||
service (VBS) APIs, and virtqueue (VQ) APIs, as shown in
|
||||
:numref:`be-interface`.
|
||||
|
||||
.. figure:: images/virtio-hld-image0.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: be-interface
|
||||
|
||||
ACRN Virtio Backend Service Interface
|
||||
|
||||
- **DM APIs** are exported by the DM, and are mainly used during the
|
||||
device initialization phase and runtime. The DM APIs also include
|
||||
PCIe emulation APIs because each virtio device is a PCIe device in
|
||||
the SOS and UOS.
|
||||
- **VBS APIs** are mainly exported by the VBS and related modules.
|
||||
Generally they are callbacks to be
|
||||
registered into the DM.
|
||||
- **VQ APIs** are used by a virtio backend device to access and parse
|
||||
information from the shared memory between the frontend and backend
|
||||
device drivers.
|
||||
|
||||
Virtio framework is the para-virtualization specification that ACRN
|
||||
follows to implement I/O virtualization of performance-critical
|
||||
devices such as audio, eAVB/TSN, IPU, and CSMU devices. This section gives
|
||||
an overview about virtio history, motivation, and advantages, and then
|
||||
highlights virtio key concepts. Second, this section will describe
|
||||
ACRN's virtio architectures, and elaborates on ACRN virtio APIs. Finally
|
||||
this section will introduce all the virtio devices currently supported
|
||||
by ACRN.
|
||||
|
||||
Virtio introduction
|
||||
*******************
|
||||
|
||||
Virtio is an abstraction layer over devices in a para-virtualized
|
||||
hypervisor. Virtio was developed by Rusty Russell when he worked at IBM
|
||||
research to support his lguest hypervisor in 2007, and it quickly became
|
||||
the de facto standard for KVM's para-virtualized I/O devices.
|
||||
|
||||
Virtio is very popular for virtual I/O devices because is provides a
|
||||
straightforward, efficient, standard, and extensible mechanism, and
|
||||
eliminates the need for boutique, per-environment, or per-OS mechanisms.
|
||||
For example, rather than having a variety of device emulation
|
||||
mechanisms, virtio provides a common frontend driver framework that
|
||||
standardizes device interfaces, and increases code reuse across
|
||||
different virtualization platforms.
|
||||
|
||||
Given the advantages of virtio, ACRN also follows the virtio
|
||||
specification.
|
||||
|
||||
Key Concepts
|
||||
************
|
||||
|
||||
To better understand virtio, especially its usage in ACRN, we'll
|
||||
highlight several key virtio concepts important to ACRN:
|
||||
|
||||
|
||||
Frontend virtio driver (FE)
|
||||
Virtio adopts a frontend-backend architecture that enables a simple but
|
||||
flexible framework for both frontend and backend virtio drivers. The FE
|
||||
driver merely needs to offer services configure the interface, pass messages,
|
||||
produce requests, and kick backend virtio driver. As a result, the FE
|
||||
driver is easy to implement and the performance overhead of emulating
|
||||
a device is eliminated.
|
||||
|
||||
Backend virtio driver (BE)
|
||||
Similar to FE driver, the BE driver, running either in user-land or
|
||||
kernel-land of the host OS, consumes requests from the FE driver and sends them
|
||||
to the host native device driver. Once the requests are done by the host
|
||||
native device driver, the BE driver notifies the FE driver that the
|
||||
request is complete.
|
||||
|
||||
Note: to distinguish BE driver from host native device driver, the host
|
||||
native device driver is called "native driver" in this document.
|
||||
|
||||
Straightforward: virtio devices as standard devices on existing buses
|
||||
Instead of creating new device buses from scratch, virtio devices are
|
||||
built on existing buses. This gives a straightforward way for both FE
|
||||
and BE drivers to interact with each other. For example, FE driver could
|
||||
read/write registers of the device, and the virtual device could
|
||||
interrupt FE driver, on behalf of the BE driver, in case something of
|
||||
interest is happening.
|
||||
|
||||
Currently virtio supports PCI/PCIe bus and MMIO bus. In ACRN, only
|
||||
PCI/PCIe bus is supported, and all the virtio devices share the same
|
||||
vendor ID 0x1AF4.
|
||||
|
||||
Note: For MMIO, the "bus" is a little bit an overstatement since
|
||||
basically it is a few descriptors describing the devices.
|
||||
|
||||
Efficient: batching operation is encouraged
|
||||
Batching operation and deferred notification are important to achieve
|
||||
high-performance I/O, since notification between FE and BE driver
|
||||
usually involves an expensive exit of the guest. Therefore batching
|
||||
operating and notification suppression are highly encouraged if
|
||||
possible. This will give an efficient implementation for
|
||||
performance-critical devices.
|
||||
|
||||
Standard: virtqueue
|
||||
All virtio devices share a standard ring buffer and descriptor
|
||||
mechanism, called a virtqueue, shown in :numref:`virtqueue`. A virtqueue is a
|
||||
queue of scatter-gather buffers. There are three important methods on
|
||||
virtqueues:
|
||||
|
||||
- **add_buf** is for adding a request/response buffer in a virtqueue,
|
||||
- **get_buf** is for getting a response/request in a virtqueue, and
|
||||
- **kick** is for notifying the other side for a virtqueue to consume buffers.
|
||||
|
||||
The virtqueues are created in guest physical memory by the FE drivers.
|
||||
BE drivers only need to parse the virtqueue structures to obtain
|
||||
the requests and process them. How a virtqueue is organized is
|
||||
specific to the Guest OS. In the Linux implementation of virtio, the
|
||||
virtqueue is implemented as a ring buffer structure called vring.
|
||||
|
||||
In ACRN, the virtqueue APIs can be leveraged directly so that users
|
||||
don't need to worry about the details of the virtqueue. (Refer to guest
|
||||
OS for more details about the virtqueue implementation.)
|
||||
|
||||
.. figure:: images/virtio-hld-image2.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: virtqueue
|
||||
|
||||
Virtqueue
|
||||
|
||||
Extensible: feature bits
|
||||
A simple extensible feature negotiation mechanism exists for each
|
||||
virtual device and its driver. Each virtual device could claim its
|
||||
device specific features while the corresponding driver could respond to
|
||||
the device with the subset of features the driver understands. The
|
||||
feature mechanism enables forward and backward compatibility for the
|
||||
virtual device and driver.
|
||||
|
||||
Virtio Device Modes
|
||||
The virtio specification defines three modes of virtio devices:
|
||||
a legacy mode device, a transitional mode device, and a modern mode
|
||||
device. A legacy mode device is compliant to virtio specification
|
||||
version 0.95, a transitional mode device is compliant to both
|
||||
0.95 and 1.0 spec versions, and a modern mode
|
||||
device is only compatible to the version 1.0 specification.
|
||||
|
||||
In ACRN, all the virtio devices are transitional devices, meaning that
|
||||
they should be compatible with both 0.95 and 1.0 versions of virtio
|
||||
specification.
|
||||
|
||||
Virtio Device Discovery
|
||||
Virtio devices are commonly implemented as PCI/PCIe devices. A
|
||||
virtio device using virtio over PCI/PCIe bus must expose an interface to
|
||||
the Guest OS that meets the PCI/PCIe specifications.
|
||||
|
||||
Conventionally, any PCI device with Vendor ID 0x1AF4,
|
||||
PCI_VENDOR_ID_REDHAT_QUMRANET, and Device ID 0x1000 through 0x107F
|
||||
inclusive is a virtio device. Among the Device IDs, the
|
||||
legacy/transitional mode virtio devices occupy the first 64 IDs ranging
|
||||
from 0x1000 to 0x103F, while the range 0x1040-0x107F belongs to
|
||||
virtio modern devices. In addition, the Subsystem Vendor ID should
|
||||
reflect the PCI/PCIe vendor ID of the environment, and the Subsystem
|
||||
Device ID indicates which virtio device is supported by the device.
|
||||
|
||||
Virtio Frameworks
|
||||
*****************
|
||||
|
||||
This section describes the overall architecture of virtio, and then
|
||||
introduce ACRN specific implementations of the virtio framework.
|
||||
|
||||
Architecture
|
||||
============
|
||||
|
||||
Virtio adopts a frontend-backend
|
||||
architecture, as shown in :numref:`virtio-arch`. Basically the FE and BE driver
|
||||
communicate with each other through shared memory, via the
|
||||
virtqueues. The FE driver talks to the BE driver in the same way it
|
||||
would talk to a real PCIe device. The BE driver handles requests
|
||||
from the FE driver, and notifies the FE driver if the request has been
|
||||
processed.
|
||||
|
||||
.. figure:: images/virtio-hld-image1.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: virtio-arch
|
||||
|
||||
Virtio Architecture
|
||||
|
||||
In addition to virtio's frontend-backend architecture, both FE and BE
|
||||
drivers follow a layered architecture, as shown in
|
||||
:numref:`virtio-fe-be`. Each
|
||||
side has three layers: transports, core models, and device types.
|
||||
All virtio devices share the same virtio infrastructure, including
|
||||
virtqueues, feature mechanisms, configuration space, and buses.
|
||||
|
||||
.. figure:: images/virtio-hld-image4.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: virtio-fe-be
|
||||
|
||||
Virtio Frontend/Backend Layered Architecture
|
||||
|
||||
Virtio Framework Considerations
|
||||
===============================
|
||||
|
||||
How to realize the virtio framework is specific to a
|
||||
hypervisor implementation. In ACRN, the virtio framework implementations
|
||||
can be classified into two types, virtio backend service in user-land
|
||||
(VBS-U) and virtio backend service in kernel-land (VBS-K), according to
|
||||
where the virtio backend service (VBS) is located. Although different in BE
|
||||
drivers, both VBS-U and VBS-K share the same FE drivers. The reason
|
||||
behind the two virtio implementations is to meet the requirement of
|
||||
supporting a large amount of diverse I/O devices in ACRN project.
|
||||
|
||||
When developing a virtio BE device driver, the device owner should choose
|
||||
carefully between the VBS-U and VBS-K. Generally VBS-U targets
|
||||
non-performance-critical devices, but enables easy development and
|
||||
debugging. VBS-K targets performance critical devices.
|
||||
|
||||
The next two sections introduce ACRN's two implementations of the virtio
|
||||
framework.
|
||||
|
||||
User-Land Virtio Framework
|
||||
==========================
|
||||
|
||||
The architecture of ACRN user-land virtio framework (VBS-U) is shown in
|
||||
:numref:`virtio-userland`.
|
||||
|
||||
The FE driver talks to the BE driver as if it were talking with a PCIe
|
||||
device. This means for "control plane", the FE driver could poke device
|
||||
registers through PIO or MMIO, and the device will interrupt the FE
|
||||
driver when something happens. For "data plane", the communication
|
||||
between the FE and BE driver is through shared memory, in the form of
|
||||
virtqueues.
|
||||
|
||||
On the service OS side where the BE driver is located, there are several
|
||||
key components in ACRN, including device model (DM), virtio and HV
|
||||
service module (VHM), VBS-U, and user-level vring service API helpers.
|
||||
|
||||
DM bridges the FE driver and BE driver since each VBS-U module emulates
|
||||
a PCIe virtio device. VHM bridges DM and the hypervisor by providing
|
||||
remote memory map APIs and notification APIs. VBS-U accesses the
|
||||
virtqueue through the user-level vring service API helpers.
|
||||
|
||||
.. figure:: images/virtio-hld-image3.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: virtio-userland
|
||||
|
||||
ACRN User-Land Virtio Framework
|
||||
|
||||
Kernel-Land Virtio Framework
|
||||
============================
|
||||
|
||||
ACRN supports two kernel-land virtio frameworks: VBS-K, designed from
|
||||
scratch for ACRN, the other called Vhost, compatible with Linux Vhost.
|
||||
|
||||
VBS-K framework
|
||||
---------------
|
||||
|
||||
The architecture of ACRN VBS-K is shown in
|
||||
:numref:`kernel-virtio-framework` below.
|
||||
|
||||
Generally VBS-K provides acceleration towards performance critical
|
||||
devices emulated by VBS-U modules by handling the "data plane" of the
|
||||
devices directly in the kernel. When VBS-K is enabled for certain
|
||||
devices, the kernel-land vring service API helpers, instead of the
|
||||
user-land helpers, are used to access the virtqueues shared by the FE
|
||||
driver. Compared to VBS-U, this eliminates the overhead of copying data
|
||||
back-and-forth between user-land and kernel-land within service OS, but
|
||||
pays with the extra implementation complexity of the BE drivers.
|
||||
|
||||
Except for the differences mentioned above, VBS-K still relies on VBS-U
|
||||
for feature negotiations between FE and BE drivers. This means the
|
||||
"control plane" of the virtio device still remains in VBS-U. When
|
||||
feature negotiation is done, which is determined by FE driver setting up
|
||||
an indicative flag, VBS-K module will be initialized by VBS-U.
|
||||
Afterwards, all request handling will be offloaded to the VBS-K in
|
||||
kernel.
|
||||
|
||||
Finally the FE driver is not aware of how the BE driver is implemented,
|
||||
either in VBS-U or VBS-K. This saves engineering effort regarding FE
|
||||
driver development.
|
||||
|
||||
.. figure:: images/virtio-hld-image54.png
|
||||
:align: center
|
||||
:name: kernel-virtio-framework
|
||||
|
||||
ACRN Kernel Land Virtio Framework
|
||||
|
||||
Vhost framework
|
||||
---------------
|
||||
|
||||
Vhost is similar to VBS-K. Vhost is a common solution upstreamed in the
|
||||
Linux kernel, with several kernel mediators based on it.
|
||||
|
||||
Architecture
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Vhost/virtio is a semi-virtualized device abstraction interface
|
||||
specification that has been widely applied in various virtualization
|
||||
solutions. Vhost is a specific kind of virtio where the data plane is
|
||||
put into host kernel space to reduce the context switch while processing
|
||||
the IO request. It is usually called "virtio" when used as a front-end
|
||||
driver in a guest operating system or "vhost" when used as a back-end
|
||||
driver in a host. Compared with a pure virtio solution on a host, vhost
|
||||
uses the same frontend driver as virtio solution and can achieve better
|
||||
performance. :numref:`vhost-arch` shows the vhost architecture on ACRN.
|
||||
|
||||
.. figure:: images/virtio-hld-image71.png
|
||||
:align: center
|
||||
:name: vhost-arch
|
||||
|
||||
Vhost Architecture on ACRN
|
||||
|
||||
Compared with a userspace virtio solution, vhost decomposes data plane
|
||||
from user space to kernel space. The vhost general data plane workflow
|
||||
can be described as:
|
||||
|
||||
1. vhost proxy creates two eventfds per virtqueue, one is for kick,
|
||||
(an ioeventfd), the other is for call, (an irqfd).
|
||||
2. vhost proxy registers the two eventfds to VHM through VHM character
|
||||
device:
|
||||
|
||||
a) Ioevenftd is bound with a PIO/MMIO range. If it is a PIO, it is
|
||||
registered with (fd, port, len, value). If it is a MMIO, it is
|
||||
registered with (fd, addr, len).
|
||||
b) Irqfd is registered with MSI vector.
|
||||
|
||||
3. vhost proxy sets the two fds to vhost kernel through ioctl of vhost
|
||||
device.
|
||||
4. vhost starts polling the kick fd and wakes up when guest kicks a
|
||||
virtqueue, which results a event_signal on kick fd by VHM ioeventfd.
|
||||
5. vhost device in kernel signals on the irqfd to notify the guest.
|
||||
|
||||
Ioeventfd implementation
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Ioeventfd module is implemented in VHM, and can enhance a registered
|
||||
eventfd to listen to IO requests (PIO/MMIO) from vhm ioreq module and
|
||||
signal the eventfd when needed. :numref:`ioeventfd-workflow` shows the
|
||||
general workflow of ioeventfd.
|
||||
|
||||
.. figure:: images/virtio-hld-image58.png
|
||||
:align: center
|
||||
:name: ioeventfd-workflow
|
||||
|
||||
ioeventfd general work flow
|
||||
|
||||
The workflow can be summarized as:
|
||||
|
||||
1. vhost device init. Vhost proxy create two eventfd for ioeventfd and
|
||||
irqfd.
|
||||
2. pass ioeventfd to vhost kernel driver.
|
||||
3. pass ioevent fd to vhm driver
|
||||
4. UOS FE driver triggers ioreq and forwarded to SOS by hypervisor
|
||||
5. ioreq is dispatched by vhm driver to related vhm client.
|
||||
6. ioeventfd vhm client traverse the io_range list and find
|
||||
corresponding eventfd.
|
||||
7. trigger the signal to related eventfd.
|
||||
|
||||
Irqfd implementation
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The irqfd module is implemented in VHM, and can enhance an registered
|
||||
eventfd to inject an interrupt to a guest OS when the eventfd gets
|
||||
signaled. :numref:`irqfd-workflow` shows the general flow for irqfd.
|
||||
|
||||
.. figure:: images/virtio-hld-image60.png
|
||||
:align: center
|
||||
:name: irqfd-workflow
|
||||
|
||||
irqfd general flow
|
||||
|
||||
The workflow can be summarized as:
|
||||
|
||||
1. vhost device init. Vhost proxy create two eventfd for ioeventfd and
|
||||
irqfd.
|
||||
2. pass irqfd to vhost kernel driver.
|
||||
3. pass irq fd to vhm driver
|
||||
4. vhost device driver triggers irq eventfd signal once related native
|
||||
transfer is completed.
|
||||
5. irqfd related logic traverses the irqfd list to retrieve related irq
|
||||
information.
|
||||
6. irqfd related logic inject an interrupt through vhm interrupt API.
|
||||
7. interrupt is delivered to UOS FE driver through hypervisor.
|
||||
|
||||
Virtio APIs
|
||||
***********
|
||||
|
||||
This section provides details on the ACRN virtio APIs. As outlined previously,
|
||||
the ACRN virtio APIs can be divided into three groups: DM_APIs,
|
||||
VBS_APIs, and VQ_APIs. The following sections will elaborate on
|
||||
these APIs.
|
||||
|
||||
VBS-U Key Data Structures
|
||||
=========================
|
||||
|
||||
The key data structures for VBS-U are listed as following, and their
|
||||
relationships are shown in :numref:`VBS-U-data`.
|
||||
|
||||
``struct pci_virtio_blk``
|
||||
An example virtio device, such as virtio-blk.
|
||||
``struct virtio_common``
|
||||
A common component to any virtio device.
|
||||
``struct virtio_ops``
|
||||
Virtio specific operation functions for this type of virtio device.
|
||||
``struct pci_vdev``
|
||||
Instance of a virtual PCIe device, and any virtio
|
||||
device is a virtual PCIe device.
|
||||
``struct pci_vdev_ops``
|
||||
PCIe device's operation functions for this type
|
||||
of device.
|
||||
``struct vqueue_info``
|
||||
Instance of a virtqueue.
|
||||
|
||||
.. figure:: images/virtio-hld-image5.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: VBS-U-data
|
||||
|
||||
VBS-U Key Data Structures
|
||||
|
||||
Each virtio device is a PCIe device. In addition, each virtio device
|
||||
could have none or multiple virtqueues, depending on the device type.
|
||||
The ``struct virtio_common`` is a key data structure to be manipulated by
|
||||
DM, and DM finds other key data structures through it. The ``struct
|
||||
virtio_ops`` abstracts a series of virtio callbacks to be provided by
|
||||
device owner.
|
||||
|
||||
VBS-K Key Data Structures
|
||||
=========================
|
||||
|
||||
The key data structures for VBS-K are listed as follows, and their
|
||||
relationships are shown in :numref:`VBS-K-data`.
|
||||
|
||||
``struct vbs_k_rng``
|
||||
In-kernel VBS-K component handling data plane of a
|
||||
VBS-U virtio device, for example virtio random_num_generator.
|
||||
``struct vbs_k_dev``
|
||||
In-kernel VBS-K component common to all VBS-K.
|
||||
``struct vbs_k_vq``
|
||||
In-kernel VBS-K component to be working with kernel
|
||||
vring service API helpers.
|
||||
``struct vbs_k_dev_inf``
|
||||
Virtio device information to be synchronized
|
||||
from VBS-U to VBS-K kernel module.
|
||||
``struct vbs_k_vq_info``
|
||||
A single virtqueue information to be
|
||||
synchronized from VBS-U to VBS-K kernel module.
|
||||
``struct vbs_k_vqs_info``
|
||||
Virtqueue(s) information, of a virtio device,
|
||||
to be synchronized from VBS-U to VBS-K kernel module.
|
||||
|
||||
.. figure:: images/virtio-hld-image8.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: VBS-K-data
|
||||
|
||||
VBS-K Key Data Structures
|
||||
|
||||
In VBS-K, the struct vbs_k_xxx represents the in-kernel component
|
||||
handling a virtio device's data plane. It presents a char device for VBS-U
|
||||
to open and register device status after feature negotiation with the FE
|
||||
driver.
|
||||
|
||||
The device status includes negotiated features, number of virtqueues,
|
||||
interrupt information, and more. All these status will be synchronized
|
||||
from VBS-U to VBS-K. In VBS-U, the ``struct vbs_k_dev_info`` and ``struct
|
||||
vbs_k_vqs_info`` will collect all the information and notify VBS-K through
|
||||
ioctls. In VBS-K, the ``struct vbs_k_dev`` and ``struct vbs_k_vq``, which are
|
||||
common to all VBS-K modules, are the counterparts to preserve the
|
||||
related information. The related information is necessary to kernel-land
|
||||
vring service API helpers.
|
||||
|
||||
VHOST Key Data Structures
|
||||
=========================
|
||||
|
||||
The key data structures for vhost are listed as follows.
|
||||
|
||||
.. doxygenstruct:: vhost_dev
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenstruct:: vhost_vq
|
||||
:project: Project ACRN
|
||||
|
||||
DM APIs
|
||||
=======
|
||||
|
||||
The DM APIs are exported by DM, and they should be used when realizing
|
||||
BE device drivers on ACRN.
|
||||
|
||||
.. doxygenfunction:: paddr_guest2host
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: pci_set_cfgdata8
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: pci_set_cfgdata16
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: pci_set_cfgdata32
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: pci_get_cfgdata8
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: pci_get_cfgdata16
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: pci_get_cfgdata32
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: pci_lintr_assert
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: pci_lintr_deassert
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: pci_generate_msi
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: pci_generate_msix
|
||||
:project: Project ACRN
|
||||
|
||||
VBS APIs
|
||||
========
|
||||
|
||||
The VBS APIs are exported by VBS related modules, including VBS, DM, and
|
||||
SOS kernel modules. They can be classified into VBS-U and VBS-K APIs
|
||||
listed as follows.
|
||||
|
||||
VBS-U APIs
|
||||
----------
|
||||
|
||||
These APIs provided by VBS-U are callbacks to be registered to DM, and
|
||||
the virtio framework within DM will invoke them appropriately.
|
||||
|
||||
.. doxygenstruct:: virtio_ops
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: virtio_pci_read
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: virtio_pci_write
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: virtio_interrupt_init
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: virtio_linkup
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: virtio_reset_dev
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: virtio_set_io_bar
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: virtio_set_modern_bar
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: virtio_config_changed
|
||||
:project: Project ACRN
|
||||
|
||||
VBS-K APIs
|
||||
----------
|
||||
|
||||
The VBS-K APIs are exported by VBS-K related modules. Users could use
|
||||
the following APIs to implement their VBS-K modules.
|
||||
|
||||
APIs provided by DM
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. doxygenfunction:: vbs_kernel_reset
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vbs_kernel_start
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vbs_kernel_stop
|
||||
:project: Project ACRN
|
||||
|
||||
APIs provided by VBS-K modules in service OS
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. kernel-doc:: include/linux/vbs/vbs.h
|
||||
:functions: virtio_dev_init
|
||||
virtio_dev_ioctl
|
||||
virtio_vqs_ioctl
|
||||
virtio_dev_register
|
||||
virtio_dev_deregister
|
||||
virtio_vqs_index_get
|
||||
virtio_dev_reset
|
||||
|
||||
VHOST APIS
|
||||
==========
|
||||
|
||||
APIs provided by DM
|
||||
-------------------
|
||||
|
||||
.. doxygenfunction:: vhost_dev_init
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vhost_dev_deinit
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vhost_dev_start
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vhost_dev_stop
|
||||
:project: Project ACRN
|
||||
|
||||
Linux vhost IOCTLs
|
||||
------------------
|
||||
|
||||
``#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64)``
|
||||
This IOCTL is used to get the supported feature flags by vhost kernel driver.
|
||||
``#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64)``
|
||||
This IOCTL is used to set the supported feature flags to vhost kernel driver.
|
||||
``#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)``
|
||||
This IOCTL is used to set current process as the exclusive owner of the vhost
|
||||
char device. It must be called before any other vhost commands.
|
||||
``#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)``
|
||||
This IOCTL is used to give up the ownership of the vhost char device.
|
||||
``#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)``
|
||||
This IOCTL is used to convey the guest OS memory layout to vhost kernel driver.
|
||||
``#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)``
|
||||
This IOCTL is used to set the number of descriptors in virtio ring. It cannot
|
||||
be modified while the virtio ring is running.
|
||||
``#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)``
|
||||
This IOCTL is used to set the address of the virtio ring.
|
||||
``#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_state)``
|
||||
This IOCTL is used to set the base value where virtqueue looks for available
|
||||
descriptors.
|
||||
``#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct vhost_vring_state)``
|
||||
This IOCTL is used to get the base value where virtqueue looks for available
|
||||
descriptors.
|
||||
``#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)``
|
||||
This IOCTL is used to set the eventfd on which vhost can poll for guest
|
||||
virtqueue kicks.
|
||||
``#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)``
|
||||
This IOCTL is used to set the eventfd which is used by vhost do inject
|
||||
virtual interrupt.
|
||||
|
||||
VHM eventfd IOCTLs
|
||||
------------------
|
||||
|
||||
.. doxygenstruct:: acrn_ioeventfd
|
||||
:project: Project ACRN
|
||||
|
||||
``#define IC_EVENT_IOEVENTFD _IC_ID(IC_ID, IC_ID_EVENT_BASE + 0x00)``
|
||||
This IOCTL is used to register/unregister ioeventfd with appropriate address,
|
||||
length and data value.
|
||||
|
||||
.. doxygenstruct:: acrn_irqfd
|
||||
:project: Project ACRN
|
||||
|
||||
``#define IC_EVENT_IRQFD _IC_ID(IC_ID, IC_ID_EVENT_BASE + 0x01)``
|
||||
This IOCTL is used to register/unregister irqfd with appropriate MSI information.
|
||||
|
||||
VQ APIs
|
||||
=======
|
||||
|
||||
The virtqueue APIs, or VQ APIs, are used by a BE device driver to
|
||||
access the virtqueues shared by the FE driver. The VQ APIs abstract the
|
||||
details of virtqueues so that users don't need to worry about the data
|
||||
structures within the virtqueues. In addition, the VQ APIs are designed
|
||||
to be identical between VBS-U and VBS-K, so that users don't need to
|
||||
learn different APIs when implementing BE drivers based on VBS-U and
|
||||
VBS-K.
|
||||
|
||||
.. doxygenfunction:: vq_interrupt
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vq_getchain
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vq_retchain
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vq_relchain
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vq_endchains
|
||||
:project: Project ACRN
|
||||
|
||||
Below is an example showing a typical logic of how a BE driver handles
|
||||
requests from a FE driver.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
static void BE_callback(struct pci_virtio_xxx *pv, struct vqueue_info *vq ) {
|
||||
while (vq_has_descs(vq)) {
|
||||
vq_getchain(vq, &idx, &iov, 1, NULL);
|
||||
/* handle requests in iov */
|
||||
request_handle_proc();
|
||||
/* Release this chain and handle more */
|
||||
vq_relchain(vq, idx, len);
|
||||
}
|
||||
/* Generate interrupt if appropriate. 1 means ring empty \*/
|
||||
vq_endchains(vq, 1);
|
||||
}
|
||||
|
||||
Supported Virtio Devices
|
||||
************************
|
||||
|
||||
All the BE virtio drivers are implemented using the
|
||||
ACRN virtio APIs, and the FE drivers are reusing the standard Linux FE
|
||||
virtio drivers. For the devices with FE drivers available in the Linux
|
||||
kernel, they should use standard virtio Vendor ID/Device ID and
|
||||
Subsystem Vendor ID/Subsystem Device ID. For other devices within ACRN,
|
||||
their temporary IDs are listed in the following table.
|
||||
|
||||
.. table:: Virtio Devices without existing FE drivers in Linux
|
||||
:align: center
|
||||
:name: virtio-device-table
|
||||
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| virtio | Vendor ID | Device ID | Subvendor | Subdevice |
|
||||
| device | | | ID | ID |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| RPMB | 0x8086 | 0x8601 | 0x8086 | 0xFFFF |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| HECI | 0x8086 | 0x8602 | 0x8086 | 0xFFFE |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| audio | 0x8086 | 0x8603 | 0x8086 | 0xFFFD |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| IPU | 0x8086 | 0x8604 | 0x8086 | 0xFFFC |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| TSN/AVB | 0x8086 | 0x8605 | 0x8086 | 0xFFFB |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| hyper_dmabuf | 0x8086 | 0x8606 | 0x8086 | 0xFFFA |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| HDCP | 0x8086 | 0x8607 | 0x8086 | 0xFFF9 |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| COREU | 0x8086 | 0x8608 | 0x8086 | 0xFFF8 |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
|
||||
The following sections introduce the status of virtio devices currently
|
||||
supported in ACRN.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
virtio-blk
|
||||
virtio-net
|
||||
virtio-input
|
||||
virtio-console
|
||||
virtio-rnd
|
@@ -1,161 +0,0 @@
|
||||
.. _hld-vm-management:
|
||||
|
||||
VM Management high-level design
|
||||
###############################
|
||||
|
||||
Management of a Virtual Machine (VM) means to switch a VM to the right
|
||||
state, according to the requirements of applications or system power
|
||||
operations.
|
||||
|
||||
VM state
|
||||
********
|
||||
|
||||
Generally, a VM is not running at the beginning: it is in a 'stopped'
|
||||
state. After its UOS is launched successfully, the VM enter a 'running'
|
||||
state. When the UOS powers off, the VM returns to a 'stopped' state again.
|
||||
A UOS can sleep when it is running, so there is also a 'paused' state.
|
||||
|
||||
Because VMs are designed to work under an SOS environment, a VM can
|
||||
only run and change its state when the SOS is running. A VM must be put to
|
||||
'paused' or 'stopped' state before the SOS can sleep or power-off.
|
||||
Otherwise the VM may be damaged and user data would be lost.
|
||||
|
||||
Scenarios of VM state change
|
||||
****************************
|
||||
|
||||
Button-initiated System Power On
|
||||
================================
|
||||
|
||||
When the user presses the power button to power on the system,
|
||||
everything is started at the beginning. VMs that run user applications
|
||||
are launched automatically after the SOS is ready.
|
||||
|
||||
Button-initiated VM Power on
|
||||
============================
|
||||
|
||||
At SOS boot up, SOS-Life-Cycle-Service and Acrnd are automatically started
|
||||
as system services. SOS-Life-Cycle-Service notifies Acrnd that SOS is
|
||||
started, then Acrnd starts an Acrn-DM for launching each UOS, whose state
|
||||
changes from 'stopped' to 'running'.
|
||||
|
||||
Button-initiated VM Power off
|
||||
=============================
|
||||
|
||||
When SOS is about to shutdown, IOC powers off all VMs.
|
||||
SOS-Life-Cycle-Service delays the SOS shutdown operation using heartbeat,
|
||||
and waits for Acrnd to notify it can shutdown.
|
||||
|
||||
Acrnd keeps query states of all VMs. When all of them are 'stopped',
|
||||
it notifies SOS-Life-Cycle-Service. SOS-Life-Cycle-Service stops the send delay
|
||||
shutdown heartbeat, allowing SOS to continue the shutdown process.
|
||||
|
||||
RTC S3/S5 entry
|
||||
===============
|
||||
|
||||
UOS asks Acrnd to resume/restart itself later by sending an RTC timer request,
|
||||
and suspends/powers-off. SOS suspends/powers-off before that RTC
|
||||
timer expires. Acrnd stores the RTC resume/restart time to a file, and
|
||||
send the RTC timer request to SOS-Life-Cycle-Service.
|
||||
SOS-Life-Cycle-Service sets the RTC timer to IOC. Finally, the SOS is
|
||||
suspended/powered-off.
|
||||
|
||||
RTC S3/S5 exiting
|
||||
=================
|
||||
|
||||
SOS is resumed/started by IOC RTC timer. SOS-Life-Cycle-Service notifies
|
||||
Acrnd SOS has become alive again. Acrnd checks that the wakeup reason
|
||||
was because SOS is resumed/started by IOC RTC. It then reads UOS
|
||||
resume/restart time from the file, and resumes/restarts the UOS when
|
||||
time is expired.
|
||||
|
||||
VM State management
|
||||
*******************
|
||||
|
||||
Overview of VM State Management
|
||||
===============================
|
||||
|
||||
Management of VMs on SOS uses the
|
||||
SOS-Life-Cycle-Service, Acrnd, and Acrn-dm, working together and using
|
||||
Acrn-Manager-AIP as IPC interface.
|
||||
|
||||
* The Lifecycle-Service get the Wakeup-Reason from IOC controller. It can set
|
||||
different power cycle method, and RTC timer, by sending a heartbeat to IOC
|
||||
with proper data.
|
||||
|
||||
* The Acrnd get Wakeup Reason from Lifecycle-Service and forwards it to
|
||||
Acrn-dm. It coordinates the lifecycle of VMs and SOS and handles IOC-timed
|
||||
wakeup/poweron.
|
||||
|
||||
* Acrn-Dm is the device model of a VM running on SOS. Virtual IOC
|
||||
inside Acrn-DM is responsible to control VM power state, usually triggered by Acrnd.
|
||||
|
||||
SOS Life Cycle Service
|
||||
======================
|
||||
|
||||
SOS-Life-Cycle-Service (SOS-LCS) is a daemon service running on SOS.
|
||||
|
||||
SOS-LCS listens on ``/dev/cbc-lifecycle`` tty port to receive "wakeup
|
||||
reason" information from IOC controller. SOS-LCS keeps reading system
|
||||
status from IOC, to discover which power cycle method IOC is
|
||||
doing. SOS-LCS should reply a heartbeat to IOC. This heartbeat can tell
|
||||
IOC to keep doing this power cycle method, or change to another power
|
||||
cycle method. SOS-LCS heartbeat can also set RTC timer to IOC.
|
||||
|
||||
SOS-LCS handles SHUTDOWN, SUSPEND, and REBOOT acrn-manager messages
|
||||
request from Acrnd. When these messages are received, SOS-LCS switches IOC
|
||||
power cycle method to shutdown, suspend, and reboot, respectively.
|
||||
|
||||
SOS-LCS handles WAKEUP_REASON acrn-manager messages request from Acrnd.
|
||||
When it receives this message, SOS-LCS sends "wakeup reason" to Acrnd.
|
||||
|
||||
SOS-LCS handles RTC_TIMER acrn-manager messages request from Acrnd.
|
||||
When it receives this message, SOS-LCS setup IOC RTC timer for Acrnd.
|
||||
|
||||
SOS-LCS notifies Acrnd at the moment system becomes alive from other
|
||||
status.
|
||||
|
||||
Acrnd
|
||||
=====
|
||||
|
||||
Acrnd is a daemon service running on SOS.
|
||||
|
||||
Acrnd can start/resume VMs and query VM states for SOS-LCS, helping
|
||||
SOS-LCS to decide which power cycle method is right. It also helps UOS
|
||||
to be started/resumed by timer, required by S3/S5 feature.
|
||||
|
||||
Acrnd forwards wakeup reason to acrn-dm. Acrnd is responsible to retrieve
|
||||
wakeup reason from SOS-LCS service and attach the wakeup reason to
|
||||
acrn-dm parameter for ioc-dm.
|
||||
|
||||
When SOS is about to suspend/shutdown, SOS lifecycle service will send a
|
||||
request to Acrnd to guarantee all guest VMs are suspended or shutdown
|
||||
before SOS suspending/shutdown process continue. On receiving the
|
||||
request, Acrnd starts polling the guest VMs state, and notifies SOS
|
||||
lifecycle service when all guest VMs are put in proper state gracefully.
|
||||
|
||||
Guest UOS may need to
|
||||
resume/start in a future time for some tasks. To
|
||||
setup a timed resume/start, ioc-dm will send a request to acrnd to
|
||||
maintain a list of timed requests from guest VMs. acrnd selects the
|
||||
nearest request and sends it to SOS lifecycle service who will setup the
|
||||
physical IOC.
|
||||
|
||||
Acrn-DM
|
||||
=======
|
||||
|
||||
Acrn-Dm is the device model of VM running on SOS. Dm-IOC inside Acrn-DM
|
||||
operates virtual IOC to control VM power state, and collects VM power
|
||||
state information. Acrn-DM Monitor abstracts these Virtual IOC
|
||||
functions into monitor-vm-ops, and allows Acrnd to use them via
|
||||
Acrn-Manager IPC helper functions.
|
||||
|
||||
Acrn-manager IPC helper
|
||||
=======================
|
||||
|
||||
SOS-LCS, Acrnd, and Acrn-DM use sockets to do IPC. Acrn-Manager IPC helper API
|
||||
makes socket transparent for them. These are:
|
||||
|
||||
- int mngr_open_un() - create a descriptor for vm management IPC
|
||||
- void mngr_close() - close descriptor and release the resources
|
||||
- int mngr_add_handler() - add a handler for message specified by message
|
||||
- int mngr_send_msg() - send a message and wait for acknowledgement
|
@@ -1,4 +0,0 @@
|
||||
.. _hld-vsbl:
|
||||
|
||||
Virtual Slim-Bootloader high-level design
|
||||
#########################################
|
@@ -1,51 +0,0 @@
|
||||
.. _hv-config:
|
||||
|
||||
Compile-time Configuration
|
||||
##########################
|
||||
|
||||
The hypervisor provides a kconfig-like way for manipulating compile-time
|
||||
configurations. Basically the hypervisor defines a set of configuration
|
||||
symbols and declare their default value. A configuration file is
|
||||
created, containing the values of each symbol, before building the
|
||||
sources.
|
||||
|
||||
Similar to Linux kconfig, there are three files involved:
|
||||
|
||||
- **.config** This files stores the values of all configuration
|
||||
symbols.
|
||||
|
||||
- **config.mk** This file is a conversion of .config in Makefile
|
||||
syntax, and can be included in makefiles so that the build
|
||||
process can rely on the configurations.
|
||||
|
||||
- **config.h** This file is a conversion of .config in C syntax, and is
|
||||
automatically included in every source file so that the values of
|
||||
the configuration symbols are available in the sources.
|
||||
|
||||
.. figure:: images/config-image103.png
|
||||
:align: center
|
||||
:name: config-build-workflow
|
||||
|
||||
Hypervisor configuration and build workflow
|
||||
|
||||
:numref:`config-build-workflow` shows the workflow of building the
|
||||
hypervisor:
|
||||
|
||||
1. Three targets are introduced for manipulating the configurations.
|
||||
|
||||
a. **defconfig** creates a .config based on a predefined
|
||||
configuration file.
|
||||
|
||||
b. **oldconfig** updates an existing .config after creating one if it
|
||||
does not exist.
|
||||
|
||||
c. **menuconfig** presents a terminal UI to navigate and modify the
|
||||
configurations in an interactive manner.
|
||||
|
||||
2. The target oldconfig is also used to create a .config if a .config
|
||||
file does not exist when building the source directly.
|
||||
|
||||
3. The other two files for makefiles and C sources are regenerated after
|
||||
.config changes.
|
||||
|
||||
Refer to :ref:`configuration` for a complete list of configuration symbols.
|
@@ -1,101 +0,0 @@
|
||||
.. _hv-console-shell-uart:
|
||||
|
||||
Hypervisor console, hypervisor shell, and virtual UART
|
||||
######################################################
|
||||
|
||||
.. _hv-console:
|
||||
|
||||
Hypervisor console
|
||||
******************
|
||||
|
||||
The hypervisor console is a text-based terminal accessible from UART.
|
||||
:numref:`console-processing` shows the workflow of the console:
|
||||
|
||||
.. figure:: images/console-image93.png
|
||||
:align: center
|
||||
:name: console-processing
|
||||
|
||||
Periodic console processing
|
||||
|
||||
A periodic timer is set on initialization to trigger console processing every 40ms.
|
||||
Processing behavior depends on whether the vUART
|
||||
is active:
|
||||
|
||||
- If it is not active, the hypervisor shell is kicked to handle
|
||||
inputs from the physical UART, if there are any.
|
||||
|
||||
- If the vUART is active, the bytes from
|
||||
the physical UART are redirected to the RX fifo of the vUART, and those
|
||||
in the vUART TX fifo to the physical UART.
|
||||
|
||||
.. note:: The console is only available in the debug version of the hypervisor,
|
||||
configured at compile time. In the release version, the console is
|
||||
disabled and the physical UART is not used by the hypervisor or SOS.
|
||||
|
||||
Hypervisor shell
|
||||
****************
|
||||
|
||||
For debugging, the hypervisor shell provides commands to list some
|
||||
internal states and statistics of the hypervisor. It is accessible on
|
||||
the physical UART only when the vUART is deactivated. See
|
||||
:ref:`acrnshell` for the list of available hypervisor shell commands.
|
||||
|
||||
Virtual UART
|
||||
************
|
||||
|
||||
Currently UART 16550 is owned by the hypervisor itself and used for
|
||||
debugging purposes. Properties are configured by hypervisor command
|
||||
line. Hypervisor emulates a UART device with 0x3F8 address to SOS that
|
||||
acts as the console of SOS with these features:
|
||||
|
||||
- The vUART is exposed via I/O port 0x3f8.
|
||||
- Incorporate a 256-byte RX buffer and 65536 TX buffer.
|
||||
- Full emulation of input/output bytes and related interrupts.
|
||||
- For other read-write registers the value is stored without effects
|
||||
and reads get the latest stored value. For read-only registers
|
||||
writes are ignored.
|
||||
- vUART activation via shell command and deactivate via hotkey.
|
||||
|
||||
The following diagram shows the activation state transition of vUART.
|
||||
|
||||
.. figure:: images/console-image41.png
|
||||
:align: center
|
||||
|
||||
Periodic console processing
|
||||
|
||||
Specifically:
|
||||
|
||||
- After initialization vUART is disabled.
|
||||
- The vUART is activated after the command "vm_console" is executed on
|
||||
the hypervisor shell. Inputs to the physical UART will be
|
||||
redirected to the vUART starting from the next timer event.
|
||||
|
||||
- The vUART is deactivated after a :kbd:`Ctrl + Space` hotkey is received
|
||||
from the physical UART. Inputs to the physical UART will be
|
||||
handled by the hypervisor shell starting from the next timer
|
||||
event.
|
||||
|
||||
The workflows are described as follows:
|
||||
|
||||
- RX flow:
|
||||
|
||||
- Characters are read from UART HW into a sbuf whose size is 2048
|
||||
bytes, triggered by console_read
|
||||
|
||||
- Characters are read from this sbuf and put to rxFIFO,
|
||||
triggered by vuart_console_rx_chars
|
||||
|
||||
- A virtual interrupt is sent to SOS, triggered by a read from
|
||||
SOS. Characters in rxFIFO are sent to SOS by emulation of
|
||||
read of register UART16550_RBR
|
||||
|
||||
- TX flow:
|
||||
|
||||
- Characters are put to txFIFO by emulation of write of register
|
||||
UART16550_THR
|
||||
|
||||
- Characters in txFIFO are read out one by one and sent to console
|
||||
by printf, triggered by vuart_console_tx_chars
|
||||
|
||||
- Implementation of printf is based on console, which finally sends
|
||||
characters to UART HW by writing to register UART16550_RBR
|
@@ -1,261 +0,0 @@
|
||||
.. _hv-device-passthrough:
|
||||
|
||||
Device Passthrough
|
||||
##################
|
||||
|
||||
A critical part of virtualization is virtualizing devices: exposing all
|
||||
aspects of a device including its I/O, interrupts, DMA, and configuration.
|
||||
There are three typical device
|
||||
virtualization methods: emulation, para-virtualization, and passthrough.
|
||||
Both emulation and passthrough are used in ACRN project. Device
|
||||
emulation is discussed in :ref:`hld-io-emulation` and
|
||||
device passthrough will be discussed here.
|
||||
|
||||
In the ACRN project, device emulation means emulating all existing hardware
|
||||
resource through a software component device model running in the
|
||||
Service OS (SOS). Device
|
||||
emulation must maintain the same SW interface as a native device,
|
||||
providing transparency to the VM software stack. Passthrough implemented in
|
||||
hypervisor assigns a physical device to a VM so the VM can access
|
||||
the hardware device directly with minimal (if any) VMM involvement.
|
||||
|
||||
The difference between device emulation and passthrough is shown in
|
||||
:numref:`emu-passthru-diff`. You can notice device emulation has
|
||||
a longer access path which causes worse performance compared with
|
||||
passthrough. Passthrough can deliver near-native performance, but
|
||||
can't support device sharing.
|
||||
|
||||
.. figure:: images/passthru-image30.png
|
||||
:align: center
|
||||
:name: emu-passthru-diff
|
||||
|
||||
Difference between Emulation and passthrough
|
||||
|
||||
Passthrough in the hypervisor provides the following functionalities to
|
||||
allow VM to access PCI devices directly:
|
||||
|
||||
- DMA Remapping by VT-d for PCI device: hypervisor will setup DMA
|
||||
remapping during VM initialization phase.
|
||||
- MMIO Remapping between virtual and physical BAR
|
||||
- Device configuration Emulation
|
||||
- Remapping interrupts for PCI device
|
||||
- ACPI configuration Virtualization
|
||||
- GSI sharing violation check
|
||||
|
||||
The following diagram details passthrough initialization control flow in ACRN:
|
||||
|
||||
.. figure:: images/passthru-image22.png
|
||||
:align: center
|
||||
|
||||
Passthrough devices initialization control flow
|
||||
|
||||
Passthrough Device status
|
||||
*************************
|
||||
|
||||
Most common devices on supported platforms are enabled for
|
||||
passthrough, as detailed here:
|
||||
|
||||
.. figure:: images/passthru-image77.png
|
||||
:align: center
|
||||
|
||||
Passthrough Device Status
|
||||
|
||||
DMA Remapping
|
||||
*************
|
||||
|
||||
To enable passthrough, for VM DMA access the VM can only
|
||||
support GPA, while physical DMA requires HPA. One work-around
|
||||
is building identity mapping so that GPA is equal to HPA, but this
|
||||
is not recommended as some VM don't support relocation well. To
|
||||
address this issue, Intel introduces VT-d in chipset to add one
|
||||
remapping engine to translate GPA to HPA for DMA operations.
|
||||
|
||||
Each VT-d engine (DMAR Unit), maintains a remapping structure
|
||||
similar to a page table with device BDF (Bus/Dev/Func) as input and final
|
||||
page table for GPA/HPA translation as output. The GPA/HPA translation
|
||||
page table is similar to a normal multi-level page table.
|
||||
|
||||
VM DMA depends on Intel VT-d to do the translation from GPA to HPA, so we
|
||||
need to enable VT-d IOMMU engine in ACRN before we can passthrough any device. SOS
|
||||
in ACRN is a VM running in non-root mode which also depends
|
||||
on VT-d to access a device. In SOS DMA remapping
|
||||
engine settings, GPA is equal to HPA.
|
||||
|
||||
ACRN hypervisor checks DMA-Remapping Hardware unit Definition (DRHD) in
|
||||
host DMAR ACPI table to get basic info, then sets up each DMAR unit. For
|
||||
simplicity, ACRN reuses EPT table as the translation table in DMAR
|
||||
unit for each passthrough device. The control flow is shown in the
|
||||
following figures:
|
||||
|
||||
.. figure:: images/passthru-image72.png
|
||||
:align: center
|
||||
|
||||
DMA Remapping control flow during HV init
|
||||
|
||||
.. figure:: images/passthru-image86.png
|
||||
:align: center
|
||||
|
||||
ptdev assignment control flow
|
||||
|
||||
.. figure:: images/passthru-image42.png
|
||||
:align: center
|
||||
|
||||
ptdev de-assignment control flow
|
||||
|
||||
|
||||
MMIO Remapping
|
||||
**************
|
||||
|
||||
For PCI MMIO BAR, hypervisor builds EPT mapping between virtual BAR and
|
||||
physical BAR, then VM can access MMIO directly.
|
||||
|
||||
Device configuration emulation
|
||||
******************************
|
||||
|
||||
PCI configuration is based on access of port 0xCF8/CFC. ACRN
|
||||
implements PCI configuration emulation to handle 0xCF8/CFC to control
|
||||
PCI device through two paths: implemented in hypervisor or in SOS device
|
||||
model.
|
||||
|
||||
- When configuration emulation is in the hypervisor, the interception of
|
||||
0xCF8/CFC port and emulation of PCI configuration space access are
|
||||
tricky and unclean. Therefore the final solution is to reuse the
|
||||
PCI emulation infrastructure of SOS device model. The hypervisor
|
||||
routes the UOS 0xCF8/CFC access to device model, and keeps blind to the
|
||||
physical PCI devices. Upon receiving UOS PCI configuration space access
|
||||
request, device model needs to emulate some critical space, for instance,
|
||||
BAR, MSI capability, and INTLINE/INTPIN.
|
||||
|
||||
- For other access, device model
|
||||
reads/writes physical configuration space on behalf of UOS. To do
|
||||
this, device model is linked with lib pci access to access physical PCI
|
||||
device.
|
||||
|
||||
Interrupt Remapping
|
||||
*******************
|
||||
|
||||
When the physical interrupt of a passthrough device happens, hypervisor has
|
||||
to distribute it to the relevant VM according to interrupt remapping
|
||||
relationships. The structure ``ptirq_remapping_info`` is used to define
|
||||
the subordination relation between physical interrupt and VM, the
|
||||
virtual destination, etc. See the following figure for details:
|
||||
|
||||
.. figure:: images/passthru-image91.png
|
||||
:align: center
|
||||
|
||||
Remapping of physical interrupts
|
||||
|
||||
There are two different types of interrupt source: IOAPIC and MSI.
|
||||
The hypervisor will record different information for interrupt
|
||||
distribution: physical and virtual IOAPIC pin for IOAPIC source,
|
||||
physical and virtual BDF and other info for MSI source.
|
||||
|
||||
SOS passthrough is also in the scope of interrupt remapping which is
|
||||
done on-demand rather than on hypervisor initialization.
|
||||
|
||||
.. figure:: images/passthru-image102.png
|
||||
:align: center
|
||||
:name: init-remapping
|
||||
|
||||
Initialization of remapping of virtual IOAPIC interrupts for SOS
|
||||
|
||||
:numref:`init-remapping` above illustrates how remapping of (virtual) IOAPIC
|
||||
interrupts are remapped for SOS. VM exit occurs whenever SOS tries to
|
||||
unmask an interrupt in (virtual) IOAPIC by writing to the Redirection
|
||||
Table Entry (or RTE). The hypervisor then invokes the IOAPIC emulation
|
||||
handler (refer to :ref:`hld-io-emulation` for details on I/O emulation) which
|
||||
calls APIs to set up a remapping for the to-be-unmasked interrupt.
|
||||
|
||||
Remapping of (virtual) PIC interrupts are set up in a similar sequence:
|
||||
|
||||
.. figure:: images/passthru-image98.png
|
||||
:align: center
|
||||
|
||||
Initialization of remapping of virtual MSI for SOS
|
||||
|
||||
This figure illustrates how mappings of MSI or MSIX are set up for
|
||||
SOS. SOS is responsible for issuing an hypercall to notify the
|
||||
hypervisor before it configures the PCI configuration space to enable an
|
||||
MSI. The hypervisor takes this opportunity to set up a remapping for the
|
||||
given MSI or MSIX before it is actually enabled by SOS.
|
||||
|
||||
When the UOS needs to access the physical device by passthrough, it uses
|
||||
the following steps:
|
||||
|
||||
- UOS gets a virtual interrupt
|
||||
- VM exit happens and the trapped vCPU is the target where the interrupt
|
||||
will be injected.
|
||||
- Hypervisor will handle the interrupt and translate the vector
|
||||
according to ptirq_remapping_info.
|
||||
- Hypervisor delivers the interrupt to UOS.
|
||||
|
||||
When the SOS needs to use the physical device, the passthrough is also
|
||||
active because the SOS is the first VM. The detail steps are:
|
||||
|
||||
- SOS get all physical interrupts. It assigns different interrupts for
|
||||
different VMs during initialization and reassign when a VM is created or
|
||||
deleted.
|
||||
- When physical interrupt is trapped, an exception will happen after VMCS
|
||||
has been set.
|
||||
- Hypervisor will handle the vm exit issue according to
|
||||
ptirq_remapping_info and translates the vector.
|
||||
- The interrupt will be injected the same as a virtual interrupt.
|
||||
|
||||
ACPI Virtualization
|
||||
*******************
|
||||
|
||||
ACPI virtualization is designed in ACRN with these assumptions:
|
||||
|
||||
- HV has no knowledge of ACPI,
|
||||
- SOS owns all physical ACPI resources,
|
||||
- UOS sees virtual ACPI resources emulated by device model.
|
||||
|
||||
Some passthrough devices require physical ACPI table entry for
|
||||
initialization. The device model will create such device entry based on
|
||||
the physical one according to vendor ID and device ID. Virtualization is
|
||||
implemented in SOS device model and not in scope of the hypervisor.
|
||||
|
||||
GSI Sharing Violation Check
|
||||
***************************
|
||||
|
||||
All the PCI devices that are sharing the same GSI should be assigned to
|
||||
the same VM to avoid physical GSI sharing between multiple VMs. For
|
||||
devices that don't support MSI, ACRN DM
|
||||
shares the same GSI pin to a GSI
|
||||
sharing group. The devices in the same group should be assigned together to
|
||||
the current VM, otherwise, none of them should be assigned to the
|
||||
current VM. A device that violates the rule will be rejected to be
|
||||
passthrough. The checking logic is implemented in Device Mode and not
|
||||
in scope of hypervisor.
|
||||
|
||||
Data structures and interfaces
|
||||
******************************
|
||||
|
||||
The following APIs are provided to initialize interrupt remapping for
|
||||
SOS:
|
||||
|
||||
.. doxygenfunction:: ptirq_intx_pin_remap
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: ptirq_msix_remap
|
||||
:project: Project ACRN
|
||||
|
||||
The following APIs are provided to manipulate the interrupt remapping
|
||||
for UOS.
|
||||
|
||||
.. doxygenfunction:: ptirq_add_intx_remapping
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: ptirq_remove_intx_remapping
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: ptirq_add_msix_remapping
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: ptirq_remove_msix_remapping
|
||||
:project: Project ACRN
|
||||
|
||||
The following APIs are provided to acknowledge a virtual interrupt.
|
||||
|
||||
.. doxygenfunction:: ptirq_intx_ack
|
||||
:project: Project ACRN
|
@@ -1,21 +0,0 @@
|
||||
.. _hv-hypercall:
|
||||
|
||||
Hypercall / VHM upcall
|
||||
######################
|
||||
|
||||
HV currently supports hypercall APIs for VM management, I/O request
|
||||
distribution, and guest memory mapping.
|
||||
|
||||
HV and Service OS (SOS) also use vector 0xF7, reserved as x86 platform
|
||||
IPI vector for HV notification to SOS. This upcall is necessary whenever
|
||||
there is device emulation requirement to SOS. The upcall vector 0xF7 is
|
||||
injected to SOS vCPU0
|
||||
|
||||
SOS will register the irq handler for 0xF7 and notify the I/O emulation
|
||||
module in SOS once the irq is triggered.
|
||||
|
||||
|
||||
.. note:: Add API doc references for General interface, VM management
|
||||
interface, IRQ and Interrupts, Device Model IO request distribution,
|
||||
Guest memory management, PCI assignment and IOMMU, Debug, Trusty, Power
|
||||
management
|
@@ -1,423 +0,0 @@
|
||||
.. _interrupt-hld:
|
||||
|
||||
Physical Interrupt high-level design
|
||||
####################################
|
||||
|
||||
Overview
|
||||
********
|
||||
|
||||
The ACRN hypervisor implements a simple but fully functional framework
|
||||
to manage interrupts and exceptions, as show in
|
||||
:numref:`interrupt-modules-overview`. In its native layer, it configures
|
||||
the physical PIC, IOAPIC, and LAPIC to support different interrupt
|
||||
sources from local timer/IPI to external INTx/MSI. In its virtual guest
|
||||
layer, it emulates virtual PIC, virtual IOAPIC and virtual LAPIC, and
|
||||
provides full APIs allowing virtual interrupt injection from emulated or
|
||||
pass-thru devices.
|
||||
|
||||
.. figure:: images/interrupt-image3.png
|
||||
:align: center
|
||||
:width: 600px
|
||||
:name: interrupt-modules-overview
|
||||
|
||||
ACRN Interrupt Modules Overview
|
||||
|
||||
In the software modules view shown in :numref:`interrupt-sw-modules`,
|
||||
the ACRN hypervisor sets up the physical interrupt in its basic
|
||||
interrupt modules (e.g., IOAPIC/LAPIC/IDT). It dispatches the interrupt
|
||||
in the hypervisor interrupt flow control layer to the corresponding
|
||||
handlers, that could be pre-defined IPI notification, timer, or runtime
|
||||
registered pass-thru devices. The ACRN hypervisor then uses its VM
|
||||
interfaces based on vPIC, vIOAPIC, and vMSI modules, to inject the
|
||||
necessary virtual interrupt into the specific VM
|
||||
|
||||
.. figure:: images/interrupt-image2.png
|
||||
:align: center
|
||||
:width: 600px
|
||||
:name: interrupt-sw-modules
|
||||
|
||||
ACRN Interrupt SW Modules Overview
|
||||
|
||||
|
||||
The hypervisor implements the following functionalities for handling
|
||||
physical interrupts:
|
||||
|
||||
- Configure interrupt-related hardware including IDT, PIC, LAPIC, and
|
||||
IOAPIC on startup.
|
||||
|
||||
- Provide APIs to manipulate the registers of LAPIC and IOAPIC.
|
||||
|
||||
- Acknowledge physical interrupts.
|
||||
|
||||
- Set up a callback mechanism for the other components in the
|
||||
hypervisor to request for an interrupt vector and register a
|
||||
handler for that interrupt.
|
||||
|
||||
HV owns all native physical interrupts and manages 256 vectors per CPU.
|
||||
All physical interrupts are first handled in VMX root-mode. The
|
||||
"external-interrupt exiting" bit in VM-Execution controls field is set
|
||||
to support this. The ACRN hypervisor also initializes all the interrupt
|
||||
related modules like IDT, PIC, IOAPIC, and LAPIC.
|
||||
|
||||
HV does not own any host devices (except UART). All devices are by
|
||||
default assigned to SOS. Any interrupts received by Guest VM (SOS or
|
||||
UOS) device drivers are virtual interrupts injected by HV (via vLAPIC).
|
||||
HV manages a Host-to-Guest mapping. When a native IRQ/interrupt occurs,
|
||||
HV decides whether this IRQ/interrupt should be forwarded to a VM and
|
||||
which VM to forward to (if any). Refer to section 3.7.6 for virtual
|
||||
interrupt injection and section 3.9.6 for the management of interrupt
|
||||
remapping.
|
||||
|
||||
HV does not own any exceptions. Guest VMCS are configured so no VM Exit
|
||||
happens, with some exceptions such as #INT3 and #MC. This is to
|
||||
simplify the design as HV does not support any exception handling
|
||||
itself. HV supports only static memory mapping, so there should be no
|
||||
#PF or #GP. If HV receives an exception indicating an error, an assert
|
||||
function is then executed with an error message print out, and the
|
||||
system then halts.
|
||||
|
||||
Native interrupts could be generated from one of the following
|
||||
sources:
|
||||
|
||||
- GSI interrupts
|
||||
|
||||
- PIC or Legacy devices IRQ (0~15)
|
||||
- IOAPIC pin
|
||||
|
||||
- PCI MSI/MSI-X vectors
|
||||
- Inter CPU IPI
|
||||
- LAPIC timer
|
||||
|
||||
Physical Interrupt Initialization
|
||||
*********************************
|
||||
|
||||
After ACRN hypervisor gets control from the bootloader, it
|
||||
initializes all physical interrupt-related modules for all the CPUs. ACRN
|
||||
hypervisor creates a framework to manage the physical interrupt for
|
||||
hypervisor local devices, pass-thru devices, and IPI between CPUs, as
|
||||
shown in :numref:`hv-interrupt-init`:
|
||||
|
||||
.. figure:: images/interrupt-image66.png
|
||||
:align: center
|
||||
:name: hv-interrupt-init
|
||||
|
||||
Physical Interrupt Initialization
|
||||
|
||||
IDT Initialization
|
||||
==================
|
||||
|
||||
ACRN hypervisor builds its native IDT (interrupt descriptor table)
|
||||
during interrupt initialization and set up the following handlers:
|
||||
|
||||
- On an exception, the hypervisor dumps its context and halts the current
|
||||
physical processor (because physical exceptions are not expected).
|
||||
|
||||
- For external interrupts, HV may mask the interrupt (depending on the
|
||||
trigger mode), followed by interrupt acknowledgement and dispatch
|
||||
to the registered handler, if any.
|
||||
|
||||
Most interrupts and exceptions are handled without a stack switch,
|
||||
except for machine-check, double fault, and stack fault exceptions which
|
||||
have their own stack set in TSS.
|
||||
|
||||
PIC/IOAPIC Initialization
|
||||
=========================
|
||||
|
||||
ACRN hypervisor masks all interrupts from the PIC. All legacy interrupts
|
||||
from PIC (<16) will be linked to IOAPIC, as shown in the connections in
|
||||
:numref:`hv-pic-config`.
|
||||
|
||||
ACRN will pre-allocate vectors and mask them for these legacy interrupt
|
||||
in IOAPIC RTE. For others (>= 16), ACRN will mask them with vector 0 in
|
||||
RTE, and the vector will be dynamically allocate on demand.
|
||||
|
||||
All external IOAPIC pins are categorized as GSI interrupt according to
|
||||
ACPI definition. HV supports multiple IOAPIC components. IRQ PIN to GSI
|
||||
mappings are maintained internally to determine GSI source IOAPIC.
|
||||
Native PIC is not used in the system.
|
||||
|
||||
.. figure:: images/interrupt-image46.png
|
||||
:align: center
|
||||
:name: hv-pic-config
|
||||
|
||||
HV PIC/IOAPIC/LAPIC configuration
|
||||
|
||||
LAPIC Initialization
|
||||
====================
|
||||
|
||||
Physical LAPICs are in xAPIC mode in ACRN hypervisor. The hypervisor
|
||||
initializes LAPIC for each physical CPU by masking all interrupts in the
|
||||
local vector table (LVT), clearing all ISRs, and enabling LAPIC.
|
||||
|
||||
APIs are provided to access LAPIC for the other components in the
|
||||
hypervisor, aiming for further usage of local timer (TSC Deadline)
|
||||
program, IPI notification program, etc. See :ref:`hv_interrupt-data-api`
|
||||
for a complete list.
|
||||
|
||||
HV Interrupt Vectors and Delivery Mode
|
||||
======================================
|
||||
|
||||
The interrupt vectors are assigned as shown here:
|
||||
|
||||
**Vector 0-0x1F**
|
||||
are exceptions that are not handled by HV. If
|
||||
such an exception does occur, the system then halts.
|
||||
|
||||
**Vector: 0x20-0x2F**
|
||||
are allocated statically for legacy IRQ0-15.
|
||||
|
||||
**Vector: 0x30-0xDF**
|
||||
are dynamically allocated vectors for PCI devices
|
||||
INTx or MSI/MIS-X usage. According to different interrupt delivery mode
|
||||
(FLAT or PER_CPU mode), an interrupt will be assigned to a vector for
|
||||
all the CPUs or a particular CPU.
|
||||
|
||||
**Vector: 0xE0-0xFE**
|
||||
are high priority vectors reserved by HV for
|
||||
dedicated purposes. For example, 0xEF is used for timer, 0xF0 is used
|
||||
for IPI.
|
||||
|
||||
.. list-table::
|
||||
:widths: 30 70
|
||||
:header-rows: 1
|
||||
|
||||
* - Vectors
|
||||
- Usage
|
||||
|
||||
* - 0x0-0x13
|
||||
- Exceptions: NMI, INT3, page dault, GP, debug.
|
||||
|
||||
* - 0x14-0x1F
|
||||
- Reserved
|
||||
|
||||
* - 0x20-0x2F
|
||||
- Statically allocated for external IRQ (IRQ0-IRQ15)
|
||||
|
||||
* - 0x30-0xDF
|
||||
- Dynamically allocated for IOAPIC IRQ from PCI INTx/MSI
|
||||
|
||||
* - 0xE0-0xFE
|
||||
- Static allocated for HV
|
||||
|
||||
* - 0xEF
|
||||
- Timer
|
||||
|
||||
* - 0xF0
|
||||
- IPI
|
||||
|
||||
* - 0xFF
|
||||
- SPURIOUS_APIC_VECTOR
|
||||
|
||||
Interrupts from either IOAPIC or MSI can be delivered to a target CPU.
|
||||
By default they are configured as Lowest Priority (FLAT mode), i.e. they
|
||||
are delivered to a CPU core that is currently idle or executing lowest
|
||||
priority ISR. There is no guarantee a device's interrupt will be
|
||||
delivered to a specific Guest's CPU. Timer interrupts are an exception -
|
||||
these are always delivered to the CPU which programs the LAPIC timer.
|
||||
|
||||
There are two interrupt delivery modes: FLAT mode and PER_CPU mode. ACRN
|
||||
uses FLAT MODE where the interrupt/irq to vector mapping is the same on all CPUs. Every
|
||||
CPU receives same interrupts. IOAPIC and LAPIC MSI delivery mode are
|
||||
configured to Lowest Priority.
|
||||
|
||||
Vector allocation for CPUs is shown here:
|
||||
|
||||
.. figure:: images/interrupt-image89.png
|
||||
:align: center
|
||||
|
||||
FLAT mode vector allocation
|
||||
|
||||
IRQ Descriptor Table
|
||||
====================
|
||||
|
||||
ACRN hypervisor maintains a global IRQ Descriptor Table shared among the
|
||||
physical CPUs. ACRN use FLAT MODE to manage the interrupts so the
|
||||
same vector will link to same the IRQ number for all CPUs.
|
||||
|
||||
.. note:: need to reference API doc for irq_desc
|
||||
|
||||
|
||||
The *irq_desc[]* array's index represents IRQ number. An *irq_handler*
|
||||
field could be set to common edge/level/quick handler which will be
|
||||
called from *interrupt_dispatch*. The *irq_desc* structure also
|
||||
contains the *dev_list* field to maintain this IRQ's action handler
|
||||
list.
|
||||
|
||||
Another reverse mapping from vector to IRQ is used in addition to the
|
||||
IRQ descriptor table which maintains the mapping from IRQ to vector.
|
||||
|
||||
On initialization, the descriptor of the legacy IRQs are initialized with
|
||||
proper vectors and the corresponding reverse mapping is set up.
|
||||
The descriptor of other IRQs are filled with an invalid
|
||||
vector which will be updated on IRQ allocation.
|
||||
|
||||
For example, if local timer registers an interrupt with IRQ number 271 and
|
||||
vector 0xEF, then this date will be set up:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
irq_desc[271].irq = 271
|
||||
irq_desc[271].vector = 0xEF
|
||||
vector_to_irq[0xEF] = 271
|
||||
|
||||
External Interrupt Handling
|
||||
***************************
|
||||
|
||||
CPU runs under VMX non-root mode and inside Guest VMs.
|
||||
``MSR_IA32_VMX_PINBASED_CTLS.bit[0]`` and
|
||||
``MSR_IA32_VMX_EXIT_CTLS.bit[15]`` are set to allow vCPU VM Exit to HV
|
||||
whenever there are interrupts to that physical CPU under
|
||||
non-root mode. HV ACKs the interrupts in VMX non-root and saves the
|
||||
interrupt vector to the relevant VM Exit field for HV IRQ processing.
|
||||
|
||||
Note that as discussed above, an external interrupt causing vCPU VM Exit
|
||||
to HV does not mean that the interrupt belongs to that Guest VM. When
|
||||
CPU executes VM Exit into root-mode, interrupt handling will be enabled
|
||||
and the interrupt will be delivered and processed as quickly as possible
|
||||
inside HV. HV may emulate a virtual interrupt and inject to Guest if
|
||||
necessary.
|
||||
|
||||
When an physical interrupt happened on a CPU, this CPU could be running
|
||||
under VMX root mode or non-root mode. If the CPU is running under VMX
|
||||
root mode, the interrupt is triggered from standard native IRQ flow -
|
||||
interrupt gate to IRQ handler. If the CPU is running under VMX non-root
|
||||
mode, an external interrupt will trigger a VM exit for reason
|
||||
"external-interrupt".
|
||||
|
||||
Interrupt and IRQ processing flow diagrams are shown below:
|
||||
|
||||
.. figure:: images/interrupt-image48.png
|
||||
:align: center
|
||||
:name: phy-interrupt-processing
|
||||
|
||||
Processing of physical interrupts
|
||||
|
||||
.. figure:: images/interrupt-image39.png
|
||||
:align: center
|
||||
|
||||
IRQ processing control flow
|
||||
|
||||
When a physical interrupt is raised and delivered to a physical CPU, the
|
||||
CPU may be running under either VMX root mode or non-root mode.
|
||||
|
||||
- If the CPU is running under VMX root mode, the interrupt is handled
|
||||
following the standard native IRQ flow: interrupt gate to
|
||||
dispatch_interrupt(), IRQ handler, and finally the registered callback.
|
||||
- If the CPU is running under VMX non-root mode, an external interrupt
|
||||
calls a VM exit for reason "external-interrupt", and then the VM
|
||||
exit processing flow will call dispatch_interrupt() to dispatch and
|
||||
handle the interrupt.
|
||||
|
||||
After an interrupt occurs from either path shown in
|
||||
:numref:`phy-interrupt-processing`, ACRN hypervisor will jump to
|
||||
dispatch_interrupt. This function gets the vector of the generated
|
||||
interrupt from the context, gets IRQ number from vector_to_irq[], and
|
||||
then gets the corresponding irq_desc.
|
||||
|
||||
Though there is only one generic IRQ handler for registered interrupt,
|
||||
there are three different handling flows according to flags:
|
||||
|
||||
- ``!IRQF_LEVEL``
|
||||
- ``IRQF_LEVEL && !IRQF_PT``
|
||||
|
||||
To avoid continuous interrupt triggers, it masks the IOAPIC pin and
|
||||
unmask it only after IRQ action callback is executed
|
||||
|
||||
- ``IRQF_LEVEL && IRQF_PT``
|
||||
|
||||
For pass-thru devices, to avoid continuous interrupt triggers, it masks
|
||||
the IOAPIC pin and leaves it unmasked until corresponding vIOAPIC
|
||||
pin gets an explicit EOI ACK from guest.
|
||||
|
||||
Since interrupts are not shared for multiple devices, there is only one
|
||||
IRQ action registered for each interrupt
|
||||
|
||||
The IRQ number inside HV is a software concept to identify GSI and
|
||||
Vectors. Each GSI will be mapped to one IRQ. The GSI number is usually the same
|
||||
as the IRQ number. IRQ numbers greater than max GSI (nr_gsi) number are dynamically
|
||||
assigned. For example, HV allocates an interrupt vector to a PCI device,
|
||||
an IRQ number is then assigned to that vector. When the vector later
|
||||
reaches a CPU, the corresponding IRQ routine is located and executed.
|
||||
|
||||
See :numref:`request-irq` for request IRQ control flow for different
|
||||
conditions:
|
||||
|
||||
.. figure:: images/interrupt-image76.png
|
||||
:align: center
|
||||
:name: request-irq
|
||||
|
||||
Request IRQ for different conditions
|
||||
|
||||
.. _ipi-management:
|
||||
|
||||
IPI Management
|
||||
**************
|
||||
|
||||
The only purpose of IPI use in HV is to kick a vCPU out of non-root mode
|
||||
and enter to HV mode. This requires I/O request and virtual interrupt
|
||||
injection be distributed to different IPI vectors. The I/O request uses
|
||||
IPI vector 0xF4 upcall (refer to Chapter 5.4). The virtual interrupt
|
||||
injection uses IPI vector 0xF0.
|
||||
|
||||
0xF4 upcall
|
||||
A Guest vCPU VM Exit exits due to EPT violation or IO instruction trap.
|
||||
It requires Device Module to emulate the MMIO/PortIO instruction.
|
||||
However it could be that the Service OS (SOS) vCPU0 is still in non-root
|
||||
mode. So an IPI (0xF4 upcall vector) should be sent to the physical CPU0
|
||||
(with non-root mode as vCPU0 inside SOS) to force vCPU0 to VM Exit due
|
||||
to the external interrupt. The virtual upcall vector is then injected to
|
||||
SOS, and the vCPU0 inside SOS then will pick up the IO request and do
|
||||
emulation for other Guest.
|
||||
|
||||
0xF0 IPI flow
|
||||
If Device Module inside SOS needs to inject an interrupt to other Guest
|
||||
such as vCPU1, it will issue an IPI first to kick CPU1 (assuming CPU1 is
|
||||
running on vCPU1) to root-hv_interrupt-data-apmode. CPU1 will inject the
|
||||
interrupt before VM Enter.
|
||||
|
||||
.. _hv_interrupt-data-api:
|
||||
|
||||
Data structures and interfaces
|
||||
******************************
|
||||
|
||||
IOAPIC
|
||||
======
|
||||
|
||||
The following APIs are external interfaces for IOAPIC related
|
||||
operations.
|
||||
|
||||
.. doxygengroup:: ioapic_ext_apis
|
||||
:project: Project ACRN
|
||||
:content-only:
|
||||
|
||||
|
||||
LAPIC
|
||||
=====
|
||||
|
||||
The following APIs are external interfaces for LAPIC related operations.
|
||||
|
||||
.. doxygengroup:: lapic_ext_apis
|
||||
:project: Project ACRN
|
||||
:content-only:
|
||||
|
||||
|
||||
IPI
|
||||
===
|
||||
|
||||
The following APIs are external interfaces for IPI related operations.
|
||||
|
||||
.. doxygengroup:: ipi_ext_apis
|
||||
:project: Project ACRN
|
||||
:content-only:
|
||||
|
||||
|
||||
Physical Interrupt
|
||||
==================
|
||||
|
||||
The following APIs are external interfaces for physical interrupt
|
||||
related operations.
|
||||
|
||||
.. doxygengroup:: phys_int_ext_apis
|
||||
:project: Project ACRN
|
||||
:content-only:
|
||||
|
@@ -1,329 +0,0 @@
|
||||
.. _hld-io-emulation:
|
||||
|
||||
I/O Emulation high-level design
|
||||
###############################
|
||||
|
||||
As discussed in :ref:`intro-io-emulation`, there are multiple ways and
|
||||
places to handle I/O emulation, including HV, SOS Kernel VHM, and SOS
|
||||
user-land device model (acrn-dm).
|
||||
|
||||
I/O emulation in the hypervisor provides these functionalities:
|
||||
|
||||
- Maintain lists of port I/O or MMIO handlers in the hypervisor for
|
||||
emulating trapped I/O accesses in a certain range.
|
||||
|
||||
- Forward I/O accesses to SOS when they cannot be handled by the
|
||||
hypervisor by any registered handlers.
|
||||
|
||||
:numref:`io-control-flow` illustrates the main control flow steps of I/O emulation
|
||||
inside the hypervisor:
|
||||
|
||||
1. Trap and decode I/O access by VM exits and decode the access from
|
||||
exit qualification or by invoking the instruction decoder.
|
||||
|
||||
2. If the range of the I/O access overlaps with any registered handler,
|
||||
call that handler if it completely covers the range of the
|
||||
access, or ignore the access if the access crosses the boundary.
|
||||
|
||||
3. If the range of the I/O access does not overlap the range of any I/O
|
||||
handler, deliver an I/O request to SOS.
|
||||
|
||||
.. figure:: images/ioem-image101.png
|
||||
:align: center
|
||||
:name: io-control-flow
|
||||
|
||||
Control flow of I/O emulation in the hypervisor
|
||||
|
||||
I/O emulation does not rely on any calibration data.
|
||||
|
||||
Trap Path
|
||||
*********
|
||||
|
||||
Port I/O accesses are trapped by VM exits with the basic exit reason
|
||||
"I/O instruction". The port address to be accessed, size, and direction
|
||||
(read or write) are fetched from the VM exit qualification. For writes
|
||||
the value to be written to the I/O port is fetched from guest registers
|
||||
al, ax or eax, depending on the access size.
|
||||
|
||||
MMIO accesses are trapped by VM exits with the basic exit reason "EPT
|
||||
violation". The instruction emulator is invoked to decode the
|
||||
instruction that triggers the VM exit to get the memory address being
|
||||
accessed, size, direction (read or write), and the involved register.
|
||||
|
||||
The I/O bitmaps and EPT are used to configure the addresses that will
|
||||
trigger VM exits when accessed by a VM. Refer to
|
||||
:ref:`io-mmio-emulation` for details.
|
||||
|
||||
I/O Emulation in the Hypervisor
|
||||
*******************************
|
||||
|
||||
When a port I/O or MMIO access is trapped, the hypervisor first checks
|
||||
whether the to-be-accessed address falls in the range of any registered
|
||||
handler, and calls the handler when such a handler exists.
|
||||
|
||||
Handler Management
|
||||
==================
|
||||
|
||||
Each VM has two lists of I/O handlers, one for port I/O and the other
|
||||
for MMIO. Each element of the list contains a memory range and a pointer
|
||||
to the handler which emulates the accesses falling in the range. See
|
||||
:ref:`io-handler-init` for descriptions of the related data structures.
|
||||
|
||||
The I/O handlers are registered on VM creation and never changed until
|
||||
the destruction of that VM, when the handlers are unregistered. If
|
||||
multiple handlers are registered for the same address, the one
|
||||
registered later wins. See :ref:`io-handler-init` for the interfaces
|
||||
used to register and unregister I/O handlers.
|
||||
|
||||
I/O Dispatching
|
||||
===============
|
||||
|
||||
When a port I/O or MMIO access is trapped, the hypervisor first walks
|
||||
through the corresponding I/O handler list in the reverse order of
|
||||
registration, looking for a proper handler to emulate the access. The
|
||||
following cases exist:
|
||||
|
||||
- If a handler whose range overlaps the range of the I/O access is
|
||||
found,
|
||||
|
||||
- If the range of the I/O access falls completely in the range the
|
||||
handler can emulate, that handler is called.
|
||||
|
||||
- Otherwise it is implied that the access crosses the boundary of
|
||||
multiple devices which the hypervisor does not emulate. Thus
|
||||
no handler is called and no I/O request will be delivered to
|
||||
SOS. I/O reads get all 1's and I/O writes are dropped.
|
||||
|
||||
- If the range of the I/O access does not overlap with any range of the
|
||||
handlers, the I/O access is delivered to SOS as an I/O request
|
||||
for further processing.
|
||||
|
||||
I/O Requests
|
||||
************
|
||||
|
||||
An I/O request is delivered to SOS vCPU 0 if the hypervisor does not
|
||||
find any handler that overlaps the range of a trapped I/O access. This
|
||||
section describes the initialization of the I/O request mechanism and
|
||||
how an I/O access is emulated via I/O requests in the hypervisor.
|
||||
|
||||
Initialization
|
||||
==============
|
||||
|
||||
For each UOS the hypervisor shares a page with SOS to exchange I/O
|
||||
requests. The 4-KByte page consists of 16 256-Byte slots, indexed by
|
||||
vCPU ID. It is required for the DM to allocate and set up the request
|
||||
buffer on VM creation, otherwise I/O accesses from UOS cannot be
|
||||
emulated by SOS, and all I/O accesses not handled by the I/O handlers in
|
||||
the hypervisor will be dropped (reads get all 1's).
|
||||
|
||||
Refer to Section 4.4.1 for the details of I/O requests and the
|
||||
initialization of the I/O request buffer.
|
||||
|
||||
Types of I/O Requests
|
||||
=====================
|
||||
|
||||
There are four types of I/O requests:
|
||||
|
||||
.. list-table::
|
||||
:widths: 50 50
|
||||
:header-rows: 1
|
||||
|
||||
* - I/O Request Type
|
||||
- Description
|
||||
|
||||
* - PIO
|
||||
- A port I/O access.
|
||||
|
||||
* - MMIO
|
||||
- A MMIO access to a GPA with no mapping in EPT.
|
||||
|
||||
* - PCI
|
||||
- A PCI configuration space access.
|
||||
|
||||
* - WP
|
||||
- A MMIO access to a GPA with a read-only mapping in EPT.
|
||||
|
||||
|
||||
For port I/O accesses, the hypervisor will always deliver an I/O request
|
||||
of type PIO to SOS. For MMIO accesses, the hypervisor will deliver an
|
||||
I/O request of either MMIO or WP, depending on the mapping of the
|
||||
accessed address (in GPA) in the EPT of the vCPU. The hypervisor will
|
||||
never deliver any I/O request of type PCI, but will handle such I/O
|
||||
requests in the same ways as port I/O accesses on their completion.
|
||||
|
||||
Refer to :ref:`io-structs-interfaces` for a detailed description of the
|
||||
data held by each type of I/O request.
|
||||
|
||||
I/O Request State Transitions
|
||||
=============================
|
||||
|
||||
Each slot in the I/O request buffer is managed by a finite state machine
|
||||
with four states. The following figure illustrates the state transitions
|
||||
and the events that trigger them.
|
||||
|
||||
.. figure:: images/ioem-image92.png
|
||||
:align: center
|
||||
|
||||
State Transition of I/O Requests
|
||||
|
||||
The four states are:
|
||||
|
||||
FREE
|
||||
The I/O request slot is not used and new I/O requests can be
|
||||
delivered. This is the initial state on UOS creation.
|
||||
|
||||
PENDING
|
||||
The I/O request slot is occupied with an I/O request pending
|
||||
to be processed by SOS.
|
||||
|
||||
PROCESSING
|
||||
The I/O request has been dispatched to a client but the
|
||||
client has not finished handling it yet.
|
||||
|
||||
COMPLETE
|
||||
The client has completed the I/O request but the hypervisor
|
||||
has not consumed the results yet.
|
||||
|
||||
The contents of an I/O request slot are owned by the hypervisor when the
|
||||
state of an I/O request slot is FREE or COMPLETE. In such cases SOS can
|
||||
only access the state of that slot. Similarly the contents are owned by
|
||||
SOS when the state is PENDING or PROCESSING, when the hypervisor can
|
||||
only access the state of that slot.
|
||||
|
||||
The states are transferred as follow:
|
||||
|
||||
1. To deliver an I/O request, the hypervisor takes the slot
|
||||
corresponding to the vCPU triggering the I/O access, fills the
|
||||
contents, changes the state to PENDING and notifies SOS via
|
||||
upcall.
|
||||
|
||||
2. On upcalls, SOS dispatches each I/O request in the PENDING state to
|
||||
clients and changes the state to PROCESSING.
|
||||
|
||||
3. The client assigned an I/O request changes the state to COMPLETE
|
||||
after it completes the emulation of the I/O request. A hypercall
|
||||
is made to notify the hypervisor on I/O request completion after
|
||||
the state change.
|
||||
|
||||
4. The hypervisor finishes the post-work of a I/O request after it is
|
||||
notified on its completion and change the state back to FREE.
|
||||
|
||||
States are accessed using atomic operations to avoid getting unexpected
|
||||
states on one core when it is written on another.
|
||||
|
||||
Note that there is no state to represent a 'failed' I/O request. SOS
|
||||
should return all 1's for reads and ignore writes whenever it cannot
|
||||
handle the I/O request, and change the state of the request to COMPLETE.
|
||||
|
||||
Post-work
|
||||
=========
|
||||
|
||||
After an I/O request is completed, some more work needs to be done for
|
||||
I/O reads to update guest registers accordingly. Currently the
|
||||
hypervisor re-enters the vCPU thread every time a vCPU is scheduled back
|
||||
in, rather than switching to where the vCPU is scheduled out. As a result,
|
||||
post-work is introduced for this purpose.
|
||||
|
||||
The hypervisor pauses a vCPU before an I/O request is delivered to SOS.
|
||||
Once the I/O request emulation is completed, a client notifies the
|
||||
hypervisor by a hypercall. The hypervisor will pick up that request, do
|
||||
the post-work, and resume the guest vCPU. The post-work takes care of
|
||||
updating the vCPU guest state to reflect the effect of the I/O reads.
|
||||
|
||||
.. figure:: images/ioem-image100.png
|
||||
:align: center
|
||||
|
||||
Workflow of MMIO I/O request completion
|
||||
|
||||
The figure above illustrates the workflow to complete an I/O
|
||||
request for MMIO. Once the I/O request is completed, SOS makes a
|
||||
hypercall to notify the hypervisor which resumes the UOS vCPU triggering
|
||||
the access after requesting post-work on that vCPU. After the UOS vCPU
|
||||
resumes, it does the post-work first to update the guest registers if
|
||||
the access reads an address, changes the state of the corresponding I/O
|
||||
request slot to FREE, and continues execution of the vCPU.
|
||||
|
||||
.. figure:: images/ioem-image106.png
|
||||
:align: center
|
||||
:name: port-io-completion
|
||||
|
||||
Workflow of port I/O request completion
|
||||
|
||||
Completion of a port I/O request (shown in :numref:`port-io-completion`
|
||||
above) is
|
||||
similar to the MMIO case, except the post-work is done before resuming
|
||||
the vCPU. This is because the post-work for port I/O reads need to update
|
||||
the general register eax of the vCPU, while the post-work for MMIO reads
|
||||
need further emulation of the trapped instruction. This is much more
|
||||
complex and may impact the performance of SOS.
|
||||
|
||||
.. _io-structs-interfaces:
|
||||
|
||||
Data Structures and Interfaces
|
||||
******************************
|
||||
|
||||
External Interfaces
|
||||
===================
|
||||
|
||||
The following structures represent an I/O request. *struct vhm_request*
|
||||
is the main structure and the others are detailed representations of I/O
|
||||
requests of different kinds. Refer to Section 4.4.4 for the usage of
|
||||
*struct pci_request*.
|
||||
|
||||
.. doxygenstruct:: mmio_request
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenstruct:: pio_request
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenstruct:: pci_request
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenunion:: vhm_io_request
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenstruct:: vhm_request
|
||||
:project: Project ACRN
|
||||
|
||||
For hypercalls related to I/O emulation, refer to Section 3.11.4.
|
||||
|
||||
.. _io-handler-init:
|
||||
|
||||
Initialization and Deinitialization
|
||||
===================================
|
||||
|
||||
The following structure represents a port I/O handler:
|
||||
|
||||
.. doxygenstruct:: vm_io_handler_desc
|
||||
:project: Project ACRN
|
||||
|
||||
The following structure represents a MMIO handler.
|
||||
|
||||
.. doxygenstruct:: mem_io_node
|
||||
:project: Project ACRN
|
||||
|
||||
The following APIs are provided to initialize, deinitialize or configure
|
||||
I/O bitmaps and register or unregister I/O handlers:
|
||||
|
||||
.. doxygenfunction:: allow_guest_pio_access
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: register_pio_emulation_handler
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: register_mmio_emulation_handler
|
||||
:project: Project ACRN
|
||||
|
||||
I/O Emulation
|
||||
=============
|
||||
|
||||
The following APIs are provided for I/O emulation at runtime:
|
||||
|
||||
.. doxygenfunction:: acrn_insert_request
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: pio_instr_vmexit_handler
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: ept_violation_vmexit_handler
|
||||
:project: Project ACRN
|
@@ -1,728 +0,0 @@
|
||||
.. _IOC_virtualization_hld:
|
||||
|
||||
IOC Virtualization high-level design
|
||||
####################################
|
||||
|
||||
|
||||
.. author: Yuan Liu
|
||||
|
||||
The I/O Controller (IOC) is an SoC bridge we can use to communicate
|
||||
with a Vehicle Bus in automotive applications, routing Vehicle Bus
|
||||
signals, such as those extracted from CAN messages, from the IOC to the
|
||||
SoC and back, as well as signals the SoC uses to control onboard
|
||||
peripherals.
|
||||
|
||||
.. note::
|
||||
NUC and UP2 platforms do not support IOC hardware, and as such, IOC
|
||||
virtualization is not supported on these platforms.
|
||||
|
||||
The main purpose of IOC virtualization is to transfer data between
|
||||
native Carrier Board Communication (CBC) char devices and a virtual
|
||||
UART. IOC virtualization is implemented as full virtualization so the
|
||||
user OS can directly reuse native CBC driver.
|
||||
|
||||
The IOC Mediator has several virtualization requirements, such as S3/S5
|
||||
wakeup reason emulation, CBC link frame packing/unpacking, signal
|
||||
whitelist, and RTC configuration.
|
||||
|
||||
IOC Mediator Design
|
||||
*******************
|
||||
|
||||
Architecture Diagrams
|
||||
=====================
|
||||
|
||||
IOC introduction
|
||||
----------------
|
||||
|
||||
.. figure:: images/ioc-image12.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: ioc-mediator-arch
|
||||
|
||||
IOC Mediator Architecture
|
||||
|
||||
- Vehicle Bus communication involves a wide range of individual signals
|
||||
to be used, varying from single GPIO signals on the IOC up to
|
||||
complete automotive networks that connect many external ECUs.
|
||||
- IOC (I/O controller) is an SoC bridge to communicate with a Vehicle
|
||||
Bus. It routes Vehicle Bus signals (extracted from CAN
|
||||
messages for example) back and forth between the IOC and SoC. It also
|
||||
controls the onboard peripherals from the SoC.
|
||||
- IOC is always turned on. The power supply of the SoC and its memory are
|
||||
controlled by the IOC. IOC monitors some wakeup reason to control SoC
|
||||
lifecycle-related features.
|
||||
- Some hardware signals are connected to the IOC, allowing the SoC to control
|
||||
them.
|
||||
- Besides, there is one NVM (Non-Volatile Memory) that is connected to
|
||||
IOC for storing persistent data. The IOC is in charge of accessing NVM
|
||||
following the SoC's requirements.
|
||||
|
||||
CBC protocol introduction
|
||||
-------------------------
|
||||
|
||||
The Carrier Board Communication (CBC) protocol multiplexes and
|
||||
prioritizes communication from the available interface between the SoC
|
||||
and the IOC.
|
||||
|
||||
The CBC protocol offers a layered approach, which allows it to run on
|
||||
different serial connections, such as SPI or UART.
|
||||
|
||||
.. figure:: images/ioc-image14.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-cbc-frame-def
|
||||
|
||||
IOC Native - CBC frame definition
|
||||
|
||||
The CBC protocol is based on a four-layer system:
|
||||
|
||||
- The **Physical layer** is a serial interface with full
|
||||
duplex capabilities. A hardware handshake is required. The required
|
||||
bit rate depends on the peripherals connected, e.g. UART, and SPI.
|
||||
- The **Link layer** handles the length and payload verification.
|
||||
- The **Address Layer** is used to distinguish between the general data
|
||||
transferred. It is placed in front of the underlying Service Layer
|
||||
and contains Multiplexer (MUX) and Priority fields.
|
||||
- The **Service Layer** contains the payload data.
|
||||
|
||||
Native architecture
|
||||
-------------------
|
||||
|
||||
In the native architecture, the IOC controller connects to UART
|
||||
hardware, and communicates with the CAN bus to access peripheral
|
||||
devices. ``cbc_attach`` is an application to enable the CBC ldisc
|
||||
function, which creates several CBC char devices. All userspace
|
||||
subsystems or services communicate with IOC firmware via the CBC char
|
||||
devices.
|
||||
|
||||
.. figure:: images/ioc-image13.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-software-arch
|
||||
|
||||
IOC Native - Software architecture
|
||||
|
||||
Virtualization architecture
|
||||
---------------------------
|
||||
|
||||
In the virtualization architecture, the IOC Device Model (DM) is
|
||||
responsible for communication between the UOS and IOC firmware. The IOC
|
||||
DM communicates with several native CBC char devices and a PTY device.
|
||||
The native CBC char devices only include ``/dev/cbc-lifecycle``,
|
||||
``/dev/cbc-signals``, and ``/dev/cbc-raw0`` - ``/dev/cbc-raw11``. Others
|
||||
are not used by the IOC DM. IOC DM opens the ``/dev/ptmx`` device to
|
||||
create a pair of devices (master and slave), The IOC DM uses these
|
||||
devices to communicate with UART DM since UART DM needs a TTY capable
|
||||
device as its backend.
|
||||
|
||||
.. figure:: images/ioc-image15.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-virt-software-arch
|
||||
|
||||
IOC Virtualization - Software architecture
|
||||
|
||||
High-Level Design
|
||||
=================
|
||||
|
||||
There are five parts in this high-level design:
|
||||
|
||||
* Software data flow introduces data transfer in the IOC mediator
|
||||
* State transfer introduces IOC mediator work states
|
||||
* CBC protocol illustrates the CBC data packing/unpacking
|
||||
* Power management involves boot/resume/suspend/shutdown flows
|
||||
* Emulated CBC commands introduces some commands work flow
|
||||
|
||||
IOC mediator has three threads to transfer data between UOS and SOS. The
|
||||
core thread is responsible for data reception, and Tx and Rx threads are
|
||||
used for data transmission. Each of the transmission threads has one
|
||||
data queue as a buffer, so that the IOC mediator can read data from CBC
|
||||
char devices and UART DM immediately.
|
||||
|
||||
.. figure:: images/ioc-image16.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-med-sw-data-flow
|
||||
|
||||
IOC Mediator - Software data flow
|
||||
|
||||
- For Tx direction, the data comes from IOC firmware. IOC mediator
|
||||
receives service data from native CBC char devices such as
|
||||
``/dev/cbc-lifecycle``. If service data is CBC wakeup reason, some wakeup
|
||||
reason bits will be masked. If service data is CBC signal, the data
|
||||
will be dropped and will not be defined in the whitelist. If service
|
||||
data comes from a raw channel, the data will be passed forward. Before
|
||||
transmitting to the virtual UART interface, all data needs to be
|
||||
packed with an address header and link header.
|
||||
- For Rx direction, the data comes from the UOS. The IOC mediator receives link
|
||||
data from the virtual UART interface. The data will be unpacked by Core
|
||||
thread, and then forwarded to Rx queue, similar to how the Tx direction flow
|
||||
is done except that the heartbeat and RTC are only used by the IOC
|
||||
mediator and will not be transferred to IOC
|
||||
firmware.
|
||||
- Currently, IOC mediator only cares about lifecycle, signal, and raw data.
|
||||
Others, e.g. diagnosis, are not used by the IOC mediator.
|
||||
|
||||
State transfer
|
||||
--------------
|
||||
|
||||
IOC mediator has four states and five events for state transfer.
|
||||
|
||||
.. figure:: images/ioc-image18.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: ioc-state-transfer
|
||||
|
||||
IOC Mediator - State Transfer
|
||||
|
||||
- **INIT state**: This state is the initialized state of the IOC mediator.
|
||||
All CBC protocol packets are handled normally. In this state, the UOS
|
||||
has not yet sent an active heartbeat.
|
||||
- **ACTIVE state**: Enter this state if an HB ACTIVE event is triggered,
|
||||
indicating that the UOS state has been active and need to set the bit
|
||||
23 (SoC bit) in the wakeup reason.
|
||||
- **SUSPENDING state**: Enter this state if a RAM REFRESH event or HB
|
||||
INACTIVE event is triggered. The related event handler needs to mask
|
||||
all wakeup reason bits except SoC bit and drop the queued CBC
|
||||
protocol frames.
|
||||
- **SUSPENDED state**: Enter this state if a SHUTDOWN event is triggered to
|
||||
close all native CBC char devices. The IOC mediator will be put to
|
||||
sleep until a RESUME event is triggered to re-open the closed native
|
||||
CBC char devices and transition to the INIT state.
|
||||
|
||||
CBC protocol
|
||||
------------
|
||||
|
||||
IOC mediator needs to pack/unpack the CBC link frame for IOC
|
||||
virtualization, as shown in the detailed flow below:
|
||||
|
||||
.. figure:: images/ioc-image17.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-cbc-frame-usage
|
||||
|
||||
IOC Native - CBC frame usage
|
||||
|
||||
In the native architecture, the CBC link frame is unpacked by CBC
|
||||
driver. The usage services only get the service data from the CBC char
|
||||
devices. For data packing, CBC driver will compute the checksum and set
|
||||
priority for the frame, then send data to the UART driver.
|
||||
|
||||
.. figure:: images/ioc-image20.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-cbc-prot
|
||||
|
||||
IOC Virtualizaton - CBC protocol virtualization
|
||||
|
||||
The difference between the native and virtualization architectures is
|
||||
that the IOC mediator needs to re-compute the checksum and reset
|
||||
priority. Currently, priority is not supported by IOC firmware; the
|
||||
priority setting by the IOC mediator is based on the priority setting of
|
||||
the CBC driver. The SOS and UOS use the same CBC driver.
|
||||
|
||||
Power management virtualization
|
||||
-------------------------------
|
||||
|
||||
In acrn-dm, the IOC power management architecture involves PM DM, IOC
|
||||
DM, and UART DM modules. PM DM is responsible for UOS power management,
|
||||
and IOC DM is responsible for heartbeat and wakeup reason flows for IOC
|
||||
firmware. The heartbeat flow is used to control IOC firmware power state
|
||||
and wakeup reason flow is used to indicate IOC power state to the OS.
|
||||
UART DM transfers all IOC data between the SOS and UOS. These modules
|
||||
complete boot/suspend/resume/shutdown functions.
|
||||
|
||||
Boot flow
|
||||
+++++++++
|
||||
|
||||
.. figure:: images/ioc-image19.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-virt-boot
|
||||
|
||||
IOC Virtualizaton - Boot flow
|
||||
|
||||
#. Press ignition button for booting.
|
||||
#. SOS lifecycle service gets a "booting" wakeup reason.
|
||||
#. SOS lifecycle service notifies wakeup reason to VM Manager, and VM
|
||||
Manager starts VM.
|
||||
#. VM Manager sets the VM state to "start".
|
||||
#. IOC DM forwards the wakeup reason to UOS.
|
||||
#. PM DM starts UOS.
|
||||
#. UOS lifecycle gets a "booting" wakeup reason.
|
||||
|
||||
Suspend & Shutdown flow
|
||||
+++++++++++++++++++++++
|
||||
|
||||
.. figure:: images/ioc-image21.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-suspend
|
||||
|
||||
IOC Virtualizaton - Suspend and Shutdown by Ignition
|
||||
|
||||
#. Press ignition button to suspend or shutdown.
|
||||
#. SOS lifecycle service gets a 0x800000 wakeup reason, then keeps
|
||||
sending a shutdown delay heartbeat to IOC firmware, and notifies a
|
||||
"stop" event to VM Manager.
|
||||
#. IOC DM forwards the wakeup reason to UOS lifecycle service.
|
||||
#. SOS lifecycle service sends a "stop" event to VM Manager, and waits for
|
||||
the stop response before timeout.
|
||||
#. UOS lifecycle service gets a 0x800000 wakeup reason and sends inactive
|
||||
heartbeat with suspend or shutdown SUS_STAT to IOC DM.
|
||||
#. UOS lifecycle service gets a 0x000000 wakeup reason, then enters
|
||||
suspend or shutdown kernel PM flow based on SUS_STAT.
|
||||
#. PM DM executes UOS suspend/shutdown request based on ACPI.
|
||||
#. VM Manager queries each VM state from PM DM. Suspend request maps
|
||||
to a paused state and shutdown request maps to a stop state.
|
||||
#. VM Manager collects all VMs state, and reports it to SOS lifecycle
|
||||
service.
|
||||
#. SOS lifecycle sends inactive heartbeat to IOC firmware with
|
||||
suspend/shutdown SUS_STAT, based on the SOS' own lifecycle service
|
||||
policy.
|
||||
|
||||
Resume flow
|
||||
+++++++++++
|
||||
|
||||
.. figure:: images/ioc-image22.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-resume
|
||||
|
||||
IOC Virtualizaton - Resume flow
|
||||
|
||||
The resume reason contains both the ignition button and RTC, and have
|
||||
the same flow blocks.
|
||||
|
||||
For ignition resume flow:
|
||||
|
||||
#. Press ignition button to resume.
|
||||
#. SOS lifecycle service gets an initial wakeup reason from the IOC
|
||||
firmware. The wakeup reason is 0x000020, from which the ignition button
|
||||
bit is set. It then sends active or initial heartbeat to IOC firmware.
|
||||
#. SOS lifecycle forwards the wakeup reason and sends start event to VM
|
||||
Manager. The VM Manager starts to resume VMs.
|
||||
#. IOC DM gets the wakeup reason from the VM Manager and forwards it to UOS
|
||||
lifecycle service.
|
||||
#. VM Manager sets the VM state to starting for PM DM.
|
||||
#. PM DM resumes UOS.
|
||||
#. UOS lifecycle service gets wakeup reason 0x000020, and then sends an initial
|
||||
or active heartbeat. The UOS gets wakeup reason 0x800020 after
|
||||
resuming.
|
||||
|
||||
For RTC resume flow
|
||||
|
||||
#. RTC timer expires.
|
||||
#. SOS lifecycle service gets initial wakeup reason from the IOC
|
||||
firmware. The wakeup reason is 0x000200, from which RTC bit is set.
|
||||
It then sends active or initial heartbeat to IOC firmware.
|
||||
#. SOS lifecycle forwards the wakeup reason and sends start event to VM
|
||||
Manager. VM Manager begins resuming VMs.
|
||||
#. IOC DM gets the wakeup reason from the VM Manager, and forwards it to
|
||||
the UOS lifecycle service.
|
||||
#. VM Manager sets the VM state to starting for PM DM.
|
||||
#. PM DM resumes UOS.
|
||||
#. UOS lifecycle service gets the wakeup reason 0x000200, and sends
|
||||
initial or active heartbeat. The UOS gets wakeup reason 0x800200
|
||||
after resuming..
|
||||
|
||||
System control data
|
||||
-------------------
|
||||
|
||||
IOC mediator has several emulated CBC commands, including wakeup reason,
|
||||
heartbeat, and RTC.
|
||||
|
||||
The wakeup reason, heartbeat, and RTC commands belong to the system
|
||||
control frames, which are used for startup or shutdown control. System
|
||||
control includes Wakeup Reasons, Heartbeat, Boot Selector, Suppress
|
||||
Heartbeat Check, and Set Wakeup Timer functions. Details are in this
|
||||
table:
|
||||
|
||||
.. list-table:: System control SVC values
|
||||
:header-rows: 1
|
||||
|
||||
* - System Control
|
||||
- Value Name
|
||||
- Description
|
||||
- Data Direction
|
||||
|
||||
* - 1
|
||||
- Wakeup Reasons
|
||||
- Wakeup Reasons
|
||||
- IOC to SoC
|
||||
|
||||
* - 2
|
||||
- Heartbeat
|
||||
- Heartbeat
|
||||
- Soc to IOC
|
||||
|
||||
* - 3
|
||||
- Boot Selector
|
||||
- Boot Selector
|
||||
- Soc to IOC
|
||||
|
||||
* - 4
|
||||
- Suppress Heartbeat Check
|
||||
- Suppress Heartbeat Check
|
||||
- Soc to IOC
|
||||
|
||||
* - 5
|
||||
- Set Wakeup Timer
|
||||
- Set Wakeup Timer in AIOC firmware
|
||||
- Soc to IOC
|
||||
|
||||
- IOC mediator only supports wakeup reasons Heartbeat and Set Wakeup
|
||||
Timer.
|
||||
- The Boot Selector command is used to configure which partition the
|
||||
IOC has to use for normal and emergency boots. Additionally, the IOC
|
||||
has to report to the SoC after the CBC communication has been
|
||||
established successfully with which boot partition has been started
|
||||
and for what reason.
|
||||
- The Suppress Heartbeat Check command is sent by the SoC in
|
||||
preparation for maintenance tasks which requires the CBC Server to be
|
||||
shut down for a certain period of time. It instructs the IOC not to
|
||||
expect CBC heartbeat messages during the specified time. The IOC must
|
||||
disable any watchdog on the CBC heartbeat messages during this period
|
||||
of time.
|
||||
|
||||
Wakeup reason
|
||||
+++++++++++++
|
||||
|
||||
The wakeup reasons command contains a bit mask of all reasons, which is
|
||||
currently keeping the SoC/IOC active. The SoC itself also has a wakeup
|
||||
reason, which allows the SoC to keep the IOC active. The wakeup reasons
|
||||
should be sent every 1000 ms by the IOC.
|
||||
|
||||
Wakeup reason frame definition is as below:
|
||||
|
||||
.. figure:: images/ioc-image24.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-wakeup-reason
|
||||
|
||||
Wakeup Reason Frame Definition
|
||||
|
||||
Currently the wakeup reason bits are supported by sources shown here:
|
||||
|
||||
.. list-table:: Wakeup Reason Bits
|
||||
:header-rows: 1
|
||||
|
||||
* - Wakeup Reason
|
||||
- Bit
|
||||
- Source
|
||||
|
||||
* - wakeup_button
|
||||
- 5
|
||||
- Get from IOC FW, forward to UOS
|
||||
|
||||
* - RTC wakeup
|
||||
- 9
|
||||
- Get from IOC FW, forward to UOS
|
||||
|
||||
* - car door wakeup
|
||||
- 11
|
||||
- Get from IOC FW, forward to UOS
|
||||
|
||||
* - SoC wakeup
|
||||
- 23
|
||||
- Emulation (Depends on UOS's heartbeat message
|
||||
|
||||
- CBC_WK_RSN_BTN (bit 5): ignition button.
|
||||
- CBC_WK_RSN_RTC (bit 9): RTC timer.
|
||||
- CBC_WK_RSN_DOR (bit 11): Car door.
|
||||
- CBC_WK_RSN_SOC (bit 23): SoC active/inactive.
|
||||
|
||||
.. figure:: images/ioc-image4.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: ioc-wakeup-flow
|
||||
|
||||
IOC Mediator - Wakeup reason flow
|
||||
|
||||
Bit 23 is for the SoC wakeup indicator and should not be forwarded
|
||||
directly because every VM has a different heartbeat status.
|
||||
|
||||
Heartbeat
|
||||
+++++++++
|
||||
|
||||
The Heartbeat is used for SOC watchdog, indicating the SOC power
|
||||
reset behavior. Heartbeat needs to be sent every 1000 ms by
|
||||
the SoC.
|
||||
|
||||
.. figure:: images/ioc-image5.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-heartbeat
|
||||
|
||||
System control - Heartbeat
|
||||
|
||||
Heartbeat frame definition is shown here:
|
||||
|
||||
.. figure:: images/ioc-image6.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-heartbeat-frame
|
||||
|
||||
Heartbeat Frame Definition
|
||||
|
||||
- Heartbeat active is repeatedly sent from SoC to IOC to signal that
|
||||
the SoC is active and intends to stay active. The On SUS_STAT action
|
||||
must be set to invalid.
|
||||
- Heartbeat inactive is sent once from SoC to IOC to signal that the
|
||||
SoC is ready for power shutdown. The On SUS_STAT action must be set
|
||||
to a required value.
|
||||
- Heartbeat delay is repeatedly sent from SoC to IOC to signal that the
|
||||
SoC has received the shutdown request, but isn't ready for
|
||||
shutdown yet (for example, a phone call or other time consuming
|
||||
action is active). The On SUS_STAT action must be set to invalid.
|
||||
|
||||
.. figure:: images/ioc-image7.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: ioc-heartbeat-commands
|
||||
|
||||
Heartbeat Commands
|
||||
|
||||
- SUS_STAT invalid action needs to be set with a heartbeat active
|
||||
message.
|
||||
- For the heartbeat inactive message, the SoC needs to be set from
|
||||
command 1 to 7 following the related scenarios. For example: S3 case
|
||||
needs to be set at 7 to prevent from power gating the memory.
|
||||
- The difference between halt and reboot is related if the power rail
|
||||
that supplies to customer peripherals (such as Fan, HDMI-in, BT/Wi-Fi,
|
||||
M.2, and Ethernet) is reset.
|
||||
|
||||
.. figure:: images/ioc-image8.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-heartbeat-flow
|
||||
|
||||
IOC Mediator - Heartbeat Flow
|
||||
|
||||
- IOC DM will not maintain a watchdog timer for a heartbeat message. This
|
||||
is because it already has other watchdog features, so the main use of
|
||||
Heartbeat active command is to maintain the virtual wakeup reason
|
||||
bitmap variable.
|
||||
- For Heartbeat, IOC mediator supports Heartbeat shutdown prepared,
|
||||
Heartbeat active, Heartbeat shutdown delay, Heartbeat initial, and
|
||||
Heartbeat Standby.
|
||||
- For SUS_STAT, IOC mediator supports invalid action and RAM refresh
|
||||
action.
|
||||
- For Suppress heartbeat check will also be dropped directly.
|
||||
|
||||
RTC
|
||||
+++
|
||||
|
||||
RTC timer is used to wakeup SoC when the timer is expired. (A use
|
||||
case is for an automatic software upgrade with a specific time.) RTC frame
|
||||
definition is as below.
|
||||
|
||||
.. figure:: images/ioc-image9.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
|
||||
- The RTC command contains a relative time but not an absolute time.
|
||||
- SOS lifecycle service will re-compute the time offset before it is
|
||||
sent to the IOC firmware.
|
||||
|
||||
.. figure:: images/ioc-image10.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-rtc-flow
|
||||
|
||||
IOC Mediator - RTC flow
|
||||
|
||||
Signal data
|
||||
-----------
|
||||
|
||||
Signal channel is an API between the SOC and IOC for
|
||||
miscellaneous requirements. The process data includes all vehicle bus and
|
||||
carrier board data (GPIO, sensors, and so on). It supports
|
||||
transportation of single signals and group signals. Each signal consists
|
||||
of a signal ID (reference), its value, and its length. IOC and SOC need
|
||||
agreement on the definition of signal IDs that can be treated as API
|
||||
interface definitions.
|
||||
|
||||
IOC signal type definitions are as below.
|
||||
|
||||
.. figure:: images/ioc-image1.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: ioc-process-data-svc-val
|
||||
|
||||
Process Data SVC values
|
||||
|
||||
.. figure:: images/ioc-image2.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ioc-med-signal-flow
|
||||
|
||||
IOC Mediator - Signal flow
|
||||
|
||||
- The IOC backend needs to emulate the channel open/reset/close message which
|
||||
shouldn't be forward to the native cbc signal channel. The SOS signal
|
||||
related services should do a real open/reset/close signal channel.
|
||||
- Every backend should maintain a whitelist for different VMs. The
|
||||
whitelist can be stored in the SOS file system (Read only) in the
|
||||
future, but currently it is hard coded.
|
||||
|
||||
IOC mediator has two whitelist tables, one is used for rx
|
||||
signals(SOC->IOC), and the other one is used for tx signals. The IOC
|
||||
mediator drops the single signals and group signals if the signals are
|
||||
not defined in the whitelist. For multi signal, IOC mediator generates a
|
||||
new multi signal, which contains the signals in the whitelist.
|
||||
|
||||
.. figure:: images/ioc-image3.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: ioc-med-multi-signal
|
||||
|
||||
IOC Mediator - Multi-Signal whitelist
|
||||
|
||||
Raw data
|
||||
--------
|
||||
|
||||
OEM raw channel only assigns to a specific UOS following that OEM
|
||||
configuration. The IOC Mediator will directly forward all read/write
|
||||
message from IOC firmware to UOS without any modification.
|
||||
|
||||
Dependencies and Constraints
|
||||
****************************
|
||||
|
||||
HW External Dependencies
|
||||
========================
|
||||
|
||||
+--------------------------------------+--------------------------------------+
|
||||
| Dependency | Runtime Mechanism to Detect |
|
||||
| | Violations |
|
||||
+======================================+======================================+
|
||||
| VMX should be supported | Boot-time checks to CPUID. See |
|
||||
| | section A.1 in SDM for details. |
|
||||
+--------------------------------------+--------------------------------------+
|
||||
| EPT should be supported | Boot-time checks to primary and |
|
||||
| | secondary processor-based |
|
||||
| | VM-execution controls. See section |
|
||||
| | A.3.2 and A.3.3 in SDM for details. |
|
||||
+--------------------------------------+--------------------------------------+
|
||||
|
||||
SW External Dependencies
|
||||
========================
|
||||
|
||||
+--------------------------------------+--------------------------------------+
|
||||
| Dependency | Runtime Mechanism to Detect |
|
||||
| | Violations |
|
||||
+======================================+======================================+
|
||||
| When invoking the hypervisor, the | Check the magic value in EAX. See |
|
||||
| bootloader should have established a | section 3.2 & 3.3 in Multiboot |
|
||||
| multiboot-compliant state | Specification for details. |
|
||||
+--------------------------------------+--------------------------------------+
|
||||
|
||||
Constraints
|
||||
===========
|
||||
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| Description | Rationale | How such constraint is |
|
||||
| | | enforced |
|
||||
+==========================+==========================+==========================+
|
||||
| Physical cores are | To avoid interference | A bitmap indicating free |
|
||||
| exclusively assigned to | between vcpus on the | pcpus; on vcpu creation |
|
||||
| vcpus. | same core. | a free pcpu is picked. |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| Only PCI devices | Without HW reset it is | |
|
||||
| supporting HW reset can | challenging to manage | |
|
||||
| be passed through to a | devices on UOS crashes | |
|
||||
| UOS. | | |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
|
||||
|
||||
Interface Specification
|
||||
***********************
|
||||
|
||||
Doxygen-style comments in the code are used for interface specification.
|
||||
This section provides some examples on how functions and structure
|
||||
should be commented.
|
||||
|
||||
Function Header Template
|
||||
========================
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
/**
|
||||
* @brief Initialize environment for Trusty-OS on a VCPU.
|
||||
*
|
||||
* More info here.
|
||||
*
|
||||
* @param[in] vcpu Pointer to VCPU data structure
|
||||
* @param[inout] param guest physical address. This gpa points to
|
||||
* struct trusty_boot_param
|
||||
*
|
||||
* @return 0 - on success.
|
||||
* @return -EIO - (description when this error can happen)
|
||||
* @return -EINVAL - (description )
|
||||
*
|
||||
* @pre vcpu must not be NULL.
|
||||
* @pre param must ...
|
||||
*
|
||||
* @post the return value is non-zero if param is ....
|
||||
* @post
|
||||
*
|
||||
* @remark The api must be invoked with interrupt disabled.
|
||||
* @remark (Other usage constraints here)
|
||||
*/
|
||||
|
||||
|
||||
Structure
|
||||
=========
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
/**
|
||||
* @brief An mmio request.
|
||||
*
|
||||
* More info here.
|
||||
*/
|
||||
struct mmio_request {
|
||||
uint32_t direction; /**< Direction of this request. */
|
||||
uint32_t reserved; /**< Reserved. */
|
||||
int64_t address; /**< gpa of the register to be accessed. */
|
||||
int64_t size; /**< Width of the register to be accessed. */
|
||||
int64_t value; /**< Value read from or to be written to the
|
||||
register. */
|
||||
} __aligned(8);
|
||||
|
||||
|
||||
IOC Mediator Configuration
|
||||
**************************
|
||||
|
||||
TBD
|
||||
|
||||
IOC Mediator Usage
|
||||
******************
|
||||
|
||||
The device model configuration command syntax for IOC mediator is as
|
||||
follows::
|
||||
|
||||
-i,[ioc_channel_path],[wakeup_reason]
|
||||
-l,[lpc_port],[ioc_channel_path]
|
||||
|
||||
The "ioc_channel_path" is an absolute path for communication between
|
||||
IOC mediator and UART DM.
|
||||
|
||||
The "lpc_port" is "com1" or "com2", IOC mediator needs one unassigned
|
||||
lpc port for data transfer between UOS and SOS.
|
||||
|
||||
The "wakeup_reason" is IOC mediator boot up reason, each bit represents
|
||||
one wakeup reason.
|
||||
|
||||
For example, the following commands are used to enable IOC feature, the
|
||||
initial wakeup reason is the ignition button and cbc_attach uses ttyS1
|
||||
for TTY line discipline in UOS::
|
||||
|
||||
-i /run/acrn/ioc_$vm_name,0x20
|
||||
-l com2,/run/acrn/ioc_$vm_name
|
||||
|
||||
|
||||
Porting and adaptation to different platforms
|
||||
*********************************************
|
||||
|
||||
TBD
|
@@ -1,498 +0,0 @@
|
||||
.. _memmgt-hld:
|
||||
|
||||
Memory Management high-level design
|
||||
###################################
|
||||
|
||||
This document describes memory management for the ACRN hypervisor.
|
||||
|
||||
Overview
|
||||
********
|
||||
|
||||
The hypervisor (HV) virtualizes real physical memory so an unmodified OS
|
||||
(such as Linux or Android) running in a virtual machine, has the view of
|
||||
managing its own contiguous physical memory. HV uses virtual-processor
|
||||
identifiers (VPIDs) and the extended page-table mechanism (EPT) to
|
||||
translate guest-physical address into host-physical address. HV enables
|
||||
EPT and VPID hardware virtualization features, establishes EPT page
|
||||
tables for SOS/UOS, and provides EPT page tables operation interfaces to
|
||||
others.
|
||||
|
||||
In the ACRN hypervisor system, there are few different memory spaces to
|
||||
consider. From the hypervisor's point of view there are:
|
||||
|
||||
- **Host Physical Address (HPA)**: the native physical address space, and
|
||||
- **Host Virtual Address (HVA)**: the native virtual address space based on
|
||||
a MMU. A page table is used to translate between HPA and HVA
|
||||
spaces.
|
||||
|
||||
From the Guest OS running on a hypervisor there are:
|
||||
|
||||
- **Guest Physical Address (GPA)**: the guest physical address space from a
|
||||
virtual machine. GPA to HPA transition is usually based on a
|
||||
MMU-like hardware module (EPT in X86), and associated with a page
|
||||
table
|
||||
- **Guest Virtual Address (GVA)**: the guest virtual address space from a
|
||||
virtual machine based on a vMMU
|
||||
|
||||
.. figure:: images/mem-image2.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: mem-overview
|
||||
|
||||
ACRN Memory Mapping Overview
|
||||
|
||||
:numref:`mem-overview` provides an overview of the ACRN system memory
|
||||
mapping, showing:
|
||||
|
||||
- GVA to GPA mapping based on vMMU on a VCPU in a VM
|
||||
- GPA to HPA mapping based on EPT for a VM in the hypervisor
|
||||
- HVA to HPA mapping based on MMU in the hypervisor
|
||||
|
||||
This document illustrates the memory management infrastructure for the
|
||||
ACRN hypervisor and how it handles the different memory space views
|
||||
inside the hypervisor and from a VM:
|
||||
|
||||
- How ACRN hypervisor manages host memory (HPA/HVA)
|
||||
- How ACRN hypervisor manages SOS guest memory (HPA/GPA)
|
||||
- How ACRN hypervisor & SOS DM manage UOS guest memory (HPA/GPA)
|
||||
|
||||
Hypervisor Physical Memory Management
|
||||
*************************************
|
||||
|
||||
In the ACRN, the HV initializes MMU page tables to manage all physical
|
||||
memory and then switches to the new MMU page tables. After MMU page
|
||||
tables are initialized at the platform initialization stage, no updates
|
||||
are made for MMU page tables.
|
||||
|
||||
Hypervisor Physical Memory Layout - E820
|
||||
========================================
|
||||
|
||||
The ACRN hypervisor is the primary owner to manage system memory.
|
||||
Typically the boot firmware (e.g., EFI) passes the platform physical
|
||||
memory layout - E820 table to the hypervisor. The ACRN hypervisor does
|
||||
its memory management based on this table using 4-level paging.
|
||||
|
||||
The BIOS/bootloader firmware (e.g., EFI) passes the E820 table through a
|
||||
multiboot protocol. This table contains the original memory layout for
|
||||
the platform.
|
||||
|
||||
.. figure:: images/mem-image1.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: mem-layout
|
||||
|
||||
Physical Memory Layout Example
|
||||
|
||||
:numref:`mem-layout` is an example of the physical memory layout based on a simple
|
||||
platform E820 table.
|
||||
|
||||
Hypervisor Memory Initialization
|
||||
================================
|
||||
|
||||
The ACRN hypervisor runs under paging mode. After the bootstrap
|
||||
processor (BSP) gets the platform E820 table, BSP creates its MMU page
|
||||
table based on it. This is done by the function *init_paging()* and
|
||||
*smep()*. After the application processor (AP) receives IPI CPU startup
|
||||
interrupt, it uses the MMU page tables created by BSP and enable SMEP.
|
||||
:numref:`hv-mem-init` describes the hypervisor memory initialization for BSP
|
||||
and APs.
|
||||
|
||||
.. figure:: images/mem-image8.png
|
||||
:align: center
|
||||
:name: hv-mem-init
|
||||
|
||||
Hypervisor Memory Initialization
|
||||
|
||||
The memory mapping policy used is:
|
||||
|
||||
- Identical mapping (ACRN hypervisor memory could be relocatable in
|
||||
the future)
|
||||
- Map all memory regions with UNCACHED type
|
||||
- Remap RAM regions to WRITE-BACK type
|
||||
|
||||
.. figure:: images/mem-image69.png
|
||||
:align: center
|
||||
:name: hv-mem-vm-init
|
||||
|
||||
Hypervisor Virtual Memory Layout
|
||||
|
||||
:numref:`hv-mem-vm-init` above shows:
|
||||
|
||||
- Hypervisor has a view of and can access all system memory
|
||||
- Hypervisor has UNCACHED MMIO/PCI hole reserved for devices such as
|
||||
LAPIC/IOAPIC accessing
|
||||
- Hypervisor has its own memory with WRITE-BACK cache type for its
|
||||
code/data (< 1M part is for secondary CPU reset code)
|
||||
|
||||
The hypervisor should use minimum memory pages to map from virtual
|
||||
address space into physical address space.
|
||||
|
||||
- If 1GB hugepage can be used
|
||||
for virtual address space mapping, the corresponding PDPT entry shall be
|
||||
set for this 1GB hugepage.
|
||||
- If 1GB hugepage can't be used for virtual
|
||||
address space mapping and 2MB hugepage can be used, the corresponding
|
||||
PDT entry shall be set for this 2MB hugepage.
|
||||
- If both of 1GB hugepage
|
||||
and 2MB hugepage can't be used for virtual address space mapping, the
|
||||
corresponding PT entry shall be set.
|
||||
|
||||
If memory type or access rights of a page is updated, or some virtual
|
||||
address space is deleted, it will lead to splitting of the corresponding
|
||||
page. The hypervisor will still keep using minimum memory pages to map from
|
||||
virtual address space into physical address space.
|
||||
|
||||
Memory Pages Pool Functions
|
||||
===========================
|
||||
|
||||
Memory pages pool functions provide dynamic management of multiple
|
||||
4KB page-size memory blocks, used by the hypervisor to store internal
|
||||
data. Through these functions, the hypervisor can allocate and
|
||||
deallocate pages.
|
||||
|
||||
Data Flow Design
|
||||
================
|
||||
|
||||
The physical memory management unit provides MMU 4-level page tables
|
||||
creating and updating services, MMU page tables switching service, SMEP
|
||||
enable service, and HPA/HVA retrieving service to other units.
|
||||
:numref:`mem-data-flow-physical` shows the data flow diagram
|
||||
of physical memory management.
|
||||
|
||||
.. figure:: images/mem-image45.png
|
||||
:align: center
|
||||
:name: mem-data-flow-physical
|
||||
|
||||
Data Flow of Hypervisor Physical Memory Management
|
||||
|
||||
Interfaces Design
|
||||
=================
|
||||
|
||||
|
||||
MMU Initialization
|
||||
------------------
|
||||
|
||||
.. doxygenfunction:: enable_smep
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: enable_paging
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: init_paging
|
||||
:project: Project ACRN
|
||||
|
||||
Address Space Translation
|
||||
-------------------------
|
||||
|
||||
.. doxygenfunction:: hpa2hva
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: hva2hpa
|
||||
:project: Project ACRN
|
||||
|
||||
|
||||
Hypervisor Memory Virtualization
|
||||
********************************
|
||||
|
||||
The hypervisor provides a contiguous region of physical memory for SOS
|
||||
and each UOS. It also guarantees that the SOS and UOS can not access
|
||||
code and internal data in the hypervisor, and each UOS can not access
|
||||
code and internal data of the SOS and other UOSs.
|
||||
|
||||
The hypervisor:
|
||||
|
||||
- enables EPT and VPID hardware virtualization features,
|
||||
- establishes EPT page tables for SOS/UOS,
|
||||
- provides EPT page tables operations services,
|
||||
- virtualizes MTRR for SOS/UOS,
|
||||
- provides VPID operations services,
|
||||
- provides services for address spaces translation between GPA and HPA, and
|
||||
- provides services for data transfer between hypervisor and virtual machine.
|
||||
|
||||
Memory Virtualization Capability Checking
|
||||
=========================================
|
||||
|
||||
In the hypervisor, memory virtualization provides EPT/VPID capability
|
||||
checking service and EPT hugepage supporting checking service. Before HV
|
||||
enables memory virtualization and uses EPT hugepage, these service need
|
||||
to be invoked by other units.
|
||||
|
||||
Data Transfer between Different Address Spaces
|
||||
==============================================
|
||||
|
||||
In ACRN, different memory space management is used in the hypervisor,
|
||||
Service OS, and User OS to achieve spatial isolation. Between memory
|
||||
spaces, there are different kinds of data transfer, such as a SOS/UOS
|
||||
may hypercall to request hypervisor services which includes data
|
||||
transferring, or when the hypervisor does instruction emulation: the HV
|
||||
needs to access the guest instruction pointer register to fetch guest
|
||||
instruction data.
|
||||
|
||||
Access GPA from Hypervisor
|
||||
--------------------------
|
||||
|
||||
When hypervisor need access GPA for data transfer, the caller from guest
|
||||
must make sure this memory range's GPA is continuous. But for HPA in
|
||||
hypervisor, it could be dis-continuous (especially for UOS under hugetlb
|
||||
allocation mechanism). For example, a 4M GPA range may map to 2
|
||||
different 2M huge host-physical pages. The ACRN hypervisor must take
|
||||
care of this kind of data transfer by doing EPT page walking based on
|
||||
its HPA.
|
||||
|
||||
Access GVA from Hypervisor
|
||||
--------------------------
|
||||
|
||||
When hypervisor needs to access GVA for data transfer, it's likely both
|
||||
GPA and HPA could be address dis-continuous. The ACRN hypervisor must
|
||||
watch for this kind of data transfer, and handle it by doing page
|
||||
walking based on both its GPA and HPA.
|
||||
|
||||
EPT Page Tables Operations
|
||||
==========================
|
||||
|
||||
The hypervisor should use a minimum of memory pages to map from
|
||||
guest-physical address (GPA) space into host-physical address (HPA)
|
||||
space.
|
||||
|
||||
- If 1GB hugepage can be used for GPA space mapping, the
|
||||
corresponding EPT PDPT entry shall be set for this 1GB hugepage.
|
||||
- If 1GB hugepage can't be used for GPA space mapping and 2MB hugepage can be
|
||||
used, the corresponding EPT PDT entry shall be set for this 2MB
|
||||
hugepage.
|
||||
- If both 1GB hugepage and 2MB hugepage can't be used for GPA
|
||||
space mapping, the corresponding EPT PT entry shall be set.
|
||||
|
||||
If memory type or access rights of a page is updated or some GPA space
|
||||
is deleted, it will lead to the corresponding EPT page being split. The
|
||||
hypervisor should still keep to using minimum EPT pages to map from GPA
|
||||
space into HPA space.
|
||||
|
||||
The hypervisor provides EPT guest-physical mappings adding service, EPT
|
||||
guest-physical mappings modifying/deleting service, EPT page tables
|
||||
deallocation, and EPT guest-physical mappings invalidation service.
|
||||
|
||||
Virtual MTRR
|
||||
************
|
||||
|
||||
In ACRN, the hypervisor only virtualizes MTRRs fixed range (0~1MB).
|
||||
The HV sets MTRRs of the fixed range as Write-Back for UOS, and the SOS reads
|
||||
native MTRRs of the fixed range set by BIOS.
|
||||
|
||||
If the guest physical address is not in the fixed range (0~1MB), the
|
||||
hypervisor uses the default memory type in the MTRR (Write-Back).
|
||||
|
||||
When the guest disables MTRRs, the HV sets the guest address memory type
|
||||
as UC.
|
||||
|
||||
If the guest physical address is in fixed range (0~1MB), the HV sets
|
||||
memory type according to the fixed virtual MTRRs.
|
||||
|
||||
When the guest enable MTRRs, MTRRs have no effect on the memory type
|
||||
used for access to GPA. The HV first intercepts MTRR MSR registers
|
||||
access through MSR access VM exit and updates EPT memory type field in EPT
|
||||
PTE according to the memory type selected by MTRRs. This combines with
|
||||
PAT entry in the PAT MSR (which is determined by PAT, PCD, and PWT bits
|
||||
from the guest paging structures) to determine the effective memory
|
||||
type.
|
||||
|
||||
VPID operations
|
||||
===============
|
||||
|
||||
Virtual-processor identifier (VPID) is a hardware feature to optimize
|
||||
TLB management. When VPID is enable, hardware will add a tag for TLB of
|
||||
a logical processor and cache information for multiple linear-address
|
||||
spaces. VMX transitions may retain cached information and the logical
|
||||
processor switches to a different address space, avoiding unnecessary
|
||||
TLB flushes.
|
||||
|
||||
In ACRN, an unique VPID must be allocated for each virtual CPU
|
||||
when a virtual CPU is created. The logical processor invalidates linear
|
||||
mappings and combined mapping associated with all VPIDs (except VPID
|
||||
0000H), and with all PCIDs when the logical processor launches the virtual
|
||||
CPU. The logical processor invalidates all linear mapping and combined
|
||||
mappings associated with the specified VPID when the interrupt pending
|
||||
request handling needs to invalidate cached mapping of the specified
|
||||
VPID.
|
||||
|
||||
Data Flow Design
|
||||
================
|
||||
|
||||
The memory virtualization unit includes address space translation
|
||||
functions, data transferring functions, VM EPT operations functions,
|
||||
VPID operations functions, VM exit hanging about EPT violation and EPT
|
||||
misconfiguration, and MTRR virtualization functions. This unit handles
|
||||
guest-physical mapping updates by creating or updating related EPT page
|
||||
tables. It virtualizes MTRR for guest OS by updating related EPT page
|
||||
tables. It handles address translation from GPA to HPA by walking EPT
|
||||
page tables. It copies data from VM into the HV or from the HV to VM by
|
||||
walking guest MMU page tables and EPT page tables. It provides services
|
||||
to allocate VPID for each virtual CPU and TLB invalidation related VPID.
|
||||
It handles VM exit about EPT violation and EPT misconfiguration. The
|
||||
following :numref:`mem-flow-mem-virt` describes the data flow diagram of
|
||||
the memory virtualization unit.
|
||||
|
||||
.. figure:: images/mem-image84.png
|
||||
:align: center
|
||||
:name: mem-flow-mem-virt
|
||||
|
||||
Data Flow of Hypervisor Memory Virtualization
|
||||
|
||||
Data Structure Design
|
||||
=====================
|
||||
|
||||
EPT Memory Type Definition:
|
||||
|
||||
.. doxygengroup:: ept_mem_type
|
||||
:project: Project ACRN
|
||||
:content-only:
|
||||
|
||||
EPT Memory Access Right Definition:
|
||||
|
||||
.. doxygengroup:: ept_mem_access_right
|
||||
:project: Project ACRN
|
||||
:content-only:
|
||||
|
||||
|
||||
Interfaces Design
|
||||
=================
|
||||
|
||||
The memory virtualization unit interacts with external units through VM
|
||||
exit and APIs.
|
||||
|
||||
VM Exit about EPT
|
||||
=================
|
||||
|
||||
There are two VM exit handlers for EPT violation and EPT
|
||||
misconfiguration in the hypervisor. EPT page tables are
|
||||
always configured correctly for SOS and UOS. If EPT misconfiguration is
|
||||
detected, a fatal error is reported by HV. The hypervisor
|
||||
uses EPT violation to intercept MMIO access to do device emulation. EPT
|
||||
violation handling data flow is described in the
|
||||
:ref:`instruction-emulation`.
|
||||
|
||||
Memory Virtualization APIs
|
||||
==========================
|
||||
|
||||
Here is a list of major memory related APIs in HV:
|
||||
|
||||
EPT/VPID Capability Checking
|
||||
----------------------------
|
||||
|
||||
Data Transferring between hypervisor and VM
|
||||
-------------------------------------------
|
||||
|
||||
.. doxygenfunction:: copy_from_gpa
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: copy_to_gpa
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: copy_from_gva
|
||||
:project: Project ACRN
|
||||
|
||||
Address Space Translation
|
||||
-------------------------
|
||||
|
||||
.. doxygenfunction:: gpa2hpa
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: sos_vm_hpa2gpa
|
||||
:project: Project ACRN
|
||||
|
||||
EPT
|
||||
---
|
||||
|
||||
.. doxygenfunction:: ept_add_mr
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: ept_del_mr
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: ept_modify_mr
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: destroy_ept
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: invept
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: ept_misconfig_vmexit_handler
|
||||
:project: Project ACRN
|
||||
|
||||
Virtual MTRR
|
||||
------------
|
||||
|
||||
.. doxygenfunction:: init_vmtrr
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: write_vmtrr
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: read_vmtrr
|
||||
:project: Project ACRN
|
||||
|
||||
VPID
|
||||
----
|
||||
.. doxygenfunction:: flush_vpid_single
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: flush_vpid_global
|
||||
:project: Project ACRN
|
||||
|
||||
Service OS Memory Management
|
||||
****************************
|
||||
|
||||
After the ACRN hypervisor starts, it creates the Service OS as its first
|
||||
VM. The Service OS runs all the native device drivers, manage the
|
||||
hardware devices, and provides I/O mediation to guest VMs. The Service
|
||||
OS is in charge of the memory allocation for Guest VMs as well.
|
||||
|
||||
ACRN hypervisor passes the whole system memory access (except its own
|
||||
part) to the Service OS. The Service OS must be able to access all of
|
||||
the system memory except the hypervisor part.
|
||||
|
||||
Guest Physical Memory Layout - E820
|
||||
===================================
|
||||
|
||||
The ACRN hypervisor passes the original E820 table to the Service OS
|
||||
after filtering out its own part. So from Service OS's view, it sees
|
||||
almost all the system memory as shown here:
|
||||
|
||||
.. figure:: images/mem-image3.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: sos-mem-layout
|
||||
|
||||
SOS Physical Memory Layout
|
||||
|
||||
Host to Guest Mapping
|
||||
=====================
|
||||
|
||||
ACRN hypervisor creates Service OS's host (HPA) to guest (GPA) mapping
|
||||
(EPT mapping) through the function ``prepare_sos_vm_memmap()``
|
||||
when it creates the SOS VM. It follows these rules:
|
||||
|
||||
- Identical mapping
|
||||
- Map all memory range with UNCACHED type
|
||||
- Remap RAM entries in E820 (revised) with WRITE-BACK type
|
||||
- Unmap ACRN hypervisor memory range
|
||||
- Unmap ACRN hypervisor emulated vLAPIC/vIOAPIC MMIO range
|
||||
|
||||
The host to guest mapping is static for the Service OS; it will not
|
||||
change after the Service OS begins running. Each native device driver
|
||||
can access its MMIO through this static mapping. EPT violation is only
|
||||
serving for vLAPIC/vIOAPIC's emulation in the hypervisor for Service OS
|
||||
VM.
|
||||
|
||||
Trusty
|
||||
******
|
||||
|
||||
For an Android User OS, there is a secure world named trusty world
|
||||
support, whose memory must be secured by the ACRN hypervisor and
|
||||
must not be accessible by SOS and UOS normal world.
|
||||
|
||||
.. figure:: images/mem-image18.png
|
||||
:align: center
|
||||
|
||||
UOS Physical Memory Layout with Trusty
|
@@ -1,367 +0,0 @@
|
||||
.. _partition-mode-hld:
|
||||
|
||||
Partition mode
|
||||
##############
|
||||
|
||||
ACRN is type-1 hypervisor that supports running multiple guest operating
|
||||
systems (OS). Typically, the platform BIOS/boot-loader boots ACRN, and
|
||||
ACRN loads single or multiple guest OSes. Refer to :ref:`hv-startup` for
|
||||
details on the start-up flow of the ACRN hypervisor.
|
||||
|
||||
ACRN supports two modes of operation: Sharing mode and Partition mode.
|
||||
This document describes ACRN's high-level design for Partition mode
|
||||
support.
|
||||
|
||||
.. contents::
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
Introduction
|
||||
************
|
||||
|
||||
In partition mode, ACRN provides guests with exclusive access to cores,
|
||||
memory, cache, and peripheral devices. Partition mode enables developers
|
||||
to dedicate resources exclusively among the guests. However there is no
|
||||
support today in x86 hardware or in ACRN to partition resources such as
|
||||
peripheral buses (e.g. PCI) or memory bandwidth. Cache partitioning
|
||||
technology, such as Cache Allocation Technology (CAT) in x86, can be
|
||||
used by developers to partition Last Level Cache (LLC) among the guests.
|
||||
(Note: ACRN support for x86 CAT is on the roadmap, but not currently
|
||||
supported).
|
||||
|
||||
ACRN expects static partitioning of resources either by code
|
||||
modification for guest configuration or through compile-time config
|
||||
options. All the devices exposed to the guests are either physical
|
||||
resources or emulated in the hypervisor. So, there is no need for
|
||||
device-model and Service OS. :numref:`pmode2vms` shows a partition mode
|
||||
example of two VMs with exclusive access to physical resources.
|
||||
|
||||
.. figure:: images/partition-image3.png
|
||||
:align: center
|
||||
:name: pmode2vms
|
||||
|
||||
Partition Mode example with two VMs
|
||||
|
||||
Guest info
|
||||
**********
|
||||
|
||||
ACRN uses multi-boot info passed from the platform boot-loader to know
|
||||
the location of each guest kernel in memory. ACRN creates a copy of each
|
||||
guest kernel into each of the guests' memory. Current implementation of
|
||||
ACRN requires developers to specify kernel parameters for the guests as
|
||||
part of guest configuration. ACRN picks up kernel parameters from guest
|
||||
configuration and copies them to the corresponding guest memory.
|
||||
|
||||
.. figure:: images/partition-image18.png
|
||||
:align: center
|
||||
|
||||
ACRN set-up for guests
|
||||
**********************
|
||||
|
||||
Cores
|
||||
=====
|
||||
|
||||
ACRN requires the developer to specify the number of guests and the
|
||||
cores dedicated for each guest. Also the developer needs to specify
|
||||
the physical core used as the Boot Strap Processor (BSP) for each guest. As
|
||||
the processors are brought to life in the hypervisor, it checks if they are
|
||||
configured as BSP for any of the guests. If a processor is BSP of any of
|
||||
the guests, ACRN proceeds to build the memory mapping for the guest,
|
||||
mptable, E820 entries, and zero page for the guest. As described in
|
||||
`Guest info`_, ACRN creates copies of guest kernel and kernel
|
||||
parameters into guest memory. :numref:`partBSPsetup` explains these
|
||||
events in chronological order.
|
||||
|
||||
.. figure:: images/partition-image7.png
|
||||
:align: center
|
||||
:name: partBSPsetup
|
||||
|
||||
Memory
|
||||
======
|
||||
|
||||
For each guest in partition mode, the ACRN developer specifies the size of
|
||||
memory for the guest and the starting address in the host physical
|
||||
address in the guest configuration. There is no support for HIGHMEM for
|
||||
partition mode guests. The developer needs to take care of two aspects
|
||||
for assigning host memory to the guests:
|
||||
|
||||
1) Sum of guest PCI hole and guest "System RAM" is less than 4GB.
|
||||
|
||||
2) Pick the starting address in the host physical address and the
|
||||
size, so that it does not overlap with any reserved regions in
|
||||
host E820.
|
||||
|
||||
ACRN creates EPT mapping for the guest between GPA (0, memory size) and
|
||||
HPA (starting address in guest configuration, memory size).
|
||||
|
||||
E820 and zero page info
|
||||
=======================
|
||||
|
||||
A default E820 is used for all the guests in partition mode. This table
|
||||
shows the reference E820 layout. Zero page is created with this
|
||||
e820 info for all the guests.
|
||||
|
||||
+------------------------+
|
||||
| RAM |
|
||||
| |
|
||||
| 0 - 0xEFFFFH |
|
||||
+------------------------+
|
||||
| RESERVED (MPTABLE) |
|
||||
| |
|
||||
| 0xF0000H - 0x100000H |
|
||||
+------------------------+
|
||||
| RAM |
|
||||
| |
|
||||
| 0x100000H - LOWMEM |
|
||||
+------------------------+
|
||||
| RESERVED |
|
||||
+------------------------+
|
||||
| PCI HOLE |
|
||||
+------------------------+
|
||||
| RESERVED |
|
||||
+------------------------+
|
||||
|
||||
Platform info - mptable
|
||||
=======================
|
||||
|
||||
ACRN, in partition mode, uses mptable to convey platform info to each
|
||||
guest. Using this platform information, number of cores used for each
|
||||
guest, and whether the guest needs devices with INTX, ACRN builds
|
||||
mptable and copies it to the guest memory. In partition mode, ACRN uses
|
||||
physical APIC IDs to pass to the guests.
|
||||
|
||||
I/O - Virtual devices
|
||||
=====================
|
||||
|
||||
Port I/O is supported for PCI device config space 0xcfc and 0xcf8, vUART
|
||||
0x3f8, vRTC 0x70 and 0x71, and vPIC ranges 0x20/21, 0xa0/a1, and
|
||||
0x4d0/4d1. MMIO is supported for vIOAPIC. ACRN exposes a virtual
|
||||
host-bridge at BDF (Bus Device Function) 0.0:0 to each guest. Access to
|
||||
256 bytes of config space for virtual host bridge is emulated.
|
||||
|
||||
I/O - Pass-thru devices
|
||||
=======================
|
||||
|
||||
ACRN, in partition mode, supports passing thru PCI devices on the
|
||||
platform. All the pass-thru devices are exposed as child devices under
|
||||
the virtual host bridge. ACRN does not support either passing thru
|
||||
bridges or emulating virtual bridges. Pass-thru devices should be
|
||||
statically allocated to each guest using the guest configuration. ACRN
|
||||
expects the developer to provide the virtual BDF to BDF of the
|
||||
physical device mapping for all the pass-thru devices as
|
||||
part of each guest configuration.
|
||||
|
||||
Run-time ACRN support for guests
|
||||
********************************
|
||||
|
||||
ACRN, in partition mode, supports an option to pass-thru LAPIC of the
|
||||
physical CPUs to the guest. ACRN expects developers to specify if the
|
||||
guest needs LAPIC pass-thru using guest configuration. When guest
|
||||
configures vLAPIC as x2APIC, and if the guest configuration has LAPIC
|
||||
pass-thru enabled, ACRN passes the LAPIC to the guest. Guest can access
|
||||
the LAPIC hardware directly without hypervisor interception. During
|
||||
runtime of the guest, this option differentiates how ACRN supports
|
||||
inter-processor interrupt handling and device interrupt handling. This
|
||||
will be discussed in detail in the corresponding sections.
|
||||
|
||||
.. figure:: images/partition-image16.png
|
||||
:align: center
|
||||
|
||||
|
||||
Guest SMP boot flow
|
||||
===================
|
||||
|
||||
The core APIC IDs are reported to the guest using mptable info. SMP boot
|
||||
flow is similar to sharing mode. Refer to :ref:`vm-startup`
|
||||
for guest SMP boot flow in ACRN. Partition mode guests startup is same as
|
||||
the SOS startup in sharing mode.
|
||||
|
||||
Inter-processor Interrupt (IPI) Handling
|
||||
========================================
|
||||
|
||||
Guests w/o LAPIC pass-thru
|
||||
--------------------------
|
||||
|
||||
For guests without LAPIC pass-thru, IPIs between guest CPUs are handled in
|
||||
the same way as sharing mode of ACRN. Refer to :ref:`virtual-interrupt-hld`
|
||||
for more details.
|
||||
|
||||
Guests w/ LAPIC pass-thru
|
||||
-------------------------
|
||||
|
||||
ACRN supports pass-thru if and only if the guest is using x2APIC mode
|
||||
for the vLAPIC. In LAPIC pass-thru mode, writes to Interrupt Command
|
||||
Register (ICR) x2APIC MSR is intercepted. Guest writes the IPI info
|
||||
including vector, destination APIC IDs to the ICR. Upon an IPI request
|
||||
from the guest, ACRN does sanity check on the destination processors
|
||||
programmed into ICR. If the destination is a valid target for the guest,
|
||||
ACRN sends IPI with the same vector from ICR to the physical CPUs
|
||||
corresponding to the destination processor info in ICR.
|
||||
|
||||
.. figure:: images/partition-image14.png
|
||||
:align: center
|
||||
|
||||
|
||||
Pass-thru device support
|
||||
========================
|
||||
|
||||
Configuration space access
|
||||
--------------------------
|
||||
|
||||
ACRN emulates Configuration Space Address (0xcf8) I/O port and
|
||||
Configuration Space Data (0xcfc) I/O port for guests to access PCI
|
||||
devices configuration space. Within the config space of a device, Base
|
||||
Address registers (BAR), offsets starting from 0x10H to 0x24H, provide
|
||||
the information about the resources (I/O and MMIO) used by the PCI
|
||||
device. ACRN virtualizes the BAR registers and for the rest of the
|
||||
config space, forwards reads and writes to the physical config space of
|
||||
pass-thru devices. Refer to `I/O`_ section below for more details.
|
||||
|
||||
.. figure:: images/partition-image1.png
|
||||
:align: center
|
||||
|
||||
|
||||
DMA
|
||||
---
|
||||
|
||||
ACRN developers need to statically define the pass-thru devices for each
|
||||
guest using the guest configuration. For devices to DMA to/from guest
|
||||
memory directly, ACRN parses the list of pass-thru devices for each
|
||||
guest and creates context entries in the VT-d remapping hardware. EPT
|
||||
page tables created for the guest are used for VT-d page tables.
|
||||
|
||||
I/O
|
||||
---
|
||||
|
||||
ACRN supports I/O for pass-thru devices with two restrictions.
|
||||
|
||||
1) Supports only MMIO. So requires developers to expose I/O BARs as
|
||||
not present in the guest configuration.
|
||||
|
||||
2) Supports only 32-bit MMIO BAR type.
|
||||
|
||||
As guest PCI sub-system scans the PCI bus and assigns Guest Physical
|
||||
Address (GPA) to the MMIO BAR, ACRN maps GPA to the address in the
|
||||
physical BAR of the pass-thru device using EPT. Following timeline chart
|
||||
explains how PCI devices are assigned to guest and BARs are mapped upon
|
||||
guest initialization.
|
||||
|
||||
.. figure:: images/partition-image13.png
|
||||
:align: center
|
||||
|
||||
|
||||
Interrupt Configuration
|
||||
-----------------------
|
||||
|
||||
ACRN supports both legacy (INTx) and MSI interrupts for pass-thru
|
||||
devices.
|
||||
|
||||
INTx support
|
||||
~~~~~~~~~~~~
|
||||
|
||||
ACRN expects developers to identify the interrupt line info (0x3CH) from
|
||||
the physical BAR of the pass-thru device and build an interrupt entry in
|
||||
the mptable for the corresponding guest. As guest configures the vIOAPIC
|
||||
for the interrupt RTE, ACRN writes the info from the guest RTE into the
|
||||
physical IOAPIC RTE. Upon guest kernel request to mask the interrupt,
|
||||
ACRN writes to the physical RTE to mask the interrupt at the physical
|
||||
IOAPIC. When guest masks the RTE in vIOAPIC, ACRN masks the interrupt
|
||||
RTE in the physical IOAPIC. Level triggered interrupts are not
|
||||
supported.
|
||||
|
||||
MSI support
|
||||
~~~~~~~~~~~
|
||||
|
||||
Guest reads/writes to PCI configuration space for configuring MSI
|
||||
interrupts using address. Data and control registers are pass-thru to
|
||||
the physical BAR of pass-thru device. Refer to `Configuration
|
||||
space access`_ for details on how PCI configuration space is emulated.
|
||||
|
||||
Virtual device support
|
||||
======================
|
||||
|
||||
ACRN provides read-only vRTC support for partition mode guests. Writes
|
||||
to the data port are discarded.
|
||||
|
||||
For port I/O to ports other than vPIC, vRTC, or vUART, reads return 0xFF and
|
||||
writes are discarded.
|
||||
|
||||
Interrupt delivery
|
||||
==================
|
||||
|
||||
Guests w/o LAPIC pass-thru
|
||||
--------------------------
|
||||
|
||||
In partition mode of ACRN, interrupts stay disabled after a vmexit. The
|
||||
processor does not take interrupts when it is executing in VMX root
|
||||
mode. ACRN configures the processor to take vmexit upon external
|
||||
interrupt if the processor is executing in VMX non-root mode. Upon an
|
||||
external interrupt, after sending EOI to the physical LAPIC, ACRN
|
||||
injects the vector into the vLAPIC of vCPU currently running on the
|
||||
processor. Guests using Linux as kernel, uses vectors less than 0xECh
|
||||
for device interrupts.
|
||||
|
||||
.. figure:: images/partition-image20.png
|
||||
:align: center
|
||||
|
||||
|
||||
Guests w/ LAPIC pass-thru
|
||||
-------------------------
|
||||
|
||||
For guests with LAPIC pass-thru, ACRN does not configure vmexit upon
|
||||
external interrupts. There is no vmexit upon device interrupts and they are
|
||||
handled by the guest IDT.
|
||||
|
||||
Hypervisor IPI service
|
||||
======================
|
||||
|
||||
ACRN needs IPIs for events such as flushing TLBs across CPUs, sending virtual
|
||||
device interrupts (e.g. vUART to vCPUs), and others.
|
||||
|
||||
Guests w/o LAPIC pass-thru
|
||||
--------------------------
|
||||
|
||||
Hypervisor IPIs work the same way as in sharing mode.
|
||||
|
||||
Guests w/ LAPIC pass-thru
|
||||
-------------------------
|
||||
|
||||
Since external interrupts are pass-thru to guest IDT, IPIs do not
|
||||
trigger vmexit. ACRN uses NMI delivery mode and the NMI exiting is
|
||||
chosen for vCPUs. At the time of NMI interrupt on the target processor,
|
||||
if the processor is in non-root mode, vmexit happens on the processor
|
||||
and the event mask is checked for servicing the events.
|
||||
|
||||
Debug Console
|
||||
=============
|
||||
|
||||
For details on how hypervisor console works, refer to
|
||||
:ref:`hv-console`.
|
||||
|
||||
For a guest console in partition mode, ACRN provides an option to pass
|
||||
``vmid`` as an argument to ``vm_console``. vmid is same as the one
|
||||
developer uses in the guest configuration.
|
||||
|
||||
Guests w/o LAPIC pass-thru
|
||||
--------------------------
|
||||
|
||||
Works the same way as sharing mode.
|
||||
|
||||
Hypervisor Console
|
||||
==================
|
||||
|
||||
ACRN uses TSC deadline timer to provide timer service. Hypervisor
|
||||
console uses a timer on CPU0 to poll characters on the serial device. To
|
||||
support LAPIC pass-thru, TSC deadline MSR is pass-thru and the local
|
||||
timer interrupt also delivered to the guest IDT. Instead of TSC deadline
|
||||
timer, ACRN uses VMX preemption timer to poll the serial device.
|
||||
|
||||
Guest Console
|
||||
=============
|
||||
|
||||
ACRN exposes vUART to partition mode guests. vUART uses vPIC to inject
|
||||
interrupt to the guest BSP. In cases of guest having more than one core,
|
||||
during runtime, vUART might need to inject interrupt to guest BSP from
|
||||
another core (other than BSP). As mentioned in section <Hypervisor IPI
|
||||
service>, ACRN uses NMI delivery mode for notifying the CPU running BSP
|
||||
of the guest.
|
@@ -1,49 +0,0 @@
|
||||
.. _pm_hld:
|
||||
|
||||
Power Management
|
||||
################
|
||||
|
||||
System PM module
|
||||
****************
|
||||
|
||||
The PM module in the hypervisor does three things:
|
||||
|
||||
- When all UOSes enter low power state, VM management will notify the SOS
|
||||
lifecycle service and trigger the SOS to enter a low-power state.
|
||||
SOS follows its own standard low-power state entry process and
|
||||
writes the ACPI control register to put SOS into low-power state.
|
||||
Hypervisor traps the ACPI control register writing and
|
||||
emulates SOS low-power state entry.
|
||||
|
||||
- Once SOS low-power emulation is done, Hypervisor handles its
|
||||
own low-power state transition
|
||||
|
||||
- Once system resumes from low-power mode, the hypervisor handles its
|
||||
own resume and emulates SOS resume too.
|
||||
|
||||
It is assumed that SOS does not trigger any power state transition until
|
||||
the VM manager of ACRN notifies it that all UOSes are inactive and SOS
|
||||
offlines all its virtual APs.
|
||||
|
||||
:numref:`pm-low-power-transition` shows the SOS/Hypervisor low-power
|
||||
state transition process. SOS triggers power state transition by
|
||||
writing ACPI control register on its virtual BSP (which is pinned to the
|
||||
physical BSP). The hypervisor then does the following in sequence before
|
||||
it writes to the physical ACPI control register to trigger physical
|
||||
power state transition:
|
||||
|
||||
- Pauses SOS.
|
||||
- Offlines all physical APs.
|
||||
- Save the context of console, ioapic of SOS, I/O MMU, lapic of SOS,
|
||||
virtual BSP.
|
||||
- Save the context of physical BSP.
|
||||
|
||||
When exiting from low-power mode, the hypervisor does similar steps in
|
||||
reverse order to restore contexts, start APs and resume SOS. SOS is
|
||||
responsible for starting its own virtual APs as well as UOSes.
|
||||
|
||||
.. figure:: images/pm-image24-105.png
|
||||
:align: center
|
||||
:name: pm-low-power-transition
|
||||
|
||||
SOS/Hypervisor low power state transition process
|
@@ -1,207 +0,0 @@
|
||||
.. _hv-startup:
|
||||
|
||||
Hypervisor Startup
|
||||
##################
|
||||
|
||||
This section is an overview of the ACRN hypervisor startup.
|
||||
The ACRN hypervisor
|
||||
compiles to a 32-bit multiboot-compliant ELF file.
|
||||
The bootloader (ABL or SBL) loads the hypervisor according to the
|
||||
addresses specified in the ELF header. The BSP starts the hypervisor
|
||||
with an initial state compliant to multiboot 1 specification, after the
|
||||
bootloader prepares full configurations including ACPI, E820, etc.
|
||||
|
||||
The HV startup has two parts: the native startup followed by
|
||||
VM startup.
|
||||
|
||||
Native Startup
|
||||
**************
|
||||
|
||||
.. figure:: images/hld-image107.png
|
||||
:align: center
|
||||
:name: hvstart-nativeflow
|
||||
|
||||
Hypervisor Native Startup Flow
|
||||
|
||||
Native startup sets up a baseline environment for HV, including basic
|
||||
memory and interrupt initialization as shown in
|
||||
:numref:`hvstart-nativeflow`. Here is a short
|
||||
description for the flow:
|
||||
|
||||
- **BSP Startup:** The starting point for bootstrap processor.
|
||||
|
||||
- **Relocation**: relocate the hypervisor image if the hypervisor image
|
||||
is not placed at the assumed base address.
|
||||
|
||||
- **UART Init:** Initialize a pre-configured UART device used
|
||||
as the base physical console for HV and Service OS.
|
||||
|
||||
- **Shell Init:** Start a command shell for HV accessible via the UART.
|
||||
|
||||
- **Memory Init:** Initialize memory type and cache policy, and creates
|
||||
MMU page table mapping for HV.
|
||||
|
||||
- **Interrupt Init:** Initialize interrupt and exception for native HV
|
||||
including IDT and ``do_IRQ`` infrastructure; a timer interrupt
|
||||
framework is then built. The native/physical interrupts will go
|
||||
through this ``do_IRQ`` infrastructure then distribute to special
|
||||
targets (HV or VMs).
|
||||
|
||||
- **Start AP:** BSP kicks ``INIT-SIPI-SIPI`` IPI sequence to start other
|
||||
native APs (application processor). Each AP will initialize its
|
||||
own memory and interrupts, notifies the BSP on completion and
|
||||
enter the default idle loop.
|
||||
|
||||
Symbols in the hypervisor are placed with an assumed base address, but
|
||||
the bootloader may not place the hypervisor at that specified base. In
|
||||
such case the hypervisor will relocate itself to where the bootloader
|
||||
loads it.
|
||||
|
||||
Here is a summary of CPU and memory initial states that are set up after
|
||||
native startup.
|
||||
|
||||
CPU
|
||||
ACRN hypervisor brings all physical processors to 64-bit IA32e
|
||||
mode, with the assumption that the BSP starts in protection mode where
|
||||
segmentation and paging sets an identical mapping of the first 4G
|
||||
addresses without permission restrictions. The control registers and
|
||||
some MSRs are set as follows:
|
||||
|
||||
- cr0: The following features are enabled: paging, write protection,
|
||||
protection mode, numeric error and co-processor monitoring.
|
||||
|
||||
- cr3: refer to the initial state of memory.
|
||||
|
||||
- cr4: The following features are enabled: physical address extension,
|
||||
machine-check, FXSAVE/FXRSTOR, SMEP, VMX operation and unmask
|
||||
SIMD FP exception. The other features are disabled.
|
||||
|
||||
- MSR_IA32_EFER: only IA32e mode is enabled.
|
||||
|
||||
- MSR_IA32_FS_BASE: the address of stack canary, used for detecting
|
||||
stack smashing.
|
||||
|
||||
- MSR_IA32_TSC_AUX: a unique logical ID is set for each physical
|
||||
processor.
|
||||
|
||||
- stack: each physical processor has a separate stack.
|
||||
|
||||
Memory
|
||||
All physical processors are in 64-bit IA32e mode after
|
||||
startup. The GDT holds four entries, one unused, one for code and
|
||||
another for data, both of which have a base of all 0's and a limit of
|
||||
all 1's, and the other for 64-bit TSS. The TSS only holds three stack
|
||||
pointers (for machine-check, double fault and stack fault) in the
|
||||
interrupt stack table (IST) which are different across physical
|
||||
processors. LDT is disabled.
|
||||
|
||||
Refer to section 3.5.2 for a detailed description of interrupt-related
|
||||
initial states, including IDT and physical PICs.
|
||||
|
||||
After BSP detects that all APs are up, BSP will start creating the first
|
||||
VM, i.e. SOS, as explained in the next section.
|
||||
|
||||
.. _vm-startup:
|
||||
|
||||
VM Startup
|
||||
**********
|
||||
|
||||
SOS is created and launched on the physical BSP after the hypervisor
|
||||
initializes itself. Meanwhile, the APs enter the default idle loop
|
||||
(refer to :ref:`VCPU_lifecycle` for details), waiting for any vCPU to be
|
||||
scheduled to them.
|
||||
|
||||
:numref:`hvstart-vmflow` illustrates a high-level execution flow of
|
||||
creating and launching a VM, applicable to both SOS and UOS. One major
|
||||
difference in the creation of SOS and UOS is that SOS is created by the
|
||||
hypervisor, while the creation of UOSes is triggered by the DM in SOS.
|
||||
The main steps include:
|
||||
|
||||
- **Create VM**: A VM structure is allocated and initialized. A unique
|
||||
VM ID is picked, EPT is created, I/O bitmap is set up, I/O
|
||||
emulation handlers initialized and registered and virtual CPUID
|
||||
entries filled. For SOS an addition e820 table is prepared.
|
||||
|
||||
- **Create vCPUs:** Create the vCPUs, assign the physical processor it
|
||||
is pinned to, a unique-per-VM vCPU ID and a globally unique VPID,
|
||||
and initializes its virtual lapic and MTRR. For SOS one vCPU is
|
||||
created for each physical CPU on the platform. For UOS the DM
|
||||
determines the number of vCPUs to be created.
|
||||
|
||||
- **SW Load:** The BSP of a VM also prepares for each VM's SW
|
||||
configuration including kernel entry address, ramdisk address,
|
||||
bootargs, zero page etc. This is done by the hypervisor for SOS
|
||||
while by DM for UOS.
|
||||
|
||||
- **Schedule vCPUs:** The vCPUs are scheduled to the corresponding
|
||||
physical processors for execution.
|
||||
|
||||
- **Init VMCS:** Initialize vCPU's VMCS for its host state, guest
|
||||
state, execution control, entry control and exit control. It's
|
||||
the last configuration before vCPU runs.
|
||||
|
||||
- **vCPU thread:** vCPU kicks out to run. For "Primary CPU" it will
|
||||
start running into kernel image which SW Load is configured; for
|
||||
"Non-Primary CPU" it will wait for INIT-SIPI-SIPI IPI sequence
|
||||
trigger from its "Primary CPU".
|
||||
|
||||
.. figure:: images/hld-image104.png
|
||||
:align: center
|
||||
:name: hvstart-vmflow
|
||||
|
||||
Hypervisor VM Startup Flow
|
||||
|
||||
SW configuration for Service OS (SOS_VM):
|
||||
|
||||
- **ACPI**: HV passes the entire ACPI table from bootloader to Service
|
||||
OS directly. Legacy mode is currently supported as the ACPI table
|
||||
is loaded at F-Segment.
|
||||
|
||||
- **E820**: HV passes e820 table from bootloader through multi-boot
|
||||
information after the HV reserved memory (32M for example) is
|
||||
filtered out.
|
||||
|
||||
- **Zero Page**: HV prepares the zero page at the high end of Service
|
||||
OS memory which is determined by SOS_VM guest FIT binary build. The
|
||||
zero page includes configuration for ramdisk, bootargs and e820
|
||||
entries. The zero page address will be set to "Primary CPU" RSI
|
||||
register before VCPU gets run.
|
||||
|
||||
- **Entry address**: HV will copy Service OS kernel image to 0x1000000
|
||||
as entry address for SOS_VM's "Primary CPU". This entry address will
|
||||
be set to "Primary CPU" RIP register before VCPU gets run.
|
||||
|
||||
SW configuration for User OS (VMx):
|
||||
|
||||
- **ACPI**: the virtual ACPI table is built by DM and put at VMx's
|
||||
F-Segment. Refer to :ref:`hld-io-emulation` for details.
|
||||
|
||||
- **E820**: the virtual E820 table is built by the DM then passed to
|
||||
the zero page. Refer to :ref:`hld-io-emulation` for details.
|
||||
|
||||
- **Zero Page**: the DM prepares the zero page at location of
|
||||
"lowmem_top - 4K" in VMx. This location is set into VMx's
|
||||
"Primary CPU" RSI register in **SW Load**.
|
||||
|
||||
- **Entry address**: the DM will copy User OS kernel image to 0x1000000
|
||||
as entry address for VMx's "Primary CPU". This entry address will
|
||||
be set to "Primary CPU" RIP register before VCPU gets run.
|
||||
|
||||
Here is initial mode of vCPUs:
|
||||
|
||||
|
||||
+------------------------------+-------------------------------+
|
||||
| VM and Processor Type | Initial Mode |
|
||||
+=============+================+===============================+
|
||||
| SOS | BSP | Same as physical BSP |
|
||||
| +----------------+-------------------------------+
|
||||
| | AP | Real Mode |
|
||||
+-------------+----------------+-------------------------------+
|
||||
| UOS | BSP | Real Mode |
|
||||
| +----------------+-------------------------------+
|
||||
| | AP | Real Mode |
|
||||
+-------------+----------------+-------------------------------+
|
||||
|
||||
Note that SOS is started with the same number of vCPUs as the physical
|
||||
CPUs to boost the boot-up. SOS will offline the APs right before it
|
||||
starts any UOS.
|
@@ -1,61 +0,0 @@
|
||||
.. _timer-hld:
|
||||
|
||||
Timer
|
||||
#####
|
||||
|
||||
Because ACRN is a flexible, lightweight reference hypervisor, we provide
|
||||
limited timer management services:
|
||||
|
||||
- Only lapic tsc-deadline timer is supported as the clock source.
|
||||
|
||||
- A timer can only be added on the logical CPU for a process or thread. Timer
|
||||
scheduling or timer migrating are not supported.
|
||||
|
||||
How it works
|
||||
************
|
||||
|
||||
When the system boots, we check that the hardware supports lapic
|
||||
tsc-deadline timer by checking CPUID.01H:ECX.TSC_Deadline[bit 24]. If
|
||||
support is missing, we output an error message and panic the hypervisor.
|
||||
If supported, we register the timer interrupt callback that raises a
|
||||
timer softirq on each logical CPU and set the lapic timer mode to
|
||||
tsc-deadline timer mode by writing the local APIC LVT register.
|
||||
|
||||
Data Structures and APIs
|
||||
************************
|
||||
|
||||
Interfaces Design
|
||||
=================
|
||||
|
||||
.. doxygenfunction:: initialize_timer
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: timer_expired
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: add_timer
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: del_timer
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: timer_init
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: calibrate_tsc
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: us_to_ticks
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: ticks_to_us
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: ticks_to_ms
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: rdtsc
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: get_tsc_khz
|
||||
:project: Project ACRN
|
@@ -1,264 +0,0 @@
|
||||
.. _virtual-interrupt-hld:
|
||||
|
||||
Virtual Interrupt
|
||||
#################
|
||||
|
||||
This section introduces ACRN guest virtual interrupt
|
||||
management, which includes:
|
||||
|
||||
- VCPU request for virtual interrupt kick off,
|
||||
- vPIC/vIOAPIC/vLAPIC for virtual interrupt injection interfaces,
|
||||
- physical-to-virtual interrupt mapping for a pass-thru device, and
|
||||
- the process of VMX interrupt/exception injection.
|
||||
|
||||
A guest VM never owns any physical interrupts. All interrupts received by
|
||||
Guest OS come from a virtual interrupt injected by vLAPIC, vIOAPIC or
|
||||
vPIC. Such virtual interrupts are triggered either from a pass-through
|
||||
device or from I/O mediators in SOS via hypercalls. Section 3.8.6
|
||||
introduces how the hypervisor manages the mapping between physical and
|
||||
virtual interrupts for pass-through devices.
|
||||
|
||||
Emulation for devices is inside SOS user space device model, i.e.,
|
||||
acrn-dm. However for performance consideration: vLAPIC, vIOAPIC, and vPIC
|
||||
are emulated inside HV directly.
|
||||
|
||||
From guest OS point of view, vPIC is Virtual Wire Mode via vIOAPIC. The
|
||||
symmetric I/O Mode is shown in :numref:`pending-virt-interrupt` later in
|
||||
this section.
|
||||
|
||||
The following command line
|
||||
options to guest Linux affects whether it uses PIC or IOAPIC:
|
||||
|
||||
- **Kernel boot param with vPIC**: add "maxcpu=0" Guest OS will use PIC
|
||||
- **Kernel boot param with vIOAPIC**: add "maxcpu=1" (as long as not "0")
|
||||
Guest OS will use IOAPIC. And Keep IOAPIC pin2 as source of PIC.
|
||||
|
||||
vCPU Request for Interrupt Injection
|
||||
************************************
|
||||
|
||||
The vCPU request mechanism (described in :ref:`pending-request-handlers`) is leveraged
|
||||
to inject interrupts to a certain vCPU. As mentioned in
|
||||
:ref:`ipi-management`,
|
||||
physical vector 0xF0 is used to kick VCPU out of its VMX non-root mode,
|
||||
used to make a request for virtual interrupt injection or other
|
||||
requests such as flush EPT.
|
||||
|
||||
The eventid supported for virtual interrupt injection includes:
|
||||
|
||||
.. doxygengroup:: virt_int_injection
|
||||
:project: Project ACRN
|
||||
:content-only:
|
||||
|
||||
|
||||
The *vcpu_make_request* is necessary for a virtual interrupt
|
||||
injection. If the target vCPU is running under VMX non-root mode, it
|
||||
will send an IPI to kick it out, which leads to an external-interrupt
|
||||
VM-Exit. For some cases there is no need to send IPI when making a request,
|
||||
because the CPU making the request itself is the target VCPU. For
|
||||
example, the #GP exception request always happens on the current CPU when it
|
||||
finds an invalid emulation has happened. An external interrupt for a pass-thru
|
||||
device always happens on the VCPUs this device belonging to, so after it
|
||||
triggers an external-interrupt VM-Exit, the current CPU is also the
|
||||
target VCPU.
|
||||
|
||||
Virtual LAPIC
|
||||
*************
|
||||
|
||||
LAPIC is virtualized for all Guest types: SOS and UOS. Given support by
|
||||
the
|
||||
physical processor, APICv Virtual Interrupt Delivery (VID) is enabled
|
||||
and will support Posted-Interrupt feature. Otherwise, it will fall back to legacy
|
||||
virtual interrupt injection mode.
|
||||
|
||||
vLAPIC provides the same features as the native LAPIC:
|
||||
|
||||
- Vector mask/unmask
|
||||
- Virtual vector injections (Level or Edge trigger mode) to vCPU
|
||||
- vIOAPIC notification of EOI processing
|
||||
- TSC Timer service
|
||||
- vLAPIC support CR8 to update TPR
|
||||
- INIT/STARTUP handling
|
||||
|
||||
vLAPIC APIs
|
||||
===========
|
||||
|
||||
APIs are provided when an interrupt source from vLAPIC needs to inject
|
||||
an interrupt, for example:
|
||||
|
||||
- from LVT like LAPIC timer
|
||||
- from vIOAPIC for a pass-thru device interrupt
|
||||
- from an emulated device for a MSI
|
||||
|
||||
These APIs will finish by making a request for *ACRN_REQUEST_EVENT.*
|
||||
|
||||
.. doxygenfunction:: vlapic_set_local_intr
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vlapic_intr_msi
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: apicv_get_pir_desc_paddr
|
||||
:project: Project ACRN
|
||||
|
||||
EOI processing
|
||||
==============
|
||||
|
||||
EOI virtualization is enabled if APICv virtual interrupt delivery is
|
||||
supported. Except for level triggered interrupts, VM will not exit in
|
||||
case of EOI.
|
||||
|
||||
In case of no APICv virtual interrupt delivery support, vLAPIC requires
|
||||
EOI from Guest OS whenever a vector was acknowledged and processed by
|
||||
guest. vLAPIC behavior is the same as HW LAPIC. Once an EOI is received,
|
||||
it clears the highest priority vector in ISR and TMR, and updates PPR
|
||||
status. vLAPIC will then notify vIOAPIC if the corresponding vector
|
||||
comes from vIOAPIC. This only occurs for the level triggered interrupts.
|
||||
|
||||
LAPIC passthrough based on vLAPIC
|
||||
=================================
|
||||
|
||||
LAPIC passthrough is supported based on vLAPIC, after switch to x2APIC
|
||||
mode. In case of LAPIC passthrough based on vLAPIC, the system will have the
|
||||
following characteristics.
|
||||
|
||||
* IRQs received by the LAPIC can be handled by the Guest VM without ``vmexit``
|
||||
* Guest VM always see virtual LAPIC IDs for security reasons
|
||||
* most MSRs are directly accessible from Guest VM except for ``XAPICID``,
|
||||
``LDR`` and ``ICR``. Write operations to ``ICR`` will be trapped to avoid
|
||||
malicious IPI. Read operations to ``XAPIC`` and ``LDR`` will be trapped in
|
||||
order to make the Guest VM always see the virtual LAPIC IDs instead of the
|
||||
physical ones.
|
||||
|
||||
Virtual IOAPIC
|
||||
**************
|
||||
|
||||
vIOAPIC is emulated by HV when Guest accesses MMIO GPA range:
|
||||
0xFEC00000-0xFEC01000. vIOAPIC for SOS should match to the native HW
|
||||
IOAPIC Pin numbers. vIOAPIC for UOS provides 48 Pins. As the vIOAPIC is
|
||||
always associated with vLAPIC, the virtual interrupt injection from
|
||||
vIOAPIC will finally trigger a request for vLAPIC event by calling
|
||||
vLAPIC APIs.
|
||||
|
||||
**Supported APIs:**
|
||||
|
||||
.. doxygenfunction:: vioapic_set_irqline_lock
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vioapic_set_irqline_nolock
|
||||
:project: Project ACRN
|
||||
|
||||
Virtual PIC
|
||||
***********
|
||||
|
||||
vPIC is required for TSC calculation. Normally UOS will boot with
|
||||
vIOAPIC and vPIC as the source of external interrupts to Guest. On every
|
||||
VM Exit, HV will check if there are any pending external PIC interrupts.
|
||||
vPIC APIs usage are similar to vIOAPIC.
|
||||
|
||||
ACRN hypervisor emulates a vPIC for each VM based on IO range 0x20~0x21,
|
||||
0xa0~0xa1 and 0x4d0~0x4d1.
|
||||
|
||||
If an interrupt source from vPIC need to inject an interrupt, the
|
||||
following APIs need be called, which will finally make a request for
|
||||
*ACRN_REQUEST_EXTINT or ACRN_REQUEST_EVENT*:
|
||||
|
||||
.. doxygenfunction:: vpic_set_irqline
|
||||
:project: Project ACRN
|
||||
|
||||
The following APIs are used to query the vector needed to be injected and ACK
|
||||
the service (means move the interrupt from request service - IRR to in
|
||||
service - ISR):
|
||||
|
||||
.. doxygenfunction:: vpic_pending_intr
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vpic_intr_accepted
|
||||
:project: Project ACRN
|
||||
|
||||
Virtual Exception
|
||||
*****************
|
||||
|
||||
When doing emulation, an exception may need to be triggered in
|
||||
hypervisor, for example:
|
||||
|
||||
- if guest accesses an invalid vMSR register,
|
||||
- hypervisor needs to inject a #GP, or
|
||||
- during instruction emulation, an instruction fetch may access
|
||||
a non-exist page from rip_gva, at that time a #PF need be injected.
|
||||
|
||||
ACRN hypervisor implements virtual exception injection using these APIs:
|
||||
|
||||
.. doxygenfunction:: vcpu_queue_exception
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vcpu_inject_extint
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vcpu_inject_nmi
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vcpu_inject_gp
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vcpu_inject_pf
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vcpu_inject_ud
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: vcpu_inject_ss
|
||||
:project: Project ACRN
|
||||
|
||||
ACRN hypervisor uses the *vcpu_inject_gp/vcpu_inject_pf* functions
|
||||
to queue exception request, and follows SDM vol3 - 6.15, Table 6-5 to
|
||||
generate double fault if the condition is met.
|
||||
|
||||
Virtual Interrupt Injection
|
||||
***************************
|
||||
|
||||
The source of virtual interrupts comes from either DM or assigned
|
||||
devices.
|
||||
|
||||
- **For SOS assigned devices**: as all devices are assigned to SOS
|
||||
directly. Whenever there is a device's physical interrupt, the
|
||||
corresponding virtual interrupts are injected to SOS via
|
||||
vLAPIC/vIOAPIC. SOS does not use vPIC and does not have emulated
|
||||
devices. See section 3.8.5 Device assignment.
|
||||
|
||||
- **For UOS assigned devices**: only PCI devices could be assigned to
|
||||
UOS. Virtual interrupt injection follows the same way as SOS. A
|
||||
virtual interrupt injection operation is triggered when a
|
||||
device's physical interrupt occurs.
|
||||
|
||||
- **For UOS emulated devices**: DM (acrn-dm) is responsible for UOS
|
||||
emulated devices' interrupt lifecycle management. DM knows when
|
||||
an emulated device needs to assert a virtual IOPAIC/PIC Pin or
|
||||
needs to send a virtual MSI vector to Guest. These logic is
|
||||
entirely handled by DM.
|
||||
|
||||
.. figure:: images/virtint-image64.png
|
||||
:align: center
|
||||
:name: pending-virt-interrupt
|
||||
|
||||
Handle pending virtual interrupt
|
||||
|
||||
Before APICv virtual interrupt delivery, a virtual interrupt can be
|
||||
injected only if guest interrupt is allowed. There are many cases
|
||||
that Guest ``RFLAGS.IF`` gets cleared and it would not accept any further
|
||||
interrupts. HV will check for the available Guest IRQ windows before
|
||||
injection.
|
||||
|
||||
NMI is unmasked interrupt and its injection is always allowed
|
||||
regardless of the guest IRQ window status. If current IRQ
|
||||
windows is not present, HV would enable
|
||||
``MSR_IA32_VMX_PROCBASED_CTLS_IRQ_WIN (PROCBASED_CTRL.bit[2])`` and
|
||||
VM Enter directly. The injection will be done on next VM Exit once Guest
|
||||
issues ``STI (GuestRFLAG.IF=1)``.
|
||||
|
||||
Data structures and interfaces
|
||||
******************************
|
||||
|
||||
There is no data structure exported to the other components in the
|
||||
hypervisor for virtual interrupts. The APIs listed in the previous
|
||||
sections are meant to be called whenever a virtual interrupt should be
|
||||
injected or acknowledged.
|
@@ -1,327 +0,0 @@
|
||||
.. _vt-d-hld:
|
||||
|
||||
VT-d
|
||||
####
|
||||
|
||||
VT-d stands for Intel Virtual Technology for Directed IO, and provides
|
||||
hardware capabilities to assign I/O devices to VMs and extending the
|
||||
protection and isolation properties of VMs for I/O operations.
|
||||
|
||||
VT-d provides the following main functions:
|
||||
|
||||
- **DMA remapping**: for supporting address translations for DMA from
|
||||
devices.
|
||||
|
||||
- **Interrupt remapping**: for supporting isolation and routing of
|
||||
interrupts from devices and external interrupt controllers to
|
||||
appropriate VMs.
|
||||
|
||||
- **Interrupt posting**: for supporting direct delivery of virtual
|
||||
interrupts from devices and external controllers to virtual
|
||||
processors.
|
||||
|
||||
ACRN hypervisor supports DMA remapping that provides address translation
|
||||
capability for PCI pass-through devices, and second-level translation,
|
||||
which applies to requests-without-PASID. ACRN does not support
|
||||
First-level / nested translation.
|
||||
|
||||
DMAR Engines Discovery
|
||||
**********************
|
||||
|
||||
DMA Remapping Report ACPI table
|
||||
===============================
|
||||
|
||||
For generic platforms, ACRN hypervisor retrieves DMAR information from
|
||||
the ACPI table, and parses the DMAR reporting structure to discover the
|
||||
number of DMA-remapping hardware units present in the platform as well as
|
||||
the devices under the scope of a remapping hardware unit, as shown in
|
||||
:numref:`dma-remap-report`:
|
||||
|
||||
.. figure:: images/vt-d-image90.png
|
||||
:align: center
|
||||
:name: dma-remap-report
|
||||
|
||||
DMA Remapping Reporting Structure
|
||||
|
||||
Pre-parsed DMAR information
|
||||
===========================
|
||||
|
||||
For specific platforms, ACRN hypervisor uses pre-parsed DMA remapping
|
||||
reporting information directly to save time for hypervisor boot-up.
|
||||
|
||||
DMA remapping unit for integrated graphics device
|
||||
=================================================
|
||||
|
||||
Generally, there is a dedicated remapping hardware unit for the Intel
|
||||
integrated graphics device. ACRN implements GVT-g for graphics, but
|
||||
GVT-g is not compatible with VT-d. The remapping hardware unit for
|
||||
graphics device is disabled on ACRN if GVT-g is enabled. If the graphics
|
||||
device needs to pass-through to a VM, then the remapping hardware unit
|
||||
must be enabled.
|
||||
|
||||
DMA Remapping
|
||||
*************
|
||||
|
||||
DMA remapping hardware is used to isolate device access to memory,
|
||||
enabling each device in the system to be assigned to a specific domain
|
||||
through a distinct set of paging structures.
|
||||
|
||||
Domains
|
||||
=======
|
||||
|
||||
A domain is abstractly defined as an isolated environment in the
|
||||
platform, to which a subset of the host physical memory is allocated.
|
||||
The memory resource of a domain is specified by the address translation
|
||||
tables.
|
||||
|
||||
Device to Domain Mapping Structure
|
||||
==================================
|
||||
|
||||
VT-d hardware uses root-table and context-tables to build the mapping
|
||||
between devices and domains as shown in :numref:`vt-d-mapping`.
|
||||
|
||||
.. figure:: images/vt-d-image44.png
|
||||
:align: center
|
||||
:name: vt-d-mapping
|
||||
|
||||
Device to Domain Mapping structures
|
||||
|
||||
The root-table is 4-KByte in size and contains 256 root-entries to cover
|
||||
the PCI bus number space (0-255). Each root-entry contains a
|
||||
context-table pointer to reference the context-table for devices on the
|
||||
bus identified by the root-entry, if the present flag of the root-entry
|
||||
is set.
|
||||
|
||||
Each context-table contains 256 entries, with each entry corresponding
|
||||
to a PCI device function on the bus. For a PCI device, the device and
|
||||
function numbers (8-bits) are used to index into the context-table. Each
|
||||
context-entry contains a Second-level Page-table Pointer, which provides
|
||||
the host physical address of the address translation structure in system
|
||||
memory to be used for remapping requests-without-PASID processed through
|
||||
the context-entry.
|
||||
|
||||
For a given Bus, Device, and Function combination as shown in
|
||||
:numref:`bdf-passthru`, a pass-through device can be associated with
|
||||
address translation structures for a domain.
|
||||
|
||||
.. figure:: images/vt-d-image19.png
|
||||
:align: center
|
||||
:name: bdf-passthru
|
||||
|
||||
BDF Format of Pass-through Device
|
||||
|
||||
Refer to the `VT-d spec`_ for the more details of Device to domain
|
||||
mapping structures.
|
||||
|
||||
.. _VT-d spec:
|
||||
https://software.intel.com/sites/default/files/managed/c5/15/vt-directed-io-spec.pdf
|
||||
|
||||
Address Translation Structures
|
||||
==============================
|
||||
|
||||
On ACRN, EPT table of a domain is used as the address translation
|
||||
structures for the devices assigned to the domain, as shown
|
||||
:numref:`vt-d-DMA`.
|
||||
|
||||
.. figure:: images/vt-d-image40.png
|
||||
:align: center
|
||||
:name: vt-d-DMA
|
||||
|
||||
DMA Remapping Diagram
|
||||
|
||||
When the device attempts to access system memory, the DMA
|
||||
remapping hardware intercepts the access, utilizes the EPT table of the
|
||||
domain to determine whether the access is allowed, and translates the DMA
|
||||
address according to the EPT table from guest physical address (GPA) to
|
||||
host physical address (HPA).
|
||||
|
||||
Domains and Memory Isolation
|
||||
============================
|
||||
|
||||
There are no DMA operations inside the hypervisor, so ACRN doesn't
|
||||
create a domain for the hypervisor. No DMA operations from pass-through
|
||||
devices can access the hypervisor memory.
|
||||
|
||||
ACRN treats each virtual machine (VM) as a separate domain. For a VM,
|
||||
there is a EPT table for Normal world, and there may be a EPT table for
|
||||
Secure World. Secure world can access Normal World's memory, but Normal
|
||||
world cannot access Secure World's memory.
|
||||
|
||||
SOS_VM domain
|
||||
SOS_VM domain is created when the hypervisor creates VM for the
|
||||
Service OS.
|
||||
|
||||
IOMMU uses the EPT table of Normal world of SOS_VM as the address
|
||||
translation structures for the devices in SOS_VM domain. The Normal world's
|
||||
EPT table of SOS_VM doesn't include the memory resource of the hypervisor
|
||||
and Secure worlds if any. So the devices in SOS_VM domain can't access the
|
||||
memory belong to hypervisor or secure worlds.
|
||||
|
||||
Other domains
|
||||
Other VM domains will be created when hypervisor creates User OS. One
|
||||
domain for each User OS.
|
||||
|
||||
IOMMU uses the EPT table of Normal world of a VM as the address
|
||||
translation structures for the devices in the domain. The Normal world's
|
||||
EPT table of the VM only allows devices to access the memory
|
||||
allocated for Normal world of the VM.
|
||||
|
||||
Page-walk coherency
|
||||
===================
|
||||
|
||||
For the VT-d hardware, which doesn't support page-walk coherency,
|
||||
hypervisor needs to make sure the updates of VT-d tables are synced in
|
||||
memory:
|
||||
|
||||
- Device to Domain Mapping Structures, including Root-entries and
|
||||
Context-entries
|
||||
|
||||
- EPT table of a VM.
|
||||
|
||||
ACRN will flush the related cache line after updates of these structures
|
||||
if the VT-d hardware doesn't support page-walk coherency.
|
||||
|
||||
Super-page support
|
||||
==================
|
||||
|
||||
ACRN VT-d reuses the EPT table as address a translation table. VT-d capability
|
||||
for super-page support should be identical with the usage of EPT table.
|
||||
|
||||
Snoop control
|
||||
=============
|
||||
|
||||
If VT-d hardware supports snoop control, it allows VT-d to control to
|
||||
ignore the "no-snoop attribute" in PCI-E transactions.
|
||||
|
||||
The following table shows the snoop behavior of DMA operation controlled by the
|
||||
combination of:
|
||||
|
||||
- Snoop Control capability of VT-d DMAR unit
|
||||
- The setting of SNP filed in leaf PTE
|
||||
- No-snoop attribute in PCI-e request
|
||||
|
||||
.. list-table::
|
||||
:widths: 25 25 25 25
|
||||
:header-rows: 1
|
||||
|
||||
* - SC cap of VT-d
|
||||
- SNP filed in leaf PTE
|
||||
- No-snoop attribute in request
|
||||
- Snoop behavior
|
||||
|
||||
* - 0
|
||||
- 0 (must be 0)
|
||||
- no snoop
|
||||
- No snoop
|
||||
|
||||
* - 0
|
||||
- 0 (must be 0)
|
||||
- snoop
|
||||
- Snoop
|
||||
|
||||
* - 1
|
||||
- 1
|
||||
- snoop / no snoop
|
||||
- Snoop
|
||||
|
||||
* - 1
|
||||
- 0
|
||||
- no snoop
|
||||
- No snoop
|
||||
|
||||
* - 1
|
||||
- 0
|
||||
- snoop
|
||||
- Snoop
|
||||
|
||||
ACRN enable Snoop Control by default if all enabled VT-d DMAR units
|
||||
support Snoop Control by setting bit 11 of leaf PTE of EPT table. Bit 11
|
||||
of leaf PTE of EPT is ignored by MMU. So no side effect for MMU.
|
||||
|
||||
If one of the enabled VT-d DMAR units doesn't support Snoop Control,
|
||||
then Bit 11 of leaf PET of EPT is not set since the field is treated as
|
||||
reserved(0) by VT-d hardware implementations not supporting Snoop
|
||||
Control.
|
||||
|
||||
Initialization
|
||||
**************
|
||||
|
||||
During hypervisor initialization, it registers DMAR units on the
|
||||
platform according to the reparsed information or DMAR table. There may
|
||||
be multiple DMAR units on the platform, ACRN allows some of the DMAR
|
||||
units to be ignored. If some DMAR unit(s) are marked as ignored, they
|
||||
would not be enabled.
|
||||
|
||||
Hypervisor creates SOS_VM domain using the Normal World's EPT table of SOS_VM
|
||||
as address translation table when creating SOS_VM as Service OS. And all
|
||||
PCI devices on the platform are added to SOS_VM domain. Then enable DMAR
|
||||
translation for DMAR unit(s) if they are not marked as ignored.
|
||||
|
||||
Device assignment
|
||||
*****************
|
||||
|
||||
All devices are initially added to SOS_VM domain.
|
||||
To assign a device means to assign the device to an User OS. The device
|
||||
is remove from SOS_VM domain and added to the VM domain related to the User
|
||||
OS, which changes the address translation table from EPT of SOS_VM to EPT
|
||||
of User OS for the device.
|
||||
|
||||
To unassign a device means to unassign the device from an User OS. The
|
||||
device is remove from the VM domain related to the User OS, then added
|
||||
back to SOS_VM domain, which changes the address translation table from EPT
|
||||
of User OS to EPT of SOS_VM for the device.
|
||||
|
||||
Power Management support for S3
|
||||
*******************************
|
||||
|
||||
During platform S3 suspend and resume, the VT-d register values will be
|
||||
lost. ACRN VT-d provide APIs to be called during S3 suspend and resume.
|
||||
|
||||
During S3 suspend, some register values are saved in the memory, and
|
||||
DMAR translation is disabled. During S3 resume, the register values
|
||||
saved are restored. Root table address register is set. DMAR translation
|
||||
is enabled.
|
||||
|
||||
All the operations for S3 suspend and resume are performed on all DMAR
|
||||
units on the platform, except for the DMAR units marked ignored.
|
||||
|
||||
Error Handling
|
||||
**************
|
||||
|
||||
ACRN VT-d supports DMA remapping error reporting. ACRN VT-d requests a
|
||||
IRQ / vector for DMAR error reporting. A DMAR fault handler is
|
||||
registered for the IRQ. DMAR unit supports report fault event via MSI.
|
||||
When a fault event occurs, a MSI is generated, so that the DMAR fault
|
||||
handler will be called to report error event.
|
||||
|
||||
Data structures and interfaces
|
||||
******************************
|
||||
|
||||
initialization and deinitialization
|
||||
===================================
|
||||
|
||||
The following APIs are provided during initialization and
|
||||
deinitialization:
|
||||
|
||||
.. doxygenfunction:: init_iommu
|
||||
:project: Project ACRN
|
||||
|
||||
runtime
|
||||
=======
|
||||
|
||||
The following API are provided during runtime:
|
||||
|
||||
.. doxygenfunction:: create_iommu_domain
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: destroy_iommu_domain
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: suspend_iommu
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: resume_iommu
|
||||
:project: Project ACRN
|
||||
|
||||
.. doxygenfunction:: move_pt_device
|
||||
:project: Project ACRN
|
Before Width: | Height: | Size: 81 KiB |
Before Width: | Height: | Size: 26 KiB |
Before Width: | Height: | Size: 72 KiB |
Before Width: | Height: | Size: 60 KiB |
Before Width: | Height: | Size: 34 KiB |
Before Width: | Height: | Size: 173 KiB |
Before Width: | Height: | Size: 201 KiB |
Before Width: | Height: | Size: 14 KiB |
Before Width: | Height: | Size: 147 KiB |
Before Width: | Height: | Size: 35 KiB |
Before Width: | Height: | Size: 58 KiB |
Before Width: | Height: | Size: 117 KiB |
Before Width: | Height: | Size: 166 KiB |
Before Width: | Height: | Size: 101 KiB |
Before Width: | Height: | Size: 71 KiB |
Before Width: | Height: | Size: 450 KiB |
Before Width: | Height: | Size: 62 KiB |
Before Width: | Height: | Size: 76 KiB |
Before Width: | Height: | Size: 75 KiB |
Before Width: | Height: | Size: 86 KiB |
Before Width: | Height: | Size: 34 KiB |
Before Width: | Height: | Size: 32 KiB |
Before Width: | Height: | Size: 38 KiB |
Before Width: | Height: | Size: 8.1 KiB |
Before Width: | Height: | Size: 20 KiB |
Before Width: | Height: | Size: 41 KiB |
Before Width: | Height: | Size: 36 KiB |
Before Width: | Height: | Size: 60 KiB |
Before Width: | Height: | Size: 37 KiB |
Before Width: | Height: | Size: 30 KiB |
Before Width: | Height: | Size: 50 KiB |
Before Width: | Height: | Size: 18 KiB |
Before Width: | Height: | Size: 54 KiB |
Before Width: | Height: | Size: 27 KiB |
Before Width: | Height: | Size: 80 KiB |
Before Width: | Height: | Size: 23 KiB |
Before Width: | Height: | Size: 23 KiB |
Before Width: | Height: | Size: 22 KiB |
Before Width: | Height: | Size: 14 KiB |
Before Width: | Height: | Size: 20 KiB |
Before Width: | Height: | Size: 25 KiB |
Before Width: | Height: | Size: 56 KiB |
Before Width: | Height: | Size: 56 KiB |
Before Width: | Height: | Size: 18 KiB |
Before Width: | Height: | Size: 42 KiB |
Before Width: | Height: | Size: 28 KiB |
Before Width: | Height: | Size: 67 KiB |
Before Width: | Height: | Size: 33 KiB |
Before Width: | Height: | Size: 57 KiB |
Before Width: | Height: | Size: 21 KiB |
Before Width: | Height: | Size: 20 KiB |
Before Width: | Height: | Size: 24 KiB |
Before Width: | Height: | Size: 115 KiB |
Before Width: | Height: | Size: 45 KiB |
Before Width: | Height: | Size: 127 KiB |
Before Width: | Height: | Size: 58 KiB |
Before Width: | Height: | Size: 99 KiB |
Before Width: | Height: | Size: 11 KiB |
Before Width: | Height: | Size: 82 KiB |
Before Width: | Height: | Size: 44 KiB |
Before Width: | Height: | Size: 8.8 KiB |
Before Width: | Height: | Size: 37 KiB |
Before Width: | Height: | Size: 24 KiB |
Before Width: | Height: | Size: 56 KiB |
Before Width: | Height: | Size: 46 KiB |
Before Width: | Height: | Size: 31 KiB |
Before Width: | Height: | Size: 18 KiB |
Before Width: | Height: | Size: 58 KiB |
Before Width: | Height: | Size: 31 KiB |
Before Width: | Height: | Size: 53 KiB |
Before Width: | Height: | Size: 62 KiB |
Before Width: | Height: | Size: 58 KiB |
Before Width: | Height: | Size: 14 KiB |
Before Width: | Height: | Size: 15 KiB |