doc: continue doc restructuring

Changing the folder structure will cause too many broken links for
external references (from other sites). So, let's put the content back
where it was before the reorg, and instead use the new persona-based
navigation to point to documents in the original locations.

Also, introduce redirects for some documents that no longer exits.

Signed-off-by: David B. Kinder <david.b.kinder@intel.com>
This commit is contained in:
David B. Kinder
2019-08-01 13:48:08 -07:00
committed by David Kinder
parent 901a65cb53
commit e2d3653976
298 changed files with 164 additions and 28 deletions

View File

@@ -0,0 +1,954 @@
.. _APL_GVT-g-hld:
GVT-g high-level design
#######################
Introduction
************
Purpose of this Document
========================
This high-level design (HLD) document describes the usage requirements
and high level design for Intel |reg| Graphics Virtualization Technology for
shared virtual :term:`GPU` technology (:term:`GVT-g`) on Apollo Lake-I
SoCs.
This document describes:
- The different GPU virtualization techniques
- GVT-g mediated pass-through
- High level design
- Key components
- GVT-g new architecture differentiation
Audience
========
This document is for developers, validation teams, architects and
maintainers of Intel |reg| GVT-g for the Apollo Lake SoCs.
The reader should have some familiarity with the basic concepts of
system virtualization and Intel processor graphics.
Reference Documents
===================
The following documents were used as references for this specification:
- Paper in USENIX ATC '14 - *Full GPU Virtualization Solution with
Mediated Pass-Through* - https://www.usenix.org/node/183932
- Hardware Specification - PRMs -
https://01.org/linuxgraphics/documentation/hardware-specification-prms
Background
**********
Intel GVT-g is an enabling technology in emerging graphics
virtualization scenarios. It adopts a full GPU virtualization approach
based on mediated pass-through technology, to achieve good performance,
scalability and secure isolation among Virtual Machines (VMs). A virtual
GPU (vGPU), with full GPU features, is presented to each VM so that a
native graphics driver can run directly inside a VM.
Intel GVT-g technology for Apollo Lake (APL) has been implemented in
open source hypervisors or Virtual Machine Monitors (VMMs):
- Intel GVT-g for ACRN, also known as, "AcrnGT"
- Intel GVT-g for KVM, also known as, "KVMGT"
- Intel GVT-g for Xen, also known as, "XenGT"
The core vGPU device model is released under BSD/MIT dual license, so it
can be reused in other proprietary hypervisors.
Intel has a portfolio of graphics virtualization technologies
(:term:`GVT-g`, :term:`GVT-d` and :term:`GVT-s`). GVT-d and GVT-s are
outside of the scope of this document.
This HLD applies to the Apollo Lake platform only. Support of other
hardware is outside the scope of this HLD.
Targeted Usages
===============
The main targeted usage of GVT-g is in automotive applications, such as:
- An Instrument cluster running in one domain
- An In Vehicle Infotainment (IVI) solution running in another domain
- Additional domains for specific purposes, such as Rear Seat
Entertainment or video camera capturing.
.. figure:: images/APL_GVT-g-ive-use-case.png
:width: 900px
:align: center
:name: ive-use-case
IVE Use Case
Existing Techniques
===================
A graphics device is no different from any other I/O device, with
respect to how the device I/O interface is virtualized. Therefore,
existing I/O virtualization techniques can be applied to graphics
virtualization. However, none of the existing techniques can meet the
general requirement of performance, scalability, and secure isolation
simultaneously. In this section, we review the pros and cons of each
technique in detail, enabling the audience to understand the rationale
behind the entire GVT-g effort.
Emulation
---------
A device can be emulated fully in software, including its I/O registers
and internal functional blocks. There would be no dependency on the
underlying hardware capability, therefore compatibility can be achieved
across platforms. However, due to the CPU emulation cost, this technique
is usually used for legacy devices, such as a keyboard, mouse, and VGA
card. There would be great complexity and extremely low performance to
fully emulate a modern accelerator, such as a GPU. It may be acceptable
for use in a simulation environment, but it is definitely not suitable
for production usage.
API Forwarding
--------------
API forwarding, or a split driver model, is another widely-used I/O
virtualization technology. It has been used in commercial virtualization
productions, for example, VMware*, PCoIP*, and Microsoft* RemoteFx*.
It is a natural path when researchers study a new type of
I/O virtualization usage, for example, when GPGPU computing in VM was
initially proposed. Intel GVT-s is based on this approach.
The architecture of API forwarding is shown in :numref:`api-forwarding`:
.. figure:: images/APL_GVT-g-api-forwarding.png
:width: 400px
:align: center
:name: api-forwarding
API Forwarding
A frontend driver is employed to forward high-level API calls (OpenGL,
Directx, and so on) inside a VM, to a Backend driver in the Hypervisor
for acceleration. The Backend may be using a different graphics stack,
so API translation between different graphics protocols may be required.
The Backend driver allocates a physical GPU resource for each VM,
behaving like a normal graphics application in a Hypervisor. Shared
memory may be used to reduce memory copying between the host and guest
graphic stacks.
API forwarding can bring hardware acceleration capability into a VM,
with other merits such as vendor independence and high density. However, it
also suffers from the following intrinsic limitations:
- Lagging features - Every new API version needs to be specifically
handled, so it means slow time-to-market (TTM) to support new standards.
For example,
only DirectX9 is supported, when DirectX11 is already in the market.
Also, there is a big gap in supporting media and compute usages.
- Compatibility issues - A GPU is very complex, and consequently so are
high level graphics APIs. Different protocols are not 100% compatible
on every subtle API, so the customer can observe feature/quality loss
for specific applications.
- Maintenance burden - Occurs when supported protocols and specific
versions are incremented.
- Performance overhead - Different API forwarding implementations
exhibit quite different performance, which gives rise to a need for a
fine-grained graphics tuning effort.
Direct Pass-Through
-------------------
"Direct pass-through" dedicates the GPU to a single VM, providing full
features and good performance, but at the cost of device sharing
capability among VMs. Only one VM at a time can use the hardware
acceleration capability of the GPU, which is a major limitation of this
technique. However, it is still a good approach to enable graphics
virtualization usages on Intel server platforms, as an intermediate
solution. Intel GVT-d uses this mechanism.
.. figure:: images/APL_GVT-g-pass-through.png
:width: 400px
:align: center
:name: gvt-pass-through
Pass-Through
SR-IOV
------
Single Root IO Virtualization (SR-IOV) implements I/O virtualization
directly on a device. Multiple Virtual Functions (VFs) are implemented,
with each VF directly assignable to a VM.
.. _Graphic_mediation:
Mediated Pass-Through
*********************
Intel GVT-g achieves full GPU virtualization using a "mediated
pass-through" technique.
Concept
=======
Mediated pass-through allows a VM to access performance-critical I/O
resources (usually partitioned) directly, without intervention from the
hypervisor in most cases. Privileged operations from this VM are
trapped-and-emulated to provide secure isolation among VMs.
.. figure:: images/APL_GVT-g-mediated-pass-through.png
:width: 400px
:align: center
:name: mediated-pass-through
Mediated Pass-Through
The Hypervisor must ensure that no vulnerability is exposed when
assigning performance-critical resource to each VM. When a
performance-critical resource cannot be partitioned, a scheduler must be
implemented (either in software or hardware) to allow time-based sharing
among multiple VMs. In this case, the device must allow the hypervisor
to save and restore the hardware state associated with the shared resource,
either through direct I/O register reads and writes (when there is no software
invisible state) or through a device-specific context save and restore
mechanism (where there is a software invisible state).
Examples of performance-critical I/O resources include the following:
.. figure:: images/APL_GVT-g-perf-critical.png
:width: 800px
:align: center
:name: perf-critical
Performance-Critical I/O Resources
The key to implementing mediated pass-through for a specific device is
to define the right policy for various I/O resources.
Virtualization Policies for GPU Resources
=========================================
:numref:`graphics-arch` shows how Intel Processor Graphics works at a high level.
Software drivers write commands into a command buffer through the CPU.
The Render Engine in the GPU fetches these commands and executes them.
The Display Engine fetches pixel data from the Frame Buffer and sends
them to the external monitors for display.
.. figure:: images/APL_GVT-g-graphics-arch.png
:width: 400px
:align: center
:name: graphics-arch
Architecture of Intel Processor Graphics
This architecture abstraction applies to most modern GPUs, but may
differ in how graphics memory is implemented. Intel Processor Graphics
uses system memory as graphics memory. System memory can be mapped into
multiple virtual address spaces by GPU page tables. A 4 GB global
virtual address space called "global graphics memory", accessible from
both the GPU and CPU, is mapped through a global page table. Local
graphics memory spaces are supported in the form of multiple 4 GB local
virtual address spaces, but are only limited to access by the Render
Engine through local page tables. Global graphics memory is mostly used
for the Frame Buffer and also serves as the Command Buffer. Massive data
accesses are made to local graphics memory when hardware acceleration is
in progress. Other GPUs have similar page table mechanism accompanying
the on-die memory.
The CPU programs the GPU through GPU-specific commands, shown in
:numref:`graphics-arch`, using a producer-consumer model. The graphics
driver programs GPU commands into the Command Buffer, including primary
buffer and batch buffer, according to the high-level programming APIs,
such as OpenGL* or DirectX*. Then, the GPU fetches and executes the
commands. The primary buffer (called a ring buffer) may chain other
batch buffers together. The primary buffer and ring buffer are used
interchangeably thereafter. The batch buffer is used to convey the
majority of the commands (up to ~98% of them) per programming model. A
register tuple (head, tail) is used to control the ring buffer. The CPU
submits the commands to the GPU by updating the tail, while the GPU
fetches commands from the head, and then notifies the CPU by updating
the head, after the commands have finished execution. Therefore, when
the GPU has executed all commands from the ring buffer, the head and
tail pointers are the same.
Having introduced the GPU architecture abstraction, it is important for
us to understand how real-world graphics applications use the GPU
hardware so that we can virtualize it in VMs efficiently. To do so, we
characterized, for some representative GPU-intensive 3D workloads (the
Phoronix Test Suite), the usages of the four critical interfaces:
1) the Frame Buffer,
2) the Command Buffer,
3) the GPU Page Table Entries (PTEs), which carry the GPU page tables, and
4) the I/O registers, including Memory-Mapped I/O (MMIO) registers,
Port I/O (PIO) registers, and PCI configuration space registers
for internal state.
:numref:`access-patterns` shows the average access frequency of running
Phoronix 3D workloads on the four interfaces.
The Frame Buffer and Command Buffer exhibit the most
performance-critical resources, as shown in :numref:`access-patterns`.
When the applications are being loaded, lots of source vertices and
pixels are written by the CPU, so the Frame Buffer accesses occur in the
range of hundreds of thousands per second. Then at run-time, the CPU
programs the GPU through the commands, to render the Frame Buffer, so
the Command Buffer accesses become the largest group, also in the
hundreds of thousands per second. PTE and I/O accesses are minor in both
load and run-time phases ranging in tens of thousands per second.
.. figure:: images/APL_GVT-g-access-patterns.png
:width: 400px
:align: center
:name: access-patterns
Access Patterns of Running 3D Workloads
High Level Architecture
***********************
:numref:`gvt-arch` shows the overall architecture of GVT-g, based on the
ACRN hypervisor, with SOS as the privileged VM, and multiple user
guests. A GVT-g device model working with the ACRN hypervisor,
implements the policies of trap and pass-through. Each guest runs the
native graphics driver and can directly access performance-critical
resources: the Frame Buffer and Command Buffer, with resource
partitioning (as presented later). To protect privileged resources, that
is, the I/O registers and PTEs, corresponding accesses from the graphics
driver in user VMs are trapped and forwarded to the GVT device model in
SOS for emulation. The device model leverages i915 interfaces to access
the physical GPU.
In addition, the device model implements a GPU scheduler that runs
concurrently with the CPU scheduler in ACRN to share the physical GPU
timeslot among the VMs. GVT-g uses the physical GPU to directly execute
all the commands submitted from a VM, so it avoids the complexity of
emulating the Render Engine, which is the most complex part of the GPU.
In the meantime, the resource pass-through of both the Frame Buffer and
Command Buffer minimizes the hypervisor's intervention of CPU accesses,
while the GPU scheduler guarantees every VM a quantum time-slice for
direct GPU execution. With that, GVT-g can achieve near-native
performance for a VM workload.
In :numref:`gvt-arch`, the yellow GVT device model works as a client on
top of an i915 driver in the SOS. It has a generic Mediated Pass-Through
(MPT) interface, compatible with all types of hypervisors. For ACRN,
some extra development work is needed for such MPT interfaces. For
example, we need some changes in ACRN-DM to make ACRN compatible with
the MPT framework. The vGPU lifecycle is the same as the lifecycle of
the guest VM creation through ACRN-DM. They interact through sysfs,
exposed by the GVT device model.
.. figure:: images/APL_GVT-g-arch.png
:width: 600px
:align: center
:name: gvt-arch
AcrnGT High-level Architecture
Key Techniques
**************
vGPU Device Model
=================
The vGPU Device model is the main component because it constructs the
vGPU instance for each guest to satisfy every GPU request from the guest
and gives the corresponding result back to the guest.
The vGPU Device Model provides the basic framework to do
trap-and-emulation, including MMIO virtualization, interrupt
virtualization, and display virtualization. It also handles and
processes all the requests internally, such as, command scan and shadow,
schedules them in the proper manner, and finally submits to
the SOS i915 driver.
.. figure:: images/APL_GVT-g-DM.png
:width: 800px
:align: center
:name: GVT-DM
GVT-g Device Model
MMIO Virtualization
-------------------
Intel Processor Graphics implements two PCI MMIO BARs:
- **GTTMMADR BAR**: Combines both :term:`GGTT` modification range and Memory
Mapped IO range. It is 16 MB on :term:`BDW`, with 2 MB used by MMIO, 6 MB
reserved and 8 MB allocated to GGTT. GGTT starts from
:term:`GTTMMADR` + 8 MB. In this section, we focus on virtualization of
the MMIO range, discussing GGTT virtualization later.
- **GMADR BAR**: As the PCI aperture is used by the CPU to access tiled
graphics memory, GVT-g partitions this aperture range among VMs for
performance reasons.
A 2 MB virtual MMIO structure is allocated per vGPU instance.
All the virtual MMIO registers are emulated as simple in-memory
read-write, that is, guest driver will read back the same value that was
programmed earlier. A common emulation handler (for example,
intel_gvt_emulate_read/write) is enough to handle such general
emulation requirements. However, some registers need to be emulated with
specific logic, for example, affected by change of other states or
additional audit or translation when updating the virtual register.
Therefore, a specific emulation handler must be installed for those
special registers.
The graphics driver may have assumptions about the initial device state,
which stays with the point when the BIOS transitions to the OS. To meet
the driver expectation, we need to provide an initial state of vGPU that
a driver may observe on a pGPU. So the host graphics driver is expected
to generate a snapshot of physical GPU state, which it does before guest
driver's initialization. This snapshot is used as the initial vGPU state
by the device model.
PCI Configuration Space Virtualization
--------------------------------------
PCI configuration space also needs to be virtualized in the device
model. Different implementations may choose to implement the logic
within the vGPU device model or in default system device model (for
example, ACRN-DM). GVT-g emulates the logic in the device model.
Some information is vital for the vGPU device model, including:
Guest PCI BAR, Guest PCI MSI, and Base of ACPI OpRegion.
Legacy VGA Port I/O Virtualization
----------------------------------
Legacy VGA is not supported in the vGPU device model. We rely on the
default device model (for example, :term:`QEMU`) to provide legacy VGA
emulation, which means either ISA VGA emulation or
PCI VGA emulation.
Interrupt Virtualization
------------------------
The GVT device model does not touch the hardware interrupt in the new
architecture, since it is hard to combine the interrupt controlling
logic between the virtual device model and the host driver. To prevent
architectural changes in the host driver, the host GPU interrupt does
not go to the virtual device model and the virtual device model has to
handle the GPU interrupt virtualization by itself. Virtual GPU
interrupts are categorized into three types:
- Periodic GPU interrupts are emulated by timers. However, a notable
exception to this is the VBlank interrupt. Due to the demands of user
space compositors, such as Wayland, which requires a flip done event
to be synchronized with a VBlank, this interrupt is forwarded from
SOS to UOS when SOS receives it from the hardware.
- Event-based GPU interrupts are emulated by the emulation logic. For
example, AUX Channel Interrupt.
- GPU command interrupts are emulated by a command parser and workload
dispatcher. The command parser marks out which GPU command interrupts
are generated during the command execution and the workload
dispatcher injects those interrupts into the VM after the workload is
finished.
.. figure:: images/APL_GVT-g-interrupt-virt.png
:width: 400px
:align: center
:name: interrupt-virt
Interrupt Virtualization
Workload Scheduler
------------------
The scheduling policy and workload scheduler are decoupled for
scalability reasons. For example, a future QoS enhancement will only
impact the scheduling policy, any i915 interface change or HW submission
interface change (from execlist to :term:`GuC`) will only need workload
scheduler updates.
The scheduling policy framework is the core of the vGPU workload
scheduling system. It controls all of the scheduling actions and
provides the developer with a generic framework for easy development of
scheduling policies. The scheduling policy framework controls the work
scheduling process without caring about how the workload is dispatched
or completed. All the detailed workload dispatching is hidden in the
workload scheduler, which is the actual executer of a vGPU workload.
The workload scheduler handles everything about one vGPU workload. Each
hardware ring is backed by one workload scheduler kernel thread. The
workload scheduler picks the workload from current vGPU workload queue
and communicates with the virtual HW submission interface to emulate the
"schedule-in" status for the vGPU. It performs context shadow, Command
Buffer scan and shadow, PPGTT page table pin/unpin/out-of-sync, before
submitting this workload to the host i915 driver. When the vGPU workload
is completed, the workload scheduler asks the virtual HW submission
interface to emulate the "schedule-out" status for the vGPU. The VM
graphics driver then knows that a GPU workload is finished.
.. figure:: images/APL_GVT-g-scheduling.png
:width: 500px
:align: center
:name: scheduling
GVT-g Scheduling Framework
Workload Submission Path
------------------------
Software submits the workload using the legacy ring buffer mode on Intel
Processor Graphics before Broadwell, which is no longer supported by the
GVT-g virtual device model. A new HW submission interface named
"Execlist" is introduced since Broadwell. With the new HW submission
interface, software can achieve better programmability and easier
context management. In Intel GVT-g, the vGPU submits the workload
through the virtual HW submission interface. Each workload in submission
will be represented as an ``intel_vgpu_workload`` data structure, a vGPU
workload, which will be put on a per-vGPU and per-engine workload queue
later after performing a few basic checks and verifications.
.. figure:: images/APL_GVT-g-workload.png
:width: 800px
:align: center
:name: workload
GVT-g Workload Submission
Display Virtualization
----------------------
GVT-g reuses the i915 graphics driver in the SOS to initialize the Display
Engine, and then manages the Display Engine to show different VM frame
buffers. When two vGPUs have the same resolution, only the frame buffer
locations are switched.
.. figure:: images/APL_GVT-g-display-virt.png
:width: 800px
:align: center
:name: display-virt
Display Virtualization
Direct Display Model
--------------------
.. figure:: images/APL_GVT-g-direct-display.png
:width: 600px
:align: center
:name: direct-display
Direct Display Model
A typical automotive use case is where there are two displays in the car
and each one needs to show one domain's content, with the two domains
being the Instrument cluster and the In Vehicle Infotainment (IVI). As
shown in :numref:`direct-display`, this can be accomplished through the direct
display model of GVT-g, where the SOS and UOS are each assigned all HW
planes of two different pipes. GVT-g has a concept of display owner on a
per HW plane basis. If it determines that a particular domain is the
owner of a HW plane, then it allows the domain's MMIO register write to
flip a frame buffer to that plane to go through to the HW. Otherwise,
such writes are blocked by the GVT-g.
Indirect Display Model
----------------------
.. figure:: images/APL_GVT-g-indirect-display.png
:width: 600px
:align: center
:name: indirect-display
Indirect Display Model
For security or fastboot reasons, it may be determined that the UOS is
either not allowed to display its content directly on the HW or it may
be too late before it boots up and displays its content. In such a
scenario, the responsibility of displaying content on all displays lies
with the SOS. One of the use cases that can be realized is to display the
entire frame buffer of the UOS on a secondary display. GVT-g allows for this
model by first trapping all MMIO writes by the UOS to the HW. A proxy
application can then capture the address in GGTT where the UOS has written
its frame buffer and using the help of the Hypervisor and the SOS's i915
driver, can convert the Guest Physical Addresses (GPAs) into Host
Physical Addresses (HPAs) before making a texture source or EGL image
out of the frame buffer and then either post processing it further or
simply displaying it on a HW plane of the secondary display.
GGTT-Based Surface Sharing
--------------------------
One of the major automotive use case is called "surface sharing". This
use case requires that the SOS accesses an individual surface or a set of
surfaces from the UOS without having to access the entire frame buffer of
the UOS. Unlike the previous two models, where the UOS did not have to do
anything to show its content and therefore a completely unmodified UOS
could continue to run, this model requires changes to the UOS.
This model can be considered an extension of the indirect display model.
Under the indirect display model, the UOS's frame buffer was temporarily
pinned by it in the video memory access through the Global graphics
translation table. This GGTT-based surface sharing model takes this a
step further by having a compositor of the UOS to temporarily pin all
application buffers into GGTT. It then also requires the compositor to
create a metadata table with relevant surface information such as width,
height, and GGTT offset, and flip that in lieu of the frame buffer.
In the SOS, the proxy application knows that the GGTT offset has been
flipped, maps it, and through it can access the GGTT offset of an
application that it wants to access. It is worth mentioning that in this
model, UOS applications did not require any changes, and only the
compositor, Mesa, and i915 driver had to be modified.
This model has a major benefit and a major limitation. The
benefit is that since it builds on top of the indirect display model,
there are no special drivers necessary for it on either SOS or UOS.
Therefore, any Real Time Operating System (RTOS) that use
this model can simply do so without having to implement a driver, the
infrastructure for which may not be present in their operating system.
The limitation of this model is that video memory dedicated for a UOS is
generally limited to a couple of hundred MBs. This can easily be
exhausted by a few application buffers so the number and size of buffers
is limited. Since it is not a highly-scalable model, in general, Intel
recommends the Hyper DMA buffer sharing model, described next.
Hyper DMA Buffer Sharing
------------------------
.. figure:: images/APL_GVT-g-hyper-dma.png
:width: 800px
:align: center
:name: hyper-dma
Hyper DMA Buffer Design
Another approach to surface sharing is Hyper DMA Buffer sharing. This
model extends the Linux DMA buffer sharing mechanism where one driver is
able to share its pages with another driver within one domain.
Applications buffers are backed by i915 Graphics Execution Manager
Buffer Objects (GEM BOs). As in GGTT surface
sharing, this model also requires compositor changes. The compositor of
UOS requests i915 to export these application GEM BOs and then passes
them on to a special driver called the Hyper DMA Buf exporter whose job
is to create a scatter gather list of pages mapped by PDEs and PTEs and
export a Hyper DMA Buf ID back to the compositor.
The compositor then shares this Hyper DMA Buf ID with the SOS's Hyper DMA
Buf importer driver which then maps the memory represented by this ID in
the SOS. A proxy application in the SOS can then provide the ID of this driver
to the SOS i915, which can create its own GEM BO. Finally, the application
can use it as an EGL image and do any post processing required before
either providing it to the SOS compositor or directly flipping it on a
HW plane in the compositor's absence.
This model is highly scalable and can be used to share up to 4 GB worth
of pages. It is also not limited to only sharing graphics buffers. Other
buffers for the IPU and others, can also be shared with it. However, it
does require that the SOS port the Hyper DMA Buffer importer driver. Also,
the SOS OS must comprehend and implement the DMA buffer sharing model.
For detailed information about this model, please refer to the `Linux
HYPER_DMABUF Driver High Level Design
<https://github.com/downor/linux_hyper_dmabuf/blob/hyper_dmabuf_integration_v4/Documentation/hyper-dmabuf-sharing.txt>`_.
.. _plane_restriction:
Plane-Based Domain Ownership
----------------------------
.. figure:: images/APL_GVT-g-plane-based.png
:width: 600px
:align: center
:name: plane-based
Plane-Based Domain Ownership
Yet another mechanism for showing content of both the SOS and UOS on the
same physical display is called plane-based domain ownership. Under this
model, both the SOS and UOS are provided a set of HW planes that they can
flip their contents on to. Since each domain provides its content, there
is no need for any extra composition to be done through the SOS. The display
controller handles alpha blending contents of different domains on a
single pipe. This saves on any complexity on either the SOS or the UOS
SW stack.
It is important to provide only specific planes and have them statically
assigned to different Domains. To achieve this, the i915 driver of both
domains is provided a command line parameter that specifies the exact
planes that this domain has access to. The i915 driver then enumerates
only those HW planes and exposes them to its compositor. It is then left
to the compositor configuration to use these planes appropriately and
show the correct content on them. No other changes are necessary.
While the biggest benefit of this model is that is extremely simple and
quick to implement, it also has some drawbacks. First, since each domain
is responsible for showing the content on the screen, there is no
control of the UOS by the SOS. If the UOS is untrusted, this could
potentially cause some unwanted content to be displayed. Also, there is
no post processing capability, except that provided by the display
controller (for example, scaling, rotation, and so on). So each domain
must provide finished buffers with the expectation that alpha blending
with another domain will not cause any corruption or unwanted artifacts.
Graphics Memory Virtualization
==============================
To achieve near-to-native graphics performance, GVT-g passes through the
performance-critical operations, such as Frame Buffer and Command Buffer
from the VM. For the global graphics memory space, GVT-g uses graphics
memory resource partitioning and an address space ballooning mechanism.
For local graphics memory spaces, GVT-g implements per-VM local graphics
memory through a render context switch because local graphics memory is
only accessible by the GPU.
Global Graphics Memory
----------------------
Graphics Memory Resource Partitioning
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
GVT-g partitions the global graphics memory among VMs. Splitting the
CPU/GPU scheduling mechanism requires that the global graphics memory of
different VMs can be accessed by the CPU and the GPU simultaneously.
Consequently, GVT-g must, at any time, present each VM with its own
resource, leading to the resource partitioning approaching, for global
graphics memory, as shown in :numref:`mem-part`.
.. figure:: images/APL_GVT-g-mem-part.png
:width: 800px
:align: center
:name: mem-part
Memory Partition and Ballooning
The performance impact of reduced global graphics memory resource
due to memory partitioning is very limited according to various test
results.
Address Space Ballooning
%%%%%%%%%%%%%%%%%%%%%%%%
The address space ballooning technique is introduced to eliminate the
address translation overhead, shown in :numref:`mem-part`. GVT-g exposes the
partitioning information to the VM graphics driver through the PVINFO
MMIO window. The graphics driver marks the other VMs' regions as
'ballooned', and reserves them as not being used from its graphics
memory allocator. Under this design, the guest view of global graphics
memory space is exactly the same as the host view and the driver
programmed addresses, using guest physical address, can be directly used
by the hardware. Address space ballooning is different from traditional
memory ballooning techniques. Memory ballooning is for memory usage
control concerning the number of ballooned pages, while address space
ballooning is to balloon special memory address ranges.
Another benefit of address space ballooning is that there is no address
translation overhead as we use the guest Command Buffer for direct GPU
execution.
Per-VM Local Graphics Memory
----------------------------
GVT-g allows each VM to use the full local graphics memory spaces of its
own, similar to the virtual address spaces on the CPU. The local
graphics memory spaces are only visible to the Render Engine in the GPU.
Therefore, any valid local graphics memory address, programmed by a VM,
can be used directly by the GPU. The GVT-g device model switches the
local graphics memory spaces, between VMs, when switching render
ownership.
GPU Page Table Virtualization
=============================
Shared Shadow GGTT
------------------
To achieve resource partitioning and address space ballooning, GVT-g
implements a shared shadow global page table for all VMs. Each VM has
its own guest global page table to translate the graphics memory page
number to the Guest memory Page Number (GPN). The shadow global page
table is then translated from the graphics memory page number to the
Host memory Page Number (HPN).
The shared shadow global page table maintains the translations for all
VMs to support concurrent accesses from the CPU and GPU concurrently.
Therefore, GVT-g implements a single, shared shadow global page table by
trapping guest PTE updates, as shown in :numref:`shared-shadow`. The
global page table, in MMIO space, has 1024K PTE entries, each pointing
to a 4 KB system memory page, so the global page table overall creates a
4 GB global graphics memory space. GVT-g audits the guest PTE values
according to the address space ballooning information before updating
the shadow PTE entries.
.. figure:: images/APL_GVT-g-shared-shadow.png
:width: 600px
:align: center
:name: shared-shadow
Shared Shadow Global Page Table
Per-VM Shadow PPGTT
-------------------
To support local graphics memory access pass-through, GVT-g implements
per-VM shadow local page tables. The local graphics memory is only
accessible from the Render Engine. The local page tables have two-level
paging structures, as shown in :numref:`per-vm-shadow`.
The first level, Page Directory Entries (PDEs), located in the global
page table, points to the second level, Page Table Entries (PTEs) in
system memory, so guest accesses to the PDE are trapped and emulated,
through the implementation of shared shadow global page table.
GVT-g also write-protects a list of guest PTE pages for each VM. The
GVT-g device model synchronizes the shadow page with the guest page, at
the time of write-protection page fault, and switches the shadow local
page tables at render context switches.
.. figure:: images/APL_GVT-g-per-vm-shadow.png
:width: 800px
:align: center
:name: per-vm-shadow
Per-VM Shadow PPGTT
.. _GVT-g-prioritized-rendering:
Prioritized Rendering and Preemption
====================================
Different Schedulers and Their Roles
------------------------------------
.. figure:: images/APL_GVT-g-scheduling-policy.png
:width: 800px
:align: center
:name: scheduling-policy
Scheduling Policy
In the system, there are three different schedulers for the GPU:
- i915 UOS scheduler
- Mediator GVT scheduler
- i915 SOS scheduler
Since UOS always uses the host-based command submission (ELSP) model,
and it never accesses the GPU or the Graphic Micro Controller (GuC)
directly, its scheduler cannot do any preemption by itself.
The i915 scheduler does ensure batch buffers are
submitted in dependency order, that is, if a compositor had to wait for
an application buffer to finish before its workload can be submitted to
the GPU, then the i915 scheduler of the UOS ensures that this happens.
The UOS assumes that by submitting its batch buffers to the Execlist
Submission Port (ELSP), the GPU will start working on them. However,
the MMIO write to the ELSP is captured by the Hypervisor, which forwards
these requests to the GVT module. GVT then creates a shadow context
based on this batch buffer and submits the shadow context to the SOS
i915 driver.
However, it is dependent on a second scheduler called the GVT
scheduler. This scheduler is time based and uses a round robin algorithm
to provide a specific time for each UOS to submit its workload when it
is considered as a "render owner". The workload of the UOSs that are not
render owners during a specific time period end up waiting in the
virtual GPU context until the GVT scheduler makes them render owners.
The GVT shadow context submits only one workload at
a time, and once the workload is finished by the GPU, it copies any
context state back to DomU and sends the appropriate interrupts before
picking up any other workloads from either this UOS or another one. This
also implies that this scheduler does not do any preemption of
workloads.
Finally, there is the i915 scheduler in the SOS. This scheduler uses the
GuC or ELSP to do command submission of SOS local content as well as any
content that GVT is submitting to it on behalf of the UOSs. This
scheduler uses GuC or ELSP to preempt workloads. GuC has four different
priority queues, but the SOS i915 driver uses only two of them. One of
them is considered high priority and the other is normal priority with a
GuC rule being that any command submitted on the high priority queue
would immediately try to preempt any workload submitted on the normal
priority queue. For ELSP submission, the i915 will submit a preempt
context to preempt the current running context and then wait for the GPU
engine to be idle.
While the identification of workloads to be preempted is decided by
customizable scheduling policies, once a candidate for preemption is
identified, the i915 scheduler simply submits a preemption request to
the GuC high-priority queue. Based on the HW's ability to preempt (on an
Apollo Lake SoC, 3D workload is preemptible on a 3D primitive level with
some exceptions), the currently executing workload is saved and
preempted. The GuC informs the driver using an interrupt of a preemption
event occurring. After handling the interrupt, the driver submits the
high-priority workload through the normal priority GuC queue. As such,
the normal priority GuC queue is used for actual execbuf submission most
of the time with the high-priority GuC queue only being used for the
preemption of lower-priority workload.
Scheduling policies are customizable and left to customers to change if
they are not satisfied with the built-in i915 driver policy, where all
workloads of the SOS are considered higher priority than those of the
UOS. This policy can be enforced through an SOS i915 kernel command line
parameter, and can replace the default in-order command submission (no
preemption) policy.
AcrnGT
*******
ACRN is a flexible, lightweight reference hypervisor, built with
real-time and safety-criticality in mind, optimized to streamline
embedded development through an open source platform.
AcrnGT is the GVT-g implementation on the ACRN hypervisor. It adapts
the MPT interface of GVT-g onto ACRN by using the kernel APIs provided
by ACRN.
:numref:`full-pic` shows the full architecture of AcrnGT with a Linux Guest
OS and an Android Guest OS.
.. figure:: images/APL_GVT-g-full-pic.png
:width: 800px
:align: center
:name: full-pic
Full picture of the AcrnGT
AcrnGT in kernel
=================
The AcrnGT module in the SOS kernel acts as an adaption layer to connect
between GVT-g in the i915, the VHM module, and the ACRN-DM user space
application:
- AcrnGT module implements the MPT interface of GVT-g to provide
services to it, including set and unset trap areas, set and unset
write-protection pages, etc.
- It calls the VHM APIs provided by the ACRN VHM module in the SOS
kernel, to eventually call into the routines provided by ACRN
hypervisor through hyper-calls.
- It provides user space interfaces through ``sysfs`` to the user space
ACRN-DM, so that DM can manage the lifecycle of the virtual GPUs.
AcrnGT in DM
=============
To emulate a PCI device to a Guest, we need an AcrnGT sub-module in the
ACRN-DM. This sub-module is responsible for:
- registering the virtual GPU device to the PCI device tree presented to
guest;
- registerng the MMIO resources to ACRN-DM so that it can reserve
resources in ACPI table;
- managing the lifecycle of the virtual GPU device, such as creation,
destruction, and resetting according to the state of the virtual
machine.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,18 @@
.. _hld-emulated-devices:
Emulated devices high-level design
##################################
Full virtualization device models can typically
reuse existing native device drivers to avoid implementing front-end
drivers. ACRN implements several fully virtualized devices, as
documented in this section.
.. toctree::
:maxdepth: 1
usb-virt-hld
UART virtualization <uart-virt-hld>
Watchdoc virtualization <watchdog-hld>
random-virt-hld
GVT-g GPU Virtualization <hld-APL_GVT-g>

View File

@@ -0,0 +1,24 @@
.. _hld-hypervisor:
Hypervisor high-level design
############################
.. toctree::
:maxdepth: 1
hv-startup
hv-cpu-virt
Memory management <hv-memmgt>
I/O Emulation <hv-io-emulation>
IOC Virtualization <hv-ioc-virt>
Physical Interrupt <hv-interrupt>
Timer <hv-timer>
Virtual Interrupt <hv-virt-interrupt>
VT-d <hv-vt-d>
Device Passthrough <hv-dev-passthrough>
hv-partitionmode
Power Management <hv-pm>
Console, Shell, and vUART <hv-console>
Hypercall / VHM upcall <hv-hypercall>
Compile-time configuration <hv-config>

View File

@@ -0,0 +1,529 @@
.. _hld-overview:
ACRN high-level design overview
###############################
ACRN is an open source reference hypervisor (HV) running on top of Intel
Apollo Lake platforms for Software Defined Cockpit (SDC) or In-Vehicle
Experience (IVE) solutions. ACRN provides embedded hypervisor vendors
with a reference I/O mediation solution with a permissive license and
provides auto makers a reference software stack for in-vehicle use.
ACRN Supported Use Cases
************************
Software Defined Cockpit
========================
The SDC system consists of multiple systems: the instrument cluster (IC)
system, the In-vehicle Infotainment (IVI) system, and one or more rear
seat entertainment (RSE) systems. Each system runs as a VM for better
isolation.
The Instrument Control (IC) system manages graphics display of
- driving speed, engine RPM, temperature, fuel level, odometer, trip mile, etc.
- alerts of low fuel or tire pressure
- rear-view camera(RVC) and surround-camera view for driving assistance.
In-Vehicle Infotainment
=======================
A typical In-Vehicle Infotainment (IVI) system would support:
- Navigation systems;
- Radios, audio and video playback;
- Mobile devices connection for calls, music and applications via voice
recognition and/or gesture Recognition / Touch.
- Rear-seat RSE services such as:
- entertainment system
- virtual office
- connection to IVI front system and mobile devices (cloud
connectivity)
ACRN supports guest OSes of Clear Linux OS and Android. OEMs can use the ACRN
hypervisor and Linux or Android guest OS reference code to implement their own
VMs for a customized IC/IVI/RSE.
Hardware Requirements
*********************
Mandatory IA CPU features are support for:
- Long mode
- MTRR
- TSC deadline timer
- NX, SMAP, SMEP
- Intel-VT including VMX, EPT, VT-d, APICv, VPID, invept and invvpid
Recommended Memory: 4GB, 8GB preferred.
ACRN Architecture
*****************
ACRN is a type-I hypervisor, running on top of bare metal. It supports
Intel Apollo Lake platforms and can be easily extended to support future
platforms. ACRN implements a hybrid VMM architecture, using a privileged
service VM running the service OS (SOS) to manage I/O devices and
provide I/O mediation. Multiple user VMs can be supported, running Clear
Linux OS or Android OS as the user OS (UOS).
Instrument cluster applications are critical in the Software Defined
Cockpit (SDC) use case, and may require functional safety certification
in the future. Running the IC system in a separate VM can isolate it from
other VMs and their applications, thereby reducing the attack surface
and minimizing potential interference. However, running the IC system in
a separate VM introduces additional latency for the IC applications.
Some country regulations requires an IVE system to show a rear-view
camera (RVC) within 2 seconds, which is difficult to achieve if a
separate instrument cluster VM is started after the SOS is booted.
:numref:`overview-arch` shows the architecture of ACRN together with IC VM and
service VM. As shown, SOS owns most of platform devices and provides I/O
mediation to VMs. Some of the PCIe devices function as pass-through mode
to UOSs according to VM configuration. In addition, the SOS could run
the IC applications and HV helper applications such as the Device Model,
VM manager, etc. where the VM manager is responsible for VM
start/stop/pause, virtual CPU pause/resume,etc.
.. figure:: images/over-image34.png
:align: center
:name: overview-arch
ACRN Architecture
.. _intro-io-emulation:
Device Emulation
================
ACRN adopts various approaches for emulating devices for UOS:
- **Emulated device**: A virtual device using this approach is emulated in
the SOS by trapping accesses to the device in UOS. Two sub-categories
exist for emulated device:
- fully emulated, allowing native drivers to be used
unmodified in the UOS, and
- para-virtualized, requiring front-end drivers in
the UOS to function.
- **Pass-through device**: A device passed through to UOS is fully
accessible to UOS without interception. However, interrupts
are first handled by the hypervisor before
being injected to the UOS.
- **Mediated pass-through device**: A mediated pass-through device is a
hybrid of the previous two approaches. Performance-critical
resources (mostly data-plane related) are passed-through to UOSes and
others (mostly control-plane related) are emulated.
I/O Emulation
-------------
The device model (DM) is a place for managing UOS devices: it allocates
memory for UOSes, configures and initializes the devices shared by the
guest, loads the virtual BIOS and initializes the virtual CPU state, and
invokes hypervisor service to execute the guest instructions.
The following diagram illustrates the control flow of emulating a port
I/O read from UOS.
.. figure:: images/over-image29.png
:align: center
:name: overview-io-emu-path
I/O (PIO/MMIO) Emulation Path
:numref:`overview-io-emu-path` shows an example I/O emulation flow path.
When a guest executes an I/O instruction (port I/O or MMIO), an VM exit
happens. HV takes control, and executes the request based on the VM exit
reason ``VMX_EXIT_REASON_IO_INSTRUCTION`` for port I/O access, for
example. HV will then fetch the additional guest instructions, if any,
and processes the port I/O instructions at a pre-configured port address
(in ``AL, 20h`` for example), and place the decoded information such as
the port I/O address, size of access, read/write, and target register
into the I/O request in the I/O request buffer (shown in
:numref:`overview-io-emu-path`) and notify/interrupt SOS to process.
The virtio and HV service module (VHM) in SOS intercepts HV interrupts,
and accesses the I/O request buffer for the port I/O instructions. It will
then check if there is any kernel device claiming ownership of the
I/O port. The owning device, if any, executes the requested APIs from a
VM. Otherwise, the VHM module leaves the I/O request in the request buffer
and wakes up the DM thread for processing.
DM follows the same mechanism as VHM. The I/O processing thread of the
DM queries the I/O request buffer to get the PIO instruction details and
checks to see if any (guest) device emulation modules claim ownership of
the I/O port. If yes, the owning module is invoked to execute requested
APIs.
When the DM completes the emulation (port IO 20h access in this example)
of a device such as uDev1, uDev1 will put the result into the request
buffer (register AL). The DM will then return the control to HV
indicating completion of an IO instruction emulation, typically thru
VHM/hypercall. The HV then stores the result to the guest register
context, advances the guest IP to indicate the completion of instruction
execution, and resumes the guest.
MMIO access path is similar except for a VM exit reason of *EPT
violation*.
DMA Emulation
-------------
Currently the only fully virtualized devices to UOS are USB xHCI, UART,
and Automotive I/O controller. None of these require emulating
DMA transactions. ACRN does not currently support virtual DMA.
Hypervisor
**********
ACRN takes advantage of Intel Virtualization Technology (Intel VT).
The ACRN HV runs in Virtual Machine Extension (VMX) root operation,
host mode, or VMM mode, while the SOS and UOS guests run
in VMX non-root operation, or guest mode. (We'll use "root mode"
and "non-root mode" for simplicity).
The VMM mode has 4 rings. ACRN
runs HV in ring 0 privilege only, and leaves ring 1-3 unused. A guest
running in non-root mode, has its own full rings (ring 0 to 3). The
guest kernel runs in ring 0 in guest mode, while guest user land
applications run in ring 3 of guest mode (ring 1 and 2 are usually not
used by commercial OS).
.. figure:: images/over-image11.png
:align: center
:name: overview-arch-hv
Architecture of ACRN hypervisor
:numref:`overview-arch-hv` shows an overview of the ACRN hypervisor architecture.
- A platform initialization layer provides an entry
point, checking hardware capabilities and initializing the
processors, memory, and interrupts. Relocation of the hypervisor
image, derivation of encryption seeds are also supported by this
component.
- A hardware management and utilities layer provides services for
managing physical resources at runtime. Examples include handling
physical interrupts and low power state changes.
- A layer siting on top of hardware management enables virtual
CPUs (or vCPUs), leveraging Intel VT. A vCPU loop runs a vCPU in
non-root mode and handles VM exit events triggered by the vCPU.
This layer handles CPU and memory related VM
exits and provides a way to inject exceptions or interrupts to a
vCPU.
- On top of vCPUs are three components for device emulation: one for
emulation inside the hypervisor, another for communicating with
SOS for mediation, and the third for managing pass-through
devices.
- The highest layer is a VM management module providing
VM lifecycle and power operations.
- A library component provides basic utilities for the rest of the
hypervisor, including encryption algorithms, mutual-exclusion
primitives, etc.
There are three ways that the hypervisor interacts with SOS:
VM exits (including hypercalls), upcalls, and through the I/O request buffer.
Interaction between the hypervisor and UOS is more restricted, including
only VM exits and hypercalls related to trusty.
SOS
***
SOS (Service OS) is an important guest OS in the ACRN architecture. It
runs in non-root mode, and contains many critical components including VM
manager, device model (DM), ACRN services, kernel mediation, and virtio
and hypercall module (VHM). DM manages UOS (User OS) and
provide device emulation for it. The SOS also provides services
for system power lifecycle management through ACRN service and VM manager,
and services for system debugging through ACRN log/trace tools.
DM
==
DM (Device Model) is an user level QEMU-like application in SOS
responsible for creating an UOS VM and then performing devices emulation
based on command line configurations.
Based on a VHM kernel module, DM interacts with VM manager to create UOS
VM. It then emulates devices through full virtualization in DM user
level, or para-virtualized based on kernel mediator (such as virtio,
GVT), or pass-through based on kernel VHM APIs.
Refer to :ref:`hld-devicemodel` for more details.
VM Manager
==========
VM Manager is an user level service in SOS handling UOS VM creation and
VM state management, according to the application requirements or system
power operations.
VM Manager creates UOS VM based on DM application, and does UOS VM state
management by interacting with lifecycle service in ACRN service.
Please refer to VM management chapter for more details.
ACRN Service
============
ACRN service provides
system lifecycle management based on IOC polling. It communicates with
VM manager to handle UOS VM state, such as S3 and power-off.
VHM
===
VHM (virtio & hypercall module) kernel module is an SOS kernel driver
supporting UOS VM management and device emulation. Device Model follows
the standard Linux char device API (ioctl) to access VHM
functionalities. VHM communicates with the ACRN hypervisor through
hypercall or upcall interrupts.
Please refer to VHM chapter for more details.
Kernel Mediators
================
Kernel mediators are kernel modules providing a para-virtualization method
for the UOS VMs, for example, an i915 gvt driver.
Log/Trace Tools
===============
ACRN Log/Trace tools are user level applications used to
capture ACRN hypervisor log and trace data. The VHM kernel module provides a
middle layer to support these tools.
Refer to :ref:`hld-trace-log` for more details.
UOS
***
Currently, ACRN can boot Linux and Android guest OSes. For Android guest OS, ACRN
provides a VM environment with two worlds: normal world and trusty
world. The Android OS runs in the the normal world. The trusty OS and
security sensitive applications run in the trusty world. The trusty
world can see the memory of normal world, but normal world cannot see
trusty world.
Guest Physical Memory Layout - UOS E820
=======================================
DM will create E820 table for a User OS VM based on these simple rules:
- If requested VM memory size < low memory limitation (currently 2 GB,
defined in DM), then low memory range = [0, requested VM memory
size]
- If requested VM memory size > low memory limitation, then low
memory range = [0, 2G], and high memory range =
[4G, 4G + requested VM memory size - 2G]
.. figure:: images/over-image13.png
:align: center
UOS Physical Memory Layout
UOS Memory Allocation
=====================
DM does UOS memory allocation based on hugetlb mechanism by default.
The real memory mapping may be scattered in SOS physical
memory space, as shown in :numref:`overview-mem-layout`:
.. figure:: images/over-image15.png
:align: center
:name: overview-mem-layout
UOS Physical Memory Layout Based on Hugetlb
User OS's memory is allocated by Service OS DM application, it may come
from different huge pages in Service OS as shown in
:numref:`overview-mem-layout`.
As Service OS has full knowledge of these huge pages size,
GPA\ :sup:`SOS` and GPA\ :sup:`UOS`, it works with the hypervisor
to complete UOS's host-to-guest mapping using this pseudo code:
.. code-block: none
for x in allocated huge pages do
x.hpa = gpa2hpa_for_sos(x.sos_gpa)
host2guest_map_for_uos(x.hpa, x.uos_gpa, x.size)
end
Virtual Slim bootloader
=======================
Virtual Slim bootloader (vSBL) is the virtual bootloader that supports
booting the UOS on the ACRN hypervisor platform. The vSBL design is
derived from Slim Bootloader. It follows a staged design approach that
provides hardware initialization and payload launching that provides the
boot logic. As shown in :numref:`overview-sbl`, the virtual SBL has an
initialization unit to initialize virtual hardware, and a payload unit
to boot Linux or Android guest OS.
.. figure:: images/over-image110.png
:align: center
:name: overview-sbl
vSBL System Context Diagram
The vSBL image is released as a part of the Service OS (SOS) root
filesystem (rootfs). The vSBL is copied to UOS memory by the VM manager
in the SOS while creating the UOS virtual BSP of UOS. The SOS passes the
start of vSBL and related information to HV. HV sets guest RIP of UOS
virtual BSP as the start of vSBL and related guest registers, and
launches the UOS virtual BSP. The vSBL starts running in the virtual
real mode within the UOS. Conceptually, vSBL is part of the UOS runtime.
In the current design, the vSBL supports booting Android guest OS or
Linux guest OS using the same vSBL image.
For an Android VM, the vSBL will load and verify trusty OS first, and
trusty OS will then load and verify Android OS according to the Android
OS verification mechanism.
Freedom From Interference
*************************
The hypervisor is critical for preventing inter-VM interference, using
the following mechanisms:
- Each physical CPU is dedicated to one vCPU.
Sharing a physical CPU among multiple vCPUs gives rise to multiple
sources of interference such as the vCPU of one VM flushing the
L1 & L2 cache for another, or tremendous interrupts for one VM
delaying the execution of another. It also requires vCPU
scheduling in the hypervisor to consider more complexities such as
scheduling latency and vCPU priority, exposing more opportunities
for one VM to interfere another.
To prevent such interference, ACRN hypervisor adopts static
core partitioning by dedicating each physical CPU to one vCPU. The
physical CPU loops in idle when the vCPU is paused by I/O
emulation. This makes the vCPU scheduling deterministic and physical
resource sharing is minimized.
- Hardware mechanisms including EPT, VT-d, SMAP and SMEP are leveraged
to prevent unintended memory accesses.
Memory corruption can be a common failure mode. ACRN hypervisor properly
sets up the memory-related hardware mechanisms to ensure that:
1. SOS cannot access the memory of the hypervisor, unless explicitly
allowed,
2. UOS cannot access the memory of SOS and the hypervisor, and
3. The hypervisor does not unintendedly access the memory of SOS or UOS.
- Destination of external interrupts are set to be the physical core
where the VM that handles them is running.
External interrupts are always handled by the hypervisor in ACRN.
Excessive interrupts to one VM (say VM A) could slow down another
VM (VM B) if they are handled by the physical core running VM B
instead of VM A. Two mechanisms are designed to mitigate such
interference.
1. The destination of an external interrupt is set to the physical core
that runs the vCPU where virtual interrupts will be injected.
2. The hypervisor maintains statistics on the total number of received
interrupts to SOS via a hypercall, and has a delay mechanism to
temporarily block certain virtual interrupts from being injected.
This allows SOS to detect the occurrence of an interrupt storm and
control the interrupt injection rate when necessary.
- Mitigation of DMA storm.
(To be documented later.)
Boot Flow
*********
.. figure:: images/over-image85.png
:align: center
ACRN Boot Flow
Power Management
****************
CPU P-state & C-state
=====================
In ACRN, CPU P-state and C-state (Px/Cx) are controlled by the guest OS.
The corresponding governors are managed in SOS/UOS for best power
efficiency and simplicity.
Guest should be able to process the ACPI P/C-state request from OSPM.
The needed ACPI objects for P/C-state management should be ready in
ACPI table.
Hypervisor can restrict guest's P/C-state request (per customer
requirement). MSR accesses of P-state requests could be intercepted by
the hypervisor and forwarded to the host directly if the requested
P-state is valid. Guest MWAIT/Port IO accesses of C-state control could
be passed through to host with no hypervisor interception to minimize
performance impacts.
This diagram shows CPU P/C-state management blocks:
.. figure:: images/over-image4.png
:align: center
CPU P/C-state management block diagram
System power state
==================
ACRN supports ACPI standard defined power state: S3 and S5 in system
level. For each guest, ACRN assume guest implements OSPM and controls its
own power state accordingly. ACRN doesn't involve guest OSPM. Instead,
it traps the power state transition request from guest and emulates it.
.. figure:: images/over-image21.png
:align: center
:name: overview-pm-block
ACRN Power Management Diagram Block
:numref:`overview-pm-block` shows the basic diagram block for ACRN PM.
The OSPM in each guest manages the guest power state transition. The
Device Model running in SOS traps and emulates the power state
transition of UOS (Linux VM or Android VM in
:numref:`overview-pm-block`). VM Manager knows all UOS power states and
notifies OSPM of SOS (Service OS in :numref:`overview-pm-block`) once
active UOS is in the required power state.
Then OSPM of the SOS starts the power state transition of SOS which is
trapped to "Sx Agency" in ACRN, and it will start the power state
transition.
Some details about the ACPI table for UOS and SOS:
- The ACPI table in UOS is emulated by Device Model. The Device Model
knows which register the UOS writes to trigger power state
transitions. Device Model must register an I/O handler for it.
- The ACPI table in SOS is passthru. There is no ACPI parser
in ACRN HV. The power management related ACPI table is
generated offline and hardcoded in ACRN HV.

View File

@@ -0,0 +1,179 @@
.. _hld-power-management:
Power Management high-level design
##################################
P-state/C-state management
**************************
ACPI Px/Cx data
===============
CPU P-state/C-state are controlled by the guest OS. The ACPI
P/C-state driver relies on some P/C-state-related ACPI data in the guest
ACPI table.
SOS could run ACPI driver with no problem because it can access native
the ACPI table. For UOS though, we need to prepare the corresponding ACPI data
for Device Model to build virtual ACPI table.
The Px/Cx data includes four
ACPI objects: _PCT, _PPC, and _PSS for P-state management, and _CST for
C-state management. All these ACPI data must be consistent with the
native data because the control method is a kind of pass through.
These ACPI objects data are parsed by an offline tool and hard-coded in a
Hypervisor module named CPU state table:
.. code-block:: c
struct cpu_px_data {
uint64_t core_frequency; /* megahertz */
uint64_t power; /* milliWatts */
uint64_t transition_latency; /* microseconds */
uint64_t bus_master_latency; /* microseconds */
uint64_t control; /* control value */
uint64_t status; /* success indicator */
} __attribute__((aligned(8)));
struct acpi_generic_address {
uint8_t space_id;
uint8_t bit_width;
uint8_t bit_offset;
uint8_t access_size;
uint64_t address;
} __attribute__((aligned(8)));
struct cpu_cx_data {
struct acpi_generic_address cx_reg;
uint8_t type;
uint32_t latency;
uint64_t power;
} __attribute__((aligned(8)));
With these Px/Cx data, the Hypervisor is able to intercept guest's
P/C-state requests with desired restrictions.
Virtual ACPI table build flow
=============================
:numref:`vACPItable` shows how to build virtual ACPI table with
Px/Cx data for UOS P/C-state management:
.. figure:: images/hld-pm-image28.png
:align: center
:name: vACPItable
System block for building vACPI table with Px/Cx data
Some ioctl APIs are defined for Device model to query Px/Cx data from
SOS VHM. The Hypervisor needs to provide hypercall APIs to transit Px/Cx
data from CPU state table to SOS VHM.
The build flow is:
1) Use offline tool (e.g. **iasl**) to parse the Px/Cx data and hard-code to
CPU state table in Hypervisor. Hypervisor loads the data after
system boot up.
2) Before UOS launching, Device mode queries the Px/Cx data from SOS VHM
via ioctl interface.
3) VHM transmits the query request to Hypervisor by hypercall.
4) Hypervisor returns the Px/Cx data.
5) Device model builds the virtual ACPI table with these Px/Cx data
Intercept Policy
================
Hypervisor should be able to restrict guest's
P/C-state request, with a user-customized policy.
Hypervisor should intercept guest P-state request and validate whether
it is a valid P-state. Any invalid P-state (e.g. doesn't exist in CPU state
table) should be rejected.
It is better not to intercept C-state request because the trap would
impact both power and performance.
.. note:: For P-state control you should pay attention to SoC core
voltage domain design when doing P-state measurement. The highest
P-state would win if different P-state requests on the cores shared
same voltage domain. In this case APERF/MPERF must be used to see
what P-state was granted on that core.
S3/S5
*****
ACRN assumes guest has complete S3/S5 power state management and follows
the ACPI standard exactly. System S3/S5 needs to follow well-defined
enter/exit paths and cooperate among different components.
System low power state enter process
====================================
Each time, when OSPM of UOS starts power state transition, it will
finally write the ACPI register per ACPI spec requirement.
With help of ACRN I/O emulation framework, the UOS ACPI
register writing will be dispatched to Device Model and Device Model
will emulate the UOS power state (pause UOS VM for S3 and power off UOS
VM for S5)
The VM Manager monitors all UOS. If all active UOSes are in required power
state, VM Manager will notify OSPM of SOS to start SOS power state
transition. OSPM of SOS follows a very similar process as UOS for power
state transition. The difference is SOS ACPI register writing is trapped
to ACRN HV. And ACRN HV will emulate SOS power state (pause SOS VM for
S3 and no special action for S5)
Once SOS low power state is done, ACRN HV will go through its own low
power state enter path.
The whole system is finally put into low power state.
System low power state exit process
===================================
The low power state exit process is in reverse order. The ACRN
hypervisor is woken up at first. It will go through its own low power
state exit path. Then ACRN hypervisor will resume the SOS to let SOS go
through SOS low power state exit path. After that, the DM is resumed and
let UOS go through UOS low power state exit path. The system is resumed
to running state after at least one UOS is resumed to running state.
:numref:`pmworkflow` shows the flow of low power S3 enter/exit process (S5 follows
very similar process)
.. figure:: images/hld-pm-image62.png
:align: center
:name: pmworkflow
ACRN system power management workflow
For system power state entry:
1. UOS OSPM start UOS S3 entry
2. The UOS S3 entering request is trapped ACPI PM Device of DM
3. DM pauses UOS VM to emulate UOS S3 and notifies VM Manager that the UOS
dedicated to it is in S3
4. If all UOSes are in S3, VM Manager will notify OSPM of SOS
5. SOS OSPM starts SOS S3 enter
6. SOS S3 entering request is trapped to Sx Agency in ACRN HV
7. ACRN HV pauses SOS VM to emulate SOS S3 and starts ACRN HV S3 entry.
For system power state exit:
1. When system is resumed from S3, native bootloader will jump to wake
up vector of HV
2. HV resumes S3 and jumps to wake up vector to emulate SOS resume from S3
3. OSPM of SOS is running
4. OSPM of SOS notifies VM Manager that it's ready to wake up UOS
5. VM Manager will notify DM to resume the UOS
6. DM resets the UOS VM to emulate UOS resume from S3
According to ACPI standard, S3 is mapped to suspend to RAM and S5 is
mapped to shutdown. So the S5 process is a little different:
- UOS enters S3 -> UOS powers off
- System enters S3 -> System powers off
- System resumes From S3 -> System fresh start
- UOS resumes from S3 -> UOS fresh startup

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,241 @@
.. _hld-trace-log:
Tracing and Logging high-level design
#####################################
Both Trace and Log are built on top of a mechanism named shared
buffer (Sbuf).
Shared Buffer
*************
Shared Buffer is a ring buffer divided into predetermined-size slots. There
are two use scenarios of Sbuf:
- sbuf can serve as a lockless ring buffer to share data from ACRN HV to
SOS in non-overwritten mode. (Writing will fail if an overrun
happens.)
- sbuf can serve as a conventional ring buffer in hypervisor in
over-written mode. A lock is required to synchronize access by the
producer and consumer.
Both ACRNTrace and ACRNLog use sbuf as a lockless ring buffer. The Sbuf
is allocated by SOS and assigned to HV via a hypercall. To hold pointers
to sbuf passed down via hypercall, an array ``sbuf[ACRN_SBUF_ID_MAX]``
is defined in per_cpu region of HV, with predefined sbuf id to identify
the usage, such as ACRNTrace, ACRNLog, etc.
For each physical CPU there is a dedicated Sbuf. Only a single producer
is allowed to put data into that Sbuf in HV, and a single consumer is
allowed to get data from Sbuf in SOS. Therefore, no lock is required to
synchronize access by the producer and consumer.
sbuf APIs
=========
.. note:: reference APIs defined in hypervisor/include/debug/sbuf.h
ACRN Trace
**********
ACRNTrace is a tool running on the Service OS (SOS) to capture trace
data. It allows developers to add performance profiling trace points at
key locations to get a picture of what is going on inside the
hypervisor. Scripts to analyze the collected trace data are also
provided.
As shown in :numref:`acrntrace-arch`, ACRNTrace is built using
Shared Buffers (Sbuf), and consists of three parts from bottom layer
up:
- **ACRNTrace userland app**: Userland application collecting trace data to
files (Per Physical CPU)
- **SOS Trace Module**: allocates/frees SBufs, creates device for each
SBuf, sets up sbuf shared between SOS and HV, and provides a dev node for the
userland app to retrieve trace data from Sbuf
- **Trace APIs**: provide APIs to generate trace event and insert to Sbuf.
.. figure:: images/log-image50.png
:align: center
:name: acrntrace-arch
Architectural diagram of ACRNTrace
Trace APIs
==========
.. note:: reference APIs defined in hypervisor/include/debug/trace.h
for trace_entry struct and functions.
SOS Trace Module
================
The SOS trace module is responsible for:
- allocating sbuf in sos memory range for each physical CPU, and assign
the gpa of Sbuf to ``per_cpu sbuf[ACRN_TRACE]``
- create a misc device for each physical CPU
- provide mmap operation to map entire Sbuf to userspace for high
flexible and efficient access.
On SOS shutdown, the trace module is responsible to remove misc devices, free
SBufs, and set ``per_cpu sbuf[ACRN_TRACE]`` to null.
ACRNTrace Application
=====================
ACRNTrace application includes a binary to retrieve trace data from
Sbuf, and Python scripts to convert trace data from raw format into
readable text, and do analysis.
Figure 2.2 shows the sequence of trace initialization and trace data
collection. With a debug build, trace components are initialized at boot
time. After initialization, HV writes trace event date into sbuf
until sbuf is full, which can happen easily if the ACRNTrace app is not
consuming trace data from Sbuf on SOS user space.
Once ACRNTrace is launched, for each physical CPU a consumer thread is
created to periodically read RAW trace data from sbuf and write to a
file.
.. note:: figure is missing
Figure 2.2 Sequence of trace init and trace data collection
These are the Python scripts provided:
- **acrntrace_format.py** converts RAW trace data to human-readable
text offline according to given format;
- **acrnalyze.py** analyzes trace data (as output by acrntrace)
based on given analyzer filters, such as vm_exit or irq, and generates a
report.
See :ref:`acrntrace` for details and usage.
ACRN Log
********
acrnlog is a tool used to capture ACRN hypervisor log to files on
SOS filesystem. It can run as an SOS service at boot, capturing two
kinds of logs:
- Current runtime logs;
- Logs remaining in the buffer, from last crashed running.
Architectural diagram
=====================
Similar to the design of ACRN Trace, ACRN Log is built on the top of
Shared Buffer (Sbuf), and consists of three parts from bottom layer
up:
- **ACRN Log app**: Userland application collecting hypervisor log to
files;
- **SOS ACRN Log Module**: constructs/frees SBufs at reserved memory
area, creates dev for current/last logs, sets up sbuf shared between
SOS and HV, and provides a dev node for the userland app to
retrieve logs
- **ACRN log support in HV**: put logs at specified loglevel to Sbuf.
.. figure:: images/log-image73.png
:align: center
Architectural diagram of ACRN Log
ACRN log support in Hypervisor
==============================
To support acrn log, the following adaption was made to hypervisor log
system:
- log messages with severity level higher than a specified value will
be put into Sbuf when calling logmsg in hypervisor
- allocate sbuf to accommodate early hypervisor logs before SOS
can allocate and set up sbuf
There are 6 different loglevels, as shown below. The specified
severity loglevel is stored in ``mem_loglevel``, initialized
by :option:`CONFIG_MEM_LOGLEVEL_DEFAULT`. The loglevel can
be set to a new value
at runtime via hypervisor shell command "loglevel".
.. code-block:: c
#define LOG_FATAL 1U
#define LOG_ACRN 2U
#define LOG_ERROR 3U
#define LOG_WARNING 4U
#define LOG_INFO 5U
#define LOG_DEBUG 6U
The element size of sbuf for logs is fixed at 80 bytes, and the max size
of a single log message is 320 bytes. Log messages with a length between
80 and 320 bytes will be separated into multiple sbuf elements. Log
messages with length larger then 320 will be truncated.
For security, SOS allocates sbuf in its memory range and assigns it to
the hypervisor. To handle log messages before SOS boots, sbuf for each
physical cpu will be allocated in acrn hypervisor memory range for any
early log entries. Once sbuf in the SOS memory range is allocated and
assigned to hypervisor via hypercall, the Hypervisor logmsg will switch
to use SOS allocated sbuf, early logs will be copied, and early sbuf in
hypervisor memory range will be freed.
SOS ACRN Log Module
===================
To enable retrieving log messages from a crash, 4MB of memory from
0x6DE00000 is reserved for acrn log. This space is further divided into
two each ranges, one for current run and one for last previous run:
.. figure:: images/log-image59.png
:align: center
ACRN Log crash log/current log buffers
On SOS boot, SOS acrnlog module is responsible to:
- examine if there are log messages remaining from last crashed
run by checking the magic number of each sbuf
- if there are previous crash logs, construct sbuf and create misc devices for
these last logs
- construct sbuf in the usable buf range for each physical CPU,
assign the gpa of Sbuf to ``per_cpu sbuf[ACRN_LOG]`` and create a misc
device for each physical CPU
- the misc devices implement read() file operation to allow
userspace app to read one Sbuf element.
When checking the validity of sbuf for last logs examination, it sets the
current sbuf with magic number ``0x5aa57aa71aa13aa3``, and changes the
magic number of last sbuf to ``0x5aa57aa71aa13aa2``, to distinguish which is
the current/last.
On SOS shutdown, the module is responsible to remove misc devices,
free SBufs, and set ``per_cpu sbuf[ACRN_TRACE]`` to null.
ACRN Log Application
====================
ACRNLog application reads log messages from sbuf for each physical
CPU and combines them into log files with log messages in ascending
order by the global sequence number. If the sequence number is not
continuous, a warning of "incontinuous logs" will be inserted.
To avoid using up storage space, the size of a single log file and
the total number of log files are both limited. By default, log file
size limitation is 1MB and file number limitation is 4.
If there are last log devices, ACRN log will read out the log
messages, combine them, and save them into last log files.
See :ref:`acrnlog` for usage details.

View File

@@ -0,0 +1,763 @@
.. _hld-virtio-devices:
.. _virtio-hld:
Virtio devices high-level design
################################
The ACRN Hypervisor follows the `Virtual I/O Device (virtio)
specification
<http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html>`_ to
realize I/O virtualization for many performance-critical devices
supported in the ACRN project. Adopting the virtio specification lets us
reuse many frontend virtio drivers already available in a Linux-based
User OS, drastically reducing potential development effort for frontend
virtio drivers. To further reduce the development effort of backend
virtio drivers, the hypervisor provides the virtio backend service
(VBS) APIs, that make it very straightforward to implement a virtio
device in the hypervisor.
The virtio APIs can be divided into 3 groups: DM APIs, virtio backend
service (VBS) APIs, and virtqueue (VQ) APIs, as shown in
:numref:`be-interface`.
.. figure:: images/virtio-hld-image0.png
:width: 900px
:align: center
:name: be-interface
ACRN Virtio Backend Service Interface
- **DM APIs** are exported by the DM, and are mainly used during the
device initialization phase and runtime. The DM APIs also include
PCIe emulation APIs because each virtio device is a PCIe device in
the SOS and UOS.
- **VBS APIs** are mainly exported by the VBS and related modules.
Generally they are callbacks to be
registered into the DM.
- **VQ APIs** are used by a virtio backend device to access and parse
information from the shared memory between the frontend and backend
device drivers.
Virtio framework is the para-virtualization specification that ACRN
follows to implement I/O virtualization of performance-critical
devices such as audio, eAVB/TSN, IPU, and CSMU devices. This section gives
an overview about virtio history, motivation, and advantages, and then
highlights virtio key concepts. Second, this section will describe
ACRN's virtio architectures, and elaborates on ACRN virtio APIs. Finally
this section will introduce all the virtio devices currently supported
by ACRN.
Virtio introduction
*******************
Virtio is an abstraction layer over devices in a para-virtualized
hypervisor. Virtio was developed by Rusty Russell when he worked at IBM
research to support his lguest hypervisor in 2007, and it quickly became
the de facto standard for KVM's para-virtualized I/O devices.
Virtio is very popular for virtual I/O devices because is provides a
straightforward, efficient, standard, and extensible mechanism, and
eliminates the need for boutique, per-environment, or per-OS mechanisms.
For example, rather than having a variety of device emulation
mechanisms, virtio provides a common frontend driver framework that
standardizes device interfaces, and increases code reuse across
different virtualization platforms.
Given the advantages of virtio, ACRN also follows the virtio
specification.
Key Concepts
************
To better understand virtio, especially its usage in ACRN, we'll
highlight several key virtio concepts important to ACRN:
Frontend virtio driver (FE)
Virtio adopts a frontend-backend architecture that enables a simple but
flexible framework for both frontend and backend virtio drivers. The FE
driver merely needs to offer services configure the interface, pass messages,
produce requests, and kick backend virtio driver. As a result, the FE
driver is easy to implement and the performance overhead of emulating
a device is eliminated.
Backend virtio driver (BE)
Similar to FE driver, the BE driver, running either in user-land or
kernel-land of the host OS, consumes requests from the FE driver and sends them
to the host native device driver. Once the requests are done by the host
native device driver, the BE driver notifies the FE driver that the
request is complete.
Note: to distinguish BE driver from host native device driver, the host
native device driver is called "native driver" in this document.
Straightforward: virtio devices as standard devices on existing buses
Instead of creating new device buses from scratch, virtio devices are
built on existing buses. This gives a straightforward way for both FE
and BE drivers to interact with each other. For example, FE driver could
read/write registers of the device, and the virtual device could
interrupt FE driver, on behalf of the BE driver, in case something of
interest is happening.
Currently virtio supports PCI/PCIe bus and MMIO bus. In ACRN, only
PCI/PCIe bus is supported, and all the virtio devices share the same
vendor ID 0x1AF4.
Note: For MMIO, the "bus" is a little bit an overstatement since
basically it is a few descriptors describing the devices.
Efficient: batching operation is encouraged
Batching operation and deferred notification are important to achieve
high-performance I/O, since notification between FE and BE driver
usually involves an expensive exit of the guest. Therefore batching
operating and notification suppression are highly encouraged if
possible. This will give an efficient implementation for
performance-critical devices.
Standard: virtqueue
All virtio devices share a standard ring buffer and descriptor
mechanism, called a virtqueue, shown in :numref:`virtqueue`. A virtqueue is a
queue of scatter-gather buffers. There are three important methods on
virtqueues:
- **add_buf** is for adding a request/response buffer in a virtqueue,
- **get_buf** is for getting a response/request in a virtqueue, and
- **kick** is for notifying the other side for a virtqueue to consume buffers.
The virtqueues are created in guest physical memory by the FE drivers.
BE drivers only need to parse the virtqueue structures to obtain
the requests and process them. How a virtqueue is organized is
specific to the Guest OS. In the Linux implementation of virtio, the
virtqueue is implemented as a ring buffer structure called vring.
In ACRN, the virtqueue APIs can be leveraged directly so that users
don't need to worry about the details of the virtqueue. (Refer to guest
OS for more details about the virtqueue implementation.)
.. figure:: images/virtio-hld-image2.png
:width: 900px
:align: center
:name: virtqueue
Virtqueue
Extensible: feature bits
A simple extensible feature negotiation mechanism exists for each
virtual device and its driver. Each virtual device could claim its
device specific features while the corresponding driver could respond to
the device with the subset of features the driver understands. The
feature mechanism enables forward and backward compatibility for the
virtual device and driver.
Virtio Device Modes
The virtio specification defines three modes of virtio devices:
a legacy mode device, a transitional mode device, and a modern mode
device. A legacy mode device is compliant to virtio specification
version 0.95, a transitional mode device is compliant to both
0.95 and 1.0 spec versions, and a modern mode
device is only compatible to the version 1.0 specification.
In ACRN, all the virtio devices are transitional devices, meaning that
they should be compatible with both 0.95 and 1.0 versions of virtio
specification.
Virtio Device Discovery
Virtio devices are commonly implemented as PCI/PCIe devices. A
virtio device using virtio over PCI/PCIe bus must expose an interface to
the Guest OS that meets the PCI/PCIe specifications.
Conventionally, any PCI device with Vendor ID 0x1AF4,
PCI_VENDOR_ID_REDHAT_QUMRANET, and Device ID 0x1000 through 0x107F
inclusive is a virtio device. Among the Device IDs, the
legacy/transitional mode virtio devices occupy the first 64 IDs ranging
from 0x1000 to 0x103F, while the range 0x1040-0x107F belongs to
virtio modern devices. In addition, the Subsystem Vendor ID should
reflect the PCI/PCIe vendor ID of the environment, and the Subsystem
Device ID indicates which virtio device is supported by the device.
Virtio Frameworks
*****************
This section describes the overall architecture of virtio, and then
introduce ACRN specific implementations of the virtio framework.
Architecture
============
Virtio adopts a frontend-backend
architecture, as shown in :numref:`virtio-arch`. Basically the FE and BE driver
communicate with each other through shared memory, via the
virtqueues. The FE driver talks to the BE driver in the same way it
would talk to a real PCIe device. The BE driver handles requests
from the FE driver, and notifies the FE driver if the request has been
processed.
.. figure:: images/virtio-hld-image1.png
:width: 900px
:align: center
:name: virtio-arch
Virtio Architecture
In addition to virtio's frontend-backend architecture, both FE and BE
drivers follow a layered architecture, as shown in
:numref:`virtio-fe-be`. Each
side has three layers: transports, core models, and device types.
All virtio devices share the same virtio infrastructure, including
virtqueues, feature mechanisms, configuration space, and buses.
.. figure:: images/virtio-hld-image4.png
:width: 900px
:align: center
:name: virtio-fe-be
Virtio Frontend/Backend Layered Architecture
Virtio Framework Considerations
===============================
How to realize the virtio framework is specific to a
hypervisor implementation. In ACRN, the virtio framework implementations
can be classified into two types, virtio backend service in user-land
(VBS-U) and virtio backend service in kernel-land (VBS-K), according to
where the virtio backend service (VBS) is located. Although different in BE
drivers, both VBS-U and VBS-K share the same FE drivers. The reason
behind the two virtio implementations is to meet the requirement of
supporting a large amount of diverse I/O devices in ACRN project.
When developing a virtio BE device driver, the device owner should choose
carefully between the VBS-U and VBS-K. Generally VBS-U targets
non-performance-critical devices, but enables easy development and
debugging. VBS-K targets performance critical devices.
The next two sections introduce ACRN's two implementations of the virtio
framework.
User-Land Virtio Framework
==========================
The architecture of ACRN user-land virtio framework (VBS-U) is shown in
:numref:`virtio-userland`.
The FE driver talks to the BE driver as if it were talking with a PCIe
device. This means for "control plane", the FE driver could poke device
registers through PIO or MMIO, and the device will interrupt the FE
driver when something happens. For "data plane", the communication
between the FE and BE driver is through shared memory, in the form of
virtqueues.
On the service OS side where the BE driver is located, there are several
key components in ACRN, including device model (DM), virtio and HV
service module (VHM), VBS-U, and user-level vring service API helpers.
DM bridges the FE driver and BE driver since each VBS-U module emulates
a PCIe virtio device. VHM bridges DM and the hypervisor by providing
remote memory map APIs and notification APIs. VBS-U accesses the
virtqueue through the user-level vring service API helpers.
.. figure:: images/virtio-hld-image3.png
:width: 900px
:align: center
:name: virtio-userland
ACRN User-Land Virtio Framework
Kernel-Land Virtio Framework
============================
ACRN supports two kernel-land virtio frameworks: VBS-K, designed from
scratch for ACRN, the other called Vhost, compatible with Linux Vhost.
VBS-K framework
---------------
The architecture of ACRN VBS-K is shown in
:numref:`kernel-virtio-framework` below.
Generally VBS-K provides acceleration towards performance critical
devices emulated by VBS-U modules by handling the "data plane" of the
devices directly in the kernel. When VBS-K is enabled for certain
devices, the kernel-land vring service API helpers, instead of the
user-land helpers, are used to access the virtqueues shared by the FE
driver. Compared to VBS-U, this eliminates the overhead of copying data
back-and-forth between user-land and kernel-land within service OS, but
pays with the extra implementation complexity of the BE drivers.
Except for the differences mentioned above, VBS-K still relies on VBS-U
for feature negotiations between FE and BE drivers. This means the
"control plane" of the virtio device still remains in VBS-U. When
feature negotiation is done, which is determined by FE driver setting up
an indicative flag, VBS-K module will be initialized by VBS-U.
Afterwards, all request handling will be offloaded to the VBS-K in
kernel.
Finally the FE driver is not aware of how the BE driver is implemented,
either in VBS-U or VBS-K. This saves engineering effort regarding FE
driver development.
.. figure:: images/virtio-hld-image54.png
:align: center
:name: kernel-virtio-framework
ACRN Kernel Land Virtio Framework
Vhost framework
---------------
Vhost is similar to VBS-K. Vhost is a common solution upstreamed in the
Linux kernel, with several kernel mediators based on it.
Architecture
~~~~~~~~~~~~
Vhost/virtio is a semi-virtualized device abstraction interface
specification that has been widely applied in various virtualization
solutions. Vhost is a specific kind of virtio where the data plane is
put into host kernel space to reduce the context switch while processing
the IO request. It is usually called "virtio" when used as a front-end
driver in a guest operating system or "vhost" when used as a back-end
driver in a host. Compared with a pure virtio solution on a host, vhost
uses the same frontend driver as virtio solution and can achieve better
performance. :numref:`vhost-arch` shows the vhost architecture on ACRN.
.. figure:: images/virtio-hld-image71.png
:align: center
:name: vhost-arch
Vhost Architecture on ACRN
Compared with a userspace virtio solution, vhost decomposes data plane
from user space to kernel space. The vhost general data plane workflow
can be described as:
1. vhost proxy creates two eventfds per virtqueue, one is for kick,
(an ioeventfd), the other is for call, (an irqfd).
2. vhost proxy registers the two eventfds to VHM through VHM character
device:
a) Ioevenftd is bound with a PIO/MMIO range. If it is a PIO, it is
registered with (fd, port, len, value). If it is a MMIO, it is
registered with (fd, addr, len).
b) Irqfd is registered with MSI vector.
3. vhost proxy sets the two fds to vhost kernel through ioctl of vhost
device.
4. vhost starts polling the kick fd and wakes up when guest kicks a
virtqueue, which results a event_signal on kick fd by VHM ioeventfd.
5. vhost device in kernel signals on the irqfd to notify the guest.
Ioeventfd implementation
~~~~~~~~~~~~~~~~~~~~~~~~
Ioeventfd module is implemented in VHM, and can enhance a registered
eventfd to listen to IO requests (PIO/MMIO) from vhm ioreq module and
signal the eventfd when needed. :numref:`ioeventfd-workflow` shows the
general workflow of ioeventfd.
.. figure:: images/virtio-hld-image58.png
:align: center
:name: ioeventfd-workflow
ioeventfd general work flow
The workflow can be summarized as:
1. vhost device init. Vhost proxy create two eventfd for ioeventfd and
irqfd.
2. pass ioeventfd to vhost kernel driver.
3. pass ioevent fd to vhm driver
4. UOS FE driver triggers ioreq and forwarded to SOS by hypervisor
5. ioreq is dispatched by vhm driver to related vhm client.
6. ioeventfd vhm client traverse the io_range list and find
corresponding eventfd.
7. trigger the signal to related eventfd.
Irqfd implementation
~~~~~~~~~~~~~~~~~~~~
The irqfd module is implemented in VHM, and can enhance an registered
eventfd to inject an interrupt to a guest OS when the eventfd gets
signaled. :numref:`irqfd-workflow` shows the general flow for irqfd.
.. figure:: images/virtio-hld-image60.png
:align: center
:name: irqfd-workflow
irqfd general flow
The workflow can be summarized as:
1. vhost device init. Vhost proxy create two eventfd for ioeventfd and
irqfd.
2. pass irqfd to vhost kernel driver.
3. pass irq fd to vhm driver
4. vhost device driver triggers irq eventfd signal once related native
transfer is completed.
5. irqfd related logic traverses the irqfd list to retrieve related irq
information.
6. irqfd related logic inject an interrupt through vhm interrupt API.
7. interrupt is delivered to UOS FE driver through hypervisor.
Virtio APIs
***********
This section provides details on the ACRN virtio APIs. As outlined previously,
the ACRN virtio APIs can be divided into three groups: DM_APIs,
VBS_APIs, and VQ_APIs. The following sections will elaborate on
these APIs.
VBS-U Key Data Structures
=========================
The key data structures for VBS-U are listed as following, and their
relationships are shown in :numref:`VBS-U-data`.
``struct pci_virtio_blk``
An example virtio device, such as virtio-blk.
``struct virtio_common``
A common component to any virtio device.
``struct virtio_ops``
Virtio specific operation functions for this type of virtio device.
``struct pci_vdev``
Instance of a virtual PCIe device, and any virtio
device is a virtual PCIe device.
``struct pci_vdev_ops``
PCIe device's operation functions for this type
of device.
``struct vqueue_info``
Instance of a virtqueue.
.. figure:: images/virtio-hld-image5.png
:width: 900px
:align: center
:name: VBS-U-data
VBS-U Key Data Structures
Each virtio device is a PCIe device. In addition, each virtio device
could have none or multiple virtqueues, depending on the device type.
The ``struct virtio_common`` is a key data structure to be manipulated by
DM, and DM finds other key data structures through it. The ``struct
virtio_ops`` abstracts a series of virtio callbacks to be provided by
device owner.
VBS-K Key Data Structures
=========================
The key data structures for VBS-K are listed as follows, and their
relationships are shown in :numref:`VBS-K-data`.
``struct vbs_k_rng``
In-kernel VBS-K component handling data plane of a
VBS-U virtio device, for example virtio random_num_generator.
``struct vbs_k_dev``
In-kernel VBS-K component common to all VBS-K.
``struct vbs_k_vq``
In-kernel VBS-K component to be working with kernel
vring service API helpers.
``struct vbs_k_dev_inf``
Virtio device information to be synchronized
from VBS-U to VBS-K kernel module.
``struct vbs_k_vq_info``
A single virtqueue information to be
synchronized from VBS-U to VBS-K kernel module.
``struct vbs_k_vqs_info``
Virtqueue(s) information, of a virtio device,
to be synchronized from VBS-U to VBS-K kernel module.
.. figure:: images/virtio-hld-image8.png
:width: 900px
:align: center
:name: VBS-K-data
VBS-K Key Data Structures
In VBS-K, the struct vbs_k_xxx represents the in-kernel component
handling a virtio device's data plane. It presents a char device for VBS-U
to open and register device status after feature negotiation with the FE
driver.
The device status includes negotiated features, number of virtqueues,
interrupt information, and more. All these status will be synchronized
from VBS-U to VBS-K. In VBS-U, the ``struct vbs_k_dev_info`` and ``struct
vbs_k_vqs_info`` will collect all the information and notify VBS-K through
ioctls. In VBS-K, the ``struct vbs_k_dev`` and ``struct vbs_k_vq``, which are
common to all VBS-K modules, are the counterparts to preserve the
related information. The related information is necessary to kernel-land
vring service API helpers.
VHOST Key Data Structures
=========================
The key data structures for vhost are listed as follows.
.. doxygenstruct:: vhost_dev
:project: Project ACRN
.. doxygenstruct:: vhost_vq
:project: Project ACRN
DM APIs
=======
The DM APIs are exported by DM, and they should be used when realizing
BE device drivers on ACRN.
.. doxygenfunction:: paddr_guest2host
:project: Project ACRN
.. doxygenfunction:: pci_set_cfgdata8
:project: Project ACRN
.. doxygenfunction:: pci_set_cfgdata16
:project: Project ACRN
.. doxygenfunction:: pci_set_cfgdata32
:project: Project ACRN
.. doxygenfunction:: pci_get_cfgdata8
:project: Project ACRN
.. doxygenfunction:: pci_get_cfgdata16
:project: Project ACRN
.. doxygenfunction:: pci_get_cfgdata32
:project: Project ACRN
.. doxygenfunction:: pci_lintr_assert
:project: Project ACRN
.. doxygenfunction:: pci_lintr_deassert
:project: Project ACRN
.. doxygenfunction:: pci_generate_msi
:project: Project ACRN
.. doxygenfunction:: pci_generate_msix
:project: Project ACRN
VBS APIs
========
The VBS APIs are exported by VBS related modules, including VBS, DM, and
SOS kernel modules. They can be classified into VBS-U and VBS-K APIs
listed as follows.
VBS-U APIs
----------
These APIs provided by VBS-U are callbacks to be registered to DM, and
the virtio framework within DM will invoke them appropriately.
.. doxygenstruct:: virtio_ops
:project: Project ACRN
.. doxygenfunction:: virtio_pci_read
:project: Project ACRN
.. doxygenfunction:: virtio_pci_write
:project: Project ACRN
.. doxygenfunction:: virtio_interrupt_init
:project: Project ACRN
.. doxygenfunction:: virtio_linkup
:project: Project ACRN
.. doxygenfunction:: virtio_reset_dev
:project: Project ACRN
.. doxygenfunction:: virtio_set_io_bar
:project: Project ACRN
.. doxygenfunction:: virtio_set_modern_bar
:project: Project ACRN
.. doxygenfunction:: virtio_config_changed
:project: Project ACRN
VBS-K APIs
----------
The VBS-K APIs are exported by VBS-K related modules. Users could use
the following APIs to implement their VBS-K modules.
APIs provided by DM
~~~~~~~~~~~~~~~~~~~
.. doxygenfunction:: vbs_kernel_reset
:project: Project ACRN
.. doxygenfunction:: vbs_kernel_start
:project: Project ACRN
.. doxygenfunction:: vbs_kernel_stop
:project: Project ACRN
APIs provided by VBS-K modules in service OS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. kernel-doc:: include/linux/vbs/vbs.h
:functions: virtio_dev_init
virtio_dev_ioctl
virtio_vqs_ioctl
virtio_dev_register
virtio_dev_deregister
virtio_vqs_index_get
virtio_dev_reset
VHOST APIS
==========
APIs provided by DM
-------------------
.. doxygenfunction:: vhost_dev_init
:project: Project ACRN
.. doxygenfunction:: vhost_dev_deinit
:project: Project ACRN
.. doxygenfunction:: vhost_dev_start
:project: Project ACRN
.. doxygenfunction:: vhost_dev_stop
:project: Project ACRN
Linux vhost IOCTLs
------------------
``#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64)``
This IOCTL is used to get the supported feature flags by vhost kernel driver.
``#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64)``
This IOCTL is used to set the supported feature flags to vhost kernel driver.
``#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)``
This IOCTL is used to set current process as the exclusive owner of the vhost
char device. It must be called before any other vhost commands.
``#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)``
This IOCTL is used to give up the ownership of the vhost char device.
``#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)``
This IOCTL is used to convey the guest OS memory layout to vhost kernel driver.
``#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)``
This IOCTL is used to set the number of descriptors in virtio ring. It cannot
be modified while the virtio ring is running.
``#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)``
This IOCTL is used to set the address of the virtio ring.
``#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_state)``
This IOCTL is used to set the base value where virtqueue looks for available
descriptors.
``#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct vhost_vring_state)``
This IOCTL is used to get the base value where virtqueue looks for available
descriptors.
``#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)``
This IOCTL is used to set the eventfd on which vhost can poll for guest
virtqueue kicks.
``#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)``
This IOCTL is used to set the eventfd which is used by vhost do inject
virtual interrupt.
VHM eventfd IOCTLs
------------------
.. doxygenstruct:: acrn_ioeventfd
:project: Project ACRN
``#define IC_EVENT_IOEVENTFD _IC_ID(IC_ID, IC_ID_EVENT_BASE + 0x00)``
This IOCTL is used to register/unregister ioeventfd with appropriate address,
length and data value.
.. doxygenstruct:: acrn_irqfd
:project: Project ACRN
``#define IC_EVENT_IRQFD _IC_ID(IC_ID, IC_ID_EVENT_BASE + 0x01)``
This IOCTL is used to register/unregister irqfd with appropriate MSI information.
VQ APIs
=======
The virtqueue APIs, or VQ APIs, are used by a BE device driver to
access the virtqueues shared by the FE driver. The VQ APIs abstract the
details of virtqueues so that users don't need to worry about the data
structures within the virtqueues. In addition, the VQ APIs are designed
to be identical between VBS-U and VBS-K, so that users don't need to
learn different APIs when implementing BE drivers based on VBS-U and
VBS-K.
.. doxygenfunction:: vq_interrupt
:project: Project ACRN
.. doxygenfunction:: vq_getchain
:project: Project ACRN
.. doxygenfunction:: vq_retchain
:project: Project ACRN
.. doxygenfunction:: vq_relchain
:project: Project ACRN
.. doxygenfunction:: vq_endchains
:project: Project ACRN
Below is an example showing a typical logic of how a BE driver handles
requests from a FE driver.
.. code-block:: c
static void BE_callback(struct pci_virtio_xxx *pv, struct vqueue_info *vq ) {
while (vq_has_descs(vq)) {
vq_getchain(vq, &idx, &iov, 1, NULL);
/* handle requests in iov */
request_handle_proc();
/* Release this chain and handle more */
vq_relchain(vq, idx, len);
}
/* Generate interrupt if appropriate. 1 means ring empty \*/
vq_endchains(vq, 1);
}
Supported Virtio Devices
************************
All the BE virtio drivers are implemented using the
ACRN virtio APIs, and the FE drivers are reusing the standard Linux FE
virtio drivers. For the devices with FE drivers available in the Linux
kernel, they should use standard virtio Vendor ID/Device ID and
Subsystem Vendor ID/Subsystem Device ID. For other devices within ACRN,
their temporary IDs are listed in the following table.
.. table:: Virtio Devices without existing FE drivers in Linux
:align: center
:name: virtio-device-table
+--------------+-------------+-------------+-------------+-------------+
| virtio | Vendor ID | Device ID | Subvendor | Subdevice |
| device | | | ID | ID |
+--------------+-------------+-------------+-------------+-------------+
| RPMB | 0x8086 | 0x8601 | 0x8086 | 0xFFFF |
+--------------+-------------+-------------+-------------+-------------+
| HECI | 0x8086 | 0x8602 | 0x8086 | 0xFFFE |
+--------------+-------------+-------------+-------------+-------------+
| audio | 0x8086 | 0x8603 | 0x8086 | 0xFFFD |
+--------------+-------------+-------------+-------------+-------------+
| IPU | 0x8086 | 0x8604 | 0x8086 | 0xFFFC |
+--------------+-------------+-------------+-------------+-------------+
| TSN/AVB | 0x8086 | 0x8605 | 0x8086 | 0xFFFB |
+--------------+-------------+-------------+-------------+-------------+
| hyper_dmabuf | 0x8086 | 0x8606 | 0x8086 | 0xFFFA |
+--------------+-------------+-------------+-------------+-------------+
| HDCP | 0x8086 | 0x8607 | 0x8086 | 0xFFF9 |
+--------------+-------------+-------------+-------------+-------------+
| COREU | 0x8086 | 0x8608 | 0x8086 | 0xFFF8 |
+--------------+-------------+-------------+-------------+-------------+
The following sections introduce the status of virtio devices currently
supported in ACRN.
.. toctree::
:maxdepth: 1
virtio-blk
virtio-net
virtio-input
virtio-console
virtio-rnd

View File

@@ -0,0 +1,161 @@
.. _hld-vm-management:
VM Management high-level design
###############################
Management of a Virtual Machine (VM) means to switch a VM to the right
state, according to the requirements of applications or system power
operations.
VM state
********
Generally, a VM is not running at the beginning: it is in a 'stopped'
state. After its UOS is launched successfully, the VM enter a 'running'
state. When the UOS powers off, the VM returns to a 'stopped' state again.
A UOS can sleep when it is running, so there is also a 'paused' state.
Because VMs are designed to work under an SOS environment, a VM can
only run and change its state when the SOS is running. A VM must be put to
'paused' or 'stopped' state before the SOS can sleep or power-off.
Otherwise the VM may be damaged and user data would be lost.
Scenarios of VM state change
****************************
Button-initiated System Power On
================================
When the user presses the power button to power on the system,
everything is started at the beginning. VMs that run user applications
are launched automatically after the SOS is ready.
Button-initiated VM Power on
============================
At SOS boot up, SOS-Life-Cycle-Service and Acrnd are automatically started
as system services. SOS-Life-Cycle-Service notifies Acrnd that SOS is
started, then Acrnd starts an Acrn-DM for launching each UOS, whose state
changes from 'stopped' to 'running'.
Button-initiated VM Power off
=============================
When SOS is about to shutdown, IOC powers off all VMs.
SOS-Life-Cycle-Service delays the SOS shutdown operation using heartbeat,
and waits for Acrnd to notify it can shutdown.
Acrnd keeps query states of all VMs. When all of them are 'stopped',
it notifies SOS-Life-Cycle-Service. SOS-Life-Cycle-Service stops the send delay
shutdown heartbeat, allowing SOS to continue the shutdown process.
RTC S3/S5 entry
===============
UOS asks Acrnd to resume/restart itself later by sending an RTC timer request,
and suspends/powers-off. SOS suspends/powers-off before that RTC
timer expires. Acrnd stores the RTC resume/restart time to a file, and
send the RTC timer request to SOS-Life-Cycle-Service.
SOS-Life-Cycle-Service sets the RTC timer to IOC. Finally, the SOS is
suspended/powered-off.
RTC S3/S5 exiting
=================
SOS is resumed/started by IOC RTC timer. SOS-Life-Cycle-Service notifies
Acrnd SOS has become alive again. Acrnd checks that the wakeup reason
was because SOS is resumed/started by IOC RTC. It then reads UOS
resume/restart time from the file, and resumes/restarts the UOS when
time is expired.
VM State management
*******************
Overview of VM State Management
===============================
Management of VMs on SOS uses the
SOS-Life-Cycle-Service, Acrnd, and Acrn-dm, working together and using
Acrn-Manager-AIP as IPC interface.
* The Lifecycle-Service get the Wakeup-Reason from IOC controller. It can set
different power cycle method, and RTC timer, by sending a heartbeat to IOC
with proper data.
* The Acrnd get Wakeup Reason from Lifecycle-Service and forwards it to
Acrn-dm. It coordinates the lifecycle of VMs and SOS and handles IOC-timed
wakeup/poweron.
* Acrn-Dm is the device model of a VM running on SOS. Virtual IOC
inside Acrn-DM is responsible to control VM power state, usually triggered by Acrnd.
SOS Life Cycle Service
======================
SOS-Life-Cycle-Service (SOS-LCS) is a daemon service running on SOS.
SOS-LCS listens on ``/dev/cbc-lifecycle`` tty port to receive "wakeup
reason" information from IOC controller. SOS-LCS keeps reading system
status from IOC, to discover which power cycle method IOC is
doing. SOS-LCS should reply a heartbeat to IOC. This heartbeat can tell
IOC to keep doing this power cycle method, or change to another power
cycle method. SOS-LCS heartbeat can also set RTC timer to IOC.
SOS-LCS handles SHUTDOWN, SUSPEND, and REBOOT acrn-manager messages
request from Acrnd. When these messages are received, SOS-LCS switches IOC
power cycle method to shutdown, suspend, and reboot, respectively.
SOS-LCS handles WAKEUP_REASON acrn-manager messages request from Acrnd.
When it receives this message, SOS-LCS sends "wakeup reason" to Acrnd.
SOS-LCS handles RTC_TIMER acrn-manager messages request from Acrnd.
When it receives this message, SOS-LCS setup IOC RTC timer for Acrnd.
SOS-LCS notifies Acrnd at the moment system becomes alive from other
status.
Acrnd
=====
Acrnd is a daemon service running on SOS.
Acrnd can start/resume VMs and query VM states for SOS-LCS, helping
SOS-LCS to decide which power cycle method is right. It also helps UOS
to be started/resumed by timer, required by S3/S5 feature.
Acrnd forwards wakeup reason to acrn-dm. Acrnd is responsible to retrieve
wakeup reason from SOS-LCS service and attach the wakeup reason to
acrn-dm parameter for ioc-dm.
When SOS is about to suspend/shutdown, SOS lifecycle service will send a
request to Acrnd to guarantee all guest VMs are suspended or shutdown
before SOS suspending/shutdown process continue. On receiving the
request, Acrnd starts polling the guest VMs state, and notifies SOS
lifecycle service when all guest VMs are put in proper state gracefully.
Guest UOS may need to
resume/start in a future time for some tasks. To
setup a timed resume/start, ioc-dm will send a request to acrnd to
maintain a list of timed requests from guest VMs. acrnd selects the
nearest request and sends it to SOS lifecycle service who will setup the
physical IOC.
Acrn-DM
=======
Acrn-Dm is the device model of VM running on SOS. Dm-IOC inside Acrn-DM
operates virtual IOC to control VM power state, and collects VM power
state information. Acrn-DM Monitor abstracts these Virtual IOC
functions into monitor-vm-ops, and allows Acrnd to use them via
Acrn-Manager IPC helper functions.
Acrn-manager IPC helper
=======================
SOS-LCS, Acrnd, and Acrn-DM use sockets to do IPC. Acrn-Manager IPC helper API
makes socket transparent for them. These are:
- int mngr_open_un() - create a descriptor for vm management IPC
- void mngr_close() - close descriptor and release the resources
- int mngr_add_handler() - add a handler for message specified by message
- int mngr_send_msg() - send a message and wait for acknowledgement

View File

@@ -0,0 +1,4 @@
.. _hld-vsbl:
Virtual Slim-Bootloader high-level design
#########################################

View File

@@ -0,0 +1,51 @@
.. _hv-config:
Compile-time Configuration
##########################
The hypervisor provides a kconfig-like way for manipulating compile-time
configurations. Basically the hypervisor defines a set of configuration
symbols and declare their default value. A configuration file is
created, containing the values of each symbol, before building the
sources.
Similar to Linux kconfig, there are three files involved:
- **.config** This files stores the values of all configuration
symbols.
- **config.mk** This file is a conversion of .config in Makefile
syntax, and can be included in makefiles so that the build
process can rely on the configurations.
- **config.h** This file is a conversion of .config in C syntax, and is
automatically included in every source file so that the values of
the configuration symbols are available in the sources.
.. figure:: images/config-image103.png
:align: center
:name: config-build-workflow
Hypervisor configuration and build workflow
:numref:`config-build-workflow` shows the workflow of building the
hypervisor:
1. Three targets are introduced for manipulating the configurations.
a. **defconfig** creates a .config based on a predefined
configuration file.
b. **oldconfig** updates an existing .config after creating one if it
does not exist.
c. **menuconfig** presents a terminal UI to navigate and modify the
configurations in an interactive manner.
2. The target oldconfig is also used to create a .config if a .config
file does not exist when building the source directly.
3. The other two files for makefiles and C sources are regenerated after
.config changes.
Refer to :ref:`configuration` for a complete list of configuration symbols.

View File

@@ -0,0 +1,101 @@
.. _hv-console-shell-uart:
Hypervisor console, hypervisor shell, and virtual UART
######################################################
.. _hv-console:
Hypervisor console
******************
The hypervisor console is a text-based terminal accessible from UART.
:numref:`console-processing` shows the workflow of the console:
.. figure:: images/console-image93.png
:align: center
:name: console-processing
Periodic console processing
A periodic timer is set on initialization to trigger console processing every 40ms.
Processing behavior depends on whether the vUART
is active:
- If it is not active, the hypervisor shell is kicked to handle
inputs from the physical UART, if there are any.
- If the vUART is active, the bytes from
the physical UART are redirected to the RX fifo of the vUART, and those
in the vUART TX fifo to the physical UART.
.. note:: The console is only available in the debug version of the hypervisor,
configured at compile time. In the release version, the console is
disabled and the physical UART is not used by the hypervisor or SOS.
Hypervisor shell
****************
For debugging, the hypervisor shell provides commands to list some
internal states and statistics of the hypervisor. It is accessible on
the physical UART only when the vUART is deactivated. See
:ref:`acrnshell` for the list of available hypervisor shell commands.
Virtual UART
************
Currently UART 16550 is owned by the hypervisor itself and used for
debugging purposes. Properties are configured by hypervisor command
line. Hypervisor emulates a UART device with 0x3F8 address to SOS that
acts as the console of SOS with these features:
- The vUART is exposed via I/O port 0x3f8.
- Incorporate a 256-byte RX buffer and 65536 TX buffer.
- Full emulation of input/output bytes and related interrupts.
- For other read-write registers the value is stored without effects
and reads get the latest stored value. For read-only registers
writes are ignored.
- vUART activation via shell command and deactivate via hotkey.
The following diagram shows the activation state transition of vUART.
.. figure:: images/console-image41.png
:align: center
Periodic console processing
Specifically:
- After initialization vUART is disabled.
- The vUART is activated after the command "vm_console" is executed on
the hypervisor shell. Inputs to the physical UART will be
redirected to the vUART starting from the next timer event.
- The vUART is deactivated after a :kbd:`Ctrl + Space` hotkey is received
from the physical UART. Inputs to the physical UART will be
handled by the hypervisor shell starting from the next timer
event.
The workflows are described as follows:
- RX flow:
- Characters are read from UART HW into a sbuf whose size is 2048
bytes, triggered by console_read
- Characters are read from this sbuf and put to rxFIFO,
triggered by vuart_console_rx_chars
- A virtual interrupt is sent to SOS, triggered by a read from
SOS. Characters in rxFIFO are sent to SOS by emulation of
read of register UART16550_RBR
- TX flow:
- Characters are put to txFIFO by emulation of write of register
UART16550_THR
- Characters in txFIFO are read out one by one and sent to console
by printf, triggered by vuart_console_tx_chars
- Implementation of printf is based on console, which finally sends
characters to UART HW by writing to register UART16550_RBR

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,261 @@
.. _hv-device-passthrough:
Device Passthrough
##################
A critical part of virtualization is virtualizing devices: exposing all
aspects of a device including its I/O, interrupts, DMA, and configuration.
There are three typical device
virtualization methods: emulation, para-virtualization, and passthrough.
Both emulation and passthrough are used in ACRN project. Device
emulation is discussed in :ref:`hld-io-emulation` and
device passthrough will be discussed here.
In the ACRN project, device emulation means emulating all existing hardware
resource through a software component device model running in the
Service OS (SOS). Device
emulation must maintain the same SW interface as a native device,
providing transparency to the VM software stack. Passthrough implemented in
hypervisor assigns a physical device to a VM so the VM can access
the hardware device directly with minimal (if any) VMM involvement.
The difference between device emulation and passthrough is shown in
:numref:`emu-passthru-diff`. You can notice device emulation has
a longer access path which causes worse performance compared with
passthrough. Passthrough can deliver near-native performance, but
can't support device sharing.
.. figure:: images/passthru-image30.png
:align: center
:name: emu-passthru-diff
Difference between Emulation and passthrough
Passthrough in the hypervisor provides the following functionalities to
allow VM to access PCI devices directly:
- DMA Remapping by VT-d for PCI device: hypervisor will setup DMA
remapping during VM initialization phase.
- MMIO Remapping between virtual and physical BAR
- Device configuration Emulation
- Remapping interrupts for PCI device
- ACPI configuration Virtualization
- GSI sharing violation check
The following diagram details passthrough initialization control flow in ACRN:
.. figure:: images/passthru-image22.png
:align: center
Passthrough devices initialization control flow
Passthrough Device status
*************************
Most common devices on supported platforms are enabled for
passthrough, as detailed here:
.. figure:: images/passthru-image77.png
:align: center
Passthrough Device Status
DMA Remapping
*************
To enable passthrough, for VM DMA access the VM can only
support GPA, while physical DMA requires HPA. One work-around
is building identity mapping so that GPA is equal to HPA, but this
is not recommended as some VM don't support relocation well. To
address this issue, Intel introduces VT-d in chipset to add one
remapping engine to translate GPA to HPA for DMA operations.
Each VT-d engine (DMAR Unit), maintains a remapping structure
similar to a page table with device BDF (Bus/Dev/Func) as input and final
page table for GPA/HPA translation as output. The GPA/HPA translation
page table is similar to a normal multi-level page table.
VM DMA depends on Intel VT-d to do the translation from GPA to HPA, so we
need to enable VT-d IOMMU engine in ACRN before we can passthrough any device. SOS
in ACRN is a VM running in non-root mode which also depends
on VT-d to access a device. In SOS DMA remapping
engine settings, GPA is equal to HPA.
ACRN hypervisor checks DMA-Remapping Hardware unit Definition (DRHD) in
host DMAR ACPI table to get basic info, then sets up each DMAR unit. For
simplicity, ACRN reuses EPT table as the translation table in DMAR
unit for each passthrough device. The control flow is shown in the
following figures:
.. figure:: images/passthru-image72.png
:align: center
DMA Remapping control flow during HV init
.. figure:: images/passthru-image86.png
:align: center
ptdev assignment control flow
.. figure:: images/passthru-image42.png
:align: center
ptdev de-assignment control flow
MMIO Remapping
**************
For PCI MMIO BAR, hypervisor builds EPT mapping between virtual BAR and
physical BAR, then VM can access MMIO directly.
Device configuration emulation
******************************
PCI configuration is based on access of port 0xCF8/CFC. ACRN
implements PCI configuration emulation to handle 0xCF8/CFC to control
PCI device through two paths: implemented in hypervisor or in SOS device
model.
- When configuration emulation is in the hypervisor, the interception of
0xCF8/CFC port and emulation of PCI configuration space access are
tricky and unclean. Therefore the final solution is to reuse the
PCI emulation infrastructure of SOS device model. The hypervisor
routes the UOS 0xCF8/CFC access to device model, and keeps blind to the
physical PCI devices. Upon receiving UOS PCI configuration space access
request, device model needs to emulate some critical space, for instance,
BAR, MSI capability, and INTLINE/INTPIN.
- For other access, device model
reads/writes physical configuration space on behalf of UOS. To do
this, device model is linked with lib pci access to access physical PCI
device.
Interrupt Remapping
*******************
When the physical interrupt of a passthrough device happens, hypervisor has
to distribute it to the relevant VM according to interrupt remapping
relationships. The structure ``ptirq_remapping_info`` is used to define
the subordination relation between physical interrupt and VM, the
virtual destination, etc. See the following figure for details:
.. figure:: images/passthru-image91.png
:align: center
Remapping of physical interrupts
There are two different types of interrupt source: IOAPIC and MSI.
The hypervisor will record different information for interrupt
distribution: physical and virtual IOAPIC pin for IOAPIC source,
physical and virtual BDF and other info for MSI source.
SOS passthrough is also in the scope of interrupt remapping which is
done on-demand rather than on hypervisor initialization.
.. figure:: images/passthru-image102.png
:align: center
:name: init-remapping
Initialization of remapping of virtual IOAPIC interrupts for SOS
:numref:`init-remapping` above illustrates how remapping of (virtual) IOAPIC
interrupts are remapped for SOS. VM exit occurs whenever SOS tries to
unmask an interrupt in (virtual) IOAPIC by writing to the Redirection
Table Entry (or RTE). The hypervisor then invokes the IOAPIC emulation
handler (refer to :ref:`hld-io-emulation` for details on I/O emulation) which
calls APIs to set up a remapping for the to-be-unmasked interrupt.
Remapping of (virtual) PIC interrupts are set up in a similar sequence:
.. figure:: images/passthru-image98.png
:align: center
Initialization of remapping of virtual MSI for SOS
This figure illustrates how mappings of MSI or MSIX are set up for
SOS. SOS is responsible for issuing an hypercall to notify the
hypervisor before it configures the PCI configuration space to enable an
MSI. The hypervisor takes this opportunity to set up a remapping for the
given MSI or MSIX before it is actually enabled by SOS.
When the UOS needs to access the physical device by passthrough, it uses
the following steps:
- UOS gets a virtual interrupt
- VM exit happens and the trapped vCPU is the target where the interrupt
will be injected.
- Hypervisor will handle the interrupt and translate the vector
according to ptirq_remapping_info.
- Hypervisor delivers the interrupt to UOS.
When the SOS needs to use the physical device, the passthrough is also
active because the SOS is the first VM. The detail steps are:
- SOS get all physical interrupts. It assigns different interrupts for
different VMs during initialization and reassign when a VM is created or
deleted.
- When physical interrupt is trapped, an exception will happen after VMCS
has been set.
- Hypervisor will handle the vm exit issue according to
ptirq_remapping_info and translates the vector.
- The interrupt will be injected the same as a virtual interrupt.
ACPI Virtualization
*******************
ACPI virtualization is designed in ACRN with these assumptions:
- HV has no knowledge of ACPI,
- SOS owns all physical ACPI resources,
- UOS sees virtual ACPI resources emulated by device model.
Some passthrough devices require physical ACPI table entry for
initialization. The device model will create such device entry based on
the physical one according to vendor ID and device ID. Virtualization is
implemented in SOS device model and not in scope of the hypervisor.
GSI Sharing Violation Check
***************************
All the PCI devices that are sharing the same GSI should be assigned to
the same VM to avoid physical GSI sharing between multiple VMs. For
devices that don't support MSI, ACRN DM
shares the same GSI pin to a GSI
sharing group. The devices in the same group should be assigned together to
the current VM, otherwise, none of them should be assigned to the
current VM. A device that violates the rule will be rejected to be
passthrough. The checking logic is implemented in Device Mode and not
in scope of hypervisor.
Data structures and interfaces
******************************
The following APIs are provided to initialize interrupt remapping for
SOS:
.. doxygenfunction:: ptirq_intx_pin_remap
:project: Project ACRN
.. doxygenfunction:: ptirq_msix_remap
:project: Project ACRN
The following APIs are provided to manipulate the interrupt remapping
for UOS.
.. doxygenfunction:: ptirq_add_intx_remapping
:project: Project ACRN
.. doxygenfunction:: ptirq_remove_intx_remapping
:project: Project ACRN
.. doxygenfunction:: ptirq_add_msix_remapping
:project: Project ACRN
.. doxygenfunction:: ptirq_remove_msix_remapping
:project: Project ACRN
The following APIs are provided to acknowledge a virtual interrupt.
.. doxygenfunction:: ptirq_intx_ack
:project: Project ACRN

View File

@@ -0,0 +1,21 @@
.. _hv-hypercall:
Hypercall / VHM upcall
######################
HV currently supports hypercall APIs for VM management, I/O request
distribution, and guest memory mapping.
HV and Service OS (SOS) also use vector 0xF7, reserved as x86 platform
IPI vector for HV notification to SOS. This upcall is necessary whenever
there is device emulation requirement to SOS. The upcall vector 0xF7 is
injected to SOS vCPU0
SOS will register the irq handler for 0xF7 and notify the I/O emulation
module in SOS once the irq is triggered.
.. note:: Add API doc references for General interface, VM management
interface, IRQ and Interrupts, Device Model IO request distribution,
Guest memory management, PCI assignment and IOMMU, Debug, Trusty, Power
management

View File

@@ -0,0 +1,423 @@
.. _interrupt-hld:
Physical Interrupt high-level design
####################################
Overview
********
The ACRN hypervisor implements a simple but fully functional framework
to manage interrupts and exceptions, as show in
:numref:`interrupt-modules-overview`. In its native layer, it configures
the physical PIC, IOAPIC, and LAPIC to support different interrupt
sources from local timer/IPI to external INTx/MSI. In its virtual guest
layer, it emulates virtual PIC, virtual IOAPIC and virtual LAPIC, and
provides full APIs allowing virtual interrupt injection from emulated or
pass-thru devices.
.. figure:: images/interrupt-image3.png
:align: center
:width: 600px
:name: interrupt-modules-overview
ACRN Interrupt Modules Overview
In the software modules view shown in :numref:`interrupt-sw-modules`,
the ACRN hypervisor sets up the physical interrupt in its basic
interrupt modules (e.g., IOAPIC/LAPIC/IDT). It dispatches the interrupt
in the hypervisor interrupt flow control layer to the corresponding
handlers, that could be pre-defined IPI notification, timer, or runtime
registered pass-thru devices. The ACRN hypervisor then uses its VM
interfaces based on vPIC, vIOAPIC, and vMSI modules, to inject the
necessary virtual interrupt into the specific VM
.. figure:: images/interrupt-image2.png
:align: center
:width: 600px
:name: interrupt-sw-modules
ACRN Interrupt SW Modules Overview
The hypervisor implements the following functionalities for handling
physical interrupts:
- Configure interrupt-related hardware including IDT, PIC, LAPIC, and
IOAPIC on startup.
- Provide APIs to manipulate the registers of LAPIC and IOAPIC.
- Acknowledge physical interrupts.
- Set up a callback mechanism for the other components in the
hypervisor to request for an interrupt vector and register a
handler for that interrupt.
HV owns all native physical interrupts and manages 256 vectors per CPU.
All physical interrupts are first handled in VMX root-mode. The
"external-interrupt exiting" bit in VM-Execution controls field is set
to support this. The ACRN hypervisor also initializes all the interrupt
related modules like IDT, PIC, IOAPIC, and LAPIC.
HV does not own any host devices (except UART). All devices are by
default assigned to SOS. Any interrupts received by Guest VM (SOS or
UOS) device drivers are virtual interrupts injected by HV (via vLAPIC).
HV manages a Host-to-Guest mapping. When a native IRQ/interrupt occurs,
HV decides whether this IRQ/interrupt should be forwarded to a VM and
which VM to forward to (if any). Refer to section 3.7.6 for virtual
interrupt injection and section 3.9.6 for the management of interrupt
remapping.
HV does not own any exceptions. Guest VMCS are configured so no VM Exit
happens, with some exceptions such as #INT3 and #MC. This is to
simplify the design as HV does not support any exception handling
itself. HV supports only static memory mapping, so there should be no
#PF or #GP. If HV receives an exception indicating an error, an assert
function is then executed with an error message print out, and the
system then halts.
Native interrupts could be generated from one of the following
sources:
- GSI interrupts
- PIC or Legacy devices IRQ (0~15)
- IOAPIC pin
- PCI MSI/MSI-X vectors
- Inter CPU IPI
- LAPIC timer
Physical Interrupt Initialization
*********************************
After ACRN hypervisor gets control from the bootloader, it
initializes all physical interrupt-related modules for all the CPUs. ACRN
hypervisor creates a framework to manage the physical interrupt for
hypervisor local devices, pass-thru devices, and IPI between CPUs, as
shown in :numref:`hv-interrupt-init`:
.. figure:: images/interrupt-image66.png
:align: center
:name: hv-interrupt-init
Physical Interrupt Initialization
IDT Initialization
==================
ACRN hypervisor builds its native IDT (interrupt descriptor table)
during interrupt initialization and set up the following handlers:
- On an exception, the hypervisor dumps its context and halts the current
physical processor (because physical exceptions are not expected).
- For external interrupts, HV may mask the interrupt (depending on the
trigger mode), followed by interrupt acknowledgement and dispatch
to the registered handler, if any.
Most interrupts and exceptions are handled without a stack switch,
except for machine-check, double fault, and stack fault exceptions which
have their own stack set in TSS.
PIC/IOAPIC Initialization
=========================
ACRN hypervisor masks all interrupts from the PIC. All legacy interrupts
from PIC (<16) will be linked to IOAPIC, as shown in the connections in
:numref:`hv-pic-config`.
ACRN will pre-allocate vectors and mask them for these legacy interrupt
in IOAPIC RTE. For others (>= 16), ACRN will mask them with vector 0 in
RTE, and the vector will be dynamically allocate on demand.
All external IOAPIC pins are categorized as GSI interrupt according to
ACPI definition. HV supports multiple IOAPIC components. IRQ PIN to GSI
mappings are maintained internally to determine GSI source IOAPIC.
Native PIC is not used in the system.
.. figure:: images/interrupt-image46.png
:align: center
:name: hv-pic-config
HV PIC/IOAPIC/LAPIC configuration
LAPIC Initialization
====================
Physical LAPICs are in xAPIC mode in ACRN hypervisor. The hypervisor
initializes LAPIC for each physical CPU by masking all interrupts in the
local vector table (LVT), clearing all ISRs, and enabling LAPIC.
APIs are provided to access LAPIC for the other components in the
hypervisor, aiming for further usage of local timer (TSC Deadline)
program, IPI notification program, etc. See :ref:`hv_interrupt-data-api`
for a complete list.
HV Interrupt Vectors and Delivery Mode
======================================
The interrupt vectors are assigned as shown here:
**Vector 0-0x1F**
are exceptions that are not handled by HV. If
such an exception does occur, the system then halts.
**Vector: 0x20-0x2F**
are allocated statically for legacy IRQ0-15.
**Vector: 0x30-0xDF**
are dynamically allocated vectors for PCI devices
INTx or MSI/MIS-X usage. According to different interrupt delivery mode
(FLAT or PER_CPU mode), an interrupt will be assigned to a vector for
all the CPUs or a particular CPU.
**Vector: 0xE0-0xFE**
are high priority vectors reserved by HV for
dedicated purposes. For example, 0xEF is used for timer, 0xF0 is used
for IPI.
.. list-table::
:widths: 30 70
:header-rows: 1
* - Vectors
- Usage
* - 0x0-0x13
- Exceptions: NMI, INT3, page dault, GP, debug.
* - 0x14-0x1F
- Reserved
* - 0x20-0x2F
- Statically allocated for external IRQ (IRQ0-IRQ15)
* - 0x30-0xDF
- Dynamically allocated for IOAPIC IRQ from PCI INTx/MSI
* - 0xE0-0xFE
- Static allocated for HV
* - 0xEF
- Timer
* - 0xF0
- IPI
* - 0xFF
- SPURIOUS_APIC_VECTOR
Interrupts from either IOAPIC or MSI can be delivered to a target CPU.
By default they are configured as Lowest Priority (FLAT mode), i.e. they
are delivered to a CPU core that is currently idle or executing lowest
priority ISR. There is no guarantee a device's interrupt will be
delivered to a specific Guest's CPU. Timer interrupts are an exception -
these are always delivered to the CPU which programs the LAPIC timer.
There are two interrupt delivery modes: FLAT mode and PER_CPU mode. ACRN
uses FLAT MODE where the interrupt/irq to vector mapping is the same on all CPUs. Every
CPU receives same interrupts. IOAPIC and LAPIC MSI delivery mode are
configured to Lowest Priority.
Vector allocation for CPUs is shown here:
.. figure:: images/interrupt-image89.png
:align: center
FLAT mode vector allocation
IRQ Descriptor Table
====================
ACRN hypervisor maintains a global IRQ Descriptor Table shared among the
physical CPUs. ACRN use FLAT MODE to manage the interrupts so the
same vector will link to same the IRQ number for all CPUs.
.. note:: need to reference API doc for irq_desc
The *irq_desc[]* array's index represents IRQ number. An *irq_handler*
field could be set to common edge/level/quick handler which will be
called from *interrupt_dispatch*. The *irq_desc* structure also
contains the *dev_list* field to maintain this IRQ's action handler
list.
Another reverse mapping from vector to IRQ is used in addition to the
IRQ descriptor table which maintains the mapping from IRQ to vector.
On initialization, the descriptor of the legacy IRQs are initialized with
proper vectors and the corresponding reverse mapping is set up.
The descriptor of other IRQs are filled with an invalid
vector which will be updated on IRQ allocation.
For example, if local timer registers an interrupt with IRQ number 271 and
vector 0xEF, then this date will be set up:
.. code-block:: c
irq_desc[271].irq = 271
irq_desc[271].vector = 0xEF
vector_to_irq[0xEF] = 271
External Interrupt Handling
***************************
CPU runs under VMX non-root mode and inside Guest VMs.
``MSR_IA32_VMX_PINBASED_CTLS.bit[0]`` and
``MSR_IA32_VMX_EXIT_CTLS.bit[15]`` are set to allow vCPU VM Exit to HV
whenever there are interrupts to that physical CPU under
non-root mode. HV ACKs the interrupts in VMX non-root and saves the
interrupt vector to the relevant VM Exit field for HV IRQ processing.
Note that as discussed above, an external interrupt causing vCPU VM Exit
to HV does not mean that the interrupt belongs to that Guest VM. When
CPU executes VM Exit into root-mode, interrupt handling will be enabled
and the interrupt will be delivered and processed as quickly as possible
inside HV. HV may emulate a virtual interrupt and inject to Guest if
necessary.
When an physical interrupt happened on a CPU, this CPU could be running
under VMX root mode or non-root mode. If the CPU is running under VMX
root mode, the interrupt is triggered from standard native IRQ flow -
interrupt gate to IRQ handler. If the CPU is running under VMX non-root
mode, an external interrupt will trigger a VM exit for reason
"external-interrupt".
Interrupt and IRQ processing flow diagrams are shown below:
.. figure:: images/interrupt-image48.png
:align: center
:name: phy-interrupt-processing
Processing of physical interrupts
.. figure:: images/interrupt-image39.png
:align: center
IRQ processing control flow
When a physical interrupt is raised and delivered to a physical CPU, the
CPU may be running under either VMX root mode or non-root mode.
- If the CPU is running under VMX root mode, the interrupt is handled
following the standard native IRQ flow: interrupt gate to
dispatch_interrupt(), IRQ handler, and finally the registered callback.
- If the CPU is running under VMX non-root mode, an external interrupt
calls a VM exit for reason "external-interrupt", and then the VM
exit processing flow will call dispatch_interrupt() to dispatch and
handle the interrupt.
After an interrupt occurs from either path shown in
:numref:`phy-interrupt-processing`, ACRN hypervisor will jump to
dispatch_interrupt. This function gets the vector of the generated
interrupt from the context, gets IRQ number from vector_to_irq[], and
then gets the corresponding irq_desc.
Though there is only one generic IRQ handler for registered interrupt,
there are three different handling flows according to flags:
- ``!IRQF_LEVEL``
- ``IRQF_LEVEL && !IRQF_PT``
To avoid continuous interrupt triggers, it masks the IOAPIC pin and
unmask it only after IRQ action callback is executed
- ``IRQF_LEVEL && IRQF_PT``
For pass-thru devices, to avoid continuous interrupt triggers, it masks
the IOAPIC pin and leaves it unmasked until corresponding vIOAPIC
pin gets an explicit EOI ACK from guest.
Since interrupts are not shared for multiple devices, there is only one
IRQ action registered for each interrupt
The IRQ number inside HV is a software concept to identify GSI and
Vectors. Each GSI will be mapped to one IRQ. The GSI number is usually the same
as the IRQ number. IRQ numbers greater than max GSI (nr_gsi) number are dynamically
assigned. For example, HV allocates an interrupt vector to a PCI device,
an IRQ number is then assigned to that vector. When the vector later
reaches a CPU, the corresponding IRQ routine is located and executed.
See :numref:`request-irq` for request IRQ control flow for different
conditions:
.. figure:: images/interrupt-image76.png
:align: center
:name: request-irq
Request IRQ for different conditions
.. _ipi-management:
IPI Management
**************
The only purpose of IPI use in HV is to kick a vCPU out of non-root mode
and enter to HV mode. This requires I/O request and virtual interrupt
injection be distributed to different IPI vectors. The I/O request uses
IPI vector 0xF4 upcall (refer to Chapter 5.4). The virtual interrupt
injection uses IPI vector 0xF0.
0xF4 upcall
A Guest vCPU VM Exit exits due to EPT violation or IO instruction trap.
It requires Device Module to emulate the MMIO/PortIO instruction.
However it could be that the Service OS (SOS) vCPU0 is still in non-root
mode. So an IPI (0xF4 upcall vector) should be sent to the physical CPU0
(with non-root mode as vCPU0 inside SOS) to force vCPU0 to VM Exit due
to the external interrupt. The virtual upcall vector is then injected to
SOS, and the vCPU0 inside SOS then will pick up the IO request and do
emulation for other Guest.
0xF0 IPI flow
If Device Module inside SOS needs to inject an interrupt to other Guest
such as vCPU1, it will issue an IPI first to kick CPU1 (assuming CPU1 is
running on vCPU1) to root-hv_interrupt-data-apmode. CPU1 will inject the
interrupt before VM Enter.
.. _hv_interrupt-data-api:
Data structures and interfaces
******************************
IOAPIC
======
The following APIs are external interfaces for IOAPIC related
operations.
.. doxygengroup:: ioapic_ext_apis
:project: Project ACRN
:content-only:
LAPIC
=====
The following APIs are external interfaces for LAPIC related operations.
.. doxygengroup:: lapic_ext_apis
:project: Project ACRN
:content-only:
IPI
===
The following APIs are external interfaces for IPI related operations.
.. doxygengroup:: ipi_ext_apis
:project: Project ACRN
:content-only:
Physical Interrupt
==================
The following APIs are external interfaces for physical interrupt
related operations.
.. doxygengroup:: phys_int_ext_apis
:project: Project ACRN
:content-only:

View File

@@ -0,0 +1,329 @@
.. _hld-io-emulation:
I/O Emulation high-level design
###############################
As discussed in :ref:`intro-io-emulation`, there are multiple ways and
places to handle I/O emulation, including HV, SOS Kernel VHM, and SOS
user-land device model (acrn-dm).
I/O emulation in the hypervisor provides these functionalities:
- Maintain lists of port I/O or MMIO handlers in the hypervisor for
emulating trapped I/O accesses in a certain range.
- Forward I/O accesses to SOS when they cannot be handled by the
hypervisor by any registered handlers.
:numref:`io-control-flow` illustrates the main control flow steps of I/O emulation
inside the hypervisor:
1. Trap and decode I/O access by VM exits and decode the access from
exit qualification or by invoking the instruction decoder.
2. If the range of the I/O access overlaps with any registered handler,
call that handler if it completely covers the range of the
access, or ignore the access if the access crosses the boundary.
3. If the range of the I/O access does not overlap the range of any I/O
handler, deliver an I/O request to SOS.
.. figure:: images/ioem-image101.png
:align: center
:name: io-control-flow
Control flow of I/O emulation in the hypervisor
I/O emulation does not rely on any calibration data.
Trap Path
*********
Port I/O accesses are trapped by VM exits with the basic exit reason
"I/O instruction". The port address to be accessed, size, and direction
(read or write) are fetched from the VM exit qualification. For writes
the value to be written to the I/O port is fetched from guest registers
al, ax or eax, depending on the access size.
MMIO accesses are trapped by VM exits with the basic exit reason "EPT
violation". The instruction emulator is invoked to decode the
instruction that triggers the VM exit to get the memory address being
accessed, size, direction (read or write), and the involved register.
The I/O bitmaps and EPT are used to configure the addresses that will
trigger VM exits when accessed by a VM. Refer to
:ref:`io-mmio-emulation` for details.
I/O Emulation in the Hypervisor
*******************************
When a port I/O or MMIO access is trapped, the hypervisor first checks
whether the to-be-accessed address falls in the range of any registered
handler, and calls the handler when such a handler exists.
Handler Management
==================
Each VM has two lists of I/O handlers, one for port I/O and the other
for MMIO. Each element of the list contains a memory range and a pointer
to the handler which emulates the accesses falling in the range. See
:ref:`io-handler-init` for descriptions of the related data structures.
The I/O handlers are registered on VM creation and never changed until
the destruction of that VM, when the handlers are unregistered. If
multiple handlers are registered for the same address, the one
registered later wins. See :ref:`io-handler-init` for the interfaces
used to register and unregister I/O handlers.
I/O Dispatching
===============
When a port I/O or MMIO access is trapped, the hypervisor first walks
through the corresponding I/O handler list in the reverse order of
registration, looking for a proper handler to emulate the access. The
following cases exist:
- If a handler whose range overlaps the range of the I/O access is
found,
- If the range of the I/O access falls completely in the range the
handler can emulate, that handler is called.
- Otherwise it is implied that the access crosses the boundary of
multiple devices which the hypervisor does not emulate. Thus
no handler is called and no I/O request will be delivered to
SOS. I/O reads get all 1's and I/O writes are dropped.
- If the range of the I/O access does not overlap with any range of the
handlers, the I/O access is delivered to SOS as an I/O request
for further processing.
I/O Requests
************
An I/O request is delivered to SOS vCPU 0 if the hypervisor does not
find any handler that overlaps the range of a trapped I/O access. This
section describes the initialization of the I/O request mechanism and
how an I/O access is emulated via I/O requests in the hypervisor.
Initialization
==============
For each UOS the hypervisor shares a page with SOS to exchange I/O
requests. The 4-KByte page consists of 16 256-Byte slots, indexed by
vCPU ID. It is required for the DM to allocate and set up the request
buffer on VM creation, otherwise I/O accesses from UOS cannot be
emulated by SOS, and all I/O accesses not handled by the I/O handlers in
the hypervisor will be dropped (reads get all 1's).
Refer to Section 4.4.1 for the details of I/O requests and the
initialization of the I/O request buffer.
Types of I/O Requests
=====================
There are four types of I/O requests:
.. list-table::
:widths: 50 50
:header-rows: 1
* - I/O Request Type
- Description
* - PIO
- A port I/O access.
* - MMIO
- A MMIO access to a GPA with no mapping in EPT.
* - PCI
- A PCI configuration space access.
* - WP
- A MMIO access to a GPA with a read-only mapping in EPT.
For port I/O accesses, the hypervisor will always deliver an I/O request
of type PIO to SOS. For MMIO accesses, the hypervisor will deliver an
I/O request of either MMIO or WP, depending on the mapping of the
accessed address (in GPA) in the EPT of the vCPU. The hypervisor will
never deliver any I/O request of type PCI, but will handle such I/O
requests in the same ways as port I/O accesses on their completion.
Refer to :ref:`io-structs-interfaces` for a detailed description of the
data held by each type of I/O request.
I/O Request State Transitions
=============================
Each slot in the I/O request buffer is managed by a finite state machine
with four states. The following figure illustrates the state transitions
and the events that trigger them.
.. figure:: images/ioem-image92.png
:align: center
State Transition of I/O Requests
The four states are:
FREE
The I/O request slot is not used and new I/O requests can be
delivered. This is the initial state on UOS creation.
PENDING
The I/O request slot is occupied with an I/O request pending
to be processed by SOS.
PROCESSING
The I/O request has been dispatched to a client but the
client has not finished handling it yet.
COMPLETE
The client has completed the I/O request but the hypervisor
has not consumed the results yet.
The contents of an I/O request slot are owned by the hypervisor when the
state of an I/O request slot is FREE or COMPLETE. In such cases SOS can
only access the state of that slot. Similarly the contents are owned by
SOS when the state is PENDING or PROCESSING, when the hypervisor can
only access the state of that slot.
The states are transferred as follow:
1. To deliver an I/O request, the hypervisor takes the slot
corresponding to the vCPU triggering the I/O access, fills the
contents, changes the state to PENDING and notifies SOS via
upcall.
2. On upcalls, SOS dispatches each I/O request in the PENDING state to
clients and changes the state to PROCESSING.
3. The client assigned an I/O request changes the state to COMPLETE
after it completes the emulation of the I/O request. A hypercall
is made to notify the hypervisor on I/O request completion after
the state change.
4. The hypervisor finishes the post-work of a I/O request after it is
notified on its completion and change the state back to FREE.
States are accessed using atomic operations to avoid getting unexpected
states on one core when it is written on another.
Note that there is no state to represent a 'failed' I/O request. SOS
should return all 1's for reads and ignore writes whenever it cannot
handle the I/O request, and change the state of the request to COMPLETE.
Post-work
=========
After an I/O request is completed, some more work needs to be done for
I/O reads to update guest registers accordingly. Currently the
hypervisor re-enters the vCPU thread every time a vCPU is scheduled back
in, rather than switching to where the vCPU is scheduled out. As a result,
post-work is introduced for this purpose.
The hypervisor pauses a vCPU before an I/O request is delivered to SOS.
Once the I/O request emulation is completed, a client notifies the
hypervisor by a hypercall. The hypervisor will pick up that request, do
the post-work, and resume the guest vCPU. The post-work takes care of
updating the vCPU guest state to reflect the effect of the I/O reads.
.. figure:: images/ioem-image100.png
:align: center
Workflow of MMIO I/O request completion
The figure above illustrates the workflow to complete an I/O
request for MMIO. Once the I/O request is completed, SOS makes a
hypercall to notify the hypervisor which resumes the UOS vCPU triggering
the access after requesting post-work on that vCPU. After the UOS vCPU
resumes, it does the post-work first to update the guest registers if
the access reads an address, changes the state of the corresponding I/O
request slot to FREE, and continues execution of the vCPU.
.. figure:: images/ioem-image106.png
:align: center
:name: port-io-completion
Workflow of port I/O request completion
Completion of a port I/O request (shown in :numref:`port-io-completion`
above) is
similar to the MMIO case, except the post-work is done before resuming
the vCPU. This is because the post-work for port I/O reads need to update
the general register eax of the vCPU, while the post-work for MMIO reads
need further emulation of the trapped instruction. This is much more
complex and may impact the performance of SOS.
.. _io-structs-interfaces:
Data Structures and Interfaces
******************************
External Interfaces
===================
The following structures represent an I/O request. *struct vhm_request*
is the main structure and the others are detailed representations of I/O
requests of different kinds. Refer to Section 4.4.4 for the usage of
*struct pci_request*.
.. doxygenstruct:: mmio_request
:project: Project ACRN
.. doxygenstruct:: pio_request
:project: Project ACRN
.. doxygenstruct:: pci_request
:project: Project ACRN
.. doxygenunion:: vhm_io_request
:project: Project ACRN
.. doxygenstruct:: vhm_request
:project: Project ACRN
For hypercalls related to I/O emulation, refer to Section 3.11.4.
.. _io-handler-init:
Initialization and Deinitialization
===================================
The following structure represents a port I/O handler:
.. doxygenstruct:: vm_io_handler_desc
:project: Project ACRN
The following structure represents a MMIO handler.
.. doxygenstruct:: mem_io_node
:project: Project ACRN
The following APIs are provided to initialize, deinitialize or configure
I/O bitmaps and register or unregister I/O handlers:
.. doxygenfunction:: allow_guest_pio_access
:project: Project ACRN
.. doxygenfunction:: register_pio_emulation_handler
:project: Project ACRN
.. doxygenfunction:: register_mmio_emulation_handler
:project: Project ACRN
I/O Emulation
=============
The following APIs are provided for I/O emulation at runtime:
.. doxygenfunction:: acrn_insert_request
:project: Project ACRN
.. doxygenfunction:: pio_instr_vmexit_handler
:project: Project ACRN
.. doxygenfunction:: ept_violation_vmexit_handler
:project: Project ACRN

View File

@@ -0,0 +1,728 @@
.. _IOC_virtualization_hld:
IOC Virtualization high-level design
####################################
.. author: Yuan Liu
The I/O Controller (IOC) is an SoC bridge we can use to communicate
with a Vehicle Bus in automotive applications, routing Vehicle Bus
signals, such as those extracted from CAN messages, from the IOC to the
SoC and back, as well as signals the SoC uses to control onboard
peripherals.
.. note::
NUC and UP2 platforms do not support IOC hardware, and as such, IOC
virtualization is not supported on these platforms.
The main purpose of IOC virtualization is to transfer data between
native Carrier Board Communication (CBC) char devices and a virtual
UART. IOC virtualization is implemented as full virtualization so the
user OS can directly reuse native CBC driver.
The IOC Mediator has several virtualization requirements, such as S3/S5
wakeup reason emulation, CBC link frame packing/unpacking, signal
whitelist, and RTC configuration.
IOC Mediator Design
*******************
Architecture Diagrams
=====================
IOC introduction
----------------
.. figure:: images/ioc-image12.png
:width: 600px
:align: center
:name: ioc-mediator-arch
IOC Mediator Architecture
- Vehicle Bus communication involves a wide range of individual signals
to be used, varying from single GPIO signals on the IOC up to
complete automotive networks that connect many external ECUs.
- IOC (I/O controller) is an SoC bridge to communicate with a Vehicle
Bus. It routes Vehicle Bus signals (extracted from CAN
messages for example) back and forth between the IOC and SoC. It also
controls the onboard peripherals from the SoC.
- IOC is always turned on. The power supply of the SoC and its memory are
controlled by the IOC. IOC monitors some wakeup reason to control SoC
lifecycle-related features.
- Some hardware signals are connected to the IOC, allowing the SoC to control
them.
- Besides, there is one NVM (Non-Volatile Memory) that is connected to
IOC for storing persistent data. The IOC is in charge of accessing NVM
following the SoC's requirements.
CBC protocol introduction
-------------------------
The Carrier Board Communication (CBC) protocol multiplexes and
prioritizes communication from the available interface between the SoC
and the IOC.
The CBC protocol offers a layered approach, which allows it to run on
different serial connections, such as SPI or UART.
.. figure:: images/ioc-image14.png
:width: 900px
:align: center
:name: ioc-cbc-frame-def
IOC Native - CBC frame definition
The CBC protocol is based on a four-layer system:
- The **Physical layer** is a serial interface with full
duplex capabilities. A hardware handshake is required. The required
bit rate depends on the peripherals connected, e.g. UART, and SPI.
- The **Link layer** handles the length and payload verification.
- The **Address Layer** is used to distinguish between the general data
transferred. It is placed in front of the underlying Service Layer
and contains Multiplexer (MUX) and Priority fields.
- The **Service Layer** contains the payload data.
Native architecture
-------------------
In the native architecture, the IOC controller connects to UART
hardware, and communicates with the CAN bus to access peripheral
devices. ``cbc_attach`` is an application to enable the CBC ldisc
function, which creates several CBC char devices. All userspace
subsystems or services communicate with IOC firmware via the CBC char
devices.
.. figure:: images/ioc-image13.png
:width: 900px
:align: center
:name: ioc-software-arch
IOC Native - Software architecture
Virtualization architecture
---------------------------
In the virtualization architecture, the IOC Device Model (DM) is
responsible for communication between the UOS and IOC firmware. The IOC
DM communicates with several native CBC char devices and a PTY device.
The native CBC char devices only include ``/dev/cbc-lifecycle``,
``/dev/cbc-signals``, and ``/dev/cbc-raw0`` - ``/dev/cbc-raw11``. Others
are not used by the IOC DM. IOC DM opens the ``/dev/ptmx`` device to
create a pair of devices (master and slave), The IOC DM uses these
devices to communicate with UART DM since UART DM needs a TTY capable
device as its backend.
.. figure:: images/ioc-image15.png
:width: 900px
:align: center
:name: ioc-virt-software-arch
IOC Virtualization - Software architecture
High-Level Design
=================
There are five parts in this high-level design:
* Software data flow introduces data transfer in the IOC mediator
* State transfer introduces IOC mediator work states
* CBC protocol illustrates the CBC data packing/unpacking
* Power management involves boot/resume/suspend/shutdown flows
* Emulated CBC commands introduces some commands work flow
IOC mediator has three threads to transfer data between UOS and SOS. The
core thread is responsible for data reception, and Tx and Rx threads are
used for data transmission. Each of the transmission threads has one
data queue as a buffer, so that the IOC mediator can read data from CBC
char devices and UART DM immediately.
.. figure:: images/ioc-image16.png
:width: 900px
:align: center
:name: ioc-med-sw-data-flow
IOC Mediator - Software data flow
- For Tx direction, the data comes from IOC firmware. IOC mediator
receives service data from native CBC char devices such as
``/dev/cbc-lifecycle``. If service data is CBC wakeup reason, some wakeup
reason bits will be masked. If service data is CBC signal, the data
will be dropped and will not be defined in the whitelist. If service
data comes from a raw channel, the data will be passed forward. Before
transmitting to the virtual UART interface, all data needs to be
packed with an address header and link header.
- For Rx direction, the data comes from the UOS. The IOC mediator receives link
data from the virtual UART interface. The data will be unpacked by Core
thread, and then forwarded to Rx queue, similar to how the Tx direction flow
is done except that the heartbeat and RTC are only used by the IOC
mediator and will not be transferred to IOC
firmware.
- Currently, IOC mediator only cares about lifecycle, signal, and raw data.
Others, e.g. diagnosis, are not used by the IOC mediator.
State transfer
--------------
IOC mediator has four states and five events for state transfer.
.. figure:: images/ioc-image18.png
:width: 600px
:align: center
:name: ioc-state-transfer
IOC Mediator - State Transfer
- **INIT state**: This state is the initialized state of the IOC mediator.
All CBC protocol packets are handled normally. In this state, the UOS
has not yet sent an active heartbeat.
- **ACTIVE state**: Enter this state if an HB ACTIVE event is triggered,
indicating that the UOS state has been active and need to set the bit
23 (SoC bit) in the wakeup reason.
- **SUSPENDING state**: Enter this state if a RAM REFRESH event or HB
INACTIVE event is triggered. The related event handler needs to mask
all wakeup reason bits except SoC bit and drop the queued CBC
protocol frames.
- **SUSPENDED state**: Enter this state if a SHUTDOWN event is triggered to
close all native CBC char devices. The IOC mediator will be put to
sleep until a RESUME event is triggered to re-open the closed native
CBC char devices and transition to the INIT state.
CBC protocol
------------
IOC mediator needs to pack/unpack the CBC link frame for IOC
virtualization, as shown in the detailed flow below:
.. figure:: images/ioc-image17.png
:width: 900px
:align: center
:name: ioc-cbc-frame-usage
IOC Native - CBC frame usage
In the native architecture, the CBC link frame is unpacked by CBC
driver. The usage services only get the service data from the CBC char
devices. For data packing, CBC driver will compute the checksum and set
priority for the frame, then send data to the UART driver.
.. figure:: images/ioc-image20.png
:width: 900px
:align: center
:name: ioc-cbc-prot
IOC Virtualizaton - CBC protocol virtualization
The difference between the native and virtualization architectures is
that the IOC mediator needs to re-compute the checksum and reset
priority. Currently, priority is not supported by IOC firmware; the
priority setting by the IOC mediator is based on the priority setting of
the CBC driver. The SOS and UOS use the same CBC driver.
Power management virtualization
-------------------------------
In acrn-dm, the IOC power management architecture involves PM DM, IOC
DM, and UART DM modules. PM DM is responsible for UOS power management,
and IOC DM is responsible for heartbeat and wakeup reason flows for IOC
firmware. The heartbeat flow is used to control IOC firmware power state
and wakeup reason flow is used to indicate IOC power state to the OS.
UART DM transfers all IOC data between the SOS and UOS. These modules
complete boot/suspend/resume/shutdown functions.
Boot flow
+++++++++
.. figure:: images/ioc-image19.png
:width: 900px
:align: center
:name: ioc-virt-boot
IOC Virtualizaton - Boot flow
#. Press ignition button for booting.
#. SOS lifecycle service gets a "booting" wakeup reason.
#. SOS lifecycle service notifies wakeup reason to VM Manager, and VM
Manager starts VM.
#. VM Manager sets the VM state to "start".
#. IOC DM forwards the wakeup reason to UOS.
#. PM DM starts UOS.
#. UOS lifecycle gets a "booting" wakeup reason.
Suspend & Shutdown flow
+++++++++++++++++++++++
.. figure:: images/ioc-image21.png
:width: 900px
:align: center
:name: ioc-suspend
IOC Virtualizaton - Suspend and Shutdown by Ignition
#. Press ignition button to suspend or shutdown.
#. SOS lifecycle service gets a 0x800000 wakeup reason, then keeps
sending a shutdown delay heartbeat to IOC firmware, and notifies a
"stop" event to VM Manager.
#. IOC DM forwards the wakeup reason to UOS lifecycle service.
#. SOS lifecycle service sends a "stop" event to VM Manager, and waits for
the stop response before timeout.
#. UOS lifecycle service gets a 0x800000 wakeup reason and sends inactive
heartbeat with suspend or shutdown SUS_STAT to IOC DM.
#. UOS lifecycle service gets a 0x000000 wakeup reason, then enters
suspend or shutdown kernel PM flow based on SUS_STAT.
#. PM DM executes UOS suspend/shutdown request based on ACPI.
#. VM Manager queries each VM state from PM DM. Suspend request maps
to a paused state and shutdown request maps to a stop state.
#. VM Manager collects all VMs state, and reports it to SOS lifecycle
service.
#. SOS lifecycle sends inactive heartbeat to IOC firmware with
suspend/shutdown SUS_STAT, based on the SOS' own lifecycle service
policy.
Resume flow
+++++++++++
.. figure:: images/ioc-image22.png
:width: 900px
:align: center
:name: ioc-resume
IOC Virtualizaton - Resume flow
The resume reason contains both the ignition button and RTC, and have
the same flow blocks.
For ignition resume flow:
#. Press ignition button to resume.
#. SOS lifecycle service gets an initial wakeup reason from the IOC
firmware. The wakeup reason is 0x000020, from which the ignition button
bit is set. It then sends active or initial heartbeat to IOC firmware.
#. SOS lifecycle forwards the wakeup reason and sends start event to VM
Manager. The VM Manager starts to resume VMs.
#. IOC DM gets the wakeup reason from the VM Manager and forwards it to UOS
lifecycle service.
#. VM Manager sets the VM state to starting for PM DM.
#. PM DM resumes UOS.
#. UOS lifecycle service gets wakeup reason 0x000020, and then sends an initial
or active heartbeat. The UOS gets wakeup reason 0x800020 after
resuming.
For RTC resume flow
#. RTC timer expires.
#. SOS lifecycle service gets initial wakeup reason from the IOC
firmware. The wakeup reason is 0x000200, from which RTC bit is set.
It then sends active or initial heartbeat to IOC firmware.
#. SOS lifecycle forwards the wakeup reason and sends start event to VM
Manager. VM Manager begins resuming VMs.
#. IOC DM gets the wakeup reason from the VM Manager, and forwards it to
the UOS lifecycle service.
#. VM Manager sets the VM state to starting for PM DM.
#. PM DM resumes UOS.
#. UOS lifecycle service gets the wakeup reason 0x000200, and sends
initial or active heartbeat. The UOS gets wakeup reason 0x800200
after resuming..
System control data
-------------------
IOC mediator has several emulated CBC commands, including wakeup reason,
heartbeat, and RTC.
The wakeup reason, heartbeat, and RTC commands belong to the system
control frames, which are used for startup or shutdown control. System
control includes Wakeup Reasons, Heartbeat, Boot Selector, Suppress
Heartbeat Check, and Set Wakeup Timer functions. Details are in this
table:
.. list-table:: System control SVC values
:header-rows: 1
* - System Control
- Value Name
- Description
- Data Direction
* - 1
- Wakeup Reasons
- Wakeup Reasons
- IOC to SoC
* - 2
- Heartbeat
- Heartbeat
- Soc to IOC
* - 3
- Boot Selector
- Boot Selector
- Soc to IOC
* - 4
- Suppress Heartbeat Check
- Suppress Heartbeat Check
- Soc to IOC
* - 5
- Set Wakeup Timer
- Set Wakeup Timer in AIOC firmware
- Soc to IOC
- IOC mediator only supports wakeup reasons Heartbeat and Set Wakeup
Timer.
- The Boot Selector command is used to configure which partition the
IOC has to use for normal and emergency boots. Additionally, the IOC
has to report to the SoC after the CBC communication has been
established successfully with which boot partition has been started
and for what reason.
- The Suppress Heartbeat Check command is sent by the SoC in
preparation for maintenance tasks which requires the CBC Server to be
shut down for a certain period of time. It instructs the IOC not to
expect CBC heartbeat messages during the specified time. The IOC must
disable any watchdog on the CBC heartbeat messages during this period
of time.
Wakeup reason
+++++++++++++
The wakeup reasons command contains a bit mask of all reasons, which is
currently keeping the SoC/IOC active. The SoC itself also has a wakeup
reason, which allows the SoC to keep the IOC active. The wakeup reasons
should be sent every 1000 ms by the IOC.
Wakeup reason frame definition is as below:
.. figure:: images/ioc-image24.png
:width: 900px
:align: center
:name: ioc-wakeup-reason
Wakeup Reason Frame Definition
Currently the wakeup reason bits are supported by sources shown here:
.. list-table:: Wakeup Reason Bits
:header-rows: 1
* - Wakeup Reason
- Bit
- Source
* - wakeup_button
- 5
- Get from IOC FW, forward to UOS
* - RTC wakeup
- 9
- Get from IOC FW, forward to UOS
* - car door wakeup
- 11
- Get from IOC FW, forward to UOS
* - SoC wakeup
- 23
- Emulation (Depends on UOS's heartbeat message
- CBC_WK_RSN_BTN (bit 5): ignition button.
- CBC_WK_RSN_RTC (bit 9): RTC timer.
- CBC_WK_RSN_DOR (bit 11): Car door.
- CBC_WK_RSN_SOC (bit 23): SoC active/inactive.
.. figure:: images/ioc-image4.png
:width: 600px
:align: center
:name: ioc-wakeup-flow
IOC Mediator - Wakeup reason flow
Bit 23 is for the SoC wakeup indicator and should not be forwarded
directly because every VM has a different heartbeat status.
Heartbeat
+++++++++
The Heartbeat is used for SOC watchdog, indicating the SOC power
reset behavior. Heartbeat needs to be sent every 1000 ms by
the SoC.
.. figure:: images/ioc-image5.png
:width: 900px
:align: center
:name: ioc-heartbeat
System control - Heartbeat
Heartbeat frame definition is shown here:
.. figure:: images/ioc-image6.png
:width: 900px
:align: center
:name: ioc-heartbeat-frame
Heartbeat Frame Definition
- Heartbeat active is repeatedly sent from SoC to IOC to signal that
the SoC is active and intends to stay active. The On SUS_STAT action
must be set to invalid.
- Heartbeat inactive is sent once from SoC to IOC to signal that the
SoC is ready for power shutdown. The On SUS_STAT action must be set
to a required value.
- Heartbeat delay is repeatedly sent from SoC to IOC to signal that the
SoC has received the shutdown request, but isn't ready for
shutdown yet (for example, a phone call or other time consuming
action is active). The On SUS_STAT action must be set to invalid.
.. figure:: images/ioc-image7.png
:width: 600px
:align: center
:name: ioc-heartbeat-commands
Heartbeat Commands
- SUS_STAT invalid action needs to be set with a heartbeat active
message.
- For the heartbeat inactive message, the SoC needs to be set from
command 1 to 7 following the related scenarios. For example: S3 case
needs to be set at 7 to prevent from power gating the memory.
- The difference between halt and reboot is related if the power rail
that supplies to customer peripherals (such as Fan, HDMI-in, BT/Wi-Fi,
M.2, and Ethernet) is reset.
.. figure:: images/ioc-image8.png
:width: 900px
:align: center
:name: ioc-heartbeat-flow
IOC Mediator - Heartbeat Flow
- IOC DM will not maintain a watchdog timer for a heartbeat message. This
is because it already has other watchdog features, so the main use of
Heartbeat active command is to maintain the virtual wakeup reason
bitmap variable.
- For Heartbeat, IOC mediator supports Heartbeat shutdown prepared,
Heartbeat active, Heartbeat shutdown delay, Heartbeat initial, and
Heartbeat Standby.
- For SUS_STAT, IOC mediator supports invalid action and RAM refresh
action.
- For Suppress heartbeat check will also be dropped directly.
RTC
+++
RTC timer is used to wakeup SoC when the timer is expired. (A use
case is for an automatic software upgrade with a specific time.) RTC frame
definition is as below.
.. figure:: images/ioc-image9.png
:width: 600px
:align: center
- The RTC command contains a relative time but not an absolute time.
- SOS lifecycle service will re-compute the time offset before it is
sent to the IOC firmware.
.. figure:: images/ioc-image10.png
:width: 900px
:align: center
:name: ioc-rtc-flow
IOC Mediator - RTC flow
Signal data
-----------
Signal channel is an API between the SOC and IOC for
miscellaneous requirements. The process data includes all vehicle bus and
carrier board data (GPIO, sensors, and so on). It supports
transportation of single signals and group signals. Each signal consists
of a signal ID (reference), its value, and its length. IOC and SOC need
agreement on the definition of signal IDs that can be treated as API
interface definitions.
IOC signal type definitions are as below.
.. figure:: images/ioc-image1.png
:width: 600px
:align: center
:name: ioc-process-data-svc-val
Process Data SVC values
.. figure:: images/ioc-image2.png
:width: 900px
:align: center
:name: ioc-med-signal-flow
IOC Mediator - Signal flow
- The IOC backend needs to emulate the channel open/reset/close message which
shouldn't be forward to the native cbc signal channel. The SOS signal
related services should do a real open/reset/close signal channel.
- Every backend should maintain a whitelist for different VMs. The
whitelist can be stored in the SOS file system (Read only) in the
future, but currently it is hard coded.
IOC mediator has two whitelist tables, one is used for rx
signals(SOC->IOC), and the other one is used for tx signals. The IOC
mediator drops the single signals and group signals if the signals are
not defined in the whitelist. For multi signal, IOC mediator generates a
new multi signal, which contains the signals in the whitelist.
.. figure:: images/ioc-image3.png
:width: 600px
:align: center
:name: ioc-med-multi-signal
IOC Mediator - Multi-Signal whitelist
Raw data
--------
OEM raw channel only assigns to a specific UOS following that OEM
configuration. The IOC Mediator will directly forward all read/write
message from IOC firmware to UOS without any modification.
Dependencies and Constraints
****************************
HW External Dependencies
========================
+--------------------------------------+--------------------------------------+
| Dependency | Runtime Mechanism to Detect |
| | Violations |
+======================================+======================================+
| VMX should be supported | Boot-time checks to CPUID. See |
| | section A.1 in SDM for details. |
+--------------------------------------+--------------------------------------+
| EPT should be supported | Boot-time checks to primary and |
| | secondary processor-based |
| | VM-execution controls. See section |
| | A.3.2 and A.3.3 in SDM for details. |
+--------------------------------------+--------------------------------------+
SW External Dependencies
========================
+--------------------------------------+--------------------------------------+
| Dependency | Runtime Mechanism to Detect |
| | Violations |
+======================================+======================================+
| When invoking the hypervisor, the | Check the magic value in EAX. See |
| bootloader should have established a | section 3.2 & 3.3 in Multiboot |
| multiboot-compliant state | Specification for details. |
+--------------------------------------+--------------------------------------+
Constraints
===========
+--------------------------+--------------------------+--------------------------+
| Description | Rationale | How such constraint is |
| | | enforced |
+==========================+==========================+==========================+
| Physical cores are | To avoid interference | A bitmap indicating free |
| exclusively assigned to | between vcpus on the | pcpus; on vcpu creation |
| vcpus. | same core. | a free pcpu is picked. |
+--------------------------+--------------------------+--------------------------+
| Only PCI devices | Without HW reset it is | |
| supporting HW reset can | challenging to manage | |
| be passed through to a | devices on UOS crashes | |
| UOS. | | |
+--------------------------+--------------------------+--------------------------+
Interface Specification
***********************
Doxygen-style comments in the code are used for interface specification.
This section provides some examples on how functions and structure
should be commented.
Function Header Template
========================
.. code-block:: c
/**
* @brief Initialize environment for Trusty-OS on a VCPU.
*
* More info here.
*
* @param[in] vcpu Pointer to VCPU data structure
* @param[inout] param guest physical address. This gpa points to
* struct trusty_boot_param
*
* @return 0 - on success.
* @return -EIO - (description when this error can happen)
* @return -EINVAL - (description )
*
* @pre vcpu must not be NULL.
* @pre param must ...
*
* @post the return value is non-zero if param is ....
* @post
*
* @remark The api must be invoked with interrupt disabled.
* @remark (Other usage constraints here)
*/
Structure
=========
.. code-block:: c
/**
* @brief An mmio request.
*
* More info here.
*/
struct mmio_request {
uint32_t direction; /**< Direction of this request. */
uint32_t reserved; /**< Reserved. */
int64_t address; /**< gpa of the register to be accessed. */
int64_t size; /**< Width of the register to be accessed. */
int64_t value; /**< Value read from or to be written to the
register. */
} __aligned(8);
IOC Mediator Configuration
**************************
TBD
IOC Mediator Usage
******************
The device model configuration command syntax for IOC mediator is as
follows::
-i,[ioc_channel_path],[wakeup_reason]
-l,[lpc_port],[ioc_channel_path]
The "ioc_channel_path" is an absolute path for communication between
IOC mediator and UART DM.
The "lpc_port" is "com1" or "com2", IOC mediator needs one unassigned
lpc port for data transfer between UOS and SOS.
The "wakeup_reason" is IOC mediator boot up reason, each bit represents
one wakeup reason.
For example, the following commands are used to enable IOC feature, the
initial wakeup reason is the ignition button and cbc_attach uses ttyS1
for TTY line discipline in UOS::
-i /run/acrn/ioc_$vm_name,0x20
-l com2,/run/acrn/ioc_$vm_name
Porting and adaptation to different platforms
*********************************************
TBD

View File

@@ -0,0 +1,498 @@
.. _memmgt-hld:
Memory Management high-level design
###################################
This document describes memory management for the ACRN hypervisor.
Overview
********
The hypervisor (HV) virtualizes real physical memory so an unmodified OS
(such as Linux or Android) running in a virtual machine, has the view of
managing its own contiguous physical memory. HV uses virtual-processor
identifiers (VPIDs) and the extended page-table mechanism (EPT) to
translate guest-physical address into host-physical address. HV enables
EPT and VPID hardware virtualization features, establishes EPT page
tables for SOS/UOS, and provides EPT page tables operation interfaces to
others.
In the ACRN hypervisor system, there are few different memory spaces to
consider. From the hypervisor's point of view there are:
- **Host Physical Address (HPA)**: the native physical address space, and
- **Host Virtual Address (HVA)**: the native virtual address space based on
a MMU. A page table is used to translate between HPA and HVA
spaces.
From the Guest OS running on a hypervisor there are:
- **Guest Physical Address (GPA)**: the guest physical address space from a
virtual machine. GPA to HPA transition is usually based on a
MMU-like hardware module (EPT in X86), and associated with a page
table
- **Guest Virtual Address (GVA)**: the guest virtual address space from a
virtual machine based on a vMMU
.. figure:: images/mem-image2.png
:align: center
:width: 900px
:name: mem-overview
ACRN Memory Mapping Overview
:numref:`mem-overview` provides an overview of the ACRN system memory
mapping, showing:
- GVA to GPA mapping based on vMMU on a VCPU in a VM
- GPA to HPA mapping based on EPT for a VM in the hypervisor
- HVA to HPA mapping based on MMU in the hypervisor
This document illustrates the memory management infrastructure for the
ACRN hypervisor and how it handles the different memory space views
inside the hypervisor and from a VM:
- How ACRN hypervisor manages host memory (HPA/HVA)
- How ACRN hypervisor manages SOS guest memory (HPA/GPA)
- How ACRN hypervisor & SOS DM manage UOS guest memory (HPA/GPA)
Hypervisor Physical Memory Management
*************************************
In the ACRN, the HV initializes MMU page tables to manage all physical
memory and then switches to the new MMU page tables. After MMU page
tables are initialized at the platform initialization stage, no updates
are made for MMU page tables.
Hypervisor Physical Memory Layout - E820
========================================
The ACRN hypervisor is the primary owner to manage system memory.
Typically the boot firmware (e.g., EFI) passes the platform physical
memory layout - E820 table to the hypervisor. The ACRN hypervisor does
its memory management based on this table using 4-level paging.
The BIOS/bootloader firmware (e.g., EFI) passes the E820 table through a
multiboot protocol. This table contains the original memory layout for
the platform.
.. figure:: images/mem-image1.png
:align: center
:width: 900px
:name: mem-layout
Physical Memory Layout Example
:numref:`mem-layout` is an example of the physical memory layout based on a simple
platform E820 table.
Hypervisor Memory Initialization
================================
The ACRN hypervisor runs under paging mode. After the bootstrap
processor (BSP) gets the platform E820 table, BSP creates its MMU page
table based on it. This is done by the function *init_paging()* and
*smep()*. After the application processor (AP) receives IPI CPU startup
interrupt, it uses the MMU page tables created by BSP and enable SMEP.
:numref:`hv-mem-init` describes the hypervisor memory initialization for BSP
and APs.
.. figure:: images/mem-image8.png
:align: center
:name: hv-mem-init
Hypervisor Memory Initialization
The memory mapping policy used is:
- Identical mapping (ACRN hypervisor memory could be relocatable in
the future)
- Map all memory regions with UNCACHED type
- Remap RAM regions to WRITE-BACK type
.. figure:: images/mem-image69.png
:align: center
:name: hv-mem-vm-init
Hypervisor Virtual Memory Layout
:numref:`hv-mem-vm-init` above shows:
- Hypervisor has a view of and can access all system memory
- Hypervisor has UNCACHED MMIO/PCI hole reserved for devices such as
LAPIC/IOAPIC accessing
- Hypervisor has its own memory with WRITE-BACK cache type for its
code/data (< 1M part is for secondary CPU reset code)
The hypervisor should use minimum memory pages to map from virtual
address space into physical address space.
- If 1GB hugepage can be used
for virtual address space mapping, the corresponding PDPT entry shall be
set for this 1GB hugepage.
- If 1GB hugepage can't be used for virtual
address space mapping and 2MB hugepage can be used, the corresponding
PDT entry shall be set for this 2MB hugepage.
- If both of 1GB hugepage
and 2MB hugepage can't be used for virtual address space mapping, the
corresponding PT entry shall be set.
If memory type or access rights of a page is updated, or some virtual
address space is deleted, it will lead to splitting of the corresponding
page. The hypervisor will still keep using minimum memory pages to map from
virtual address space into physical address space.
Memory Pages Pool Functions
===========================
Memory pages pool functions provide dynamic management of multiple
4KB page-size memory blocks, used by the hypervisor to store internal
data. Through these functions, the hypervisor can allocate and
deallocate pages.
Data Flow Design
================
The physical memory management unit provides MMU 4-level page tables
creating and updating services, MMU page tables switching service, SMEP
enable service, and HPA/HVA retrieving service to other units.
:numref:`mem-data-flow-physical` shows the data flow diagram
of physical memory management.
.. figure:: images/mem-image45.png
:align: center
:name: mem-data-flow-physical
Data Flow of Hypervisor Physical Memory Management
Interfaces Design
=================
MMU Initialization
------------------
.. doxygenfunction:: enable_smep
:project: Project ACRN
.. doxygenfunction:: enable_paging
:project: Project ACRN
.. doxygenfunction:: init_paging
:project: Project ACRN
Address Space Translation
-------------------------
.. doxygenfunction:: hpa2hva
:project: Project ACRN
.. doxygenfunction:: hva2hpa
:project: Project ACRN
Hypervisor Memory Virtualization
********************************
The hypervisor provides a contiguous region of physical memory for SOS
and each UOS. It also guarantees that the SOS and UOS can not access
code and internal data in the hypervisor, and each UOS can not access
code and internal data of the SOS and other UOSs.
The hypervisor:
- enables EPT and VPID hardware virtualization features,
- establishes EPT page tables for SOS/UOS,
- provides EPT page tables operations services,
- virtualizes MTRR for SOS/UOS,
- provides VPID operations services,
- provides services for address spaces translation between GPA and HPA, and
- provides services for data transfer between hypervisor and virtual machine.
Memory Virtualization Capability Checking
=========================================
In the hypervisor, memory virtualization provides EPT/VPID capability
checking service and EPT hugepage supporting checking service. Before HV
enables memory virtualization and uses EPT hugepage, these service need
to be invoked by other units.
Data Transfer between Different Address Spaces
==============================================
In ACRN, different memory space management is used in the hypervisor,
Service OS, and User OS to achieve spatial isolation. Between memory
spaces, there are different kinds of data transfer, such as a SOS/UOS
may hypercall to request hypervisor services which includes data
transferring, or when the hypervisor does instruction emulation: the HV
needs to access the guest instruction pointer register to fetch guest
instruction data.
Access GPA from Hypervisor
--------------------------
When hypervisor need access GPA for data transfer, the caller from guest
must make sure this memory range's GPA is continuous. But for HPA in
hypervisor, it could be dis-continuous (especially for UOS under hugetlb
allocation mechanism). For example, a 4M GPA range may map to 2
different 2M huge host-physical pages. The ACRN hypervisor must take
care of this kind of data transfer by doing EPT page walking based on
its HPA.
Access GVA from Hypervisor
--------------------------
When hypervisor needs to access GVA for data transfer, it's likely both
GPA and HPA could be address dis-continuous. The ACRN hypervisor must
watch for this kind of data transfer, and handle it by doing page
walking based on both its GPA and HPA.
EPT Page Tables Operations
==========================
The hypervisor should use a minimum of memory pages to map from
guest-physical address (GPA) space into host-physical address (HPA)
space.
- If 1GB hugepage can be used for GPA space mapping, the
corresponding EPT PDPT entry shall be set for this 1GB hugepage.
- If 1GB hugepage can't be used for GPA space mapping and 2MB hugepage can be
used, the corresponding EPT PDT entry shall be set for this 2MB
hugepage.
- If both 1GB hugepage and 2MB hugepage can't be used for GPA
space mapping, the corresponding EPT PT entry shall be set.
If memory type or access rights of a page is updated or some GPA space
is deleted, it will lead to the corresponding EPT page being split. The
hypervisor should still keep to using minimum EPT pages to map from GPA
space into HPA space.
The hypervisor provides EPT guest-physical mappings adding service, EPT
guest-physical mappings modifying/deleting service, EPT page tables
deallocation, and EPT guest-physical mappings invalidation service.
Virtual MTRR
************
In ACRN, the hypervisor only virtualizes MTRRs fixed range (0~1MB).
The HV sets MTRRs of the fixed range as Write-Back for UOS, and the SOS reads
native MTRRs of the fixed range set by BIOS.
If the guest physical address is not in the fixed range (0~1MB), the
hypervisor uses the default memory type in the MTRR (Write-Back).
When the guest disables MTRRs, the HV sets the guest address memory type
as UC.
If the guest physical address is in fixed range (0~1MB), the HV sets
memory type according to the fixed virtual MTRRs.
When the guest enable MTRRs, MTRRs have no effect on the memory type
used for access to GPA. The HV first intercepts MTRR MSR registers
access through MSR access VM exit and updates EPT memory type field in EPT
PTE according to the memory type selected by MTRRs. This combines with
PAT entry in the PAT MSR (which is determined by PAT, PCD, and PWT bits
from the guest paging structures) to determine the effective memory
type.
VPID operations
===============
Virtual-processor identifier (VPID) is a hardware feature to optimize
TLB management. When VPID is enable, hardware will add a tag for TLB of
a logical processor and cache information for multiple linear-address
spaces. VMX transitions may retain cached information and the logical
processor switches to a different address space, avoiding unnecessary
TLB flushes.
In ACRN, an unique VPID must be allocated for each virtual CPU
when a virtual CPU is created. The logical processor invalidates linear
mappings and combined mapping associated with all VPIDs (except VPID
0000H), and with all PCIDs when the logical processor launches the virtual
CPU. The logical processor invalidates all linear mapping and combined
mappings associated with the specified VPID when the interrupt pending
request handling needs to invalidate cached mapping of the specified
VPID.
Data Flow Design
================
The memory virtualization unit includes address space translation
functions, data transferring functions, VM EPT operations functions,
VPID operations functions, VM exit hanging about EPT violation and EPT
misconfiguration, and MTRR virtualization functions. This unit handles
guest-physical mapping updates by creating or updating related EPT page
tables. It virtualizes MTRR for guest OS by updating related EPT page
tables. It handles address translation from GPA to HPA by walking EPT
page tables. It copies data from VM into the HV or from the HV to VM by
walking guest MMU page tables and EPT page tables. It provides services
to allocate VPID for each virtual CPU and TLB invalidation related VPID.
It handles VM exit about EPT violation and EPT misconfiguration. The
following :numref:`mem-flow-mem-virt` describes the data flow diagram of
the memory virtualization unit.
.. figure:: images/mem-image84.png
:align: center
:name: mem-flow-mem-virt
Data Flow of Hypervisor Memory Virtualization
Data Structure Design
=====================
EPT Memory Type Definition:
.. doxygengroup:: ept_mem_type
:project: Project ACRN
:content-only:
EPT Memory Access Right Definition:
.. doxygengroup:: ept_mem_access_right
:project: Project ACRN
:content-only:
Interfaces Design
=================
The memory virtualization unit interacts with external units through VM
exit and APIs.
VM Exit about EPT
=================
There are two VM exit handlers for EPT violation and EPT
misconfiguration in the hypervisor. EPT page tables are
always configured correctly for SOS and UOS. If EPT misconfiguration is
detected, a fatal error is reported by HV. The hypervisor
uses EPT violation to intercept MMIO access to do device emulation. EPT
violation handling data flow is described in the
:ref:`instruction-emulation`.
Memory Virtualization APIs
==========================
Here is a list of major memory related APIs in HV:
EPT/VPID Capability Checking
----------------------------
Data Transferring between hypervisor and VM
-------------------------------------------
.. doxygenfunction:: copy_from_gpa
:project: Project ACRN
.. doxygenfunction:: copy_to_gpa
:project: Project ACRN
.. doxygenfunction:: copy_from_gva
:project: Project ACRN
Address Space Translation
-------------------------
.. doxygenfunction:: gpa2hpa
:project: Project ACRN
.. doxygenfunction:: sos_vm_hpa2gpa
:project: Project ACRN
EPT
---
.. doxygenfunction:: ept_add_mr
:project: Project ACRN
.. doxygenfunction:: ept_del_mr
:project: Project ACRN
.. doxygenfunction:: ept_modify_mr
:project: Project ACRN
.. doxygenfunction:: destroy_ept
:project: Project ACRN
.. doxygenfunction:: invept
:project: Project ACRN
.. doxygenfunction:: ept_misconfig_vmexit_handler
:project: Project ACRN
Virtual MTRR
------------
.. doxygenfunction:: init_vmtrr
:project: Project ACRN
.. doxygenfunction:: write_vmtrr
:project: Project ACRN
.. doxygenfunction:: read_vmtrr
:project: Project ACRN
VPID
----
.. doxygenfunction:: flush_vpid_single
:project: Project ACRN
.. doxygenfunction:: flush_vpid_global
:project: Project ACRN
Service OS Memory Management
****************************
After the ACRN hypervisor starts, it creates the Service OS as its first
VM. The Service OS runs all the native device drivers, manage the
hardware devices, and provides I/O mediation to guest VMs. The Service
OS is in charge of the memory allocation for Guest VMs as well.
ACRN hypervisor passes the whole system memory access (except its own
part) to the Service OS. The Service OS must be able to access all of
the system memory except the hypervisor part.
Guest Physical Memory Layout - E820
===================================
The ACRN hypervisor passes the original E820 table to the Service OS
after filtering out its own part. So from Service OS's view, it sees
almost all the system memory as shown here:
.. figure:: images/mem-image3.png
:align: center
:width: 900px
:name: sos-mem-layout
SOS Physical Memory Layout
Host to Guest Mapping
=====================
ACRN hypervisor creates Service OS's host (HPA) to guest (GPA) mapping
(EPT mapping) through the function ``prepare_sos_vm_memmap()``
when it creates the SOS VM. It follows these rules:
- Identical mapping
- Map all memory range with UNCACHED type
- Remap RAM entries in E820 (revised) with WRITE-BACK type
- Unmap ACRN hypervisor memory range
- Unmap ACRN hypervisor emulated vLAPIC/vIOAPIC MMIO range
The host to guest mapping is static for the Service OS; it will not
change after the Service OS begins running. Each native device driver
can access its MMIO through this static mapping. EPT violation is only
serving for vLAPIC/vIOAPIC's emulation in the hypervisor for Service OS
VM.
Trusty
******
For an Android User OS, there is a secure world named trusty world
support, whose memory must be secured by the ACRN hypervisor and
must not be accessible by SOS and UOS normal world.
.. figure:: images/mem-image18.png
:align: center
UOS Physical Memory Layout with Trusty

View File

@@ -0,0 +1,367 @@
.. _partition-mode-hld:
Partition mode
##############
ACRN is type-1 hypervisor that supports running multiple guest operating
systems (OS). Typically, the platform BIOS/boot-loader boots ACRN, and
ACRN loads single or multiple guest OSes. Refer to :ref:`hv-startup` for
details on the start-up flow of the ACRN hypervisor.
ACRN supports two modes of operation: Sharing mode and Partition mode.
This document describes ACRN's high-level design for Partition mode
support.
.. contents::
:depth: 2
:local:
Introduction
************
In partition mode, ACRN provides guests with exclusive access to cores,
memory, cache, and peripheral devices. Partition mode enables developers
to dedicate resources exclusively among the guests. However there is no
support today in x86 hardware or in ACRN to partition resources such as
peripheral buses (e.g. PCI) or memory bandwidth. Cache partitioning
technology, such as Cache Allocation Technology (CAT) in x86, can be
used by developers to partition Last Level Cache (LLC) among the guests.
(Note: ACRN support for x86 CAT is on the roadmap, but not currently
supported).
ACRN expects static partitioning of resources either by code
modification for guest configuration or through compile-time config
options. All the devices exposed to the guests are either physical
resources or emulated in the hypervisor. So, there is no need for
device-model and Service OS. :numref:`pmode2vms` shows a partition mode
example of two VMs with exclusive access to physical resources.
.. figure:: images/partition-image3.png
:align: center
:name: pmode2vms
Partition Mode example with two VMs
Guest info
**********
ACRN uses multi-boot info passed from the platform boot-loader to know
the location of each guest kernel in memory. ACRN creates a copy of each
guest kernel into each of the guests' memory. Current implementation of
ACRN requires developers to specify kernel parameters for the guests as
part of guest configuration. ACRN picks up kernel parameters from guest
configuration and copies them to the corresponding guest memory.
.. figure:: images/partition-image18.png
:align: center
ACRN set-up for guests
**********************
Cores
=====
ACRN requires the developer to specify the number of guests and the
cores dedicated for each guest. Also the developer needs to specify
the physical core used as the Boot Strap Processor (BSP) for each guest. As
the processors are brought to life in the hypervisor, it checks if they are
configured as BSP for any of the guests. If a processor is BSP of any of
the guests, ACRN proceeds to build the memory mapping for the guest,
mptable, E820 entries, and zero page for the guest. As described in
`Guest info`_, ACRN creates copies of guest kernel and kernel
parameters into guest memory. :numref:`partBSPsetup` explains these
events in chronological order.
.. figure:: images/partition-image7.png
:align: center
:name: partBSPsetup
Memory
======
For each guest in partition mode, the ACRN developer specifies the size of
memory for the guest and the starting address in the host physical
address in the guest configuration. There is no support for HIGHMEM for
partition mode guests. The developer needs to take care of two aspects
for assigning host memory to the guests:
1) Sum of guest PCI hole and guest "System RAM" is less than 4GB.
2) Pick the starting address in the host physical address and the
size, so that it does not overlap with any reserved regions in
host E820.
ACRN creates EPT mapping for the guest between GPA (0, memory size) and
HPA (starting address in guest configuration, memory size).
E820 and zero page info
=======================
A default E820 is used for all the guests in partition mode. This table
shows the reference E820 layout. Zero page is created with this
e820 info for all the guests.
+------------------------+
| RAM |
| |
| 0 - 0xEFFFFH |
+------------------------+
| RESERVED (MPTABLE) |
| |
| 0xF0000H - 0x100000H |
+------------------------+
| RAM |
| |
| 0x100000H - LOWMEM |
+------------------------+
| RESERVED |
+------------------------+
| PCI HOLE |
+------------------------+
| RESERVED |
+------------------------+
Platform info - mptable
=======================
ACRN, in partition mode, uses mptable to convey platform info to each
guest. Using this platform information, number of cores used for each
guest, and whether the guest needs devices with INTX, ACRN builds
mptable and copies it to the guest memory. In partition mode, ACRN uses
physical APIC IDs to pass to the guests.
I/O - Virtual devices
=====================
Port I/O is supported for PCI device config space 0xcfc and 0xcf8, vUART
0x3f8, vRTC 0x70 and 0x71, and vPIC ranges 0x20/21, 0xa0/a1, and
0x4d0/4d1. MMIO is supported for vIOAPIC. ACRN exposes a virtual
host-bridge at BDF (Bus Device Function) 0.0:0 to each guest. Access to
256 bytes of config space for virtual host bridge is emulated.
I/O - Pass-thru devices
=======================
ACRN, in partition mode, supports passing thru PCI devices on the
platform. All the pass-thru devices are exposed as child devices under
the virtual host bridge. ACRN does not support either passing thru
bridges or emulating virtual bridges. Pass-thru devices should be
statically allocated to each guest using the guest configuration. ACRN
expects the developer to provide the virtual BDF to BDF of the
physical device mapping for all the pass-thru devices as
part of each guest configuration.
Run-time ACRN support for guests
********************************
ACRN, in partition mode, supports an option to pass-thru LAPIC of the
physical CPUs to the guest. ACRN expects developers to specify if the
guest needs LAPIC pass-thru using guest configuration. When guest
configures vLAPIC as x2APIC, and if the guest configuration has LAPIC
pass-thru enabled, ACRN passes the LAPIC to the guest. Guest can access
the LAPIC hardware directly without hypervisor interception. During
runtime of the guest, this option differentiates how ACRN supports
inter-processor interrupt handling and device interrupt handling. This
will be discussed in detail in the corresponding sections.
.. figure:: images/partition-image16.png
:align: center
Guest SMP boot flow
===================
The core APIC IDs are reported to the guest using mptable info. SMP boot
flow is similar to sharing mode. Refer to :ref:`vm-startup`
for guest SMP boot flow in ACRN. Partition mode guests startup is same as
the SOS startup in sharing mode.
Inter-processor Interrupt (IPI) Handling
========================================
Guests w/o LAPIC pass-thru
--------------------------
For guests without LAPIC pass-thru, IPIs between guest CPUs are handled in
the same way as sharing mode of ACRN. Refer to :ref:`virtual-interrupt-hld`
for more details.
Guests w/ LAPIC pass-thru
-------------------------
ACRN supports pass-thru if and only if the guest is using x2APIC mode
for the vLAPIC. In LAPIC pass-thru mode, writes to Interrupt Command
Register (ICR) x2APIC MSR is intercepted. Guest writes the IPI info
including vector, destination APIC IDs to the ICR. Upon an IPI request
from the guest, ACRN does sanity check on the destination processors
programmed into ICR. If the destination is a valid target for the guest,
ACRN sends IPI with the same vector from ICR to the physical CPUs
corresponding to the destination processor info in ICR.
.. figure:: images/partition-image14.png
:align: center
Pass-thru device support
========================
Configuration space access
--------------------------
ACRN emulates Configuration Space Address (0xcf8) I/O port and
Configuration Space Data (0xcfc) I/O port for guests to access PCI
devices configuration space. Within the config space of a device, Base
Address registers (BAR), offsets starting from 0x10H to 0x24H, provide
the information about the resources (I/O and MMIO) used by the PCI
device. ACRN virtualizes the BAR registers and for the rest of the
config space, forwards reads and writes to the physical config space of
pass-thru devices. Refer to `I/O`_ section below for more details.
.. figure:: images/partition-image1.png
:align: center
DMA
---
ACRN developers need to statically define the pass-thru devices for each
guest using the guest configuration. For devices to DMA to/from guest
memory directly, ACRN parses the list of pass-thru devices for each
guest and creates context entries in the VT-d remapping hardware. EPT
page tables created for the guest are used for VT-d page tables.
I/O
---
ACRN supports I/O for pass-thru devices with two restrictions.
1) Supports only MMIO. So requires developers to expose I/O BARs as
not present in the guest configuration.
2) Supports only 32-bit MMIO BAR type.
As guest PCI sub-system scans the PCI bus and assigns Guest Physical
Address (GPA) to the MMIO BAR, ACRN maps GPA to the address in the
physical BAR of the pass-thru device using EPT. Following timeline chart
explains how PCI devices are assigned to guest and BARs are mapped upon
guest initialization.
.. figure:: images/partition-image13.png
:align: center
Interrupt Configuration
-----------------------
ACRN supports both legacy (INTx) and MSI interrupts for pass-thru
devices.
INTx support
~~~~~~~~~~~~
ACRN expects developers to identify the interrupt line info (0x3CH) from
the physical BAR of the pass-thru device and build an interrupt entry in
the mptable for the corresponding guest. As guest configures the vIOAPIC
for the interrupt RTE, ACRN writes the info from the guest RTE into the
physical IOAPIC RTE. Upon guest kernel request to mask the interrupt,
ACRN writes to the physical RTE to mask the interrupt at the physical
IOAPIC. When guest masks the RTE in vIOAPIC, ACRN masks the interrupt
RTE in the physical IOAPIC. Level triggered interrupts are not
supported.
MSI support
~~~~~~~~~~~
Guest reads/writes to PCI configuration space for configuring MSI
interrupts using address. Data and control registers are pass-thru to
the physical BAR of pass-thru device. Refer to `Configuration
space access`_ for details on how PCI configuration space is emulated.
Virtual device support
======================
ACRN provides read-only vRTC support for partition mode guests. Writes
to the data port are discarded.
For port I/O to ports other than vPIC, vRTC, or vUART, reads return 0xFF and
writes are discarded.
Interrupt delivery
==================
Guests w/o LAPIC pass-thru
--------------------------
In partition mode of ACRN, interrupts stay disabled after a vmexit. The
processor does not take interrupts when it is executing in VMX root
mode. ACRN configures the processor to take vmexit upon external
interrupt if the processor is executing in VMX non-root mode. Upon an
external interrupt, after sending EOI to the physical LAPIC, ACRN
injects the vector into the vLAPIC of vCPU currently running on the
processor. Guests using Linux as kernel, uses vectors less than 0xECh
for device interrupts.
.. figure:: images/partition-image20.png
:align: center
Guests w/ LAPIC pass-thru
-------------------------
For guests with LAPIC pass-thru, ACRN does not configure vmexit upon
external interrupts. There is no vmexit upon device interrupts and they are
handled by the guest IDT.
Hypervisor IPI service
======================
ACRN needs IPIs for events such as flushing TLBs across CPUs, sending virtual
device interrupts (e.g. vUART to vCPUs), and others.
Guests w/o LAPIC pass-thru
--------------------------
Hypervisor IPIs work the same way as in sharing mode.
Guests w/ LAPIC pass-thru
-------------------------
Since external interrupts are pass-thru to guest IDT, IPIs do not
trigger vmexit. ACRN uses NMI delivery mode and the NMI exiting is
chosen for vCPUs. At the time of NMI interrupt on the target processor,
if the processor is in non-root mode, vmexit happens on the processor
and the event mask is checked for servicing the events.
Debug Console
=============
For details on how hypervisor console works, refer to
:ref:`hv-console`.
For a guest console in partition mode, ACRN provides an option to pass
``vmid`` as an argument to ``vm_console``. vmid is same as the one
developer uses in the guest configuration.
Guests w/o LAPIC pass-thru
--------------------------
Works the same way as sharing mode.
Hypervisor Console
==================
ACRN uses TSC deadline timer to provide timer service. Hypervisor
console uses a timer on CPU0 to poll characters on the serial device. To
support LAPIC pass-thru, TSC deadline MSR is pass-thru and the local
timer interrupt also delivered to the guest IDT. Instead of TSC deadline
timer, ACRN uses VMX preemption timer to poll the serial device.
Guest Console
=============
ACRN exposes vUART to partition mode guests. vUART uses vPIC to inject
interrupt to the guest BSP. In cases of guest having more than one core,
during runtime, vUART might need to inject interrupt to guest BSP from
another core (other than BSP). As mentioned in section <Hypervisor IPI
service>, ACRN uses NMI delivery mode for notifying the CPU running BSP
of the guest.

View File

@@ -0,0 +1,49 @@
.. _pm_hld:
Power Management
################
System PM module
****************
The PM module in the hypervisor does three things:
- When all UOSes enter low power state, VM management will notify the SOS
lifecycle service and trigger the SOS to enter a low-power state.
SOS follows its own standard low-power state entry process and
writes the ACPI control register to put SOS into low-power state.
Hypervisor traps the ACPI control register writing and
emulates SOS low-power state entry.
- Once SOS low-power emulation is done, Hypervisor handles its
own low-power state transition
- Once system resumes from low-power mode, the hypervisor handles its
own resume and emulates SOS resume too.
It is assumed that SOS does not trigger any power state transition until
the VM manager of ACRN notifies it that all UOSes are inactive and SOS
offlines all its virtual APs.
:numref:`pm-low-power-transition` shows the SOS/Hypervisor low-power
state transition process. SOS triggers power state transition by
writing ACPI control register on its virtual BSP (which is pinned to the
physical BSP). The hypervisor then does the following in sequence before
it writes to the physical ACPI control register to trigger physical
power state transition:
- Pauses SOS.
- Offlines all physical APs.
- Save the context of console, ioapic of SOS, I/O MMU, lapic of SOS,
virtual BSP.
- Save the context of physical BSP.
When exiting from low-power mode, the hypervisor does similar steps in
reverse order to restore contexts, start APs and resume SOS. SOS is
responsible for starting its own virtual APs as well as UOSes.
.. figure:: images/pm-image24-105.png
:align: center
:name: pm-low-power-transition
SOS/Hypervisor low power state transition process

View File

@@ -0,0 +1,207 @@
.. _hv-startup:
Hypervisor Startup
##################
This section is an overview of the ACRN hypervisor startup.
The ACRN hypervisor
compiles to a 32-bit multiboot-compliant ELF file.
The bootloader (ABL or SBL) loads the hypervisor according to the
addresses specified in the ELF header. The BSP starts the hypervisor
with an initial state compliant to multiboot 1 specification, after the
bootloader prepares full configurations including ACPI, E820, etc.
The HV startup has two parts: the native startup followed by
VM startup.
Native Startup
**************
.. figure:: images/hld-image107.png
:align: center
:name: hvstart-nativeflow
Hypervisor Native Startup Flow
Native startup sets up a baseline environment for HV, including basic
memory and interrupt initialization as shown in
:numref:`hvstart-nativeflow`. Here is a short
description for the flow:
- **BSP Startup:** The starting point for bootstrap processor.
- **Relocation**: relocate the hypervisor image if the hypervisor image
is not placed at the assumed base address.
- **UART Init:** Initialize a pre-configured UART device used
as the base physical console for HV and Service OS.
- **Shell Init:** Start a command shell for HV accessible via the UART.
- **Memory Init:** Initialize memory type and cache policy, and creates
MMU page table mapping for HV.
- **Interrupt Init:** Initialize interrupt and exception for native HV
including IDT and ``do_IRQ`` infrastructure; a timer interrupt
framework is then built. The native/physical interrupts will go
through this ``do_IRQ`` infrastructure then distribute to special
targets (HV or VMs).
- **Start AP:** BSP kicks ``INIT-SIPI-SIPI`` IPI sequence to start other
native APs (application processor). Each AP will initialize its
own memory and interrupts, notifies the BSP on completion and
enter the default idle loop.
Symbols in the hypervisor are placed with an assumed base address, but
the bootloader may not place the hypervisor at that specified base. In
such case the hypervisor will relocate itself to where the bootloader
loads it.
Here is a summary of CPU and memory initial states that are set up after
native startup.
CPU
ACRN hypervisor brings all physical processors to 64-bit IA32e
mode, with the assumption that the BSP starts in protection mode where
segmentation and paging sets an identical mapping of the first 4G
addresses without permission restrictions. The control registers and
some MSRs are set as follows:
- cr0: The following features are enabled: paging, write protection,
protection mode, numeric error and co-processor monitoring.
- cr3: refer to the initial state of memory.
- cr4: The following features are enabled: physical address extension,
machine-check, FXSAVE/FXRSTOR, SMEP, VMX operation and unmask
SIMD FP exception. The other features are disabled.
- MSR_IA32_EFER: only IA32e mode is enabled.
- MSR_IA32_FS_BASE: the address of stack canary, used for detecting
stack smashing.
- MSR_IA32_TSC_AUX: a unique logical ID is set for each physical
processor.
- stack: each physical processor has a separate stack.
Memory
All physical processors are in 64-bit IA32e mode after
startup. The GDT holds four entries, one unused, one for code and
another for data, both of which have a base of all 0's and a limit of
all 1's, and the other for 64-bit TSS. The TSS only holds three stack
pointers (for machine-check, double fault and stack fault) in the
interrupt stack table (IST) which are different across physical
processors. LDT is disabled.
Refer to section 3.5.2 for a detailed description of interrupt-related
initial states, including IDT and physical PICs.
After BSP detects that all APs are up, BSP will start creating the first
VM, i.e. SOS, as explained in the next section.
.. _vm-startup:
VM Startup
**********
SOS is created and launched on the physical BSP after the hypervisor
initializes itself. Meanwhile, the APs enter the default idle loop
(refer to :ref:`VCPU_lifecycle` for details), waiting for any vCPU to be
scheduled to them.
:numref:`hvstart-vmflow` illustrates a high-level execution flow of
creating and launching a VM, applicable to both SOS and UOS. One major
difference in the creation of SOS and UOS is that SOS is created by the
hypervisor, while the creation of UOSes is triggered by the DM in SOS.
The main steps include:
- **Create VM**: A VM structure is allocated and initialized. A unique
VM ID is picked, EPT is created, I/O bitmap is set up, I/O
emulation handlers initialized and registered and virtual CPUID
entries filled. For SOS an addition e820 table is prepared.
- **Create vCPUs:** Create the vCPUs, assign the physical processor it
is pinned to, a unique-per-VM vCPU ID and a globally unique VPID,
and initializes its virtual lapic and MTRR. For SOS one vCPU is
created for each physical CPU on the platform. For UOS the DM
determines the number of vCPUs to be created.
- **SW Load:** The BSP of a VM also prepares for each VM's SW
configuration including kernel entry address, ramdisk address,
bootargs, zero page etc. This is done by the hypervisor for SOS
while by DM for UOS.
- **Schedule vCPUs:** The vCPUs are scheduled to the corresponding
physical processors for execution.
- **Init VMCS:** Initialize vCPU's VMCS for its host state, guest
state, execution control, entry control and exit control. It's
the last configuration before vCPU runs.
- **vCPU thread:** vCPU kicks out to run. For "Primary CPU" it will
start running into kernel image which SW Load is configured; for
"Non-Primary CPU" it will wait for INIT-SIPI-SIPI IPI sequence
trigger from its "Primary CPU".
.. figure:: images/hld-image104.png
:align: center
:name: hvstart-vmflow
Hypervisor VM Startup Flow
SW configuration for Service OS (SOS_VM):
- **ACPI**: HV passes the entire ACPI table from bootloader to Service
OS directly. Legacy mode is currently supported as the ACPI table
is loaded at F-Segment.
- **E820**: HV passes e820 table from bootloader through multi-boot
information after the HV reserved memory (32M for example) is
filtered out.
- **Zero Page**: HV prepares the zero page at the high end of Service
OS memory which is determined by SOS_VM guest FIT binary build. The
zero page includes configuration for ramdisk, bootargs and e820
entries. The zero page address will be set to "Primary CPU" RSI
register before VCPU gets run.
- **Entry address**: HV will copy Service OS kernel image to 0x1000000
as entry address for SOS_VM's "Primary CPU". This entry address will
be set to "Primary CPU" RIP register before VCPU gets run.
SW configuration for User OS (VMx):
- **ACPI**: the virtual ACPI table is built by DM and put at VMx's
F-Segment. Refer to :ref:`hld-io-emulation` for details.
- **E820**: the virtual E820 table is built by the DM then passed to
the zero page. Refer to :ref:`hld-io-emulation` for details.
- **Zero Page**: the DM prepares the zero page at location of
"lowmem_top - 4K" in VMx. This location is set into VMx's
"Primary CPU" RSI register in **SW Load**.
- **Entry address**: the DM will copy User OS kernel image to 0x1000000
as entry address for VMx's "Primary CPU". This entry address will
be set to "Primary CPU" RIP register before VCPU gets run.
Here is initial mode of vCPUs:
+------------------------------+-------------------------------+
| VM and Processor Type | Initial Mode |
+=============+================+===============================+
| SOS | BSP | Same as physical BSP |
| +----------------+-------------------------------+
| | AP | Real Mode |
+-------------+----------------+-------------------------------+
| UOS | BSP | Real Mode |
| +----------------+-------------------------------+
| | AP | Real Mode |
+-------------+----------------+-------------------------------+
Note that SOS is started with the same number of vCPUs as the physical
CPUs to boost the boot-up. SOS will offline the APs right before it
starts any UOS.

View File

@@ -0,0 +1,61 @@
.. _timer-hld:
Timer
#####
Because ACRN is a flexible, lightweight reference hypervisor, we provide
limited timer management services:
- Only lapic tsc-deadline timer is supported as the clock source.
- A timer can only be added on the logical CPU for a process or thread. Timer
scheduling or timer migrating are not supported.
How it works
************
When the system boots, we check that the hardware supports lapic
tsc-deadline timer by checking CPUID.01H:ECX.TSC_Deadline[bit 24]. If
support is missing, we output an error message and panic the hypervisor.
If supported, we register the timer interrupt callback that raises a
timer softirq on each logical CPU and set the lapic timer mode to
tsc-deadline timer mode by writing the local APIC LVT register.
Data Structures and APIs
************************
Interfaces Design
=================
.. doxygenfunction:: initialize_timer
:project: Project ACRN
.. doxygenfunction:: timer_expired
:project: Project ACRN
.. doxygenfunction:: add_timer
:project: Project ACRN
.. doxygenfunction:: del_timer
:project: Project ACRN
.. doxygenfunction:: timer_init
:project: Project ACRN
.. doxygenfunction:: calibrate_tsc
:project: Project ACRN
.. doxygenfunction:: us_to_ticks
:project: Project ACRN
.. doxygenfunction:: ticks_to_us
:project: Project ACRN
.. doxygenfunction:: ticks_to_ms
:project: Project ACRN
.. doxygenfunction:: rdtsc
:project: Project ACRN
.. doxygenfunction:: get_tsc_khz
:project: Project ACRN

View File

@@ -0,0 +1,264 @@
.. _virtual-interrupt-hld:
Virtual Interrupt
#################
This section introduces ACRN guest virtual interrupt
management, which includes:
- VCPU request for virtual interrupt kick off,
- vPIC/vIOAPIC/vLAPIC for virtual interrupt injection interfaces,
- physical-to-virtual interrupt mapping for a pass-thru device, and
- the process of VMX interrupt/exception injection.
A guest VM never owns any physical interrupts. All interrupts received by
Guest OS come from a virtual interrupt injected by vLAPIC, vIOAPIC or
vPIC. Such virtual interrupts are triggered either from a pass-through
device or from I/O mediators in SOS via hypercalls. Section 3.8.6
introduces how the hypervisor manages the mapping between physical and
virtual interrupts for pass-through devices.
Emulation for devices is inside SOS user space device model, i.e.,
acrn-dm. However for performance consideration: vLAPIC, vIOAPIC, and vPIC
are emulated inside HV directly.
From guest OS point of view, vPIC is Virtual Wire Mode via vIOAPIC. The
symmetric I/O Mode is shown in :numref:`pending-virt-interrupt` later in
this section.
The following command line
options to guest Linux affects whether it uses PIC or IOAPIC:
- **Kernel boot param with vPIC**: add "maxcpu=0" Guest OS will use PIC
- **Kernel boot param with vIOAPIC**: add "maxcpu=1" (as long as not "0")
Guest OS will use IOAPIC. And Keep IOAPIC pin2 as source of PIC.
vCPU Request for Interrupt Injection
************************************
The vCPU request mechanism (described in :ref:`pending-request-handlers`) is leveraged
to inject interrupts to a certain vCPU. As mentioned in
:ref:`ipi-management`,
physical vector 0xF0 is used to kick VCPU out of its VMX non-root mode,
used to make a request for virtual interrupt injection or other
requests such as flush EPT.
The eventid supported for virtual interrupt injection includes:
.. doxygengroup:: virt_int_injection
:project: Project ACRN
:content-only:
The *vcpu_make_request* is necessary for a virtual interrupt
injection. If the target vCPU is running under VMX non-root mode, it
will send an IPI to kick it out, which leads to an external-interrupt
VM-Exit. For some cases there is no need to send IPI when making a request,
because the CPU making the request itself is the target VCPU. For
example, the #GP exception request always happens on the current CPU when it
finds an invalid emulation has happened. An external interrupt for a pass-thru
device always happens on the VCPUs this device belonging to, so after it
triggers an external-interrupt VM-Exit, the current CPU is also the
target VCPU.
Virtual LAPIC
*************
LAPIC is virtualized for all Guest types: SOS and UOS. Given support by
the
physical processor, APICv Virtual Interrupt Delivery (VID) is enabled
and will support Posted-Interrupt feature. Otherwise, it will fall back to legacy
virtual interrupt injection mode.
vLAPIC provides the same features as the native LAPIC:
- Vector mask/unmask
- Virtual vector injections (Level or Edge trigger mode) to vCPU
- vIOAPIC notification of EOI processing
- TSC Timer service
- vLAPIC support CR8 to update TPR
- INIT/STARTUP handling
vLAPIC APIs
===========
APIs are provided when an interrupt source from vLAPIC needs to inject
an interrupt, for example:
- from LVT like LAPIC timer
- from vIOAPIC for a pass-thru device interrupt
- from an emulated device for a MSI
These APIs will finish by making a request for *ACRN_REQUEST_EVENT.*
.. doxygenfunction:: vlapic_set_local_intr
:project: Project ACRN
.. doxygenfunction:: vlapic_intr_msi
:project: Project ACRN
.. doxygenfunction:: apicv_get_pir_desc_paddr
:project: Project ACRN
EOI processing
==============
EOI virtualization is enabled if APICv virtual interrupt delivery is
supported. Except for level triggered interrupts, VM will not exit in
case of EOI.
In case of no APICv virtual interrupt delivery support, vLAPIC requires
EOI from Guest OS whenever a vector was acknowledged and processed by
guest. vLAPIC behavior is the same as HW LAPIC. Once an EOI is received,
it clears the highest priority vector in ISR and TMR, and updates PPR
status. vLAPIC will then notify vIOAPIC if the corresponding vector
comes from vIOAPIC. This only occurs for the level triggered interrupts.
LAPIC passthrough based on vLAPIC
=================================
LAPIC passthrough is supported based on vLAPIC, after switch to x2APIC
mode. In case of LAPIC passthrough based on vLAPIC, the system will have the
following characteristics.
* IRQs received by the LAPIC can be handled by the Guest VM without ``vmexit``
* Guest VM always see virtual LAPIC IDs for security reasons
* most MSRs are directly accessible from Guest VM except for ``XAPICID``,
``LDR`` and ``ICR``. Write operations to ``ICR`` will be trapped to avoid
malicious IPI. Read operations to ``XAPIC`` and ``LDR`` will be trapped in
order to make the Guest VM always see the virtual LAPIC IDs instead of the
physical ones.
Virtual IOAPIC
**************
vIOAPIC is emulated by HV when Guest accesses MMIO GPA range:
0xFEC00000-0xFEC01000. vIOAPIC for SOS should match to the native HW
IOAPIC Pin numbers. vIOAPIC for UOS provides 48 Pins. As the vIOAPIC is
always associated with vLAPIC, the virtual interrupt injection from
vIOAPIC will finally trigger a request for vLAPIC event by calling
vLAPIC APIs.
**Supported APIs:**
.. doxygenfunction:: vioapic_set_irqline_lock
:project: Project ACRN
.. doxygenfunction:: vioapic_set_irqline_nolock
:project: Project ACRN
Virtual PIC
***********
vPIC is required for TSC calculation. Normally UOS will boot with
vIOAPIC and vPIC as the source of external interrupts to Guest. On every
VM Exit, HV will check if there are any pending external PIC interrupts.
vPIC APIs usage are similar to vIOAPIC.
ACRN hypervisor emulates a vPIC for each VM based on IO range 0x20~0x21,
0xa0~0xa1 and 0x4d0~0x4d1.
If an interrupt source from vPIC need to inject an interrupt, the
following APIs need be called, which will finally make a request for
*ACRN_REQUEST_EXTINT or ACRN_REQUEST_EVENT*:
.. doxygenfunction:: vpic_set_irqline
:project: Project ACRN
The following APIs are used to query the vector needed to be injected and ACK
the service (means move the interrupt from request service - IRR to in
service - ISR):
.. doxygenfunction:: vpic_pending_intr
:project: Project ACRN
.. doxygenfunction:: vpic_intr_accepted
:project: Project ACRN
Virtual Exception
*****************
When doing emulation, an exception may need to be triggered in
hypervisor, for example:
- if guest accesses an invalid vMSR register,
- hypervisor needs to inject a #GP, or
- during instruction emulation, an instruction fetch may access
a non-exist page from rip_gva, at that time a #PF need be injected.
ACRN hypervisor implements virtual exception injection using these APIs:
.. doxygenfunction:: vcpu_queue_exception
:project: Project ACRN
.. doxygenfunction:: vcpu_inject_extint
:project: Project ACRN
.. doxygenfunction:: vcpu_inject_nmi
:project: Project ACRN
.. doxygenfunction:: vcpu_inject_gp
:project: Project ACRN
.. doxygenfunction:: vcpu_inject_pf
:project: Project ACRN
.. doxygenfunction:: vcpu_inject_ud
:project: Project ACRN
.. doxygenfunction:: vcpu_inject_ss
:project: Project ACRN
ACRN hypervisor uses the *vcpu_inject_gp/vcpu_inject_pf* functions
to queue exception request, and follows SDM vol3 - 6.15, Table 6-5 to
generate double fault if the condition is met.
Virtual Interrupt Injection
***************************
The source of virtual interrupts comes from either DM or assigned
devices.
- **For SOS assigned devices**: as all devices are assigned to SOS
directly. Whenever there is a device's physical interrupt, the
corresponding virtual interrupts are injected to SOS via
vLAPIC/vIOAPIC. SOS does not use vPIC and does not have emulated
devices. See section 3.8.5 Device assignment.
- **For UOS assigned devices**: only PCI devices could be assigned to
UOS. Virtual interrupt injection follows the same way as SOS. A
virtual interrupt injection operation is triggered when a
device's physical interrupt occurs.
- **For UOS emulated devices**: DM (acrn-dm) is responsible for UOS
emulated devices' interrupt lifecycle management. DM knows when
an emulated device needs to assert a virtual IOPAIC/PIC Pin or
needs to send a virtual MSI vector to Guest. These logic is
entirely handled by DM.
.. figure:: images/virtint-image64.png
:align: center
:name: pending-virt-interrupt
Handle pending virtual interrupt
Before APICv virtual interrupt delivery, a virtual interrupt can be
injected only if guest interrupt is allowed. There are many cases
that Guest ``RFLAGS.IF`` gets cleared and it would not accept any further
interrupts. HV will check for the available Guest IRQ windows before
injection.
NMI is unmasked interrupt and its injection is always allowed
regardless of the guest IRQ window status. If current IRQ
windows is not present, HV would enable
``MSR_IA32_VMX_PROCBASED_CTLS_IRQ_WIN (PROCBASED_CTRL.bit[2])`` and
VM Enter directly. The injection will be done on next VM Exit once Guest
issues ``STI (GuestRFLAG.IF=1)``.
Data structures and interfaces
******************************
There is no data structure exported to the other components in the
hypervisor for virtual interrupts. The APIs listed in the previous
sections are meant to be called whenever a virtual interrupt should be
injected or acknowledged.

View File

@@ -0,0 +1,327 @@
.. _vt-d-hld:
VT-d
####
VT-d stands for Intel Virtual Technology for Directed IO, and provides
hardware capabilities to assign I/O devices to VMs and extending the
protection and isolation properties of VMs for I/O operations.
VT-d provides the following main functions:
- **DMA remapping**: for supporting address translations for DMA from
devices.
- **Interrupt remapping**: for supporting isolation and routing of
interrupts from devices and external interrupt controllers to
appropriate VMs.
- **Interrupt posting**: for supporting direct delivery of virtual
interrupts from devices and external controllers to virtual
processors.
ACRN hypervisor supports DMA remapping that provides address translation
capability for PCI pass-through devices, and second-level translation,
which applies to requests-without-PASID. ACRN does not support
First-level / nested translation.
DMAR Engines Discovery
**********************
DMA Remapping Report ACPI table
===============================
For generic platforms, ACRN hypervisor retrieves DMAR information from
the ACPI table, and parses the DMAR reporting structure to discover the
number of DMA-remapping hardware units present in the platform as well as
the devices under the scope of a remapping hardware unit, as shown in
:numref:`dma-remap-report`:
.. figure:: images/vt-d-image90.png
:align: center
:name: dma-remap-report
DMA Remapping Reporting Structure
Pre-parsed DMAR information
===========================
For specific platforms, ACRN hypervisor uses pre-parsed DMA remapping
reporting information directly to save time for hypervisor boot-up.
DMA remapping unit for integrated graphics device
=================================================
Generally, there is a dedicated remapping hardware unit for the Intel
integrated graphics device. ACRN implements GVT-g for graphics, but
GVT-g is not compatible with VT-d. The remapping hardware unit for
graphics device is disabled on ACRN if GVT-g is enabled. If the graphics
device needs to pass-through to a VM, then the remapping hardware unit
must be enabled.
DMA Remapping
*************
DMA remapping hardware is used to isolate device access to memory,
enabling each device in the system to be assigned to a specific domain
through a distinct set of paging structures.
Domains
=======
A domain is abstractly defined as an isolated environment in the
platform, to which a subset of the host physical memory is allocated.
The memory resource of a domain is specified by the address translation
tables.
Device to Domain Mapping Structure
==================================
VT-d hardware uses root-table and context-tables to build the mapping
between devices and domains as shown in :numref:`vt-d-mapping`.
.. figure:: images/vt-d-image44.png
:align: center
:name: vt-d-mapping
Device to Domain Mapping structures
The root-table is 4-KByte in size and contains 256 root-entries to cover
the PCI bus number space (0-255). Each root-entry contains a
context-table pointer to reference the context-table for devices on the
bus identified by the root-entry, if the present flag of the root-entry
is set.
Each context-table contains 256 entries, with each entry corresponding
to a PCI device function on the bus. For a PCI device, the device and
function numbers (8-bits) are used to index into the context-table. Each
context-entry contains a Second-level Page-table Pointer, which provides
the host physical address of the address translation structure in system
memory to be used for remapping requests-without-PASID processed through
the context-entry.
For a given Bus, Device, and Function combination as shown in
:numref:`bdf-passthru`, a pass-through device can be associated with
address translation structures for a domain.
.. figure:: images/vt-d-image19.png
:align: center
:name: bdf-passthru
BDF Format of Pass-through Device
Refer to the `VT-d spec`_ for the more details of Device to domain
mapping structures.
.. _VT-d spec:
https://software.intel.com/sites/default/files/managed/c5/15/vt-directed-io-spec.pdf
Address Translation Structures
==============================
On ACRN, EPT table of a domain is used as the address translation
structures for the devices assigned to the domain, as shown
:numref:`vt-d-DMA`.
.. figure:: images/vt-d-image40.png
:align: center
:name: vt-d-DMA
DMA Remapping Diagram
When the device attempts to access system memory, the DMA
remapping hardware intercepts the access, utilizes the EPT table of the
domain to determine whether the access is allowed, and translates the DMA
address according to the EPT table from guest physical address (GPA) to
host physical address (HPA).
Domains and Memory Isolation
============================
There are no DMA operations inside the hypervisor, so ACRN doesn't
create a domain for the hypervisor. No DMA operations from pass-through
devices can access the hypervisor memory.
ACRN treats each virtual machine (VM) as a separate domain. For a VM,
there is a EPT table for Normal world, and there may be a EPT table for
Secure World. Secure world can access Normal World's memory, but Normal
world cannot access Secure World's memory.
SOS_VM domain
SOS_VM domain is created when the hypervisor creates VM for the
Service OS.
IOMMU uses the EPT table of Normal world of SOS_VM as the address
translation structures for the devices in SOS_VM domain. The Normal world's
EPT table of SOS_VM doesn't include the memory resource of the hypervisor
and Secure worlds if any. So the devices in SOS_VM domain can't access the
memory belong to hypervisor or secure worlds.
Other domains
Other VM domains will be created when hypervisor creates User OS. One
domain for each User OS.
IOMMU uses the EPT table of Normal world of a VM as the address
translation structures for the devices in the domain. The Normal world's
EPT table of the VM only allows devices to access the memory
allocated for Normal world of the VM.
Page-walk coherency
===================
For the VT-d hardware, which doesn't support page-walk coherency,
hypervisor needs to make sure the updates of VT-d tables are synced in
memory:
- Device to Domain Mapping Structures, including Root-entries and
Context-entries
- EPT table of a VM.
ACRN will flush the related cache line after updates of these structures
if the VT-d hardware doesn't support page-walk coherency.
Super-page support
==================
ACRN VT-d reuses the EPT table as address a translation table. VT-d capability
for super-page support should be identical with the usage of EPT table.
Snoop control
=============
If VT-d hardware supports snoop control, it allows VT-d to control to
ignore the "no-snoop attribute" in PCI-E transactions.
The following table shows the snoop behavior of DMA operation controlled by the
combination of:
- Snoop Control capability of VT-d DMAR unit
- The setting of SNP filed in leaf PTE
- No-snoop attribute in PCI-e request
.. list-table::
:widths: 25 25 25 25
:header-rows: 1
* - SC cap of VT-d
- SNP filed in leaf PTE
- No-snoop attribute in request
- Snoop behavior
* - 0
- 0 (must be 0)
- no snoop
- No snoop
* - 0
- 0 (must be 0)
- snoop
- Snoop
* - 1
- 1
- snoop / no snoop
- Snoop
* - 1
- 0
- no snoop
- No snoop
* - 1
- 0
- snoop
- Snoop
ACRN enable Snoop Control by default if all enabled VT-d DMAR units
support Snoop Control by setting bit 11 of leaf PTE of EPT table. Bit 11
of leaf PTE of EPT is ignored by MMU. So no side effect for MMU.
If one of the enabled VT-d DMAR units doesn't support Snoop Control,
then Bit 11 of leaf PET of EPT is not set since the field is treated as
reserved(0) by VT-d hardware implementations not supporting Snoop
Control.
Initialization
**************
During hypervisor initialization, it registers DMAR units on the
platform according to the reparsed information or DMAR table. There may
be multiple DMAR units on the platform, ACRN allows some of the DMAR
units to be ignored. If some DMAR unit(s) are marked as ignored, they
would not be enabled.
Hypervisor creates SOS_VM domain using the Normal World's EPT table of SOS_VM
as address translation table when creating SOS_VM as Service OS. And all
PCI devices on the platform are added to SOS_VM domain. Then enable DMAR
translation for DMAR unit(s) if they are not marked as ignored.
Device assignment
*****************
All devices are initially added to SOS_VM domain.
To assign a device means to assign the device to an User OS. The device
is remove from SOS_VM domain and added to the VM domain related to the User
OS, which changes the address translation table from EPT of SOS_VM to EPT
of User OS for the device.
To unassign a device means to unassign the device from an User OS. The
device is remove from the VM domain related to the User OS, then added
back to SOS_VM domain, which changes the address translation table from EPT
of User OS to EPT of SOS_VM for the device.
Power Management support for S3
*******************************
During platform S3 suspend and resume, the VT-d register values will be
lost. ACRN VT-d provide APIs to be called during S3 suspend and resume.
During S3 suspend, some register values are saved in the memory, and
DMAR translation is disabled. During S3 resume, the register values
saved are restored. Root table address register is set. DMAR translation
is enabled.
All the operations for S3 suspend and resume are performed on all DMAR
units on the platform, except for the DMAR units marked ignored.
Error Handling
**************
ACRN VT-d supports DMA remapping error reporting. ACRN VT-d requests a
IRQ / vector for DMAR error reporting. A DMAR fault handler is
registered for the IRQ. DMAR unit supports report fault event via MSI.
When a fault event occurs, a MSI is generated, so that the DMAR fault
handler will be called to report error event.
Data structures and interfaces
******************************
initialization and deinitialization
===================================
The following APIs are provided during initialization and
deinitialization:
.. doxygenfunction:: init_iommu
:project: Project ACRN
runtime
=======
The following API are provided during runtime:
.. doxygenfunction:: create_iommu_domain
:project: Project ACRN
.. doxygenfunction:: destroy_iommu_domain
:project: Project ACRN
.. doxygenfunction:: suspend_iommu
:project: Project ACRN
.. doxygenfunction:: resume_iommu
:project: Project ACRN
.. doxygenfunction:: move_pt_device
:project: Project ACRN

Binary file not shown.

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 173 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 201 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 147 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 117 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 166 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 71 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 450 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 115 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 127 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 99 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Some files were not shown because too many files have changed in this diff Show More