mirror of
https://github.com/projectacrn/acrn-hypervisor.git
synced 2025-05-29 18:35:01 +00:00
While we hoped to make the headings consistent over time while doing other edits, we should instead just make the squirrels happy and do them all at once or they'll likely never be made consistent. A python script was used to find the headings, and then a call to https://pypi.org/project/titlecase to transform the title. A visual inspection was used to tweak a few unexpected resulting titles. Signed-off-by: David B. Kinder <david.b.kinder@intel.com>
447 lines
18 KiB
ReStructuredText
447 lines
18 KiB
ReStructuredText
.. _hv-device-passthrough:
|
|
|
|
Device Passthrough
|
|
##################
|
|
|
|
A critical part of virtualization is virtualizing devices: exposing all
|
|
aspects of a device including its I/O, interrupts, DMA, and
|
|
configuration. There are three typical device virtualization methods:
|
|
emulation, para-virtualization, and passthrough. All emulation,
|
|
para-virtualization and passthrough are used in ACRN project. Device
|
|
emulation is discussed in :ref:`hld-io-emulation`, para-virtualization
|
|
is discussed in :ref:`hld-virtio-devices` and device passthrough will be
|
|
discussed here.
|
|
|
|
In the ACRN project, device emulation means emulating all existing
|
|
hardware resource through a software component device model running in
|
|
the Service OS (SOS). Device emulation must maintain the same SW
|
|
interface as a native device, providing transparency to the VM software
|
|
stack. Passthrough implemented in hypervisor assigns a physical device
|
|
to a VM so the VM can access the hardware device directly with minimal
|
|
(if any) VMM involvement.
|
|
|
|
The difference between device emulation and passthrough is shown in
|
|
:numref:`emu-passthru-diff`. You can notice device emulation has
|
|
a longer access path which causes worse performance compared with
|
|
passthrough. Passthrough can deliver near-native performance, but
|
|
can't support device sharing.
|
|
|
|
.. figure:: images/passthru-image30.png
|
|
:align: center
|
|
:name: emu-passthru-diff
|
|
|
|
Difference between Emulation and passthrough
|
|
|
|
Passthrough in the hypervisor provides the following functionalities to
|
|
allow VM to access PCI devices directly:
|
|
|
|
- VT-d DMA Remapping for PCI devices: hypervisor will setup DMA
|
|
remapping during VM initialization phase.
|
|
- VT-d Interrupt-remapping for PCI devices: hypervisor will enable
|
|
VT-d interrupt-remapping for PCI devices for security considerations.
|
|
- MMIO Remapping between virtual and physical BAR
|
|
- Device configuration Emulation
|
|
- Remapping interrupts for PCI devices
|
|
- ACPI configuration Virtualization
|
|
- GSI sharing violation check
|
|
|
|
The following diagram details passthrough initialization control flow in ACRN
|
|
for post-launched VM:
|
|
|
|
.. figure:: images/passthru-image22.png
|
|
:align: center
|
|
|
|
Passthrough devices initialization control flow
|
|
|
|
Passthrough Device Status
|
|
*************************
|
|
|
|
Most common devices on supported platforms are enabled for
|
|
passthrough, as detailed here:
|
|
|
|
.. figure:: images/passthru-image77.png
|
|
:align: center
|
|
|
|
Passthrough Device Status
|
|
|
|
Owner of Passthrough Devices
|
|
****************************
|
|
|
|
ACRN hypervisor will do PCI enumeration to discover the PCI devices on the platform.
|
|
According to the hypervisor/VM configurations, the owner of these PCI devices can be
|
|
one the following 4 cases:
|
|
|
|
- **Hypervisor**: hypervisor uses UART device as the console in debug version for
|
|
debug purpose, so the UART device is owned by hypervisor and is not visible
|
|
to any VM. For now, UART is the only pci device could be owned by hypervisor.
|
|
- **Pre-launched VM**: The passthrough devices will be used in a pre-launched VM is
|
|
pre-defined in VM configuration. These passthrough devices are owned by the
|
|
pre-launched VM after the VM is created. These devices will not be removed
|
|
from the pre-launched VM. There could be pre-launched VM(s) in logical partition
|
|
mode and hybrid mode.
|
|
- **Service VM**: All the passthrough devices except these described above (owned by
|
|
hypervisor or pre-launched VM(s)) are assigned to Service VM. And some of these devices
|
|
can be assigned to a post-launched VM according to the passthrough device list
|
|
specified in the parameters of the ACRN DM.
|
|
- **Post-launched VM**: A list of passthrough devices can be specified in the parameters of
|
|
the ACRN DM. When creating a post-launched VM, these specified devices will be moved
|
|
from Service VM domain to the post-launched VM domain. After the post-launched VM is
|
|
powered-off, these devices will be moved back to Service VM domain.
|
|
|
|
|
|
VT-d DMA Remapping
|
|
******************
|
|
|
|
To enable passthrough, for VM DMA access the VM can only
|
|
support GPA, while physical DMA requires HPA. One work-around
|
|
is building identity mapping so that GPA is equal to HPA, but this
|
|
is not recommended as some VM don't support relocation well. To
|
|
address this issue, Intel introduces VT-d in the chipset to add one
|
|
remapping engine to translate GPA to HPA for DMA operations.
|
|
|
|
Each VT-d engine (DMAR Unit), maintains a remapping structure
|
|
similar to a page table with device BDF (Bus/Dev/Func) as input and final
|
|
page table for GPA/HPA translation as output. The GPA/HPA translation
|
|
page table is similar to a normal multi-level page table.
|
|
|
|
VM DMA depends on Intel VT-d to do the translation from GPA to HPA, so we
|
|
need to enable VT-d IOMMU engine in ACRN before we can passthrough any device. Service VM
|
|
in ACRN is a VM running in non-root mode which also depends
|
|
on VT-d to access a device. In Service VM DMA remapping
|
|
engine settings, GPA is equal to HPA.
|
|
|
|
ACRN hypervisor checks DMA-Remapping Hardware unit Definition (DRHD) in
|
|
host DMAR ACPI table to get basic info, then sets up each DMAR unit. For
|
|
simplicity, ACRN reuses EPT table as the translation table in DMAR
|
|
unit for each passthrough device. The control flow of assigning and de-assigning
|
|
a passthrough device to/from a post-launched VM is shown in the following figures:
|
|
|
|
.. figure:: images/passthru-image86.png
|
|
:align: center
|
|
|
|
ptdev assignment control flow
|
|
|
|
.. figure:: images/passthru-image42.png
|
|
:align: center
|
|
|
|
ptdev de-assignment control flow
|
|
|
|
.. _vtd-posted-interrupt:
|
|
|
|
|
|
VT-d Interrupt-Remapping
|
|
************************
|
|
|
|
The VT-d interrupt-remapping architecture enables system software to
|
|
control and censor external interrupt requests generated by all sources
|
|
including those from interrupt controllers (I/OxAPICs), MSI/MSI-X capable
|
|
devices including endpoints, root-ports and Root-Complex integrated
|
|
end-points.
|
|
ACRN forces to enabled VT-d interrupt-remapping feature for security reasons.
|
|
If the VT-d hardware doesn't support interrupt-remapping, then ACRN will
|
|
refuse to boot VMs.
|
|
VT-d Interrupt-remapping is NOT related to the translation from physical
|
|
interrupt to virtual interrupt or vice versa. The term VT-d interrupt-remapping
|
|
remaps the interrupt index in the VT-d interrupt-remapping table to the physical
|
|
interrupt vector after checking the external interrupt request is valid. Translation
|
|
physical vector to virtual vector is still needed to be done by hypervisor, which is
|
|
also described in the below section :ref:`interrupt-remapping`.
|
|
|
|
VT-d posted interrupt (PI) enables direct delivery of external interrupts from
|
|
passthrough devices to VMs without having to exit to hypervisor, thereby improving
|
|
interrupt performance. ACRN uses VT-d posted interrupts if the platform
|
|
supports them. VT-d distinguishes between remapped
|
|
and posted interrupt modes by bit 15 in the low 64-bit of the IRTE. If cleared the
|
|
entry is remapped, if set it's posted.
|
|
The idea for posted interrupt is to keep a Posted Interrupt Descriptor (PID) in memory.
|
|
The PID is a 64-byte data structure that contains several fields:
|
|
|
|
Posted Interrupt Request (PIR):
|
|
a 256-bit field, one bit per request vector;
|
|
this is where the interrupts are posted;
|
|
|
|
Suppress Notification (SN):
|
|
determines whether to notify (``SN=0``) or not notify (``SN=1``)
|
|
the CPU for non-urgent interrupts. For ACRN,
|
|
all interrupts are treated as non-urgent. ACRN sets SN=0 during initialization
|
|
and then never changes it at runtime;
|
|
|
|
Notification Vector (NV):
|
|
the CPU must be notified with an interrupt and this
|
|
field specifies the vector for notification;
|
|
|
|
Notification Destination (NDST):
|
|
the physical APIC-ID of the destination.
|
|
ACRN does not support vCPU migration, one vCPU always runs on the same pCPU,
|
|
so for ACRN, NDST is never changed after initialization.
|
|
|
|
Outstanding Notification (ON):
|
|
indicates if a notification event is outstanding
|
|
|
|
The ACRN scheduler supports vCPU scheduling, where two or more vCPUs can
|
|
share the same pCPU using a time sharing technique. One issue emerges
|
|
here for VT-d posted interrupt handling process, where IRQs could happen
|
|
when the target vCPU is in a halted state. We need to handle the case
|
|
where the running vCPU disrupted by the external interrupt, is not the
|
|
target vCPU that an external interrupt should be delivered.
|
|
|
|
Consider this scenario:
|
|
|
|
* vCPU0 runs on pCPU0 and then enters a halted state,
|
|
* ACRN scheduler now chooses vCPU1 to run on pCPU0.
|
|
|
|
If an external interrupt from an assigned device destined to vCPU0
|
|
happens at this time, we do not want this interrupt to be incorrectly
|
|
consumed by vCPU1 currently running on pCPU0. This would happen if we
|
|
allocate the same Activation Notification Vector (ANV) to all vCPUs.
|
|
|
|
To circumvent this issue, ACRN allocates unique ANVs for each vCPU that
|
|
belongs to the same pCPU. The ANVs need only be unique within each pCPU,
|
|
not across all vCPUs. Since vCPU0's ANV is different from vCPU1's ANV,
|
|
if a vCPU0 is in a halted state, external interrupts from an assigned
|
|
device destined to vCPU0 delivered through the PID will not trigger the
|
|
posted interrupt processing. Instead, a VMExit to ACRN happens that can
|
|
then process the event such as waking up the halted vCPU0 and kick it
|
|
to run on pCPU0.
|
|
|
|
For ACRN, ``CONFIG_MAX_VM_NUM`` vCPUs may be running on top of a pCPU. ACRN
|
|
does not support two vCPUs of the same VM running on top of the same
|
|
pCPU. This reduces the number of pre-allocated ANVs for posted
|
|
interrupts to ``CONFIG_MAX_VM_NUM``, and enables ACRN to avoid switching
|
|
between active and wake-up vector values in the posted interrupt
|
|
descriptor on vCPU scheduling state changes. ACRN uses the following
|
|
formula to assign posted interrupt vectors to vCPUs::
|
|
|
|
NV = POSTED_INTR_VECTOR + vcpu->vm->vm_id
|
|
|
|
where ``POSTED_INTR_VECTOR`` is the starting vector (0xe3) for posted interrupts.
|
|
|
|
ACRN maintains a per-PCPU vCPU array that stores the pointers to
|
|
assigned vCPUs for each pCPU and is indexed by ``vcpu->vm->vm_id``.
|
|
When the vCPU is created, ACRN adds the vCPU to the containing pCPU's
|
|
vCPU array. When the vCPU is offline, ACRN removes the vCPU from the
|
|
related vCPU array.
|
|
|
|
An example to illustrate our solution:
|
|
|
|
.. figure:: images/passthru-image50.png
|
|
:align: center
|
|
|
|
ACRN sets ``SN=0`` during initialization and then never change it at
|
|
runtime. This means posted interrupt notification is never suppressed.
|
|
After posting the interrupt in Posted Interrupt Request (PIR), VT-d will
|
|
always notify the CPU using the interrupt vector NV, in both root and
|
|
non-root mode. With this scheme, if the target vCPU is running under
|
|
VMX non-root mode, it will receive the interrupts coming from
|
|
passed-through device without a VMExit (and therefore without any
|
|
intervention of the ACRN hypervisor).
|
|
|
|
If the target vCPU is in a halted state (under VMX non-root mode), a
|
|
scheduling request will be raised to wake it up. This is needed to
|
|
achieve real time behavior. If an RT-VM is waiting for an event, when
|
|
the event is fired (a PI interrupt fires), we need to wake up the VM
|
|
immediately.
|
|
|
|
|
|
MMIO Remapping
|
|
**************
|
|
|
|
For PCI MMIO BAR, hypervisor builds EPT mapping between virtual BAR and
|
|
physical BAR, then VM can access MMIO directly.
|
|
There is one exception, MSI-X table is also in a MMIO BAR. Hypervisor needs to trap the
|
|
accesses to MSI-X table. So the page(s) having MSI-X table should not be accessed by guest
|
|
directly. EPT mapping is not built for these pages having MSI-X table.
|
|
|
|
Device Configuration Emulation
|
|
******************************
|
|
|
|
The PCI configuration space can be accessed by a PCI-compatible
|
|
Configuration Mechanism (IO port 0xCF8/CFC) and the PCI Express Enhanced
|
|
Configuration Access Mechanism (PCI MMCONFIG). The ACRN hypervisor traps
|
|
this PCI configuration space access and emulate it. Refer to :ref:`split-device-model` for details.
|
|
|
|
MSI-X Table Emulation
|
|
*********************
|
|
|
|
VM accesses to MSI-X table should be trapped so that hypervisor has the
|
|
information to map the virtual vector and physical vector. EPT mapping should
|
|
be skipped for the 4KB pages having MSI-X table.
|
|
|
|
There are three situations for the emulation of MSI-X table:
|
|
|
|
- **Service VM**: accesses to MSI-X table are handled by HV MMIO handler (4KB adjusted up
|
|
and down). HV will remap interrupts.
|
|
- **Post-launched VM**: accesses to MSI-X Tables are handled by DM MMIO handler
|
|
(4KB adjusted up and down) and when DM (Service VM) writes to the table, it will be
|
|
intercepted by HV MMIO handler and HV will remap interrupts.
|
|
- **Pre-launched VM**: Writes to MMIO region in MSI-X Table BAR handled by HV MMIO
|
|
handler. If the offset falls within the MSI-X table (offset, offset+tables_size),
|
|
HV remaps interrupts.
|
|
|
|
|
|
.. _interrupt-remapping:
|
|
|
|
Interrupt Remapping
|
|
*******************
|
|
|
|
When the physical interrupt of a passthrough device happens, hypervisor has
|
|
to distribute it to the relevant VM according to interrupt remapping
|
|
relationships. The structure ``ptirq_remapping_info`` is used to define
|
|
the subordination relation between physical interrupt and VM, the
|
|
virtual destination, etc. See the following figure for details:
|
|
|
|
.. figure:: images/passthru-image91.png
|
|
:align: center
|
|
|
|
Remapping of physical interrupts
|
|
|
|
There are two different types of interrupt source: IOAPIC and MSI.
|
|
The hypervisor will record different information for interrupt
|
|
distribution: physical and virtual IOAPIC pin for IOAPIC source,
|
|
physical and virtual BDF and other info for MSI source.
|
|
|
|
Service VM passthrough is also in the scope of interrupt remapping which is
|
|
done on-demand rather than on hypervisor initialization.
|
|
|
|
.. figure:: images/passthru-image102.png
|
|
:align: center
|
|
:name: init-remapping
|
|
|
|
Initialization of remapping of virtual IOAPIC interrupts for Service VM
|
|
|
|
:numref:`init-remapping` above illustrates how remapping of (virtual) IOAPIC
|
|
interrupts are remapped for Service VM. VM exit occurs whenever Service VM tries to
|
|
unmask an interrupt in (virtual) IOAPIC by writing to the Redirection
|
|
Table Entry (or RTE). The hypervisor then invokes the IOAPIC emulation
|
|
handler (refer to :ref:`hld-io-emulation` for details on I/O emulation) which
|
|
calls APIs to set up a remapping for the to-be-unmasked interrupt.
|
|
|
|
Remapping of (virtual) MSI interrupts are set up in a similar sequence:
|
|
|
|
.. figure:: images/passthru-image98.png
|
|
:align: center
|
|
|
|
Initialization of remapping of virtual MSI for Service VM
|
|
|
|
This figure illustrates how mappings of MSI or MSI-X are set up for
|
|
Service VM. Service VM is responsible for issuing a hypercall to notify the
|
|
hypervisor before it configures the PCI configuration space to enable an
|
|
MSI. The hypervisor takes this opportunity to set up a remapping for the
|
|
given MSI or MSI-X before it is actually enabled by Service VM.
|
|
|
|
When the User VM needs to access the physical device by passthrough, it uses
|
|
the following steps:
|
|
|
|
- User VM gets a virtual interrupt
|
|
- VM exit happens and the trapped vCPU is the target where the interrupt
|
|
will be injected.
|
|
- Hypervisor will handle the interrupt and translate the vector
|
|
according to ptirq_remapping_info.
|
|
- Hypervisor delivers the interrupt to User VM.
|
|
|
|
When the Service VM needs to use the physical device, the passthrough is also
|
|
active because the Service VM is the first VM. The detail steps are:
|
|
|
|
- Service VM get all physical interrupts. It assigns different interrupts for
|
|
different VMs during initialization and reassign when a VM is created or
|
|
deleted.
|
|
- When physical interrupt is trapped, an exception will happen after VMCS
|
|
has been set.
|
|
- Hypervisor will handle the VM exit issue according to
|
|
ptirq_remapping_info and translates the vector.
|
|
- The interrupt will be injected the same as a virtual interrupt.
|
|
|
|
ACPI Virtualization
|
|
*******************
|
|
|
|
ACPI virtualization is designed in ACRN with these assumptions:
|
|
|
|
- HV has no knowledge of ACPI,
|
|
- Service VM owns all physical ACPI resources,
|
|
- User VM sees virtual ACPI resources emulated by device model.
|
|
|
|
Some passthrough devices require physical ACPI table entry for
|
|
initialization. The device model will create such device entry based on
|
|
the physical one according to vendor ID and device ID. Virtualization is
|
|
implemented in Service VM device model and not in scope of the hypervisor.
|
|
For pre-launched VM, ACRN hypervisor doesn't support the ACPI virtualization,
|
|
so devices relying on ACPI table are not supported.
|
|
|
|
GSI Sharing Violation Check
|
|
***************************
|
|
|
|
All the PCI devices that are sharing the same GSI should be assigned to
|
|
the same VM to avoid physical GSI sharing between multiple VMs.
|
|
In logical partition mode or hybrid mode, the PCI devices assigned to
|
|
pre-launched VM is statically pre-defined. Developers should take care not to
|
|
violate the rule.
|
|
For post-launched VM, devices that don't support MSI, ACRN DM puts the devices
|
|
sharing the same GSI pin to a GSI
|
|
sharing group. The devices in the same group should be assigned together to
|
|
the current VM, otherwise, none of them should be assigned to the
|
|
current VM. A device that violates the rule will be rejected to be
|
|
passed-through. The checking logic is implemented in Device Model and not
|
|
in scope of hypervisor.
|
|
The platform GSI information is in devicemodel/hw/pci/platform_gsi_info.c
|
|
for limited platform (currently, only APL MRB). For other platforms, the platform
|
|
specific GSI information should be added to activate the checking of GSI sharing violation.
|
|
|
|
Data Structures and Interfaces
|
|
******************************
|
|
|
|
The following APIs are common APIs provided to initialize interrupt remapping for
|
|
VMs:
|
|
|
|
.. doxygenfunction:: ptirq_intx_pin_remap
|
|
:project: Project ACRN
|
|
|
|
.. doxygenfunction:: ptirq_prepare_msix_remap
|
|
:project: Project ACRN
|
|
|
|
Post-launched VM needs to pre-allocate interrupt entries during VM initialization.
|
|
Post-launched VM needs to free interrupt entries during VM de-initialization.
|
|
The following APIs are provided to pre-allocate/free interrupt entries for post-launched VM:
|
|
|
|
.. doxygenfunction:: ptirq_add_intx_remapping
|
|
:project: Project ACRN
|
|
|
|
.. doxygenfunction:: ptirq_remove_intx_remapping
|
|
:project: Project ACRN
|
|
|
|
.. doxygenfunction:: ptirq_remove_msix_remapping
|
|
:project: Project ACRN
|
|
|
|
The following APIs are provided to acknowledge a virtual interrupt.
|
|
|
|
.. doxygenfunction:: ptirq_intx_ack
|
|
:project: Project ACRN
|
|
|
|
The following APIs are provided to handle ptdev interrupt:
|
|
|
|
.. doxygenfunction:: ptdev_init
|
|
:project: Project ACRN
|
|
|
|
.. doxygenfunction:: ptirq_softirq
|
|
:project: Project ACRN
|
|
|
|
.. doxygenfunction:: ptirq_alloc_entry
|
|
:project: Project ACRN
|
|
|
|
.. doxygenfunction:: ptirq_release_entry
|
|
:project: Project ACRN
|
|
|
|
.. doxygenfunction:: ptdev_release_all_entries
|
|
:project: Project ACRN
|
|
|
|
.. doxygenfunction:: ptirq_activate_entry
|
|
:project: Project ACRN
|
|
|
|
.. doxygenfunction:: ptirq_deactivate_entry
|
|
:project: Project ACRN
|
|
|
|
.. doxygenfunction:: ptirq_dequeue_softirq
|
|
:project: Project ACRN
|
|
|
|
.. doxygenfunction:: ptirq_get_intr_data
|
|
:project: Project ACRN
|