doc: reorganize HLD docs
Reorganize the high-level design docs to align with a work-in-progress HLD document. Migrate previous web content (and images) into the new organization. From here we'll continue inclusion of new design chapters as they're reviewed and edited. Signed-off-by: David B. Kinder <david.b.kinder@intel.com>
359
doc/developer-guides/hld/acpi-virt.rst
Normal file
@@ -0,0 +1,359 @@
|
||||
.. _acpi-virt-HLD:
|
||||
|
||||
ACPI Virtualization high-level design
|
||||
#####################################
|
||||
|
||||
ACPI introduction
|
||||
*****************
|
||||
|
||||
Advanced Configuration and Power Interface (ACPI) provides an open
|
||||
standard that operating systems can use to discover and configure
|
||||
computer hardware components to perform power management for example, by
|
||||
monitoring status and putting unused components to sleep.
|
||||
|
||||
Functions implemented by ACPI include:
|
||||
|
||||
- System/Device/Processor power management
|
||||
- Device/Processor performance management
|
||||
- Configuration / Plug and Play
|
||||
- System event
|
||||
- Battery management
|
||||
- Thermal management
|
||||
|
||||
ACPI enumerates and lists the different DMA engines in the platform, and
|
||||
device scope relationships between PCI devices and which DMA engine
|
||||
controls them. All critical functions depend on ACPI tables. Here's an
|
||||
example on an Apollo Lake platform (APL) with Linux installed:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
root@:Dom0 ~ $ ls /sys/firmware/acpi/tables/
|
||||
APIC data DMAR DSDT dynamic FACP FACS HPET MCFG NHLT TPM2
|
||||
|
||||
These tables provide different information and functions:
|
||||
|
||||
- Advanced Programmable Interrupt Controller (APIC) for Symmetric
|
||||
Multiprocessor systems (SMP),
|
||||
- DMA remapping (DMAR) for Intel |reg| Virtualization Technology for
|
||||
Directed I/O (VT-d),
|
||||
- Non-HD Audio Link Table (NHLT) for supporting audio device,
|
||||
- and Differentiated System Description Table (DSDT) for system
|
||||
configuration info. DSDT is a major ACPI table used to describe what
|
||||
peripherals the machine has, and information on PCI IRQ mappings and
|
||||
power management
|
||||
|
||||
Most of the ACPI functionality is provided in ACPI Machine Language
|
||||
(AML) bytecode stored in the ACPI tables. To make use of these tables,
|
||||
Linux implements an interpreter for the AML bytecode. At BIOS
|
||||
development time, the AML bytecode is compiled from the ASL (ACPI Source
|
||||
Language) code. The ``iasl`` command is used to disassemble the ACPI table
|
||||
and display its contents:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
root@:Dom0 ~ $ cp /sys/firmware/acpi/tables/DMAR .
|
||||
root@:Dom0 ~ $ iasl -d DMAR
|
||||
|
||||
Intel ACPI Component Architecture
|
||||
ASL+ Optimizing Compiler/Disassembler version 20170728
|
||||
Copyright (c) 2000 - 2017 Intel Corporation
|
||||
Input file DMAR, Length 0xB0 (176) bytes
|
||||
ACPI: DMAR 0x0000000000000000 0000B0 (v01 INTEL BDW 00000001 INTL 00000001)
|
||||
Acpi Data Table [DMAR] decoded
|
||||
Formatted output: DMAR.dsl - 5286 bytes
|
||||
|
||||
root@:Dom0 ~ $ cat DMAR.dsl
|
||||
[000h 0000 4] Signature : "DMAR" [DMA Remapping table]
|
||||
[004h 0004 4] Table Length : 000000B0
|
||||
[008h 0008 1] Revision : 01
|
||||
...
|
||||
[030h 0048 2] Subtable Type : 0000 [Hardware Unit Definition]
|
||||
[032h 0050 2] Length : 0018
|
||||
[034h 0052 1] Flags : 00
|
||||
[035h 0053 1] Reserved : 00
|
||||
[036h 0054 2] PCI Segment Number : 0000
|
||||
[038h 0056 8] Register Base Address : 00000000FED64000
|
||||
|
||||
From the displayed ASL, we can see some generic table fields, such as
|
||||
the version information, and one VTd remapping engine description with
|
||||
FED64000 as base address.
|
||||
|
||||
We can modify DMAR.dsl and assemble it again to AML:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
root@:Dom0 ~ $ iasl DMAR.dsl
|
||||
Intel ACPI Component Architecture
|
||||
ASL+ Optimizing Compiler/Disassembler version 20170728
|
||||
Copyright (c) 2000 - 2017 Intel Corporation
|
||||
Table Input: DMAR.dsl - 113 lines, 5286 bytes, 72 fields
|
||||
Binary Output: DMAR.aml - 176 bytes
|
||||
Compilation complete. 0 Errors, 0 Warnings, 0 Remarks
|
||||
|
||||
We can see the new AML file ``DMAR.aml`` is created.
|
||||
|
||||
There are many ACPI tables in the system, linked together via table
|
||||
pointers. In all ACPI-compatible system, the OS can enumerate all
|
||||
needed tables starting with the Root System Description Pointer (RSDP)
|
||||
provided at a known place in the system low address space, and pointing
|
||||
to an XSDT (Extended System Description Table). The following picture
|
||||
shows a typical ACPI table layout in an Intel APL platform:
|
||||
|
||||
.. figure:: images/acpi-image1.png
|
||||
:width: 700px
|
||||
:align: center
|
||||
:name: acpi-layout
|
||||
|
||||
Typical ACPI table layout in an Intel APL platform
|
||||
|
||||
|
||||
ACPI virtualization
|
||||
*******************
|
||||
|
||||
Most modern OSes requires ACPI, so ACRN provides ACPI virtualization to
|
||||
emulate an ACPI-capable virtual platform for the guest OS. To achieve
|
||||
this, there are two options, depending on physical device and ACPI
|
||||
resources are abstracted: Partitioning and Emulation.
|
||||
|
||||
Partitioning
|
||||
============
|
||||
|
||||
One option is to assign and partition physical devices and ACPI
|
||||
resources among all guest OSes. That means each guest OS owns specific
|
||||
devices with passthrough, such as shown below:
|
||||
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| PCI Devices | VM0(Cluster VM) | VM1(IVI VM) |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| I2C | I2C3, I2C0 | I2C1, I2C2, I2C4, I2C5, |
|
||||
| | | I2C6, I2C7 |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| SPI | SPI1 | SPI0, SPI2 |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| USB | | USB-Host (xHCI) and |
|
||||
| | | USB-Device (xDCI) |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| SDIO | | SDIO |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| IPU | | IPU |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| Ethernet | Ethernet | |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| WIFI | | WIFI |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| Bluetooth | | Bluetooth |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| Audio | | Audio |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| GPIO | GPIO | |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
| UART | UART | |
|
||||
+--------------------------+--------------------------+--------------------------+
|
||||
|
||||
In an early ACRN development phase, partitioning was used for
|
||||
simplicity. To implement partitioning, we need to hack the PCI logic to
|
||||
make different VMs see a different subset of devices, and create one
|
||||
copy of the ACPI tables for each of them, as shown in the following
|
||||
picture:
|
||||
|
||||
.. figure:: images/acpi-image3.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
|
||||
For each VM, its ACPI tables are standalone copies and not related to
|
||||
other VMs. Opregion also needs to be copied for different VM.
|
||||
|
||||
For each table, we make modifications, based on the physical table, to
|
||||
reflect the assigned devices to a particular VM. In the picture below,
|
||||
we can see keep SP2(0:19.1) for VM0, and SP1(0:19.0)/SP3(0:19.2) for
|
||||
VM1. Any time a partition policy changes, we need to modify both tables
|
||||
again, including dissembling, modification, and assembling, which is
|
||||
tricky and bug-prone.
|
||||
|
||||
.. figure:: images/acpi-image2.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
|
||||
Emulation
|
||||
---------
|
||||
|
||||
A second option is for the SOS (VM0) to "own" all devices and emulate a
|
||||
set of virtual devices for each of the UOS (VM1). This is the most
|
||||
popular model for virtualization, as show below. ACRN currently uses
|
||||
device emulation plus some device passthrough for UOS.
|
||||
|
||||
.. figure:: images/acpi-image5.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
|
||||
Regarding ACPI virtualization in ACRN, different policy are used for
|
||||
different components:
|
||||
|
||||
- Hypervisor - ACPI is transparent to the Hypervisor, which has no
|
||||
knowledge of ACPI at all.
|
||||
- SOS - All ACPI resources are physically owned by the SOS, which
|
||||
enumerates all ACPI tables and devices.
|
||||
- UOS - Virtual ACPI resources exposed by the device model are owned by
|
||||
UOS.
|
||||
|
||||
Source for the ACPI emulation code for the device model is found in
|
||||
``hw/platform/acpi/acpi.c``.
|
||||
|
||||
Each entry in ``basl_ftables`` is related to each virtual ACPI table,
|
||||
including following elements:
|
||||
|
||||
- wsect - output handler to write related ACPI table contents to
|
||||
specific file
|
||||
- offset - related ACPI table offset in the memory
|
||||
- valid - dynamically indicate if this table is needed
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
static struct {
|
||||
int (*wsect)(FILE *fp, struct vmctx *ctx);
|
||||
uint64_t offset;
|
||||
bool valid;
|
||||
} basl_ftables[] = {
|
||||
{ basl_fwrite_rsdp, 0, true },
|
||||
{ basl_fwrite_rsdt, RSDT_OFFSET, true },
|
||||
{ basl_fwrite_xsdt, XSDT_OFFSET, true },
|
||||
{ basl_fwrite_madt, MADT_OFFSET, true },
|
||||
{ basl_fwrite_fadt, FADT_OFFSET, true },
|
||||
{ basl_fwrite_hpet, HPET_OFFSET, true },
|
||||
{ basl_fwrite_mcfg, MCFG_OFFSET, true },
|
||||
{ basl_fwrite_facs, FACS_OFFSET, true },
|
||||
{ basl_fwrite_nhlt, NHLT_OFFSET, false }, /*valid with audio ptdev*/
|
||||
{ basl_fwrite_dsdt, DSDT_OFFSET, true }
|
||||
};
|
||||
|
||||
The main function to create virtual ACPI tables is ``acpi_build`` that
|
||||
calls ``basl_compile`` for each table and performs the following:
|
||||
|
||||
#. create two temp files: infile and outfile
|
||||
#. with output handler, write table contents stream to infile
|
||||
#. use ``iasl`` tool to assemble infile into outfile
|
||||
#. load outfile contents to the required memory offset
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
static int
|
||||
basl_compile(struct vmctx *ctx,
|
||||
int (*fwrite_section)(FILE *, struct vmctx *),
|
||||
uint64_t offset)
|
||||
{
|
||||
struct basl_fio io[2];
|
||||
static char iaslbuf[3*MAXPATHLEN + 10];
|
||||
int err;
|
||||
|
||||
err = basl_start(&io[0], &io[1]);
|
||||
if (!err) {
|
||||
err = (*fwrite_section)(io[0].fp, ctx);
|
||||
|
||||
if (!err) {
|
||||
/*
|
||||
* iasl sends the results of the compilation to
|
||||
* stdout. Shut this down by using the shell to
|
||||
* redirect stdout to /dev/null, unless the user
|
||||
* has requested verbose output for debugging
|
||||
* purposes
|
||||
*/
|
||||
if (basl_verbose_iasl)
|
||||
snprintf(iaslbuf, sizeof(iaslbuf),
|
||||
"%s -p %s %s",
|
||||
ASL_COMPILER,
|
||||
io[1].f_name, io[0].f_name);
|
||||
else
|
||||
snprintf(iaslbuf, sizeof(iaslbuf),
|
||||
"/bin/sh -c \"%s -p %s %s\" 1> /dev/null",
|
||||
ASL_COMPILER,
|
||||
io[1].f_name, io[0].f_name);
|
||||
|
||||
err = system(iaslbuf);
|
||||
|
||||
if (!err) {
|
||||
/*
|
||||
* Copy the aml output file into guest
|
||||
* memory at the specified location
|
||||
*/
|
||||
err = basl_load(ctx, io[1].fd, offset);
|
||||
} else
|
||||
err = -1;
|
||||
}
|
||||
basl_end(&io[0], &io[1]);
|
||||
}
|
||||
|
||||
After processing each entry, the virtual ACPI tables are present in UOS
|
||||
memory.
|
||||
|
||||
For pass-through devices in UOS, we likely need to add some ACPI
|
||||
description in the UOS virtual DSDT table. There is one hook
|
||||
(``passthrough_write_dsdt``) in ``hw/pci/passthrough.c`` for it. The following
|
||||
source code shows calls to different functions to add different contents
|
||||
for each vendor and device id.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
static void
|
||||
passthru_write_dsdt(struct pci_vdev *dev)
|
||||
{
|
||||
struct passthru_dev *ptdev = (struct passthru_dev *) dev->arg;
|
||||
uint32_t vendor = 0, device = 0;
|
||||
|
||||
vendor = read_config(ptdev->phys_dev, PCIR_VENDOR, 2);
|
||||
|
||||
if (vendor != 0x8086)
|
||||
return;
|
||||
|
||||
device = read_config(ptdev->phys_dev, PCIR_DEVICE, 2);
|
||||
|
||||
/* Provides ACPI extra info */
|
||||
if (device == 0x5aaa)
|
||||
/* XDCI @ 00:15.1 to enable ADB */
|
||||
write_dsdt_xhci(dev);
|
||||
else if (device == 0x5ab4)
|
||||
/* HDAC @ 00:17.0 as codec */
|
||||
write_dsdt_hdac(dev);
|
||||
else if (device == 0x5a98)
|
||||
/* HDAS @ 00:e.0 */
|
||||
write_dsdt_hdas(dev);
|
||||
else if (device == 0x5aac)
|
||||
/* i2c @ 00:16.0 for ipu */
|
||||
write_dsdt_ipu_i2c(dev);
|
||||
else if (device == 0x5abc)
|
||||
/* URT1 @ 00:18.0 for bluetooth*/
|
||||
write_dsdt_urt1(dev);
|
||||
|
||||
}
|
||||
|
||||
For instance, ``write_dsdt_urt1`` provides ACPI contents for Bluetooth
|
||||
UART device when pass-throughed to the UOS. It provides virtual PCI
|
||||
device/function as ``_ADR``, with other descriptions possible for Bluetooth
|
||||
UART enumeration.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
static void
|
||||
write_dsdt_urt1(struct pci_vdev *dev)
|
||||
{
|
||||
printf("write virt-%x:%x.%x in dsdt for URT1 @ 00:18.0\n",
|
||||
dev->bus,
|
||||
dev->slot,
|
||||
dev->func);
|
||||
dsdt_line("Device (URT1)");
|
||||
dsdt_line("{");
|
||||
dsdt_line(" Name (_ADR, 0x%04X%04X)", dev->slot, dev->func);
|
||||
dsdt_line(" Name (_DDN, \"Intel(R) HS-UART Controller #1\")");
|
||||
dsdt_line(" Name (_UID, One)");
|
||||
dsdt_line(" Name (RBUF, ResourceTemplate ()");
|
||||
dsdt_line(" {");
|
||||
dsdt_line(" })");
|
||||
dsdt_line(" Method (_CRS, 0, NotSerialized)");
|
||||
dsdt_line(" {");
|
||||
dsdt_line(" Return (RBUF)");
|
||||
dsdt_line(" }");
|
||||
dsdt_line("}");
|
||||
}
|
||||
|
||||
This document introduces basic ACPI virtualization. Other topics such as
|
||||
power management virtualization, adds more requirement for ACPI, and
|
||||
will be discussed in the power management documentation.
|
948
doc/developer-guides/hld/hld-APL_GVT-g.rst
Normal file
@@ -0,0 +1,948 @@
|
||||
.. _APL_GVT-g-hld:
|
||||
|
||||
GVT-g high-level design
|
||||
#######################
|
||||
|
||||
Introduction
|
||||
************
|
||||
|
||||
Purpose of this Document
|
||||
========================
|
||||
|
||||
This high-level design (HLD) document describes the usage requirements
|
||||
and high level design for Intel® Graphics Virtualization Technology for
|
||||
shared virtual :term:`GPU` technology (:term:`GVT-g`) on Apollo Lake-I
|
||||
SoCs.
|
||||
|
||||
This document describes:
|
||||
|
||||
- The different GPU virtualization techniques
|
||||
- GVT-g mediated pass-through
|
||||
- High level design
|
||||
- Key components
|
||||
- GVT-g new architecture differentiation
|
||||
|
||||
Audience
|
||||
========
|
||||
|
||||
This document is for developers, validation teams, architects and
|
||||
maintainers of Intel® GVT-g for the Apollo Lake SoCs.
|
||||
|
||||
The reader should have some familiarity with the basic concepts of
|
||||
system virtualization and Intel® processor graphics.
|
||||
|
||||
Reference Documents
|
||||
===================
|
||||
|
||||
The following documents were used as references for this specification:
|
||||
|
||||
- Paper in USENIX ATC '14 - *Full GPU Virtualization Solution with
|
||||
Mediated Pass-Through* - https://www.usenix.org/node/183932
|
||||
|
||||
- Hardware Specification - PRMs -
|
||||
https://01.org/linuxgraphics/documentation/hardware-specification-prms
|
||||
|
||||
Background
|
||||
**********
|
||||
|
||||
Intel® GVT-g is an enabling technology in emerging graphics
|
||||
virtualization scenarios. It adopts a full GPU virtualization approach
|
||||
based on mediated pass-through technology, to achieve good performance,
|
||||
scalability and secure isolation among Virtual Machines (VMs). A virtual
|
||||
GPU (vGPU), with full GPU features, is presented to each VM so that a
|
||||
native graphics driver can run directly inside a VM.
|
||||
|
||||
Intel® GVT-g technology for Apollo Lake (APL) has been implemented in
|
||||
open source hypervisors or Virtual Machine Monitors (VMMs):
|
||||
|
||||
- Intel® GVT-g for ACRN, also known as, "AcrnGT"
|
||||
- Intel® GVT-g for KVM, also known as, "KVMGT"
|
||||
- Intel® GVT-g for Xen, also known as, "XenGT"
|
||||
|
||||
The core vGPU device model is released under BSD/MIT dual license, so it
|
||||
can be reused in other proprietary hypervisors.
|
||||
|
||||
Intel has a portfolio of graphics virtualization technologies
|
||||
(:term:`GVT-g`, :term:`GVT-d` and :term:`GVT-s`). GVT-d and GVT-s are
|
||||
outside of the scope of this document.
|
||||
|
||||
This HLD applies to the Apollo Lake platform only. Support of other
|
||||
hardware is outside the scope of this HLD.
|
||||
|
||||
Targeted Usages
|
||||
===============
|
||||
|
||||
The main targeted usage of GVT-g is in automotive applications, such as:
|
||||
|
||||
- An Instrument cluster running in one domain
|
||||
- An In Vehicle Infotainment (IVI) solution running in another domain
|
||||
- Additional domains for specific purposes, such as Rear Seat
|
||||
Entertainment or video camera capturing.
|
||||
|
||||
.. figure:: images/APL_GVT-g-ive-use-case.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: ive-use-case
|
||||
|
||||
IVE Use Case
|
||||
|
||||
Existing Techniques
|
||||
===================
|
||||
|
||||
A graphics device is no different from any other I/O device, with
|
||||
respect to how the device I/O interface is virtualized. Therefore,
|
||||
existing I/O virtualization techniques can be applied to graphics
|
||||
virtualization. However, none of the existing techniques can meet the
|
||||
general requirement of performance, scalability, and secure isolation
|
||||
simultaneously. In this section, we review the pros and cons of each
|
||||
technique in detail, enabling the audience to understand the rationale
|
||||
behind the entire GVT-g effort.
|
||||
|
||||
Emulation
|
||||
---------
|
||||
|
||||
A device can be emulated fully in software, including its I/O registers
|
||||
and internal functional blocks. There would be no dependency on the
|
||||
underlying hardware capability, therefore compatibility can be achieved
|
||||
across platforms. However, due to the CPU emulation cost, this technique
|
||||
is usually used for legacy devices, such as a keyboard, mouse, and VGA
|
||||
card. There would be great complexity and extremely low performance to
|
||||
fully emulate a modern accelerator, such as a GPU. It may be acceptable
|
||||
for use in a simulation environment, but it is definitely not suitable
|
||||
for production usage.
|
||||
|
||||
API Forwarding
|
||||
--------------
|
||||
|
||||
API forwarding, or a split driver model, is another widely-used I/O
|
||||
virtualization technology. It has been used in commercial virtualization
|
||||
productions, for example, VMware*, PCoIP*, and Microsoft* RemoteFx*.
|
||||
It is a natural path when researchers study a new type of
|
||||
I/O virtualization usage, for example, when GPGPU computing in VM was
|
||||
initially proposed. Intel® GVT-s is based on this approach.
|
||||
|
||||
The architecture of API forwarding is shown in :numref:`api-forwarding`:
|
||||
|
||||
.. figure:: images/APL_GVT-g-api-forwarding.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
:name: api-forwarding
|
||||
|
||||
API Forwarding
|
||||
|
||||
A frontend driver is employed to forward high-level API calls (OpenGL,
|
||||
Directx, and so on) inside a VM, to a Backend driver in the Hypervisor
|
||||
for acceleration. The Backend may be using a different graphics stack,
|
||||
so API translation between different graphics protocols may be required.
|
||||
The Backend driver allocates a physical GPU resource for each VM,
|
||||
behaving like a normal graphics application in a Hypervisor. Shared
|
||||
memory may be used to reduce memory copying between the host and guest
|
||||
graphic stacks.
|
||||
|
||||
API forwarding can bring hardware acceleration capability into a VM,
|
||||
with other merits such as vendor independence and high density. However, it
|
||||
also suffers from the following intrinsic limitations:
|
||||
|
||||
- Lagging features - Every new API version needs to be specifically
|
||||
handled, so it means slow time-to-market (TTM) to support new standards.
|
||||
For example,
|
||||
only DirectX9 is supported, when DirectX11 is already in the market.
|
||||
Also, there is a big gap in supporting media and compute usages.
|
||||
|
||||
- Compatibility issues - A GPU is very complex, and consequently so are
|
||||
high level graphics APIs. Different protocols are not 100% compatible
|
||||
on every subtle API, so the customer can observe feature/quality loss
|
||||
for specific applications.
|
||||
|
||||
- Maintenance burden - Occurs when supported protocols and specific
|
||||
versions are incremented.
|
||||
|
||||
- Performance overhead - Different API forwarding implementations
|
||||
exhibit quite different performance, which gives rise to a need for a
|
||||
fine-grained graphics tuning effort.
|
||||
|
||||
Direct Pass-Through
|
||||
-------------------
|
||||
|
||||
"Direct pass-through" dedicates the GPU to a single VM, providing full
|
||||
features and good performance, but at the cost of device sharing
|
||||
capability among VMs. Only one VM at a time can use the hardware
|
||||
acceleration capability of the GPU, which is a major limitation of this
|
||||
technique. However, it is still a good approach to enable graphics
|
||||
virtualization usages on Intel server platforms, as an intermediate
|
||||
solution. Intel® GVT-d uses this mechanism.
|
||||
|
||||
.. figure:: images/APL_GVT-g-pass-through.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
:name: gvt-pass-through
|
||||
|
||||
Pass-Through
|
||||
|
||||
SR-IOV
|
||||
------
|
||||
|
||||
Single Root IO Virtualization (SR-IOV) implements I/O virtualization
|
||||
directly on a device. Multiple Virtual Functions (VFs) are implemented,
|
||||
with each VF directly assignable to a VM.
|
||||
|
||||
Mediated Pass-Through
|
||||
*********************
|
||||
|
||||
Intel® GVT-g achieves full GPU virtualization using a "mediated
|
||||
pass-through" technique.
|
||||
|
||||
Concept
|
||||
=======
|
||||
|
||||
Mediated pass-through allows a VM to access performance-critical I/O
|
||||
resources (usually partitioned) directly, without intervention from the
|
||||
hypervisor in most cases. Privileged operations from this VM are
|
||||
trapped-and-emulated to provide secure isolation among VMs.
|
||||
|
||||
.. figure:: images/APL_GVT-g-mediated-pass-through.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
:name: mediated-pass-through
|
||||
|
||||
Mediated Pass-Through
|
||||
|
||||
The Hypervisor must ensure that no vulnerability is exposed when
|
||||
assigning performance-critical resource to each VM. When a
|
||||
performance-critical resource cannot be partitioned, a scheduler must be
|
||||
implemented (either in software or hardware) to allow time-based sharing
|
||||
among multiple VMs. In this case, the device must allow the hypervisor
|
||||
to save and restore the hardware state associated with the shared resource,
|
||||
either through direct I/O register reads and writes (when there is no software
|
||||
invisible state) or through a device-specific context save and restore
|
||||
mechanism (where there is a software invisible state).
|
||||
|
||||
Examples of performance-critical I/O resources include the following:
|
||||
|
||||
.. figure:: images/APL_GVT-g-perf-critical.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: perf-critical
|
||||
|
||||
Performance-Critical I/O Resources
|
||||
|
||||
|
||||
The key to implementing mediated pass-through for a specific device is
|
||||
to define the right policy for various I/O resources.
|
||||
|
||||
Virtualization Policies for GPU Resources
|
||||
=========================================
|
||||
|
||||
:numref:`graphics-arch` shows how Intel Processor Graphics works at a high level.
|
||||
Software drivers write commands into a command buffer through the CPU.
|
||||
The Render Engine in the GPU fetches these commands and executes them.
|
||||
The Display Engine fetches pixel data from the Frame Buffer and sends
|
||||
them to the external monitors for display.
|
||||
|
||||
.. figure:: images/APL_GVT-g-graphics-arch.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
:name: graphics-arch
|
||||
|
||||
Architecture of Intel Processor Graphics
|
||||
|
||||
This architecture abstraction applies to most modern GPUs, but may
|
||||
differ in how graphics memory is implemented. Intel Processor Graphics
|
||||
uses system memory as graphics memory. System memory can be mapped into
|
||||
multiple virtual address spaces by GPU page tables. A 4 GB global
|
||||
virtual address space called "global graphics memory", accessible from
|
||||
both the GPU and CPU, is mapped through a global page table. Local
|
||||
graphics memory spaces are supported in the form of multiple 4 GB local
|
||||
virtual address spaces, but are only limited to access by the Render
|
||||
Engine through local page tables. Global graphics memory is mostly used
|
||||
for the Frame Buffer and also serves as the Command Buffer. Massive data
|
||||
accesses are made to local graphics memory when hardware acceleration is
|
||||
in progress. Other GPUs have similar page table mechanism accompanying
|
||||
the on-die memory.
|
||||
|
||||
The CPU programs the GPU through GPU-specific commands, shown in
|
||||
:numref:`graphics-arch`, using a producer-consumer model. The graphics
|
||||
driver programs GPU commands into the Command Buffer, including primary
|
||||
buffer and batch buffer, according to the high-level programming APIs,
|
||||
such as OpenGL* or DirectX*. Then, the GPU fetches and executes the
|
||||
commands. The primary buffer (called a ring buffer) may chain other
|
||||
batch buffers together. The primary buffer and ring buffer are used
|
||||
interchangeably thereafter. The batch buffer is used to convey the
|
||||
majority of the commands (up to ~98% of them) per programming model. A
|
||||
register tuple (head, tail) is used to control the ring buffer. The CPU
|
||||
submits the commands to the GPU by updating the tail, while the GPU
|
||||
fetches commands from the head, and then notifies the CPU by updating
|
||||
the head, after the commands have finished execution. Therefore, when
|
||||
the GPU has executed all commands from the ring buffer, the head and
|
||||
tail pointers are the same.
|
||||
|
||||
Having introduced the GPU architecture abstraction, it is important for
|
||||
us to understand how real-world graphics applications use the GPU
|
||||
hardware so that we can virtualize it in VMs efficiently. To do so, we
|
||||
characterized, for some representative GPU-intensive 3D workloads (the
|
||||
Phoronix Test Suite), the usages of the four critical interfaces:
|
||||
|
||||
1) the Frame Buffer,
|
||||
2) the Command Buffer,
|
||||
3) the GPU Page Table Entries (PTEs), which carry the GPU page tables, and
|
||||
4) the I/O registers, including Memory-Mapped I/O (MMIO) registers,
|
||||
Port I/O (PIO) registers, and PCI configuration space registers
|
||||
for internal state.
|
||||
|
||||
:numref:`access-patterns` shows the average access frequency of running
|
||||
Phoronix 3D workloads on the four interfaces.
|
||||
|
||||
The Frame Buffer and Command Buffer exhibit the most
|
||||
performance-critical resources, as shown in :numref:`access-patterns`.
|
||||
When the applications are being loaded, lots of source vertices and
|
||||
pixels are written by the CPU, so the Frame Buffer accesses occur in the
|
||||
range of hundreds of thousands per second. Then at run-time, the CPU
|
||||
programs the GPU through the commands, to render the Frame Buffer, so
|
||||
the Command Buffer accesses become the largest group, also in the
|
||||
hundreds of thousands per second. PTE and I/O accesses are minor in both
|
||||
load and run-time phases ranging in tens of thousands per second.
|
||||
|
||||
.. figure:: images/APL_GVT-g-access-patterns.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
:name: access-patterns
|
||||
|
||||
Access Patterns of Running 3D Workloads
|
||||
|
||||
High Level Architecture
|
||||
***********************
|
||||
|
||||
:numref:`gvt-arch` shows the overall architecture of GVT-g, based on the
|
||||
ACRN hypervisor, with SOS as the privileged VM, and multiple user
|
||||
guests. A GVT-g device model working with the ACRN hypervisor,
|
||||
implements the policies of trap and pass-through. Each guest runs the
|
||||
native graphics driver and can directly access performance-critical
|
||||
resources: the Frame Buffer and Command Buffer, with resource
|
||||
partitioning (as presented later). To protect privileged resources, that
|
||||
is, the I/O registers and PTEs, corresponding accesses from the graphics
|
||||
driver in user VMs are trapped and forwarded to the GVT device model in
|
||||
SOS for emulation. The device model leverages i915 interfaces to access
|
||||
the physical GPU.
|
||||
|
||||
In addition, the device model implements a GPU scheduler that runs
|
||||
concurrently with the CPU scheduler in ACRN to share the physical GPU
|
||||
timeslot among the VMs. GVT-g uses the physical GPU to directly execute
|
||||
all the commands submitted from a VM, so it avoids the complexity of
|
||||
emulating the Render Engine, which is the most complex part of the GPU.
|
||||
In the meantime, the resource pass-through of both the Frame Buffer and
|
||||
Command Buffer minimizes the hypervisor's intervention of CPU accesses,
|
||||
while the GPU scheduler guarantees every VM a quantum time-slice for
|
||||
direct GPU execution. With that, GVT-g can achieve near-native
|
||||
performance for a VM workload.
|
||||
|
||||
In :numref:`gvt-arch`, the yellow GVT device model works as a client on
|
||||
top of an i915 driver in the SOS. It has a generic Mediated Pass-Through
|
||||
(MPT) interface, compatible with all types of hypervisors. For ACRN,
|
||||
some extra development work is needed for such MPT interfaces. For
|
||||
example, we need some changes in ACRN-DM to make ACRN compatible with
|
||||
the MPT framework. The vGPU lifecycle is the same as the lifecycle of
|
||||
the guest VM creation through ACRN-DM. They interact through sysfs,
|
||||
exposed by the GVT device model.
|
||||
|
||||
.. figure:: images/APL_GVT-g-arch.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: gvt-arch
|
||||
|
||||
AcrnGT High-level Architecture
|
||||
|
||||
Key Techniques
|
||||
**************
|
||||
|
||||
vGPU Device Model
|
||||
=================
|
||||
|
||||
The vGPU Device model is the main component because it constructs the
|
||||
vGPU instance for each guest to satisfy every GPU request from the guest
|
||||
and gives the corresponding result back to the guest.
|
||||
|
||||
The vGPU Device Model provides the basic framework to do
|
||||
trap-and-emulation, including MMIO virtualization, interrupt
|
||||
virtualization, and display virtualization. It also handles and
|
||||
processes all the requests internally, such as, command scan and shadow,
|
||||
schedules them in the proper manner, and finally submits to
|
||||
the SOS i915 driver.
|
||||
|
||||
.. figure:: images/APL_GVT-g-DM.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: GVT-DM
|
||||
|
||||
GVT-g Device Model
|
||||
|
||||
MMIO Virtualization
|
||||
-------------------
|
||||
|
||||
Intel Processor Graphics implements two PCI MMIO BARs:
|
||||
|
||||
- **GTTMMADR BAR**: Combines both :term:`GGTT` modification range and Memory
|
||||
Mapped IO range. It is 16 MB on :term:`BDW`, with 2 MB used by MMIO, 6 MB
|
||||
reserved and 8 MB allocated to GGTT. GGTT starts from
|
||||
:term:`GTTMMADR` + 8 MB. In this section, we focus on virtualization of
|
||||
the MMIO range, discussing GGTT virtualization later.
|
||||
|
||||
- **GMADR BAR**: As the PCI aperture is used by the CPU to access tiled
|
||||
graphics memory, GVT-g partitions this aperture range among VMs for
|
||||
performance reasons.
|
||||
|
||||
A 2 MB virtual MMIO structure is allocated per vGPU instance.
|
||||
|
||||
All the virtual MMIO registers are emulated as simple in-memory
|
||||
read-write, that is, guest driver will read back the same value that was
|
||||
programmed earlier. A common emulation handler (for example,
|
||||
intel_gvt_emulate_read/write) is enough to handle such general
|
||||
emulation requirements. However, some registers need to be emulated with
|
||||
specific logic, for example, affected by change of other states or
|
||||
additional audit or translation when updating the virtual register.
|
||||
Therefore, a specific emulation handler must be installed for those
|
||||
special registers.
|
||||
|
||||
The graphics driver may have assumptions about the initial device state,
|
||||
which stays with the point when the BIOS transitions to the OS. To meet
|
||||
the driver expectation, we need to provide an initial state of vGPU that
|
||||
a driver may observe on a pGPU. So the host graphics driver is expected
|
||||
to generate a snapshot of physical GPU state, which it does before guest
|
||||
driver's initialization. This snapshot is used as the initial vGPU state
|
||||
by the device model.
|
||||
|
||||
PCI Configuration Space Virtualization
|
||||
--------------------------------------
|
||||
|
||||
PCI configuration space also needs to be virtualized in the device
|
||||
model. Different implementations may choose to implement the logic
|
||||
within the vGPU device model or in default system device model (for
|
||||
example, ACRN-DM). GVT-g emulates the logic in the device model.
|
||||
|
||||
Some information is vital for the vGPU device model, including:
|
||||
Guest PCI BAR, Guest PCI MSI, and Base of ACPI OpRegion.
|
||||
|
||||
Legacy VGA Port I/O Virtualization
|
||||
----------------------------------
|
||||
|
||||
Legacy VGA is not supported in the vGPU device model. We rely on the
|
||||
default device model (for example, :term:`QEMU`) to provide legacy VGA
|
||||
emulation, which means either ISA VGA emulation or
|
||||
PCI VGA emulation.
|
||||
|
||||
Interrupt Virtualization
|
||||
------------------------
|
||||
|
||||
The GVT device model does not touch the hardware interrupt in the new
|
||||
architecture, since it is hard to combine the interrupt controlling
|
||||
logic between the virtual device model and the host driver. To prevent
|
||||
architectural changes in the host driver, the host GPU interrupt does
|
||||
not go to the virtual device model and the virtual device model has to
|
||||
handle the GPU interrupt virtualization by itself. Virtual GPU
|
||||
interrupts are categorized into three types:
|
||||
|
||||
- Periodic GPU interrupts are emulated by timers. However, a notable
|
||||
exception to this is the VBlank interrupt. Due to the demands of user
|
||||
space compositors, such as Wayland, which requires a flip done event
|
||||
to be synchronized with a VBlank, this interrupt is forwarded from
|
||||
SOS to UOS when SOS receives it from the hardware.
|
||||
|
||||
- Event-based GPU interrupts are emulated by the emulation logic. For
|
||||
example, AUX Channel Interrupt.
|
||||
|
||||
- GPU command interrupts are emulated by a command parser and workload
|
||||
dispatcher. The command parser marks out which GPU command interrupts
|
||||
are generated during the command execution and the workload
|
||||
dispatcher injects those interrupts into the VM after the workload is
|
||||
finished.
|
||||
|
||||
.. figure:: images/APL_GVT-g-interrupt-virt.png
|
||||
:width: 400px
|
||||
:align: center
|
||||
:name: interrupt-virt
|
||||
|
||||
Interrupt Virtualization
|
||||
|
||||
Workload Scheduler
|
||||
------------------
|
||||
|
||||
The scheduling policy and workload scheduler are decoupled for
|
||||
scalability reasons. For example, a future QoS enhancement will only
|
||||
impact the scheduling policy, any i915 interface change or HW submission
|
||||
interface change (from execlist to :term:`GuC`) will only need workload
|
||||
scheduler updates.
|
||||
|
||||
The scheduling policy framework is the core of the vGPU workload
|
||||
scheduling system. It controls all of the scheduling actions and
|
||||
provides the developer with a generic framework for easy development of
|
||||
scheduling policies. The scheduling policy framework controls the work
|
||||
scheduling process without caring about how the workload is dispatched
|
||||
or completed. All the detailed workload dispatching is hidden in the
|
||||
workload scheduler, which is the actual executer of a vGPU workload.
|
||||
|
||||
The workload scheduler handles everything about one vGPU workload. Each
|
||||
hardware ring is backed by one workload scheduler kernel thread. The
|
||||
workload scheduler picks the workload from current vGPU workload queue
|
||||
and communicates with the virtual HW submission interface to emulate the
|
||||
"schedule-in" status for the vGPU. It performs context shadow, Command
|
||||
Buffer scan and shadow, PPGTT page table pin/unpin/out-of-sync, before
|
||||
submitting this workload to the host i915 driver. When the vGPU workload
|
||||
is completed, the workload scheduler asks the virtual HW submission
|
||||
interface to emulate the "schedule-out" status for the vGPU. The VM
|
||||
graphics driver then knows that a GPU workload is finished.
|
||||
|
||||
.. figure:: images/APL_GVT-g-scheduling.png
|
||||
:width: 500px
|
||||
:align: center
|
||||
:name: scheduling
|
||||
|
||||
GVT-g Scheduling Framework
|
||||
|
||||
Workload Submission Path
|
||||
------------------------
|
||||
|
||||
Software submits the workload using the legacy ring buffer mode on Intel
|
||||
Processor Graphics before Broadwell, which is no longer supported by the
|
||||
GVT-g virtual device model. A new HW submission interface named
|
||||
"Execlist" is introduced since Broadwell. With the new HW submission
|
||||
interface, software can achieve better programmability and easier
|
||||
context management. In Intel GVT-g, the vGPU submits the workload
|
||||
through the virtual HW submission interface. Each workload in submission
|
||||
will be represented as an ``intel_vgpu_workload`` data structure, a vGPU
|
||||
workload, which will be put on a per-vGPU and per-engine workload queue
|
||||
later after performing a few basic checks and verifications.
|
||||
|
||||
.. figure:: images/APL_GVT-g-workload.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: workload
|
||||
|
||||
GVT-g Workload Submission
|
||||
|
||||
|
||||
Display Virtualization
|
||||
----------------------
|
||||
|
||||
GVT-g reuses the i915 graphics driver in the SOS to initialize the Display
|
||||
Engine, and then manages the Display Engine to show different VM frame
|
||||
buffers. When two vGPUs have the same resolution, only the frame buffer
|
||||
locations are switched.
|
||||
|
||||
.. figure:: images/APL_GVT-g-display-virt.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: display-virt
|
||||
|
||||
Display Virtualization
|
||||
|
||||
Direct Display Model
|
||||
--------------------
|
||||
|
||||
.. figure:: images/APL_GVT-g-direct-display.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: direct-display
|
||||
|
||||
Direct Display Model
|
||||
|
||||
A typical automotive use case is where there are two displays in the car
|
||||
and each one needs to show one domain's content, with the two domains
|
||||
being the Instrument cluster and the In Vehicle Infotainment (IVI). As
|
||||
shown in :numref:`direct-display`, this can be accomplished through the direct
|
||||
display model of GVT-g, where the SOS and UOS are each assigned all HW
|
||||
planes of two different pipes. GVT-g has a concept of display owner on a
|
||||
per HW plane basis. If it determines that a particular domain is the
|
||||
owner of a HW plane, then it allows the domain's MMIO register write to
|
||||
flip a frame buffer to that plane to go through to the HW. Otherwise,
|
||||
such writes are blocked by the GVT-g.
|
||||
|
||||
Indirect Display Model
|
||||
----------------------
|
||||
|
||||
.. figure:: images/APL_GVT-g-indirect-display.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: indirect-display
|
||||
|
||||
Indirect Display Model
|
||||
|
||||
For security or fastboot reasons, it may be determined that the UOS is
|
||||
either not allowed to display its content directly on the HW or it may
|
||||
be too late before it boots up and displays its content. In such a
|
||||
scenario, the responsibility of displaying content on all displays lies
|
||||
with the SOS. One of the use cases that can be realized is to display the
|
||||
entire frame buffer of the UOS on a secondary display. GVT-g allows for this
|
||||
model by first trapping all MMIO writes by the UOS to the HW. A proxy
|
||||
application can then capture the address in GGTT where the UOS has written
|
||||
its frame buffer and using the help of the Hypervisor and the SOS's i915
|
||||
driver, can convert the Guest Physical Addresses (GPAs) into Host
|
||||
Physical Addresses (HPAs) before making a texture source or EGL image
|
||||
out of the frame buffer and then either post processing it further or
|
||||
simply displaying it on a HW plane of the secondary display.
|
||||
|
||||
GGTT-Based Surface Sharing
|
||||
--------------------------
|
||||
|
||||
One of the major automotive use case is called "surface sharing". This
|
||||
use case requires that the SOS accesses an individual surface or a set of
|
||||
surfaces from the UOS without having to access the entire frame buffer of
|
||||
the UOS. Unlike the previous two models, where the UOS did not have to do
|
||||
anything to show its content and therefore a completely unmodified UOS
|
||||
could continue to run, this model requires changes to the UOS.
|
||||
|
||||
This model can be considered an extension of the indirect display model.
|
||||
Under the indirect display model, the UOS's frame buffer was temporarily
|
||||
pinned by it in the video memory access through the Global graphics
|
||||
translation table. This GGTT-based surface sharing model takes this a
|
||||
step further by having a compositor of the UOS to temporarily pin all
|
||||
application buffers into GGTT. It then also requires the compositor to
|
||||
create a metadata table with relevant surface information such as width,
|
||||
height, and GGTT offset, and flip that in lieu of the frame buffer.
|
||||
In the SOS, the proxy application knows that the GGTT offset has been
|
||||
flipped, maps it, and through it can access the GGTT offset of an
|
||||
application that it wants to access. It is worth mentioning that in this
|
||||
model, UOS applications did not require any changes, and only the
|
||||
compositor, Mesa, and i915 driver had to be modified.
|
||||
|
||||
This model has a major benefit and a major limitation. The
|
||||
benefit is that since it builds on top of the indirect display model,
|
||||
there are no special drivers necessary for it on either SOS or UOS.
|
||||
Therefore, any Real Time Operating System (RTOS) that use
|
||||
this model can simply do so without having to implement a driver, the
|
||||
infrastructure for which may not be present in their operating system.
|
||||
The limitation of this model is that video memory dedicated for a UOS is
|
||||
generally limited to a couple of hundred MBs. This can easily be
|
||||
exhausted by a few application buffers so the number and size of buffers
|
||||
is limited. Since it is not a highly-scalable model, in general, Intel
|
||||
recommends the Hyper DMA buffer sharing model, described next.
|
||||
|
||||
Hyper DMA Buffer Sharing
|
||||
------------------------
|
||||
|
||||
.. figure:: images/APL_GVT-g-hyper-dma.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: hyper-dma
|
||||
|
||||
Hyper DMA Buffer Design
|
||||
|
||||
Another approach to surface sharing is Hyper DMA Buffer sharing. This
|
||||
model extends the Linux DMA buffer sharing mechanism where one driver is
|
||||
able to share its pages with another driver within one domain.
|
||||
|
||||
Applications buffers are backed by i915 Graphics Execution Manager
|
||||
Buffer Objects (GEM BOs). As in GGTT surface
|
||||
sharing, this model also requires compositor changes. The compositor of
|
||||
UOS requests i915 to export these application GEM BOs and then passes
|
||||
them on to a special driver called the Hyper DMA Buf exporter whose job
|
||||
is to create a scatter gather list of pages mapped by PDEs and PTEs and
|
||||
export a Hyper DMA Buf ID back to the compositor.
|
||||
|
||||
The compositor then shares this Hyper DMA Buf ID with the SOS's Hyper DMA
|
||||
Buf importer driver which then maps the memory represented by this ID in
|
||||
the SOS. A proxy application in the SOS can then provide the ID of this driver
|
||||
to the SOS i915, which can create its own GEM BO. Finally, the application
|
||||
can use it as an EGL image and do any post processing required before
|
||||
either providing it to the SOS compositor or directly flipping it on a
|
||||
HW plane in the compositor's absence.
|
||||
|
||||
This model is highly scalable and can be used to share up to 4 GB worth
|
||||
of pages. It is also not limited to only sharing graphics buffers. Other
|
||||
buffers for the IPU and others, can also be shared with it. However, it
|
||||
does require that the SOS port the Hyper DMA Buffer importer driver. Also,
|
||||
the SOS OS must comprehend and implement the DMA buffer sharing model.
|
||||
|
||||
For detailed information about this model, please refer to the `Linux
|
||||
HYPER_DMABUF Driver High Level Design
|
||||
<https://github.com/downor/linux_hyper_dmabuf/blob/hyper_dmabuf_integration_v4/Documentation/hyper-dmabuf-sharing.txt>`_.
|
||||
|
||||
Plane-Based Domain Ownership
|
||||
----------------------------
|
||||
|
||||
.. figure:: images/APL_GVT-g-plane-based.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: plane-based
|
||||
|
||||
Plane-Based Domain Ownership
|
||||
|
||||
Yet another mechanism for showing content of both the SOS and UOS on the
|
||||
same physical display is called plane-based domain ownership. Under this
|
||||
model, both the SOS and UOS are provided a set of HW planes that they can
|
||||
flip their contents on to. Since each domain provides its content, there
|
||||
is no need for any extra composition to be done through the SOS. The display
|
||||
controller handles alpha blending contents of different domains on a
|
||||
single pipe. This saves on any complexity on either the SOS or the UOS
|
||||
SW stack.
|
||||
|
||||
It is important to provide only specific planes and have them statically
|
||||
assigned to different Domains. To achieve this, the i915 driver of both
|
||||
domains is provided a command line parameter that specifies the exact
|
||||
planes that this domain has access to. The i915 driver then enumerates
|
||||
only those HW planes and exposes them to its compositor. It is then left
|
||||
to the compositor configuration to use these planes appropriately and
|
||||
show the correct content on them. No other changes are necessary.
|
||||
|
||||
While the biggest benefit of this model is that is extremely simple and
|
||||
quick to implement, it also has some drawbacks. First, since each domain
|
||||
is responsible for showing the content on the screen, there is no
|
||||
control of the UOS by the SOS. If the UOS is untrusted, this could
|
||||
potentially cause some unwanted content to be displayed. Also, there is
|
||||
no post processing capability, except that provided by the display
|
||||
controller (for example, scaling, rotation, and so on). So each domain
|
||||
must provide finished buffers with the expectation that alpha blending
|
||||
with another domain will not cause any corruption or unwanted artifacts.
|
||||
|
||||
Graphics Memory Virtualization
|
||||
==============================
|
||||
|
||||
To achieve near-to-native graphics performance, GVT-g passes through the
|
||||
performance-critical operations, such as Frame Buffer and Command Buffer
|
||||
from the VM. For the global graphics memory space, GVT-g uses graphics
|
||||
memory resource partitioning and an address space ballooning mechanism.
|
||||
For local graphics memory spaces, GVT-g implements per-VM local graphics
|
||||
memory through a render context switch because local graphics memory is
|
||||
only accessible by the GPU.
|
||||
|
||||
Global Graphics Memory
|
||||
----------------------
|
||||
|
||||
Graphics Memory Resource Partitioning
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
GVT-g partitions the global graphics memory among VMs. Splitting the
|
||||
CPU/GPU scheduling mechanism requires that the global graphics memory of
|
||||
different VMs can be accessed by the CPU and the GPU simultaneously.
|
||||
Consequently, GVT-g must, at any time, present each VM with its own
|
||||
resource, leading to the resource partitioning approaching, for global
|
||||
graphics memory, as shown in :numref:`mem-part`.
|
||||
|
||||
.. figure:: images/APL_GVT-g-mem-part.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: mem-part
|
||||
|
||||
Memory Partition and Ballooning
|
||||
|
||||
The performance impact of reduced global graphics memory resource
|
||||
due to memory partitioning is very limited according to various test
|
||||
results.
|
||||
|
||||
Address Space Ballooning
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
The address space ballooning technique is introduced to eliminate the
|
||||
address translation overhead, shown in :numref:`mem-part`. GVT-g exposes the
|
||||
partitioning information to the VM graphics driver through the PVINFO
|
||||
MMIO window. The graphics driver marks the other VMs' regions as
|
||||
'ballooned', and reserves them as not being used from its graphics
|
||||
memory allocator. Under this design, the guest view of global graphics
|
||||
memory space is exactly the same as the host view and the driver
|
||||
programmed addresses, using guest physical address, can be directly used
|
||||
by the hardware. Address space ballooning is different from traditional
|
||||
memory ballooning techniques. Memory ballooning is for memory usage
|
||||
control concerning the number of ballooned pages, while address space
|
||||
ballooning is to balloon special memory address ranges.
|
||||
|
||||
Another benefit of address space ballooning is that there is no address
|
||||
translation overhead as we use the guest Command Buffer for direct GPU
|
||||
execution.
|
||||
|
||||
Per-VM Local Graphics Memory
|
||||
----------------------------
|
||||
|
||||
GVT-g allows each VM to use the full local graphics memory spaces of its
|
||||
own, similar to the virtual address spaces on the CPU. The local
|
||||
graphics memory spaces are only visible to the Render Engine in the GPU.
|
||||
Therefore, any valid local graphics memory address, programmed by a VM,
|
||||
can be used directly by the GPU. The GVT-g device model switches the
|
||||
local graphics memory spaces, between VMs, when switching render
|
||||
ownership.
|
||||
|
||||
GPU Page Table Virtualization
|
||||
=============================
|
||||
|
||||
Shared Shadow GGTT
|
||||
------------------
|
||||
|
||||
To achieve resource partitioning and address space ballooning, GVT-g
|
||||
implements a shared shadow global page table for all VMs. Each VM has
|
||||
its own guest global page table to translate the graphics memory page
|
||||
number to the Guest memory Page Number (GPN). The shadow global page
|
||||
table is then translated from the graphics memory page number to the
|
||||
Host memory Page Number (HPN).
|
||||
|
||||
The shared shadow global page table maintains the translations for all
|
||||
VMs to support concurrent accesses from the CPU and GPU concurrently.
|
||||
Therefore, GVT-g implements a single, shared shadow global page table by
|
||||
trapping guest PTE updates, as shown in :numref:`shared-shadow`. The
|
||||
global page table, in MMIO space, has 1024K PTE entries, each pointing
|
||||
to a 4 KB system memory page, so the global page table overall creates a
|
||||
4 GB global graphics memory space. GVT-g audits the guest PTE values
|
||||
according to the address space ballooning information before updating
|
||||
the shadow PTE entries.
|
||||
|
||||
.. figure:: images/APL_GVT-g-shared-shadow.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
:name: shared-shadow
|
||||
|
||||
Shared Shadow Global Page Table
|
||||
|
||||
Per-VM Shadow PPGTT
|
||||
-------------------
|
||||
|
||||
To support local graphics memory access pass-through, GVT-g implements
|
||||
per-VM shadow local page tables. The local graphics memory is only
|
||||
accessible from the Render Engine. The local page tables have two-level
|
||||
paging structures, as shown in :numref:`per-vm-shadow`.
|
||||
|
||||
The first level, Page Directory Entries (PDEs), located in the global
|
||||
page table, points to the second level, Page Table Entries (PTEs) in
|
||||
system memory, so guest accesses to the PDE are trapped and emulated,
|
||||
through the implementation of shared shadow global page table.
|
||||
|
||||
GVT-g also write-protects a list of guest PTE pages for each VM. The
|
||||
GVT-g device model synchronizes the shadow page with the guest page, at
|
||||
the time of write-protection page fault, and switches the shadow local
|
||||
page tables at render context switches.
|
||||
|
||||
.. figure:: images/APL_GVT-g-per-vm-shadow.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: per-vm-shadow
|
||||
|
||||
Per-VM Shadow PPGTT
|
||||
|
||||
Prioritized Rendering and Preemption
|
||||
====================================
|
||||
|
||||
Different Schedulers and Their Roles
|
||||
------------------------------------
|
||||
|
||||
.. figure:: images/APL_GVT-g-scheduling-policy.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: scheduling-policy
|
||||
|
||||
Scheduling Policy
|
||||
|
||||
In the system, there are three different schedulers for the GPU:
|
||||
|
||||
- i915 UOS scheduler
|
||||
- Mediator GVT scheduler
|
||||
- i915 SOS scheduler
|
||||
|
||||
Since UOS always uses the host-based command submission (ELSP) model,
|
||||
and it never accesses the GPU or the Graphic Micro Controller (GuC)
|
||||
directly, its scheduler cannot do any preemption by itself.
|
||||
The i915 scheduler does ensure batch buffers are
|
||||
submitted in dependency order, that is, if a compositor had to wait for
|
||||
an application buffer to finish before its workload can be submitted to
|
||||
the GPU, then the i915 scheduler of the UOS ensures that this happens.
|
||||
|
||||
The UOS assumes that by submitting its batch buffers to the Execlist
|
||||
Submission Port (ELSP), the GPU will start working on them. However,
|
||||
the MMIO write to the ELSP is captured by the Hypervisor, which forwards
|
||||
these requests to the GVT module. GVT then creates a shadow context
|
||||
based on this batch buffer and submits the shadow context to the SOS
|
||||
i915 driver.
|
||||
|
||||
However, it is dependent on a second scheduler called the GVT
|
||||
scheduler. This scheduler is time based and uses a round robin algorithm
|
||||
to provide a specific time for each UOS to submit its workload when it
|
||||
is considered as a "render owner". The workload of the UOSs that are not
|
||||
render owners during a specific time period end up waiting in the
|
||||
virtual GPU context until the GVT scheduler makes them render owners.
|
||||
The GVT shadow context submits only one workload at
|
||||
a time, and once the workload is finished by the GPU, it copies any
|
||||
context state back to DomU and sends the appropriate interrupts before
|
||||
picking up any other workloads from either this UOS or another one. This
|
||||
also implies that this scheduler does not do any preemption of
|
||||
workloads.
|
||||
|
||||
Finally, there is the i915 scheduler in the SOS. This scheduler uses the
|
||||
GuC or ELSP to do command submission of SOS local content as well as any
|
||||
content that GVT is submitting to it on behalf of the UOSs. This
|
||||
scheduler uses GuC or ELSP to preempt workloads. GuC has four different
|
||||
priority queues, but the SOS i915 driver uses only two of them. One of
|
||||
them is considered high priority and the other is normal priority with a
|
||||
GuC rule being that any command submitted on the high priority queue
|
||||
would immediately try to preempt any workload submitted on the normal
|
||||
priority queue. For ELSP submission, the i915 will submit a preempt
|
||||
context to preempt the current running context and then wait for the GPU
|
||||
engine to be idle.
|
||||
|
||||
While the identification of workloads to be preempted is decided by
|
||||
customizable scheduling policies, once a candidate for preemption is
|
||||
identified, the i915 scheduler simply submits a preemption request to
|
||||
the GuC high-priority queue. Based on the HW's ability to preempt (on an
|
||||
Apollo Lake SoC, 3D workload is preemptible on a 3D primitive level with
|
||||
some exceptions), the currently executing workload is saved and
|
||||
preempted. The GuC informs the driver using an interrupt of a preemption
|
||||
event occurring. After handling the interrupt, the driver submits the
|
||||
high-priority workload through the normal priority GuC queue. As such,
|
||||
the normal priority GuC queue is used for actual execbuf submission most
|
||||
of the time with the high-priority GuC queue only being used for the
|
||||
preemption of lower-priority workload.
|
||||
|
||||
Scheduling policies are customizable and left to customers to change if
|
||||
they are not satisfied with the built-in i915 driver policy, where all
|
||||
workloads of the SOS are considered higher priority than those of the
|
||||
UOS. This policy can be enforced through an SOS i915 kernel command line
|
||||
parameter, and can replace the default in-order command submission (no
|
||||
preemption) policy.
|
||||
|
||||
AcrnGT
|
||||
*******
|
||||
|
||||
ACRN is a flexible, lightweight reference hypervisor, built with
|
||||
real-time and safety-criticality in mind, optimized to streamline
|
||||
embedded development through an open source platform.
|
||||
|
||||
AcrnGT is the GVT-g implementation on the ACRN hypervisor. It adapts
|
||||
the MPT interface of GVT-g onto ACRN by using the kernel APIs provided
|
||||
by ACRN.
|
||||
|
||||
:numref:`full-pic` shows the full architecture of AcrnGT with a Linux Guest
|
||||
OS and an Android Guest OS.
|
||||
|
||||
.. figure:: images/APL_GVT-g-full-pic.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
:name: full-pic
|
||||
|
||||
Full picture of the AcrnGT
|
||||
|
||||
AcrnGT in kernel
|
||||
=================
|
||||
|
||||
The AcrnGT module in the SOS kernel acts as an adaption layer to connect
|
||||
between GVT-g in the i915, the VHM module, and the ACRN-DM user space
|
||||
application:
|
||||
|
||||
- AcrnGT module implements the MPT interface of GVT-g to provide
|
||||
services to it, including set and unset trap areas, set and unset
|
||||
write-protection pages, etc.
|
||||
|
||||
- It calls the VHM APIs provided by the ACRN VHM module in the SOS
|
||||
kernel, to eventually call into the routines provided by ACRN
|
||||
hypervisor through hyper-calls.
|
||||
|
||||
- It provides user space interfaces through ``sysfs`` to the user space
|
||||
ACRN-DM, so that DM can manage the lifecycle of the virtual GPUs.
|
||||
|
||||
AcrnGT in DM
|
||||
=============
|
||||
|
||||
To emulate a PCI device to a Guest, we need an AcrnGT sub-module in the
|
||||
ACRN-DM. This sub-module is responsible for:
|
||||
|
||||
- registering the virtual GPU device to the PCI device tree presented to
|
||||
guest;
|
||||
|
||||
- registerng the MMIO resources to ACRN-DM so that it can reserve
|
||||
resources in ACPI table;
|
||||
|
||||
- managing the lifecycle of the virtual GPU device, such as creation,
|
||||
destruction, and resetting according to the state of the virtual
|
||||
machine.
|
10
doc/developer-guides/hld/hld-devicemodel.rst
Normal file
@@ -0,0 +1,10 @@
|
||||
.. _hld-devicemodel:
|
||||
|
||||
Device Model high-level design
|
||||
##############################
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
ACPI virtualization <acpi-virt>
|
11
doc/developer-guides/hld/hld-emulated-devices.rst
Normal file
@@ -0,0 +1,11 @@
|
||||
.. _hld-emulated-devices:
|
||||
|
||||
Emulated Devices high-level design
|
||||
##################################
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
GVT-g GPU Virtualization <hld-APL_GVT-g>
|
||||
UART virtualization <uart-virt-hld>
|
||||
Watchdoc virtualization <watchdog-hld>
|
11
doc/developer-guides/hld/hld-hypervisor.rst
Normal file
@@ -0,0 +1,11 @@
|
||||
.. _hld-hypervisor:
|
||||
|
||||
Hypervisor high-level design
|
||||
############################
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
Memory management <memmgt-hld>
|
||||
Interrupt management <interrupt-hld>
|
4
doc/developer-guides/hld/hld-overview.rst
Normal file
@@ -0,0 +1,4 @@
|
||||
.. _hld-overview:
|
||||
|
||||
Overview
|
||||
########
|
4
doc/developer-guides/hld/hld-power-management.rst
Normal file
@@ -0,0 +1,4 @@
|
||||
.. _hld-power-management:
|
||||
|
||||
Power Management high-level design
|
||||
##################################
|
1079
doc/developer-guides/hld/hld-security.rst
Normal file
4
doc/developer-guides/hld/hld-trace-log.rst
Normal file
@@ -0,0 +1,4 @@
|
||||
.. _hld-trace-log:
|
||||
|
||||
Tracing and Logging high-level design
|
||||
#####################################
|
499
doc/developer-guides/hld/hld-virtio-devices.rst
Normal file
@@ -0,0 +1,499 @@
|
||||
.. _hld-virtio-devices:
|
||||
.. _virtio-hld:
|
||||
|
||||
Virtio devices high-level design
|
||||
################################
|
||||
|
||||
The ACRN Hypervisor follows the `Virtual I/O Device (virtio)
|
||||
specification
|
||||
<http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html>`_ to
|
||||
realize I/O virtualization for many performance-critical devices
|
||||
supported in the ACRN project. Adopting the virtio specification lets us
|
||||
reuse many frontend virtio drivers already available in a Linux-based
|
||||
User OS, drastically reducing potential development effort for frontend
|
||||
virtio drivers. To further reduce the development effort of backend
|
||||
virtio drivers, the hypervisor provides the virtio backend service
|
||||
(VBS) APIs, that make it very straightforward to implement a virtio
|
||||
device in the hypervisor.
|
||||
|
||||
The virtio APIs can be divided into 3 groups: DM APIs, virtio backend
|
||||
service (VBS) APIs, and virtqueue (VQ) APIs, as shown in
|
||||
:numref:`be-interface`.
|
||||
|
||||
.. figure:: images/virtio-hld-image0.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: be-interface
|
||||
|
||||
ACRN Virtio Backend Service Interface
|
||||
|
||||
- **DM APIs** are exported by the DM, and are mainly used during the
|
||||
device initialization phase and runtime. The DM APIs also include
|
||||
PCIe emulation APIs because each virtio device is a PCIe device in
|
||||
the SOS and UOS.
|
||||
- **VBS APIs** are mainly exported by the VBS and related modules.
|
||||
Generally they are callbacks to be
|
||||
registered into the DM.
|
||||
- **VQ APIs** are used by a virtio backend device to access and parse
|
||||
information from the shared memory between the frontend and backend
|
||||
device drivers.
|
||||
|
||||
Virtio framework is the para-virtualization specification that ACRN
|
||||
follows to implement I/O virtualization of performance-critical
|
||||
devices such as audio, eAVB/TSN, IPU, and CSMU devices. This section gives
|
||||
an overview about virtio history, motivation, and advantages, and then
|
||||
highlights virtio key concepts. Second, this section will describe
|
||||
ACRN's virtio architectures, and elaborates on ACRN virtio APIs. Finally
|
||||
this section will introduce all the virtio devices currently supported
|
||||
by ACRN.
|
||||
|
||||
Virtio introduction
|
||||
*******************
|
||||
|
||||
Virtio is an abstraction layer over devices in a para-virtualized
|
||||
hypervisor. Virtio was developed by Rusty Russell when he worked at IBM
|
||||
research to support his lguest hypervisor in 2007, and it quickly became
|
||||
the de-facto standard for KVM's para-virtualized I/O devices.
|
||||
|
||||
Virtio is very popular for virtual I/O devices because is provides a
|
||||
straightforward, efficient, standard, and extensible mechanism, and
|
||||
eliminates the need for boutique, per-environment, or per-OS mechanisms.
|
||||
For example, rather than having a variety of device emulation
|
||||
mechanisms, virtio provides a common frontend driver framework that
|
||||
standardizes device interfaces, and increases code reuse across
|
||||
different virtualization platforms.
|
||||
|
||||
Given the advantages of virtio, ACRN also follows the virtio
|
||||
specification.
|
||||
|
||||
Key Concepts
|
||||
************
|
||||
|
||||
To better understand virtio, especially its usage in ACRN, we'll
|
||||
highlight several key virtio concepts important to ACRN:
|
||||
|
||||
|
||||
Frontend virtio driver (FE)
|
||||
Virtio adopts a frontend-backend architecture that enables a simple but
|
||||
flexible framework for both frontend and backend virtio drivers. The FE
|
||||
driver merely needs to offer services configure the interface, pass messages,
|
||||
produce requests, and kick backend virtio driver. As a result, the FE
|
||||
driver is easy to implement and the performance overhead of emulating
|
||||
a device is eliminated.
|
||||
|
||||
Backend virtio driver (BE)
|
||||
Similar to FE driver, the BE driver, running either in user-land or
|
||||
kernel-land of the host OS, consumes requests from the FE driver and sends them
|
||||
to the host native device driver. Once the requests are done by the host
|
||||
native device driver, the BE driver notifies the FE driver that the
|
||||
request is complete.
|
||||
|
||||
Note: to distinguish BE driver from host native device driver, the host
|
||||
native device driver is called "native driver" in this document.
|
||||
|
||||
Straightforward: virtio devices as standard devices on existing buses
|
||||
Instead of creating new device buses from scratch, virtio devices are
|
||||
built on existing buses. This gives a straightforward way for both FE
|
||||
and BE drivers to interact with each other. For example, FE driver could
|
||||
read/write registers of the device, and the virtual device could
|
||||
interrupt FE driver, on behalf of the BE driver, in case something of
|
||||
interest is happening.
|
||||
|
||||
Currently virtio supports PCI/PCIe bus and MMIO bus. In ACRN, only
|
||||
PCI/PCIe bus is supported, and all the virtio devices share the same
|
||||
vendor ID 0x1AF4.
|
||||
|
||||
Note: For MMIO, the "bus" is a little bit an overstatement since
|
||||
basically it is a few descriptors describing the devices.
|
||||
|
||||
Efficient: batching operation is encouraged
|
||||
Batching operation and deferred notification are important to achieve
|
||||
high-performance I/O, since notification between FE and BE driver
|
||||
usually involves an expensive exit of the guest. Therefore batching
|
||||
operating and notification suppression are highly encouraged if
|
||||
possible. This will give an efficient implementation for
|
||||
performance-critical devices.
|
||||
|
||||
Standard: virtqueue
|
||||
All virtio devices share a standard ring buffer and descriptor
|
||||
mechanism, called a virtqueue, shown in :numref:`virtqueue`. A virtqueue is a
|
||||
queue of scatter-gather buffers. There are three important methods on
|
||||
virtqueues:
|
||||
|
||||
- **add_buf** is for adding a request/response buffer in a virtqueue,
|
||||
- **get_buf** is for getting a response/request in a virtqueue, and
|
||||
- **kick** is for notifying the other side for a virtqueue to consume buffers.
|
||||
|
||||
The virtqueues are created in guest physical memory by the FE drivers.
|
||||
BE drivers only need to parse the virtqueue structures to obtain
|
||||
the requests and process them. How a virtqueue is organized is
|
||||
specific to the Guest OS. In the Linux implementation of virtio, the
|
||||
virtqueue is implemented as a ring buffer structure called vring.
|
||||
|
||||
In ACRN, the virtqueue APIs can be leveraged directly so that users
|
||||
don't need to worry about the details of the virtqueue. (Refer to guest
|
||||
OS for more details about the virtqueue implementation.)
|
||||
|
||||
.. figure:: images/virtio-hld-image2.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: virtqueue
|
||||
|
||||
Virtqueue
|
||||
|
||||
Extensible: feature bits
|
||||
A simple extensible feature negotiation mechanism exists for each
|
||||
virtual device and its driver. Each virtual device could claim its
|
||||
device specific features while the corresponding driver could respond to
|
||||
the device with the subset of features the driver understands. The
|
||||
feature mechanism enables forward and backward compatibility for the
|
||||
virtual device and driver.
|
||||
|
||||
Virtio Device Modes
|
||||
The virtio specification defines three modes of virtio devices:
|
||||
a legacy mode device, a transitional mode device, and a modern mode
|
||||
device. A legacy mode device is compliant to virtio specification
|
||||
version 0.95, a transitional mode device is compliant to both
|
||||
0.95 and 1.0 spec versions, and a modern mode
|
||||
device is only compatible to the version 1.0 specification.
|
||||
|
||||
In ACRN, all the virtio devices are transitional devices, meaning that
|
||||
they should be compatible with both 0.95 and 1.0 versions of virtio
|
||||
specification.
|
||||
|
||||
Virtio Device Discovery
|
||||
Virtio devices are commonly implemented as PCI/PCIe devices. A
|
||||
virtio device using virtio over PCI/PCIe bus must expose an interface to
|
||||
the Guest OS that meets the PCI/PCIe specifications.
|
||||
|
||||
Conventionally, any PCI device with Vendor ID 0x1AF4,
|
||||
PCI_VENDOR_ID_REDHAT_QUMRANET, and Device ID 0x1000 through 0x107F
|
||||
inclusive is a virtio device. Among the Device IDs, the
|
||||
legacy/transitional mode virtio devices occupy the first 64 IDs ranging
|
||||
from 0x1000 to 0x103F, while the range 0x1040-0x107F belongs to
|
||||
virtio modern devices. In addition, the Subsystem Vendor ID should
|
||||
reflect the PCI/PCIe vendor ID of the environment, and the Subsystem
|
||||
Device ID indicates which virtio device is supported by the device.
|
||||
|
||||
Virtio Frameworks
|
||||
*****************
|
||||
|
||||
This section describes the overall architecture of virtio, and then
|
||||
introduce ACRN specific implementations of the virtio framework.
|
||||
|
||||
Architecture
|
||||
============
|
||||
|
||||
Virtio adopts a frontend-backend
|
||||
architecture, as shown in :numref:`virtio-arch`. Basically the FE and BE driver
|
||||
communicate with each other through shared memory, via the
|
||||
virtqueues. The FE driver talks to the BE driver in the same way it
|
||||
would talk to a real PCIe device. The BE driver handles requests
|
||||
from the FE driver, and notifies the FE driver if the request has been
|
||||
processed.
|
||||
|
||||
.. figure:: images/virtio-hld-image1.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: virtio-arch
|
||||
|
||||
Virtio Architecture
|
||||
|
||||
In addition to virtio's frontend-backend architecture, both FE and BE
|
||||
drivers follow a layered architecture, as shown in
|
||||
:numref:`virtio-fe-be`. Each
|
||||
side has three layers: transports, core models, and device types.
|
||||
All virtio devices share the same virtio infrastructure, including
|
||||
virtqueues, feature mechanisms, configuration space, and buses.
|
||||
|
||||
.. figure:: images/virtio-hld-image4.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: virtio-fe-be
|
||||
|
||||
Virtio Frontend/Backend Layered Architecture
|
||||
|
||||
Virtio Framework Considerations
|
||||
===============================
|
||||
|
||||
How to realize the virtio framework is specific to a
|
||||
hypervisor implementation. In ACRN, the virtio framework implementations
|
||||
can be classified into two types, virtio backend service in user-land
|
||||
(VBS-U) and virtio backend service in kernel-land (VBS-K), according to
|
||||
where the virtio backend service (VBS) is located. Although different in BE
|
||||
drivers, both VBS-U and VBS-K share the same FE drivers. The reason
|
||||
behind the two virtio implementations is to meet the requirement of
|
||||
supporting a large amount of diverse I/O devices in ACRN project.
|
||||
|
||||
When developing a virtio BE device driver, the device owner should choose
|
||||
carefully between the VBS-U and VBS-K. Generally VBS-U targets
|
||||
non-performance-critical devices, but enables easy development and
|
||||
debugging. VBS-K targets performance critical devices.
|
||||
|
||||
The next two sections introduce ACRN's two implementations of the virtio
|
||||
framework.
|
||||
|
||||
User-Land Virtio Framework
|
||||
==========================
|
||||
|
||||
The architecture of ACRN user-land virtio framework (VBS-U) is shown in
|
||||
:numref:`virtio-userland`.
|
||||
|
||||
The FE driver talks to the BE driver as if it were talking with a PCIe
|
||||
device. This means for "control plane", the FE driver could poke device
|
||||
registers through PIO or MMIO, and the device will interrupt the FE
|
||||
driver when something happens. For "data plane", the communication
|
||||
between the FE and BE driver is through shared memory, in the form of
|
||||
virtqueues.
|
||||
|
||||
On the service OS side where the BE driver is located, there are several
|
||||
key components in ACRN, including device model (DM), virtio and HV
|
||||
service module (VHM), VBS-U, and user-level vring service API helpers.
|
||||
|
||||
DM bridges the FE driver and BE driver since each VBS-U module emulates
|
||||
a PCIe virtio device. VHM bridges DM and the hypervisor by providing
|
||||
remote memory map APIs and notification APIs. VBS-U accesses the
|
||||
virtqueue through the user-level vring service API helpers.
|
||||
|
||||
.. figure:: images/virtio-hld-image3.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: virtio-userland
|
||||
|
||||
ACRN User-Land Virtio Framework
|
||||
|
||||
Kernel-Land Virtio Framework
|
||||
============================
|
||||
|
||||
The architecture of ACRN kernel-land virtio framework (VBS-K) is shown
|
||||
in :numref:`virtio-kernelland`.
|
||||
|
||||
VBS-K provides acceleration for performance critical devices emulated by
|
||||
VBS-U modules by handling the "data plane" of the devices directly in
|
||||
the kernel. When VBS-K is enabled for certain device, the kernel-land
|
||||
vring service API helpers are used to access the virtqueues shared by
|
||||
the FE driver. Compared to VBS-U, this eliminates the overhead of
|
||||
copying data back-and-forth between user-land and kernel-land within the
|
||||
service OS, but pays with the extra implementation complexity of the BE
|
||||
drivers.
|
||||
|
||||
Except for the differences mentioned above, VBS-K still relies on VBS-U
|
||||
for feature negotiations between FE and BE drivers. This means the
|
||||
"control plane" of the virtio device still remains in VBS-U. When
|
||||
feature negotiation is done, which is determined by FE driver setting up
|
||||
an indicative flag, VBS-K module will be initialized by VBS-U, after
|
||||
which all request handling will be offloaded to the VBS-K in kernel.
|
||||
|
||||
The FE driver is not aware of how the BE driver is implemented, either
|
||||
in the VBS-U or VBS-K model. This saves engineering effort regarding FE
|
||||
driver development.
|
||||
|
||||
.. figure:: images/virtio-hld-image6.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: virtio-kernelland
|
||||
|
||||
ACRN Kernel-Land Virtio Framework
|
||||
|
||||
Virtio APIs
|
||||
***********
|
||||
|
||||
This section provides details on the ACRN virtio APIs. As outlined previously,
|
||||
the ACRN virtio APIs can be divided into three groups: DM_APIs,
|
||||
VBS_APIs, and VQ_APIs. The following sections will elaborate on
|
||||
these APIs.
|
||||
|
||||
VBS-U Key Data Structures
|
||||
=========================
|
||||
|
||||
The key data structures for VBS-U are listed as following, and their
|
||||
relationships are shown in :numref:`VBS-U-data`.
|
||||
|
||||
``struct pci_virtio_blk``
|
||||
An example virtio device, such as virtio-blk.
|
||||
``struct virtio_common``
|
||||
A common component to any virtio device.
|
||||
``struct virtio_ops``
|
||||
Virtio specific operation functions for this type of virtio device.
|
||||
``struct pci_vdev``
|
||||
Instance of a virtual PCIe device, and any virtio
|
||||
device is a virtual PCIe device.
|
||||
``struct pci_vdev_ops``
|
||||
PCIe device's operation functions for this type
|
||||
of device.
|
||||
``struct vqueue_info``
|
||||
Instance of a virtqueue.
|
||||
|
||||
.. figure:: images/virtio-hld-image5.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: VBS-U-data
|
||||
|
||||
VBS-U Key Data Structures
|
||||
|
||||
Each virtio device is a PCIe device. In addition, each virtio device
|
||||
could have none or multiple virtqueues, depending on the device type.
|
||||
The ``struct virtio_common`` is a key data structure to be manipulated by
|
||||
DM, and DM finds other key data structures through it. The ``struct
|
||||
virtio_ops`` abstracts a series of virtio callbacks to be provided by
|
||||
device owner.
|
||||
|
||||
VBS-K Key Data Structures
|
||||
=========================
|
||||
|
||||
The key data structures for VBS-K are listed as follows, and their
|
||||
relationships are shown in :numref:`VBS-K-data`.
|
||||
|
||||
``struct vbs_k_rng``
|
||||
In-kernel VBS-K component handling data plane of a
|
||||
VBS-U virtio device, for example virtio random_num_generator.
|
||||
``struct vbs_k_dev``
|
||||
In-kernel VBS-K component common to all VBS-K.
|
||||
``struct vbs_k_vq``
|
||||
In-kernel VBS-K component to be working with kernel
|
||||
vring service API helpers.
|
||||
``struct vbs_k_dev_inf``
|
||||
Virtio device information to be synchronized
|
||||
from VBS-U to VBS-K kernel module.
|
||||
``struct vbs_k_vq_info``
|
||||
A single virtqueue information to be
|
||||
synchronized from VBS-U to VBS-K kernel module.
|
||||
``struct vbs_k_vqs_info``
|
||||
Virtqueue(s) information, of a virtio device,
|
||||
to be synchronized from VBS-U to VBS-K kernel module.
|
||||
|
||||
.. figure:: images/virtio-hld-image8.png
|
||||
:width: 900px
|
||||
:align: center
|
||||
:name: VBS-K-data
|
||||
|
||||
VBS-K Key Data Structures
|
||||
|
||||
In VBS-K, the struct vbs_k_xxx represents the in-kernel component
|
||||
handling a virtio device's data plane. It presents a char device for VBS-U
|
||||
to open and register device status after feature negotiation with the FE
|
||||
driver.
|
||||
|
||||
The device status includes negotiated features, number of virtqueues,
|
||||
interrupt information, and more. All these status will be synchronized
|
||||
from VBS-U to VBS-K. In VBS-U, the ``struct vbs_k_dev_info`` and ``struct
|
||||
vbs_k_vqs_info`` will collect all the information and notify VBS-K through
|
||||
ioctls. In VBS-K, the ``struct vbs_k_dev`` and ``struct vbs_k_vq``, which are
|
||||
common to all VBS-K modules, are the counterparts to preserve the
|
||||
related information. The related information is necessary to kernel-land
|
||||
vring service API helpers.
|
||||
|
||||
DM APIs
|
||||
=======
|
||||
|
||||
The DM APIs are exported by DM, and they should be used when realizing
|
||||
BE device drivers on ACRN.
|
||||
|
||||
[API Material from doxygen comments]
|
||||
|
||||
VBS APIs
|
||||
========
|
||||
|
||||
The VBS APIs are exported by VBS related modules, including VBS, DM, and
|
||||
SOS kernel modules. They can be classified into VBS-U and VBS-K APIs
|
||||
listed as follows.
|
||||
|
||||
VBS-U APIs
|
||||
----------
|
||||
|
||||
These APIs provided by VBS-U are callbacks to be registered to DM, and
|
||||
the virtio framework within DM will invoke them appropriately.
|
||||
|
||||
[API Material from doxygen comments]
|
||||
|
||||
VBS-K APIs
|
||||
----------
|
||||
|
||||
The VBS-K APIs are exported by VBS-K related modules. Users could use
|
||||
the following APIs to implement their VBS-K modules.
|
||||
|
||||
APIs provided by DM
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
[API Material from doxygen comments]
|
||||
|
||||
APIs provided by VBS-K modules in service OS
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
[API Material from doxygen comments]
|
||||
|
||||
VQ APIs
|
||||
=======
|
||||
|
||||
The virtqueue APIs, or VQ APIs, are used by a BE device driver to
|
||||
access the virtqueues shared by the FE driver. The VQ APIs abstract the
|
||||
details of virtqueues so that users don't need to worry about the data
|
||||
structures within the virtqueues. In addition, the VQ APIs are designed
|
||||
to be identical between VBS-U and VBS-K, so that users don't need to
|
||||
learn different APIs when implementing BE drivers based on VBS-U and
|
||||
VBS-K.
|
||||
|
||||
[API Material from doxygen comments]
|
||||
|
||||
Below is an example showing a typical logic of how a BE driver handles
|
||||
requests from a FE driver.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
static void BE_callback(struct pci_virtio_xxx *pv, struct vqueue_info *vq ) {
|
||||
while (vq_has_descs(vq)) {
|
||||
vq_getchain(vq, &idx, &iov, 1, NULL);
|
||||
/* handle requests in iov */
|
||||
request_handle_proc();
|
||||
/* Release this chain and handle more */
|
||||
vq_relchain(vq, idx, len);
|
||||
}
|
||||
/* Generate interrupt if appropriate. 1 means ring empty \*/
|
||||
vq_endchains(vq, 1);
|
||||
}
|
||||
|
||||
Supported Virtio Devices
|
||||
************************
|
||||
|
||||
All the BE virtio drivers are implemented using the
|
||||
ACRN virtio APIs, and the FE drivers are reusing the standard Linux FE
|
||||
virtio drivers. For the devices with FE drivers available in the Linux
|
||||
kernel, they should use standard virtio Vendor ID/Device ID and
|
||||
Subsystem Vendor ID/Subsystem Device ID. For other devices within ACRN,
|
||||
their temporary IDs are listed in the following table.
|
||||
|
||||
.. table:: Virtio Devices without existing FE drivers in Linux
|
||||
:align: center
|
||||
:name: virtio-device-table
|
||||
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| virtio | Vendor ID | Device ID | Subvendor | Subdevice |
|
||||
| device | | | ID | ID |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| RPMB | 0x8086 | 0x8601 | 0x8086 | 0xFFFF |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| HECI | 0x8086 | 0x8602 | 0x8086 | 0xFFFE |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| audio | 0x8086 | 0x8603 | 0x8086 | 0xFFFD |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| IPU | 0x8086 | 0x8604 | 0x8086 | 0xFFFC |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| TSN/AVB | 0x8086 | 0x8605 | 0x8086 | 0xFFFB |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| hyper_dmabuf | 0x8086 | 0x8606 | 0x8086 | 0xFFFA |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| HDCP | 0x8086 | 0x8607 | 0x8086 | 0xFFF9 |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
| COREU | 0x8086 | 0x8608 | 0x8086 | 0xFFF8 |
|
||||
+--------------+-------------+-------------+-------------+-------------+
|
||||
|
||||
The following sections introduce the status of virtio devices currently
|
||||
supported in ACRN.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
virtio-blk
|
||||
virtio-net
|
||||
virtio-console
|
||||
virtio-rnd
|
4
doc/developer-guides/hld/hld-vm-management.rst
Normal file
@@ -0,0 +1,4 @@
|
||||
.. _hld-vm-management:
|
||||
|
||||
VM Management high-level design
|
||||
###############################
|
4
doc/developer-guides/hld/hld-vsbl.rst
Normal file
@@ -0,0 +1,4 @@
|
||||
.. _hld-vsbl:
|
||||
|
||||
Virtual Slim-Bootloader high-level design
|
||||
#########################################
|
BIN
doc/developer-guides/hld/images/APL_GVT-g-DM.png
Normal file
After Width: | Height: | Size: 81 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-access-patterns.png
Normal file
After Width: | Height: | Size: 26 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-api-forwarding.png
Normal file
After Width: | Height: | Size: 72 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-arch.png
Normal file
After Width: | Height: | Size: 60 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-direct-display.png
Normal file
After Width: | Height: | Size: 34 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-display-virt.png
Normal file
After Width: | Height: | Size: 173 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-full-pic.png
Normal file
After Width: | Height: | Size: 201 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-graphics-arch.png
Normal file
After Width: | Height: | Size: 14 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-hyper-dma.png
Normal file
After Width: | Height: | Size: 147 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-indirect-display.png
Normal file
After Width: | Height: | Size: 35 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-interrupt-virt.png
Normal file
After Width: | Height: | Size: 58 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-ive-use-case.png
Normal file
After Width: | Height: | Size: 117 KiB |
After Width: | Height: | Size: 166 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-mem-part.png
Normal file
After Width: | Height: | Size: 101 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-pass-through.png
Normal file
After Width: | Height: | Size: 71 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-per-vm-shadow.png
Normal file
After Width: | Height: | Size: 450 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-perf-critical.png
Normal file
After Width: | Height: | Size: 62 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-plane-based.png
Normal file
After Width: | Height: | Size: 76 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-scheduling-policy.png
Normal file
After Width: | Height: | Size: 75 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-scheduling.png
Normal file
After Width: | Height: | Size: 86 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-shared-shadow.png
Normal file
After Width: | Height: | Size: 34 KiB |
BIN
doc/developer-guides/hld/images/APL_GVT-g-workload.png
Normal file
After Width: | Height: | Size: 32 KiB |
BIN
doc/developer-guides/hld/images/acpi-image1.png
Normal file
After Width: | Height: | Size: 37 KiB |
BIN
doc/developer-guides/hld/images/acpi-image2.png
Normal file
After Width: | Height: | Size: 30 KiB |
BIN
doc/developer-guides/hld/images/acpi-image3.png
Normal file
After Width: | Height: | Size: 36 KiB |
BIN
doc/developer-guides/hld/images/acpi-image5.png
Normal file
After Width: | Height: | Size: 50 KiB |
BIN
doc/developer-guides/hld/images/interrupt-image2.png
Normal file
After Width: | Height: | Size: 20 KiB |
BIN
doc/developer-guides/hld/images/interrupt-image3.png
Normal file
After Width: | Height: | Size: 24 KiB |
BIN
doc/developer-guides/hld/images/interrupt-image4.png
Normal file
After Width: | Height: | Size: 36 KiB |
BIN
doc/developer-guides/hld/images/interrupt-image5.png
Normal file
After Width: | Height: | Size: 121 KiB |
BIN
doc/developer-guides/hld/images/interrupt-image6.png
Normal file
After Width: | Height: | Size: 29 KiB |
BIN
doc/developer-guides/hld/images/interrupt-image7.png
Normal file
After Width: | Height: | Size: 18 KiB |
BIN
doc/developer-guides/hld/images/mem-image1.png
Normal file
After Width: | Height: | Size: 4.7 KiB |
BIN
doc/developer-guides/hld/images/mem-image2.png
Normal file
After Width: | Height: | Size: 23 KiB |
BIN
doc/developer-guides/hld/images/mem-image3.png
Normal file
After Width: | Height: | Size: 16 KiB |
BIN
doc/developer-guides/hld/images/mem-image4.png
Normal file
After Width: | Height: | Size: 11 KiB |
BIN
doc/developer-guides/hld/images/mem-image5.png
Normal file
After Width: | Height: | Size: 35 KiB |
BIN
doc/developer-guides/hld/images/mem-image6.png
Normal file
After Width: | Height: | Size: 45 KiB |
BIN
doc/developer-guides/hld/images/mem-image7.png
Normal file
After Width: | Height: | Size: 43 KiB |
BIN
doc/developer-guides/hld/images/network-virt-arch.png
Normal file
After Width: | Height: | Size: 156 KiB |
BIN
doc/developer-guides/hld/images/network-virt-sos-infrastruct.png
Normal file
After Width: | Height: | Size: 9.0 KiB |
BIN
doc/developer-guides/hld/images/security-image1.png
Normal file
After Width: | Height: | Size: 377 KiB |
BIN
doc/developer-guides/hld/images/security-image10.png
Normal file
After Width: | Height: | Size: 61 KiB |
BIN
doc/developer-guides/hld/images/security-image11.png
Normal file
After Width: | Height: | Size: 21 KiB |
BIN
doc/developer-guides/hld/images/security-image12.png
Normal file
After Width: | Height: | Size: 52 KiB |
BIN
doc/developer-guides/hld/images/security-image13.png
Normal file
After Width: | Height: | Size: 740 KiB |
BIN
doc/developer-guides/hld/images/security-image14.png
Normal file
After Width: | Height: | Size: 11 KiB |
BIN
doc/developer-guides/hld/images/security-image2.png
Normal file
After Width: | Height: | Size: 28 KiB |
BIN
doc/developer-guides/hld/images/security-image3.png
Normal file
After Width: | Height: | Size: 12 KiB |
BIN
doc/developer-guides/hld/images/security-image4.png
Normal file
After Width: | Height: | Size: 50 KiB |
BIN
doc/developer-guides/hld/images/security-image5.png
Normal file
After Width: | Height: | Size: 47 KiB |
BIN
doc/developer-guides/hld/images/security-image6.png
Normal file
After Width: | Height: | Size: 26 KiB |
BIN
doc/developer-guides/hld/images/security-image7.png
Normal file
After Width: | Height: | Size: 46 KiB |
BIN
doc/developer-guides/hld/images/security-image8.png
Normal file
After Width: | Height: | Size: 27 KiB |
BIN
doc/developer-guides/hld/images/security-image9.png
Normal file
After Width: | Height: | Size: 18 KiB |
BIN
doc/developer-guides/hld/images/uart-image1.png
Normal file
After Width: | Height: | Size: 94 KiB |
BIN
doc/developer-guides/hld/images/virtio-blk-image01.png
Normal file
After Width: | Height: | Size: 142 KiB |
BIN
doc/developer-guides/hld/images/virtio-blk-image02.png
Normal file
After Width: | Height: | Size: 45 KiB |
BIN
doc/developer-guides/hld/images/virtio-console-arch.png
Normal file
After Width: | Height: | Size: 156 KiB |
BIN
doc/developer-guides/hld/images/virtio-hld-image0.png
Normal file
After Width: | Height: | Size: 29 KiB |
BIN
doc/developer-guides/hld/images/virtio-hld-image1.png
Normal file
After Width: | Height: | Size: 52 KiB |
BIN
doc/developer-guides/hld/images/virtio-hld-image2.png
Normal file
After Width: | Height: | Size: 66 KiB |
BIN
doc/developer-guides/hld/images/virtio-hld-image3.png
Normal file
After Width: | Height: | Size: 72 KiB |
BIN
doc/developer-guides/hld/images/virtio-hld-image4.png
Normal file
After Width: | Height: | Size: 136 KiB |
BIN
doc/developer-guides/hld/images/virtio-hld-image5.png
Normal file
After Width: | Height: | Size: 44 KiB |
BIN
doc/developer-guides/hld/images/virtio-hld-image6.png
Normal file
After Width: | Height: | Size: 70 KiB |
BIN
doc/developer-guides/hld/images/virtio-hld-image7.png
Normal file
After Width: | Height: | Size: 86 KiB |
BIN
doc/developer-guides/hld/images/virtio-hld-image8.png
Normal file
After Width: | Height: | Size: 51 KiB |
BIN
doc/developer-guides/hld/images/watchdog-image1.png
Normal file
After Width: | Height: | Size: 250 KiB |
BIN
doc/developer-guides/hld/images/watchdog-image2.png
Normal file
After Width: | Height: | Size: 135 KiB |
28
doc/developer-guides/hld/index.rst
Normal file
@@ -0,0 +1,28 @@
|
||||
.. _hld:
|
||||
|
||||
High-Level Design Guides
|
||||
########################
|
||||
|
||||
The ACRN Hypervisor acts as a host with full control of the processor(s)
|
||||
and the hardware (physical memory, interrupt management and I/O). It
|
||||
provides the User OS with an abstraction of a virtual platform, allowing
|
||||
the guest to behave as if were executing directly on a logical
|
||||
processor.
|
||||
|
||||
These chapters describe the ACRN architecture, high-level design,
|
||||
background, and motivation for specific areas within the ACRN hypervisor
|
||||
system.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
Overview <hld-overview>
|
||||
Hypervisor <hld-hypervisor>
|
||||
Device Model <hld-devicemodel>
|
||||
Emulated Devices <hld-emulated-devices>
|
||||
Virtio Devices <hld-virtio-devices>
|
||||
VM Management <hld-vm-management>
|
||||
Power Management <hld-power-management>
|
||||
Tracing and Logging <hld-trace-log>
|
||||
Virtual Bootloader <hld-vsbl>
|
||||
Security <hld-security>
|
486
doc/developer-guides/hld/interrupt-hld.rst
Normal file
@@ -0,0 +1,486 @@
|
||||
.. _interrupt-hld:
|
||||
|
||||
Interrupt Management high-level design
|
||||
######################################
|
||||
|
||||
|
||||
Overview
|
||||
********
|
||||
|
||||
This document describes the interrupt management high-level design for
|
||||
the ACRN hypervisor.
|
||||
|
||||
The ACRN hypervisor implements a simple but fully functional framework
|
||||
to manage interrupts and exceptions, as show in
|
||||
:numref:`interrupt-modules-overview`. In its native layer, it configures
|
||||
the physical PIC, IOAPIC, and LAPIC to support different interrupt
|
||||
sources from local timer/IPI to external INTx/MSI. In its virtual guest
|
||||
layer, it emulates virtual PIC, virtual IOAPIC and virtual LAPIC, and
|
||||
provides full APIs allowing virtual interrupt injection from emulated or
|
||||
pass-thru devices.
|
||||
|
||||
.. figure:: images/interrupt-image3.png
|
||||
:align: center
|
||||
:width: 600px
|
||||
:name: interrupt-modules-overview
|
||||
|
||||
ACRN Interrupt Modules Overview
|
||||
|
||||
In the software modules view shown in :numref:`interrupt-sw-modules`,
|
||||
the ACRN hypervisor sets up the physical interrupt in its basic
|
||||
interrupt modules (e.g., IOAPIC/LAPIC/IDT). It dispatches the interrupt
|
||||
in the hypervisor interrupt flow control layer to the corresponding
|
||||
handlers, that could be pre-defined IPI notification, timer, or runtime
|
||||
registered pass-thru devices. The ACRN hypervisor then uses its VM
|
||||
interfaces based on vPIC, vIOAPIC, and vMSI modules, to inject the
|
||||
necessary virtual interrupt into the specific VM
|
||||
|
||||
.. figure:: images/interrupt-image2.png
|
||||
:align: center
|
||||
:width: 600px
|
||||
:name: interrupt-sw-modules
|
||||
|
||||
ACRN Interrupt SW Modules Overview
|
||||
|
||||
Hypervisor Physical Interrupt Management
|
||||
****************************************
|
||||
|
||||
The ACRN hypervisor is responsible for all the physical interrupt
|
||||
handling. All physical interrupts are first handled in VMX root-mode.
|
||||
The "external-interrupt exiting" bit in the VM-Execution controls field
|
||||
is set to support this. The ACRN hypervisor also initializes all the
|
||||
interrupt related modules such as IDT, PIC, IOAPIC, and LAPIC.
|
||||
|
||||
Only a few physical interrupts (such as TSC-Deadline timer and IOMMU)
|
||||
are fully serviced in the hypervisor. Most interrupts come from pass-thru
|
||||
devices whose interrupt are remapped to a virtual INTx/MSI source and
|
||||
injected to the SOS or UOS, according to the pass-thru device
|
||||
configuration.
|
||||
|
||||
The ACRN hypervisor does handle exceptions and any exception coming from
|
||||
the VMX root-mode will lead to the CPU halting. For guest exception, the
|
||||
hypervisor only traps #MC (machine check), prints a warning message, and
|
||||
injects the exception back into the guest OS.
|
||||
|
||||
Physical Interrupt Initialization
|
||||
=================================
|
||||
|
||||
After the ACRN hypervisor get control from the bootloader, it
|
||||
initializes all physical interrupt-related modules for all the CPUs. The
|
||||
ACRN hypervisor creates a framework to manage the physical interrupt for
|
||||
hypervisor-local devices, pass-thru devices, and IPI between CPUs.
|
||||
|
||||
IDT
|
||||
---
|
||||
|
||||
The ACRN hypervisor builds its native Interrupt Descriptor Table (IDT) during
|
||||
interrupt initialization. For exceptions, it links to function
|
||||
``dispatch_exception``, and for external interrupts it links to function
|
||||
``dispatch_interrupt``. Please refer to ``arch/x86/idt.S`` for more details.
|
||||
|
||||
LAPIC
|
||||
-----
|
||||
|
||||
The ACRN hypervisor resets LAPIC for each CPU, and provides basic APIs
|
||||
used, for example, by the local timer (TSC Deadline)
|
||||
program and IPI notification program. These APIs include
|
||||
write_laipic_reg32, send_lapic_eoi, send_startup_ipi, and
|
||||
send_single_ipi.
|
||||
|
||||
|
||||
.. comment
|
||||
|
||||
Need reference to API doc generated from doxygen comments
|
||||
in hypervisor/include/arch/x86/lapic.h
|
||||
|
||||
PIC/IOAPIC
|
||||
----------
|
||||
|
||||
The ACRN hypervisor masks all interrupts from PIC, so all the
|
||||
legacy interrupts from PIC (<16) are linked to IOAPIC, as shown in
|
||||
:numref:`interrupt-pic-pin`.
|
||||
|
||||
ACRN will pre-allocate vectors and mask them for these legacy interrupts
|
||||
in IOAPIC RTE. For others (>= 16) ACRN will mask them with vector 0 in
|
||||
RTE, and the vector will be dynamically allocated on demand.
|
||||
|
||||
.. figure:: images/interrupt-image5.png
|
||||
:align: center
|
||||
:width: 600px
|
||||
:name: interrupt-pic-pin
|
||||
|
||||
PIC & IOAPIC Pin Connection
|
||||
|
||||
Irq Desc
|
||||
--------
|
||||
|
||||
The ACRN hypervisor maintains a global ``irq_desc[]`` array shared among the
|
||||
CPUs and uses a flat mode to manage the interrupts. The same
|
||||
vector is linked to the same IRQ number for all CPUs.
|
||||
|
||||
.. comment
|
||||
|
||||
Need reference to API doc generated from doxygen comments
|
||||
for ``struct irq_desc`` in hypervisor/include/common/irq.h
|
||||
|
||||
|
||||
The ``irq_desc[]`` array is indexed by the IRQ number. An
|
||||
``irq_handler`` field can be set to a common edge, level, or quick
|
||||
handler called from ``interrupt_dispatch``. The ``irq_desc`` structure
|
||||
also contains the ``dev_list`` field to maintain this IRQ's action
|
||||
handler list.
|
||||
|
||||
The global array ``vector_to_irq[]`` is used to manage the vector
|
||||
resource. This array is initialized with value ``IRQ_INVALID`` for all
|
||||
vectors, and will be set to a valid IRQ number after the corresponding
|
||||
vector is registered.
|
||||
|
||||
For example, if the local timer registers interrupt with IRQ number 271 and
|
||||
vector 0xEF, then the arrays mentioned above will be set to::
|
||||
|
||||
irq_desc[271].irq = 271;
|
||||
irq_desc[271].vector = 0xEF;
|
||||
vector_to_irq[0xEF] = 271;
|
||||
|
||||
Physical Interrupt Flow
|
||||
=======================
|
||||
|
||||
|
||||
When an physical interrupt occurs, and the CPU is running under VMX root
|
||||
mode, the interrupt is triggered from the standard native irq flow:
|
||||
interrupt gate to irq handler. However, if the CPU is running under VMX
|
||||
non-root mode, an external interrupt will trigger a VM exit for reason
|
||||
"external-interrupt". See :numref:`interrupt-handle-flow`.
|
||||
|
||||
.. figure:: images/interrupt-image4.png
|
||||
:align: center
|
||||
:width: 800px
|
||||
:name: interrupt-handle-flow
|
||||
|
||||
ACRN Hypervisor Interrupt Handle Flow
|
||||
|
||||
After an interrupt happens (in either case noted above), the ACRN
|
||||
hypervisor jumps to ``dispatch_interrupt``. This function will check
|
||||
which vector caused this interrupt, and the corresponding ``irq_desc``
|
||||
structure's ``irq_handler`` will be called for the service.
|
||||
|
||||
There are several irq_handler's defined in the ACRN hypervisor, as shown
|
||||
in :numref:`interrupt-handle-flow`, designed for different uses. For
|
||||
example, ``quick_handler_nolock`` is used when no critical data needs
|
||||
protection in the action handlers; the VCPU notification IPI and local
|
||||
timer are good example of this use case.
|
||||
|
||||
The more complicated ``common_dev_handler_level`` handler is intended
|
||||
for pass-thru devices with level triggered interrupts. To avoid
|
||||
continuously triggering the interrupt, it initially masks IOAPIC pin and
|
||||
unmasks it only when the corresponding vIOAPIC pin gets an explicit EOI
|
||||
ACK from the guest.
|
||||
|
||||
All the irq handler's finally call their own action handler list, as
|
||||
shown here:
|
||||
|
||||
.. code-block: c
|
||||
|
||||
struct dev_handler_node \*dev = desc->dev_list;
|
||||
while (dev != NULL) {
|
||||
if (dev->dev_handler != NULL)
|
||||
dev->dev_handler(desc->irq, dev->dev_data);
|
||||
dev = dev->next;
|
||||
}
|
||||
|
||||
The common APIs for registering, updating, and unregistering
|
||||
interrupt handlers include irq_to_vector, dev_to_irq, dev_to_vector,
|
||||
pri_register_handler, normal_register_handler,
|
||||
unregister_handler_common, and update_irq_handler.
|
||||
|
||||
.. comment
|
||||
|
||||
Need reference to API doc generated from doxygen comments
|
||||
in hypervisor/include/common/irq.h
|
||||
|
||||
.. _physical_interrupt_source:
|
||||
|
||||
Physical Interrupt Source
|
||||
=========================
|
||||
|
||||
The ACRN hypervisor handles interrupts from many different sources, as
|
||||
shown in :numref:`interrupt-source`:
|
||||
|
||||
|
||||
.. list-table:: Physical Interrupt Source
|
||||
:widths: 15 10 60
|
||||
:header-rows: 1
|
||||
:name: interrupt-source
|
||||
|
||||
* - Interrupt Source
|
||||
- Vector
|
||||
- Description
|
||||
* - TSC Deadline Timer
|
||||
- 0xEF
|
||||
- The TSC deadline timer implements the timer framework in
|
||||
the hypervisor based on the LAPIC TSC deadline. This interrupt's
|
||||
target is specific to the CPU to which the LAPIC belongs.
|
||||
* - CPU Startup IPI
|
||||
- N/A
|
||||
- The BSP needs to trigger an INIT-SIPI sequence to wake up the
|
||||
APs. This interrupt's target is specified by the BSP calling
|
||||
`` start_cpus()``.
|
||||
* - VCPU Notify IPI
|
||||
- 0xF0
|
||||
- When the hypervisor needs to kick the VCPU out of VMX non-root
|
||||
mode to do requests such as virtual interrupt injection, EPT
|
||||
flush, etc. This interrupt's target is specified by function
|
||||
``send_single_ipi()``.
|
||||
* - IOMMU MSI
|
||||
- dynamic
|
||||
- IOMMU device supports an MSI interrupt. The vtd device driver in
|
||||
the hypervisor will register an interrupt to handle dmar fault.
|
||||
This interrupt's target is specified by vtd device driver.
|
||||
* - PTdev INTx
|
||||
- dynamic
|
||||
- All native devices are owned by the guest (SOS or UOS), taking
|
||||
advantage of the pass-thru method. Each pass-thru device connected
|
||||
with IOAPIC/PIC (PTdev INTx) will register an interrupt when
|
||||
its attached interrupt controller pin first gets unmasked.
|
||||
This interrupt's target is defined by and RTE entry in the IOAPIC.
|
||||
* - PTdev MSI
|
||||
- dynamic
|
||||
- All native devices are owned by the guest (SOS or UOS), taking
|
||||
advantage of pass-thru method. Each pass-thru device with
|
||||
enabled MSI (PTdev MSI) will register an interrupt when the SOS
|
||||
does an explicit hypercall. This interrupt's target is defined
|
||||
by an MSI address entry.
|
||||
|
||||
Softirq
|
||||
=======
|
||||
|
||||
ACRN hypervisor implements a simple bottom-half softirq to execute the
|
||||
interrupt handler, as showed in :numref:`interrupt-handle-flow`.
|
||||
The softirq is executed when an interrupt is enabled. Several APIs for softirq
|
||||
are defined including enable_softirq, disable_softirq, raise_softirq,
|
||||
and exec_softirq.
|
||||
|
||||
.. comment
|
||||
|
||||
Need reference to API doc generated from doxygen comments
|
||||
in hypervisor/include/common/softirq.h
|
||||
|
||||
Physical Exception Handling
|
||||
===========================
|
||||
|
||||
As mentioned earlier, the ACRN hypervisor does not handle any
|
||||
physical exceptions. The VMX root mode code path should guarantee no
|
||||
exceptions are triggered while the hypervisor is running.
|
||||
|
||||
Guest Virtual Interrupt Management
|
||||
**********************************
|
||||
|
||||
The previous sections describe physical interrupt management in the ACRN
|
||||
hypervisor. After a physical interrupt happens, a registered action
|
||||
handler is executed. Usually, the action handler represents a service
|
||||
for virtual interrupt injection. For example, if an interrupt is
|
||||
triggered from a pass-thru device, the appropriate virtual interrupt
|
||||
should be injected into its guest VM.
|
||||
|
||||
The virtual interrupt injection could also come from an emulated device.
|
||||
The I/O mediator in the Service OS (SOS) could trigger an interrupt
|
||||
through a hypercall, and then do the virtual interrupt injection in the
|
||||
hypervisor.
|
||||
|
||||
The following sections give an introduction to the ACRN guest virtual
|
||||
interrupt management, including VCPU request for virtual interrupt kick
|
||||
off, vPIC/vIOAPIC/vLAPIC for virtual interrupt injection interfaces,
|
||||
physical-to-virtual interrupt mapping for a pass-thru device, and the
|
||||
process of VMX interrupt/exception injection.
|
||||
|
||||
VCPU Request
|
||||
============
|
||||
|
||||
As mentioned in `physical_interrupt_source`_, physical vector 0xF0 is
|
||||
used to kick the VCPU out of its VMX non-root mode, and make a request
|
||||
for virtual interrupt injection or other requests such as flush EPT.
|
||||
|
||||
The request-make API (vcpu_make_request) and eventid supports virtual interrupt
|
||||
injection.
|
||||
|
||||
.. comment
|
||||
|
||||
Need reference to API doc generated from doxygen comments
|
||||
in hypervisor/include/common/irq.h
|
||||
|
||||
There are requests for exception injection (ACRN_REQUEST_EXCP), vLAPIC
|
||||
event (ACRN_REQUEST_EVENT), external interrupt from vPIC
|
||||
(ACRN_REQUEST_EXTINT) and non-maskable-interrupt (ACRN_REQUEST_NMI).
|
||||
|
||||
The ``vcpu_make_request`` is necessary for a virtual interrupt
|
||||
injection. If the target VCPU is running under VMX non-root mode, it
|
||||
will send an IPI to kick it out and results in an external-interrupt
|
||||
VM-Exit. The flow of :numref:`interrupt-handle-flow` could be executed
|
||||
to complete the injection of a virtual interrupt.
|
||||
|
||||
There are some cases that do not need to send an IPI when making a
|
||||
request because the CPU making the request is the target VCPU. For
|
||||
example, the #GP exception request always happens on the current CPU
|
||||
when an invalid emulation happens. An external interrupt for a pass-thru
|
||||
device always happens on the VCPUs the device belongs to, so after it
|
||||
triggers an external-interrupt VM-Exit, the current CPU is also the
|
||||
target VCPU.
|
||||
|
||||
Virtual PIC
|
||||
===========
|
||||
|
||||
The ACRN hypervisor emulates a vPIC for each VM based on IO ranges
|
||||
0x20-0x21, 0xa0-0xa1, or 0x4d0-0x4d1.
|
||||
|
||||
If an interrupt source from vPIC needs to inject an interrupt,
|
||||
the vpic_assert_irq, vpic_deassert_irq, or vpic_pulse_irq functions can
|
||||
be called to make a request for ACRN_REQUEST_EXTINT or
|
||||
ACRN_REQUEST_EVENT:
|
||||
|
||||
.. comment
|
||||
|
||||
Need reference to API doc generated from doxygen comments
|
||||
in hypervisor/include/common/vpic.h
|
||||
|
||||
The vpic_pending_intr and vpic_intr_accepted APIs are used to query the
|
||||
vector being injected and ACK the service, by moving the interrupt from
|
||||
request service (IRR) to in service (ISR).
|
||||
|
||||
|
||||
Virtual IOAPIC
|
||||
==============
|
||||
|
||||
ACRN hypervisor emulates a vIOAPIC for each VM based on MMIO
|
||||
VIOAPIC_BASE.
|
||||
|
||||
If an interrupt source from vIOAPIC needs to inject an interrupt, the
|
||||
vioapic_assert_irq, vioapic_dessert_irq, and vioapic_pulse_irq APIs are
|
||||
used to make a request for ACRN_REQUEST_EVENT.
|
||||
|
||||
As the vIOAPIC is always associated with a vLAPIC, the virtual interrupt
|
||||
injection from vIOAPIC will finally trigger a request for an vLAPIC
|
||||
event.
|
||||
|
||||
Virtual LAPIC
|
||||
=============
|
||||
|
||||
The ACRN hypervisor emulates a vLAPIC for each VCPU based on MMIO
|
||||
DEFAULT_APIC_BASE.
|
||||
|
||||
If an interrupt source from vLAPIC needs to inject an interrupt (e.g.,
|
||||
from LVT such as an LAPIC timer, from vIOAPIC for a pass-thru device
|
||||
interrupt, or from an emulated device for a MSI), vlapic_intr_level,
|
||||
vlapic_intr_edge, vlapic_set_local_intr, vlapic_intr_msi,
|
||||
vlapic_deliver_intr APIs need to be called, resulting in a request for
|
||||
ACRN_REQUEST_EVENT.
|
||||
|
||||
.. comment
|
||||
|
||||
Need reference to API doc generated from doxygen comments
|
||||
in hypervisor/include/common/vlapic.h
|
||||
|
||||
|
||||
The vlapic_pending_intr and vlapic_intr_accepted APIs are used to query
|
||||
the vector that needs to be injected and ACK
|
||||
the service that move the interrupt from request service (IRR) to in
|
||||
service (ISR).
|
||||
|
||||
By default, the ACRN hypervisor enables vAPIC to improve the performance of
|
||||
a vLAPIC emulation.
|
||||
|
||||
Virtual Exception
|
||||
=================
|
||||
|
||||
When doing emulation, an exception may be triggered in the hypervisor,
|
||||
for example, if guest accesses an invalid vMSR register, or the
|
||||
hypervisor needs to inject a #GP, or during instruction emulation, an
|
||||
instruction fetch may access a non-exist page from rip_gva, and a #PF
|
||||
must be injected.
|
||||
|
||||
ACRN hypervisor implements virtual exception injection using the
|
||||
vcpu_queue_exception, vcpu_inject_gq, and vcpu_inject_pf APIs.
|
||||
|
||||
.. comment
|
||||
|
||||
Need reference to API doc generated from doxygen comments
|
||||
in hypervisor/include/common/irq.h
|
||||
|
||||
The ACRN hypervisor uses vcpu_inject_gp/vcpu_inject_pf functions to
|
||||
queue exception requests, and follows `Intel Software
|
||||
Developer Manual, Vol 3. <SDM vol3>`_ - 6.15, Table 6-5
|
||||
listing conditions for generating a double fault.
|
||||
|
||||
.. _SDM vol3: https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
|
||||
|
||||
Interrupt Mapping for a Pass-thru Device
|
||||
========================================
|
||||
|
||||
A VM can control a PCI device directly through pass-thru device
|
||||
assignment. The pass-thru entry is the major info object, and it is:
|
||||
|
||||
- A physical interrupt source, and could be a MSI/MSIX entry, PIC pins, or
|
||||
IOAPIC pins
|
||||
- Pass-thru remapping information between physical and virtual interrupt
|
||||
source, for MSI/MSIX it is identified by a PCI device's BDF. For
|
||||
PIC/IOAPIC it is identified by the pin number.
|
||||
|
||||
.. figure:: images/interrupt-image7.png
|
||||
:align: center
|
||||
:width: 600px
|
||||
:name: interrupt-pass-thru
|
||||
|
||||
Pass-thru Device Entry Assignment
|
||||
|
||||
As shown in :numref:`interrupt-pass-thru` above, a UOS will assign its
|
||||
pass-thru device entry by the DM, and it will fill its entry info from:
|
||||
|
||||
- vPIC/vIOAPIC interrupt mask/unmask
|
||||
- MSI IOReq from UOS then MSI hypercall from SOS
|
||||
|
||||
The SOS adds its pass-thru device entry at runtime and fills info for:
|
||||
|
||||
- vPIC/vIOAPIC interrupt mask/unmask
|
||||
- MSI hypercall from SOS
|
||||
|
||||
During the pass-thru device entry info filling, the hypervisor builds
|
||||
native IOAPIC RTE/MSI entry based on vIOAPIC/vPIC/vMSI configuration,
|
||||
and register the physical interrupt handler for it. Then with the pass-thru
|
||||
device entry as the handler private data, the physical interrupt can
|
||||
be linked to a virtual pin of a guest's vPIC/vIOAPIC or virtual vector of
|
||||
a guest's vMSI. The handler then injects the corresponding virtual
|
||||
interrupt into the guest, based on vPIC/vIOAPIC/vLAPIC APIs described
|
||||
earlier.
|
||||
|
||||
Interrupt Storm Mitigation
|
||||
==========================
|
||||
|
||||
When the Device Model (DM) launches a User OS (UOS), the ACRN hypervisor
|
||||
will remap the interrupt for this user OS's pass-through devices. When
|
||||
an interrupt occurs for a pass-through device, the CPU core is assigned
|
||||
to that User OS gets trapped into the hypervisor. The benefit of such a
|
||||
mechanism is that, should an interrupt storm happen in a particular UOS,
|
||||
it will have only a minimal effect on the performance of the Service OS.
|
||||
|
||||
Interrupt/Exception Injection Process
|
||||
=====================================
|
||||
|
||||
As shown in :numref:`interrupt-handle-flow`, the ACRN hypervisor injects
|
||||
virtual interrupt/exception to the guest before its VM-Entry.
|
||||
|
||||
This is done by updating the VMX_ENTRY_INT_INFO_FIELD of the VCPU's
|
||||
VMCS. As this field is unique, the interrupt/exception injection must
|
||||
follow a priority rule to handle one-by-one.
|
||||
|
||||
:numref:`interrupt-injection` below shows the rules about how to inject
|
||||
virtual interrupt/exception one-by-one. If a high priority
|
||||
interrupt/exception was already injected, the next pending
|
||||
interrupt/exception will enable an interrupt window where the next
|
||||
injection will be done by the following VM-Exit, triggered by the
|
||||
interrupt window.
|
||||
|
||||
.. figure:: images/interrupt-image6.png
|
||||
:align: center
|
||||
:width: 600px
|
||||
:name: interrupt-injection
|
||||
|
||||
ACRN Hypervisor Interrupt/Exception Injection Process
|
248
doc/developer-guides/hld/memmgt-hld.rst
Normal file
@@ -0,0 +1,248 @@
|
||||
.. _memmgt-hld:
|
||||
|
||||
Memory Management high-level design
|
||||
###################################
|
||||
|
||||
This document describes memory management for the ACRN hypervisor.
|
||||
|
||||
Overview
|
||||
********
|
||||
|
||||
In the ACRN hypervisor system, there are few different memory spaces to
|
||||
consider. From the hypervisor's point of view there are:
|
||||
|
||||
- Host Physical Address (HPA): the native physical address space, and
|
||||
- Host Virtual Address (HVA): the native virtual address space based on
|
||||
a MMU. A page table is used to do the translation between HPA and HVA
|
||||
spaces.
|
||||
|
||||
And from the Guest OS running on a hypervisor there are:
|
||||
|
||||
- Guest Physical Address (GPA): the guest physical address space from a
|
||||
virtual machine. GPA to HPA transition is usually based on a
|
||||
MMU-like hardware module (EPT in X86), and associated with a page
|
||||
table
|
||||
- Guest Virtual Address (GVA): the guest virtual address space from a
|
||||
virtual machine based on a vMMU
|
||||
|
||||
.. figure:: images/mem-image2.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: mem-overview
|
||||
|
||||
ACRN Memory Mapping Overview
|
||||
|
||||
:numref:`mem-overview` provides an overview of the ACRN system memory
|
||||
mapping, showing:
|
||||
|
||||
- GVA to GPA mapping based on vMMU on a VCPU in a VM
|
||||
- GPA to HPA mapping based on EPT for a VM in the hypervisor
|
||||
- HVA to HPA mapping based on MMU in the hypervisor
|
||||
|
||||
This document illustrates the memory management infrastructure for the
|
||||
ACRN hypervisor and how it handles the different memory space views
|
||||
inside the hypervisor and from a VM:
|
||||
|
||||
- How ACRN hypervisor manages host memory (HPA/HVA)
|
||||
- How ACRN hypervisor manages SOS guest memory (HPA/GPA)
|
||||
- How ACRN hypervisor & SOS DM manage UOS guest memory (HPA/GPA)
|
||||
|
||||
Hypervisor Memory Management
|
||||
****************************
|
||||
|
||||
The ACRN hypervisor is the primary owner to manage system
|
||||
memory. Typically the boot firmware (e.g., EFI) passes the platform physical
|
||||
memory layout - E820 table to the hypervisor. The ACRN hypervisor does its memory
|
||||
management based on this table.
|
||||
|
||||
Physical Memory Layout - E820
|
||||
=============================
|
||||
|
||||
The boot firmware (e.g., EFI) passes the E820 table through a multiboot protocol.
|
||||
This table contains the original memory layout for the platform.
|
||||
|
||||
.. figure:: images/mem-image1.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: mem-layout
|
||||
|
||||
Physical Memory Layout Example
|
||||
|
||||
:numref:`mem-layout` is an example of the physical memory layout based on a simple
|
||||
platform E820 table. The following sections demonstrate different memory
|
||||
space management by referencing it.
|
||||
|
||||
Physical to Virtual Mapping
|
||||
===========================
|
||||
|
||||
ACRN hypervisor is running under paging mode, so after receiving
|
||||
the platform E820 table, ACRN hypervisor creates its MMU page table
|
||||
based on it. This is done by the function init_paging() for all
|
||||
physical CPUs.
|
||||
|
||||
The memory mapping policy here is:
|
||||
|
||||
- Identical mapping for each physical CPU (ACRN hypervisor's memory
|
||||
could be relocatable in a future implementation)
|
||||
- Map all memory regions with UNCACHED type
|
||||
- Remap RAM regions to WRITE-BACK type
|
||||
|
||||
.. figure:: images/mem-image4.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: vm-layout
|
||||
|
||||
Hypervisor Virtual Memory Layout
|
||||
|
||||
:numref:`vm-layout` shows:
|
||||
|
||||
- Hypervisor can access all of system memory
|
||||
- Hypervisor has an UNCACHED MMIO/PCI hole reserved for devices, such
|
||||
as for LAPIC/IOAPIC access
|
||||
- Hypervisor has its own memory with WRITE-BACK cache type for its
|
||||
code and data (< 1M part is for secondary CPU reset code)
|
||||
|
||||
Service OS Memory Management
|
||||
****************************
|
||||
|
||||
After the ACRN hypervisor starts, it creates the Service OS as its first
|
||||
VM. The Service OS runs all the native device drivers, manage the
|
||||
hardware devices, and provides I/O mediation to guest VMs. The Service
|
||||
OS is in charge of the memory allocation for Guest VMs as well.
|
||||
|
||||
ACRN hypervisor passes the whole system memory access (except its own
|
||||
part) to the Service OS. The Service OS must be able to access all of
|
||||
the system memory except the hypervisor part.
|
||||
|
||||
Guest Physical Memory Layout - E820
|
||||
===================================
|
||||
|
||||
The ACRN hypervisor passes the original E820 table to the Service OS
|
||||
after filtering out its own part. So from Service OS's view, it sees
|
||||
almost all the system memory as shown here:
|
||||
|
||||
.. figure:: images/mem-image3.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: sos-mem-layout
|
||||
|
||||
SOS Physical Memory Layout
|
||||
|
||||
Host to Guest Mapping
|
||||
=====================
|
||||
|
||||
ACRN hypervisor creates Service OS's host (HPA) to guest (GPA) mapping
|
||||
(EPT mapping) through the function
|
||||
``prepare_vm0_memmap_and_e820()`` when it creates the SOS VM. It follows
|
||||
these rules:
|
||||
|
||||
- Identical mapping
|
||||
- Map all memory range with UNCACHED type
|
||||
- Remap RAM entries in E820 (revised) with WRITE-BACK type
|
||||
- Unmap ACRN hypervisor memory range
|
||||
- Unmap ACRN hypervisor emulated vLAPIC/vIOAPIC MMIO range
|
||||
|
||||
The host to guest mapping is static for the Service OS; it will not
|
||||
change after the Service OS begins running. Each native device driver
|
||||
can access its MMIO through this static mapping. EPT violation is only
|
||||
serving for vLAPIC/vIOAPIC's emulation in the hypervisor for Service OS
|
||||
VM.
|
||||
|
||||
User OS Memory Management
|
||||
*************************
|
||||
|
||||
User OS VM is created by the DM (Device Model) application running in
|
||||
the Service OS. DM is responsible for the memory allocation for a User
|
||||
or Guest OS VM.
|
||||
|
||||
Guest Physical Memory Layout - E820
|
||||
===================================
|
||||
|
||||
DM will create the E820 table for a User OS VM based on these simple
|
||||
rules:
|
||||
|
||||
- If requested VM memory size < low memory limitation (defined in DM,
|
||||
as 2GB), then low memory range = [0, requested VM memory size]
|
||||
- If requested VM memory size > low memory limitation (defined in DM,
|
||||
as 2GB), then low memory range = [0, 2GB], high memory range = [4GB,
|
||||
4GB + requested VM memory size - 2GB]
|
||||
|
||||
.. figure:: images/mem-image6.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: uos-mem-layout
|
||||
|
||||
UOS Physical Memory Layout
|
||||
|
||||
DM is doing UOS memory allocation based on hugeTLB mechanism by
|
||||
default. The real memory mapping
|
||||
may be scattered in SOS physical memory space, as shown below:
|
||||
|
||||
.. figure:: images/mem-image5.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: uos-mem-layout-hugetlb
|
||||
|
||||
UOS Physical Memory Layout Based on Hugetlb
|
||||
|
||||
Host to Guest Mapping
|
||||
=====================
|
||||
|
||||
A User OS VM's memory is allocated by the Service OS DM application, and
|
||||
may come from different huge pages in the Service OS as shown in
|
||||
:ref:`uos-mem-layout-hugetlb`.
|
||||
|
||||
As Service OS has the full information of these huge pages size,
|
||||
SOS-GPA and UOS-GPA, it works with the hypervisor to complete UOS's host
|
||||
to guest mapping using this pseudo code:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
for x in allocated huge pages do
|
||||
x.hpa = gpa2hpa_for_sos(x.sos_gpa)
|
||||
host2guest_map_for_uos(x.hpa, x.uos_gpa, x.size)
|
||||
end
|
||||
|
||||
Trusty
|
||||
======
|
||||
|
||||
For an Android User OS, there is a secure world called "trusty world
|
||||
support", whose memory needs are taken care by the ACRN hypervisor for
|
||||
security consideration. From the memory management's view, the trusty
|
||||
memory space should not be accessible by SOS or UOS normal world.
|
||||
|
||||
.. figure:: images/mem-image7.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: uos-mem-layout-trusty
|
||||
|
||||
UOS Physical Memory Layout with Trusty
|
||||
|
||||
Memory Interaction
|
||||
******************
|
||||
|
||||
Previous sections described different memory spaces management in the
|
||||
ACRN hypervisor, Service OS, and User OS. Among these memory spaces,
|
||||
there are different kinds of interaction, for example, a VM may do a
|
||||
hypercall to the hypervisor that includes a data transfer, or an
|
||||
instruction emulation in the hypervisor may need to access the Guest
|
||||
instruction pointer register to fetch instruction data.
|
||||
|
||||
Access GPA from Hypervisor
|
||||
==========================
|
||||
|
||||
When a hypervisor needs access to the GPA for data transfers, the caller
|
||||
from the Guest must make sure this memory range's GPA is address
|
||||
continuous. But for HPA in the hypervisor, it could be address
|
||||
dis-continuous (especially for UOS under hugetlb allocation mechanism).
|
||||
For example, a 4MB GPA range may map to 2 different 2MB huge pages. The
|
||||
ACRN hypervisor needs to take care of this kind of data transfer by
|
||||
doing EPT page walking based on its HPA.
|
||||
|
||||
Access GVA from Hypervisor
|
||||
==========================
|
||||
|
||||
Likely, when hypervisor need to access GVA for data transfer, both GPA
|
||||
and HPA could be address dis-continuous. The ACRN hypervisor must pay
|
||||
attention to this kind of data transfer, and handle it by doing page
|
||||
walking based on both its GPA and HPA.
|
126
doc/developer-guides/hld/uart-virt-hld.rst
Normal file
@@ -0,0 +1,126 @@
|
||||
.. _uart_virtualization:
|
||||
|
||||
UART Virtualization
|
||||
###################
|
||||
|
||||
In ACRN, UART virtualization is implemented as a fully-emulated device.
|
||||
In the Service OS (SOS), UART virtualization is implemented in the
|
||||
hypervisor itself. In the User OS (UOS), UART virtualization is
|
||||
implemented in the Device Model (DM), and is the primary topic of this
|
||||
document. We'll summarize differences between the hypervisor and DM
|
||||
implementations at the end of this document.
|
||||
|
||||
|
||||
UART emulation is a typical full-emulation implementation and is a
|
||||
good example to learn about I/O emulation in a virtualized environment.
|
||||
There is a detailed explanation about the I/O emulation flow in
|
||||
ACRN in :ref:`ACRN-io-mediator`.
|
||||
|
||||
Architecture
|
||||
************
|
||||
|
||||
The ACRN DM architecture for UART virtualization is shown here:
|
||||
|
||||
.. figure:: images/uart-image1.png
|
||||
:align: center
|
||||
:name: uart-arch
|
||||
:width: 800px
|
||||
|
||||
Device Model's UART virtualization architecture
|
||||
|
||||
There are three objects used to emulate one UART device in DM:
|
||||
UART registers, rxFIFO, and backend tty devices.
|
||||
|
||||
**UART registers** are emulated by member variables in struct
|
||||
``uart_vdev``, one variable for each register. These variables are used
|
||||
to track the register status programed by the frontend driver. The
|
||||
handler of each register depends on the register's functionality.
|
||||
|
||||
A **FIFO** is implemented to emulate RX. Normally characters are read
|
||||
from the backend tty device when available, then put into the rxFIFO.
|
||||
When the Guest application tries to read from the UART, the access to
|
||||
register ``com_data`` causes a ``vmexit``. Device model catches the
|
||||
``vmexit`` and emulates the UART by returning one character from rxFIFO.
|
||||
|
||||
.. note:: When ``com_fcr`` is available, the Guest application can write
|
||||
``0`` to this register to disable rxFIFO. In this case the rxFIFO in
|
||||
device model degenerates to a buffer containing only one character.
|
||||
|
||||
When the Guest application tries to send a character to the UART, it
|
||||
writes to the ``com_data`` register, which will cause a ``vmexit`` as
|
||||
well. Device model catches the ``vmexit`` and emulates the UART by
|
||||
redirecting the character to the **backend tty device**.
|
||||
|
||||
The UART device emulated by the ACRN device model is connected to the system by
|
||||
the LPC bus. In the current implementation, two channel LPC UARTs are I/O mapped to
|
||||
the traditional COM port addresses of 0x3F8 and 0x2F8. These are defined in
|
||||
global variable ``uart_lres``.
|
||||
|
||||
There are two options needed for configuring the UART in the ``arcn-dm``
|
||||
command line. First, the LPC is defined as a PCI device::
|
||||
|
||||
-s 1:0,lpc
|
||||
|
||||
The other option defines a UART port::
|
||||
|
||||
-l com1,stdio
|
||||
|
||||
The first parameter here is the name of the UART (must be "com1" or
|
||||
"com2"). The second parameter is species the backend
|
||||
tty device: ``stdio`` or a path to the dedicated tty device
|
||||
node, for example ``/dev/pts/0``.
|
||||
|
||||
If you are using a specified tty device, find the name of the terminal
|
||||
connected to standard input using the ``tty`` command (e.g.,
|
||||
``/dev/pts/1``). Use this name to define the UART port on the acrn-dm
|
||||
command line, for example::
|
||||
|
||||
-l com1,/dev/pts/1
|
||||
|
||||
|
||||
When arcn-dm starts, ``pci_lpc_init`` is called as the callback of the
|
||||
``vdev_init`` of the PCI device given on the acrn-dm command line.
|
||||
Later, ``lpc_init`` is called in ``pci_lpc_init``. ``lpc_init`` iterates
|
||||
on the available UART instances defined on the command line and
|
||||
initializes them one by one. ``register_inout`` is called on the port
|
||||
region of each UART instance, enabling access to the UART ports to be
|
||||
routed to the registered handler.
|
||||
|
||||
In the case of UART emulation, the registered handlers are ``uart_read``
|
||||
and ``uart_write``.
|
||||
|
||||
A similar virtual UART device is implemented in the hypervisor.
|
||||
Currently UART16550 is owned by the hypervisor itself and is used for
|
||||
debugging purposes. (The UART properties are configured by parameters
|
||||
to the hypervisor command line.) The hypervisor emulates a UART device
|
||||
with 0x3F8 address to the SOS and acts as the SOS console. The general
|
||||
emulation is the same as used in the device model, with the following
|
||||
differences:
|
||||
|
||||
- PIO region is directly registered to the vmexit handler dispatcher via
|
||||
``vuart_register_io_handler``
|
||||
|
||||
- Two FIFOs are implemented, one for RX, the other of TX
|
||||
|
||||
- RX flow:
|
||||
|
||||
- Characters are read from the UART HW into a 2048-byte sbuf,
|
||||
triggered by ``console_read``
|
||||
|
||||
- Characters are read from the sbuf and put to rxFIFO,
|
||||
triggered by ``vuart_console_rx_chars``
|
||||
|
||||
- A virtual interrupt is sent to the SOS that triggered the read,
|
||||
and characters from rxFIFO are sent to the SOS by emulating a read
|
||||
of register ``UART16550_RBR``
|
||||
|
||||
- TX flow:
|
||||
|
||||
- Characters are put into txFIFO by emulating a write of register
|
||||
``UART16550_THR``
|
||||
|
||||
- Characters in txFIFO are read out one by one, and sent to the console
|
||||
by printf, triggered by ``vuart_console_tx_chars``
|
||||
|
||||
- Implementation of printf is based on the console, which finally sends
|
||||
characters to the UART HW by writing to register ``UART16550_RBR``
|
107
doc/developer-guides/hld/virtio-blk.rst
Normal file
@@ -0,0 +1,107 @@
|
||||
.. _virtio-blk:
|
||||
|
||||
Virtio-blk
|
||||
##########
|
||||
|
||||
The virtio-blk device is a simple virtual block device. The FE driver
|
||||
(in the UOS space) places read, write, and other requests onto the
|
||||
virtqueue, so that the BE driver (in the SOS space) can process them
|
||||
accordingly. Communication between the FE and BE is based on the virtio
|
||||
kick and notify mechanism.
|
||||
|
||||
The virtio device ID of the virtio-blk is ``2``, and it supports one
|
||||
virtqueue, the size of which is 64, configurable in the source code.
|
||||
|
||||
.. figure:: images/virtio-blk-image01.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: virtio-blk-arch
|
||||
|
||||
Virtio-blk architecture
|
||||
|
||||
The feature bits supported by the BE device are shown as follows:
|
||||
|
||||
``VIRTIO_BLK_F_SEG_MAX``
|
||||
Maximum number of segments in a request is in seg_max.
|
||||
``VIRTIO_BLK_F_BLK_SIZE``
|
||||
Block size of disk is in blk_size.
|
||||
``VIRTIO_BLK_F_TOPOLOGY``
|
||||
Device exports information on optimal I/O alignment.
|
||||
``VIRTIO_RING_F_INDIRECT_DESC``
|
||||
Support for indirect descriptors
|
||||
``VIRTIO_BLK_F_FLUSH``
|
||||
Cache flush command support.
|
||||
``VIRTIO_BLK_F_CONFIG_WCE``
|
||||
Device can toggle its cache between writeback and writethrough modes.
|
||||
|
||||
|
||||
Virtio-blk-BE design
|
||||
********************
|
||||
|
||||
.. figure:: images/virtio-blk-image02.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: virtio-blk-be
|
||||
|
||||
The virtio-blk BE device is implemented as a legacy virtio device. Its
|
||||
backend media could be a file or a partition. The virtio-blk device
|
||||
supports writeback and writethrough cache mode. In writeback mode,
|
||||
virtio-blk has good write and read performance. To be safer,
|
||||
writethrough is set as the default mode, as it can make sure every write
|
||||
operation queued to the virtio-blk FE driver layer is submitted to
|
||||
hardware storage.
|
||||
|
||||
During initialization, virito-blk will allocate 64 ioreq buffers in a
|
||||
shared ring used to store the I/O requests. The freeq, busyq, and pendq
|
||||
shown in :numref:`virtio-blk-be` are used to manage requests. Each
|
||||
virtio-blk device starts 8 worker threads to process request
|
||||
asynchronously.
|
||||
|
||||
|
||||
Usage:
|
||||
******
|
||||
|
||||
The device model configuration command syntax for virtio-blk is::
|
||||
|
||||
-s <slot>,virtio-blk,<filepath>[,options]
|
||||
|
||||
- ``filepath`` is the path of a file or disk partition
|
||||
- ``options`` include:
|
||||
|
||||
- ``writethru``: write operation is reported completed only when the
|
||||
data has been written to physical storage.
|
||||
- ``writeback``: write operation is reported completed when data is
|
||||
placed in the page cache. Needs to be flushed to the physical storage.
|
||||
- ``ro``: open file with readonly mode.
|
||||
- ``sectorsize``: configured as either
|
||||
``sectorsize=<sector size>/<physical sector size>`` or
|
||||
``sectorsize=<sector size>``.
|
||||
The default values for sector size and physical sector size are 512
|
||||
- ``range``: configured as ``range=<start lba in file>/<sub file size>``
|
||||
meaning the virtio-blk will only access part of the file, from the
|
||||
``<start lba in file>`` to ``<start lba in file> + <sub file site>``.
|
||||
|
||||
A simple example for virtio-blk:
|
||||
|
||||
1. Prepare a file in SOS folder::
|
||||
|
||||
dd if=/dev/zero of=test.img bs=1M count=1024
|
||||
mkfs.ext4 test.img
|
||||
|
||||
#. Add virtio-blk in the DM cmdline, slot number should not duplicate
|
||||
another device::
|
||||
|
||||
-s 9,virtio-blk,/root/test.img
|
||||
|
||||
#. Launch UOS, you can find ``/dev/vdx`` in UOS.
|
||||
|
||||
The ``x`` in ``/dev/vdx`` is related to the slot number used. If
|
||||
If you start DM with two virtio-blks, and the slot numbers are 9 and 10,
|
||||
then, the device with slot 9 will be recognized as ``/dev/vda``, and
|
||||
the device with slot 10 will be ``/dev/vdb``
|
||||
|
||||
#. Mount ``/dev/vdx`` to a folder in the UOS, and then you can access it.
|
||||
|
||||
|
||||
Successful booting of the User OS verifies the correctness of the
|
||||
device.
|
184
doc/developer-guides/hld/virtio-console.rst
Normal file
@@ -0,0 +1,184 @@
|
||||
.. _virtio-console:
|
||||
|
||||
Virtio-console
|
||||
##############
|
||||
|
||||
The Virtio-console is a simple device for data input and output. The
|
||||
console's virtio device ID is ``3`` and can have from 1 to 16 ports.
|
||||
Each port has a pair of input and output virtqueues used to communicate
|
||||
information between the Front End (FE) and Back end (BE) drivers.
|
||||
Currently the size of each virtqueue is 64 (configurable in the source
|
||||
code). The FE driver will place empty buffers for incoming data onto
|
||||
the receiving virtqueue, and enqueue outgoing characters onto the
|
||||
transmitting virtqueue.
|
||||
|
||||
A Virtio-console device has a pair of control IO virtqueues as well. The
|
||||
control virtqueues are used to communicate information between the
|
||||
device and the driver, including: ports being opened and closed on
|
||||
either side of the connection, indication from the host about whether a
|
||||
particular port is a console port, adding new ports, port
|
||||
hot-plug/unplug, indication from the guest about whether a port or a
|
||||
device was successfully added, or a port opened or closed.
|
||||
|
||||
The virtio-console architecture diagram in ACRN is shown below.
|
||||
|
||||
.. figure:: images/virtio-console-arch.png
|
||||
:align: center
|
||||
:width: 700px
|
||||
:name: virtio-console-arch
|
||||
|
||||
Virtio-console architecture diagram
|
||||
|
||||
|
||||
Virtio-console is implemented as a virtio legacy device in the ACRN device
|
||||
model (DM), and is registered as a PCI virtio device to the guest OS. No changes
|
||||
are required in the frontend Linux virtio-console except that the guest
|
||||
(UOS) kernel should be built with ``CONFIG_VIRTIO_CONSOLE=y``.
|
||||
|
||||
Currently the feature bits supported by the BE device are:
|
||||
|
||||
.. list-table:: Feature bits supported by BE drivers
|
||||
:widths: 30 50
|
||||
:header-rows: 0
|
||||
|
||||
* - VTCON_F_SIZE(bit 0)
|
||||
- configuration columns and rows are valid.
|
||||
* - VTCON_F_MULTIPORT(bit 1)
|
||||
- device supports multiple ports, and control virtqueues will be used.
|
||||
* - VTCON_F_EMERG_WRITE(bit 2)
|
||||
- device supports emergency write.
|
||||
|
||||
Virtio-console supports redirecting guest output to various backend
|
||||
devices. Currently the following backend devices are supported in ACRN
|
||||
device model: STDIO, TTY, PTY and regular file.
|
||||
|
||||
The device model configuration command syntax for virtio-console is::
|
||||
|
||||
virtio-console,[@]stdio|tty|pty|file:portname[=portpath]\
|
||||
[,[@]stdio|tty|pty|file:portname[=portpath]]
|
||||
|
||||
- Preceding with ``@`` marks the port as a console port, otherwise it is a
|
||||
normal virtio serial port
|
||||
|
||||
- The ``portpath`` can be omitted when backend is stdio or pty
|
||||
|
||||
- The ``stdio/tty/pty`` is tty capable, which means :kbd:`TAB` and
|
||||
:kbd:`BACKSPACE` are supported, as on a regular terminal
|
||||
|
||||
- When tty is used, please make sure the redirected tty is sleeping,
|
||||
(e.g., by ``sleep 2d`` command), and will not read input from stdin before it
|
||||
is used by virtio-console to redirect guest output.
|
||||
|
||||
- Claiming multiple virtio serial ports as consoles is supported,
|
||||
however the guest Linux OS will only use one of them, through the
|
||||
``console=hvcN`` kernel parameter. For example, the following command
|
||||
defines two backend ports, which are both console ports, but the frontend
|
||||
driver will only use the second port named ``pty_port`` as its hvc
|
||||
console (specified by ``console=hvc1`` in the kernel command
|
||||
line)::
|
||||
|
||||
-s n,virtio-console,@tty:tty_port=/dev/pts/0,@pty:pty_port \
|
||||
-B "root=/dev/vda2 rw rootwait maxcpus=$2 nohpet console=hvc1 console=ttyS0 ..."
|
||||
|
||||
|
||||
Console Backend Use Cases
|
||||
*************************
|
||||
|
||||
The following sections elaborate on each backend.
|
||||
|
||||
STDIO
|
||||
=====
|
||||
|
||||
1. Add a pci slot to the device model (``acrn-dm``) command line::
|
||||
|
||||
-s n,virtio-console,@stdio:stdio_port
|
||||
|
||||
#. Add the ``console`` parameter to the guest OS kernel command line::
|
||||
|
||||
console=hvc0
|
||||
|
||||
PTY
|
||||
===
|
||||
|
||||
1. Add a pci slot to the device model (``acrn-dm``) command line::
|
||||
|
||||
-s n,virtio-console,@pty:pty_port
|
||||
|
||||
#. Add the ``console`` parameter to the guest os kernel command line::
|
||||
|
||||
console=hvc0
|
||||
|
||||
One line of information, such as shown below, will be printed in the terminal
|
||||
after ``acrn-dm`` is launched (``/dev/pts/0`` may be different,
|
||||
depending on your use case):
|
||||
|
||||
.. code-block: console
|
||||
|
||||
virt-console backend redirected to /dev/pts/0
|
||||
|
||||
#. Use a terminal emulator, such as minicom or screen, to connect to the
|
||||
tty node::
|
||||
|
||||
minicom -D /dev/pts/0
|
||||
|
||||
or ::
|
||||
|
||||
screen /dev/pts/0
|
||||
|
||||
TTY
|
||||
===
|
||||
|
||||
1. Identify your tty that will be used as the UOS console:
|
||||
|
||||
- If you're connected to your device over the network via ssh, use
|
||||
the linux ``tty`` command, and it will report the node (may be
|
||||
different in your use case)::
|
||||
|
||||
/dev/pts/0
|
||||
sleep 2d
|
||||
|
||||
- If you do not have network access to your device, use screen
|
||||
to create a new tty::
|
||||
|
||||
screen
|
||||
tty
|
||||
|
||||
you will see (depending on your use case)::
|
||||
|
||||
/dev/pts/0
|
||||
|
||||
Prevent the tty from responding by sleeping::
|
||||
|
||||
sleep 2d
|
||||
|
||||
and detach the tty by pressing :kbd:`CTRL-A` :kbd:`d`.
|
||||
|
||||
#. Add a pci slot to the device model (``acrn-dm``) command line
|
||||
(changing the ``dev/pts/X`` to match your use case)::
|
||||
|
||||
-s n,virtio-console,@tty:tty_port=/dev/pts/X
|
||||
|
||||
#. Add the console parameter to the guest OS kernel command line::
|
||||
|
||||
console=hvc0
|
||||
|
||||
#. Go back to the previous tty. For example, if you're using
|
||||
``screen``, use::
|
||||
|
||||
screen -ls
|
||||
screen -r <pid_of_your_tty>
|
||||
|
||||
FILE
|
||||
====
|
||||
|
||||
The File backend only supports console output to a file (no input).
|
||||
|
||||
1. Add a pci slot to the device model (``acrn-dm``) command line,
|
||||
adjusting the ``</path/to/file>`` to your use case::
|
||||
|
||||
-s n,virtio-console,@file:file_port=</path/to/file>
|
||||
|
||||
#. Add the console parameter to the guest OS kernel command line::
|
||||
|
||||
console=hvc0
|
||||
|
525
doc/developer-guides/hld/virtio-net.rst
Normal file
@@ -0,0 +1,525 @@
|
||||
.. _virtio-net:
|
||||
|
||||
Virtio-net
|
||||
##########
|
||||
|
||||
Virtio-net is the para-virtualization solution used in ACRN for
|
||||
networking. The ACRN device model emulates virtual NICs for UOS and the
|
||||
frontend virtio network driver, simulating the virtual NIC and following
|
||||
the virtio specification. (Refer to :ref:`introduction` and
|
||||
:ref:`virtio-hld` background introductions to ACRN and Virtio.)
|
||||
|
||||
Here are some notes about Virtio-net support in ACRN:
|
||||
|
||||
- Legacy devices are supported, modern devices are not supported
|
||||
- Two virtqueues are used in virtio-net: RX queue and TX queue
|
||||
- Indirect descriptor is supported
|
||||
- TAP backend is supported
|
||||
- Control queue is not supported
|
||||
- NIC multiple queues are not supported
|
||||
|
||||
Network Virtualization Architecture
|
||||
***********************************
|
||||
|
||||
ACRN's network virtualization architecture is shown below in
|
||||
:numref:`net-virt-arch`, and illustrates the many necessary network
|
||||
virtualization components that must cooperate for the UOS to send and
|
||||
receive data from the outside world.
|
||||
|
||||
.. figure:: images/network-virt-arch.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: net-virt-arch
|
||||
|
||||
Network Virtualization Architecture
|
||||
|
||||
(The green components are parts of the ACRN solution, while the gray
|
||||
components are parts of the Linux kernel.)
|
||||
|
||||
Let's explore these components further.
|
||||
|
||||
SOS/UOS Network Stack:
|
||||
This is the standard Linux TCP/IP stack, currently the most
|
||||
feature-rich TCP/IP implementation.
|
||||
|
||||
virtio-net Frontend Driver:
|
||||
This is the standard driver in the Linux Kernel for virtual Ethernet
|
||||
devices. This driver matches devices with PCI vendor ID 0x1AF4 and PCI
|
||||
Device ID 0x1000 (for legacy devices in our case) or 0x1041 (for modern
|
||||
devices). The virtual NIC supports two virtqueues, one for transmitting
|
||||
packets and the other for receiving packets. The frontend driver places
|
||||
empty buffers into one virtqueue for receiving packets, and enqueues
|
||||
outgoing packets into another virtqueue for transmission. The size of
|
||||
each virtqueue is 1024, configurable in the virtio-net backend driver.
|
||||
|
||||
ACRN Hypervisor:
|
||||
The ACRN hypervisor is a type 1 hypervisor, running directly on the
|
||||
bare-metal hardware, and suitable for a variety of IoT and embedded
|
||||
device solutions. It fetches and analyzes the guest instructions, puts
|
||||
the decoded information into the shared page as an IOREQ, and notifies
|
||||
or interrupts the VHM module in the SOS for processing.
|
||||
|
||||
VHM Module:
|
||||
The Virtio and Hypervisor Service Module (VHM) is a kernel module in the
|
||||
Service OS (SOS) acting as a middle layer to support the device model
|
||||
and hypervisor. The VHM forwards a IOREQ to the virtio-net backend
|
||||
driver for processing.
|
||||
|
||||
ACRN Device Model and virtio-net Backend Driver:
|
||||
The ACRN Device Model (DM) gets an IOREQ from a shared page and calls
|
||||
the virtio-net backend driver to process the request. The backend driver
|
||||
receives the data in a shared virtqueue and sends it to the TAP device.
|
||||
|
||||
Bridge and Tap Device:
|
||||
Bridge and Tap are standard virtual network infrastructures. They play
|
||||
an important role in communication among the SOS, the UOS, and the
|
||||
outside world.
|
||||
|
||||
IGB Driver:
|
||||
IGB is the physical Network Interface Card (NIC) Linux kernel driver
|
||||
responsible for sending data to and receiving data from the physical
|
||||
NIC.
|
||||
|
||||
The virtual network card (NIC) is implemented as a virtio legacy device
|
||||
in the ACRN device model (DM). It is registered as a PCI virtio device
|
||||
to the guest OS (UOS) and uses the standard virtio-net in the Linux kernel as
|
||||
its driver (the guest kernel should be built with
|
||||
``CONFIG_VIRTIO_NET=y``).
|
||||
|
||||
The virtio-net backend in DM forwards the data received from the
|
||||
frontend to the TAP device, then from the TAP device to the bridge, and
|
||||
finally from the bridge to the physical NIC driver, and vice versa for
|
||||
returning data from the NIC to the frontend.
|
||||
|
||||
ACRN Virtio-Network Calling Stack
|
||||
*********************************
|
||||
|
||||
Various components of ACRN network virtualization are shown in the
|
||||
architecture diagram shows in :numref:`net-virt-arch`. In this section,
|
||||
we will use UOS data transmission (TX) and reception (RX) examples to
|
||||
explain step-by-step how these components work together to implement
|
||||
ACRN network virtualization.
|
||||
|
||||
Initialization in Device Model
|
||||
==============================
|
||||
|
||||
**virtio_net_init**
|
||||
|
||||
- Present frontend for a virtual PCI based NIC
|
||||
- Setup control plan callbacks
|
||||
- Setup data plan callbacks, including TX, RX
|
||||
- Setup tap backend
|
||||
|
||||
Initialization in virtio-net Frontend Driver
|
||||
============================================
|
||||
|
||||
**virtio_pci_probe**
|
||||
|
||||
- Construct virtio device using virtual pci device and register it to
|
||||
virtio bus
|
||||
|
||||
**virtio_dev_probe --> virtnet_probe --> init_vqs**
|
||||
|
||||
- Register network driver
|
||||
- Setup shared virtqueues
|
||||
|
||||
ACRN UOS TX FLOW
|
||||
================
|
||||
|
||||
The following shows the ACRN UOS network TX flow, using TCP as an
|
||||
example, showing the flow through each layer:
|
||||
|
||||
**UOS TCP Layer**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
tcp_sendmsg -->
|
||||
tcp_sendmsg_locked -->
|
||||
tcp_push_one -->
|
||||
tcp_write_xmit -->
|
||||
tcp_transmit_skb -->
|
||||
|
||||
**UOS IP Layer**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
ip_queue_xmit -->
|
||||
ip_local_out -->
|
||||
__ip_local_out -->
|
||||
dst_output -->
|
||||
ip_output -->
|
||||
ip_finish_output -->
|
||||
ip_finish_output2 -->
|
||||
neigh_output -->
|
||||
neigh_resolve_output -->
|
||||
|
||||
**UOS MAC Layer**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
dev_queue_xmit -->
|
||||
__dev_queue_xmit -->
|
||||
dev_hard_start_xmit -->
|
||||
xmit_one -->
|
||||
netdev_start_xmit -->
|
||||
__netdev_start_xmit -->
|
||||
|
||||
|
||||
**UOS MAC Layer virtio-net Frontend Driver**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
start_xmit --> // virtual NIC driver xmit in virtio_net
|
||||
xmit_skb -->
|
||||
virtqueue_add_outbuf --> // add out buffer to shared virtqueue
|
||||
virtqueue_add -->
|
||||
|
||||
virtqueue_kick --> // notify the backend
|
||||
virtqueue_notify -->
|
||||
vp_notify -->
|
||||
iowrite16 --> // trap here, HV will first get notified
|
||||
|
||||
**ACRN Hypervisor**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
vmexit_handler --> // vmexit because VMX_EXIT_REASON_IO_INSTRUCTION
|
||||
pio_instr_vmexit_handler -->
|
||||
emulate_io --> // ioreq cant be processed in HV, forward it to VHM
|
||||
acrn_insert_request_wait -->
|
||||
fire_vhm_interrupt --> // interrupt SOS, VHM will get notified
|
||||
|
||||
**VHM Module**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
vhm_intr_handler --> // VHM interrupt handler
|
||||
tasklet_schedule -->
|
||||
io_req_tasklet -->
|
||||
acrn_ioreq_distribute_request --> // ioreq can't be processed in VHM, forward it to device DM
|
||||
acrn_ioreq_notify_client -->
|
||||
wake_up_interruptible --> // wake up DM to handle ioreq
|
||||
|
||||
**ACRN Device Model / virtio-net Backend Driver**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
handle_vmexit -->
|
||||
vmexit_inout -->
|
||||
emulate_inout -->
|
||||
pci_emul_io_handler -->
|
||||
virtio_pci_write -->
|
||||
virtio_pci_legacy_write -->
|
||||
virtio_net_ping_txq --> // start TX thread to process, notify thread return
|
||||
virtio_net_tx_thread --> // this is TX thread
|
||||
virtio_net_proctx --> // call corresponding backend (tap) to process
|
||||
virtio_net_tap_tx -->
|
||||
writev --> // write data to tap device
|
||||
|
||||
**SOS TAP Device Forwarding**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
do_writev -->
|
||||
vfs_writev -->
|
||||
do_iter_write -->
|
||||
do_iter_readv_writev -->
|
||||
call_write_iter -->
|
||||
tun_chr_write_iter -->
|
||||
tun_get_user -->
|
||||
netif_receive_skb -->
|
||||
netif_receive_skb_internal -->
|
||||
__netif_receive_skb -->
|
||||
__netif_receive_skb_core -->
|
||||
|
||||
|
||||
**SOS Bridge Forwarding**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
br_handle_frame -->
|
||||
br_handle_frame_finish -->
|
||||
br_forward -->
|
||||
__br_forward -->
|
||||
br_forward_finish -->
|
||||
br_dev_queue_push_xmit -->
|
||||
|
||||
**SOS MAC Layer**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
dev_queue_xmit -->
|
||||
__dev_queue_xmit -->
|
||||
dev_hard_start_xmit -->
|
||||
xmit_one -->
|
||||
netdev_start_xmit -->
|
||||
__netdev_start_xmit -->
|
||||
|
||||
|
||||
**SOS MAC Layer IGB Driver**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
igb_xmit_frame --> // IGB physical NIC driver xmit function
|
||||
|
||||
ACRN UOS RX FLOW
|
||||
================
|
||||
|
||||
The following shows the ACRN UOS network RX flow, using TCP as an example.
|
||||
Let's start by receiving a device interrupt. (Note that the hypervisor
|
||||
will first get notified when receiving an interrupt even in passthrough
|
||||
cases.)
|
||||
|
||||
**Hypervisor Interrupt Dispatch**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
vmexit_handler --> // vmexit because VMX_EXIT_REASON_EXTERNAL_INTERRUPT
|
||||
external_interrupt_vmexit_handler -->
|
||||
dispatch_interrupt -->
|
||||
common_handler_edge -->
|
||||
ptdev_interrupt_handler -->
|
||||
ptdev_enqueue_softirq --> // Interrupt will be delivered in bottom-half softirq
|
||||
|
||||
|
||||
**Hypervisor Interrupt Injection**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
do_softirq -->
|
||||
ptdev_softirq -->
|
||||
vlapic_intr_msi --> // insert the interrupt into SOS
|
||||
|
||||
start_vcpu --> // VM Entry here, will process the pending interrupts
|
||||
|
||||
**SOS MAC Layer IGB Driver**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
do_IRQ -->
|
||||
...
|
||||
igb_msix_ring -->
|
||||
igbpoll -->
|
||||
napi_gro_receive -->
|
||||
napi_skb_finish -->
|
||||
netif_receive_skb_internal -->
|
||||
__netif_receive_skb -->
|
||||
__netif_receive_skb_core --
|
||||
|
||||
**SOS Bridge Forwarding**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
br_handle_frame -->
|
||||
br_handle_frame_finish -->
|
||||
br_forward -->
|
||||
__br_forward -->
|
||||
br_forward_finish -->
|
||||
br_dev_queue_push_xmit -->
|
||||
|
||||
**SOS MAC Layer**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
dev_queue_xmit -->
|
||||
__dev_queue_xmit -->
|
||||
dev_hard_start_xmit -->
|
||||
xmit_one -->
|
||||
netdev_start_xmit -->
|
||||
__netdev_start_xmit -->
|
||||
|
||||
**SOS MAC Layer TAP Driver**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
tun_net_xmit --> // Notify and wake up reader process
|
||||
|
||||
**ACRN Device Model / virtio-net Backend Driver**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
virtio_net_rx_callback --> // the tap fd get notified and this function invoked
|
||||
virtio_net_tap_rx --> // read data from tap, prepare virtqueue, insert interrupt into the UOS
|
||||
vq_endchains -->
|
||||
vq_interrupt -->
|
||||
pci_generate_msi -->
|
||||
|
||||
**VHM Module**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
vhm_dev_ioctl --> // process the IOCTL and call hypercall to inject interrupt
|
||||
hcall_inject_msi -->
|
||||
|
||||
**ACRN Hypervisor**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
vmexit_handler --> // vmexit because VMX_EXIT_REASON_VMCALL
|
||||
vmcall_vmexit_handler -->
|
||||
hcall_inject_msi --> // insert interrupt into UOS
|
||||
vlapic_intr_msi -->
|
||||
|
||||
**UOS MAC Layer virtio_net Frontend Driver**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
vring_interrupt --> // virtio-net frontend driver interrupt handler
|
||||
skb_recv_done --> //registed by virtnet_probe-->init_vqs-->virtnet_find_vqs
|
||||
virtqueue_napi_schedule -->
|
||||
__napi_schedule -->
|
||||
virtnet_poll -->
|
||||
virtnet_receive -->
|
||||
receive_buf -->
|
||||
|
||||
**UOS MAC Layer**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
napi_gro_receive -->
|
||||
napi_skb_finish -->
|
||||
netif_receive_skb_internal -->
|
||||
__netif_receive_skb -->
|
||||
__netif_receive_skb_core -->
|
||||
|
||||
**UOS IP Layer**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
ip_rcv -->
|
||||
ip_rcv_finish -->
|
||||
dst_input -->
|
||||
ip_local_deliver -->
|
||||
ip_local_deliver_finish -->
|
||||
|
||||
|
||||
**UOS TCP Layer**
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
tcp_v4_rcv -->
|
||||
tcp_v4_do_rcv -->
|
||||
tcp_rcv_established -->
|
||||
tcp_data_queue -->
|
||||
tcp_queue_rcv -->
|
||||
__skb_queue_tail -->
|
||||
|
||||
sk->sk_data_ready --> // application will get notified
|
||||
|
||||
How to Use
|
||||
==========
|
||||
|
||||
The network infrastructure shown in :numref:`net-virt-infra` needs to be
|
||||
prepared in the SOS before we start. We need to create a bridge and at
|
||||
least one tap device (two tap devices are needed to create a dual
|
||||
virtual NIC) and attach a physical NIC and tap device to the bridge.
|
||||
|
||||
.. figure:: images/network-virt-sos-infrastruct.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: net-virt-infra
|
||||
|
||||
Network Infrastructure in SOS
|
||||
|
||||
You can use Linux commands (e.g. ip, brctl) to create this network. In
|
||||
our case, we use systemd to automatically create the network by default.
|
||||
You can check the files with prefix 50- in the SOS
|
||||
``/usr/lib/systemd/network/``:
|
||||
|
||||
- `50-acrn.netdev <https://raw.githubusercontent.com/projectacrn/acrn-hypervisor/master/tools/acrnbridge/acrn.netdev>`__
|
||||
- `50-acrn.network <https://raw.githubusercontent.com/projectacrn/acrn-hypervisor/master/tools/acrnbridge/acrn.network>`__
|
||||
- `50-acrn_tap0.netdev <https://raw.githubusercontent.com/projectacrn/acrn-hypervisor/master/tools/acrnbridge/acrn_tap0.netdev>`__
|
||||
- `50-eth.network <https://raw.githubusercontent.com/projectacrn/acrn-hypervisor/master/tools/acrnbridge/eth.network>`__
|
||||
|
||||
When the SOS is started, run ``ifconfig`` to show the devices created by
|
||||
this systemd configuration:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
acrn-br0 Link encap:Ethernet HWaddr B2:50:41:FE:F7:A3
|
||||
inet addr:10.239.154.43 Bcast:10.239.154.255 Mask:255.255.255.0
|
||||
inet6 addr: fe80::b050:41ff:fefe:f7a3/64 Scope:Link
|
||||
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
|
||||
RX packets:226932 errors:0 dropped:21383 overruns:0 frame:0
|
||||
TX packets:14816 errors:0 dropped:0 overruns:0 carrier:0
|
||||
collisions:0 txqueuelen:1000
|
||||
RX bytes:100457754 (95.8 Mb) TX bytes:83481244 (79.6 Mb)
|
||||
|
||||
acrn_tap0 Link encap:Ethernet HWaddr F6:A7:7E:52:50:C6
|
||||
UP BROADCAST MULTICAST MTU:1500 Metric:1
|
||||
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
|
||||
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
|
||||
collisions:0 txqueuelen:1000
|
||||
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
|
||||
|
||||
enp3s0 Link encap:Ethernet HWaddr 98:4F:EE:14:5B:74
|
||||
inet6 addr: fe80::9a4f:eeff:fe14:5b74/64 Scope:Link
|
||||
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
|
||||
RX packets:279174 errors:0 dropped:0 overruns:0 frame:0
|
||||
TX packets:69923 errors:0 dropped:0 overruns:0 carrier:0
|
||||
collisions:0 txqueuelen:1000
|
||||
RX bytes:107312294 (102.3 Mb) TX bytes:87117507 (83.0 Mb)
|
||||
Memory:82200000-8227ffff
|
||||
|
||||
lo Link encap:Local Loopback
|
||||
inet addr:127.0.0.1 Mask:255.0.0.0
|
||||
inet6 addr: ::1/128 Scope:Host
|
||||
UP LOOPBACK RUNNING MTU:65536 Metric:1
|
||||
RX packets:16 errors:0 dropped:0 overruns:0 frame:0
|
||||
TX packets:16 errors:0 dropped:0 overruns:0 carrier:0
|
||||
collisions:0 txqueuelen:1000
|
||||
RX bytes:1216 (1.1 Kb) TX bytes:1216 (1.1 Kb)
|
||||
|
||||
Run ``brctl show`` to see the bridge ``acrn-br0`` and attached devices:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
bridge name bridge id STP enabled interfaces
|
||||
|
||||
acrn-br0 8000.b25041fef7a3 no acrn_tap0
|
||||
enp3s0
|
||||
|
||||
Add a pci slot to the device model acrn-dm command line (mac address is
|
||||
optional):
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
-s 4,virtio-net,<tap_name>,[mac=<XX:XX:XX:XX:XX:XX>]
|
||||
|
||||
When the UOS is lauched, run ``ifconfig`` to check the network. enp0s4r
|
||||
is the virtual NIC created by acrn-dm:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
enp0s4 Link encap:Ethernet HWaddr 00:16:3E:39:0F:CD
|
||||
inet addr:10.239.154.186 Bcast:10.239.154.255 Mask:255.255.255.0
|
||||
inet6 addr: fe80::216:3eff:fe39:fcd/64 Scope:Link
|
||||
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
|
||||
RX packets:140 errors:0 dropped:8 overruns:0 frame:0
|
||||
TX packets:46 errors:0 dropped:0 overruns:0 carrier:0
|
||||
collisions:0 txqueuelen:1000
|
||||
RX bytes:110727 (108.1 Kb) TX bytes:4474 (4.3 Kb)
|
||||
|
||||
lo Link encap:Local Loopback
|
||||
inet addr:127.0.0.1 Mask:255.0.0.0
|
||||
inet6 addr: ::1/128 Scope:Host
|
||||
UP LOOPBACK RUNNING MTU:65536 Metric:1
|
||||
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
|
||||
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
|
||||
collisions:0 txqueuelen:1000
|
||||
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
|
||||
|
||||
Performance Estimation
|
||||
======================
|
||||
|
||||
We've introduced the network virtualization solution in ACRN, from the
|
||||
top level architecture to the detailed TX and RX flow. Currently, the
|
||||
control plane and data plane are all processed in ACRN device model,
|
||||
which may bring some overhead. But this is not a bottleneck for 1000Mbit
|
||||
NICs or below. Network bandwidth for virtualization can be very close to
|
||||
the native bandwidgh. For high speed NIC (e.g. 10Gb or above), it is
|
||||
necessary to separate the data plane from the control plane. We can use
|
||||
vhost for acceleration. For most IoT scenarios, processing in user space
|
||||
is simple and reasonable.
|
||||
|
||||
|
21
doc/developer-guides/hld/virtio-rnd.rst
Normal file
@@ -0,0 +1,21 @@
|
||||
.. _virtio-rnd:
|
||||
|
||||
Virtio-rnd
|
||||
##########
|
||||
|
||||
The virtio-rnd entropy device supplies high-quality randomness for guest
|
||||
use. The virtio device ID of the virtio-rnd device is 4, and it supports
|
||||
one virtqueue, the size of which is 64, configurable in the source code.
|
||||
It has no feature bits defined.
|
||||
|
||||
When the FE driver requires some random bytes, the BE device will place
|
||||
bytes of random data onto the virtqueue.
|
||||
|
||||
To launch the virtio-rnd device, use the following virtio command::
|
||||
|
||||
-s <slot>,virtio-rnd
|
||||
|
||||
To verify the correctness in user OS, use the following
|
||||
command::
|
||||
|
||||
od /dev/random
|
98
doc/developer-guides/hld/watchdog-hld.rst
Normal file
@@ -0,0 +1,98 @@
|
||||
.. _watchdog-hld:
|
||||
|
||||
Watchdog Virtualization in Device Model
|
||||
#######################################
|
||||
|
||||
This document describes the watchdog virtualization implementation in
|
||||
ACRN device model.
|
||||
|
||||
Overview
|
||||
********
|
||||
|
||||
A watchdog is an important hardware component in embedded systems, used
|
||||
to monitor the system's running status, and resets the processor if the
|
||||
software crashes. In general, hardware watchdogs rely on a piece of
|
||||
software running on the machine which must "kick" the watchdog device
|
||||
regularly, say every 10 seconds. If the watchdog doesn't get "kicked"
|
||||
after 60 seconds, for example, then the watchdog device asserts the
|
||||
RESET line which results in a hard reboot.
|
||||
|
||||
For ACRN we emulate the watchdog hardware in the Intel 6300ESB chipset
|
||||
as a PCI device called 6300ESB watchdog and is added into the Device
|
||||
Model following the PCI device framework. The following
|
||||
:numref:`watchdog-device` shows the watchdog device workflow:
|
||||
|
||||
.. figure:: images/watchdog-image2.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: watchdog-device
|
||||
|
||||
Watchdog device flow
|
||||
|
||||
The DM in the Service OS (SOS) treats the watchdog as a passive device.
|
||||
It receives read/write commands from the watchdog driver, does the
|
||||
actions, and returns. In ACRN, the commands are from User OS (UOS)
|
||||
watchdog driver.
|
||||
|
||||
UOS watchdog work flow
|
||||
**********************
|
||||
|
||||
When the UOS does a read or write operation on the watchdog device's
|
||||
registers or memory space (Port IO or Memory map I/O), it will trap into
|
||||
the hypervisor. The hypervisor delivers the operation to the SOS/DM
|
||||
through IPI (inter-process interrupt) or shared memory, and the DM
|
||||
dispatches the operation to the watchdog emulation code.
|
||||
|
||||
After the DM watchdog finishes emulating the read or write operation, it
|
||||
then calls ``ioctl`` to the SOS/kernel (``/dev/acrn_vhm``). VHM will call a
|
||||
hypercall to trap into the hypervisor to tell it the operation is done, and
|
||||
the hypervisor will set UOS-related VCPU registers and resume UOS so the
|
||||
UOS watchdog driver will get the return values (or return status). The
|
||||
:numref:`watchdog-workflow` below is a typical operation flow:
|
||||
from UOS to SOS and return back:
|
||||
|
||||
.. figure:: images/watchdog-image1.png
|
||||
:align: center
|
||||
:width: 900px
|
||||
:name: watchdog-workflow
|
||||
|
||||
Watchdog operation workflow
|
||||
|
||||
Implementation in ACRN and how to use it
|
||||
****************************************
|
||||
|
||||
In ACRN, the Intel 6300ESB watchdog device emulation is added into the
|
||||
DM PCI device tree. Its interface structure is (see
|
||||
``devicemodel/include/pci_core.h``):
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct pci_vdev_ops pci_ops_wdt = {
|
||||
.class_name = "wdt-i6300esb",
|
||||
.vdev_init = pci_wdt_init,
|
||||
.vdev_deinit = pci_wdt_deinit,
|
||||
.vdev_cfgwrite = pci_wdt_cfg_write,
|
||||
.vdev_cfgread = pci_wdt_cfg_read,
|
||||
.vdev_barwrite = pci_wdt_bar_write,
|
||||
.vdev_barread = pci_wdt_bar_read
|
||||
};
|
||||
|
||||
All functions follow the ``pci_vdev_ops`` definitions for PCI device
|
||||
emulation.
|
||||
|
||||
The main part in the watchdog emulation is the timer thread. It emulates
|
||||
the watchdog device timeout management. When it gets the kick action
|
||||
from the UOS, it resets the timer. If the timer expires before getting a
|
||||
timely kick action, it will call DM API to reboot that UOS.
|
||||
|
||||
In the UOS launch script, add: ``-s xx,wdt-i6300esb`` into DM parameters.
|
||||
(xx is the virtual PCI BDF number as with other PCI devices)
|
||||
|
||||
Make sure the UOS kernel has the I6300ESB driver enabled: ``CONFIG_I6300ESB_WDT=y``. After the UOS
|
||||
boots up, the watchdog device will be created as node ``/dev/watchdog``,
|
||||
and can be used as a normal device file.
|
||||
|
||||
Usually the UOS needs a watchdog service (daemon) to run in userland and
|
||||
kick the watchdog periodically. If something prevents the daemon from
|
||||
kicking the watchdog, for example the UOS system is hung, the watchdog
|
||||
will timeout and the DM will reboot the UOS.
|