doc: reorganize HLD docs

Reorganize the high-level design docs to align with a work-in-progress
HLD document.  Migrate previous web content (and images) into the new
organization.

From here we'll continue inclusion of new design chapters as they're
reviewed and edited.

Signed-off-by: David B. Kinder <david.b.kinder@intel.com>
This commit is contained in:
David B. Kinder
2018-10-04 16:39:40 -07:00
committed by David Kinder
parent 8e21d5ee99
commit 1e38544112
93 changed files with 1249 additions and 1233 deletions

View File

@@ -0,0 +1,359 @@
.. _acpi-virt-HLD:
ACPI Virtualization high-level design
#####################################
ACPI introduction
*****************
Advanced Configuration and Power Interface (ACPI) provides an open
standard that operating systems can use to discover and configure
computer hardware components to perform power management for example, by
monitoring status and putting unused components to sleep.
Functions implemented by ACPI include:
- System/Device/Processor power management
- Device/Processor performance management
- Configuration / Plug and Play
- System event
- Battery management
- Thermal management
ACPI enumerates and lists the different DMA engines in the platform, and
device scope relationships between PCI devices and which DMA engine
controls them. All critical functions depend on ACPI tables. Here's an
example on an Apollo Lake platform (APL) with Linux installed:
.. code-block:: none
root@:Dom0 ~ $ ls /sys/firmware/acpi/tables/
APIC data DMAR DSDT dynamic FACP FACS HPET MCFG NHLT TPM2
These tables provide different information and functions:
- Advanced Programmable Interrupt Controller (APIC) for Symmetric
Multiprocessor systems (SMP),
- DMA remapping (DMAR) for Intel |reg| Virtualization Technology for
Directed I/O (VT-d),
- Non-HD Audio Link Table (NHLT) for supporting audio device,
- and Differentiated System Description Table (DSDT) for system
configuration info. DSDT is a major ACPI table used to describe what
peripherals the machine has, and information on PCI IRQ mappings and
power management
Most of the ACPI functionality is provided in ACPI Machine Language
(AML) bytecode stored in the ACPI tables. To make use of these tables,
Linux implements an interpreter for the AML bytecode. At BIOS
development time, the AML bytecode is compiled from the ASL (ACPI Source
Language) code. The ``iasl`` command is used to disassemble the ACPI table
and display its contents:
.. code-block:: none
root@:Dom0 ~ $ cp /sys/firmware/acpi/tables/DMAR .
root@:Dom0 ~ $ iasl -d DMAR
Intel ACPI Component Architecture
ASL+ Optimizing Compiler/Disassembler version 20170728
Copyright (c) 2000 - 2017 Intel Corporation
Input file DMAR, Length 0xB0 (176) bytes
ACPI: DMAR 0x0000000000000000 0000B0 (v01 INTEL BDW 00000001 INTL 00000001)
Acpi Data Table [DMAR] decoded
Formatted output: DMAR.dsl - 5286 bytes
root@:Dom0 ~ $ cat DMAR.dsl
[000h 0000 4] Signature : "DMAR" [DMA Remapping table]
[004h 0004 4] Table Length : 000000B0
[008h 0008 1] Revision : 01
...
[030h 0048 2] Subtable Type : 0000 [Hardware Unit Definition]
[032h 0050 2] Length : 0018
[034h 0052 1] Flags : 00
[035h 0053 1] Reserved : 00
[036h 0054 2] PCI Segment Number : 0000
[038h 0056 8] Register Base Address : 00000000FED64000
From the displayed ASL, we can see some generic table fields, such as
the version information, and one VTd remapping engine description with
FED64000 as base address.
We can modify DMAR.dsl and assemble it again to AML:
.. code-block:: none
root@:Dom0 ~ $ iasl DMAR.dsl
Intel ACPI Component Architecture
ASL+ Optimizing Compiler/Disassembler version 20170728
Copyright (c) 2000 - 2017 Intel Corporation
Table Input: DMAR.dsl - 113 lines, 5286 bytes, 72 fields
Binary Output: DMAR.aml - 176 bytes
Compilation complete. 0 Errors, 0 Warnings, 0 Remarks
We can see the new AML file ``DMAR.aml`` is created.
There are many ACPI tables in the system, linked together via table
pointers. In all ACPI-compatible system, the OS can enumerate all
needed tables starting with the Root System Description Pointer (RSDP)
provided at a known place in the system low address space, and pointing
to an XSDT (Extended System Description Table). The following picture
shows a typical ACPI table layout in an Intel APL platform:
.. figure:: images/acpi-image1.png
:width: 700px
:align: center
:name: acpi-layout
Typical ACPI table layout in an Intel APL platform
ACPI virtualization
*******************
Most modern OSes requires ACPI, so ACRN provides ACPI virtualization to
emulate an ACPI-capable virtual platform for the guest OS. To achieve
this, there are two options, depending on physical device and ACPI
resources are abstracted: Partitioning and Emulation.
Partitioning
============
One option is to assign and partition physical devices and ACPI
resources among all guest OSes. That means each guest OS owns specific
devices with passthrough, such as shown below:
+--------------------------+--------------------------+--------------------------+
| PCI Devices | VM0(Cluster VM) | VM1(IVI VM) |
+--------------------------+--------------------------+--------------------------+
| I2C | I2C3, I2C0 | I2C1, I2C2, I2C4, I2C5, |
| | | I2C6, I2C7 |
+--------------------------+--------------------------+--------------------------+
| SPI | SPI1 | SPI0, SPI2 |
+--------------------------+--------------------------+--------------------------+
| USB | | USB-Host (xHCI) and |
| | | USB-Device (xDCI) |
+--------------------------+--------------------------+--------------------------+
| SDIO | | SDIO |
+--------------------------+--------------------------+--------------------------+
| IPU | | IPU |
+--------------------------+--------------------------+--------------------------+
| Ethernet | Ethernet | |
+--------------------------+--------------------------+--------------------------+
| WIFI | | WIFI |
+--------------------------+--------------------------+--------------------------+
| Bluetooth | | Bluetooth |
+--------------------------+--------------------------+--------------------------+
| Audio | | Audio |
+--------------------------+--------------------------+--------------------------+
| GPIO | GPIO | |
+--------------------------+--------------------------+--------------------------+
| UART | UART | |
+--------------------------+--------------------------+--------------------------+
In an early ACRN development phase, partitioning was used for
simplicity. To implement partitioning, we need to hack the PCI logic to
make different VMs see a different subset of devices, and create one
copy of the ACPI tables for each of them, as shown in the following
picture:
.. figure:: images/acpi-image3.png
:width: 900px
:align: center
For each VM, its ACPI tables are standalone copies and not related to
other VMs. Opregion also needs to be copied for different VM.
For each table, we make modifications, based on the physical table, to
reflect the assigned devices to a particular VM. In the picture below,
we can see keep SP2(0:19.1) for VM0, and SP1(0:19.0)/SP3(0:19.2) for
VM1. Any time a partition policy changes, we need to modify both tables
again, including dissembling, modification, and assembling, which is
tricky and bug-prone.
.. figure:: images/acpi-image2.png
:width: 900px
:align: center
Emulation
---------
A second option is for the SOS (VM0) to "own" all devices and emulate a
set of virtual devices for each of the UOS (VM1). This is the most
popular model for virtualization, as show below. ACRN currently uses
device emulation plus some device passthrough for UOS.
.. figure:: images/acpi-image5.png
:width: 400px
:align: center
Regarding ACPI virtualization in ACRN, different policy are used for
different components:
- Hypervisor - ACPI is transparent to the Hypervisor, which has no
knowledge of ACPI at all.
- SOS - All ACPI resources are physically owned by the SOS, which
enumerates all ACPI tables and devices.
- UOS - Virtual ACPI resources exposed by the device model are owned by
UOS.
Source for the ACPI emulation code for the device model is found in
``hw/platform/acpi/acpi.c``.
Each entry in ``basl_ftables`` is related to each virtual ACPI table,
including following elements:
- wsect - output handler to write related ACPI table contents to
specific file
- offset - related ACPI table offset in the memory
- valid - dynamically indicate if this table is needed
.. code-block:: c
static struct {
int (*wsect)(FILE *fp, struct vmctx *ctx);
uint64_t offset;
bool valid;
} basl_ftables[] = {
{ basl_fwrite_rsdp, 0, true },
{ basl_fwrite_rsdt, RSDT_OFFSET, true },
{ basl_fwrite_xsdt, XSDT_OFFSET, true },
{ basl_fwrite_madt, MADT_OFFSET, true },
{ basl_fwrite_fadt, FADT_OFFSET, true },
{ basl_fwrite_hpet, HPET_OFFSET, true },
{ basl_fwrite_mcfg, MCFG_OFFSET, true },
{ basl_fwrite_facs, FACS_OFFSET, true },
{ basl_fwrite_nhlt, NHLT_OFFSET, false }, /*valid with audio ptdev*/
{ basl_fwrite_dsdt, DSDT_OFFSET, true }
};
The main function to create virtual ACPI tables is ``acpi_build`` that
calls ``basl_compile`` for each table and performs the following:
#. create two temp files: infile and outfile
#. with output handler, write table contents stream to infile
#. use ``iasl`` tool to assemble infile into outfile
#. load outfile contents to the required memory offset
.. code-block:: c
static int
basl_compile(struct vmctx *ctx,
int (*fwrite_section)(FILE *, struct vmctx *),
uint64_t offset)
{
struct basl_fio io[2];
static char iaslbuf[3*MAXPATHLEN + 10];
int err;
err = basl_start(&io[0], &io[1]);
if (!err) {
err = (*fwrite_section)(io[0].fp, ctx);
if (!err) {
/*
* iasl sends the results of the compilation to
* stdout. Shut this down by using the shell to
* redirect stdout to /dev/null, unless the user
* has requested verbose output for debugging
* purposes
*/
if (basl_verbose_iasl)
snprintf(iaslbuf, sizeof(iaslbuf),
"%s -p %s %s",
ASL_COMPILER,
io[1].f_name, io[0].f_name);
else
snprintf(iaslbuf, sizeof(iaslbuf),
"/bin/sh -c \"%s -p %s %s\" 1> /dev/null",
ASL_COMPILER,
io[1].f_name, io[0].f_name);
err = system(iaslbuf);
if (!err) {
/*
* Copy the aml output file into guest
* memory at the specified location
*/
err = basl_load(ctx, io[1].fd, offset);
} else
err = -1;
}
basl_end(&io[0], &io[1]);
}
After processing each entry, the virtual ACPI tables are present in UOS
memory.
For pass-through devices in UOS, we likely need to add some ACPI
description in the UOS virtual DSDT table. There is one hook
(``passthrough_write_dsdt``) in ``hw/pci/passthrough.c`` for it. The following
source code shows calls to different functions to add different contents
for each vendor and device id.
.. code-block:: c
static void
passthru_write_dsdt(struct pci_vdev *dev)
{
struct passthru_dev *ptdev = (struct passthru_dev *) dev->arg;
uint32_t vendor = 0, device = 0;
vendor = read_config(ptdev->phys_dev, PCIR_VENDOR, 2);
if (vendor != 0x8086)
return;
device = read_config(ptdev->phys_dev, PCIR_DEVICE, 2);
/* Provides ACPI extra info */
if (device == 0x5aaa)
/* XDCI @ 00:15.1 to enable ADB */
write_dsdt_xhci(dev);
else if (device == 0x5ab4)
/* HDAC @ 00:17.0 as codec */
write_dsdt_hdac(dev);
else if (device == 0x5a98)
/* HDAS @ 00:e.0 */
write_dsdt_hdas(dev);
else if (device == 0x5aac)
/* i2c @ 00:16.0 for ipu */
write_dsdt_ipu_i2c(dev);
else if (device == 0x5abc)
/* URT1 @ 00:18.0 for bluetooth*/
write_dsdt_urt1(dev);
}
For instance, ``write_dsdt_urt1`` provides ACPI contents for Bluetooth
UART device when pass-throughed to the UOS. It provides virtual PCI
device/function as ``_ADR``, with other descriptions possible for Bluetooth
UART enumeration.
.. code-block:: c
static void
write_dsdt_urt1(struct pci_vdev *dev)
{
printf("write virt-%x:%x.%x in dsdt for URT1 @ 00:18.0\n",
dev->bus,
dev->slot,
dev->func);
dsdt_line("Device (URT1)");
dsdt_line("{");
dsdt_line(" Name (_ADR, 0x%04X%04X)", dev->slot, dev->func);
dsdt_line(" Name (_DDN, \"Intel(R) HS-UART Controller #1\")");
dsdt_line(" Name (_UID, One)");
dsdt_line(" Name (RBUF, ResourceTemplate ()");
dsdt_line(" {");
dsdt_line(" })");
dsdt_line(" Method (_CRS, 0, NotSerialized)");
dsdt_line(" {");
dsdt_line(" Return (RBUF)");
dsdt_line(" }");
dsdt_line("}");
}
This document introduces basic ACPI virtualization. Other topics such as
power management virtualization, adds more requirement for ACPI, and
will be discussed in the power management documentation.

View File

@@ -0,0 +1,948 @@
.. _APL_GVT-g-hld:
GVT-g high-level design
#######################
Introduction
************
Purpose of this Document
========================
This high-level design (HLD) document describes the usage requirements
and high level design for Intel® Graphics Virtualization Technology for
shared virtual :term:`GPU` technology (:term:`GVT-g`) on Apollo Lake-I
SoCs.
This document describes:
- The different GPU virtualization techniques
- GVT-g mediated pass-through
- High level design
- Key components
- GVT-g new architecture differentiation
Audience
========
This document is for developers, validation teams, architects and
maintainers of Intel® GVT-g for the Apollo Lake SoCs.
The reader should have some familiarity with the basic concepts of
system virtualization and Intel® processor graphics.
Reference Documents
===================
The following documents were used as references for this specification:
- Paper in USENIX ATC '14 - *Full GPU Virtualization Solution with
Mediated Pass-Through* - https://www.usenix.org/node/183932
- Hardware Specification - PRMs -
https://01.org/linuxgraphics/documentation/hardware-specification-prms
Background
**********
Intel® GVT-g is an enabling technology in emerging graphics
virtualization scenarios. It adopts a full GPU virtualization approach
based on mediated pass-through technology, to achieve good performance,
scalability and secure isolation among Virtual Machines (VMs). A virtual
GPU (vGPU), with full GPU features, is presented to each VM so that a
native graphics driver can run directly inside a VM.
Intel® GVT-g technology for Apollo Lake (APL) has been implemented in
open source hypervisors or Virtual Machine Monitors (VMMs):
- Intel® GVT-g for ACRN, also known as, "AcrnGT"
- Intel® GVT-g for KVM, also known as, "KVMGT"
- Intel® GVT-g for Xen, also known as, "XenGT"
The core vGPU device model is released under BSD/MIT dual license, so it
can be reused in other proprietary hypervisors.
Intel has a portfolio of graphics virtualization technologies
(:term:`GVT-g`, :term:`GVT-d` and :term:`GVT-s`). GVT-d and GVT-s are
outside of the scope of this document.
This HLD applies to the Apollo Lake platform only. Support of other
hardware is outside the scope of this HLD.
Targeted Usages
===============
The main targeted usage of GVT-g is in automotive applications, such as:
- An Instrument cluster running in one domain
- An In Vehicle Infotainment (IVI) solution running in another domain
- Additional domains for specific purposes, such as Rear Seat
Entertainment or video camera capturing.
.. figure:: images/APL_GVT-g-ive-use-case.png
:width: 900px
:align: center
:name: ive-use-case
IVE Use Case
Existing Techniques
===================
A graphics device is no different from any other I/O device, with
respect to how the device I/O interface is virtualized. Therefore,
existing I/O virtualization techniques can be applied to graphics
virtualization. However, none of the existing techniques can meet the
general requirement of performance, scalability, and secure isolation
simultaneously. In this section, we review the pros and cons of each
technique in detail, enabling the audience to understand the rationale
behind the entire GVT-g effort.
Emulation
---------
A device can be emulated fully in software, including its I/O registers
and internal functional blocks. There would be no dependency on the
underlying hardware capability, therefore compatibility can be achieved
across platforms. However, due to the CPU emulation cost, this technique
is usually used for legacy devices, such as a keyboard, mouse, and VGA
card. There would be great complexity and extremely low performance to
fully emulate a modern accelerator, such as a GPU. It may be acceptable
for use in a simulation environment, but it is definitely not suitable
for production usage.
API Forwarding
--------------
API forwarding, or a split driver model, is another widely-used I/O
virtualization technology. It has been used in commercial virtualization
productions, for example, VMware*, PCoIP*, and Microsoft* RemoteFx*.
It is a natural path when researchers study a new type of
I/O virtualization usage, for example, when GPGPU computing in VM was
initially proposed. Intel® GVT-s is based on this approach.
The architecture of API forwarding is shown in :numref:`api-forwarding`:
.. figure:: images/APL_GVT-g-api-forwarding.png
:width: 400px
:align: center
:name: api-forwarding
API Forwarding
A frontend driver is employed to forward high-level API calls (OpenGL,
Directx, and so on) inside a VM, to a Backend driver in the Hypervisor
for acceleration. The Backend may be using a different graphics stack,
so API translation between different graphics protocols may be required.
The Backend driver allocates a physical GPU resource for each VM,
behaving like a normal graphics application in a Hypervisor. Shared
memory may be used to reduce memory copying between the host and guest
graphic stacks.
API forwarding can bring hardware acceleration capability into a VM,
with other merits such as vendor independence and high density. However, it
also suffers from the following intrinsic limitations:
- Lagging features - Every new API version needs to be specifically
handled, so it means slow time-to-market (TTM) to support new standards.
For example,
only DirectX9 is supported, when DirectX11 is already in the market.
Also, there is a big gap in supporting media and compute usages.
- Compatibility issues - A GPU is very complex, and consequently so are
high level graphics APIs. Different protocols are not 100% compatible
on every subtle API, so the customer can observe feature/quality loss
for specific applications.
- Maintenance burden - Occurs when supported protocols and specific
versions are incremented.
- Performance overhead - Different API forwarding implementations
exhibit quite different performance, which gives rise to a need for a
fine-grained graphics tuning effort.
Direct Pass-Through
-------------------
"Direct pass-through" dedicates the GPU to a single VM, providing full
features and good performance, but at the cost of device sharing
capability among VMs. Only one VM at a time can use the hardware
acceleration capability of the GPU, which is a major limitation of this
technique. However, it is still a good approach to enable graphics
virtualization usages on Intel server platforms, as an intermediate
solution. Intel® GVT-d uses this mechanism.
.. figure:: images/APL_GVT-g-pass-through.png
:width: 400px
:align: center
:name: gvt-pass-through
Pass-Through
SR-IOV
------
Single Root IO Virtualization (SR-IOV) implements I/O virtualization
directly on a device. Multiple Virtual Functions (VFs) are implemented,
with each VF directly assignable to a VM.
Mediated Pass-Through
*********************
Intel® GVT-g achieves full GPU virtualization using a "mediated
pass-through" technique.
Concept
=======
Mediated pass-through allows a VM to access performance-critical I/O
resources (usually partitioned) directly, without intervention from the
hypervisor in most cases. Privileged operations from this VM are
trapped-and-emulated to provide secure isolation among VMs.
.. figure:: images/APL_GVT-g-mediated-pass-through.png
:width: 400px
:align: center
:name: mediated-pass-through
Mediated Pass-Through
The Hypervisor must ensure that no vulnerability is exposed when
assigning performance-critical resource to each VM. When a
performance-critical resource cannot be partitioned, a scheduler must be
implemented (either in software or hardware) to allow time-based sharing
among multiple VMs. In this case, the device must allow the hypervisor
to save and restore the hardware state associated with the shared resource,
either through direct I/O register reads and writes (when there is no software
invisible state) or through a device-specific context save and restore
mechanism (where there is a software invisible state).
Examples of performance-critical I/O resources include the following:
.. figure:: images/APL_GVT-g-perf-critical.png
:width: 800px
:align: center
:name: perf-critical
Performance-Critical I/O Resources
The key to implementing mediated pass-through for a specific device is
to define the right policy for various I/O resources.
Virtualization Policies for GPU Resources
=========================================
:numref:`graphics-arch` shows how Intel Processor Graphics works at a high level.
Software drivers write commands into a command buffer through the CPU.
The Render Engine in the GPU fetches these commands and executes them.
The Display Engine fetches pixel data from the Frame Buffer and sends
them to the external monitors for display.
.. figure:: images/APL_GVT-g-graphics-arch.png
:width: 400px
:align: center
:name: graphics-arch
Architecture of Intel Processor Graphics
This architecture abstraction applies to most modern GPUs, but may
differ in how graphics memory is implemented. Intel Processor Graphics
uses system memory as graphics memory. System memory can be mapped into
multiple virtual address spaces by GPU page tables. A 4 GB global
virtual address space called "global graphics memory", accessible from
both the GPU and CPU, is mapped through a global page table. Local
graphics memory spaces are supported in the form of multiple 4 GB local
virtual address spaces, but are only limited to access by the Render
Engine through local page tables. Global graphics memory is mostly used
for the Frame Buffer and also serves as the Command Buffer. Massive data
accesses are made to local graphics memory when hardware acceleration is
in progress. Other GPUs have similar page table mechanism accompanying
the on-die memory.
The CPU programs the GPU through GPU-specific commands, shown in
:numref:`graphics-arch`, using a producer-consumer model. The graphics
driver programs GPU commands into the Command Buffer, including primary
buffer and batch buffer, according to the high-level programming APIs,
such as OpenGL* or DirectX*. Then, the GPU fetches and executes the
commands. The primary buffer (called a ring buffer) may chain other
batch buffers together. The primary buffer and ring buffer are used
interchangeably thereafter. The batch buffer is used to convey the
majority of the commands (up to ~98% of them) per programming model. A
register tuple (head, tail) is used to control the ring buffer. The CPU
submits the commands to the GPU by updating the tail, while the GPU
fetches commands from the head, and then notifies the CPU by updating
the head, after the commands have finished execution. Therefore, when
the GPU has executed all commands from the ring buffer, the head and
tail pointers are the same.
Having introduced the GPU architecture abstraction, it is important for
us to understand how real-world graphics applications use the GPU
hardware so that we can virtualize it in VMs efficiently. To do so, we
characterized, for some representative GPU-intensive 3D workloads (the
Phoronix Test Suite), the usages of the four critical interfaces:
1) the Frame Buffer,
2) the Command Buffer,
3) the GPU Page Table Entries (PTEs), which carry the GPU page tables, and
4) the I/O registers, including Memory-Mapped I/O (MMIO) registers,
Port I/O (PIO) registers, and PCI configuration space registers
for internal state.
:numref:`access-patterns` shows the average access frequency of running
Phoronix 3D workloads on the four interfaces.
The Frame Buffer and Command Buffer exhibit the most
performance-critical resources, as shown in :numref:`access-patterns`.
When the applications are being loaded, lots of source vertices and
pixels are written by the CPU, so the Frame Buffer accesses occur in the
range of hundreds of thousands per second. Then at run-time, the CPU
programs the GPU through the commands, to render the Frame Buffer, so
the Command Buffer accesses become the largest group, also in the
hundreds of thousands per second. PTE and I/O accesses are minor in both
load and run-time phases ranging in tens of thousands per second.
.. figure:: images/APL_GVT-g-access-patterns.png
:width: 400px
:align: center
:name: access-patterns
Access Patterns of Running 3D Workloads
High Level Architecture
***********************
:numref:`gvt-arch` shows the overall architecture of GVT-g, based on the
ACRN hypervisor, with SOS as the privileged VM, and multiple user
guests. A GVT-g device model working with the ACRN hypervisor,
implements the policies of trap and pass-through. Each guest runs the
native graphics driver and can directly access performance-critical
resources: the Frame Buffer and Command Buffer, with resource
partitioning (as presented later). To protect privileged resources, that
is, the I/O registers and PTEs, corresponding accesses from the graphics
driver in user VMs are trapped and forwarded to the GVT device model in
SOS for emulation. The device model leverages i915 interfaces to access
the physical GPU.
In addition, the device model implements a GPU scheduler that runs
concurrently with the CPU scheduler in ACRN to share the physical GPU
timeslot among the VMs. GVT-g uses the physical GPU to directly execute
all the commands submitted from a VM, so it avoids the complexity of
emulating the Render Engine, which is the most complex part of the GPU.
In the meantime, the resource pass-through of both the Frame Buffer and
Command Buffer minimizes the hypervisor's intervention of CPU accesses,
while the GPU scheduler guarantees every VM a quantum time-slice for
direct GPU execution. With that, GVT-g can achieve near-native
performance for a VM workload.
In :numref:`gvt-arch`, the yellow GVT device model works as a client on
top of an i915 driver in the SOS. It has a generic Mediated Pass-Through
(MPT) interface, compatible with all types of hypervisors. For ACRN,
some extra development work is needed for such MPT interfaces. For
example, we need some changes in ACRN-DM to make ACRN compatible with
the MPT framework. The vGPU lifecycle is the same as the lifecycle of
the guest VM creation through ACRN-DM. They interact through sysfs,
exposed by the GVT device model.
.. figure:: images/APL_GVT-g-arch.png
:width: 600px
:align: center
:name: gvt-arch
AcrnGT High-level Architecture
Key Techniques
**************
vGPU Device Model
=================
The vGPU Device model is the main component because it constructs the
vGPU instance for each guest to satisfy every GPU request from the guest
and gives the corresponding result back to the guest.
The vGPU Device Model provides the basic framework to do
trap-and-emulation, including MMIO virtualization, interrupt
virtualization, and display virtualization. It also handles and
processes all the requests internally, such as, command scan and shadow,
schedules them in the proper manner, and finally submits to
the SOS i915 driver.
.. figure:: images/APL_GVT-g-DM.png
:width: 800px
:align: center
:name: GVT-DM
GVT-g Device Model
MMIO Virtualization
-------------------
Intel Processor Graphics implements two PCI MMIO BARs:
- **GTTMMADR BAR**: Combines both :term:`GGTT` modification range and Memory
Mapped IO range. It is 16 MB on :term:`BDW`, with 2 MB used by MMIO, 6 MB
reserved and 8 MB allocated to GGTT. GGTT starts from
:term:`GTTMMADR` + 8 MB. In this section, we focus on virtualization of
the MMIO range, discussing GGTT virtualization later.
- **GMADR BAR**: As the PCI aperture is used by the CPU to access tiled
graphics memory, GVT-g partitions this aperture range among VMs for
performance reasons.
A 2 MB virtual MMIO structure is allocated per vGPU instance.
All the virtual MMIO registers are emulated as simple in-memory
read-write, that is, guest driver will read back the same value that was
programmed earlier. A common emulation handler (for example,
intel_gvt_emulate_read/write) is enough to handle such general
emulation requirements. However, some registers need to be emulated with
specific logic, for example, affected by change of other states or
additional audit or translation when updating the virtual register.
Therefore, a specific emulation handler must be installed for those
special registers.
The graphics driver may have assumptions about the initial device state,
which stays with the point when the BIOS transitions to the OS. To meet
the driver expectation, we need to provide an initial state of vGPU that
a driver may observe on a pGPU. So the host graphics driver is expected
to generate a snapshot of physical GPU state, which it does before guest
driver's initialization. This snapshot is used as the initial vGPU state
by the device model.
PCI Configuration Space Virtualization
--------------------------------------
PCI configuration space also needs to be virtualized in the device
model. Different implementations may choose to implement the logic
within the vGPU device model or in default system device model (for
example, ACRN-DM). GVT-g emulates the logic in the device model.
Some information is vital for the vGPU device model, including:
Guest PCI BAR, Guest PCI MSI, and Base of ACPI OpRegion.
Legacy VGA Port I/O Virtualization
----------------------------------
Legacy VGA is not supported in the vGPU device model. We rely on the
default device model (for example, :term:`QEMU`) to provide legacy VGA
emulation, which means either ISA VGA emulation or
PCI VGA emulation.
Interrupt Virtualization
------------------------
The GVT device model does not touch the hardware interrupt in the new
architecture, since it is hard to combine the interrupt controlling
logic between the virtual device model and the host driver. To prevent
architectural changes in the host driver, the host GPU interrupt does
not go to the virtual device model and the virtual device model has to
handle the GPU interrupt virtualization by itself. Virtual GPU
interrupts are categorized into three types:
- Periodic GPU interrupts are emulated by timers. However, a notable
exception to this is the VBlank interrupt. Due to the demands of user
space compositors, such as Wayland, which requires a flip done event
to be synchronized with a VBlank, this interrupt is forwarded from
SOS to UOS when SOS receives it from the hardware.
- Event-based GPU interrupts are emulated by the emulation logic. For
example, AUX Channel Interrupt.
- GPU command interrupts are emulated by a command parser and workload
dispatcher. The command parser marks out which GPU command interrupts
are generated during the command execution and the workload
dispatcher injects those interrupts into the VM after the workload is
finished.
.. figure:: images/APL_GVT-g-interrupt-virt.png
:width: 400px
:align: center
:name: interrupt-virt
Interrupt Virtualization
Workload Scheduler
------------------
The scheduling policy and workload scheduler are decoupled for
scalability reasons. For example, a future QoS enhancement will only
impact the scheduling policy, any i915 interface change or HW submission
interface change (from execlist to :term:`GuC`) will only need workload
scheduler updates.
The scheduling policy framework is the core of the vGPU workload
scheduling system. It controls all of the scheduling actions and
provides the developer with a generic framework for easy development of
scheduling policies. The scheduling policy framework controls the work
scheduling process without caring about how the workload is dispatched
or completed. All the detailed workload dispatching is hidden in the
workload scheduler, which is the actual executer of a vGPU workload.
The workload scheduler handles everything about one vGPU workload. Each
hardware ring is backed by one workload scheduler kernel thread. The
workload scheduler picks the workload from current vGPU workload queue
and communicates with the virtual HW submission interface to emulate the
"schedule-in" status for the vGPU. It performs context shadow, Command
Buffer scan and shadow, PPGTT page table pin/unpin/out-of-sync, before
submitting this workload to the host i915 driver. When the vGPU workload
is completed, the workload scheduler asks the virtual HW submission
interface to emulate the "schedule-out" status for the vGPU. The VM
graphics driver then knows that a GPU workload is finished.
.. figure:: images/APL_GVT-g-scheduling.png
:width: 500px
:align: center
:name: scheduling
GVT-g Scheduling Framework
Workload Submission Path
------------------------
Software submits the workload using the legacy ring buffer mode on Intel
Processor Graphics before Broadwell, which is no longer supported by the
GVT-g virtual device model. A new HW submission interface named
"Execlist" is introduced since Broadwell. With the new HW submission
interface, software can achieve better programmability and easier
context management. In Intel GVT-g, the vGPU submits the workload
through the virtual HW submission interface. Each workload in submission
will be represented as an ``intel_vgpu_workload`` data structure, a vGPU
workload, which will be put on a per-vGPU and per-engine workload queue
later after performing a few basic checks and verifications.
.. figure:: images/APL_GVT-g-workload.png
:width: 800px
:align: center
:name: workload
GVT-g Workload Submission
Display Virtualization
----------------------
GVT-g reuses the i915 graphics driver in the SOS to initialize the Display
Engine, and then manages the Display Engine to show different VM frame
buffers. When two vGPUs have the same resolution, only the frame buffer
locations are switched.
.. figure:: images/APL_GVT-g-display-virt.png
:width: 800px
:align: center
:name: display-virt
Display Virtualization
Direct Display Model
--------------------
.. figure:: images/APL_GVT-g-direct-display.png
:width: 600px
:align: center
:name: direct-display
Direct Display Model
A typical automotive use case is where there are two displays in the car
and each one needs to show one domain's content, with the two domains
being the Instrument cluster and the In Vehicle Infotainment (IVI). As
shown in :numref:`direct-display`, this can be accomplished through the direct
display model of GVT-g, where the SOS and UOS are each assigned all HW
planes of two different pipes. GVT-g has a concept of display owner on a
per HW plane basis. If it determines that a particular domain is the
owner of a HW plane, then it allows the domain's MMIO register write to
flip a frame buffer to that plane to go through to the HW. Otherwise,
such writes are blocked by the GVT-g.
Indirect Display Model
----------------------
.. figure:: images/APL_GVT-g-indirect-display.png
:width: 600px
:align: center
:name: indirect-display
Indirect Display Model
For security or fastboot reasons, it may be determined that the UOS is
either not allowed to display its content directly on the HW or it may
be too late before it boots up and displays its content. In such a
scenario, the responsibility of displaying content on all displays lies
with the SOS. One of the use cases that can be realized is to display the
entire frame buffer of the UOS on a secondary display. GVT-g allows for this
model by first trapping all MMIO writes by the UOS to the HW. A proxy
application can then capture the address in GGTT where the UOS has written
its frame buffer and using the help of the Hypervisor and the SOS's i915
driver, can convert the Guest Physical Addresses (GPAs) into Host
Physical Addresses (HPAs) before making a texture source or EGL image
out of the frame buffer and then either post processing it further or
simply displaying it on a HW plane of the secondary display.
GGTT-Based Surface Sharing
--------------------------
One of the major automotive use case is called "surface sharing". This
use case requires that the SOS accesses an individual surface or a set of
surfaces from the UOS without having to access the entire frame buffer of
the UOS. Unlike the previous two models, where the UOS did not have to do
anything to show its content and therefore a completely unmodified UOS
could continue to run, this model requires changes to the UOS.
This model can be considered an extension of the indirect display model.
Under the indirect display model, the UOS's frame buffer was temporarily
pinned by it in the video memory access through the Global graphics
translation table. This GGTT-based surface sharing model takes this a
step further by having a compositor of the UOS to temporarily pin all
application buffers into GGTT. It then also requires the compositor to
create a metadata table with relevant surface information such as width,
height, and GGTT offset, and flip that in lieu of the frame buffer.
In the SOS, the proxy application knows that the GGTT offset has been
flipped, maps it, and through it can access the GGTT offset of an
application that it wants to access. It is worth mentioning that in this
model, UOS applications did not require any changes, and only the
compositor, Mesa, and i915 driver had to be modified.
This model has a major benefit and a major limitation. The
benefit is that since it builds on top of the indirect display model,
there are no special drivers necessary for it on either SOS or UOS.
Therefore, any Real Time Operating System (RTOS) that use
this model can simply do so without having to implement a driver, the
infrastructure for which may not be present in their operating system.
The limitation of this model is that video memory dedicated for a UOS is
generally limited to a couple of hundred MBs. This can easily be
exhausted by a few application buffers so the number and size of buffers
is limited. Since it is not a highly-scalable model, in general, Intel
recommends the Hyper DMA buffer sharing model, described next.
Hyper DMA Buffer Sharing
------------------------
.. figure:: images/APL_GVT-g-hyper-dma.png
:width: 800px
:align: center
:name: hyper-dma
Hyper DMA Buffer Design
Another approach to surface sharing is Hyper DMA Buffer sharing. This
model extends the Linux DMA buffer sharing mechanism where one driver is
able to share its pages with another driver within one domain.
Applications buffers are backed by i915 Graphics Execution Manager
Buffer Objects (GEM BOs). As in GGTT surface
sharing, this model also requires compositor changes. The compositor of
UOS requests i915 to export these application GEM BOs and then passes
them on to a special driver called the Hyper DMA Buf exporter whose job
is to create a scatter gather list of pages mapped by PDEs and PTEs and
export a Hyper DMA Buf ID back to the compositor.
The compositor then shares this Hyper DMA Buf ID with the SOS's Hyper DMA
Buf importer driver which then maps the memory represented by this ID in
the SOS. A proxy application in the SOS can then provide the ID of this driver
to the SOS i915, which can create its own GEM BO. Finally, the application
can use it as an EGL image and do any post processing required before
either providing it to the SOS compositor or directly flipping it on a
HW plane in the compositor's absence.
This model is highly scalable and can be used to share up to 4 GB worth
of pages. It is also not limited to only sharing graphics buffers. Other
buffers for the IPU and others, can also be shared with it. However, it
does require that the SOS port the Hyper DMA Buffer importer driver. Also,
the SOS OS must comprehend and implement the DMA buffer sharing model.
For detailed information about this model, please refer to the `Linux
HYPER_DMABUF Driver High Level Design
<https://github.com/downor/linux_hyper_dmabuf/blob/hyper_dmabuf_integration_v4/Documentation/hyper-dmabuf-sharing.txt>`_.
Plane-Based Domain Ownership
----------------------------
.. figure:: images/APL_GVT-g-plane-based.png
:width: 600px
:align: center
:name: plane-based
Plane-Based Domain Ownership
Yet another mechanism for showing content of both the SOS and UOS on the
same physical display is called plane-based domain ownership. Under this
model, both the SOS and UOS are provided a set of HW planes that they can
flip their contents on to. Since each domain provides its content, there
is no need for any extra composition to be done through the SOS. The display
controller handles alpha blending contents of different domains on a
single pipe. This saves on any complexity on either the SOS or the UOS
SW stack.
It is important to provide only specific planes and have them statically
assigned to different Domains. To achieve this, the i915 driver of both
domains is provided a command line parameter that specifies the exact
planes that this domain has access to. The i915 driver then enumerates
only those HW planes and exposes them to its compositor. It is then left
to the compositor configuration to use these planes appropriately and
show the correct content on them. No other changes are necessary.
While the biggest benefit of this model is that is extremely simple and
quick to implement, it also has some drawbacks. First, since each domain
is responsible for showing the content on the screen, there is no
control of the UOS by the SOS. If the UOS is untrusted, this could
potentially cause some unwanted content to be displayed. Also, there is
no post processing capability, except that provided by the display
controller (for example, scaling, rotation, and so on). So each domain
must provide finished buffers with the expectation that alpha blending
with another domain will not cause any corruption or unwanted artifacts.
Graphics Memory Virtualization
==============================
To achieve near-to-native graphics performance, GVT-g passes through the
performance-critical operations, such as Frame Buffer and Command Buffer
from the VM. For the global graphics memory space, GVT-g uses graphics
memory resource partitioning and an address space ballooning mechanism.
For local graphics memory spaces, GVT-g implements per-VM local graphics
memory through a render context switch because local graphics memory is
only accessible by the GPU.
Global Graphics Memory
----------------------
Graphics Memory Resource Partitioning
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
GVT-g partitions the global graphics memory among VMs. Splitting the
CPU/GPU scheduling mechanism requires that the global graphics memory of
different VMs can be accessed by the CPU and the GPU simultaneously.
Consequently, GVT-g must, at any time, present each VM with its own
resource, leading to the resource partitioning approaching, for global
graphics memory, as shown in :numref:`mem-part`.
.. figure:: images/APL_GVT-g-mem-part.png
:width: 800px
:align: center
:name: mem-part
Memory Partition and Ballooning
The performance impact of reduced global graphics memory resource
due to memory partitioning is very limited according to various test
results.
Address Space Ballooning
%%%%%%%%%%%%%%%%%%%%%%%%
The address space ballooning technique is introduced to eliminate the
address translation overhead, shown in :numref:`mem-part`. GVT-g exposes the
partitioning information to the VM graphics driver through the PVINFO
MMIO window. The graphics driver marks the other VMs' regions as
'ballooned', and reserves them as not being used from its graphics
memory allocator. Under this design, the guest view of global graphics
memory space is exactly the same as the host view and the driver
programmed addresses, using guest physical address, can be directly used
by the hardware. Address space ballooning is different from traditional
memory ballooning techniques. Memory ballooning is for memory usage
control concerning the number of ballooned pages, while address space
ballooning is to balloon special memory address ranges.
Another benefit of address space ballooning is that there is no address
translation overhead as we use the guest Command Buffer for direct GPU
execution.
Per-VM Local Graphics Memory
----------------------------
GVT-g allows each VM to use the full local graphics memory spaces of its
own, similar to the virtual address spaces on the CPU. The local
graphics memory spaces are only visible to the Render Engine in the GPU.
Therefore, any valid local graphics memory address, programmed by a VM,
can be used directly by the GPU. The GVT-g device model switches the
local graphics memory spaces, between VMs, when switching render
ownership.
GPU Page Table Virtualization
=============================
Shared Shadow GGTT
------------------
To achieve resource partitioning and address space ballooning, GVT-g
implements a shared shadow global page table for all VMs. Each VM has
its own guest global page table to translate the graphics memory page
number to the Guest memory Page Number (GPN). The shadow global page
table is then translated from the graphics memory page number to the
Host memory Page Number (HPN).
The shared shadow global page table maintains the translations for all
VMs to support concurrent accesses from the CPU and GPU concurrently.
Therefore, GVT-g implements a single, shared shadow global page table by
trapping guest PTE updates, as shown in :numref:`shared-shadow`. The
global page table, in MMIO space, has 1024K PTE entries, each pointing
to a 4 KB system memory page, so the global page table overall creates a
4 GB global graphics memory space. GVT-g audits the guest PTE values
according to the address space ballooning information before updating
the shadow PTE entries.
.. figure:: images/APL_GVT-g-shared-shadow.png
:width: 600px
:align: center
:name: shared-shadow
Shared Shadow Global Page Table
Per-VM Shadow PPGTT
-------------------
To support local graphics memory access pass-through, GVT-g implements
per-VM shadow local page tables. The local graphics memory is only
accessible from the Render Engine. The local page tables have two-level
paging structures, as shown in :numref:`per-vm-shadow`.
The first level, Page Directory Entries (PDEs), located in the global
page table, points to the second level, Page Table Entries (PTEs) in
system memory, so guest accesses to the PDE are trapped and emulated,
through the implementation of shared shadow global page table.
GVT-g also write-protects a list of guest PTE pages for each VM. The
GVT-g device model synchronizes the shadow page with the guest page, at
the time of write-protection page fault, and switches the shadow local
page tables at render context switches.
.. figure:: images/APL_GVT-g-per-vm-shadow.png
:width: 800px
:align: center
:name: per-vm-shadow
Per-VM Shadow PPGTT
Prioritized Rendering and Preemption
====================================
Different Schedulers and Their Roles
------------------------------------
.. figure:: images/APL_GVT-g-scheduling-policy.png
:width: 800px
:align: center
:name: scheduling-policy
Scheduling Policy
In the system, there are three different schedulers for the GPU:
- i915 UOS scheduler
- Mediator GVT scheduler
- i915 SOS scheduler
Since UOS always uses the host-based command submission (ELSP) model,
and it never accesses the GPU or the Graphic Micro Controller (GuC)
directly, its scheduler cannot do any preemption by itself.
The i915 scheduler does ensure batch buffers are
submitted in dependency order, that is, if a compositor had to wait for
an application buffer to finish before its workload can be submitted to
the GPU, then the i915 scheduler of the UOS ensures that this happens.
The UOS assumes that by submitting its batch buffers to the Execlist
Submission Port (ELSP), the GPU will start working on them. However,
the MMIO write to the ELSP is captured by the Hypervisor, which forwards
these requests to the GVT module. GVT then creates a shadow context
based on this batch buffer and submits the shadow context to the SOS
i915 driver.
However, it is dependent on a second scheduler called the GVT
scheduler. This scheduler is time based and uses a round robin algorithm
to provide a specific time for each UOS to submit its workload when it
is considered as a "render owner". The workload of the UOSs that are not
render owners during a specific time period end up waiting in the
virtual GPU context until the GVT scheduler makes them render owners.
The GVT shadow context submits only one workload at
a time, and once the workload is finished by the GPU, it copies any
context state back to DomU and sends the appropriate interrupts before
picking up any other workloads from either this UOS or another one. This
also implies that this scheduler does not do any preemption of
workloads.
Finally, there is the i915 scheduler in the SOS. This scheduler uses the
GuC or ELSP to do command submission of SOS local content as well as any
content that GVT is submitting to it on behalf of the UOSs. This
scheduler uses GuC or ELSP to preempt workloads. GuC has four different
priority queues, but the SOS i915 driver uses only two of them. One of
them is considered high priority and the other is normal priority with a
GuC rule being that any command submitted on the high priority queue
would immediately try to preempt any workload submitted on the normal
priority queue. For ELSP submission, the i915 will submit a preempt
context to preempt the current running context and then wait for the GPU
engine to be idle.
While the identification of workloads to be preempted is decided by
customizable scheduling policies, once a candidate for preemption is
identified, the i915 scheduler simply submits a preemption request to
the GuC high-priority queue. Based on the HW's ability to preempt (on an
Apollo Lake SoC, 3D workload is preemptible on a 3D primitive level with
some exceptions), the currently executing workload is saved and
preempted. The GuC informs the driver using an interrupt of a preemption
event occurring. After handling the interrupt, the driver submits the
high-priority workload through the normal priority GuC queue. As such,
the normal priority GuC queue is used for actual execbuf submission most
of the time with the high-priority GuC queue only being used for the
preemption of lower-priority workload.
Scheduling policies are customizable and left to customers to change if
they are not satisfied with the built-in i915 driver policy, where all
workloads of the SOS are considered higher priority than those of the
UOS. This policy can be enforced through an SOS i915 kernel command line
parameter, and can replace the default in-order command submission (no
preemption) policy.
AcrnGT
*******
ACRN is a flexible, lightweight reference hypervisor, built with
real-time and safety-criticality in mind, optimized to streamline
embedded development through an open source platform.
AcrnGT is the GVT-g implementation on the ACRN hypervisor. It adapts
the MPT interface of GVT-g onto ACRN by using the kernel APIs provided
by ACRN.
:numref:`full-pic` shows the full architecture of AcrnGT with a Linux Guest
OS and an Android Guest OS.
.. figure:: images/APL_GVT-g-full-pic.png
:width: 800px
:align: center
:name: full-pic
Full picture of the AcrnGT
AcrnGT in kernel
=================
The AcrnGT module in the SOS kernel acts as an adaption layer to connect
between GVT-g in the i915, the VHM module, and the ACRN-DM user space
application:
- AcrnGT module implements the MPT interface of GVT-g to provide
services to it, including set and unset trap areas, set and unset
write-protection pages, etc.
- It calls the VHM APIs provided by the ACRN VHM module in the SOS
kernel, to eventually call into the routines provided by ACRN
hypervisor through hyper-calls.
- It provides user space interfaces through ``sysfs`` to the user space
ACRN-DM, so that DM can manage the lifecycle of the virtual GPUs.
AcrnGT in DM
=============
To emulate a PCI device to a Guest, we need an AcrnGT sub-module in the
ACRN-DM. This sub-module is responsible for:
- registering the virtual GPU device to the PCI device tree presented to
guest;
- registerng the MMIO resources to ACRN-DM so that it can reserve
resources in ACPI table;
- managing the lifecycle of the virtual GPU device, such as creation,
destruction, and resetting according to the state of the virtual
machine.

View File

@@ -0,0 +1,10 @@
.. _hld-devicemodel:
Device Model high-level design
##############################
.. toctree::
:maxdepth: 1
ACPI virtualization <acpi-virt>

View File

@@ -0,0 +1,11 @@
.. _hld-emulated-devices:
Emulated Devices high-level design
##################################
.. toctree::
:maxdepth: 1
GVT-g GPU Virtualization <hld-APL_GVT-g>
UART virtualization <uart-virt-hld>
Watchdoc virtualization <watchdog-hld>

View File

@@ -0,0 +1,11 @@
.. _hld-hypervisor:
Hypervisor high-level design
############################
.. toctree::
:maxdepth: 1
Memory management <memmgt-hld>
Interrupt management <interrupt-hld>

View File

@@ -0,0 +1,4 @@
.. _hld-overview:
Overview
########

View File

@@ -0,0 +1,4 @@
.. _hld-power-management:
Power Management high-level design
##################################

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,4 @@
.. _hld-trace-log:
Tracing and Logging high-level design
#####################################

View File

@@ -0,0 +1,499 @@
.. _hld-virtio-devices:
.. _virtio-hld:
Virtio devices high-level design
################################
The ACRN Hypervisor follows the `Virtual I/O Device (virtio)
specification
<http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html>`_ to
realize I/O virtualization for many performance-critical devices
supported in the ACRN project. Adopting the virtio specification lets us
reuse many frontend virtio drivers already available in a Linux-based
User OS, drastically reducing potential development effort for frontend
virtio drivers. To further reduce the development effort of backend
virtio drivers, the hypervisor provides the virtio backend service
(VBS) APIs, that make it very straightforward to implement a virtio
device in the hypervisor.
The virtio APIs can be divided into 3 groups: DM APIs, virtio backend
service (VBS) APIs, and virtqueue (VQ) APIs, as shown in
:numref:`be-interface`.
.. figure:: images/virtio-hld-image0.png
:width: 900px
:align: center
:name: be-interface
ACRN Virtio Backend Service Interface
- **DM APIs** are exported by the DM, and are mainly used during the
device initialization phase and runtime. The DM APIs also include
PCIe emulation APIs because each virtio device is a PCIe device in
the SOS and UOS.
- **VBS APIs** are mainly exported by the VBS and related modules.
Generally they are callbacks to be
registered into the DM.
- **VQ APIs** are used by a virtio backend device to access and parse
information from the shared memory between the frontend and backend
device drivers.
Virtio framework is the para-virtualization specification that ACRN
follows to implement I/O virtualization of performance-critical
devices such as audio, eAVB/TSN, IPU, and CSMU devices. This section gives
an overview about virtio history, motivation, and advantages, and then
highlights virtio key concepts. Second, this section will describe
ACRN's virtio architectures, and elaborates on ACRN virtio APIs. Finally
this section will introduce all the virtio devices currently supported
by ACRN.
Virtio introduction
*******************
Virtio is an abstraction layer over devices in a para-virtualized
hypervisor. Virtio was developed by Rusty Russell when he worked at IBM
research to support his lguest hypervisor in 2007, and it quickly became
the de-facto standard for KVM's para-virtualized I/O devices.
Virtio is very popular for virtual I/O devices because is provides a
straightforward, efficient, standard, and extensible mechanism, and
eliminates the need for boutique, per-environment, or per-OS mechanisms.
For example, rather than having a variety of device emulation
mechanisms, virtio provides a common frontend driver framework that
standardizes device interfaces, and increases code reuse across
different virtualization platforms.
Given the advantages of virtio, ACRN also follows the virtio
specification.
Key Concepts
************
To better understand virtio, especially its usage in ACRN, we'll
highlight several key virtio concepts important to ACRN:
Frontend virtio driver (FE)
Virtio adopts a frontend-backend architecture that enables a simple but
flexible framework for both frontend and backend virtio drivers. The FE
driver merely needs to offer services configure the interface, pass messages,
produce requests, and kick backend virtio driver. As a result, the FE
driver is easy to implement and the performance overhead of emulating
a device is eliminated.
Backend virtio driver (BE)
Similar to FE driver, the BE driver, running either in user-land or
kernel-land of the host OS, consumes requests from the FE driver and sends them
to the host native device driver. Once the requests are done by the host
native device driver, the BE driver notifies the FE driver that the
request is complete.
Note: to distinguish BE driver from host native device driver, the host
native device driver is called "native driver" in this document.
Straightforward: virtio devices as standard devices on existing buses
Instead of creating new device buses from scratch, virtio devices are
built on existing buses. This gives a straightforward way for both FE
and BE drivers to interact with each other. For example, FE driver could
read/write registers of the device, and the virtual device could
interrupt FE driver, on behalf of the BE driver, in case something of
interest is happening.
Currently virtio supports PCI/PCIe bus and MMIO bus. In ACRN, only
PCI/PCIe bus is supported, and all the virtio devices share the same
vendor ID 0x1AF4.
Note: For MMIO, the "bus" is a little bit an overstatement since
basically it is a few descriptors describing the devices.
Efficient: batching operation is encouraged
Batching operation and deferred notification are important to achieve
high-performance I/O, since notification between FE and BE driver
usually involves an expensive exit of the guest. Therefore batching
operating and notification suppression are highly encouraged if
possible. This will give an efficient implementation for
performance-critical devices.
Standard: virtqueue
All virtio devices share a standard ring buffer and descriptor
mechanism, called a virtqueue, shown in :numref:`virtqueue`. A virtqueue is a
queue of scatter-gather buffers. There are three important methods on
virtqueues:
- **add_buf** is for adding a request/response buffer in a virtqueue,
- **get_buf** is for getting a response/request in a virtqueue, and
- **kick** is for notifying the other side for a virtqueue to consume buffers.
The virtqueues are created in guest physical memory by the FE drivers.
BE drivers only need to parse the virtqueue structures to obtain
the requests and process them. How a virtqueue is organized is
specific to the Guest OS. In the Linux implementation of virtio, the
virtqueue is implemented as a ring buffer structure called vring.
In ACRN, the virtqueue APIs can be leveraged directly so that users
don't need to worry about the details of the virtqueue. (Refer to guest
OS for more details about the virtqueue implementation.)
.. figure:: images/virtio-hld-image2.png
:width: 900px
:align: center
:name: virtqueue
Virtqueue
Extensible: feature bits
A simple extensible feature negotiation mechanism exists for each
virtual device and its driver. Each virtual device could claim its
device specific features while the corresponding driver could respond to
the device with the subset of features the driver understands. The
feature mechanism enables forward and backward compatibility for the
virtual device and driver.
Virtio Device Modes
The virtio specification defines three modes of virtio devices:
a legacy mode device, a transitional mode device, and a modern mode
device. A legacy mode device is compliant to virtio specification
version 0.95, a transitional mode device is compliant to both
0.95 and 1.0 spec versions, and a modern mode
device is only compatible to the version 1.0 specification.
In ACRN, all the virtio devices are transitional devices, meaning that
they should be compatible with both 0.95 and 1.0 versions of virtio
specification.
Virtio Device Discovery
Virtio devices are commonly implemented as PCI/PCIe devices. A
virtio device using virtio over PCI/PCIe bus must expose an interface to
the Guest OS that meets the PCI/PCIe specifications.
Conventionally, any PCI device with Vendor ID 0x1AF4,
PCI_VENDOR_ID_REDHAT_QUMRANET, and Device ID 0x1000 through 0x107F
inclusive is a virtio device. Among the Device IDs, the
legacy/transitional mode virtio devices occupy the first 64 IDs ranging
from 0x1000 to 0x103F, while the range 0x1040-0x107F belongs to
virtio modern devices. In addition, the Subsystem Vendor ID should
reflect the PCI/PCIe vendor ID of the environment, and the Subsystem
Device ID indicates which virtio device is supported by the device.
Virtio Frameworks
*****************
This section describes the overall architecture of virtio, and then
introduce ACRN specific implementations of the virtio framework.
Architecture
============
Virtio adopts a frontend-backend
architecture, as shown in :numref:`virtio-arch`. Basically the FE and BE driver
communicate with each other through shared memory, via the
virtqueues. The FE driver talks to the BE driver in the same way it
would talk to a real PCIe device. The BE driver handles requests
from the FE driver, and notifies the FE driver if the request has been
processed.
.. figure:: images/virtio-hld-image1.png
:width: 900px
:align: center
:name: virtio-arch
Virtio Architecture
In addition to virtio's frontend-backend architecture, both FE and BE
drivers follow a layered architecture, as shown in
:numref:`virtio-fe-be`. Each
side has three layers: transports, core models, and device types.
All virtio devices share the same virtio infrastructure, including
virtqueues, feature mechanisms, configuration space, and buses.
.. figure:: images/virtio-hld-image4.png
:width: 900px
:align: center
:name: virtio-fe-be
Virtio Frontend/Backend Layered Architecture
Virtio Framework Considerations
===============================
How to realize the virtio framework is specific to a
hypervisor implementation. In ACRN, the virtio framework implementations
can be classified into two types, virtio backend service in user-land
(VBS-U) and virtio backend service in kernel-land (VBS-K), according to
where the virtio backend service (VBS) is located. Although different in BE
drivers, both VBS-U and VBS-K share the same FE drivers. The reason
behind the two virtio implementations is to meet the requirement of
supporting a large amount of diverse I/O devices in ACRN project.
When developing a virtio BE device driver, the device owner should choose
carefully between the VBS-U and VBS-K. Generally VBS-U targets
non-performance-critical devices, but enables easy development and
debugging. VBS-K targets performance critical devices.
The next two sections introduce ACRN's two implementations of the virtio
framework.
User-Land Virtio Framework
==========================
The architecture of ACRN user-land virtio framework (VBS-U) is shown in
:numref:`virtio-userland`.
The FE driver talks to the BE driver as if it were talking with a PCIe
device. This means for "control plane", the FE driver could poke device
registers through PIO or MMIO, and the device will interrupt the FE
driver when something happens. For "data plane", the communication
between the FE and BE driver is through shared memory, in the form of
virtqueues.
On the service OS side where the BE driver is located, there are several
key components in ACRN, including device model (DM), virtio and HV
service module (VHM), VBS-U, and user-level vring service API helpers.
DM bridges the FE driver and BE driver since each VBS-U module emulates
a PCIe virtio device. VHM bridges DM and the hypervisor by providing
remote memory map APIs and notification APIs. VBS-U accesses the
virtqueue through the user-level vring service API helpers.
.. figure:: images/virtio-hld-image3.png
:width: 900px
:align: center
:name: virtio-userland
ACRN User-Land Virtio Framework
Kernel-Land Virtio Framework
============================
The architecture of ACRN kernel-land virtio framework (VBS-K) is shown
in :numref:`virtio-kernelland`.
VBS-K provides acceleration for performance critical devices emulated by
VBS-U modules by handling the "data plane" of the devices directly in
the kernel. When VBS-K is enabled for certain device, the kernel-land
vring service API helpers are used to access the virtqueues shared by
the FE driver. Compared to VBS-U, this eliminates the overhead of
copying data back-and-forth between user-land and kernel-land within the
service OS, but pays with the extra implementation complexity of the BE
drivers.
Except for the differences mentioned above, VBS-K still relies on VBS-U
for feature negotiations between FE and BE drivers. This means the
"control plane" of the virtio device still remains in VBS-U. When
feature negotiation is done, which is determined by FE driver setting up
an indicative flag, VBS-K module will be initialized by VBS-U, after
which all request handling will be offloaded to the VBS-K in kernel.
The FE driver is not aware of how the BE driver is implemented, either
in the VBS-U or VBS-K model. This saves engineering effort regarding FE
driver development.
.. figure:: images/virtio-hld-image6.png
:width: 900px
:align: center
:name: virtio-kernelland
ACRN Kernel-Land Virtio Framework
Virtio APIs
***********
This section provides details on the ACRN virtio APIs. As outlined previously,
the ACRN virtio APIs can be divided into three groups: DM_APIs,
VBS_APIs, and VQ_APIs. The following sections will elaborate on
these APIs.
VBS-U Key Data Structures
=========================
The key data structures for VBS-U are listed as following, and their
relationships are shown in :numref:`VBS-U-data`.
``struct pci_virtio_blk``
An example virtio device, such as virtio-blk.
``struct virtio_common``
A common component to any virtio device.
``struct virtio_ops``
Virtio specific operation functions for this type of virtio device.
``struct pci_vdev``
Instance of a virtual PCIe device, and any virtio
device is a virtual PCIe device.
``struct pci_vdev_ops``
PCIe device's operation functions for this type
of device.
``struct vqueue_info``
Instance of a virtqueue.
.. figure:: images/virtio-hld-image5.png
:width: 900px
:align: center
:name: VBS-U-data
VBS-U Key Data Structures
Each virtio device is a PCIe device. In addition, each virtio device
could have none or multiple virtqueues, depending on the device type.
The ``struct virtio_common`` is a key data structure to be manipulated by
DM, and DM finds other key data structures through it. The ``struct
virtio_ops`` abstracts a series of virtio callbacks to be provided by
device owner.
VBS-K Key Data Structures
=========================
The key data structures for VBS-K are listed as follows, and their
relationships are shown in :numref:`VBS-K-data`.
``struct vbs_k_rng``
In-kernel VBS-K component handling data plane of a
VBS-U virtio device, for example virtio random_num_generator.
``struct vbs_k_dev``
In-kernel VBS-K component common to all VBS-K.
``struct vbs_k_vq``
In-kernel VBS-K component to be working with kernel
vring service API helpers.
``struct vbs_k_dev_inf``
Virtio device information to be synchronized
from VBS-U to VBS-K kernel module.
``struct vbs_k_vq_info``
A single virtqueue information to be
synchronized from VBS-U to VBS-K kernel module.
``struct vbs_k_vqs_info``
Virtqueue(s) information, of a virtio device,
to be synchronized from VBS-U to VBS-K kernel module.
.. figure:: images/virtio-hld-image8.png
:width: 900px
:align: center
:name: VBS-K-data
VBS-K Key Data Structures
In VBS-K, the struct vbs_k_xxx represents the in-kernel component
handling a virtio device's data plane. It presents a char device for VBS-U
to open and register device status after feature negotiation with the FE
driver.
The device status includes negotiated features, number of virtqueues,
interrupt information, and more. All these status will be synchronized
from VBS-U to VBS-K. In VBS-U, the ``struct vbs_k_dev_info`` and ``struct
vbs_k_vqs_info`` will collect all the information and notify VBS-K through
ioctls. In VBS-K, the ``struct vbs_k_dev`` and ``struct vbs_k_vq``, which are
common to all VBS-K modules, are the counterparts to preserve the
related information. The related information is necessary to kernel-land
vring service API helpers.
DM APIs
=======
The DM APIs are exported by DM, and they should be used when realizing
BE device drivers on ACRN.
[API Material from doxygen comments]
VBS APIs
========
The VBS APIs are exported by VBS related modules, including VBS, DM, and
SOS kernel modules. They can be classified into VBS-U and VBS-K APIs
listed as follows.
VBS-U APIs
----------
These APIs provided by VBS-U are callbacks to be registered to DM, and
the virtio framework within DM will invoke them appropriately.
[API Material from doxygen comments]
VBS-K APIs
----------
The VBS-K APIs are exported by VBS-K related modules. Users could use
the following APIs to implement their VBS-K modules.
APIs provided by DM
~~~~~~~~~~~~~~~~~~~
[API Material from doxygen comments]
APIs provided by VBS-K modules in service OS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[API Material from doxygen comments]
VQ APIs
=======
The virtqueue APIs, or VQ APIs, are used by a BE device driver to
access the virtqueues shared by the FE driver. The VQ APIs abstract the
details of virtqueues so that users don't need to worry about the data
structures within the virtqueues. In addition, the VQ APIs are designed
to be identical between VBS-U and VBS-K, so that users don't need to
learn different APIs when implementing BE drivers based on VBS-U and
VBS-K.
[API Material from doxygen comments]
Below is an example showing a typical logic of how a BE driver handles
requests from a FE driver.
.. code-block:: c
static void BE_callback(struct pci_virtio_xxx *pv, struct vqueue_info *vq ) {
while (vq_has_descs(vq)) {
vq_getchain(vq, &idx, &iov, 1, NULL);
/* handle requests in iov */
request_handle_proc();
/* Release this chain and handle more */
vq_relchain(vq, idx, len);
}
/* Generate interrupt if appropriate. 1 means ring empty \*/
vq_endchains(vq, 1);
}
Supported Virtio Devices
************************
All the BE virtio drivers are implemented using the
ACRN virtio APIs, and the FE drivers are reusing the standard Linux FE
virtio drivers. For the devices with FE drivers available in the Linux
kernel, they should use standard virtio Vendor ID/Device ID and
Subsystem Vendor ID/Subsystem Device ID. For other devices within ACRN,
their temporary IDs are listed in the following table.
.. table:: Virtio Devices without existing FE drivers in Linux
:align: center
:name: virtio-device-table
+--------------+-------------+-------------+-------------+-------------+
| virtio | Vendor ID | Device ID | Subvendor | Subdevice |
| device | | | ID | ID |
+--------------+-------------+-------------+-------------+-------------+
| RPMB | 0x8086 | 0x8601 | 0x8086 | 0xFFFF |
+--------------+-------------+-------------+-------------+-------------+
| HECI | 0x8086 | 0x8602 | 0x8086 | 0xFFFE |
+--------------+-------------+-------------+-------------+-------------+
| audio | 0x8086 | 0x8603 | 0x8086 | 0xFFFD |
+--------------+-------------+-------------+-------------+-------------+
| IPU | 0x8086 | 0x8604 | 0x8086 | 0xFFFC |
+--------------+-------------+-------------+-------------+-------------+
| TSN/AVB | 0x8086 | 0x8605 | 0x8086 | 0xFFFB |
+--------------+-------------+-------------+-------------+-------------+
| hyper_dmabuf | 0x8086 | 0x8606 | 0x8086 | 0xFFFA |
+--------------+-------------+-------------+-------------+-------------+
| HDCP | 0x8086 | 0x8607 | 0x8086 | 0xFFF9 |
+--------------+-------------+-------------+-------------+-------------+
| COREU | 0x8086 | 0x8608 | 0x8086 | 0xFFF8 |
+--------------+-------------+-------------+-------------+-------------+
The following sections introduce the status of virtio devices currently
supported in ACRN.
.. toctree::
:maxdepth: 1
virtio-blk
virtio-net
virtio-console
virtio-rnd

View File

@@ -0,0 +1,4 @@
.. _hld-vm-management:
VM Management high-level design
###############################

View File

@@ -0,0 +1,4 @@
.. _hld-vsbl:
Virtual Slim-Bootloader high-level design
#########################################

Binary file not shown.

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 173 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 201 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 147 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 117 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 166 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 71 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 450 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 121 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.7 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 156 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.0 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 377 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 61 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 740 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 142 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 156 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 66 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 136 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 250 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 135 KiB

View File

@@ -0,0 +1,28 @@
.. _hld:
High-Level Design Guides
########################
The ACRN Hypervisor acts as a host with full control of the processor(s)
and the hardware (physical memory, interrupt management and I/O). It
provides the User OS with an abstraction of a virtual platform, allowing
the guest to behave as if were executing directly on a logical
processor.
These chapters describe the ACRN architecture, high-level design,
background, and motivation for specific areas within the ACRN hypervisor
system.
.. toctree::
:maxdepth: 1
Overview <hld-overview>
Hypervisor <hld-hypervisor>
Device Model <hld-devicemodel>
Emulated Devices <hld-emulated-devices>
Virtio Devices <hld-virtio-devices>
VM Management <hld-vm-management>
Power Management <hld-power-management>
Tracing and Logging <hld-trace-log>
Virtual Bootloader <hld-vsbl>
Security <hld-security>

View File

@@ -0,0 +1,486 @@
.. _interrupt-hld:
Interrupt Management high-level design
######################################
Overview
********
This document describes the interrupt management high-level design for
the ACRN hypervisor.
The ACRN hypervisor implements a simple but fully functional framework
to manage interrupts and exceptions, as show in
:numref:`interrupt-modules-overview`. In its native layer, it configures
the physical PIC, IOAPIC, and LAPIC to support different interrupt
sources from local timer/IPI to external INTx/MSI. In its virtual guest
layer, it emulates virtual PIC, virtual IOAPIC and virtual LAPIC, and
provides full APIs allowing virtual interrupt injection from emulated or
pass-thru devices.
.. figure:: images/interrupt-image3.png
:align: center
:width: 600px
:name: interrupt-modules-overview
ACRN Interrupt Modules Overview
In the software modules view shown in :numref:`interrupt-sw-modules`,
the ACRN hypervisor sets up the physical interrupt in its basic
interrupt modules (e.g., IOAPIC/LAPIC/IDT). It dispatches the interrupt
in the hypervisor interrupt flow control layer to the corresponding
handlers, that could be pre-defined IPI notification, timer, or runtime
registered pass-thru devices. The ACRN hypervisor then uses its VM
interfaces based on vPIC, vIOAPIC, and vMSI modules, to inject the
necessary virtual interrupt into the specific VM
.. figure:: images/interrupt-image2.png
:align: center
:width: 600px
:name: interrupt-sw-modules
ACRN Interrupt SW Modules Overview
Hypervisor Physical Interrupt Management
****************************************
The ACRN hypervisor is responsible for all the physical interrupt
handling. All physical interrupts are first handled in VMX root-mode.
The "external-interrupt exiting" bit in the VM-Execution controls field
is set to support this. The ACRN hypervisor also initializes all the
interrupt related modules such as IDT, PIC, IOAPIC, and LAPIC.
Only a few physical interrupts (such as TSC-Deadline timer and IOMMU)
are fully serviced in the hypervisor. Most interrupts come from pass-thru
devices whose interrupt are remapped to a virtual INTx/MSI source and
injected to the SOS or UOS, according to the pass-thru device
configuration.
The ACRN hypervisor does handle exceptions and any exception coming from
the VMX root-mode will lead to the CPU halting. For guest exception, the
hypervisor only traps #MC (machine check), prints a warning message, and
injects the exception back into the guest OS.
Physical Interrupt Initialization
=================================
After the ACRN hypervisor get control from the bootloader, it
initializes all physical interrupt-related modules for all the CPUs. The
ACRN hypervisor creates a framework to manage the physical interrupt for
hypervisor-local devices, pass-thru devices, and IPI between CPUs.
IDT
---
The ACRN hypervisor builds its native Interrupt Descriptor Table (IDT) during
interrupt initialization. For exceptions, it links to function
``dispatch_exception``, and for external interrupts it links to function
``dispatch_interrupt``. Please refer to ``arch/x86/idt.S`` for more details.
LAPIC
-----
The ACRN hypervisor resets LAPIC for each CPU, and provides basic APIs
used, for example, by the local timer (TSC Deadline)
program and IPI notification program. These APIs include
write_laipic_reg32, send_lapic_eoi, send_startup_ipi, and
send_single_ipi.
.. comment
Need reference to API doc generated from doxygen comments
in hypervisor/include/arch/x86/lapic.h
PIC/IOAPIC
----------
The ACRN hypervisor masks all interrupts from PIC, so all the
legacy interrupts from PIC (<16) are linked to IOAPIC, as shown in
:numref:`interrupt-pic-pin`.
ACRN will pre-allocate vectors and mask them for these legacy interrupts
in IOAPIC RTE. For others (>= 16) ACRN will mask them with vector 0 in
RTE, and the vector will be dynamically allocated on demand.
.. figure:: images/interrupt-image5.png
:align: center
:width: 600px
:name: interrupt-pic-pin
PIC & IOAPIC Pin Connection
Irq Desc
--------
The ACRN hypervisor maintains a global ``irq_desc[]`` array shared among the
CPUs and uses a flat mode to manage the interrupts. The same
vector is linked to the same IRQ number for all CPUs.
.. comment
Need reference to API doc generated from doxygen comments
for ``struct irq_desc`` in hypervisor/include/common/irq.h
The ``irq_desc[]`` array is indexed by the IRQ number. An
``irq_handler`` field can be set to a common edge, level, or quick
handler called from ``interrupt_dispatch``. The ``irq_desc`` structure
also contains the ``dev_list`` field to maintain this IRQ's action
handler list.
The global array ``vector_to_irq[]`` is used to manage the vector
resource. This array is initialized with value ``IRQ_INVALID`` for all
vectors, and will be set to a valid IRQ number after the corresponding
vector is registered.
For example, if the local timer registers interrupt with IRQ number 271 and
vector 0xEF, then the arrays mentioned above will be set to::
irq_desc[271].irq = 271;
irq_desc[271].vector = 0xEF;
vector_to_irq[0xEF] = 271;
Physical Interrupt Flow
=======================
When an physical interrupt occurs, and the CPU is running under VMX root
mode, the interrupt is triggered from the standard native irq flow:
interrupt gate to irq handler. However, if the CPU is running under VMX
non-root mode, an external interrupt will trigger a VM exit for reason
"external-interrupt". See :numref:`interrupt-handle-flow`.
.. figure:: images/interrupt-image4.png
:align: center
:width: 800px
:name: interrupt-handle-flow
ACRN Hypervisor Interrupt Handle Flow
After an interrupt happens (in either case noted above), the ACRN
hypervisor jumps to ``dispatch_interrupt``. This function will check
which vector caused this interrupt, and the corresponding ``irq_desc``
structure's ``irq_handler`` will be called for the service.
There are several irq_handler's defined in the ACRN hypervisor, as shown
in :numref:`interrupt-handle-flow`, designed for different uses. For
example, ``quick_handler_nolock`` is used when no critical data needs
protection in the action handlers; the VCPU notification IPI and local
timer are good example of this use case.
The more complicated ``common_dev_handler_level`` handler is intended
for pass-thru devices with level triggered interrupts. To avoid
continuously triggering the interrupt, it initially masks IOAPIC pin and
unmasks it only when the corresponding vIOAPIC pin gets an explicit EOI
ACK from the guest.
All the irq handler's finally call their own action handler list, as
shown here:
.. code-block: c
struct dev_handler_node \*dev = desc->dev_list;
while (dev != NULL) {
if (dev->dev_handler != NULL)
dev->dev_handler(desc->irq, dev->dev_data);
dev = dev->next;
}
The common APIs for registering, updating, and unregistering
interrupt handlers include irq_to_vector, dev_to_irq, dev_to_vector,
pri_register_handler, normal_register_handler,
unregister_handler_common, and update_irq_handler.
.. comment
Need reference to API doc generated from doxygen comments
in hypervisor/include/common/irq.h
.. _physical_interrupt_source:
Physical Interrupt Source
=========================
The ACRN hypervisor handles interrupts from many different sources, as
shown in :numref:`interrupt-source`:
.. list-table:: Physical Interrupt Source
:widths: 15 10 60
:header-rows: 1
:name: interrupt-source
* - Interrupt Source
- Vector
- Description
* - TSC Deadline Timer
- 0xEF
- The TSC deadline timer implements the timer framework in
the hypervisor based on the LAPIC TSC deadline. This interrupt's
target is specific to the CPU to which the LAPIC belongs.
* - CPU Startup IPI
- N/A
- The BSP needs to trigger an INIT-SIPI sequence to wake up the
APs. This interrupt's target is specified by the BSP calling
`` start_cpus()``.
* - VCPU Notify IPI
- 0xF0
- When the hypervisor needs to kick the VCPU out of VMX non-root
mode to do requests such as virtual interrupt injection, EPT
flush, etc. This interrupt's target is specified by function
``send_single_ipi()``.
* - IOMMU MSI
- dynamic
- IOMMU device supports an MSI interrupt. The vtd device driver in
the hypervisor will register an interrupt to handle dmar fault.
This interrupt's target is specified by vtd device driver.
* - PTdev INTx
- dynamic
- All native devices are owned by the guest (SOS or UOS), taking
advantage of the pass-thru method. Each pass-thru device connected
with IOAPIC/PIC (PTdev INTx) will register an interrupt when
its attached interrupt controller pin first gets unmasked.
This interrupt's target is defined by and RTE entry in the IOAPIC.
* - PTdev MSI
- dynamic
- All native devices are owned by the guest (SOS or UOS), taking
advantage of pass-thru method. Each pass-thru device with
enabled MSI (PTdev MSI) will register an interrupt when the SOS
does an explicit hypercall. This interrupt's target is defined
by an MSI address entry.
Softirq
=======
ACRN hypervisor implements a simple bottom-half softirq to execute the
interrupt handler, as showed in :numref:`interrupt-handle-flow`.
The softirq is executed when an interrupt is enabled. Several APIs for softirq
are defined including enable_softirq, disable_softirq, raise_softirq,
and exec_softirq.
.. comment
Need reference to API doc generated from doxygen comments
in hypervisor/include/common/softirq.h
Physical Exception Handling
===========================
As mentioned earlier, the ACRN hypervisor does not handle any
physical exceptions. The VMX root mode code path should guarantee no
exceptions are triggered while the hypervisor is running.
Guest Virtual Interrupt Management
**********************************
The previous sections describe physical interrupt management in the ACRN
hypervisor. After a physical interrupt happens, a registered action
handler is executed. Usually, the action handler represents a service
for virtual interrupt injection. For example, if an interrupt is
triggered from a pass-thru device, the appropriate virtual interrupt
should be injected into its guest VM.
The virtual interrupt injection could also come from an emulated device.
The I/O mediator in the Service OS (SOS) could trigger an interrupt
through a hypercall, and then do the virtual interrupt injection in the
hypervisor.
The following sections give an introduction to the ACRN guest virtual
interrupt management, including VCPU request for virtual interrupt kick
off, vPIC/vIOAPIC/vLAPIC for virtual interrupt injection interfaces,
physical-to-virtual interrupt mapping for a pass-thru device, and the
process of VMX interrupt/exception injection.
VCPU Request
============
As mentioned in `physical_interrupt_source`_, physical vector 0xF0 is
used to kick the VCPU out of its VMX non-root mode, and make a request
for virtual interrupt injection or other requests such as flush EPT.
The request-make API (vcpu_make_request) and eventid supports virtual interrupt
injection.
.. comment
Need reference to API doc generated from doxygen comments
in hypervisor/include/common/irq.h
There are requests for exception injection (ACRN_REQUEST_EXCP), vLAPIC
event (ACRN_REQUEST_EVENT), external interrupt from vPIC
(ACRN_REQUEST_EXTINT) and non-maskable-interrupt (ACRN_REQUEST_NMI).
The ``vcpu_make_request`` is necessary for a virtual interrupt
injection. If the target VCPU is running under VMX non-root mode, it
will send an IPI to kick it out and results in an external-interrupt
VM-Exit. The flow of :numref:`interrupt-handle-flow` could be executed
to complete the injection of a virtual interrupt.
There are some cases that do not need to send an IPI when making a
request because the CPU making the request is the target VCPU. For
example, the #GP exception request always happens on the current CPU
when an invalid emulation happens. An external interrupt for a pass-thru
device always happens on the VCPUs the device belongs to, so after it
triggers an external-interrupt VM-Exit, the current CPU is also the
target VCPU.
Virtual PIC
===========
The ACRN hypervisor emulates a vPIC for each VM based on IO ranges
0x20-0x21, 0xa0-0xa1, or 0x4d0-0x4d1.
If an interrupt source from vPIC needs to inject an interrupt,
the vpic_assert_irq, vpic_deassert_irq, or vpic_pulse_irq functions can
be called to make a request for ACRN_REQUEST_EXTINT or
ACRN_REQUEST_EVENT:
.. comment
Need reference to API doc generated from doxygen comments
in hypervisor/include/common/vpic.h
The vpic_pending_intr and vpic_intr_accepted APIs are used to query the
vector being injected and ACK the service, by moving the interrupt from
request service (IRR) to in service (ISR).
Virtual IOAPIC
==============
ACRN hypervisor emulates a vIOAPIC for each VM based on MMIO
VIOAPIC_BASE.
If an interrupt source from vIOAPIC needs to inject an interrupt, the
vioapic_assert_irq, vioapic_dessert_irq, and vioapic_pulse_irq APIs are
used to make a request for ACRN_REQUEST_EVENT.
As the vIOAPIC is always associated with a vLAPIC, the virtual interrupt
injection from vIOAPIC will finally trigger a request for an vLAPIC
event.
Virtual LAPIC
=============
The ACRN hypervisor emulates a vLAPIC for each VCPU based on MMIO
DEFAULT_APIC_BASE.
If an interrupt source from vLAPIC needs to inject an interrupt (e.g.,
from LVT such as an LAPIC timer, from vIOAPIC for a pass-thru device
interrupt, or from an emulated device for a MSI), vlapic_intr_level,
vlapic_intr_edge, vlapic_set_local_intr, vlapic_intr_msi,
vlapic_deliver_intr APIs need to be called, resulting in a request for
ACRN_REQUEST_EVENT.
.. comment
Need reference to API doc generated from doxygen comments
in hypervisor/include/common/vlapic.h
The vlapic_pending_intr and vlapic_intr_accepted APIs are used to query
the vector that needs to be injected and ACK
the service that move the interrupt from request service (IRR) to in
service (ISR).
By default, the ACRN hypervisor enables vAPIC to improve the performance of
a vLAPIC emulation.
Virtual Exception
=================
When doing emulation, an exception may be triggered in the hypervisor,
for example, if guest accesses an invalid vMSR register, or the
hypervisor needs to inject a #GP, or during instruction emulation, an
instruction fetch may access a non-exist page from rip_gva, and a #PF
must be injected.
ACRN hypervisor implements virtual exception injection using the
vcpu_queue_exception, vcpu_inject_gq, and vcpu_inject_pf APIs.
.. comment
Need reference to API doc generated from doxygen comments
in hypervisor/include/common/irq.h
The ACRN hypervisor uses vcpu_inject_gp/vcpu_inject_pf functions to
queue exception requests, and follows `Intel Software
Developer Manual, Vol 3. <SDM vol3>`_ - 6.15, Table 6-5
listing conditions for generating a double fault.
.. _SDM vol3: https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
Interrupt Mapping for a Pass-thru Device
========================================
A VM can control a PCI device directly through pass-thru device
assignment. The pass-thru entry is the major info object, and it is:
- A physical interrupt source, and could be a MSI/MSIX entry, PIC pins, or
IOAPIC pins
- Pass-thru remapping information between physical and virtual interrupt
source, for MSI/MSIX it is identified by a PCI device's BDF. For
PIC/IOAPIC it is identified by the pin number.
.. figure:: images/interrupt-image7.png
:align: center
:width: 600px
:name: interrupt-pass-thru
Pass-thru Device Entry Assignment
As shown in :numref:`interrupt-pass-thru` above, a UOS will assign its
pass-thru device entry by the DM, and it will fill its entry info from:
- vPIC/vIOAPIC interrupt mask/unmask
- MSI IOReq from UOS then MSI hypercall from SOS
The SOS adds its pass-thru device entry at runtime and fills info for:
- vPIC/vIOAPIC interrupt mask/unmask
- MSI hypercall from SOS
During the pass-thru device entry info filling, the hypervisor builds
native IOAPIC RTE/MSI entry based on vIOAPIC/vPIC/vMSI configuration,
and register the physical interrupt handler for it. Then with the pass-thru
device entry as the handler private data, the physical interrupt can
be linked to a virtual pin of a guest's vPIC/vIOAPIC or virtual vector of
a guest's vMSI. The handler then injects the corresponding virtual
interrupt into the guest, based on vPIC/vIOAPIC/vLAPIC APIs described
earlier.
Interrupt Storm Mitigation
==========================
When the Device Model (DM) launches a User OS (UOS), the ACRN hypervisor
will remap the interrupt for this user OS's pass-through devices. When
an interrupt occurs for a pass-through device, the CPU core is assigned
to that User OS gets trapped into the hypervisor. The benefit of such a
mechanism is that, should an interrupt storm happen in a particular UOS,
it will have only a minimal effect on the performance of the Service OS.
Interrupt/Exception Injection Process
=====================================
As shown in :numref:`interrupt-handle-flow`, the ACRN hypervisor injects
virtual interrupt/exception to the guest before its VM-Entry.
This is done by updating the VMX_ENTRY_INT_INFO_FIELD of the VCPU's
VMCS. As this field is unique, the interrupt/exception injection must
follow a priority rule to handle one-by-one.
:numref:`interrupt-injection` below shows the rules about how to inject
virtual interrupt/exception one-by-one. If a high priority
interrupt/exception was already injected, the next pending
interrupt/exception will enable an interrupt window where the next
injection will be done by the following VM-Exit, triggered by the
interrupt window.
.. figure:: images/interrupt-image6.png
:align: center
:width: 600px
:name: interrupt-injection
ACRN Hypervisor Interrupt/Exception Injection Process

View File

@@ -0,0 +1,248 @@
.. _memmgt-hld:
Memory Management high-level design
###################################
This document describes memory management for the ACRN hypervisor.
Overview
********
In the ACRN hypervisor system, there are few different memory spaces to
consider. From the hypervisor's point of view there are:
- Host Physical Address (HPA): the native physical address space, and
- Host Virtual Address (HVA): the native virtual address space based on
a MMU. A page table is used to do the translation between HPA and HVA
spaces.
And from the Guest OS running on a hypervisor there are:
- Guest Physical Address (GPA): the guest physical address space from a
virtual machine. GPA to HPA transition is usually based on a
MMU-like hardware module (EPT in X86), and associated with a page
table
- Guest Virtual Address (GVA): the guest virtual address space from a
virtual machine based on a vMMU
.. figure:: images/mem-image2.png
:align: center
:width: 900px
:name: mem-overview
ACRN Memory Mapping Overview
:numref:`mem-overview` provides an overview of the ACRN system memory
mapping, showing:
- GVA to GPA mapping based on vMMU on a VCPU in a VM
- GPA to HPA mapping based on EPT for a VM in the hypervisor
- HVA to HPA mapping based on MMU in the hypervisor
This document illustrates the memory management infrastructure for the
ACRN hypervisor and how it handles the different memory space views
inside the hypervisor and from a VM:
- How ACRN hypervisor manages host memory (HPA/HVA)
- How ACRN hypervisor manages SOS guest memory (HPA/GPA)
- How ACRN hypervisor & SOS DM manage UOS guest memory (HPA/GPA)
Hypervisor Memory Management
****************************
The ACRN hypervisor is the primary owner to manage system
memory. Typically the boot firmware (e.g., EFI) passes the platform physical
memory layout - E820 table to the hypervisor. The ACRN hypervisor does its memory
management based on this table.
Physical Memory Layout - E820
=============================
The boot firmware (e.g., EFI) passes the E820 table through a multiboot protocol.
This table contains the original memory layout for the platform.
.. figure:: images/mem-image1.png
:align: center
:width: 900px
:name: mem-layout
Physical Memory Layout Example
:numref:`mem-layout` is an example of the physical memory layout based on a simple
platform E820 table. The following sections demonstrate different memory
space management by referencing it.
Physical to Virtual Mapping
===========================
ACRN hypervisor is running under paging mode, so after receiving
the platform E820 table, ACRN hypervisor creates its MMU page table
based on it. This is done by the function init_paging() for all
physical CPUs.
The memory mapping policy here is:
- Identical mapping for each physical CPU (ACRN hypervisor's memory
could be relocatable in a future implementation)
- Map all memory regions with UNCACHED type
- Remap RAM regions to WRITE-BACK type
.. figure:: images/mem-image4.png
:align: center
:width: 900px
:name: vm-layout
Hypervisor Virtual Memory Layout
:numref:`vm-layout` shows:
- Hypervisor can access all of system memory
- Hypervisor has an UNCACHED MMIO/PCI hole reserved for devices, such
as for LAPIC/IOAPIC access
- Hypervisor has its own memory with WRITE-BACK cache type for its
code and data (< 1M part is for secondary CPU reset code)
Service OS Memory Management
****************************
After the ACRN hypervisor starts, it creates the Service OS as its first
VM. The Service OS runs all the native device drivers, manage the
hardware devices, and provides I/O mediation to guest VMs. The Service
OS is in charge of the memory allocation for Guest VMs as well.
ACRN hypervisor passes the whole system memory access (except its own
part) to the Service OS. The Service OS must be able to access all of
the system memory except the hypervisor part.
Guest Physical Memory Layout - E820
===================================
The ACRN hypervisor passes the original E820 table to the Service OS
after filtering out its own part. So from Service OS's view, it sees
almost all the system memory as shown here:
.. figure:: images/mem-image3.png
:align: center
:width: 900px
:name: sos-mem-layout
SOS Physical Memory Layout
Host to Guest Mapping
=====================
ACRN hypervisor creates Service OS's host (HPA) to guest (GPA) mapping
(EPT mapping) through the function
``prepare_vm0_memmap_and_e820()`` when it creates the SOS VM. It follows
these rules:
- Identical mapping
- Map all memory range with UNCACHED type
- Remap RAM entries in E820 (revised) with WRITE-BACK type
- Unmap ACRN hypervisor memory range
- Unmap ACRN hypervisor emulated vLAPIC/vIOAPIC MMIO range
The host to guest mapping is static for the Service OS; it will not
change after the Service OS begins running. Each native device driver
can access its MMIO through this static mapping. EPT violation is only
serving for vLAPIC/vIOAPIC's emulation in the hypervisor for Service OS
VM.
User OS Memory Management
*************************
User OS VM is created by the DM (Device Model) application running in
the Service OS. DM is responsible for the memory allocation for a User
or Guest OS VM.
Guest Physical Memory Layout - E820
===================================
DM will create the E820 table for a User OS VM based on these simple
rules:
- If requested VM memory size < low memory limitation (defined in DM,
as 2GB), then low memory range = [0, requested VM memory size]
- If requested VM memory size > low memory limitation (defined in DM,
as 2GB), then low memory range = [0, 2GB], high memory range = [4GB,
4GB + requested VM memory size - 2GB]
.. figure:: images/mem-image6.png
:align: center
:width: 900px
:name: uos-mem-layout
UOS Physical Memory Layout
DM is doing UOS memory allocation based on hugeTLB mechanism by
default. The real memory mapping
may be scattered in SOS physical memory space, as shown below:
.. figure:: images/mem-image5.png
:align: center
:width: 900px
:name: uos-mem-layout-hugetlb
UOS Physical Memory Layout Based on Hugetlb
Host to Guest Mapping
=====================
A User OS VM's memory is allocated by the Service OS DM application, and
may come from different huge pages in the Service OS as shown in
:ref:`uos-mem-layout-hugetlb`.
As Service OS has the full information of these huge pages size,
SOS-GPA and UOS-GPA, it works with the hypervisor to complete UOS's host
to guest mapping using this pseudo code:
.. code-block:: c
for x in allocated huge pages do
x.hpa = gpa2hpa_for_sos(x.sos_gpa)
host2guest_map_for_uos(x.hpa, x.uos_gpa, x.size)
end
Trusty
======
For an Android User OS, there is a secure world called "trusty world
support", whose memory needs are taken care by the ACRN hypervisor for
security consideration. From the memory management's view, the trusty
memory space should not be accessible by SOS or UOS normal world.
.. figure:: images/mem-image7.png
:align: center
:width: 900px
:name: uos-mem-layout-trusty
UOS Physical Memory Layout with Trusty
Memory Interaction
******************
Previous sections described different memory spaces management in the
ACRN hypervisor, Service OS, and User OS. Among these memory spaces,
there are different kinds of interaction, for example, a VM may do a
hypercall to the hypervisor that includes a data transfer, or an
instruction emulation in the hypervisor may need to access the Guest
instruction pointer register to fetch instruction data.
Access GPA from Hypervisor
==========================
When a hypervisor needs access to the GPA for data transfers, the caller
from the Guest must make sure this memory range's GPA is address
continuous. But for HPA in the hypervisor, it could be address
dis-continuous (especially for UOS under hugetlb allocation mechanism).
For example, a 4MB GPA range may map to 2 different 2MB huge pages. The
ACRN hypervisor needs to take care of this kind of data transfer by
doing EPT page walking based on its HPA.
Access GVA from Hypervisor
==========================
Likely, when hypervisor need to access GVA for data transfer, both GPA
and HPA could be address dis-continuous. The ACRN hypervisor must pay
attention to this kind of data transfer, and handle it by doing page
walking based on both its GPA and HPA.

View File

@@ -0,0 +1,126 @@
.. _uart_virtualization:
UART Virtualization
###################
In ACRN, UART virtualization is implemented as a fully-emulated device.
In the Service OS (SOS), UART virtualization is implemented in the
hypervisor itself. In the User OS (UOS), UART virtualization is
implemented in the Device Model (DM), and is the primary topic of this
document. We'll summarize differences between the hypervisor and DM
implementations at the end of this document.
UART emulation is a typical full-emulation implementation and is a
good example to learn about I/O emulation in a virtualized environment.
There is a detailed explanation about the I/O emulation flow in
ACRN in :ref:`ACRN-io-mediator`.
Architecture
************
The ACRN DM architecture for UART virtualization is shown here:
.. figure:: images/uart-image1.png
:align: center
:name: uart-arch
:width: 800px
Device Model's UART virtualization architecture
There are three objects used to emulate one UART device in DM:
UART registers, rxFIFO, and backend tty devices.
**UART registers** are emulated by member variables in struct
``uart_vdev``, one variable for each register. These variables are used
to track the register status programed by the frontend driver. The
handler of each register depends on the register's functionality.
A **FIFO** is implemented to emulate RX. Normally characters are read
from the backend tty device when available, then put into the rxFIFO.
When the Guest application tries to read from the UART, the access to
register ``com_data`` causes a ``vmexit``. Device model catches the
``vmexit`` and emulates the UART by returning one character from rxFIFO.
.. note:: When ``com_fcr`` is available, the Guest application can write
``0`` to this register to disable rxFIFO. In this case the rxFIFO in
device model degenerates to a buffer containing only one character.
When the Guest application tries to send a character to the UART, it
writes to the ``com_data`` register, which will cause a ``vmexit`` as
well. Device model catches the ``vmexit`` and emulates the UART by
redirecting the character to the **backend tty device**.
The UART device emulated by the ACRN device model is connected to the system by
the LPC bus. In the current implementation, two channel LPC UARTs are I/O mapped to
the traditional COM port addresses of 0x3F8 and 0x2F8. These are defined in
global variable ``uart_lres``.
There are two options needed for configuring the UART in the ``arcn-dm``
command line. First, the LPC is defined as a PCI device::
-s 1:0,lpc
The other option defines a UART port::
-l com1,stdio
The first parameter here is the name of the UART (must be "com1" or
"com2"). The second parameter is species the backend
tty device: ``stdio`` or a path to the dedicated tty device
node, for example ``/dev/pts/0``.
If you are using a specified tty device, find the name of the terminal
connected to standard input using the ``tty`` command (e.g.,
``/dev/pts/1``). Use this name to define the UART port on the acrn-dm
command line, for example::
-l com1,/dev/pts/1
When arcn-dm starts, ``pci_lpc_init`` is called as the callback of the
``vdev_init`` of the PCI device given on the acrn-dm command line.
Later, ``lpc_init`` is called in ``pci_lpc_init``. ``lpc_init`` iterates
on the available UART instances defined on the command line and
initializes them one by one. ``register_inout`` is called on the port
region of each UART instance, enabling access to the UART ports to be
routed to the registered handler.
In the case of UART emulation, the registered handlers are ``uart_read``
and ``uart_write``.
A similar virtual UART device is implemented in the hypervisor.
Currently UART16550 is owned by the hypervisor itself and is used for
debugging purposes. (The UART properties are configured by parameters
to the hypervisor command line.) The hypervisor emulates a UART device
with 0x3F8 address to the SOS and acts as the SOS console. The general
emulation is the same as used in the device model, with the following
differences:
- PIO region is directly registered to the vmexit handler dispatcher via
``vuart_register_io_handler``
- Two FIFOs are implemented, one for RX, the other of TX
- RX flow:
- Characters are read from the UART HW into a 2048-byte sbuf,
triggered by ``console_read``
- Characters are read from the sbuf and put to rxFIFO,
triggered by ``vuart_console_rx_chars``
- A virtual interrupt is sent to the SOS that triggered the read,
and characters from rxFIFO are sent to the SOS by emulating a read
of register ``UART16550_RBR``
- TX flow:
- Characters are put into txFIFO by emulating a write of register
``UART16550_THR``
- Characters in txFIFO are read out one by one, and sent to the console
by printf, triggered by ``vuart_console_tx_chars``
- Implementation of printf is based on the console, which finally sends
characters to the UART HW by writing to register ``UART16550_RBR``

View File

@@ -0,0 +1,107 @@
.. _virtio-blk:
Virtio-blk
##########
The virtio-blk device is a simple virtual block device. The FE driver
(in the UOS space) places read, write, and other requests onto the
virtqueue, so that the BE driver (in the SOS space) can process them
accordingly. Communication between the FE and BE is based on the virtio
kick and notify mechanism.
The virtio device ID of the virtio-blk is ``2``, and it supports one
virtqueue, the size of which is 64, configurable in the source code.
.. figure:: images/virtio-blk-image01.png
:align: center
:width: 900px
:name: virtio-blk-arch
Virtio-blk architecture
The feature bits supported by the BE device are shown as follows:
``VIRTIO_BLK_F_SEG_MAX``
Maximum number of segments in a request is in seg_max.
``VIRTIO_BLK_F_BLK_SIZE``
Block size of disk is in blk_size.
``VIRTIO_BLK_F_TOPOLOGY``
Device exports information on optimal I/O alignment.
``VIRTIO_RING_F_INDIRECT_DESC``
Support for indirect descriptors
``VIRTIO_BLK_F_FLUSH``
Cache flush command support.
``VIRTIO_BLK_F_CONFIG_WCE``
Device can toggle its cache between writeback and writethrough modes.
Virtio-blk-BE design
********************
.. figure:: images/virtio-blk-image02.png
:align: center
:width: 900px
:name: virtio-blk-be
The virtio-blk BE device is implemented as a legacy virtio device. Its
backend media could be a file or a partition. The virtio-blk device
supports writeback and writethrough cache mode. In writeback mode,
virtio-blk has good write and read performance. To be safer,
writethrough is set as the default mode, as it can make sure every write
operation queued to the virtio-blk FE driver layer is submitted to
hardware storage.
During initialization, virito-blk will allocate 64 ioreq buffers in a
shared ring used to store the I/O requests. The freeq, busyq, and pendq
shown in :numref:`virtio-blk-be` are used to manage requests. Each
virtio-blk device starts 8 worker threads to process request
asynchronously.
Usage:
******
The device model configuration command syntax for virtio-blk is::
-s <slot>,virtio-blk,<filepath>[,options]
- ``filepath`` is the path of a file or disk partition
- ``options`` include:
- ``writethru``: write operation is reported completed only when the
data has been written to physical storage.
- ``writeback``: write operation is reported completed when data is
placed in the page cache. Needs to be flushed to the physical storage.
- ``ro``: open file with readonly mode.
- ``sectorsize``: configured as either
``sectorsize=<sector size>/<physical sector size>`` or
``sectorsize=<sector size>``.
The default values for sector size and physical sector size are 512
- ``range``: configured as ``range=<start lba in file>/<sub file size>``
meaning the virtio-blk will only access part of the file, from the
``<start lba in file>`` to ``<start lba in file> + <sub file site>``.
A simple example for virtio-blk:
1. Prepare a file in SOS folder::
dd if=/dev/zero of=test.img bs=1M count=1024
mkfs.ext4 test.img
#. Add virtio-blk in the DM cmdline, slot number should not duplicate
another device::
-s 9,virtio-blk,/root/test.img
#. Launch UOS, you can find ``/dev/vdx`` in UOS.
The ``x`` in ``/dev/vdx`` is related to the slot number used. If
If you start DM with two virtio-blks, and the slot numbers are 9 and 10,
then, the device with slot 9 will be recognized as ``/dev/vda``, and
the device with slot 10 will be ``/dev/vdb``
#. Mount ``/dev/vdx`` to a folder in the UOS, and then you can access it.
Successful booting of the User OS verifies the correctness of the
device.

View File

@@ -0,0 +1,184 @@
.. _virtio-console:
Virtio-console
##############
The Virtio-console is a simple device for data input and output. The
console's virtio device ID is ``3`` and can have from 1 to 16 ports.
Each port has a pair of input and output virtqueues used to communicate
information between the Front End (FE) and Back end (BE) drivers.
Currently the size of each virtqueue is 64 (configurable in the source
code). The FE driver will place empty buffers for incoming data onto
the receiving virtqueue, and enqueue outgoing characters onto the
transmitting virtqueue.
A Virtio-console device has a pair of control IO virtqueues as well. The
control virtqueues are used to communicate information between the
device and the driver, including: ports being opened and closed on
either side of the connection, indication from the host about whether a
particular port is a console port, adding new ports, port
hot-plug/unplug, indication from the guest about whether a port or a
device was successfully added, or a port opened or closed.
The virtio-console architecture diagram in ACRN is shown below.
.. figure:: images/virtio-console-arch.png
:align: center
:width: 700px
:name: virtio-console-arch
Virtio-console architecture diagram
Virtio-console is implemented as a virtio legacy device in the ACRN device
model (DM), and is registered as a PCI virtio device to the guest OS. No changes
are required in the frontend Linux virtio-console except that the guest
(UOS) kernel should be built with ``CONFIG_VIRTIO_CONSOLE=y``.
Currently the feature bits supported by the BE device are:
.. list-table:: Feature bits supported by BE drivers
:widths: 30 50
:header-rows: 0
* - VTCON_F_SIZE(bit 0)
- configuration columns and rows are valid.
* - VTCON_F_MULTIPORT(bit 1)
- device supports multiple ports, and control virtqueues will be used.
* - VTCON_F_EMERG_WRITE(bit 2)
- device supports emergency write.
Virtio-console supports redirecting guest output to various backend
devices. Currently the following backend devices are supported in ACRN
device model: STDIO, TTY, PTY and regular file.
The device model configuration command syntax for virtio-console is::
virtio-console,[@]stdio|tty|pty|file:portname[=portpath]\
[,[@]stdio|tty|pty|file:portname[=portpath]]
- Preceding with ``@`` marks the port as a console port, otherwise it is a
normal virtio serial port
- The ``portpath`` can be omitted when backend is stdio or pty
- The ``stdio/tty/pty`` is tty capable, which means :kbd:`TAB` and
:kbd:`BACKSPACE` are supported, as on a regular terminal
- When tty is used, please make sure the redirected tty is sleeping,
(e.g., by ``sleep 2d`` command), and will not read input from stdin before it
is used by virtio-console to redirect guest output.
- Claiming multiple virtio serial ports as consoles is supported,
however the guest Linux OS will only use one of them, through the
``console=hvcN`` kernel parameter. For example, the following command
defines two backend ports, which are both console ports, but the frontend
driver will only use the second port named ``pty_port`` as its hvc
console (specified by ``console=hvc1`` in the kernel command
line)::
-s n,virtio-console,@tty:tty_port=/dev/pts/0,@pty:pty_port \
-B "root=/dev/vda2 rw rootwait maxcpus=$2 nohpet console=hvc1 console=ttyS0 ..."
Console Backend Use Cases
*************************
The following sections elaborate on each backend.
STDIO
=====
1. Add a pci slot to the device model (``acrn-dm``) command line::
-s n,virtio-console,@stdio:stdio_port
#. Add the ``console`` parameter to the guest OS kernel command line::
console=hvc0
PTY
===
1. Add a pci slot to the device model (``acrn-dm``) command line::
-s n,virtio-console,@pty:pty_port
#. Add the ``console`` parameter to the guest os kernel command line::
console=hvc0
One line of information, such as shown below, will be printed in the terminal
after ``acrn-dm`` is launched (``/dev/pts/0`` may be different,
depending on your use case):
.. code-block: console
virt-console backend redirected to /dev/pts/0
#. Use a terminal emulator, such as minicom or screen, to connect to the
tty node::
minicom -D /dev/pts/0
or ::
screen /dev/pts/0
TTY
===
1. Identify your tty that will be used as the UOS console:
- If you're connected to your device over the network via ssh, use
the linux ``tty`` command, and it will report the node (may be
different in your use case)::
/dev/pts/0
sleep 2d
- If you do not have network access to your device, use screen
to create a new tty::
screen
tty
you will see (depending on your use case)::
/dev/pts/0
Prevent the tty from responding by sleeping::
sleep 2d
and detach the tty by pressing :kbd:`CTRL-A` :kbd:`d`.
#. Add a pci slot to the device model (``acrn-dm``) command line
(changing the ``dev/pts/X`` to match your use case)::
-s n,virtio-console,@tty:tty_port=/dev/pts/X
#. Add the console parameter to the guest OS kernel command line::
console=hvc0
#. Go back to the previous tty. For example, if you're using
``screen``, use::
screen -ls
screen -r <pid_of_your_tty>
FILE
====
The File backend only supports console output to a file (no input).
1. Add a pci slot to the device model (``acrn-dm``) command line,
adjusting the ``</path/to/file>`` to your use case::
-s n,virtio-console,@file:file_port=</path/to/file>
#. Add the console parameter to the guest OS kernel command line::
console=hvc0

View File

@@ -0,0 +1,525 @@
.. _virtio-net:
Virtio-net
##########
Virtio-net is the para-virtualization solution used in ACRN for
networking. The ACRN device model emulates virtual NICs for UOS and the
frontend virtio network driver, simulating the virtual NIC and following
the virtio specification. (Refer to :ref:`introduction` and
:ref:`virtio-hld` background introductions to ACRN and Virtio.)
Here are some notes about Virtio-net support in ACRN:
- Legacy devices are supported, modern devices are not supported
- Two virtqueues are used in virtio-net: RX queue and TX queue
- Indirect descriptor is supported
- TAP backend is supported
- Control queue is not supported
- NIC multiple queues are not supported
Network Virtualization Architecture
***********************************
ACRN's network virtualization architecture is shown below in
:numref:`net-virt-arch`, and illustrates the many necessary network
virtualization components that must cooperate for the UOS to send and
receive data from the outside world.
.. figure:: images/network-virt-arch.png
:align: center
:width: 900px
:name: net-virt-arch
Network Virtualization Architecture
(The green components are parts of the ACRN solution, while the gray
components are parts of the Linux kernel.)
Let's explore these components further.
SOS/UOS Network Stack:
This is the standard Linux TCP/IP stack, currently the most
feature-rich TCP/IP implementation.
virtio-net Frontend Driver:
This is the standard driver in the Linux Kernel for virtual Ethernet
devices. This driver matches devices with PCI vendor ID 0x1AF4 and PCI
Device ID 0x1000 (for legacy devices in our case) or 0x1041 (for modern
devices). The virtual NIC supports two virtqueues, one for transmitting
packets and the other for receiving packets. The frontend driver places
empty buffers into one virtqueue for receiving packets, and enqueues
outgoing packets into another virtqueue for transmission. The size of
each virtqueue is 1024, configurable in the virtio-net backend driver.
ACRN Hypervisor:
The ACRN hypervisor is a type 1 hypervisor, running directly on the
bare-metal hardware, and suitable for a variety of IoT and embedded
device solutions. It fetches and analyzes the guest instructions, puts
the decoded information into the shared page as an IOREQ, and notifies
or interrupts the VHM module in the SOS for processing.
VHM Module:
The Virtio and Hypervisor Service Module (VHM) is a kernel module in the
Service OS (SOS) acting as a middle layer to support the device model
and hypervisor. The VHM forwards a IOREQ to the virtio-net backend
driver for processing.
ACRN Device Model and virtio-net Backend Driver:
The ACRN Device Model (DM) gets an IOREQ from a shared page and calls
the virtio-net backend driver to process the request. The backend driver
receives the data in a shared virtqueue and sends it to the TAP device.
Bridge and Tap Device:
Bridge and Tap are standard virtual network infrastructures. They play
an important role in communication among the SOS, the UOS, and the
outside world.
IGB Driver:
IGB is the physical Network Interface Card (NIC) Linux kernel driver
responsible for sending data to and receiving data from the physical
NIC.
The virtual network card (NIC) is implemented as a virtio legacy device
in the ACRN device model (DM). It is registered as a PCI virtio device
to the guest OS (UOS) and uses the standard virtio-net in the Linux kernel as
its driver (the guest kernel should be built with
``CONFIG_VIRTIO_NET=y``).
The virtio-net backend in DM forwards the data received from the
frontend to the TAP device, then from the TAP device to the bridge, and
finally from the bridge to the physical NIC driver, and vice versa for
returning data from the NIC to the frontend.
ACRN Virtio-Network Calling Stack
*********************************
Various components of ACRN network virtualization are shown in the
architecture diagram shows in :numref:`net-virt-arch`. In this section,
we will use UOS data transmission (TX) and reception (RX) examples to
explain step-by-step how these components work together to implement
ACRN network virtualization.
Initialization in Device Model
==============================
**virtio_net_init**
- Present frontend for a virtual PCI based NIC
- Setup control plan callbacks
- Setup data plan callbacks, including TX, RX
- Setup tap backend
Initialization in virtio-net Frontend Driver
============================================
**virtio_pci_probe**
- Construct virtio device using virtual pci device and register it to
virtio bus
**virtio_dev_probe --> virtnet_probe --> init_vqs**
- Register network driver
- Setup shared virtqueues
ACRN UOS TX FLOW
================
The following shows the ACRN UOS network TX flow, using TCP as an
example, showing the flow through each layer:
**UOS TCP Layer**
.. code-block:: c
tcp_sendmsg -->
tcp_sendmsg_locked -->
tcp_push_one -->
tcp_write_xmit -->
tcp_transmit_skb -->
**UOS IP Layer**
.. code-block:: c
ip_queue_xmit -->
ip_local_out -->
__ip_local_out -->
dst_output -->
ip_output -->
ip_finish_output -->
ip_finish_output2 -->
neigh_output -->
neigh_resolve_output -->
**UOS MAC Layer**
.. code-block:: c
dev_queue_xmit -->
__dev_queue_xmit -->
dev_hard_start_xmit -->
xmit_one -->
netdev_start_xmit -->
__netdev_start_xmit -->
**UOS MAC Layer virtio-net Frontend Driver**
.. code-block:: c
start_xmit --> // virtual NIC driver xmit in virtio_net
xmit_skb -->
virtqueue_add_outbuf --> // add out buffer to shared virtqueue
virtqueue_add -->
virtqueue_kick --> // notify the backend
virtqueue_notify -->
vp_notify -->
iowrite16 --> // trap here, HV will first get notified
**ACRN Hypervisor**
.. code-block:: c
vmexit_handler --> // vmexit because VMX_EXIT_REASON_IO_INSTRUCTION
pio_instr_vmexit_handler -->
emulate_io --> // ioreq cant be processed in HV, forward it to VHM
acrn_insert_request_wait -->
fire_vhm_interrupt --> // interrupt SOS, VHM will get notified
**VHM Module**
.. code-block:: c
vhm_intr_handler --> // VHM interrupt handler
tasklet_schedule -->
io_req_tasklet -->
acrn_ioreq_distribute_request --> // ioreq can't be processed in VHM, forward it to device DM
acrn_ioreq_notify_client -->
wake_up_interruptible --> // wake up DM to handle ioreq
**ACRN Device Model / virtio-net Backend Driver**
.. code-block:: c
handle_vmexit -->
vmexit_inout -->
emulate_inout -->
pci_emul_io_handler -->
virtio_pci_write -->
virtio_pci_legacy_write -->
virtio_net_ping_txq --> // start TX thread to process, notify thread return
virtio_net_tx_thread --> // this is TX thread
virtio_net_proctx --> // call corresponding backend (tap) to process
virtio_net_tap_tx -->
writev --> // write data to tap device
**SOS TAP Device Forwarding**
.. code-block:: c
do_writev -->
vfs_writev -->
do_iter_write -->
do_iter_readv_writev -->
call_write_iter -->
tun_chr_write_iter -->
tun_get_user -->
netif_receive_skb -->
netif_receive_skb_internal -->
__netif_receive_skb -->
__netif_receive_skb_core -->
**SOS Bridge Forwarding**
.. code-block:: c
br_handle_frame -->
br_handle_frame_finish -->
br_forward -->
__br_forward -->
br_forward_finish -->
br_dev_queue_push_xmit -->
**SOS MAC Layer**
.. code-block:: c
dev_queue_xmit -->
__dev_queue_xmit -->
dev_hard_start_xmit -->
xmit_one -->
netdev_start_xmit -->
__netdev_start_xmit -->
**SOS MAC Layer IGB Driver**
.. code-block:: c
igb_xmit_frame --> // IGB physical NIC driver xmit function
ACRN UOS RX FLOW
================
The following shows the ACRN UOS network RX flow, using TCP as an example.
Let's start by receiving a device interrupt. (Note that the hypervisor
will first get notified when receiving an interrupt even in passthrough
cases.)
**Hypervisor Interrupt Dispatch**
.. code-block:: c
vmexit_handler --> // vmexit because VMX_EXIT_REASON_EXTERNAL_INTERRUPT
external_interrupt_vmexit_handler -->
dispatch_interrupt -->
common_handler_edge -->
ptdev_interrupt_handler -->
ptdev_enqueue_softirq --> // Interrupt will be delivered in bottom-half softirq
**Hypervisor Interrupt Injection**
.. code-block:: c
do_softirq -->
ptdev_softirq -->
vlapic_intr_msi --> // insert the interrupt into SOS
start_vcpu --> // VM Entry here, will process the pending interrupts
**SOS MAC Layer IGB Driver**
.. code-block:: c
do_IRQ -->
...
igb_msix_ring -->
igbpoll -->
napi_gro_receive -->
napi_skb_finish -->
netif_receive_skb_internal -->
__netif_receive_skb -->
__netif_receive_skb_core --
**SOS Bridge Forwarding**
.. code-block:: c
br_handle_frame -->
br_handle_frame_finish -->
br_forward -->
__br_forward -->
br_forward_finish -->
br_dev_queue_push_xmit -->
**SOS MAC Layer**
.. code-block:: c
dev_queue_xmit -->
__dev_queue_xmit -->
dev_hard_start_xmit -->
xmit_one -->
netdev_start_xmit -->
__netdev_start_xmit -->
**SOS MAC Layer TAP Driver**
.. code-block:: c
tun_net_xmit --> // Notify and wake up reader process
**ACRN Device Model / virtio-net Backend Driver**
.. code-block:: c
virtio_net_rx_callback --> // the tap fd get notified and this function invoked
virtio_net_tap_rx --> // read data from tap, prepare virtqueue, insert interrupt into the UOS
vq_endchains -->
vq_interrupt -->
pci_generate_msi -->
**VHM Module**
.. code-block:: c
vhm_dev_ioctl --> // process the IOCTL and call hypercall to inject interrupt
hcall_inject_msi -->
**ACRN Hypervisor**
.. code-block:: c
vmexit_handler --> // vmexit because VMX_EXIT_REASON_VMCALL
vmcall_vmexit_handler -->
hcall_inject_msi --> // insert interrupt into UOS
vlapic_intr_msi -->
**UOS MAC Layer virtio_net Frontend Driver**
.. code-block:: c
vring_interrupt --> // virtio-net frontend driver interrupt handler
skb_recv_done --> //registed by virtnet_probe-->init_vqs-->virtnet_find_vqs
virtqueue_napi_schedule -->
__napi_schedule -->
virtnet_poll -->
virtnet_receive -->
receive_buf -->
**UOS MAC Layer**
.. code-block:: c
napi_gro_receive -->
napi_skb_finish -->
netif_receive_skb_internal -->
__netif_receive_skb -->
__netif_receive_skb_core -->
**UOS IP Layer**
.. code-block:: c
ip_rcv -->
ip_rcv_finish -->
dst_input -->
ip_local_deliver -->
ip_local_deliver_finish -->
**UOS TCP Layer**
.. code-block:: c
tcp_v4_rcv -->
tcp_v4_do_rcv -->
tcp_rcv_established -->
tcp_data_queue -->
tcp_queue_rcv -->
__skb_queue_tail -->
sk->sk_data_ready --> // application will get notified
How to Use
==========
The network infrastructure shown in :numref:`net-virt-infra` needs to be
prepared in the SOS before we start. We need to create a bridge and at
least one tap device (two tap devices are needed to create a dual
virtual NIC) and attach a physical NIC and tap device to the bridge.
.. figure:: images/network-virt-sos-infrastruct.png
:align: center
:width: 900px
:name: net-virt-infra
Network Infrastructure in SOS
You can use Linux commands (e.g. ip, brctl) to create this network. In
our case, we use systemd to automatically create the network by default.
You can check the files with prefix 50- in the SOS
``/usr/lib/systemd/network/``:
- `50-acrn.netdev <https://raw.githubusercontent.com/projectacrn/acrn-hypervisor/master/tools/acrnbridge/acrn.netdev>`__
- `50-acrn.network <https://raw.githubusercontent.com/projectacrn/acrn-hypervisor/master/tools/acrnbridge/acrn.network>`__
- `50-acrn_tap0.netdev <https://raw.githubusercontent.com/projectacrn/acrn-hypervisor/master/tools/acrnbridge/acrn_tap0.netdev>`__
- `50-eth.network <https://raw.githubusercontent.com/projectacrn/acrn-hypervisor/master/tools/acrnbridge/eth.network>`__
When the SOS is started, run ``ifconfig`` to show the devices created by
this systemd configuration:
.. code-block:: none
acrn-br0 Link encap:Ethernet HWaddr B2:50:41:FE:F7:A3
inet addr:10.239.154.43 Bcast:10.239.154.255 Mask:255.255.255.0
inet6 addr: fe80::b050:41ff:fefe:f7a3/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:226932 errors:0 dropped:21383 overruns:0 frame:0
TX packets:14816 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:100457754 (95.8 Mb) TX bytes:83481244 (79.6 Mb)
acrn_tap0 Link encap:Ethernet HWaddr F6:A7:7E:52:50:C6
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
enp3s0 Link encap:Ethernet HWaddr 98:4F:EE:14:5B:74
inet6 addr: fe80::9a4f:eeff:fe14:5b74/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:279174 errors:0 dropped:0 overruns:0 frame:0
TX packets:69923 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:107312294 (102.3 Mb) TX bytes:87117507 (83.0 Mb)
Memory:82200000-8227ffff
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:16 errors:0 dropped:0 overruns:0 frame:0
TX packets:16 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1216 (1.1 Kb) TX bytes:1216 (1.1 Kb)
Run ``brctl show`` to see the bridge ``acrn-br0`` and attached devices:
.. code-block:: none
bridge name bridge id STP enabled interfaces
acrn-br0 8000.b25041fef7a3 no acrn_tap0
enp3s0
Add a pci slot to the device model acrn-dm command line (mac address is
optional):
.. code-block:: none
-s 4,virtio-net,<tap_name>,[mac=<XX:XX:XX:XX:XX:XX>]
When the UOS is lauched, run ``ifconfig`` to check the network. enp0s4r
is the virtual NIC created by acrn-dm:
.. code-block:: none
enp0s4 Link encap:Ethernet HWaddr 00:16:3E:39:0F:CD
inet addr:10.239.154.186 Bcast:10.239.154.255 Mask:255.255.255.0
inet6 addr: fe80::216:3eff:fe39:fcd/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:140 errors:0 dropped:8 overruns:0 frame:0
TX packets:46 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:110727 (108.1 Kb) TX bytes:4474 (4.3 Kb)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Performance Estimation
======================
We've introduced the network virtualization solution in ACRN, from the
top level architecture to the detailed TX and RX flow. Currently, the
control plane and data plane are all processed in ACRN device model,
which may bring some overhead. But this is not a bottleneck for 1000Mbit
NICs or below. Network bandwidth for virtualization can be very close to
the native bandwidgh. For high speed NIC (e.g. 10Gb or above), it is
necessary to separate the data plane from the control plane. We can use
vhost for acceleration. For most IoT scenarios, processing in user space
is simple and reasonable.

View File

@@ -0,0 +1,21 @@
.. _virtio-rnd:
Virtio-rnd
##########
The virtio-rnd entropy device supplies high-quality randomness for guest
use. The virtio device ID of the virtio-rnd device is 4, and it supports
one virtqueue, the size of which is 64, configurable in the source code.
It has no feature bits defined.
When the FE driver requires some random bytes, the BE device will place
bytes of random data onto the virtqueue.
To launch the virtio-rnd device, use the following virtio command::
-s <slot>,virtio-rnd
To verify the correctness in user OS, use the following
command::
od /dev/random

View File

@@ -0,0 +1,98 @@
.. _watchdog-hld:
Watchdog Virtualization in Device Model
#######################################
This document describes the watchdog virtualization implementation in
ACRN device model.
Overview
********
A watchdog is an important hardware component in embedded systems, used
to monitor the system's running status, and resets the processor if the
software crashes. In general, hardware watchdogs rely on a piece of
software running on the machine which must "kick" the watchdog device
regularly, say every 10 seconds. If the watchdog doesn't get "kicked"
after 60 seconds, for example, then the watchdog device asserts the
RESET line which results in a hard reboot.
For ACRN we emulate the watchdog hardware in the Intel 6300ESB chipset
as a PCI device called 6300ESB watchdog and is added into the Device
Model following the PCI device framework. The following
:numref:`watchdog-device` shows the watchdog device workflow:
.. figure:: images/watchdog-image2.png
:align: center
:width: 900px
:name: watchdog-device
Watchdog device flow
The DM in the Service OS (SOS) treats the watchdog as a passive device.
It receives read/write commands from the watchdog driver, does the
actions, and returns. In ACRN, the commands are from User OS (UOS)
watchdog driver.
UOS watchdog work flow
**********************
When the UOS does a read or write operation on the watchdog device's
registers or memory space (Port IO or Memory map I/O), it will trap into
the hypervisor. The hypervisor delivers the operation to the SOS/DM
through IPI (inter-process interrupt) or shared memory, and the DM
dispatches the operation to the watchdog emulation code.
After the DM watchdog finishes emulating the read or write operation, it
then calls ``ioctl`` to the SOS/kernel (``/dev/acrn_vhm``). VHM will call a
hypercall to trap into the hypervisor to tell it the operation is done, and
the hypervisor will set UOS-related VCPU registers and resume UOS so the
UOS watchdog driver will get the return values (or return status). The
:numref:`watchdog-workflow` below is a typical operation flow:
from UOS to SOS and return back:
.. figure:: images/watchdog-image1.png
:align: center
:width: 900px
:name: watchdog-workflow
Watchdog operation workflow
Implementation in ACRN and how to use it
****************************************
In ACRN, the Intel 6300ESB watchdog device emulation is added into the
DM PCI device tree. Its interface structure is (see
``devicemodel/include/pci_core.h``):
.. code-block:: c
struct pci_vdev_ops pci_ops_wdt = {
.class_name = "wdt-i6300esb",
.vdev_init = pci_wdt_init,
.vdev_deinit = pci_wdt_deinit,
.vdev_cfgwrite = pci_wdt_cfg_write,
.vdev_cfgread = pci_wdt_cfg_read,
.vdev_barwrite = pci_wdt_bar_write,
.vdev_barread = pci_wdt_bar_read
};
All functions follow the ``pci_vdev_ops`` definitions for PCI device
emulation.
The main part in the watchdog emulation is the timer thread. It emulates
the watchdog device timeout management. When it gets the kick action
from the UOS, it resets the timer. If the timer expires before getting a
timely kick action, it will call DM API to reboot that UOS.
In the UOS launch script, add: ``-s xx,wdt-i6300esb`` into DM parameters.
(xx is the virtual PCI BDF number as with other PCI devices)
Make sure the UOS kernel has the I6300ESB driver enabled: ``CONFIG_I6300ESB_WDT=y``. After the UOS
boots up, the watchdog device will be created as node ``/dev/watchdog``,
and can be used as a normal device file.
Usually the UOS needs a watchdog service (daemon) to run in userland and
kick the watchdog periodically. If something prevents the daemon from
kicking the watchdog, for example the UOS system is hung, the watchdog
will timeout and the DM will reboot the UOS.