mirror of
				https://github.com/projectacrn/acrn-hypervisor.git
				synced 2025-10-26 06:43:34 +00:00 
			
		
		
		
	Fix comments referring to wrong hypervisor name Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
		
			
				
	
	
		
			941 lines
		
	
	
		
			34 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			941 lines
		
	
	
		
			34 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. _primer:
 | ||
| 
 | ||
| Developer Primer
 | ||
| ################
 | ||
| 
 | ||
| This Developer Primer introduces the fundamental components of ACRN and
 | ||
| the virtualization technology used by this open source reference stack.
 | ||
| Code level documentation and additional details can be found by
 | ||
| consulting the :ref:`acrn_apis` documentation and the `source code in
 | ||
| GitHub`_.
 | ||
| 
 | ||
| .. _source code in GitHub: https://github.com/projectacrn
 | ||
| 
 | ||
| The ACRN Hypervisor acts as a host with full control of the processor(s)
 | ||
| and the hardware (physical memory, interrupt management and I/O). It
 | ||
| provides the User OS with an abstraction of a virtual platform, allowing
 | ||
| the guest to behave as if were executing directly on a logical
 | ||
| processor.
 | ||
| 
 | ||
| .. _source tree structure:
 | ||
| 
 | ||
| Source Tree Structure
 | ||
| *********************
 | ||
| 
 | ||
| Understanding the ACRN hypervisor and the ACRN device model source tree
 | ||
| structure is helpful for locating the code associated with a particular
 | ||
| hypervisor and device emulation feature.
 | ||
| 
 | ||
| The ACRN source code (and documentation) are maintained in the
 | ||
| https://github.com/projectacrn/acrn-hypervisor repo, with the
 | ||
| hypervisor, device model, tools, and documentation in their own
 | ||
| folders::
 | ||
| 
 | ||
|    acrn-hypervisor
 | ||
|    ├─ hypervisor
 | ||
|    ├─ devicemodel
 | ||
|    ├─ tools
 | ||
|    └─ doc
 | ||
| 
 | ||
| Here's a brief description of each of these source tree folders:
 | ||
| 
 | ||
| ACRN hypervisor source tree
 | ||
| ===========================
 | ||
| 
 | ||
| **arch/x86/**
 | ||
|   hypervisor architecture, which includes arch x86 related source files
 | ||
|   to run the hypervisor, such as CPU, memory, interrupt, and VMX.
 | ||
| 
 | ||
| **boot/**
 | ||
|   boot stuff mainly including ACPI related
 | ||
| 
 | ||
| **bsp/**
 | ||
|   board support package, used to support NUC with UEFI
 | ||
| 
 | ||
| **common/**
 | ||
|   common source files for hypervisor, which including VM hypercall
 | ||
|   definition, VM main loop, and VM software loader
 | ||
| 
 | ||
| **debug/**
 | ||
|   all debug related source files, which will not be compiled for
 | ||
|   release version, mainly including console, uart, logmsg and shell
 | ||
| 
 | ||
| **include/**
 | ||
|   include files for all public APIs (doxygen comments in these source
 | ||
|   files are used to generate the :ref:`acrn_apis` documentation)
 | ||
| 
 | ||
| **lib/**
 | ||
|   runtime service libraries
 | ||
| 
 | ||
| ACRN Device Model source tree
 | ||
| =============================
 | ||
| 
 | ||
| **arch/x86/**
 | ||
|   architecture-specific source files needed for the devicemodel
 | ||
| 
 | ||
| **core/**
 | ||
|   ACRN Device model core logic (main loop, SOS interface, etc.)
 | ||
| 
 | ||
| **hw/**
 | ||
|   Hardware emulation code, with the following subdirectories:
 | ||
| 
 | ||
|   **pci/**
 | ||
|      PCI devices, including VBS-Us (Virtio backend drivers in user-space).
 | ||
| 
 | ||
|   **platform/**
 | ||
|      platform devices such as uart, and keyboard.
 | ||
| 
 | ||
| **include/**
 | ||
|   include files for all public APIs (doxygen comments in these source
 | ||
|   files are used to generate the :ref:`acrn_apis` documentation)
 | ||
| 
 | ||
| **samples/**
 | ||
|   scripts (included in the Clear Linux build) for setting up the network
 | ||
|   and launching the User OS on the platform.
 | ||
| 
 | ||
| ACRN Tools source tree
 | ||
| =============================
 | ||
| 
 | ||
| The tools folder holds source code for ACRN-provided tools such as:
 | ||
| 
 | ||
| acrnlog
 | ||
|   a userland tool to capture the log output from the currently running
 | ||
|   hypervisor, and from the last previous run if the hypervisor crashed.
 | ||
| 
 | ||
| acrnctl
 | ||
|   a utility to create, delete, list, launch, and stop a User OS (UOS).
 | ||
| 
 | ||
| acrntrace
 | ||
|   a Service OS (SOS) utility to capture trace data and scripts to
 | ||
|   analyze the collected data.
 | ||
| 
 | ||
| ACRN documentation source tree
 | ||
| ==============================
 | ||
| 
 | ||
| Project ACRN documentation is written using the reStructuredText markup
 | ||
| language (.rst file extension) with Sphinx extensions, and processed
 | ||
| using Sphinx to create a formatted stand-alone website, (the one you're
 | ||
| reading now.) Developers can view this content either in its raw form as
 | ||
| .rst markup files in the acrn-documentation repo, or you can generate
 | ||
| the HTML content and view it with a web browser directly on your
 | ||
| workstation, useful if you're contributing documentation to the project.
 | ||
| 
 | ||
| **api/**
 | ||
|   ReST files for API document generation
 | ||
| 
 | ||
| **custom-doxygen/**
 | ||
|   Customization files for doxygen-generated html output (while
 | ||
|   generated, we currently don't include the doxygen html output but do use
 | ||
|   the XML output to feed into the Sphinx-generation process)
 | ||
| 
 | ||
| **getting_started/**
 | ||
|   ReST files and images for the Getting Started Guide
 | ||
| 
 | ||
| **howtos/**
 | ||
|   ReST files and images for Technical and Process how-to articles
 | ||
| 
 | ||
| **images/**
 | ||
|   Image files not specific to a document (logos, and such)
 | ||
| 
 | ||
| **introduction/**
 | ||
|   ReST files and images for the Introduction to Project ACRN
 | ||
| 
 | ||
| **primer/**
 | ||
|   ReST files and images for the Developer Primer
 | ||
| 
 | ||
| **scripts/**
 | ||
|   Files used to assist building the documentation set
 | ||
| 
 | ||
| **static/**
 | ||
|   Sphinx folder for extras added to the generated output (such as custom
 | ||
|   CSS additions)
 | ||
| 
 | ||
| **_templates/**
 | ||
|   Sphinx configuration updates for the standard read-the-docs templates
 | ||
|   used to format the generated HTML output
 | ||
| 
 | ||
| CPU virtualization
 | ||
| ******************
 | ||
| 
 | ||
| The ACRN hypervisor uses static partitioning of the physical CPU cores,
 | ||
| providing each User OS a virtualized environment containing at least one
 | ||
| statically assigned physical CPU core. The CPUID features for a
 | ||
| partitioned physical core is the same as the native CPU features. CPU
 | ||
| power management (Cx/Px) is managed by the User OS.
 | ||
| 
 | ||
| The supported Intel |reg| NUC platform (see :ref:`hardware`) has a CPU
 | ||
| with four cores. The Service OS is assigned one core and the other three
 | ||
| cores are assigned to the User OS. ``XSAVE`` and ``XRSTOR`` instructions
 | ||
| (used to perform a full save/restore of the extended state in the
 | ||
| processor to/from memory) are currently not supported in the User OS.
 | ||
| (The kernel boot parameters must specify ``noxsave``). Processor core
 | ||
| sharing among User OSes is planned for a future release.
 | ||
| 
 | ||
| The following sections introduce CPU virtualization related
 | ||
| concepts and technologies.
 | ||
| 
 | ||
| Host GDT
 | ||
| ========
 | ||
| 
 | ||
| The ACRN hypervisor initializes the host Global Descriptor Table (GDT),
 | ||
| used to define the characteristics of the various memory areas during
 | ||
| program execution. Code Segment ``CS:0x8`` and Data Segment ``DS:0x10``
 | ||
| are configured as Hypervisor selectors, with their settings in host the
 | ||
| GDT as shown in :numref:`host-gdt`:
 | ||
| 
 | ||
| .. figure:: images/primer-host-gdt.png
 | ||
|    :align: center
 | ||
|    :name: host-gdt
 | ||
| 
 | ||
|    Host GDT
 | ||
| 
 | ||
| Host IDT
 | ||
| ========
 | ||
| 
 | ||
| The ACRN hypervisor installs interrupt gates for both Exceptions and
 | ||
| Vectors. That means exceptions and interrupts will automatically disable
 | ||
| interrupts. The ``HOST_GDT_RING0_CODE_SEL`` is used in the Host IDT
 | ||
| table.
 | ||
| 
 | ||
| Guest SMP Booting
 | ||
| =================
 | ||
| 
 | ||
| The Bootstrap Processor (BSP) vCPU for the User OS boots into x64 long
 | ||
| mode directly, while the Application Processors (AP) vCPU boots into
 | ||
| real mode. The virtualized Local Advanced Programmable Interrupt
 | ||
| Controller (vLAPIC) for the User OS in the hypervisor emulates the
 | ||
| INIT/STARTUP signals.
 | ||
| 
 | ||
| The AP vCPU belonging to the User OS begins in an infinite loop, waiting
 | ||
| for an INIT signal.  Once the User OS issues a Startup IPI (SIPI) signal
 | ||
| to another vCPU, the vLAPIC traps the request, resets the target vCPU,
 | ||
| and then enters the ``INIT->STARTUP#1->STARTUP#2`` cycle to boot the
 | ||
| vCPUs for the User OS.
 | ||
| 
 | ||
| VMX configuration
 | ||
| =================
 | ||
| 
 | ||
| ACRN hypervisor has the Virtual Machine configuration (VMX) shown in
 | ||
| :numref:`VMX_MSR` below. (These configuration settings may change in the future, according to
 | ||
| virtualization policies.)
 | ||
| 
 | ||
| .. table:: VMX Configuration
 | ||
|    :align: center
 | ||
|    :widths: auto
 | ||
|    :name: VMX_MSR
 | ||
| 
 | ||
|    +----------------------------------------+----------------+---------------------------------------+
 | ||
|    | **VMX MSR**                            | **Bits**       | **Description**                       |
 | ||
|    +========================================+================+=======================================+
 | ||
|    | **MSR\_IA32\_VMX\_PINBASED\_CTLS**     | Bit0 set       | Enable External IRQ VM Exit           |
 | ||
|    +                                        +----------------+---------------------------------------+
 | ||
|    |                                        | Bit6 set       | Enable HV pre-40ms Preemption timer   |
 | ||
|    +                                        +----------------+---------------------------------------+
 | ||
|    |                                        | Bit7 clr       | Post interrupt did not support        |
 | ||
|    +----------------------------------------+----------------+---------------------------------------+
 | ||
|    | **MSR\_IA32\_VMX\_PROCBASED\_CTLS**    | Bit25 set      | Enable I/O bitmap                     |
 | ||
|    +                                        +----------------+---------------------------------------+
 | ||
|    |                                        | Bit28 set      | Enable MSR bitmap                     |
 | ||
|    +                                        +----------------+---------------------------------------+
 | ||
|    |                                        | Bit19,20 set   | Enable CR8 store/load                 |
 | ||
|    +----------------------------------------+----------------+---------------------------------------+
 | ||
|    | **MSR\_IA32\_VMX\_PROCBASED\_CTLS2**   | Bit1 set       | Enable EPT                            |
 | ||
|    +                                        +----------------+---------------------------------------+
 | ||
|    |                                        | Bit7 set       | Allow guest real mode                 |
 | ||
|    +----------------------------------------+----------------+---------------------------------------+
 | ||
|    | **MSR\_IA32\_VMX\_EXIT\_CTLS**         | Bit15          | VMX Exit auto ack vector              |
 | ||
|    +                                        +----------------+---------------------------------------+
 | ||
|    |                                        | Bit18,19       | MSR IA32\_PAT save/load               |
 | ||
|    +                                        +----------------+---------------------------------------+
 | ||
|    |                                        | Bit20,21       | MSR IA32\_EFER save/load              |
 | ||
|    +                                        +----------------+---------------------------------------+
 | ||
|    |                                        | Bit9           | 64-bit mode after VM Exit             |
 | ||
|    +----------------------------------------+----------------+---------------------------------------+
 | ||
| 
 | ||
| 
 | ||
| CPUID and Guest TSC calibration
 | ||
| ===============================
 | ||
| 
 | ||
| User OS access to CPUID will be trapped by ACRN hypervisor, however
 | ||
| the ACRN hypervisor will pass through most of the native CPUID
 | ||
| information to the guest, except the virtualized CPUID 0x1 (to
 | ||
| provide fake x86_model).
 | ||
| 
 | ||
| The Time Stamp Counter (TSC) is a 64-bit register present on all x86
 | ||
| processors that counts the number of cycles since reset. ACRN hypervisor
 | ||
| also virtualizes ``MSR_PLATFORM_INFO`` and ``MSR_ATOM_FSB_FREQ``.
 | ||
| 
 | ||
| RDTSC/RDTSCP
 | ||
| ============
 | ||
| 
 | ||
| User OS vCPU reads of ``RDTSC``, ``RDTSCP``, or ``MSR_IA32_TSC_AUX``
 | ||
| will not make the VM Exit to the hypervisor. Thus the vCPUID provided by
 | ||
| ``MSR_IA32_TSC_AUX`` can be changed via the User OS.
 | ||
| 
 | ||
| The ``RDTSCP`` instruction is widely used by the ACRN hypervisor to
 | ||
| identify the current CPU (and read the current value of the processor's
 | ||
| time-stamp counter). Because there is no VM Exit for
 | ||
| ``MSR_IA32_TSC_AUX`` msr register, the hypervisor will save and restore
 | ||
| the ``MSR_IA32_TSC_AUX`` value on every VM Exit and Enter. Before the
 | ||
| hypervisor restores the host CPU ID, we must not use a ``RDTSCP``
 | ||
| instruction because it would return the vCPU ID instead of host CPU ID.
 | ||
| 
 | ||
| CR Register virtualization
 | ||
| ==========================
 | ||
| 
 | ||
| Guest CR8 access will make the VM Exit, and is emulated in the
 | ||
| hypervisor for vLAPIC to update its PPR register. Guest access to CR3
 | ||
| will not make the VM Exit.
 | ||
| 
 | ||
| MSR BITMAP
 | ||
| ==========
 | ||
| 
 | ||
| In the ACRN hypervisor, only these module-specific registers (MSR) are
 | ||
| supported:
 | ||
| 
 | ||
| **MSR_IA32_TSC_DEADLINE**
 | ||
|   emulates Guest TSC timer program
 | ||
| 
 | ||
| **MSR_PLATFORM_INFO**
 | ||
|   emulates a fake X86 module
 | ||
| 
 | ||
| **MSR_ATOM_FSB_FREQ**
 | ||
|   provides the CPU frequency directly via this MSR to avoid TSC calibration
 | ||
| 
 | ||
| I/O BITMAP
 | ||
| ==========
 | ||
| 
 | ||
| All User OS I/O port accesses are trapped into the ACRN hypervisor by
 | ||
| default. Most of the Service OS I/O port accesses are not trapped into
 | ||
| the ACRN hypervisor, allowing the Service OS direct access to the
 | ||
| hardware port.
 | ||
| 
 | ||
| The Service OS I/O trap policy is:
 | ||
| 
 | ||
| **0x3F8/0x3FC**
 | ||
|   for emulated vUART inside hypervisor for SOS only, will be trapped
 | ||
| 
 | ||
| **0x20/0xA0/0x460**
 | ||
|   for vPIC emulation in hypervisor, will be trapped
 | ||
| 
 | ||
| **0xCF8/0xCFC**
 | ||
|   for hypervisor PCI device interception, will be trapped
 | ||
| 
 | ||
| Exceptions
 | ||
| ==========
 | ||
| 
 | ||
| The User OS handles its exceptions inside the VM, including page fault,
 | ||
| GP, etc. A #MC and #DB exception causes a VM Exit to the ACRN hypervisor
 | ||
| console.
 | ||
| 
 | ||
| Memory virtualization
 | ||
| *********************
 | ||
| 
 | ||
| ACRN hypervisor provides memory virtualization by using a static
 | ||
| partition of system memory. Each virtual machine owns its own contiguous
 | ||
| partition of memory, with the Service OS staying in lower memory and the
 | ||
| User OS instances in high memory. (High memory is memory which is not
 | ||
| permanently mapped in the kernel address space, while Low Memory is
 | ||
| always mapped, so you can access it in the kernel simply by
 | ||
| dereferencing a pointer.) In future implementations, this will evolve to
 | ||
| utilize EPT/VT-d.
 | ||
| 
 | ||
| ACRN hypervisor memory is not visible to any User OS. In the ACRN
 | ||
| hypervisor, there are a few memory accesses that need to work
 | ||
| efficiently:
 | ||
| 
 | ||
| - ACRN hypervisor to access host memory
 | ||
| - vCPU per VM to access guest memory
 | ||
| - vCPU per VM to access host memory
 | ||
| - vCPU per VM to access MMIO memory
 | ||
| 
 | ||
| The rest of this section introduces how these kinds of memory accesses
 | ||
| are managed.  It gives an overview of physical memory layout,
 | ||
| Paravirtualization (MMU) memory mapping in the hypervisor and VMs, and
 | ||
| Host-Guest Extended Page Table (EPT) memory mapping for each VM.
 | ||
| 
 | ||
| Physical Memory Layout
 | ||
| ======================
 | ||
| 
 | ||
| The Physical Memory Layout Example for Service OS & User OS is shown in
 | ||
| :numref:`primer-mem-layout` below:
 | ||
| 
 | ||
| .. figure:: images/primer-mem-layout.png
 | ||
|    :align: center
 | ||
|    :name: primer-mem-layout
 | ||
| 
 | ||
|    Memory Layout
 | ||
| 
 | ||
| :numref:`primer-mem-layout` shows an example of physical memory layout
 | ||
| of the Service and User OS. The Service OS accepts the whole e820 table
 | ||
| (all usable memory address ranges not reserved for use by the BIOS)
 | ||
| after filtering out the Hypervisor memory too. From the SOS's point of
 | ||
| view, it takes control of all available physical memory, including User
 | ||
| OS memory, not used by the hypervisor (or BIOS). Each User OSes memory
 | ||
| is allocated from (High) SOS memory and the User OS only owns this
 | ||
| section of memory control.
 | ||
| 
 | ||
| Some of the physical memory of a 32-bit machine, needs to be sacrificed
 | ||
| by making it hidden so memory-mapped I/O (MMIO) devices have room to
 | ||
| communicate. This creates an MMIO hole for VMs to access some range of
 | ||
| MMIO addresses directly for communicating to devices; or they may need
 | ||
| the hypervisor to trap some range of MMIO to do device emulation. This
 | ||
| access control is done through EPT mapping.
 | ||
| 
 | ||
| PV (MMU) Memory Mapping in the Hypervisor
 | ||
| =========================================
 | ||
| 
 | ||
| .. figure:: images/primer-pv-mapping.png
 | ||
|    :align: center
 | ||
|    :name: primer-pv-mapping
 | ||
| 
 | ||
|    ACRN Hypervisor PV Mapping Example
 | ||
| 
 | ||
| The ACRN hypervisor is trusted and can access and control all system
 | ||
| memory, as shown in :numref:`primer-pv-mapping`. Because the hypervisor
 | ||
| is running in protected mode, an MMU page table must be prepared for its
 | ||
| PV translation. To simplify things, the PV translation page table is set
 | ||
| as a 1:1 mapping.  Some MMIO range mappings could be removed if they are
 | ||
| not needed. This PV page table is created when the hypervisor memory is
 | ||
| first initialized.
 | ||
| 
 | ||
| PV (MMU) Memory Mapping in VMs
 | ||
| ==============================
 | ||
| 
 | ||
| As mentioned earlier, the Primary vCPU starts to run in protected mode
 | ||
| when its VM is started. But before it begins, a temporary PV (MMU) page
 | ||
| table must be prepared..
 | ||
| 
 | ||
| This page table is a 1:1 mapping for 4 Gb, and only lives for a short
 | ||
| time when the vCPU first runs. After the vCPU starts to run its kernel
 | ||
| image (for example Linux\*), the kernel will create its own PV page
 | ||
| tables, after which, the temporary page table will be obsoleted.
 | ||
| 
 | ||
| Host-Guest (EPT) Memory Mapping
 | ||
| ===============================
 | ||
| 
 | ||
| The VMs (both SOS and UOS) need to create an Extended Page Table (EPT) to
 | ||
| access the host physical memory based on its guest physical memory. The
 | ||
| guest VMs also need to set an MMIO trap to trigger EPT violations for
 | ||
| device emulation (such as IOAPIC, and LAPIC).  This memory layout is
 | ||
| shown in :numref:`primer-sos-ept-mapping`:
 | ||
| 
 | ||
| .. figure:: images/primer-sos-ept-mapping.png
 | ||
|    :align: center
 | ||
|    :name: primer-sos-ept-mapping
 | ||
| 
 | ||
|    SOS EPT Mapping Example
 | ||
| 
 | ||
| The SOS takes control of all the host physical memory space: its EPT
 | ||
| mapping covers almost all of the host memory except that reserved for
 | ||
| the hypervisor (HV) and a few MMIO trap ranges for IOAPIC & LAPIC
 | ||
| emulation. The guest to host mapping for SOS is 1:1.
 | ||
| 
 | ||
| .. figure:: images/primer-uos-ept-mapping.png
 | ||
|    :align: center
 | ||
|    :name: primer-uos-ept-mapping
 | ||
| 
 | ||
|    UOS EPT Mapping Example
 | ||
| 
 | ||
| However, for the UOS, its memory EPT mapping is linear but with an
 | ||
| offset (as shown in :numref:`primer-uos-ept-mapping`).  The MMIO hole is
 | ||
| not mapped to trap all MMIO accesses from the UOS (and do emulating in
 | ||
| the device model). To support pass through devices in the future, some
 | ||
| MMIO range mapping may be added.
 | ||
| 
 | ||
| .. _Graphic_mediation:
 | ||
| 
 | ||
| Graphic mediation
 | ||
| *****************
 | ||
| 
 | ||
| Intel |reg| Graphics Virtualization Technology –g (Intel |reg| GVT-g)
 | ||
| provides GPU sharing capability to multiple VMs by using a mediated
 | ||
| pass-through technique. This allows a VM to access performance critical
 | ||
| I/O resources (usually partitioned) directly, without intervention from
 | ||
| the hypervisor in most cases.
 | ||
| 
 | ||
| Privileged operations from this VM are trap-and-emulated to provide
 | ||
| secure isolation among VMs. The Hypervisor must ensure that no
 | ||
| vulnerability is exposed when assigning performance-critical resource to
 | ||
| each VM. When a performance-critical resource cannot be partitioned, a
 | ||
| scheduler must be implemented (either in software or hardware) to allow
 | ||
| time-based sharing among multiple VMs. In this case, the device must
 | ||
| allow the hypervisor to save and restore the hardware state associated
 | ||
| with the shared resource, either through direct I/O register read/write
 | ||
| (when there is no software invisible state) or through a device-specific
 | ||
| context save/restore mechanism (where there is a software invisible
 | ||
| state).
 | ||
| 
 | ||
| In the initial release of Project ACRN, graphic mediation is not
 | ||
| enabled, and is planned for a future release.
 | ||
| 
 | ||
| I/O emulation
 | ||
| *************
 | ||
| 
 | ||
| The I/O path is explained in the :ref:`ACRN-io-mediator` section of the
 | ||
| :ref:`introduction`.  The following sections, provide additional device
 | ||
| assignment management and PIO/MMIO trap flow introduction.
 | ||
| 
 | ||
| Device Assignment Management
 | ||
| ============================
 | ||
| 
 | ||
| ACRN hypervisor provides major device assignment management. Since the
 | ||
| hypervisor owns all native vectors and IRQs, there must be a mapping
 | ||
| table to handle the Guest IRQ/Vector to Host IRQ/Vector. Currently we
 | ||
| assign all devices to VM0 except the UART.
 | ||
| 
 | ||
| If a PCI device (with MSI/MSI-x) is assigned to Guest, the User OS will
 | ||
| program the PCI config space and set the guest vector to this device. A
 | ||
| Hypercall ``HC_VM_PCI_MSIX_REMAP`` is provided. Once the guest programs
 | ||
| the guest vector, the Service OS may call this hypercall to notify the ACRN
 | ||
| hypervisor. The hypervisor allocates a host vector, creates a guest-host
 | ||
| mapping relation, and replaces the guest vector with a real native
 | ||
| vector for the device:
 | ||
| 
 | ||
| **PCI MSI/MSI-X**
 | ||
|   PCI Message Signaled Interrupts (MSI/MSX-x) from
 | ||
|   devices can be triggered from a hypercall when a guest program
 | ||
|   vectors. All PCI devices are programed with real vectors
 | ||
|   allocated by the Hypervisor.
 | ||
| 
 | ||
| **PCI/INTx**
 | ||
|   Device assignment is triggered when the guest programs
 | ||
|   the virtual Advanced I/O Programmable Interrupt Controller
 | ||
|   (vIOAPC) Redirection Table Entries (RTE).
 | ||
| 
 | ||
| **Legacy**
 | ||
|   Legacy devices are assigned to VM0.
 | ||
| 
 | ||
| User OS device assignment is similar to the above, except the User OS
 | ||
| doesn't call hypercall. Instead, the Guest program PCI configuration
 | ||
| space will be trapped into the Device Module, and Device Module may
 | ||
| issue hypercall to notify hypervisor the guest vector is changing.
 | ||
| 
 | ||
| Currently, there are two types of I/O Emulation supported: MMIO and
 | ||
| PORTIO trap handling. MMIO emulation is triggered by an EPT violation
 | ||
| VMExit only. If there is an EPT misconfiguration and VMExit occurs, the
 | ||
| hypervisor will halt the system. (Because the hypervisor set up all EPT
 | ||
| page table mapping at the beginning of the Guest boot, there should not
 | ||
| be an EPT misconfiguration.)
 | ||
| 
 | ||
| There are multiple places where I/O emulation can happen - in ACRN
 | ||
| hypervisor, Service OS Kernel VHM module, or in the Service OS Userland
 | ||
| ACRN Device Module.
 | ||
| 
 | ||
| PIO/MMIO trap Flow
 | ||
| ==================
 | ||
| 
 | ||
| Here is a description of the PIO/MMIO trap flow:
 | ||
| 
 | ||
| #. Instruction decoder: get the Guest Physical Address (GPA) from VM
 | ||
|    Exit, go through gla2gpa() page walker if necessary.
 | ||
| 
 | ||
| #. Emulate the instruction. Here the hypervisor will have an address
 | ||
|    range check to see if the hypervisor is interested in this IO
 | ||
|    port or MMIO GPA access.
 | ||
| 
 | ||
| #. Hypervisor emulates vLAPIC, vIOAPIC, vPIC, and vUART only (for
 | ||
|    Service OS only). Any other emulation request are forwarded to
 | ||
|    the SOS for handling. The vCPU raising the I/O request will
 | ||
|    halt until this I/O request is processed successfully. An IPI will
 | ||
|    send to vCPU0 of SOS to notify there is an I/O request waiting for
 | ||
|    service.
 | ||
| 
 | ||
| #. Service OS VHM module takes the I/O request and dispatches the request
 | ||
|    to multiple clients. These clients could be SOS kernel space
 | ||
|    VBS-K, MPT, or User-land Device model. VHM I/O request server
 | ||
|    selects a default fallback client responsible to handle any I/O
 | ||
|    request not handled by other clients. (The Device Manager is the
 | ||
|    default fallback client.) Each client needs to register its I/O
 | ||
|    range or specific PCI bus/device/function (BDF) numbers. If an I/O
 | ||
|    request falls into the client range, the I/O request server will
 | ||
|    send the request to that client.
 | ||
| 
 | ||
| #. Multiple clients - fallback client (Device Model in user-land),
 | ||
|    VBS-K client, MPT client.
 | ||
|    Once the I/O request emulation completes, the client updates the
 | ||
|    request status and notifies the hypervisor by a hypercall.
 | ||
|    Hypervisor picks up that request, do any necessary cleanup,
 | ||
|    and resume the Guest vCPU.
 | ||
| 
 | ||
| Most I/O emulation tasks are done by the SOS CPU, and requests come from
 | ||
| UOS vCPUs.
 | ||
| 
 | ||
| Virtual interrupt
 | ||
| *****************
 | ||
| 
 | ||
| All interrupts received by the User OS comes from a virtual interrupt
 | ||
| injected by a virtual vLAPIC, vIOAPIC, or vPIC. All device emulation is
 | ||
| done inside the SOS Userspace device model. However for performance
 | ||
| consideration, vLAPIC, vIOAPIC, and vPIC devices are emulated inside the
 | ||
| ACRN hypervisor directly. From the guest point of view, vPIC uses
 | ||
| Virtual Wire Mode via vIOAPIC.
 | ||
| 
 | ||
| The symmetric I/O Mode is shown in :numref:`primer-symmetric-io`:
 | ||
| 
 | ||
| .. figure:: images/primer-symmetric-io.png
 | ||
|    :align: center
 | ||
|    :name: primer-symmetric-io
 | ||
| 
 | ||
|    Symmetric I/O Mode
 | ||
| 
 | ||
| 
 | ||
| **Kernel boot param with vPIC**
 | ||
|   add "maxcpu=0" to User OS to use PIC
 | ||
| 
 | ||
| **Kernel boot param with vIOAPIC**
 | ||
|   add "maxcpu=1" (as long as not "0") User OS will use IOAPIC. Keep
 | ||
|   IOAPIC pin2 as source of PIC.
 | ||
| 
 | ||
| Virtual LAPIC
 | ||
| =============
 | ||
| 
 | ||
| The LAPIC (Local Advanced Programmable interrupt Controller) is
 | ||
| virtualized for SOS or UOS. The vLAPIC is currently emulated by a Guest
 | ||
| MMIO trap to GPA address range: 0xFEE00000 - 0xFEE100000 (1MB). ACRN
 | ||
| hypervisor will support APIC-v and Post interrupts in a future release.
 | ||
| 
 | ||
| vLAPIC provides the same feature as a native LAPIC:
 | ||
| 
 | ||
| - Mask/Unmask vectors
 | ||
| - Inject virtual vectors (Level or Edge trigger mode) to vCPU
 | ||
| - Notify vIOAPIC of EOI processing
 | ||
| - Provide TSC Timer service
 | ||
| - vLAPIC support CR8 to update TPR
 | ||
| - INIT/STARTUP handling
 | ||
| 
 | ||
| Virtual IOAPIC
 | ||
| ==============
 | ||
| 
 | ||
| A vIOAPIC is emulated by the hypervisor when the Guest accesses MMIO GPA
 | ||
| Range: 0xFEC00000 - 0xFEC01000. The vIOAPIC for the SOS will match the
 | ||
| same pin numbers as the native HW IOAPIC. The vIOAPIC for UOS only
 | ||
| provides 24 Pins. When a vIOAPIC PIN is asserted, the vIOAPIC calls
 | ||
| vLAPIC APIs to inject the vector to the Guest.
 | ||
| 
 | ||
| Virtual PIC
 | ||
| ===========
 | ||
| 
 | ||
| A vPIC is required for TSC calculation. Normally the UOS boots with a
 | ||
| vIOAPIC. A vPIC is a source of external interrupts to the Guest. On
 | ||
| every VMExit, the hypervisor checks if there are pending external PIC
 | ||
| interrupts.
 | ||
| 
 | ||
| Virtual Interrupt Injection
 | ||
| ===========================
 | ||
| 
 | ||
| The source of virtual interrupts comes from either the Device Module or
 | ||
| from assigned devices:
 | ||
| 
 | ||
| **SOS assigned devices**
 | ||
|   As we assigned all devices to SOS directly whenever a devices'
 | ||
|   physical interrupts come, we inject the corresponding virtual interrupts
 | ||
|   to SOS via the vLAPIC/vIOAPIC.  In this case, the SOS doesn't use the
 | ||
|   vPIC and does not have emulated devices.
 | ||
| 
 | ||
| **UOS assigned devices**
 | ||
|   Only PCI devices are assigned to UOS, and virtual interrupt injection
 | ||
|   follows the same way as the SOS. A virtual interrupt injection operation
 | ||
|   is triggered when a device's physical interrupt is triggered.
 | ||
| 
 | ||
| **UOS emulated devices**
 | ||
|   Device Module (user-land Device Model) is responsible for UOS emulated
 | ||
|   devices' interrupt lifecycle management. The Device Model knows when an
 | ||
|   emulated device needs to assert a virtual IOPAIC/PIC Pin or needs to
 | ||
|   send a virtual MSI vector to the Guest. This logic is entirely handled
 | ||
|   by the Device Model.
 | ||
| 
 | ||
| :numref:`primer-hypervisor-interrupt` shows how the hypervisor handles
 | ||
| interrupt processing and pending interrupts (acrn_do_intr_process):
 | ||
| 
 | ||
| .. figure:: images/primer-hypervisor-interrupt.png
 | ||
|    :align: center
 | ||
|    :name: primer-hypervisor-interrupt
 | ||
| 
 | ||
|    Hypervisor Interrupt handler
 | ||
| 
 | ||
| There are many cases where the Guest RFLAG.IF is cleared and interrupts
 | ||
| are disabled. The hypervisor will check if the Guest IRQ window is
 | ||
| available before injection. NMI is unmasked interrupt injection
 | ||
| regardless of existing guest IRQ window status. If the current IRQ
 | ||
| windows is not available, hypervisor enables
 | ||
| ``MSR_IA32_VMX_PROCBASED_CTLS_IRQ_WIN`` (PROCBASED_CTRL.bit[2]) and
 | ||
| VMEnter directly. The injection will be done on next VMExit once the
 | ||
| Guest issues STI (GuestRFLAG.IF=1).
 | ||
| 
 | ||
| VT-x and VT-d
 | ||
| *************
 | ||
| 
 | ||
| Since 2006, Intel CPUs have supported hardware assist - VT-x
 | ||
| instructions, where the CPU itself traps specific guest instructions and
 | ||
| register accesses directly into the VMM without need for binary
 | ||
| translation (and modification) of the guest operating system. Guest
 | ||
| operating systems can be run natively without modification, although it
 | ||
| is common to still install virtualization-aware para-virtualized drivers
 | ||
| into the guests to improve functionality. One common example is access
 | ||
| to storage via emulated SCSI devices.
 | ||
| 
 | ||
| Intel CPUs and chipsets support various Virtualization Technology (VT)
 | ||
| features - such as VT-x and VT-d. Physical events on the platform
 | ||
| trigger CPU **VM Exits** (a trap into the VMM) to handle physical
 | ||
| events such as physical device interrupts,
 | ||
| 
 | ||
| In the ACRN hypervisor design, VT-d can be used to do DMA Remapping,
 | ||
| such as Address translation and Isolation.
 | ||
| :numref:`primer-dma-address-mapping` is an example of address
 | ||
| translation:
 | ||
| 
 | ||
| .. figure:: images/primer-dma-address-mapping.png
 | ||
|    :align: center
 | ||
|    :name: primer-dma-address-mapping
 | ||
| 
 | ||
|    DMA address mapping
 | ||
| 
 | ||
| Hypercall
 | ||
| *********
 | ||
| 
 | ||
| ACRN hypervisor currently supports less than a dozen
 | ||
| :ref:`hypercall_apis` and VHM upcall APIs to support the necessary VM
 | ||
| management, IO request distribution and guest memory mappings. The
 | ||
| hypervisor and Service OS (SOS) reserve vector 0xF4 for hypervisor
 | ||
| notification to the SOS. This upcall is necessary whenever device
 | ||
| emulation is required by the SOS.  The upcall vector 0xF4 is injected to
 | ||
| SOS vCPU0.
 | ||
| 
 | ||
| Refer to the :ref:`acrn_apis` documentation for details.
 | ||
| 
 | ||
| Device emulation
 | ||
| ****************
 | ||
| 
 | ||
| The ACRN Device Model emulates different kinds of platform devices, such as
 | ||
| RTC, LPC, UART, PCI device, and Virtio block device. The most important
 | ||
| thing about device emulation is to handle the I/O request from different
 | ||
| devices. The I/O request could be PIO, MMIO, or PCI CFG SPACE access. For
 | ||
| example:
 | ||
| 
 | ||
| - a CMOS RTC device may access 0x70/0x71 PIO to get the CMOS time,
 | ||
| - a GPU PCI device may access its MMIO or PIO BAR space to complete
 | ||
|   its frame buffer rendering, or
 | ||
| - the bootloader may access PCI devices' CFG
 | ||
|   SPACE for BAR reprogramming.
 | ||
| 
 | ||
| ACRN Device Model injects interrupts/MSIs to its frontend devices when
 | ||
| necessary as well, for example, a RTC device needs to get its ALARM
 | ||
| interrupt or a PCI device with MSI capability needs to get its MSI. The
 | ||
| Data Model also provides a PIRQ routing mechanism for platform devices.
 | ||
| 
 | ||
| Virtio Devices
 | ||
| **************
 | ||
| 
 | ||
| This section introduces the Virtio devices supported by ACRN.  Currently
 | ||
| all the Back-end Virtio drivers are implemented using the Virtio APIs
 | ||
| and the FE drivers are re-using Linux standard Front-end Virtio drivers.
 | ||
| 
 | ||
| Virtio-rnd
 | ||
| ==========
 | ||
| 
 | ||
| The Virtio-rnd entropy device supplies high-quality randomness for guest
 | ||
| use. The Virtio device ID of the Virtio-rnd device is 4, and supports
 | ||
| one virtqueue of 64 entries (configurable in the source code). No
 | ||
| feature bits are defined.
 | ||
| 
 | ||
| When the FE driver requires random bytes, the BE device places bytes of
 | ||
| random data onto the virtqueue.
 | ||
| 
 | ||
| To launch the Virtio-rnd device, you can use the following command:
 | ||
| 
 | ||
| .. code-block:: bash
 | ||
| 
 | ||
|    ./acrn-dm -A -m 1168M \
 | ||
|       -s 0:0,hostbridge \
 | ||
|       -s 1,virtio-blk,./uos.img \
 | ||
|       -s 2,virtio-rnd \
 | ||
|       -k bzImage \
 | ||
|       -B "root=/dev/vda rw rootwait noxsave maxcpus=0 nohpet \
 | ||
|           console=hvc0 no_timer_check ignore_loglevel \
 | ||
|           log_buf_len=16M consoleblank=0 tsc=reliable" vm1
 | ||
| 
 | ||
| To verify the result in user OS side, you can use the following command:
 | ||
| 
 | ||
| .. code-block:: bash
 | ||
| 
 | ||
|    od /dev/random
 | ||
| 
 | ||
| Virtio-blk
 | ||
| ==========
 | ||
| 
 | ||
| The Virtio-blk device is a simple virtual block device. The FE driver
 | ||
| will place read, write, and other requests onto the virtqueue, so that
 | ||
| the BE driver can process them accordingly.
 | ||
| 
 | ||
| The Virtio device ID of the Virtio-blk is 2, and it supports one
 | ||
| virtqueue with 64 entries, configurable in the source code. The feature
 | ||
| bits supported by the BE device are as follows:
 | ||
| 
 | ||
| **VTBLK\_F\_SEG\_MAX(bit 2)**
 | ||
|   Maximum number of segments in a request is in seg_max.
 | ||
| 
 | ||
| **VTBLK\_F\_BLK\_SIZE(bit 6)**
 | ||
|   block size of disk is in blk\_size.
 | ||
| 
 | ||
| **VTBLK\_F\_FLUSH(bit 9)**
 | ||
|   cache flush command support.
 | ||
| 
 | ||
| **VTBLK\_F\_TOPOLOGY(bit 10)**
 | ||
|   device exports information on optimal I/O alignment.
 | ||
| 
 | ||
| To use the Virtio-blk device, use the following command:
 | ||
| 
 | ||
| .. code-block:: bash
 | ||
| 
 | ||
|    ./acrn-dm -A -m 1168M \
 | ||
|       -s 0:0,hostbridge \
 | ||
|       -s 1,virtio-blk,./uos.img** \
 | ||
|       -k bzImage -B "root=/dev/vda rw rootwait noxsave maxcpus=0 \
 | ||
|          nohpet console=hvc0 no_timer_check ignore_loglevel \
 | ||
|          log_buf_len=16M consoleblank=0 tsc=reliable" vm1
 | ||
| 
 | ||
| To verify the result, you should expect the user OS to boot
 | ||
| successfully.
 | ||
| 
 | ||
| Virtio-net
 | ||
| ==========
 | ||
| 
 | ||
| The Virtio-net device is a virtual Ethernet device. The Virtio device ID
 | ||
| of the Virtio-net is 1. The Virtio-net device supports two virtqueues,
 | ||
| one for transmitting packets and the other for receiving packets. The
 | ||
| FE driver will place empty buffers onto one virtqueue for receiving
 | ||
| packets, and enqueue outgoing packets onto the other virtqueue for
 | ||
| transmission. Currently the size of each virtqueue is 1000, configurable
 | ||
| in the source code.
 | ||
| 
 | ||
| To access the external network from user OS, a L2 virtual switch should
 | ||
| be created in the service OS, and the BE driver is bonded to a tap/tun
 | ||
| device linking under the L2 virtual switch. See
 | ||
| :numref:`primer-virtio-net`:
 | ||
| 
 | ||
| .. figure:: images/primer-virtio-net.png
 | ||
|    :align: center
 | ||
|    :name: primer-virtio-net
 | ||
| 
 | ||
|    Accessing external network from User OS
 | ||
| 
 | ||
| Currently the feature bits supported by the BE device are:
 | ||
| 
 | ||
| **VIRTIO\_NET\_F\_MAC(bit 5)**
 | ||
|   device has given MAC address.
 | ||
| 
 | ||
| **VIRTIO\_NET\_F\_MRG\_RXBUF(bit 15)**
 | ||
|   BE driver can merge receive buffers.
 | ||
| 
 | ||
| **VIRTIO\_NET\_F\_STATUS(bit 16)**
 | ||
|   configuration status field is available.
 | ||
| 
 | ||
| **VIRTIO\_F\_NOTIFY\_ON\_EMPTY(bit 24)**
 | ||
|   device will issue an interrupt if it runs out of available
 | ||
|   descriptors on a virtqueue.
 | ||
| 
 | ||
| To enable the Virtio-net device, use the following command:
 | ||
| 
 | ||
| .. code-block:: bash
 | ||
| 
 | ||
|    ./acrn-dm -A -m 1168M \
 | ||
|       -s 0:0,hostbridge \
 | ||
|       -s 1,virtio-blk,./uos.img \
 | ||
|       -s 2,virtio-net,tap0 \
 | ||
|       -k bzImage -B "root=/dev/vda rw rootwait noxsave maxcpus=0 \
 | ||
|          nohpet console=hvc0 no_timer_check ignore_loglevel \
 | ||
|          log_buf_len=16M consoleblank=0 tsc=reliable" vm1
 | ||
| 
 | ||
| To verify the correctness of the device, the external
 | ||
| network should be accessible from the user OS.
 | ||
| 
 | ||
| Virtio-console
 | ||
| ==============
 | ||
| 
 | ||
| The Virtio-console device is a simple device for data input and output.
 | ||
| The Virtio device ID of the Virtio-console device is 3. A device could
 | ||
| have from one to 16 ports. Each port has a pair of input and output
 | ||
| virtqueues used to communicate information between the FE and BE
 | ||
| drivers. Currently the size of each virtqueue is 64, configurable in the
 | ||
| source code.
 | ||
| 
 | ||
| Similar to Virtio-net device, the two virtqueues specific to a port are
 | ||
| for transmitting virtqueue and receiving virtqueue. The FE driver will
 | ||
| place empty buffers onto the receiving virtqueue for incoming data, and
 | ||
| enqueue outgoing characters onto transmitting virtqueue.
 | ||
| 
 | ||
| Currently the feature bits supported by the BE device are:
 | ||
| 
 | ||
| **VTCON\_F\_SIZE(bit 0)**
 | ||
|   configuration columns and rows are valid.
 | ||
| 
 | ||
| **VTCON\_F\_MULTIPORT(bit 1)**
 | ||
|   device supports multiple ports, and control virtqueues will be used.
 | ||
| 
 | ||
| **VTCON\_F\_EMERG\_WRITE(bit 2)**
 | ||
|   device supports emergency write.
 | ||
| 
 | ||
| Virtio-console supports redirecting guest output to various backend
 | ||
| devices, including stdio/pty/tty. Users could follow the syntax below to
 | ||
| specify which backend to use:
 | ||
| 
 | ||
| .. code-block:: none
 | ||
| 
 | ||
|    virtio-console,[@]stdio\|tty\|pty:portname[=portpath][,[@]stdio\|tty\|pty:portname[=portpath]]
 | ||
| 
 | ||
| For example, to use stdio as a Virtio-console backend, use the following
 | ||
| command:
 | ||
| 
 | ||
| .. code-block:: bash
 | ||
| 
 | ||
|    ./acrn-dm -A -m 1168M \
 | ||
|       -s 0:0,hostbridge \
 | ||
|       -s 1,virtio-blk,./uos.img \
 | ||
|       -s 3,virtio-console,@stdio:stdio\_port \
 | ||
|       -k bzImage -B "root=/dev/vda rw rootwait noxsave maxcpus=0 \
 | ||
|          nohpet console=hvc0 no_timer_check ignore_loglevel \
 | ||
|          log_buf_len=16M consoleblank=0 tsc=reliable" vm1
 | ||
| 
 | ||
| Then user could login into user OS:
 | ||
| 
 | ||
| .. code-block:: bash
 | ||
| 
 | ||
|    Ubuntu 17.04 xubuntu hvc0
 | ||
|    xubuntu login: root
 | ||
|    Password:
 | ||
| 
 | ||
| To use pty as a virtio-console backend, use the following command:
 | ||
| 
 | ||
| .. code-block:: bash
 | ||
| 
 | ||
|    ./acrn-dm -A -m 1168M \
 | ||
|       -s 0:0,hostbridge \
 | ||
|       -s 1,virtio-blk,./uos.img \
 | ||
|       -s 2,virtio-net,tap0 \
 | ||
|       -s 3,virtio-console,@pty:pty\_port \
 | ||
|       -k ./bzImage -B "root=/dev/vda rw rootwait noxsave maxcpus=0 \
 | ||
|          nohpet console=hvc0 no_timer_check ignore_loglevel \
 | ||
|          log_buf_len=16M consoleblank=0 tsc=reliable" vm1 &
 | ||
| 
 | ||
| When ACRN-DM boots User OS successfully, a similar log will be shown
 | ||
| as below:
 | ||
| 
 | ||
| .. code-block:: none
 | ||
| 
 | ||
|    **************************************************************
 | ||
|    virt-console backend redirected to /dev/pts/0
 | ||
|    **************************************************************
 | ||
| 
 | ||
| You can then use the following command to login the User OS:
 | ||
| 
 | ||
| .. code-block:: bash
 | ||
| 
 | ||
|    minicom -D /dev/pts/0
 | ||
| 
 | ||
| or
 | ||
| 
 | ||
| .. code-block:: bash
 | ||
| 
 | ||
|    screen /dev/pts/0
 |