diff --git a/doc/developer-guides/hld/hld-hypervisor.rst b/doc/developer-guides/hld/hld-hypervisor.rst index 73d36d873..edc0de8bf 100644 --- a/doc/developer-guides/hld/hld-hypervisor.rst +++ b/doc/developer-guides/hld/hld-hypervisor.rst @@ -7,6 +7,8 @@ Hypervisor high-level design .. toctree:: :maxdepth: 1 + hv-startup + hv-cpu-virt static-core-hld Memory management Interrupt management diff --git a/doc/developer-guides/hld/hv-cpu-virt.rst b/doc/developer-guides/hld/hv-cpu-virt.rst new file mode 100644 index 000000000..204a76ff0 --- /dev/null +++ b/doc/developer-guides/hld/hv-cpu-virt.rst @@ -0,0 +1,1192 @@ +.. _hv-cpu-virt: + +CPU Virtualization +################## + +.. figure:: images/hld-image47.png + :align: center + :name: hv-cpu-virt-components + + ACRN Hypervisor CPU Virtualization Components + +Based on Intel VT-x virtualization technology, ACRN emulates a virtual CPU +(vCPU) with the following methods: + +- **core partition**: one vCPU is dedicated and associated with one + physical CPU (pCPU), + making much of hardware register emulation simply + pass-through and provides good isolation for physical interrupt + and guest execution. + +- **simple schedule**: only two thread loops are maintained for a CPU - + vCPU thread and default idle thread. A CPU runs most of the time in + the vCPU thread for emulating a guest CPU, switching between VMX root + mode and non-root mode. A CPU schedules out to default idle when an + operation needs it to stay in VMX root mode, such as when waiting for + an I/O request from DM or ready to destroy. + +The following sections discuss the major modules (shown in blue) in the +CPU virtualization overview shown in :numref:`hv-cpu-virt-components`. + +.. _vCPU_lifecycle: + +vCPU Lifecycle +************** + +A vCPU lifecycle is shown in :numref:`hv-vcpu-transitions` below, where +the major states are: + +- **VCPU_INIT**: vCPU is in an initialized state, and its associated CPU + is running in default_idle + +- **VCPU_RUNNING**: vCPU is running, and its associated CPU is running in + vcpu_thread + +- **VCPU_PAUSED**: vCPU is paused, and its associated CPU is running in + default_idle + +- **VPCU_ZOMBIE**: vCPU is being destroyed, and its associated CPU + is running in default_idle + +.. figure:: images/hld-image17.png + :align: center + :name: hv-vcpu-transitions + + ACRN vCPU state transitions + +This table shows the functions which drive the state machine of the vCPU +lifecycle: + +.. list-table:: + :widths: 20 80 + :header-rows: 1 + + * - **Function** + - **Description** + + * - create_vcpu + - Creates/allocates a vCPU instance, with initialization for its + vcpu_id, vpid, vmcs, vlapic, etc. It sets the init vCPU state + to VCPU_INIT. + + * - schedule_vcpu + - Adds a vCPU into the run queue and make a reschedule request for it. + It sets the vCPU state to VCPU_RUNNING. + + * - pause_vcpu + - Change a vCPU state to VCPU_PAUSED or VCPU_ZOMBIE, and make a + reschedule request for it. + + * - resume_vcpu + - Change a vCPU state to VCPU_RUNNING, and make a reschedule request + for it. + + * - reset_vcpu + - Reset all fields in a vCPU instance, the vCPU state is reset to + VCPU_INIT. + + * - destroy_vcpu + - Destroy/free a vCPU instance. + + * - start/run_vcpu + - An interface in vCPU thread to implement VM entry and VM exit. + A CPU switches between VMX root mode and non-root mode based on it. + + +vCPU Scheduling +*************** + +.. figure:: images/hld-image35.png + :align: center + :name: hv-vcpu-schedule + + ACRN vCPU scheduling flow + +As describes in the CPU virtualization overview, ACRN implements a simple +scheduling mechanism based on two threads: vcpu_thread and +default_idle. A vCPU with VCPU_RUNNING state always runs in +a vcpu_thread loop, meanwhile a vCPU with VCPU_PAUSED or VCPU_ZOMBIE +state runs in default_idle loop. The detail behaviors in +vcpu_thread and default_idle threads are illustrated in +:numref:`hv-vcpu-schedule`: + +- The **vcpu_thread** loop will try to initialize a vCPU's vmcs during + its first launch and then do the loop of handling its associated + softirq, vm exits, and pending requests around the VM entry/exit. + It will also check the reschedule request then schedule out to + default_idle if necessary. See `vCPU Thread`_ for more details + of vcpu_thread. + +- The **default_idle** loop simply does do_cpu_idle while also + checking for need-offline and reschedule requests. + If a CPU is marked as need-offline, it will go to cpu_dead. + If a reschedule request is made for this CPU, it will + schedule out to vcpu_thread if necessary. + +- The function ``make_reschedule_request`` drives the thread + switch between vcpu_thread and default_idle. + +Some example scenario flows are shown here: + +.. figure:: images/hld-image7.png + :align: center + + ACRN vCPU scheduling scenarios + +- **During starting a VM**: after create a vCPU, BSP calls *schedule_vcpu* + through *start_vm*, AP calls *schedule_vcpu* through vlapic + INIT-SIPI emulation, finally this vCPU runs in a + *vcpu_thread* loop. + +- **During shutting down a VM**: *pause_vm* function call makes a vCPU + running in *vcpu_thread* to schedule out to *default_idle*. The + following *reset_vcpu* and *destroy_vcpu* de-init and then destroy + this vCPU instance. + +- **During IOReq handling**: after an IOReq is sent to DM for emulation, a + vCPU running in *vcpu_thread* schedules out to *default_idle* + through *acrn_insert_request_wait->pause_vcpu*. After DM + complete the emulation for this IOReq, it calls + *hcall_notify_ioreq_finish->resume_vcpu* and makes the vCPU + schedule back to *vcpu_thread* to continue its guest execution. + +vCPU Thread +*********** + +The vCPU thread flow is a loop as shown and described below: + +.. figure:: images/hld-image68.png + :align: center + + ACRN vCPU thread + + +1. Check if this is the vCPU's first launch. If yes, do VMCS + initialization. (See `VMX Initialization`_.) + +2. Handle softirq by calling *do_softirq*. + +3. Handle pending request by calling *acrn_handle_pending_request*. + (See `Pending Request Handlers`_.) + +4. Check if *vcpu_thread* needs to schedule out to *default_idle* by + reschedule request. If needed, then schedule out to + *default_idle*. + +5. VM Enter by calling *start/run_vcpu*, then enter non-root mode to do + guest execution. + +6. VM Exit from *start/run_vcpu* when guest trigger vm exit reason in + non-root mode. + +7. Handle vm exit based on specific reason. + +8. Loop back to step 2. + +vCPU Run Context +================ + +During a vCPU switch between root and non-root mode, the run context of +the vCPU is saved and restored using this structure: + +.. code-block:: c + + struct run_context { + /* Contains the guest register set. + * NOTE: This must be the first element in the structure, so that + * the offsets in vmx_asm.S match + */ + union { + struct cpu_gp_regs regs; + uint64_t longs[NUM_GPRS]; + } guest_cpu_regs; + + /* The guests CR registers 0, 2, 3 and 4. */ + uint64_t cr0; + + /* CPU_CONTEXT_OFFSET_CR2 = + * offsetof(struct run_context, cr2) = 136 + */ + uint64_t cr2; + uint64_t cr4; + + uint64_t rip; + uint64_t rflags; + + /* CPU_CONTEXT_OFFSET_IA32_SPEC_CTRL = + * offsetof(struct run_context, ia32_spec_ctrl) = 168 + */ + uint64_t ia32_spec_ctrl; + uint64_t ia32_efer; + }; + + +The vCPU handles runtime context saving by three different +categories: + +- Always save/restore during vm exit/entry: + + - These registers must be saved every time vm exit, and restored + every time vm entry + - Registers include: general purpose registers, CR2, and + IA32_SPEC_CTRL + - Definition in *vcpu->run_context* + - Get/Set them through *vcpu_get/set_xxx* + +- On-demand cache/update during vm exit/entry: + + - These registers are used frequently. They should be cached from + VMCS on first time access after a VM exit, and updated to VMCS on + VM entry if marked dirty + - Registers include: RSP, RIP, EFER, RFLAGS, CR0, and CR4 + - Definition in *vcpu->run_context* + - Get/Set them through *vcpu_get/set_xxx* + +- Always read/write from/to VMCS: + + - These registers are rarely used. Access to them is always + from/to VMCS. + - Registers are in VMCS but not list in the two cases above. + - No definition in *vcpu->run_context* + - Get/Set them through VMCS API + +For the first two categories above, ACRN provides these get/set APIs: + +.. list-table:: + :widths: 30 70 + :header-rows: 1 + + * - **APIs** + - **Desc** + + * - vcpu_get_gpreg + - Get target vCPU's general purpose registers value in run_context + + * - vcpu_set_gpreg + - Set target vCPU's general purpose registers value in run_context + + * - vcpu_get_rip + - Get & cache target vCPU's RIP in run_context + + * - vcpu_set_rip + - Update target vCPU's RIP in run_context + + * - vcpu_get_rsp + - Get & cache target vCPU's RSP in run_context + + * - vcpu_set_rsp + - Update target vCPU's RSP in run_context + + * - vcpu_get_efer + - Get & cache target vCPU's EFER in run_context + + * - vcpu_set_efer + - Update target vCPU's EFER in run_context + + * - vcpu_get_rflags + - Get & cache target vCPU's RFLAGS in run_context + + * - vcpu_set_rflags + - Update target vCPU's RFLAGS in run_context + + * - vcpu_get_cr2 + - Get target vCPU's CR2 register value in run_context + + * - vcpu_set_cr2 + - Set target vCPU's CR2 register value in run_context + + * - vcpu_get_cr0/4 + - Get & cache target vCPU's CR0/4 register in run_context + + * - vcpu_set_cr0/4 + - Update target vCPU's CR0/4 register in run_context + + +VM Exit Handlers +================ + +ACRN implements its VM exit handlers with a static table. Except for the +exit reasons listed below, a default *unhandled_vmexit_handler* is used +that will trigger an error message and return without handling: + +.. list-table:: + :widths: 33 33 33 + :header-rows: 1 + + * - **VM Exit Reason** + - **Handler** + - **Desc** + + * - VMX_EXIT_REASON_EXCEPTION_OR_NMI + - exception_vmexit_handler + - Only trap #MC, print error then inject back to guest + + * - VMX_EXIT_REASON_EXTERNAL_INTERRUPT + - external_interrupt_vmexit_handler + - External interrupt handler for physical interrupt happening in non-root mode + + * - VMX_EXIT_REASON_INTERRUPT_WINDOW + - interrupt_window_vmexit_handler + - To support interrupt window if VID is disabled + + * - VMX_EXIT_REASON_CPUID + - cpuid_vmexit_handler + - Handle CPUID access from guest + + * - VMX_EXIT_REASON_VMCALL + - vmcall_vmexit_handler + - Handle hypercall from guest + + * - VMX_EXIT_REASON_CR_ACCESS + - cr_access_vmexit_handler + - Handle CR registers access from guest + + * - VMX_EXIT_REASON_IO_INSTRUCTION + - pio_instr_vmexit_handler + - Emulate I/O access with range in IO_BITMAP, + which may have a handler in hypervisor (such as vuart or vpic), + or need to create an I/O request to DM + + * - VMX_EXIT_REASON_RDMSR + - rdmsr_vmexit_handler + - Read MSR from guest in MSR_BITMAP + + * - VMX_EXIT_REASON_WRMSR + - wrmsr_vmexit_handler + - Write MSR from guest in MSR_BITMAP + + * - VMX_EXIT_REASON_APIC_ACCESS + - apic_access_vmexit_handler + - APIC access for APICv + + * - VMX_EXIT_REASON_VIRTUALIZED_EOI + - veoi_vmexit_handler + - Trap vLAPIC EOI for specific vector with level trigger mode + in vIOAPIC, required for supporting PTdev + + * - VMX_EXIT_REASON_EPT_VIOLATION + - ept_violation_vmexit_handler + - MMIO emulation, which may have handler in hypervisor + (such as vLAPIC or vIOAPIC), or need to create an I/O + request to DM + + * - VMX_EXIT_REASON_XSETBV + - xsetbv_vmexit_handler + - Set host owned XCR0 for supporting xsave + + * - VMX_EXIT_REASON_APIC_WRITE + - apic_write_vmexit_handler + - APIC write for APICv + + +Details of each vm exit reason handler are described in other sections. + +Pending Request Handlers +======================== + +ACRN uses the function *acrn_handle_pending_request* to handle +requests before VM entry in *vcpu_thread*. + +A bitmap in the vCPU structure lists the different requests: + +.. code-block:: c + + #define ACRN_REQUEST_EXCP 0U + #define ACRN_REQUEST_EVENT 1U + #define ACRN_REQUEST_EXTINT 2U + #define ACRN_REQUEST_NMI 3U + #define ACRN_REQUEST_TMR_UPDATE 4U + #define ACRN_REQUEST_EPT_FLUSH 5U + #define ACRN_REQUEST_TRP_FAULT 6U + #define ACRN_REQUEST_VPID_FLUSH 7U /* flush vpid tlb */ + + +ACRN provides the function *vcpu_make_request* to make different +requests, set the bitmap of corresponding request, and notify the target vCPU +through IPI if necessary (when the target vCPU is not currently running). See +section 3.5.5 for details. + +.. code-block:: c + + void vcpu_make_request(struct vcpu *vcpu, uint16_t eventid) + { + bitmap_set_lock(eventid, &vcpu->arch_vcpu.pending_req); + /* + * if current hostcpu is not the target vcpu's hostcpu, we need + * to invoke IPI to wake up target vcpu + * + * TODO: Here we just compare with cpuid, since cpuid currently is + * global under pCPU / vCPU 1:1 mapping. If later we enabled vcpu + * scheduling, we need change here to determine it target vcpu is + * VMX non-root or root mode + */ + if (get_cpu_id() != vcpu->pcpu_id) { + send_single_ipi(vcpu->pcpu_id, VECTOR_NOTIFY_VCPU); + } + } + +For each request, function *acrn_handle_pending_request* handles each +request as shown below. + + +.. list-table:: + :widths: 25 25 25 25 + :header-rows: 1 + + * - **Request** + - **Desc** + - **Request Maker** + - **Request Handler** + + * - ACRN_REQUEST_EXCP + - Request for exception injection + - vcpu_inject_gp, vcpu_inject_pf, vcpu_inject_ud, vcpu_inject_ac, + or vcpu_inject_ss and then queue corresponding exception by + vcpu_queue_exception + - vcpu_inject_hi_exception, vcpu_inject_lo_exception based + on exception priority + + * - ACRN_REQUEST_EVENT + - Request for vlapic interrupt vector injection + - vlapic_fire_lvt or vlapic_set_intr, which could be triggered + by vlapic lvt, vioapic, or vmsi + - vcpu_do_pending_event + + * - ACRN_REQUEST_EXTINT + - Request for extint vector injection + - vcpu_inject_extint, triggered by vpic + - vcpu_do_pending_extint + + * - ACRN_REQUEST_NMI + - Request for nmi injection + - vcpu_inject_nmi + - program VMX_ENTRY_INT_INFO_FIELD directly + + * - ACRN_REQUEST_TMR_UPDATE + - Request for update vIOAPIC TMR, which also leads to vLAPIC + VEOI bitmap update for level triggered vector + - vlapic_reset_tmr or vioapic_indirect_write change trigger mode in RTC + - vioapic_update_tmr + + * - ACRN_REQUEST_EPT_FLUSH + - Request for EPT flush + - ept_mr_add, ept_mr_modify, ept_mr_del, or vmx_write_cr0 disable cache + - invept + + * - ACRN_REQUEST_TRP_FAULT + - Request for handling triple fault + - vcpu_queue_exception meet triple fault + - fatal error + + * - ACRN_REQUEST_VPID_FLUSH + - Request for VPID flush + - None + - flush_vpid_single + +.. note:: Refer to the interrupt management chapter for request + handling order for exception, nmi, and interrupts. For other requests + such as tmr update, or EPT flush, there is no mandatory order. + +VMX Initialization +****************** + +ACRN will attempt to initialize the vCPU's VMCS before its first +launch with the host state, execution control, guest state, +entry control and exit control, as shown in the table below. + +The table briefly shows how each field got configured. +The guest state field is critical for a guest CPU start to run +based on different CPU modes. One structure *boot_ctx* is used to pass +the necessary initialized guest state to VMX, +used only for the BSP of a guest. + +For a guest vCPU's state initialization: + +- If it's BSP, the guest state configuration is based on *boot_ctx*, + which could be initialized on different objects: + + - SOS BSP based on SBL: booting up context saved at the entry of + system boot up + + - UOS BSP: DM context initialization through hypercall + +- If it's AP, then it will always start from real mode, and the start + vector will always come from vlapic INIT-SIPI emulation. So there + is no need for *boot_ctx*. Instead we use a static guest state setting + pre-defined for real mode. + +.. code-block:: c + + struct acrn_vcpu_state { + struct acrn_gp_regs gprs; + struct acrn_dt_desc gdt; + uint64_t rip; + uint64_t cs_base; + uint64_t cr0; + uint64_t cr4; + uint64_t reserved_64[4]; + + uint32_t cs_ar; + uint32_t reserved_32[4]; + + /* don't change the order of following sel */ + uint16_t cs_sel; + uint16_t ss_sel; + uint16_t ds_sel; + uint16_t es_sel; + uint16_t fs_sel; + uint16_t gs_sel; + + uint16_t reserved_16[4]; + }; + + struct boot_ctx { + struct acrn_vcpu_state vcpu_state; + struct acrn_dt_desc idt; + uint64_t cr3; + uint64_t ia32_efer; + uint64_t rflags; + + void *rsdp; + void *ap_trampoline_buf; + + uint16_t ldt_sel; + uint16_t tr_sel; + }__attribute__((aligned(8))); + + +.. list-table:: + :widths: 20 40 10 30 + :header-rows: 1 + + * - **VMX Domain** + - **Fields** + - **Bits** + - **Description** + + * - **host state** + - CS, DS, ES, FS, GS, TR, LDTR, GDTR, IDTR + - n/a + - According to host + + * - + - MSR_IA32_PAT, MSR_IA32_EFER + - n/a + - According to host + + * - + - CR0, CR3, CR4 + - n/a + - According to host + + * - + - RIP + - n/a + - Set to vm_exit pointer + + * - + - IA32_SYSENTER_CS/ESP/EIP + - n/a + - Set to 0 + + * - **exec control** + - VMX_PIN_VM_EXEC_CONTROLS + - 0 + - Enable external-interrupt exiting + + * - + - + - 7 + - Enable posted interrupts + + * - + - VMX_PROC_VM_EXEC_CONTROLS + - 3 + - Use TSC offsetting + + * - + - + - 21 + - Use TPR shadow + + * - + - + - 25 + - Use I/O bitmaps + + * - + - + - 28 + - Use MSR bitmaps + + * - + - + - 31 + - Activate secondary controls + + * - + - VMX_PROC_VM_EXEC_CONTROLS2 + - 0 + - Virtualize APIC accesses + + * - + - + - 1 + - Enable EPT + + * - + - + - 3 + - Enable RDTSCP + + * - + - + - 5 + - Enable VPID + + * - + - + - 7 + - Unrestricted guest + + * - + - + - 8 + - APIC-register virtualization + + * - + - + - 9 + - Virtual-interrupt delivery + + * - + - + - 20 + - Enable XSAVES/XRSTORS + + * - **guest state** + - CS, DS, ES, FS, GS, TR, LDTR, GDTR, IDTR + - n/a + - According to vCPU mode and init_ctx + + * - + - RIP, RSP + - n/a + - According to vCPU mode and init_ctx + + * - + - CR0, CR3, CR4 + - n/a + - According to vCPU mode and init_ctx + + * - + - GUEST_IA32_SYSENTER_CS/ESP/EIP + - n/a + - Set to 0 + + * - + - GUEST_IA32_PAT + - n/a + - Set to PAT_POWER_ON_VALUE + + * - **entry control** + - VMX_ENTRY_CONTROLS + - 2 + - Load debug controls + + * - + - + - 14 + - Load IA32_PAT + + * - + - + - 15 + - Load IA23_EFER + + * - **exit control** + - VMX_EXIT_CONTROLS + - 2 + - Save debug controls + + * - + - + - 9 + - Host address space size + + * - + - + - 15 + - Acknowledge Interrupt on exit + + * - + - + - 18 + - Save IA32_PAT + + * - + - + - 19 + - Load IA32_PAT + + * - + - + - 20 + - Save IA32_EFER + + * - + - + - 21 + - Load IA32_EFER + + +CPUID Virtualization +******************** + +CPUID access from guest would cause VM exits unconditionally if executed +as a VMX non-root operation. ACRN must return the emulated processor +identification and feature information in the EAX, EBX, ECX, and EDX +registers. + +To simplify, ACRN returns the same values from the physical CPU for most +of the CPUID, and specially handle a few CPUID features which are APIC +ID related such as CPUID.01H. + +ACRN emulates some extra CPUID features for the hypervisor as well. + +There is a per-vm *vcpuid_entries* array, initialized during VM creation +and used to cache most of the CPUID entries for each VM. During guest +CPUID emulation, ACRN will read the cached value from this array, except +some APIC ID-related CPUID data emulated at runtime. + +This table describes details for CPUID emulation: + +.. list-table:: + :widths: 20 80 + :header-rows: 1 + + + * - **CPUID** + - **Emulation Description** + + * - 01H + - - Get original value from physical CPUID + - Fill APIC ID from vLAPIC + - Disable x2APIC + - Disable PCID + - Disable VMX + - Disable XSAVE if host not enabled + + * - 0BH + - - Fill according to X2APIC feature support (default is disabled) + - If not supported, fill all registers with 0 + - If supported, get from physical CPUID + + * - 0DH + - - Fill according to XSAVE feature support + - If not supported, fill all registers with 0 + - If supported, get from physical CPUID + + * - 07H + - - Get from per-vm CPUID entries cache + - For subleaf 0, disabled INVPCID, Intel RDT + + * - 16H + - - Get from per-vm CPUID entries cache + - If physical CPU support CPUID.16H, read from physical CPUID + - If physical CPU does not support it, emulate with tsc freq + + * - 40000000H + - - Get from per-vm CPUID entries cache + - EAX: the maximum input value for CPUID supported by ACRN (40000010) + - EBX, ECX, EDX: hypervisor vendor ID signature - "ACRNACRNACRN" + + * - 40000010H + - - Get from per-vm CPUID entries cache + - EAX: virtual TSC frequency in KHz + - EBX, ECX, EDX: reserved to 0 + + * - 0AH + - - PMU Currently disabled + + * - 0FH, 10H + - - Intel RDT Currently disabled + + * - 12H + - - Intel SGX Currently disabled + + * - 14H + - - Intel Processor Trace Currently disabled + + * - Others + - - Get from per-vm CPUID entries cache + +.. note:: ACRN needs to take care of + some CPUID values that can change at runtime, for example, XD feature in + CPUID.80000001H may be cleared by MISC_ENABLE MSR. + + +MSR Virtualization +****************** + +ACRN always enables MSR bitmap in *VMX_PROC_VM_EXEC_CONTROLS* VMX +execution control field. This bitmap marks the MSRs to cause a VM +exit upon guest access for both read and write. The VM +exit reason for reading or writing these MSRs is respectively +*VMX_EXIT_REASON_RDMSR* or *VMX_EXIT_REASON_WRMSR* and the vm exit +handler is *rdmsr_vmexit_handler* or *wrmsr_vmexit_handler*. + +This table shows the predefined MSRs ACRN will trap +for all the guests. For the MSRs whose bitmap are not set in the +MSR bitmap, guest access will be pass-through directly: + +.. list-table:: + :widths: 33 33 33 + :header-rows: 1 + + * - **MSR** + - **Description** + - **Handler** + + * - MSR_IA32_TSC_DEADLINE + - TSC target of local APIC's TSC deadline mode + - emulates with vlapic + + * - MSR_IA32_BIOS_UPDT_TRIG + - BIOS update trigger + - work for update microcode from SOS, the signature ID read is from + physical MSR, and a BIOS update trigger from SOS will trigger a + physical microcode update. + + * - MSR_IA32_BIOS_SIGN_ID + - BIOS update signature ID + - " + + * - MSR_IA32_TIME_STAMP_COUNTER + - TIme-stamp counter + - work with VMX_TSC_OFFSET_FULL to emulate virtual TSC + + * - MSR_IA32_PAT + - Page-attribute table + - save/restore in vCPU, write to VMX_GUEST_IA32_PAT_FULL if cr0.cd is 0 + + * - MSR_IA32_PERF_CTL + - Performance control + - Trigger real p-state change if p-state is valid when writing, + fetch physical MSR when reading + + * - MSR_IA32_MTRR_CAP + - Memory type range register related + - Handled by MTRR emulation. + + * - MSR_IA32_MTRR_DEF_TYPE + - " + - " + + * - MSR_IA32_MTRR_PHYSBASE_0~9 + - " + - " + + * - MSR_IA32_MTRR_FIX64K_00000 + - " + - " + + * - MSR_IA32_MTRR_FIX16K_80000/A0000 + - " + - " + + * - MSR_IA32_MTRR_FIX4K_C0000~F8000 + - " + - " + + * - MSR_IA32_VMX_BASIC~VMX_TRUE_ENTRY_CTLS + - VMX related MSRs + - not support, access will inject #GP + + +CR Virtualization +***************** + +ACRN emulates ``mov to cr0``, ``mov to cr4``, ``mov to cr8``, and ``mov +from cr8`` through *cr_access_vmexit_handler* based on +*VMX_EXIT_REASON_CR_ACCESS*. + +.. note:: Currently ``mov to cr8`` and ``mov from cr8`` are actually + not valid as ``CR8-load/store exiting`` bits are set as 0 in + *VMX_PROC_VM_EXEC_CONTROLS*. + +A VM can ``mov from cr0`` and ``mov from +cr4`` without triggering a VM exit. The value read are the read shadows +of the corresponding register in VMCS. The shadows are updated by the +hypervisor on CR writes. + +.. list-table:: + :widths: 30 70 + :header-rows: 1 + + * - **Operation** + - **Handler** + + * - mov to cr0 + - Based on vCPU set context API: vcpu_set_cr0 -> vmx_write_cr0 + + * - mov to cr4 + - Based on vCPU set context API: vcpu_set_cr4 ->vmx_write_cr4 + + * - mov to cr8 + - Based on vlapic tpr API: vlapic_set_cr8->vlapic_set_tpr + + * - mov from cr8 + - Based on vlapic tpr API: vlapic_get_cr8->vlapic_get_tpr + + +For ``mov to cr0`` and ``mov to cr4``, ACRN sets +*cr0_host_mask/cr4_host_mask* into *VMX_CR0_MASK/VMX_CR4_MASK* +for the bitmask causing vm exit. + +As ACRN always enables ``unrestricted guest`` in +*VMX_PROC_VM_EXEC_CONTROLS2*, *CR0.PE* and *CR0.PG* can be +controlled by guest. + +.. list-table:: + :widths: 20 40 40 + :header-rows: 1 + + * - **CR0 MASK** + - **Value** + - **Comments** + + * - cr0_always_on_mask + - fixed0 & (~(CR0_PE | CR0_PG)) + - where fixed0 is gotten from MSR_IA32_VMX_CR0_FIXED0, means these bits + are fixed to be 1 under VMX operation + + * - cr0_always_off_mask + - ~fixed1 + - where ~fixed1 is gotten from MSR_IA32_VMX_CR0_FIXED1, means these bits + are fixed to be 0 under VMX operation + + * - CR0_TRAP_MASK + - CR0_PE | CR0_PG | CR0_WP | CR0_CD | CR0_NW + - ACRN will also trap PE, PG, WP, CD, and NW bits + + * - cr0_host_mask + - ~(fixed0 ^ fixed1) | CR0_TRAP_MASK + - ACRN will finally trap bits under VMX root mode control plus + additionally added bits + + +For ``mov to cr0`` emulation, ACRN will handle a paging mode change based on +PG bit change, and a cache mode change based on CD and NW bits changes. +ACRN also takes care of illegal writing from guest to invalid +CR0 bits (for example, set PG while CR4.PAE = 0 and IA32_EFER.LME = 1), +which will finally inject a #GP to guest. Finally, +*VMX_CR0_READ_SHADOW* will be updated for guest reading of host +controlled bits, and *VMX_GUEST_CR0* will be updated for real vmx cr0 +setting. + +.. list-table:: + :widths: 20 40 40 + :header-rows: 1 + + * - **CR4 MASK** + - **Value** + - **Comments** + + * - cr4_always_on_mask + - fixed0 + - where fixed0 is gotten from MSR_IA32_VMX_CR4_FIXED0, means these bits + are fixed to be 1 under VMX operation + + * - cr4_always_off_mask + - ~fixed1 + - where ~fixed1 is gotten from MSR_IA32_VMX_CR4_FIXED1, means these bits + are fixed to be 0 under VMX operation + + * - CR4_TRAP_MASK + - CR4_PSE | CR4_PAE | CR4_VMXE | CR4_PCIDE + - ACRN will also trap PSE, PAE, VMXE, and PCIDE bits + + * - cr4_host_mask + - ~(fixed0 ^ fixed1) | CR4_TRAP_MASK + - ACRN will finally trap bits under VMX root mode control plus + additionally added bits + + +The ``mov to cr4`` emulation is similar to cr0 emulation noted above. + +IO/MMIO Emulation +***************** + +ACRN always enables I/O bitmap in *VMX_PROC_VM_EXEC_CONTROLS* and EPT +in *VMX_PROC_VM_EXEC_CONTROLS2*. Based on them, +*pio_instr_vmexit_handler* and *ept_violation_vmexit_handler* are +used for IO/MMIO emulation for a emulated device. The emulated device +could locate in hypervisor or DM in SOS. Please refer to the "I/O +Emulation" section for more details. + +For an emulated device done in the hypervisor, ACRN provide some basic +APIs to register its IO/MMIO range: + +- For SOS, the default I/O bitmap are all set to 0, which means SOS will pass + through all I/O port access by default. Adding an I/O handler + for a hypervisor emulated device needs to first set its corresponding + I/O bitmap to 1. + +- For UOS, the default I/O bitmap are all set to 1, which means UOS will trap + all I/O port access by default. Adding an I/O handler for a + hypervisor emulated device does not need change its I/O bitmap. + If the trapped I/O port access does not fall into a hypervisor + emulated device, it will create an I/O request and pass it to SOS + DM. + +- For SOS, EPT maps all range of memory to the SOS except for ACRN hypervisor + area. This means SOS will pass through all MMIO access by + default. Adding a MMIO handler for a hypervisor emulated + device needs to first remove its MMIO range from EPT mapping. + +- For UOS, EPT only maps its system RAM to the UOS, which means UOS will + trap all MMIO access by default. Adding a MMIO handler for a + hypervisor emulated device does not need to change its EPT mapping. + If the trapped MMIO access does not fall into a hypervisor + emulated device, it will create an I/O request and pass it to SOS + DM. + +.. list-table:: + :widths: 30 70 + :header-rows: 1 + + * - **API** + - **Description** + + * - register_io_emulation_handler + - register an I/O emulation handler for a hypervisor emulated device + by specific I/O range + + * - free_io_emulation_resource + - free all I/O emulation resources for a VM + + * - register_mmio_emulation_handler + - register a MMIO emulation handler for a hypervisor emulated device + by specific MMIO range + + * - unregister_mmio_emulation_handler + - unregister a MMIO emulation handler for a hypervisor emulated device + by specific MMIO range + + +Instruction Emulation +********************* + +ACRN implements a simple instruction emulation infrastructure for +MMIO (EPT) and APIC access emulation. When such a VM exit is triggered, the +hypervisor needs to decode the instruction from RIP then attempt the +corresponding emulation based on its instruction and read/write direction. + +ACRN currently supports emulating instructions for ``mov``, ``movx``, +``movs``, ``stos``, ``test``, ``and``, ``or``, ``cmp``, ``sub`` and +``bittest`` without support for lock prefix. Real mode emulation is not +supported. + +.. figure:: images/hld-image82.png + :align: center + + Instruction Emulation Work Flow + +In the handlers for EPT violation or APIC access VM exit, ACRN will: + +1. Fetch the MMIO access request's address and size + +2. Do *decode_instruction* for the instruction in current RIP + with the following check: + + a. Is the instruction supported? If not, inject #UD to guest. + b. Is GVA of RIP, dest, and src valid? If not, inject #PF to guest. + c. Is stack valid? If not, inject #SS to guest. + +3. If step 2 succeeds, check the access direction. If it's a write, then + do *emulate_instruction* to fetch MMIO request's value from + instruction operands. + +4. Execute MMIO request handler, for EPT violation is *emulate_io* + while APIC access is *vlapic_write/read* based on access + direction. It will finally complete this MMIO request emulation + by: + + a. puting req.val to req.addr for write operation + b. getting req.val from req.addr for read operation + +5. If the access direction is read, then do *emulate_instruction* to + put MMIO request's value into instruction operands. + +6. Return to guest. + +TSC Emulation +************* + +Guest vCPU execution of *RDTSC/RDTSCP* and access to +*MSR_IA32_TSC_AUX* does not cause a VM Exit to the hypervisor. +Hypervisor uses *MSR_IA32_TSC_AUX* to record CPU ID, thus +the CPU ID provided by *MSR_IA32_TSC_AUX* might be changed via Guest. + +*RDTSCP* is widely used by hypervisor to identify current CPU ID. Due +to no VM Exit for *MSR_IA32_TSC_AUX* MSR register, ACRN hypervisor +saves/restores *MSR_IA32_TSC_AUX* value on every VM Exit/Enter. +Before hypervisor restores host CPU ID, *rdtscp* should not be +called as it could get vCPU ID instead of host CPU ID. + +The *MSR_IA32_TIME_STAMP_COUNTER* is emulated by ACRN hypervisor, with a +simple implementation based on *TSC_OFFSET* (enabled +in *VMX_PROC_VM_EXEC_CONTROLS*): + +- For read: ``val = rdtsc() + exec_vmread64(VMX_TSC_OFFSET_FULL)`` +- For write: ``exec_vmwrite64(VMX_TSC_OFFSET_FULL, val - rdtsc())`` + +XSAVE Emulation +*************** + +The XSAVE feature set is comprised of eight instructions: + +- *XGETBV* and *XSETBV* allow software to read and write the extended + control register *XCR0*, which controls the operation of the + XSAVE feature set. + +- *XSAVE*, *XSAVEOPT*, *XSAVEC*, and *XSAVES* are four instructions + that save processor state to memory. + +- *XRSTOR* and *XRSTORS* are corresponding instructions that load + processor state from memory. + +- *XGETBV*, *XSAVE*, *XSAVEOPT*, *XSAVEC*, and *XRSTOR* can be executed + at any privilege level; + +- *XSETBV*, *XSAVES*, and *XRSTORS* can be executed only if CPL = 0. + +Enabling the XSAVE feature set is controlled by XCR0 (through XSETBV) +and IA32_XSS MSR. Refer to the `Intel SDM Volume 1`_ chapter 13 for more details. + + +.. _Intel SDM Volume 1: + https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-1-manual.html + +.. figure:: images/hld-image38.png + :align: center + + ACRN Hypervisor XSAVE emulation + +By default, ACRN enables XSAVES/XRSTORS in +*VMX_PROC_VM_EXEC_CONTROLS2*, so it allows the guest to use the XSAVE +feature. Because guest execution of *XSETBV* will always trigger XSETBV VM +exit, ACRN actually needs to take care of XCR0 access. + +ACRN emulates XSAVE features through the following rules: + +1. Enumerate CPUID.01H for native XSAVE feature support +2. If yes for step 1, enable XSAVE in hypervisor by CR4.OSXSAVE +3. Emulates XSAVE related CPUID.01H & CPUID.0DH to guest +4. Emulates XCR0 access through *xsetbv_vmexit_handler* +5. ACRN pass-through the access of IA32_XSS MSR to guest +6. ACRN hypervisor does NOT use any feature of XSAVE +7. As ACRN emulate vCPU with partition mode, so based on above rules 5 + and 6, a guest vCPU will fully control the XSAVE feature in + non-root mode. diff --git a/doc/developer-guides/hld/hv-startup.rst b/doc/developer-guides/hld/hv-startup.rst new file mode 100644 index 000000000..d943a585c --- /dev/null +++ b/doc/developer-guides/hld/hv-startup.rst @@ -0,0 +1,205 @@ +.. _hv-startup: + +Hypervisor Startup +################## + +This section is an overview of the ACRN hypervisor startup. +The ACRN hypervisor +compiles to a 32-bit multiboot-compliant ELF file. +The bootloader (ABL or SBL) loads the hypervisor according to the +addresses specified in the ELF header. The BSP starts the hypervisor +with an initial state compliant to multiboot 1 specification, after the +bootloader prepares full configurations including ACPI, E820, etc. + +The HV startup has two parts: the native startup followed by +VM startup. + +Native Startup +************** + +.. figure:: images/hld-image107.png + :align: center + :name: hvstart-nativeflow + + Hypervisor Native Startup Flow + +Native startup sets up a baseline environment for HV, including basic +memory and interrupt initialization as shown in +:numref:`hvstart-nativeflow`. Here is a short +description for the flow: + +- **BSP Startup:** The starting point for bootstrap processor. + +- **Relocation**: relocate the hypervisor image if the hypervisor image + is not placed at the assumed base address. + +- **UART Init:** Initialize a pre-configured UART device used + as the base physical console for HV and Service OS. + +- **Shell Init:** Start a command shell for HV accessible via the UART. + +- **Memory Init:** Initialize memory type and cache policy, and creates + MMU page table mapping for HV. + +- **Interrupt Init:** Initialize interrupt and exception for native HV + including IDT and ``do_IRQ`` infrastructure; a timer interrupt + framework is then built. The native/physical interrupts will go + through this ``do_IRQ`` infrastructure then distribute to special + targets (HV or VMs). + +- **Start AP:** BSP kicks ``INIT-SIPI-SIPI`` IPI sequence to start other + native APs (application processor). Each AP will initialize its + own memory and interrupts, notifies the BSP on completion and + enter the default idle loop. + +Symbols in the hypervisor are placed with an assumed base address, but +the bootloader may not place the hypervisor at that specified base. In +such case the hypervisor will relocate itself to where the bootloader +loads it. + +Here is a summary of CPU and memory initial states that are set up after +native startup. + +CPU + ACRN hypervisor brings all physical processors to 64-bit IA32e + mode, with the assumption that the BSP starts in protection mode where + segmentation and paging sets an identical mapping of the first 4G + addresses without permission restrictions. The control registers and + some MSRs are set as follows: + + - cr0: The following features are enabled: paging, write protection, + protection mode, numeric error and co-processor monitoring. + + - cr3: refer to the initial state of memory. + + - cr4: The following features are enabled: physical address extension, + machine-check, FXSAVE/FXRSTOR, SMEP, VMX operation and unmask + SIMD FP exception. The other features are disabled. + + - MSR_IA32_EFER: only IA32e mode is enabled. + + - MSR_IA32_FS_BASE: the address of stack canary, used for detecting + stack smashing. + + - MSR_IA32_TSC_AUX: a unique logical ID is set for each physical + processor. + + - stack: each physical processor has a separate stack. + +Memory + All physical processors are in 64-bit IA32e mode after + startup. The GDT holds four entries, one unused, one for code and + another for data, both of which have a base of all 0's and a limit of + all 1's, and the other for 64-bit TSS. The TSS only holds three stack + pointers (for machine-check, double fault and stack fault) in the + interrupt stack table (IST) which are different across physical + processors. LDT is disabled. + +Refer to section 3.5.2 for a detailed description of interrupt-related +initial states, including IDT and physical PICs. + +After BSP detects that all APs are up, BSP will start creating the first +VM, i.e. SOS, as explained in the next section. + +VM Startup +********** + +SOS is created and launched on the physical BSP after the hypervisor +initializes itself. Meanwhile, the APs enter the default idle loop +(refer to :ref:`VCPU_lifecycle` for details), waiting for any vCPU to be +scheduled to them. + +:numref:`hvstart-vmflow` illustrates a high-level execution flow of +creating and launching a VM, applicable to both SOS and UOS. One major +difference in the creation of SOS and UOS is that SOS is created by the +hypervisor, while the creation of UOSes is triggered by the DM in SOS. +The main steps include: + +- **Create VM**: A VM structure is allocated and initialized. A unique + VM ID is picked, EPT is created, I/O bitmap is set up, I/O + emulation handlers initialized and registered and virtual CPUID + entries filled. For SOS an addition e820 table is prepared. + +- **Create vCPUs:** Create the vCPUs, assign the physical processor it + is pinned to, a unique-per-VM vCPU ID and a globally unique VPID, + and initializes its virtual lapic and MTRR. For SOS one vCPU is + created for each physical CPU on the platform. For UOS the DM + determines the number of vCPUs to be created. + +- **SW Load:** The BSP of a VM also prepares for each VM's SW + configuration including kernel entry address, ramdisk address, + bootargs, zero page etc. This is done by the hypervisor for SOS + while by DM for UOS. + +- **Schedule vCPUs:** The vCPUs are scheduled to the corresponding + physical processors for execution. + +- **Init VMCS:** Initialize vCPU's VMCS for its host state, guest + state, execution control, entry control and exit control. It's + the last configuration before vCPU runs. + +- **vCPU thread:** vCPU kicks out to run. For "Primary CPU" it will + start running into kernel image which SW Load is configured; for + "Non-Primary CPU" it will wait for INIT-SIPI-SIPI IPI sequence + trigger from its "Primary CPU". + +.. figure:: images/hld-image104.png + :align: center + :name: hvstart-vmflow + + Hypervisor VM Startup Flow + +SW configuration for Service OS (VM0): + +- **ACPI**: HV passes the entire ACPI table from bootloader to Service + OS directly. Legacy mode is currently supported as the ACPI table + is loaded at F-Segment. + +- **E820**: HV passes e820 table from bootloader through multi-boot + information after the HV reserved memory (32M for example) is + filtered out. + +- **Zero Page**: HV prepares the zero page at the high end of Service + OS memory which is determined by VM0 guest FIT binary build. The + zero page includes configuration for ramdisk, bootargs and e820 + entries. The zero page address will be set to "Primary CPU" RSI + register before VCPU gets run. + +- **Entry address**: HV will copy Service OS kernel image to 0x1000000 + as entry address for VM0's "Primary CPU". This entry address will + be set to "Primary CPU" RIP register before VCPU gets run. + +SW configuration for User OS (VMx): + +- **ACPI**: the virtual ACPI table is built by DM and put at VMx's + F-Segment. Refer to :ref:`hld-emulated-devices` for details. + +- **E820**: the virtual E820 table is built by the DM then passed to + the zero page. Refer to :ref:`hld-emulated-devices` for details. + +- **Zero Page**: the DM prepares the zero page at location of + "lowmem_top - 4K" in VMx. This location is set into VMx's + "Primary CPU" RSI register in **SW Load**. + +- **Entry address**: the DM will copy User OS kernel image to 0x1000000 + as entry address for VMx's "Primary CPU". This entry address will + be set to "Primary CPU" RIP register before VCPU gets run. + +Here is initial mode of vCPUs: + + ++------------------------------+-------------------------------+ +| VM and Processor Type | Initial Mode | ++=============+================+===============================+ +| SOS | BSP | Same as physical BSP | +| +----------------+-------------------------------+ +| | AP | Real Mode | ++-------------+----------------+-------------------------------+ +| UOS | BSP | Real Mode | +| +----------------+-------------------------------+ +| | AP | Real Mode | ++-------------+----------------+-------------------------------+ + +Note that SOS is started with the same number of vCPUs as the physical +CPUs to boost the boot-up. SOS will offline the APs right before it +starts any UOS. diff --git a/doc/developer-guides/hld/images/hld-image104.png b/doc/developer-guides/hld/images/hld-image104.png new file mode 100644 index 000000000..db9a858b2 Binary files /dev/null and b/doc/developer-guides/hld/images/hld-image104.png differ diff --git a/doc/developer-guides/hld/images/hld-image107.png b/doc/developer-guides/hld/images/hld-image107.png new file mode 100644 index 000000000..392d0122f Binary files /dev/null and b/doc/developer-guides/hld/images/hld-image107.png differ diff --git a/doc/developer-guides/hld/images/hld-image17.png b/doc/developer-guides/hld/images/hld-image17.png new file mode 100644 index 000000000..33da6827b Binary files /dev/null and b/doc/developer-guides/hld/images/hld-image17.png differ diff --git a/doc/developer-guides/hld/images/hld-image35.png b/doc/developer-guides/hld/images/hld-image35.png new file mode 100644 index 000000000..e9421aedd Binary files /dev/null and b/doc/developer-guides/hld/images/hld-image35.png differ diff --git a/doc/developer-guides/hld/images/hld-image38.png b/doc/developer-guides/hld/images/hld-image38.png new file mode 100644 index 000000000..ad6012563 Binary files /dev/null and b/doc/developer-guides/hld/images/hld-image38.png differ diff --git a/doc/developer-guides/hld/images/hld-image47.png b/doc/developer-guides/hld/images/hld-image47.png new file mode 100644 index 000000000..1aba845b0 Binary files /dev/null and b/doc/developer-guides/hld/images/hld-image47.png differ diff --git a/doc/developer-guides/hld/images/hld-image68.png b/doc/developer-guides/hld/images/hld-image68.png new file mode 100644 index 000000000..bec649a0a Binary files /dev/null and b/doc/developer-guides/hld/images/hld-image68.png differ diff --git a/doc/developer-guides/hld/images/hld-image7.png b/doc/developer-guides/hld/images/hld-image7.png new file mode 100644 index 000000000..fc4147bc0 Binary files /dev/null and b/doc/developer-guides/hld/images/hld-image7.png differ diff --git a/doc/developer-guides/hld/images/hld-image82.png b/doc/developer-guides/hld/images/hld-image82.png new file mode 100644 index 000000000..3c5810055 Binary files /dev/null and b/doc/developer-guides/hld/images/hld-image82.png differ