Add assign/deassign PCI device hypercall APIs to assign a PCI device from SOS to
post-launched VM or deassign a PCI device from post-launched VM to SOS. This patch
is prepared for spliting passthrough PCI device from DM to HV.
The old assign/deassign ptdev APIs will be discarded.
Tracked-On: #4371
Signed-off-by: Li Fei1 <fei1.li@intel.com>
On UEFI UP2 board, APs might execute HLT before SOS kernel INIT them.
After SOS kernel take over and will re-init the APs directly. The flows
from HV perspective is like:
HLT trap:
wait_event(VCPU_EVENT_VIRTUAL_INTERRUPT) -> sleep_thread
SOS kernel INIT, SIPI APs:
pause_vcpu(ZOMBIE) -> sleep_thread
-> reset_vcpu
-> launch_vcpu -> wake_vcpu
However, the last wake_vcpu will fail because the cpu event
VCPU_EVENT_VIRTUAL_INTERRUPT had not got signaled.
This patch will reset all vcpu events in reset_vcpu. If the thread was
previously waiting for a event, its waiting status will be cleared and
launch_vcpu will wake it to running.
Tracked-On: #4402
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
1. Align the coding style for these MACROs
2. Align the values of fixed VECTORs
Tracked-On: #4348
Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>
In lapic passthrough mode, it should passthrough HLT/PAUSE execution
too. This patch disable their emulation when switch to lapic passthrough mode.
Tracked-On: #4329
Tested-by: Dongsheng Zhang <dongsheng.x.zhang@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
is_polling_ioreq is more straightforward. Rename it.
Tracked-On: #4329
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
SOS will use PCIe ECAM access PCIe external configuration space. HV should trap this
access for security(Now pre-launched VM doesn't want to support PCI ECAM; post-launched
VM trap PCIe ECAM access in DM).
Besides, update PCIe MMCONFIG region to be owned by hypervisor and expose and pass through
platform hide PCI devices by BIOS to SOS.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
HLT emulation is import to CPU resource maximum utilization. vcpu
doing HLT means it is idle and can give up CPU proactively. Thus, we
pause the vcpu thread in HLT emulation and resume it while event happens.
When vcpu enter HLT, its vcpu thread will sleep, but the vcpu state is
still 'Running'.
VM ID PCPU ID VCPU ID VCPU ROLE VCPU STATE
===== ======= ======= ========= ==========
0 0 0 PRIMARY Running
0 1 1 SECONDARY Running
Tracked-On: #4329
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Sometimes HV wants to know if there are pending interrupts of one vcpu.
Add .has_pending_intr interface in acrn_apicv_ops and return the pending
interrupts status by check IRRs of apicv.
Tracked-On: #4329
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Introduce two kinds of events for each vcpu,
VCPU_EVENT_IOREQ: for vcpu waiting for IO request completion
VCPU_EVENT_VIRTUAL_INTERRUPT: for vcpu waiting for virtual interrupts events
vcpu can wait for such events, and resume to run when the
event get signalled.
This patch also change IO request waiting/notifying to this way.
Tracked-On: #4329
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
As we enabled cpu sharing, PAUSE-loop exiting can help vcpu
to release its pcpu proactively. It's good for performance.
VMX_PLE_GAP: upper bound on the amount of time between two successive
executions of PAUSE in a loop.
VMX_PLE_WINDOW: upper bound on the amount of time a guest is allowed to
execute in a PAUSE loop
Tracked-On: #4329
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
In current code, wait_pcpus_offline() and make_pcpu_offline() are called by
both shutdown_vm() and reset_vm(), but this is not needed when lapic_pt is
not enabled for the vcpus of the VM.
The patch merged offline pcpus part code into a common
offline_lapic_pt_enabled_pcpus() api for shutdown_vm() and reset_vm() use and
called only when lapic_pt is enabled.
Tracked-On: #4325
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
1. This patch passes-through CR4.PCIDE to guest VM.
2. This patch handles the invlidation of TLB and the paging-structure caches.
According to SDM Vol.3 4.10.4.1, the following instructions invalidate
entries in the TLBs and the paging-structure caches:
- INVLPG: this instruction is passed-through to guest, no extra handling needed.
- INVPCID: this instruction is passed-trhough to guest, no extra handling needed.
- CR0.PG from 1 to 0: already handled by current code, change of CR0.PG will do
EPT flush.
- MOV to CR3: hypervisor doesn't trap this instrcution, no extra handling needed.
- CR4.PGE changed: already handled by current code, change of CR4.PGE will no EPT
flush.
- CR4.PCIDE from 1 to 0: this patch handles this case, will do EPT flush.
- CR4.PAE changed: already handled by current code, change of CR4.PAE will do EPT
flush.
- CR4.SEMP from 1 to 0, already handled by current code, change of CR4.SEMP will
do EPT flush.
- Task switch: Task switch is not supported in VMX non-root mode.
- VMX transitions: already handled by current code with the support of VPID.
3. This patch checks the validatiy of CR0, CR4 related to PCID feature.
According to SDM Vol.3 4.10.1, CR.PCIDE can be 1 only in IA-32e mode.
- MOV to CR4 causes a general-protection exception (#GP) if it would change CR4.PCIDE
from 0 to 1 and either IA32_EFER.LMA = 0 or CR3[11:0] ≠ 000H
- MOV to CR0 causes a general-protection exception if it would clear CR0.PG to 0
while CR4.PCIDE = 1
Tracked-On: #4296
Signed-off-by: Binbin Wu <binbin.wu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
According to SDM Vol.3 Section 25.3, behavior of the INVPCID
instruction is determined first by the setting of the “enable
INVPCID” VM-execution control:
- If the “enable INVPCID” VM-execution control is 0, INVPCID
causes an invalid-opcode exception (#UD).
- If the “enable INVPCID” VM-execution control is 1, treatment
is based on the setting of the “INVLPG exiting” VM-execution
control:
* If the “INVLPG exiting” VM-execution control is 0, INVPCID
operates normally.
* If the “INVLPG exiting” VM-execution control is 1, INVPCID
causes a VM exit.
In current implementation, hypervisor doesn't set “INVLPG exiting”
VM-execution control, this patch sets “enable INVPCID” VM-execution
control to 1 when the instruction is supported by physical cpu.
If INVPCID is supported by physical cpu, INVPCID will not cause VM
exit in VM.
If INVPCID is not supported by physical cpu, INVPCID causes an #UD
in VM.
When INVPCID is passed-through to VM, According to SDM Vol.3 28.3.3.1,
INVPCID instruction invalidates linear mappings and combined mappings.
They are required to do so only for the current VPID.
HV assigned a unique vpid for each vCPU, if guest uses wrong PCID,
it would not affect other vCPUs.
Tracked-On: #4296
Signed-off-by: Binbin Wu <binbin.wu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Pass-through PCID related capabilities to VMs:
- The support of PCID (CPUID.01H.ECX[17])
- The support of instruction INVPCID (CPUID.07H.EBX[10])
Tracked-On: #4296
Signed-off-by: Binbin Wu <binbin.wu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
ACRN relies on the capability of VPID to avoid EPT flushes during VMX transitions.
This capability is checked as a must have hardware capability, otherwise, ACRN will
refuse to boot.
Also, the current code has already made sure each vpid for a virtual cpu is valid.
So, no need to check the validity of vpid for vcpu and enable VPID for vCPU by default.
Tracked-On: #4296
Signed-off-by: Binbin Wu <binbin.wu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Per SDM 10.12.5.1 vol.3, local APIC should keep LAPIC state after receiving
INIT. The local APIC ID register should also be preserved.
Tracked-On: #4267
Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The patch abstract a vcpu_reset_internal() api for internal usage, the
function would not touch any vcpu state transition and just do vcpu reset
processing. It will be called by create_vcpu() and reset_vcpu().
The reset_vcpu() will act as a public api and should be called
only when vcpu receive INIT or vm reset/resume from S3. It should not be
called when do shutdown_vm() or hcall_sos_offline_cpu(), so the patch remove
reset_vcpu() in shutdown_vm() and hcall_sos_offline_cpu().
The patch also introduced reset_mode enum so that vcpu and vlapic could do
different context operation according to different reset mode;
Tracked-On: #4267
Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Rename vlapic_xxx_write_handler() to vlapic_write_xxx() to make code more
readable;
Tracked-On: #4268
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Per SDM 10.4.7.1 vol3, the LVT register should be reset to 0s except for the
mask bits are set to 1s.
In current code, the lvt_last[] has been set to correct value(i.e. 0x10000) in
vlapic_reset() before enforce setting vlapic->lvt_last[i] to 0U, add the loop
that set vlapic->lvt_last[i] to 0 would lead to get zero when read LVT regs
after reset, which is incompiant with SDM;
Tracked-On: #4266
Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Fei Li <fei1.li@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
For guest reset, if the highest severity guest reset will reset
system. There is vm flag to call out the highest severity guest
in specific scenario which is a static guest severity assignment.
There is case that the static highest severity guest is shutdown
and the highest severity guest should be transfer to other guest.
For example, in ISD scenario, if RTVM (static highest severity
guest) is shutdown, SOS should be highest severity guest instead.
The is_highest_severity_vm() is updated to detect highest severity
guest dynamically. And promote the highest severity guest reset
to system reset.
Also remove the GUEST_FLAG_HIGHEST_SEVERITY definition.
Tracked-On: #4270
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
For system S5, ACRN had assumption that SOS shutdown will trigger
system shutdown. So the system shutdown logical is:
1. Trap SOS shutdown
2. Wait for all other guest shutdown
3. Shutdown system
The new logical is refined as:
If all guest is shutdown, shutdown whole system
Tracked-On: #4270
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
We don't use INIT signal notification method now. This patch
removes them.
Tracked-On: #3886
Acked-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
We have implemented a new notification method using NMI.
So replace the INIT notification method with the NMI one.
Then we can remove INIT notification related code later.
Tracked-On: #3886
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
There is a window where we may miss the current request in the
notification period when the work flow is as the following:
CPUx + + CPUr
| |
| +--+
| | | Handle pending req
| <--+
+--+ |
| | Set req flag |
<--+ |
+------------------>---+
| Send NMI | | Handle NMI
| <--+
| |
| |
| +--> vCPU enter
| |
+ +
So, this patch enables the NMI-window exiting to trigger the next vmexit
once there is no "virtual-NMI blocking" after vCPU enter into VMX non-root
mode. Then we can process the pending request on time.
Tracked-On: #3886
Acked-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
The NMI for notification should not be inject to guest. So,
this patch drops NMI injection request when we use NMI
to notify vCPUs. Meanwhile, ACRN doesn't support vNMI well
and there is no well-designed way to check if the NMI is
for notification or for guest now. So, we take all the NMIs as
notificaton NMI for hard rtvm temporarily. It means that the
hard rtvm will never receive NMI with this patch applied.
TODO: vNMI support is not ready yet. we will add it later.
Tracked-On: #3886
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
ACRN hypervisor needs to kick vCPU off VMX non-root mode to do some
operations in hypervisor, such as interrupt/exception injection, EPT
flush etc. For non lapic-pt vCPUs, we can use IPI to do so. But, it
doesn't work for lapic-pt vCPUs as the IPI will be injected to VMs
directly without vmexit.
Without the way to kick the vCPU off VMX non-root mode to handle pending
request on time, there may be fatal errors triggered.
1). Certain operation may not be carried out on time which may further
lead to fatal errors. Taking the EPT flush request as an example, once we
don't flush the EPT on time and the guest access the out-of-date EPT,
fatal error happens.
2). ACRN now will send an IPI with vector 0xF0 to target vCPU to kick the vCPU
off VMX non-root mode if it wants to do some operations on target vCPU.
However, this way doesn't work for lapic-pt vCPUs. The IPI will be delivered
to the guest directly without vmexit and the guest will receive a unexpected
interrupt. Consequently, if the guest can't handle this interrupt properly,
fatal error may happen.
The NMI can be used as the notification signal to kick the vCPU off VMX
non-root mode for lapic-pt vCPUs. So, this patch uses NMI as notification signal
to address the above issues for lapic-pt vCPUs.
Tracked-On: #3886
Acked-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
Reserved bits in a 8-bit PAT field has been checked in pat_mem_type_invalid.
Remove this redundant check "(PAT_FIELD_RSV_BITS & field) != 0UL" in
write_pat_msr.
Tracked-On: #1842
Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>
The port 0x64 is the status register of i8042 keyboard controller. When
i8042 is defined as ACPI PnP device in BIOS, enforce returning 0xff in
read handler would cause infinite loop when booting SOS VM, so expose
the physical port read in this case;
Tracked-On: #4228
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
In current architecutre, the maximum vCPUs number per VM could not
exceed the pCPUs number. Given the MAX_PCPU_NUM macro is provided
in board configurations, so remove the MAX_VCPUS_PER_VM from Kconfig
and add a macro of MAX_VCPUS_PER_VM to reference MAX_PCPU_NUM directly.
Tracked-On: #4230
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
rename the macro since MAX_PCPU_NUM could be parsed from board file and
it is not a configurable item anymore.
Tracked-On: #4230
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
In APICv advanced mode, an targeted vCPU, running in non-root mode, may get outdated
TMR and EOI exit bitmap if another vCPU sends an interrupt to it if the trigger mode
of this interrupt has changed.
This patch try to kick vCPU off to let it get the latest TMR and EOI exit bitmap when
it enters non-root mode again if new coming interrupt trigger mode has changed. Then
fill the interrupt to PIR.
Tracked-On: #4200
Signed-off-by: Li Fei1 <fei1.li@intel.com>
On some platforms, HPA regions for Virtual Machine can not be
contiguous because of E820 reserved type or PCI hole. In such
cases, pre-launched VMs need to be assigned non-contiguous memory
regions and this patch addresses it.
To keep things simple, current design has the following assumptions,
1. HPA2 always will be placed after HPA1
2. HPA1 and HPA2 don’t share a single ve820 entry.
(Create multiple entries if needed but not shared)
3. Only support 2 non-contiguous HPA regions (can extend
at a later point for multiple non-contiguous HPA)
Signed-off-by: Vijay Dhanraj <vijay.dhanraj@intel.com>
Tracked-On: #4195
Acked-by: Anthony Xu <anthony.xu@intel.com>
To handle reboot requests from pre-launched VMs that don't have
GUEST_FLAG_HIGHEST_SEVERITY, we shutdown the target VM explicitly
other than ignoring them.
Tracked-On: #2700
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Anthony Xu <anthony.xu@intel.com>
ptirq_prepare_msix_remap was called no matter whether MSI/MSI-X was enabled or not
and it passed zero to input parameter virtual MSI/MSI-X data field to indicate
MSI/MSI-X was disabled. However, it barely did nothing on this case.
Now ptirq_prepare_msix_remap is called only when MSI/MSI-X is enabled. It doesn't
need to check whether MSI/MSI-X is enabled or not by checking virtual MSI/MSI-X
data field.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
It's meaningless to sleep a non-running vcpu. Add a state check before
sleep the thread object of the vcpu.
Tracked-On: #4178
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
With cpu-sharing enabled, there are more than 1 vcpu on 1 pcpu, so the
smp_call handler should switch the vmcs to the target vcpu's vmcs. Then
get the info.
dump_vcpu_reg and dump_guest_mem should run on certain vmcs, otherwise,
there will be #GP error.
Renaming:
vcpu_dumpreg -> dump_vcpu_reg
switch_vmcs -> load_vmcs
Tracked-On: #4178
Signed-off-by: Conghui Chen <conghui.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
We care more about leaf and subleaf of cpuid than vcpu_id.
So, this patch changes the cpuid trace-entry to trace the leaf
and subleaf of this cpuid vmexit.
Tracked-On: #4175
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
PMU is hidden from any guest, UD is expected when guest
try to execute 'rdpmc' instruction.
this patch sets 'RDPMC exiting' in Processorbased
VM-execution control.
Tracked-On: #3453
Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
After changing init_vmcs to smp call approach and do it before
launch_vcpu, it could work with noop scheduler. On real sharing
scheudler, it has problem.
pcpu0 pcpu1 pcpu1
vmBvcpu0 vmAvcpu1 vmBvcpu1
vmentry
init_vmcs(vmBvcpu1) vmexit->do_init_vmcs
corrupt current vmcs
vmentry fail
launch_vcpu(vmBvcpu1)
This patch mark a event flag when request vmcs init for specific vcpu. When
it is running and checking pending events, will do init_vmcs firstly.
Tracked-On: #4178
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Starting with TSC_DEADLINE msr interception disabled, the virtual TSC_DEADLINE msr is always 0.
When the interception is enabled, need to sync the physical TSC_DEADLINE value to virtual TSC_DEADLINE.
When the interception is disabled, there are 2 cases:
- if the timer hasn't expired, sync virtual TSC_DEADLINE to physical TSC_DEADLINE, to make the guest read the same tsc_deadline
as it writes. This may change when the timer actually trigger.
- if the timer has expired, write 0 to the virtual TSC_DEADLINE.
Tracked-On: #4162
Signed-off-by: Yan, Like <like.yan@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
When write to virtual TSC_DEADLINE, if virtual TSC_ADJUST is not zero:
- when guest intends to disarm the tsc_deadline timer, should not arm the timer falsely;
- when guest intends to arm the tsc_deadline timer, should not disarm the timer falsely.
When read from virtual TSC_DEADLINE, if virtual TSC_ADJUST is not zero:
- if physical TSC_DEADLINE is not zero, return the virtual TSC_DEADLINE value;
- if physical TSC_DEADLINE is zero which means it's not armed (automatically disarmed after
timer triggered), return 0 and reset the virtual TSC_DEADLINE.
Tracked-On: #4162
Signed-off-by: Yan, Like <like.yan@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
xsave area:
legacy region: 512 bytes
xsave header: 64 bytes
extended region: < 3k bytes
So, pre-allocate 4k area for xsave. Use certain instruction to save or
restore the area according to hardware xsave feature set.
Tracked-On: #4166
Signed-off-by: Conghui Chen <conghui.chen@intel.com>
Reviewed-by: Anthony Xu <anthony.xu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
ptirq_msix_remap doesn't do the real remap, that's the vmsi_remap and vmsix_remap_entry
does. ptirq_msix_remap only did the preparation.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
Issue description:
-----------------
Machine Check Error on Page Size Change
Instruction fetch may cause machine check error if page size
and memory type was changed without invalidation on some
processors[1][2]. Malicious guest kernel could trigger this issue.
This issue applies to both primary page table and extended page
tables (EPT), however the primary page table is controlled by
hypervisor only. This patch mitigates the situation in EPT.
Mitigation details:
------------------
Implement non-execute huge pages in EPT.
This patch series clears the execute permission (bit 2) in the
EPT entries for large pages. When EPT violation is triggered by
guest instruction fetch, hypervisor converts the large page to
smaller 4 KB pages and restore the execute permission, and then
re-execute the guest instruction.
The current patch turns on the mitigation by default.
The follow-up patches will conditionally turn on/off the feature
per processor model.
[1] Refer to erratum KBL002 in "7th Generation Intel Processor
Family and 8th Generation Intel Processor Family for U Quad Core
Platforms Specification Update"
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/7th-gen-core-family-spec-update.pdf
[2] Refer to erratum SKL002 in "6th Generation Intel Processor
Family Specification Update"
https://www.intel.com/content/www/us/en/products/docs/processors/core/desktop-6th-gen-core-family-spec-update.html
Tracked-On: #4101
Signed-off-by: Binbin Wu <binbin.wu@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
In non-64-bit mode, CS segment base address should be considered when
determining the linear address of the vcpu's instruction pointer. Use
vie_calculate_gla() for instruction address translation which also takes
care of 64-bit mode.
Tracked-On: #4064
Signed-off-by: Peter Fang <peter.fang@intel.com>