Since we restore BAR values when writing Command Register if necessary. We don't
need to trap FLR and do the BAR restore then.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
When PCIe does Conventinal Reset or FLR, almost PCIe configurations and states will
lost. So we should save the configurations and states before do the reset and restore
them after the reset. This was done well by BIOS or Guest now. However, ACRN will trap
these access and handle them properly for security. Almost of these configurations and
states will be written to physical configuration space at last except for BAR values
for now. So we should do the restore for BAR values. One way is to do restore after
one type reset is detected. This will be too complex. Another way is to do the restore
when BIOS or guest tries to write the Command Register. This could work because:
1. The I/O Space Enable bit and Memory Space Enable bits in Command Register will reset
to zero.
2. Before BIOS or guest wants to enable these bits, the BAR couldn't be accessed.
3. So we could restore the BAR values before enable these bits if reset is detected.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
- target vm_id of vuart can't be un-defined VM, nor the VM itself.
- fix potential NULL pointer dereference in find_active_target_vuart()
Tracked-On: #3854
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Per SDM 10.12.5.1 vol.3, local APIC should keep LAPIC state after receiving
INIT. The local APIC ID register should also be preserved.
Tracked-On: #4267
Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The patch abstract a vcpu_reset_internal() api for internal usage, the
function would not touch any vcpu state transition and just do vcpu reset
processing. It will be called by create_vcpu() and reset_vcpu().
The reset_vcpu() will act as a public api and should be called
only when vcpu receive INIT or vm reset/resume from S3. It should not be
called when do shutdown_vm() or hcall_sos_offline_cpu(), so the patch remove
reset_vcpu() in shutdown_vm() and hcall_sos_offline_cpu().
The patch also introduced reset_mode enum so that vcpu and vlapic could do
different context operation according to different reset mode;
Tracked-On: #4267
Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Rename vlapic_xxx_write_handler() to vlapic_write_xxx() to make code more
readable;
Tracked-On: #4268
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Some MACROs in lapic.h are duplicated with apicreg.h, and some MACROs are
never referenced, remove them.
Tracked-On: #4268
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Per SDM 10.4.7.1 vol3, the LVT register should be reset to 0s except for the
mask bits are set to 1s.
In current code, the lvt_last[] has been set to correct value(i.e. 0x10000) in
vlapic_reset() before enforce setting vlapic->lvt_last[i] to 0U, add the loop
that set vlapic->lvt_last[i] to 0 would lead to get zero when read LVT regs
after reset, which is incompiant with SDM;
Tracked-On: #4266
Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Fei Li <fei1.li@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Structure pci_vbar is used to define the virtual BAR rather than physical BAR.
It's better to name as pci_vbar.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
There's no need to check which capability we care at the very beginning. We could
do it later step by step.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Per ACPI 6.2 spec, chapter 5.2.5.2 "Finding the RSDP on UEFI Enabled Systems":
In Unified Extensible Firmware Interface (UEFI) enabled systems, a pointer to
the RSDP structure exists within the EFI System Table. The OS loader is provided
a pointer to the EFI System Table at invocation. The OS loader must retrieve the
pointer to the RSDP structure from the EFI System Table and convey the pointer
to OSPM, using an OS dependent data structure, as part of the hand off of
control from the OS loader to the OS.
So when ACRN boot from direct mode on a UEFI enabled system, hypervisor might
be failed to get rsdp by seaching rsdp in legacy EBDA or 0xe0000~0xfffff region,
but it still have chance to get rsdp by seaching it in e820 ACPI reclaimable
region with some edk2 based BIOS.
The patch will search rsdp from e820 ACPI reclaim region When failed to get
rsdp from legacy region.
Tracked-On: #4301
Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Fei Li <fei1.li@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
- commit 69152647 ("hv: Use virtual APIC IDs for Pre-launched VMs")
enables virtual APIC IDs for pre-launched VMs thus xapic_phys is no
longer needed to force guest xAPIC to work in physical destination mode.
- HVC is not available in logical partition mode and "console=hvc0" should
be removed from guest Linux bootargs.
Tracked-On: #3854
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Victor Sun <victor.sun@intel.com>
Add severity definitions for different scenarios. The static
guest severity is defined according to guest configurations.
Also add sanity check to make sure the severity for all guests
are correct.
Tracked-On: #4270
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
For guest reset, if the highest severity guest reset will reset
system. There is vm flag to call out the highest severity guest
in specific scenario which is a static guest severity assignment.
There is case that the static highest severity guest is shutdown
and the highest severity guest should be transfer to other guest.
For example, in ISD scenario, if RTVM (static highest severity
guest) is shutdown, SOS should be highest severity guest instead.
The is_highest_severity_vm() is updated to detect highest severity
guest dynamically. And promote the highest severity guest reset
to system reset.
Also remove the GUEST_FLAG_HIGHEST_SEVERITY definition.
Tracked-On: #4270
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
For system S5, ACRN had assumption that SOS shutdown will trigger
system shutdown. So the system shutdown logical is:
1. Trap SOS shutdown
2. Wait for all other guest shutdown
3. Shutdown system
The new logical is refined as:
If all guest is shutdown, shutdown whole system
Tracked-On: #4270
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
ACRN hypervisor should trap guest doing PCI AF FLR. Besides, it should save some status
before doing the FLR and restore them later, only BARs values for now.
This patch will trap guest Conventional PCI Advanced Features Control Register write
operation if the device supports Conventional PCI Advanced Features Capability and
check whether it wants to do device AF FLR. If it does, call pdev_do_flr to do the job.
Tracked-On: #3465
Signed-off-by: Li Fei1 <fei1.li@intel.com>
ACRN hypervisor should trap guest doing PCIe FLR. Besides, it should save some status
before doing the FLR and restore them later, only BARs values for now.
This patch will trap guest Device Capabilities Register write operation if the device
supports PCI Express Capability and check whether it wants to do device FLR. If it does,
call pdev_do_flr to do the job.
Tracked-On: #3465
Signed-off-by: Li Fei1 <fei1.li@intel.com>
The pointer variable 'start' should be checked against NULL
right after detected it is not pointer to a space character,
otherwise the pointer variable 'end' must hold the wrong
address right after NULL if the cmdline containing trailing
whitespaces and deference the wrong address out of cmdline
string. this parsing code also been optimized and simplified.
Tracked-On: projectacrn#4250
Signed-off-by: Gary <gordon.king@intel.com>
We don't use INIT signal notification method now. This patch
removes them.
Tracked-On: #3886
Acked-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
We have implemented a new notification method using NMI.
So replace the INIT notification method with the NMI one.
Then we can remove INIT notification related code later.
Tracked-On: #3886
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
There is a window where we may miss the current request in the
notification period when the work flow is as the following:
CPUx + + CPUr
| |
| +--+
| | | Handle pending req
| <--+
+--+ |
| | Set req flag |
<--+ |
+------------------>---+
| Send NMI | | Handle NMI
| <--+
| |
| |
| +--> vCPU enter
| |
+ +
So, this patch enables the NMI-window exiting to trigger the next vmexit
once there is no "virtual-NMI blocking" after vCPU enter into VMX non-root
mode. Then we can process the pending request on time.
Tracked-On: #3886
Acked-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
The NMI for notification should not be inject to guest. So,
this patch drops NMI injection request when we use NMI
to notify vCPUs. Meanwhile, ACRN doesn't support vNMI well
and there is no well-designed way to check if the NMI is
for notification or for guest now. So, we take all the NMIs as
notificaton NMI for hard rtvm temporarily. It means that the
hard rtvm will never receive NMI with this patch applied.
TODO: vNMI support is not ready yet. we will add it later.
Tracked-On: #3886
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
ACRN hypervisor needs to kick vCPU off VMX non-root mode to do some
operations in hypervisor, such as interrupt/exception injection, EPT
flush etc. For non lapic-pt vCPUs, we can use IPI to do so. But, it
doesn't work for lapic-pt vCPUs as the IPI will be injected to VMs
directly without vmexit.
Without the way to kick the vCPU off VMX non-root mode to handle pending
request on time, there may be fatal errors triggered.
1). Certain operation may not be carried out on time which may further
lead to fatal errors. Taking the EPT flush request as an example, once we
don't flush the EPT on time and the guest access the out-of-date EPT,
fatal error happens.
2). ACRN now will send an IPI with vector 0xF0 to target vCPU to kick the vCPU
off VMX non-root mode if it wants to do some operations on target vCPU.
However, this way doesn't work for lapic-pt vCPUs. The IPI will be delivered
to the guest directly without vmexit and the guest will receive a unexpected
interrupt. Consequently, if the guest can't handle this interrupt properly,
fatal error may happen.
The NMI can be used as the notification signal to kick the vCPU off VMX
non-root mode for lapic-pt vCPUs. So, this patch uses NMI as notification signal
to address the above issues for lapic-pt vCPUs.
Tracked-On: #3886
Acked-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
Reserved bits in a 8-bit PAT field has been checked in pat_mem_type_invalid.
Remove this redundant check "(PAT_FIELD_RSV_BITS & field) != 0UL" in
write_pat_msr.
Tracked-On: #1842
Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>
Input 'vcpu_id' and the state of target vCPU should be validated
properly:
- 'vcpu_id' shall be less than 'vm->hw.created_vcpus' instead
of 'MAX_VCPUS_PER_VM'.
- The state of target vCPU should be "VCPU_PAUSED", and reject
all other states.
Tracked-On: #4245
Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
When user use make menuconfig to configure memory related kconfig items,
we need add range check to avoid compile error or other potential issues:
CONFIG_LOW_RAM_SIZE:(0 ~ 0x10000)
the value should be less than 64KB;
CONFIG_HV_RAM_SIZE: (0x1000000 ~ 0x10000000)
the hypervisor RAM size should be supposed between
16MB to 256MB;
CONFIG_PLATFORM_RAM_SIZE: (0x100000000 ~ 0x4000000000)
the platform RAM size should be larger than 4GB
and less than 256GB;
CONFIG_SOS_RAM_SIZE: (0x100000000 ~ 0x4000000000)
the SOS RAM size should be larger than 4GB
and less than 256GB;
CONFIG_UOS_RAM_SIZE: (0 ~ 0x2000000000)
the UOS RAM size should be less than 128GB;
Tracked-On: #4229
Signed-off-by: Victor Sun <victor.sun@intel.com>
Set default CONFIG_KATA_VM_NUM to 1 in SDC scenario so that user could
have a try on Kata container without rebuilding hypervisor.
Please be aware that vcpu affinity of VM1 in CPU partition mode
would be impacted by this patch.
Tracked-On: #4232
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
This patch fixes potential hypervisor crash when
calling hcall_write_protect_page() with a crafted
GPA in 'struct wp_data' instance, e.g. an invalid
GPA that is not in the scope of the target VM's
EPT address space.
To check the validity for this GPA before updating
the 'write protect' page.
Tracked-On: #4240
Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>
Reviewed-by: Fei Li <fei1.li@intel.com>
This patch adds a helper function send_single_nmi. The fisrt caller
will soon come with the following patch.
Tracked-On: #3886
Acked-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
This patch installs a NMI handler in acrn IDT to handle
NMIs out of dispatch_exception.
Tracked-On: #3886
Acked-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
There are lines of repeated codes in excp/external_interrupt_save_frame
and excp_rsvd. So, this patch defines two .macro, save_frame and restore_frame,
to reduce the repeated codes.
No functional change.
Tracked-On: #3886
Acked-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
The port 0x64 is the status register of i8042 keyboard controller. When
i8042 is defined as ACPI PnP device in BIOS, enforce returning 0xff in
read handler would cause infinite loop when booting SOS VM, so expose
the physical port read in this case;
Tracked-On: #4228
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
In current architecutre, the maximum vCPUs number per VM could not
exceed the pCPUs number. Given the MAX_PCPU_NUM macro is provided
in board configurations, so remove the MAX_VCPUS_PER_VM from Kconfig
and add a macro of MAX_VCPUS_PER_VM to reference MAX_PCPU_NUM directly.
Tracked-On: #4230
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
rename the macro since MAX_PCPU_NUM could be parsed from board file and
it is not a configurable item anymore.
Tracked-On: #4230
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
remove 'guest_init_pml4' and 'tmp_pg_array' in vm_arch
since they are not used.
Tracked-On: #1842
Signed-off-by: Mingqiang Chi <mingqiang.chi@intel.com>
The initialization of "dmar_unit->gcmd" shall be done via reading from
Global Status Register rather than Global Command Register.
Rationale:
According to Chapter 10.4.4 Global Command Register in VT-d spec, Global Command
Register is a write-only register to control remapping hardware.
Global Status Register is the corresponding read-only register to report remapping
hardware status.
Tracked-On: #1842
Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>
For now, we set NOOP scheduler as default. User can choose IORR scheduler as needed.
Tracked-On: #4178
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Add yield support for schedule, which can give up pcpu proactively.
Tracked-On: #4178
Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Yu Wang <yu1.wang@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Implement .sleep/.wake/.pick_next of sched_iorr.
In .pick_next, we count current object's timeslice and pick the next
avaiable one. The policy is
1) get the first item in runqueue firstly
2) if object picked has no time_cycles, replenish it pick this one
3) At least take one idle sched object if we have no runnable object
after step 1) and 2)
In .wake, we start the tick if we have more than one active
thread_object in runqueue. In .sleep, stop the tick timer if necessary.
Tracked-On: #4178
Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Yu Wang <yu1.wang@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
sched_control is per-pcpu, each sched_control has a tick timer running
periodically. Every period called a tick. In tick handler, we do
1) compute left timeslice of current thread_object if it's not the idle
2) make a schedule request if current thread_object run out of timeslice
For runqueue maintaining, we will keep objects which has timeslice in
the front of runqueue and the ones get new replenished in tail.
Tracked-On: #4178
Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Yu Wang <yu1.wang@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
We set timeslice to 10ms as default, and set tick interval to 1ms.
When init sched_iorr scheduler, we init a periodic timer as the tick and
init the runqueue to maintain objects in the sched_control. Destroy
the timer in deinit.
Tracked-On: #4178
Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Yu Wang <yu1.wang@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
IO sensitive Round-robin scheduler aim to schedule threads with
round-robin policy. Meanwhile, we also enhance it with some fairness
configuration, such as thread will be scheduled out without properly
timeslice. IO request on thread will be handled in high priority.
This patch only add a skeleton for the sched_iorr scheduler.
Tracked-On: #4178
Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Yu Wang <yu1.wang@intel.com>
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The calculation of the end boundry address is corrected by
adding the size extracted from _DYNAMIC to start address
in type of uint8_t while improving the code by calulating
the end boundry address after scanning, also reducing type
casts accordingly.
Tracked-On: projectacrn#4191
Signed-off-by: Gary <gordon.king@intel.com>
In APICv advanced mode, an targeted vCPU, running in non-root mode, may get outdated
TMR and EOI exit bitmap if another vCPU sends an interrupt to it if the trigger mode
of this interrupt has changed.
This patch try to kick vCPU off to let it get the latest TMR and EOI exit bitmap when
it enters non-root mode again if new coming interrupt trigger mode has changed. Then
fill the interrupt to PIR.
Tracked-On: #4200
Signed-off-by: Li Fei1 <fei1.li@intel.com>
This patch updates kconfig to support server platforms
for increased number of VCPUs per VM and PT IRQ number.
Signed-off-by: Vijay Dhanraj <vijay.dhanraj@intel.com>
Tracked-On: #4196
On some platforms, HPA regions for Virtual Machine can not be
contiguous because of E820 reserved type or PCI hole. In such
cases, pre-launched VMs need to be assigned non-contiguous memory
regions and this patch addresses it.
To keep things simple, current design has the following assumptions,
1. HPA2 always will be placed after HPA1
2. HPA1 and HPA2 don’t share a single ve820 entry.
(Create multiple entries if needed but not shared)
3. Only support 2 non-contiguous HPA regions (can extend
at a later point for multiple non-contiguous HPA)
Signed-off-by: Vijay Dhanraj <vijay.dhanraj@intel.com>
Tracked-On: #4195
Acked-by: Anthony Xu <anthony.xu@intel.com>
To handle reboot requests from pre-launched VMs that don't have
GUEST_FLAG_HIGHEST_SEVERITY, we shutdown the target VM explicitly
other than ignoring them.
Tracked-On: #2700
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Anthony Xu <anthony.xu@intel.com>
ptirq_prepare_msix_remap was called no matter whether MSI/MSI-X was enabled or not
and it passed zero to input parameter virtual MSI/MSI-X data field to indicate
MSI/MSI-X was disabled. However, it barely did nothing on this case.
Now ptirq_prepare_msix_remap is called only when MSI/MSI-X is enabled. It doesn't
need to check whether MSI/MSI-X is enabled or not by checking virtual MSI/MSI-X
data field.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
Do vMSI-X remap only when Mask Bit in Vector Control Register for MSI-X Table Entry
is unmask.
The previous implementation also has two issues:
1. It only check whether Message Control Register for MSI-X has been modified when
guest writes MSI-X CFG space at Message Control Register offset.
2. It doesn't really disable MSI-X when guest wants to disable MSI-X.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
1. disable physical MSI before writing the virtual MSI CFG space
2. do the remap_vmsi if the guest wants to enable MSI or update MSI address or data
3. disable INTx and enable MSI after step 2.
The previous Message Control check depends on the guest write MSI Message Control
Register at the offset of Message Control Register. However, the guest could access
this register at the offset of MSI Capability ID register. This patch remove this
constraint. Also, The previous implementation didn't really disable MSI when guest
wanted to disable MSI.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
It's meaningless to sleep a non-running vcpu. Add a state check before
sleep the thread object of the vcpu.
Tracked-On: #4178
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Now, we support schedule inplace. And with cpu sharing, there might be
multi vcpu running on same pcpu. Reschedule request will happen when
switch the running vcpu. If the current vcpu is polling on the IO
completion, it need to be scheduled back to the polling point.
In the polling path, construct a loop for polling, and do schedule in the
loop if needed.
Tracked-On: #4178
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
With cpu-sharing enabled, there are more than 1 vcpu on 1 pcpu, so the
smp_call handler should switch the vmcs to the target vcpu's vmcs. Then
get the info.
dump_vcpu_reg and dump_guest_mem should run on certain vmcs, otherwise,
there will be #GP error.
Renaming:
vcpu_dumpreg -> dump_vcpu_reg
switch_vmcs -> load_vmcs
Tracked-On: #4178
Signed-off-by: Conghui Chen <conghui.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Add a header field in acrnlog message to indicate the current
running thread.
Tracked-On: #4178
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
We care more about leaf and subleaf of cpuid than vcpu_id.
So, this patch changes the cpuid trace-entry to trace the leaf
and subleaf of this cpuid vmexit.
Tracked-On: #4175
Signed-off-by: Kaige Fu <kaige.fu@intel.com>
PMU is hidden from any guest, UD is expected when guest
try to execute 'rdpmc' instruction.
this patch sets 'RDPMC exiting' in Processorbased
VM-execution control.
Tracked-On: #3453
Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Deterministic is important for RTVM. The mitigation for MCE on
Page Size Change converts a large page to 4KB pages runtimely during
the vmexit triggered by the instruction fetch in the large page.
These vmexits increase nondeterminacy, which should be avoided for RTVM.
This patch builds 4KB page mapping in EPT for RTVM to avoid these vmexits.
Tracked-On: #4101
Signed-off-by: Binbin Wu <binbin.wu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Add a option MCE_ON_PSC_WORKAROUND_DISABLED to disable the software
workaround for the issue Machine Check Error on Page Size Change.
Tracked-On: #4101
Signed-off-by: Binbin Wu <binbin.wu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Only apply the software workaround on the models that might be
affected by MCE on page size change. For these models that are
known immune to the issue, the mitigation is turned off.
Atom processors are not afftected by the issue.
Also check the CPUID & MSR to check whether the model is immune to the issue:
CPU is not vulnerable when both CPUID.(EAX=07H,ECX=0H).EDX[29] and
IA32_ARCH_CAPABILITIES[IF_PSCHANGE_MC_NO] are 1.
Other cases not listed above, CPU may be vulnerable.
This patch also changes MACROs for MSR IA32_ARCH_CAPABILITIES bits to UL instead of U
since the MSR is 64bit.
Tracked-On: #4101
Signed-off-by: Binbin Wu <binbin.wu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
After changing init_vmcs to smp call approach and do it before
launch_vcpu, it could work with noop scheduler. On real sharing
scheudler, it has problem.
pcpu0 pcpu1 pcpu1
vmBvcpu0 vmAvcpu1 vmBvcpu1
vmentry
init_vmcs(vmBvcpu1) vmexit->do_init_vmcs
corrupt current vmcs
vmentry fail
launch_vcpu(vmBvcpu1)
This patch mark a event flag when request vmcs init for specific vcpu. When
it is running and checking pending events, will do init_vmcs firstly.
Tracked-On: #4178
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The default PCI mmcfg base is stored in ACPI MCFG table, when
CONFIG_ACPI_PARSE_ENABLED is set, acpi_fixup() function will
parse and fix up the platform mmcfg base in ACRN boot stage;
when it is not set, platform mmcfg base will be initialized to
DEFAULT_PCI_MMCFG_BASE which generated by acrn-config tool;
Please note we will not support platform which has multiple PCI
segment groups.
Tracked-On: #4157
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Starting with TSC_DEADLINE msr interception disabled, the virtual TSC_DEADLINE msr is always 0.
When the interception is enabled, need to sync the physical TSC_DEADLINE value to virtual TSC_DEADLINE.
When the interception is disabled, there are 2 cases:
- if the timer hasn't expired, sync virtual TSC_DEADLINE to physical TSC_DEADLINE, to make the guest read the same tsc_deadline
as it writes. This may change when the timer actually trigger.
- if the timer has expired, write 0 to the virtual TSC_DEADLINE.
Tracked-On: #4162
Signed-off-by: Yan, Like <like.yan@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
When write to virtual TSC_DEADLINE, if virtual TSC_ADJUST is not zero:
- when guest intends to disarm the tsc_deadline timer, should not arm the timer falsely;
- when guest intends to arm the tsc_deadline timer, should not disarm the timer falsely.
When read from virtual TSC_DEADLINE, if virtual TSC_ADJUST is not zero:
- if physical TSC_DEADLINE is not zero, return the virtual TSC_DEADLINE value;
- if physical TSC_DEADLINE is zero which means it's not armed (automatically disarmed after
timer triggered), return 0 and reset the virtual TSC_DEADLINE.
Tracked-On: #4162
Signed-off-by: Yan, Like <like.yan@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
xsave area:
legacy region: 512 bytes
xsave header: 64 bytes
extended region: < 3k bytes
So, pre-allocate 4k area for xsave. Use certain instruction to save or
restore the area according to hardware xsave feature set.
Tracked-On: #4166
Signed-off-by: Conghui Chen <conghui.chen@intel.com>
Reviewed-by: Anthony Xu <anthony.xu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Name vmsi and vmsix function with verb-object style:
For external APIs, using MODULE_NAME_verb-object style;
For internal APIs, using verb-object style.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
ptirq_msix_remap doesn't do the real remap, that's the vmsi_remap and vmsix_remap_entry
does. ptirq_msix_remap only did the preparation.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
Add a Kconfig parameter called UEFI_OS_LOADER_NAME to hold the Service VM EFI
bootloader to be run by the ACRN hypervisor. A new string manipulation function
to convert from (char *) to (CHAR16 *) has been added to facilitate the
implementation.
The default value is set to systemd-boot (bootloaderx64.efi)
Tracked-On: #2793
Signed-off-by: Geoffroy Van Cutsem <geoffroy.vancutsem@intel.com>
Major changes:
1. Correct handling of device multi-function capability
We only check function zero for this feature. If it has it, we continue
looking at all remaining functions, ignoring those with invalid vendors.
The PCI spec says we are not to probe beyond function zero if it does
not exist or indicates it is not a multi-function device.
2a. Walk *ALL* buses in the PCI space, however,
Before walking the PCI hierarchy, post-processed ACPI DMAR info is parsed
and a map is created between all device-scopes across all DRHDs and the
corresponding IOMMU index.
This map is used at the time of walking the PCI hierarchy. If a BDF that
ACRN is currently working on, is found in the above-mentioned map, the
BDF device is mapped to the corresponding DRHD in the map.
If the BDF were a bridge type, realized with "Header Type" in config space,
the BDF device along with all its downstream devices are mapped to the
corresponding DRHD in the map.
To avoid walking previously visited buses, we maintain a bitmap that
stores which bus is walked when we handle Bridge type devices.
Once ACPI information is included into ACRN about the PCI-Express Root
Complexes / PCI Host Bridges, we can avoid the final loop which probes
all remainder buses, and instead jump to the next Host Bridge bus.
From prior patches, init_pdev returns the pdev structure it created to
the caller. This allows us to complete initialization by updating its
drhd_idx to the correct DRHD.
Tracked-On: #4134
Signed-off-by: Alexander Merritt <alex.merritt@intel.com>
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
On server platforms, DMAR DRHD device scope entries may contain PCI
bridges.
Bridges in the DRHD device scope indicate this IOMMU translates for all
devices on the hierarchy below that bridge.
ACRN is unaware of bridge types in the device scope, and adds these
directly to its internal representation of a DRHD. When looking up a BDF
within these DRHD entries, device_to_dmaru assumes all entries are
Endpoints, comparing BDF to BDF. Thus device to DMAR unit fails, because
it treats a bridge as an Endpoint type.
This change leverages prior patches by converting a BDF to the
associated device DRHD index, and uses that index to obtain the correct
DRHD state.
Handling a bridge in other ways may require maintaining a bus list for
each, or replacing each bridge in the dev scope with a set of all device
BDFs underneath it. Server platforms can have hundreds of PCI devices,
thus making the device scope artificially large is unwieldy.
Tracked-On: #4134
Signed-off-by: Alexander Merritt <alex.merritt@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
ACRN does not support multiple PCI segments in its current form.
But VT-d module uses segment info in its interfaces and
hardcodes it to 0.
This patch cleans up everything related to segment to avoid
ambiguity.
Tracked-On: #4134
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
In later patches we use information from DMAR tables to guide discovery
and initialization of PCI devices.
Tracked-On: #4134
Signed-off-by: Alexander Merritt <alex.merritt@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
We add new member pci_pdev.drhd_idx associating the DRHD
(IOMMU) with this pdev, and a method to convert a pbdf of a device to
this index by searching the pdev list.
Partial patch: drhd_index initialization handled in subsequent patch.
Tracked-On: #4134
Signed-off-by: Alexander Merritt <alex.merritt@intel.com>
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Add some encapsulation of utilities which read PCI header space using
wrapper functions. Also contain verification of PCI vendor to its own
function, rather than having hard-coded integrals exposed among other
code.
Tracked-On: #4134
Signed-off-by: Alexander Merritt <alex.merritt@intel.com>
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
add shell command to support dump dump guest memory
e.g.
dump_guest_mem vm_id, gva, length
Tracked-On: #4144
Signed-off-by: Mingqiang Chi <mingqiang.chi@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
rename shell_dumpmem to shell_dump_host_mem
and refine this api.
Tracked-On: #4144
Signed-off-by: Mingqiang Chi <mingqiang.chi@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
now this api assumes the guest OS is 64 bits,
this patch remove this api and will replace it
with dumping guest memory.
Tracked-On: #4144
Signed-off-by: Mingqiang Chi <mingqiang.chi@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The $(BOARD)_acpi_info.h is generated by acrn-config tool, remove this
header in make clean would cause failure when user finish configuring
in webUI and start to make acrn-hypervisor by the command
"make hypervisor BOARD=xxx SCENARIO=yyy" because we mandatory do make
clean before making hypervisor.
The patch replace the file removal with a warning string to hint user
to check the file validity.
Tracked-On: #3779
Signed-off-by: Victor Sun <victor.sun@intel.com>
call vuart_register_io_handler function, when the parameter vuart_idx is greater
than or equal to 2, print the vuart index value which will not register the vuart.
Tracked-On: #4072
Signed-off-by: Jidong Xia <xiajidong@cmss.chinamobile.com>
After reshuffle pci_bar structrue we could write ~0U not BAR size mask to BAR
configuration space directly when do BAR sizing. In this case, we could know whether
the value in BAR configuration space is a valid base address. As a result, we could
do BAR re-programming whenever we want.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
The current code declare pci_bar structure following the PCI bar spec. However,
we could not tell whether the value in virtual BAR configuration space is valid
base address base on current pci_bar structure. We need to add more fields which
are duplicated instances of the vBAR information. Basides these fields which will
added, bar_base_mapped is another duplicated instance of the vBAR information.
This patch try to reshuffle the pci_bar structure to declare pci_bar structure
following the software implement benefit not the PCI bar spec.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
The current do PCI IO BAR remap in vdev_pt_allow_io_vbar. This patch split this
function into vdev_pt_deny_io_vbar and vdev_pt_allow_io_vbar. vdev_pt_deny_io_vbar
removes the old IO port mapping, vdev_pt_allow_io_vbar add the new IO port mapping.
Tracked-On: #3475
Signed-off-by: Li Fei1 <fei1.li@intel.com>
In 64-bit mode, processor pushes SS and RSP onto stack unconditionally.
Also when dumping the exception info, it makes more sense to dump
the RSP at the point of interrupt, rather than the RSP after pushing
context (including GPRs)
Tracked-On: #4102
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
The *.mk files under misc/acrn-config/library are all rules for hypervisor
makefiles only, so move these files to hypervisor/scripts/makefile/ folder.
The folder of acrn-config/library/ will be used to store python script lib only.
Tracked-On: #3779
Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Terry Zou <terry.zou@intel.com>
The board specific $(BOARD)_acpi_info.h is generated by acrn-config tool,
we should clean it up before build hypervisor, otherwise the file could be
referenced by next build process if no config XMLs is specified.
Tracked-On: #3779
Signed-off-by: Victor Sun <victor.sun@intel.com>
Issue description:
-----------------
Machine Check Error on Page Size Change
Instruction fetch may cause machine check error if page size
and memory type was changed without invalidation on some
processors[1][2]. Malicious guest kernel could trigger this issue.
This issue applies to both primary page table and extended page
tables (EPT), however the primary page table is controlled by
hypervisor only. This patch mitigates the situation in EPT.
Mitigation details:
------------------
Implement non-execute huge pages in EPT.
This patch series clears the execute permission (bit 2) in the
EPT entries for large pages. When EPT violation is triggered by
guest instruction fetch, hypervisor converts the large page to
smaller 4 KB pages and restore the execute permission, and then
re-execute the guest instruction.
The current patch turns on the mitigation by default.
The follow-up patches will conditionally turn on/off the feature
per processor model.
[1] Refer to erratum KBL002 in "7th Generation Intel Processor
Family and 8th Generation Intel Processor Family for U Quad Core
Platforms Specification Update"
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/7th-gen-core-family-spec-update.pdf
[2] Refer to erratum SKL002 in "6th Generation Intel Processor
Family Specification Update"
https://www.intel.com/content/www/us/en/products/docs/processors/core/desktop-6th-gen-core-family-spec-update.html
Tracked-On: #4101
Signed-off-by: Binbin Wu <binbin.wu@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Currently make hypervisor will depend on a $(BOARD).config file to load
board defconfig which triggered by oldconfig process, this will block
make from XMLs for a new board because $(BOARD).config never exist.
This requires us to patch configuration for new board earlier than make
oldconfig.
Tracked-On: #4067
Signed-off-by: Victor Sun <victor.sun@intel.com>
In non-64-bit mode, CS segment base address should be considered when
determining the linear address of the vcpu's instruction pointer. Use
vie_calculate_gla() for instruction address translation which also takes
care of 64-bit mode.
Tracked-On: #4064
Signed-off-by: Peter Fang <peter.fang@intel.com>
MISRA C requires specified bounds for arrays declaration, previous declaration
of platform_clos_array in board.h does not meet the requirement.
Tracked-On: #3987
Signed-off-by: Victor Sun <victor.sun@intel.com>