diff --git a/doc/developer-guides/images/work_flow_of_error_detection_and_error_handling.png b/doc/developer-guides/images/work_flow_of_error_detection_and_error_handling.png new file mode 100644 index 000000000..76f4903f0 Binary files /dev/null and b/doc/developer-guides/images/work_flow_of_error_detection_and_error_handling.png differ diff --git a/doc/developer-guides/index.rst b/doc/developer-guides/index.rst index 4750b3627..51e3cdf35 100644 --- a/doc/developer-guides/index.rst +++ b/doc/developer-guides/index.rst @@ -32,3 +32,4 @@ project. coding_guidelines doc_guidelines graphviz + sw_design_guidelines diff --git a/doc/developer-guides/sw_design_guidelines.rst b/doc/developer-guides/sw_design_guidelines.rst new file mode 100644 index 000000000..79f21be29 --- /dev/null +++ b/doc/developer-guides/sw_design_guidelines.rst @@ -0,0 +1,489 @@ +.. _sw_design_guidelines: + +Software Design Guidelines +########################## + +Error Detection and Error Handling +********************************** + +Workflow +======== + +Error detection and error handling workflow in the ACRN hypervisor is shown in +:numref:`work_flow_of_error_detection_and_error_handling`. + +.. figure:: images/work_flow_of_error_detection_and_error_handling.png + :align: center + :name: work_flow_of_error_detection_and_error_handling + + Error Detection and Error Handling Workflow + + +Design Assumption +================= + +There are three types of design assumptions in the ACRN hypervisor, as shown +below: + +**Pre-condition** + Pre-conditions shall be defined right before the definition/declaration of + the corresponding function in the C source file or header file. + All pre-conditions shall be guaranteed by the caller of the function. + Error checking of the pre-conditions are not needed in release version of the + function. Developers could use ASSERT to catch design errors in a debug + version for some cases. Verification of the hypervisor shall check whether + each caller guarantees all pre-conditions of the callee (or not). + + This design assumption applies to the following cases: + + - Input parameters of the function. + - Global state, such as hypervisor operation mode. + +**Post-condition** + Post-conditions shall be defined right before the definition/declaration of + the corresponding function in the C source file or header file. + All post-conditions shall be guaranteed by the function. All callers of the + function should trust these post-conditions are met. + Error checking of the post-conditions are not needed in release version of + each caller. Developers could use ASSERT to catch design errors in a debug + version for some cases. Verification of the hypervisor shall check whether the + function guarantees all post-conditions (or not). + + This design assumption applies to the following case: + + - Return value of the function + + It is used to guarantee that the return value is valid, such as the return + pointer is not NULL, the return value is within a valid range, or the + members of the return structure are valid. + + +**Application Constraints** + Application constraints of the hypervisor shall be defined in design document + and safety manual. + All application constraints shall be guaranteed by external safety + applications, such as Board Support Package, firmware, safety VM, or Hardware. + The verification of application integration shall check whether the safety + application meets all application constraints. + + This design assumption applies to the following cases: + + - Configuration data defined by external safety application, such as physical + PCI device information specific for each board design. + + - Input data which is only specified by external safety application. + + +Architecture Level +================== + +Fulfillment to FuSa Standards +----------------------------- + +The fulfillment matrix of error detection and error handling on architecture +level to FuSa Standards [IEC_61508-7]_ in ACRN hypervisor is shown in +:numref:`fulfillment_matrix_arch_level` below. + +.. table:: Fulfillment to FuSa Standards on Architecture Level + :align: center + :widths: auto + :name: fulfillment_matrix_arch_level + + +-----------------------+---------------+----------+--------------------------------------+ + | Technique/Measure | Ref in | SIL3 | Company Solution | + | | IEC 61508-7 | | | + +=======================+===============+==========+======================================+ + | Failure assertion | C.3.3 | R | This technique will be used. | + | programming | | | Plan to do range check in hypercalls | + | | | | and HW capability checks. | + +-----------------------+---------------+----------+--------------------------------------+ + + +Error Handling Methods +---------------------- + +The error handling methods used in the ACRN hypervisor on an architecture level +are shown below. + +**Invoke default fatal error handler** + The hypervisor shall invoke the default fatal error handler when the below + cases occur. Customers can define platform-specific handlers, allowing them to + implement additional error reporting (mostly to hardware) if required. The + default fatal error handler will invoke platform-specific handlers defined by + users at first, then it will panic the system. + + This method applies to the following cases: + + - Related hardware resources are unavailable. + - Boot information is invalid during platform initialization. + - Unexpected exception occurs in root mode due to hardware failures. + - Failures occur in the VM dedicated for error handling. + +**Return error code** + The hypervisor shall return an error code to the VM when the below cases + occur. The error code shall indicate the error type detected (e.g. invalid + parameter, device not found, device busy, resource unavailable, etc). + + This method applies to the following case: + + - The hypercall parameter from the VM is invalid. + +**Inform the safety VM through specific register or memory area** + The hypervisor shall inform the safety VM through a specific register or + memory area when the below cases occur. The VM will decide how to handle the + related error. This shall only be done after the VM (Safety OS or Service OS) + dedicated to error handling has started. + + This method applies to the following cases: + + - Machine check errors occur due to hardware failures. + + - Unexpected VM entry failures occur, where the VM is not the one dedicated + for error handling. + +**Panic the system via ASSERT** + The hypervisor can panic the system when the below cases occur. It shall + only be used for debug and used to check pre-conditions and post-conditions + to catch design errors. + + This method applies to the following case: + + - Software design errors occur. + + +Rules of Error Detection and Error Handling +------------------------------------------- + +The rules of error detection and error handling on an architecture level are +shown in :numref:`rules_arch_level` below. + +.. table:: Rules of Error Detection and Error Handling on Architecture Level + :align: center + :widths: auto + :name: rules_arch_level + + +--------------------+-------------------------+--------------+---------------------------+-------------------------+ + | Resource Class | Failure Mode | Error | Error Handling Policy | Example | + | | | Detection | | | + | | | via | | | + | | | Hypervisor | | | + +====================+=========================+==============+===========================+=========================+ + | External resource | Invalid register/memory | Yes | Follow SDM strictly, or | Unsupported MSR | + | provided by VM | state on VM exit | | state any deviation to the| or invalid CPU ID | + | | | | document explicitly | | + | +-------------------------+--------------+---------------------------+-------------------------+ + | | Invalid hypercall | Yes | The hypervisor shall | Invalid hypercall | + | | parameter | | return related error code | parameter provided by | + | | | | to the VM | any VM | + | +-------------------------+--------------+---------------------------+-------------------------+ + | | Invalid data in the | Yes | Case by case depending | Invalid data in memory | + | | sharing memory area | | on the data | shared with all VMs, | + | | | | | such as IO request | + | | | | | buffers and sbuf for | + | | | | | debug | + +--------------------+-------------------------+--------------+---------------------------+-------------------------+ + | External resource | Invalid E820 table or | Yes | The hypervisor shall | Invalid E820 table or | + | provided by | invalid boot information| | panic during platform | invalid boot information| + | bootloader | | | initialization | | + | (UEFI or SBL) | | | | | + +--------------------+-------------------------+--------------+---------------------------+-------------------------+ + | Physical resource | 1GB page is not | Yes | The hypervisor shall | 1GB page is not | + | used by the | available on the | | panic during platform | available on the | + | hypervisor | platform or invalid | | initialization | platform or invalid | + | | physical CPU ID | | | physical CPU ID | + +--------------------+-------------------------+--------------+---------------------------+-------------------------+ + + +Examples +-------- + +Here is an example to illustrate when error handling codes are required on +an architecture level. + +There are two pre-condition statements of ``vcpu_from_vid``. It indicates that +it's the caller's responsibility to guarantee these pre-conditions. + +.. code-block:: c + + /** + * @pre vcpu_id < CONFIG_MAX_VCPUS_PER_VM + * @pre &(vm->hw.vcpu_array[vcpu_id])->state != VCPU_OFFLINE + */ + static inline struct acrn_vcpu *vcpu_from_vid(struct acrn_vm *vm, uint16_t vcpu_id) + { + return &(vm->hw.vcpu_array[vcpu_id]); + } + +``vcpu_from_vid`` is called by ``hcall_set_vcpu_regs``, which is a hypercall. +``hcall_set_vcpu_regs`` is an external interface and ``vcpu_id`` is provided by +VM. In this case, we shall add the error checking codes before calling +``vcpu_from_vid`` to make sure that the passed parameters are valid and the +pre-conditions are guaranteed. + +Here is the sample codes for error checking before calling ``vcpu_from_vid``: + +.. code-block:: c + + status = 0; + + if (vcpu_id >= CONFIG_MAX_VCPUS_PER_VM) { + pr_err("vcpu id is out of range \r\n"); + status = -EINVAL; + } else if ((&(vm->hw.vcpu_array[vcpu_id]))->state == VCPU_OFFLINE) { + pr_err("vcpu is offline \r\n"); + status = -EINVAL; + } + + if (status == 0) { + vcpu = vcpu_from_vid(vm, vcpu_id); + ... + } + + +Module Level +============ + +Fulfillment to FuSa Standards +----------------------------- + +The fulfillment matrix of error detection and error handling on a module level +to FuSa Standards [IEC_61508-7]_ in ACRN hypervisor is shown in +:numref:`fulfillment_matrix_module_level` below. + +.. table:: Fulfillment to FuSa Standards on Module Level + :align: center + :widths: auto + :name: fulfillment_matrix_module_level + + +-----------------------+---------------+----------+---------------------------------------------------+ + | Technique/Measure | Ref in | SIL3 | Company Solution | + | | IEC 61508-7 | | | + +=======================+===============+==========+===================================================+ + | Design and coding | C.2.6 | HR | This technique will be used for internal functions| + | standards | Table B.1 | | of the hypervisor. Plan to use data verification, | + | | | | with explicit specification of pre-conditions and | + | | | | post-conditions for functions. | + +-----------------------+---------------+----------+---------------------------------------------------+ + + +Error Handling Methods +---------------------- + +The error handling methods used in the ACRN hypervisor on a module level are +shown below. + +**Panic the system via ASSERT** + The hypervisor can panic the system when the below cases occur. It shall + only be used for debugging, used to check pre-conditions and post-conditions + to catch design errors. + + This method applies to the following case: + + - Software design errors occur. + + +Rules of Error Detection and Error Handling +------------------------------------------- + +The rules of error detection and error handling on a module level are shown in +:numref:`rules_module_level` below. + +.. table:: Rules of Error Detection and Error Handling on Module Level + :align: center + :widths: auto + :name: rules_module_level + + +--------------------+-----------+----------------------------+---------------------------+-------------------------+ + | Resource Class | Failure | Error Detection via | Error Handling Policy | Example | + | | Mode | Hypervisor | | | + +====================+===========+============================+===========================+=========================+ + | Internal data of | N/A | Partial. | The hypervisor shall use | virtual PCI device | + | the hypervisor | | The related pre-conditions | the internal resource/data| information, defined | + | | | are required. | directly. | with array 'pci_vdevs[]'| + | | | The design will guarantee | | through static | + | | | the correctness and the | | allocation. | + | | | test cases will verify the | | | + | | | related pre-conditions. | | | + | | | If the design can not | | | + | | | guarantee the correctness, | | | + | | | the related error handling | | | + | | | codes need to be added. | | | + | | | Note: Some examples of | | | + | | | pre-conditions are listed, | | | + | | | like non-empty array, valid| | | + | | | array size and non-null | | | + | | | pointer. | | | + +--------------------+-----------+----------------------------+---------------------------+-------------------------+ + | Configuration data | Corrupted | No. | The bootloader initializes| 'vm_config->pci_ptdevs' | + | of the VM | VM config | The related pre-conditions | hypervisor (including | is configured | + | | | are required. | code, data, and bss) and | statically. | + | | | Note: VM configuration data| verifies the integrity of | | + | | | are auto generated based on| hypervisor image in which | | + | | | different board configs, | VM configurations are. | | + | | | they are defined | Thus hypervisor does not | | + | | | as static structure. | need any additional | | + | | | | mechanism. | | + +--------------------+-----------+----------------------------+---------------------------+-------------------------+ + | Configuration data | N/A | No. | The hypervisor shall use | The maximum number of | + | of the hypervisor | | The related pre-conditions | the internal resource/data| PCI devices in the VM, | + | | | are required. | directly. | defined with | + | | | The design will guarantee | | CONFIG_MAX_PCI_DEV_NUM | + | | | the correctness and this | | through configuration. | + | | | shall be verified manually.| | | + +--------------------+-----------+----------------------------+---------------------------+-------------------------+ + + +Examples +-------- + +Here are some examples to illustrate when error handling codes are required on +a module level. + +**Example_1: Analyze the function 'partition_mode_vpci_init'** + +.. code-block:: c + + /** + * @pre vm != NULL + * @pre vm->vpci->pci_vdev_cnt <= CONFIG_MAX_PCI_DEV_NUM + */ + static int32_t partition_mode_vpci_init(const struct acrn_vm *vm) + { + struct acrn_vpci *vpci = (struct acrn_vpci *)&(vm->vpci); + struct pci_vdev *vdev; + struct acrn_vm_config *vm_config = get_vm_config(vm->vm_id); + struct acrn_vm_pci_ptdev_config *ptdev_config; + uint32_t i; + + vpci->pci_vdev_cnt = vm_config->pci_ptdev_num; + + for (i = 0U; i < vpci->pci_vdev_cnt; i++) { + vdev = &vpci->pci_vdevs[i]; + vdev->vpci = vpci; + ptdev_config = &vm_config->pci_ptdevs[i]; + vdev->vbdf.value = ptdev_config->vbdf.value; + + if (vdev->vbdf.value != 0U) { + partition_mode_pdev_init(vdev, ptdev_config->pbdf); + vdev->ops = &pci_ops_vdev_pt; + } else { + vdev->ops = &pci_ops_vdev_hostbridge; + } + + if (vdev->ops->init != NULL) { + if (vdev->ops->init(vdev) != 0) { + pr_err("%s() failed at PCI device (vbdf %x)!", + __func__, vdev->vbdf); + } + } + } + + return 0; + } + +``get_vm_config`` is called by ``partition_mode_vpci_init``. +There are one pre-condition and two post-conditions of ``get_vm_config``. +It indicates that the caller of ``get_vm_config`` shall guarantee these +pre-conditions and ``get_vm_config`` itself shall guarantee the post-condition. + +.. code-block:: c + + /** + * @pre vm_id < CONFIG_MAX_VM_NUM + * @post retval != NULL + * @post retval->pci_ptdev_num <= MAX_PCI_DEV_NUM + */ + struct acrn_vm_config *get_vm_config(uint16_t vm_id) + { + return &vm_configs[vm_id]; + } + +**Question_1: Is error checking required for 'vm_config'?** + +No. Because 'vm_config' is getting data from ``get_vm_config`` and the +post-condition of ``get_vm_config`` guarantees that the return value is not NULL. + + +**Question_2: Is error checking required for 'vdev'?** + +No. Here are the reasons: + +a) The pre-condition of ``partition_mode_vpci_init`` guarantees that 'vm' is not + NULL. It indicates that 'vpci' is not NULL. Since 'vdev' is getting data from + the array 'pci_vdevs[]' via indexing, 'vdev' is not NULL as long as the index + is valid. + +b) The post-condition of ``get_vm_config`` guarantees that 'vpci->pci_vdev_cnt' + is less than or equal to 'CONFIG_MAX_PCI_DEV_NUM', which is the array size of + 'pci_vdevs[]'. It indicates that the index used to get 'vdev' is always + valid. + +Given the two reasons above, 'vdev' is always not NULL. So, the error checking +codes are not required for 'vdev'. + + +**Question_3: Is error checking required for 'ptdev_config'?** + +No. 'ptdev_config' is getting data from the array 'pci_vdevs[]', which is the +physical PCI device information coming from Board Support Package and firmware. +For physical PCI device information, the related application constraints +shall be defined in the design document or safety manual. For debug purpose, +developers could use ASSERT here to catch the Board Support Package or firmware +failures, which does not guarantee these application constraints. + + +**Question_4: Is error checking required for 'vdev->ops->init'?** + +No. Here are the reasons: + +a) Question_2 proves that 'vdev' is always not NULL. + +b) 'vdev->ops' is fully initialized before 'vdev->ops->init' is called. + +Given the two reasons above, 'vdev->ops->init' is always not NULL. So, the error +checking codes are not required for 'vdev->ops->init'. + + +**Question_5: How to handle the case when 'vdev->ops->init(vdev)' returns non-zero?** + +This case indicates that the initialization of specific virtual device fails. +Investigation has to be done to figure out the root-cause. Default fatal error +handler shall be invoked here if it is caused by a hardware failure or invalid +boot information. + + +**Example_2: Analyze the function 'partition_mode_vpci_deinit'** + +.. code-block:: c + + /** + * @pre vdev != NULL + * @pre vm->vpci->pci_vdev_cnt <= CONFIG_MAX_PCI_DEV_NUM + */ + static void partition_mode_vpci_deinit(const struct acrn_vm *vm) + { + struct pci_vdev *vdev; + uint32_t i; + + for (i = 0U; i < vm->vpci.pci_vdev_cnt; i++) { + vdev = (struct pci_vdev *) &(vm->vpci.pci_vdevs[i]); + if ((vdev->ops != NULL) && (vdev->ops->deinit != NULL)) { + if (vdev->ops->deinit(vdev) != 0) { + pr_err("vdev->ops->deinit failed!"); + } + } + /* TODO: implement the deinit of 'vdev->ops' */ + } + } + + +**Question_6: Is error checking required for 'vdev->ops' and 'vdev->ops->init'?** + +Yes. Because 'vdev->ops' and 'vdev->ops->init' can not be guaranteed to be +not NULL. If the VM called ``partition_mode_vpci_deinit`` twice, it may be NULL. + +References +********** + +.. [IEC_61508-7] IEC 61508-7:2010, Functional safety of electrical/electronic/programmable electronic safety-related systems - Part 7: Overview of techniques and measures +