Works around https://github.com/moby/moby/issues/33176 and fixes #1807. Updated al users of linuxkit/runc:2649198589ef0020d99f613adaeda45ce0093a38 to this new build. Signed-off-by: Ian Campbell <ian.campbell@docker.com>
Authors: Chris Dalton cid@hpi.com, Nigel Edwards nigel.edwards@hpe.com, Theo Koulouris theo.koulouris@hpe.com
Split Kernel
Project links:
- okernel sources on GitHub: https://github.com/linux-okernel/linux-okernel
- Userspace components and supporting material: https://github.com/linux-okernel/linux-okernel-components
Similar to the nested-kernel work for BSD by Dautenhahn et al[1], the aim of the split kernel (okernel) is to introduce a level of intra-kernel protection into the kernel so that, amongst other things, we can offer lifetime guarantees over kernel code and data integrity. Unlike the BSD- based nested kernel work, we are focused on the Linux kernel (not BSD) and do make use of HW virtualization features such as Extended Page Tables (EPT) or equivalent to provide protection from malicious kernel changes. (Our initial prototype is based on Intel x86, but the intention is to be architecture neutral so we can apply it to other architectures, including AMD and ARM.)
The split kernel provides a (protected) virtualized view of the kernel for processes entering the kernel through exceptions, syscalls and interrupts. Though we make use of hardware features designed to support virtualization, we do not virtualize at the full virtual machine level (like KVM or VMware, for example). Instead conceptually our model is closer to the approach prototyped by the DUNE[2] project where they virtualize much higher up at the user space process level. DUNE uses the hardware virtualization features to support virtualization within the user space context of a Linux process to safely expose privileged hardware features to user programs. We instead take a cut-line lower down in the OS stack and include the virtualization of the kernel space context of a process. This kernel virtualization allows us to introduce a level of intra-kernel protection into the Linux kernel.
Our initial prototype consists of a combination of fairly extensive modifications to the existing DUNE Linux kernel module (which itself derives from KVM) and a relatively small number of select modifications to the core Linux kernel code to support the virtualized kernel cut-line.
In terms of operation, a process can be switched into 'outer-kernel' mode which includes creating an EPT 'container' (lower level set of page tables) for it. After switching, the process resumes running in a non-root (NR) mode VMCS context even when in kernel context.
(In the remainder of this README we use root-mode or R-mode to describe a process which is has full visibility of the page tables: upper and lower. NR-mode or non-root mode describes a process which only has visibility of the upper level page tables.)
With this model, the majority of kernel code can be run within the EPT 'container', offering an enhanced memory protection mechanism whilst maintaining a single shared kernel image. A small handler loop within the kernel for each process (thread) handles transitions from NR-mode to R-mode where necessary to support VMEXITS and provide a privileged operations interface.
Once a process is in NR-mode, the ability to make changes to kernel memory is controlled by permissions on both the upper and lower level page tables. Our security goal is to use the lower level page tables to prevent a NR-mode process making malicious changes to the kernel. For example, as far as possible it should not be able to write code or data pages NR-mode, or if changes are made, they are isolated to the NR-mode context.
If a process in NR-mode attempts to change the kernel memory in conflict with permissions in the lower-level page tables, a VMEXIT (in the current prototype which uses Intel VMX) is triggered. R-mode is then entered where the permission violation can be handled.
Integration with LinuxKit
Custom Linux distributions utilizing the split kernel can be readily built
using LinuxKit by simply specifying an okernel Docker image in the kernel
section of the OS image YAML specification. See the sample YAML files provided
in examples.
Building the split kernel image for LinuxKit
- makewill build and package the latest version of the split kernel, by pulling sources from the top-of-tree of the okernel project GitHub (https://github.com/linux-okernel/linux-okernel).
- Additionally, a specific version of the kernel can be built
by setting the 'KERNEL' environment variable to the appropriate
value, e.g.: make KERNEL=ok-4.11-rc2. The value MUST correspond to a legitimate okernel tag present in the project GitHub (https://github.com/linux-okernel/linux-okernel/tags) beginning with "ok-".make KERNEL=latestwill build the top-of-tree release, equivalent tomake.
make kvmod or make KERNEL=NNNNNNNN kvmod where "NNNNNNNN" is the release
string corresponding to a kernel version, will build the kernel
vulnerability emulation kernel module for that kernel, useful for testing.
Limitations and Caveats
The current implementation does not have any protection of the kernel in place yet. It is a demonstration that you can create processes and run them in NR-mode using EPTs with a shared kernel. As a further demonstration of the concept, it implements protected memory pages, whereby a process may request a protected memory page which will not be mapped into the EPTs for other processes.
Roadmap
The next step, and the subject of our ongoing research is to design the memory protection architecture for the kernel. Examples of the things that we are considering protecting from root mode processes are:
- Protection of the page tables (no NR mode process can modify a page table)
- Protection of kernel executable code (RX only)
- Protection of kernel data structures (RO)
References
- [1] Nested Kernel: An Operating System Architecture for Intra-Kernel Privilege Separation, Nathan Dautenhahn, Theodoros Kasampalis, Will Dietz, John Criswell, Vikram Adve, ASPLOS '15, Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, March 2015.
- [2] Dune: Safe user-level access to privileged CPU features, Adam Belay, Andrea Bittau, Ali Mashtizadeh, David Terei, David Mazières, and Christos Kozyrakis, OSDI '12, Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation, October 2012.