addResources is just a special case of updateResources. Combine the shared codes
so that we do not maintain the two pieces of identical code.
Signed-off-by: Clare Chen <clare.chenhui@huawei.com>
When create sandbox, we setup a sandbox of 2048M base memory, and
then hotplug memory that is needed for every new container. And
we change the unit of c.config.Resources.Mem from MiB to Byte in
order to prevent the 4095B < memory < 1MiB from being lost.
Depends-on:github.com/kata-containers/tests#813
Fixes#400
Signed-off-by: Clare Chen <clare.chenhui@huawei.com>
Signed-off-by: Zichang Lin <linzichang@huawei.com>
If the memory is reduced , its cgroup in the VM was updated properly. But the
runtime assumed that the memory was also removed from the VM.
Then when it is added more memory again, more is added (but not needed).
Fixes: #801
Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
This reverts commit 8a6d383715.
Don't remove all directories in the shared directory because
`docker cp` re-mounts all the mount points specified in the
config.json causing serious problems in the host.
fixes#777
Signed-off-by: Julio Montes <julio.montes@intel.com>
Get and store guest details after sandbox is completely created.
And get memory block size from sandbox state file when check
hotplug memory valid.
Signed-off-by: Clare Chen <clare.chenhui@huawei.com>
Signed-off-by: Zichang Lin <linzichang@huawei.com>
Add support for using update command to hotplug memory to vm.
Connect kata-runtime update interface with hypervisor memory hotplug
feature.
Fixes#625
Signed-off-by: Clare Chen <clare.chenhui@huawei.com>
RemoveContainer is called right after SignalProcess(SIGKILL), the container
process might be still running and container Destroy() will fail, thus it's better
to wait on this process exited before to issue RemoveContainer.
Fixes: #690
Signed-off-by: fupan <lifupan@gmail.com>
Fixes#635
When container rootfs is block based in devicemapper use case, we can re-use
sandbox device manager to manage rootfs block plug/unplug, we don't detailed
description of block in container state file, instead we only need a Block index
referencing sandbox device.
Remove `HotpluggedDrive` and `RootfsPCIAddr` from state file because it's not
necessary any more.
Signed-off-by: Wei Zhang <zhangwei555@huawei.com>
Fixes#635
Remove `Hotplugged bool` field from device and add two new fields
instead:
* `RefCount`: how many references to this device. One device can be
referenced(`NewDevice()`) many times by same/different container(s),
two devices are regarded identical if they have same hostPath
* `AttachCount`: how many times this device has been attached. A device
can only be hotplugged once to the qemu, every new Attach command will
add the AttachCount, and real `Detach` will be done only when
`AttachCount == 0`
Signed-off-by: Wei Zhang <zhangwei555@huawei.com>
Add additional `context.Context` parameters and `struct` fields to allow
trace spans to be created by the `virtcontainers` internal functions,
objects and sub-packages.
Note that not every function is traced; we can add more traces as
desired.
Fixes#566.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Fixes#50 .
Add new interface sandbox.AddDevice, then for Frakti use case, a device
can be attached to sandbox before container is created.
Signed-off-by: Wei Zhang <zhangwei555@huawei.com>
Fixes#50
This commit imports a big logic change:
* host device to be attached or appended now is sandbox level resources,
one device should bind to sandbox/hypervisor first, then container could
reference it via device's unique ID.
* attach or detach device should go through the device manager interface
instead of the device interface.
* allocate device ID in global device mapper to guarantee every device
has a uniq device ID and there won't be any ID collision.
With this change, there will some changes on data format on disk for sandbox
and container, these changes also make a breakage of backward compatibility.
New persist data format:
* every sandbox will get a new "devices.json" file under "/run/vc/sbs/<sid>/"
which saves detailed device information, this also conforms to the concept that
device should be sandbox level resource.
* every container uses a "devices.json" file but with new data format:
```
[
{
"ID": "b80d4736e70a471f",
"ContainerPath": "/dev/zero"
},
{
"ID": "6765a06e0aa0897d",
"ContainerPath": "/dev/null"
}
]
```
`ID` should reference to a device in a sandbox, `ContainerPath` indicates device
path inside a container.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Fixes#50
Previously the devices are created with device manager and laterly
attached to hypervisor with "device.Attach()", this could work, but
there's no way to remember the reference count for every device, which
means if we plug one device to hypervisor twice, it's truly inserted
twice, but actually we only need to insert once but use it in many
places.
Use device manager as a consolidated entrypoint of device management can
give us a way to handle many "references" to single device, because it
can save all devices and remember it's use count.
Signed-off-by: Wei Zhang <zhangwei555@huawei.com>
Ensure the entire codebase uses `"sandbox"` and `"container"` log
fields for the sandboxID and containerID respectively.
Simplify code where fields can be dropped.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
This PR got merged while it had some issues with some shim processes
being left behind after k8s testing. And because those issues were
real issues introduced by this PR (not some random failures), now
the master branch is broken and new pull requests cannot get the
CI passing. That's the reason why this commit revert the changes
introduced by this PR so that we can fix the master branch.
Fixes#529
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Fixes#50
This commit imports a big logic change:
* host device to be attached or appended now is sandbox level resources,
one device should bind to sandbox/hypervisor first, then container could
reference it via device's unique ID.
* attach or detach device should go through the device manager interface
instead of the device interface.
* allocate device ID in global device mapper to guarantee every device
has a uniq device ID and there won't be any ID collision.
With this change, there will some changes on data format on disk for sandbox
and container, these changes also make a breakage of backward compatibility.
New persist data format:
* every sandbox will get a new "devices.json" file under "/run/vc/sbs/<sid>/"
which saves detailed device information, this also conforms to the concept that
device should be sandbox level resource.
* every container uses a "devices.json" file but with new data format:
```
[
{
"ID": "b80d4736e70a471f",
"ContainerPath": "/dev/zero"
},
{
"ID": "6765a06e0aa0897d",
"ContainerPath": "/dev/null"
}
]
```
`ID` should reference to a device in a sandbox, `ContainerPath` indicates device
path inside a container.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Fixes#50
Previously the devices are created with device manager and laterly
attached to hypervisor with "device.Attach()", this could work, but
there's no way to remember the reference count for every device, which
means if we plug one device to hypervisor twice, it's truly inserted
twice, but actually we only need to insert once but use it in many
places.
Use device manager as a consolidated entrypoint of device management can
give us a way to handle many "references" to single device, because it
can save all devices and remember it's use count.
Signed-off-by: Wei Zhang <zhangwei555@huawei.com>
Before this patch shared dir will reamin when sandox
has already removed, espacilly for kata-agent mod.
Do clean up shared dirs after all mounts are umounted.
Fixes: #291
Signed-off-by: Haomin <caihaomin@huawei.com>
Pause and resume container functions allow us to just pause/resume a
specific container not all the sanbox, in that way different containers
can be paused or running in the same sanbox.
Signed-off-by: Julio Montes <julio.montes@intel.com>
k8s provides a configuration for sharing PID namespace
among containers. In case of crio and cri plugin, an infra
container is started first. All following containers are
supposed to share the pid namespace of this container.
In case a non-empty pid namespace path is provided for a container,
we check for the above condition while creating a container
and pass this out to the kata agent in the CreatContainer
request as SandboxPidNs flag. We clear out the PID namespaces
in the configuration passed to the kata agent.
Fixes#343
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Events cli display container events such as cpu,
memory, and IO usage statistics.
By now OOM notifications and intel RDT are not fully supproted.
Fixes: #186
Signed-off-by: Haomin <caihaomin@huawei.com>
Don't fail if a new container with a CPU constraint was added to
a POD and no more vCPUs are available, instead apply the constraint
and let kernel balance the resources.
Signed-off-by: Julio Montes <julio.montes@intel.com>
Update command is used to update container's resources at run time.
All constraints are applied inside the VM to each container cgroup.
By now only CPU constraints are fully supported, vCPU are hot added
or removed depending of the new constraint.
fixes#189
Signed-off-by: Julio Montes <julio.montes@intel.com>
* Move makeNameID() func to virtcontainers/utils file as it's a generic
function for making name and ID.
* Move bindDevicetoVFIO() and bindDevicetoHost() to vfio driver package.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Fixes#50
This is done for decoupling device management part from other parts.
It seperate device.go to several dirs and files:
```
virtcontainers/device
├── api
│ └── interface.go
├── config
│ └── config.go
├── drivers
│ ├── block.go
│ ├── generic.go
│ ├── utils.go
│ ├── vfio.go
│ ├── vhost_user_blk.go
│ ├── vhost_user.go
│ ├── vhost_user_net.go
│ └── vhost_user_scsi.go
└── manager
├── manager.go
└── utils.go
```
* `api` contains interface definition of device management, so upper level caller
should import and use the interface, and lower level should implement the interface.
it's bridge to device drivers and callers.
* `config` contains structed exported data.
* `drivers` contains specific device drivers including block, vfio and vhost user
devices.
* `manager` exposes an external management package with a `DeviceManager`.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Store the PCI address of rootfs in case the rootfs is block
based and passed using virtio-block.
This helps up get rid of prdicting the device name inside the
container for the block device. The agent will determine the device
node name using the PCI address.
Fixes#266
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Introduce a new field in Drive to store the PCI address if the drive is
attached using virtio-blk.
Assign PCI address in the format bridge-addr/device-addr.
Since we need to assign the address while hotplugging, pass Drive
by address.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Currently we sometimes pass it as a pointer and other times not. As
a result, the view of sandbox across virtcontainers may not be the same
and it costs extra memory copy each time we pass it by value. Fix it
by ensuring sandbox is always passed by pointers.
Fixes: #262
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Here is an interesting case I have been debugging. I was trying to
understand why a "kubeadm reset" was not working for kata-runtime
compared to runc. In this case, the only pod started with Kata is
the kube-dns pod. For some reasons, when this pod is stopped and
removed, its containers receive some signals, 2 of them being SIGTERM
signals, which seems the way to properly stop them, but the third
container receives a SIGCONT. Obviously, nothing happens in this
case, but apparently CRI-O considers this should be the end of the
container and after a few seconds, it kills the container process
(being the shim in Kata case). Because it is using a SIGKILL, the
signal does not get forwarded to the agent because the shim itself
is killed right away. After this happened, CRI-O calls into
"kata-runtime state", we detect the shim is not running anymore
and we try to stop the container. The code will eventually call
into agent.RemoveContainer(), but this will fail and return an
error because inside the agent, the container is still running.
The approach to solve this issue here is to send a SIGKILL signal
to the container after the shim has been waited for. This call does
not check for the error returned because most of the cases, regular
use cases, will end up returning an error because the shim itself
not being there actually represents the container inside the VM has
already terminated.
And in case the shim has been killed without the possibility to
forward the signal (like described in first paragraph), the SIGKILL
will work and will allow the following call to agent.stopContainer()
to proceed to the removal of the container inside the agent.
Fixes#274
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The k8s test creates a log file in /dev under
/dev/termination-log, which is not the right place to create
logs, but we need to handle this. With this commit, we handle
regular files under /dev by passing them as 9p shares. All other
special files including device files and directories
are not passed as 9p shares as these are specific to the host.
Any operations on these in the guest would fail anyways.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
This reverts commit 08909b2213.
We should not be passing any bind-mounts from /dev, /sys and /proc.
Mounting these from the host inside the container does not make
sense as these files are relevant to the host OS.
Fixes#219
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
When imported, the vc files carried in the 'full style' apache
license text, but the standard for kata is to use SPDX style.
Update the relevant files to SPDX.
Fixes: #227
Signed-off-by: Graham whaley <graham.whaley@intel.com>
Communicate to the agent the number of vCPUs that were hot added,
allowing to the agent wait for the creation of all vCPUs.
fixes#90
Signed-off-by: Julio Montes <julio.montes@intel.com>
As agreed in [the kata containers API
design](https://github.com/kata-containers/documentation/blob/master/design/kata-api-design.md),
we need to rename pod notion to sandbox. The patch is a bit big but the
actual change is done through the script:
```
sed -i -e 's/pod/sandbox/g' -e 's/Pod/Sandbox/g' -e 's/POD/SB/g'
```
The only expections are `pod_sandbox` and `pod_container` annotations,
since we already pushed them to cri shims, we have to use them unchanged.
Fixes: #199
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Check if a volume passed to the container with -v is a block device
file, and if so pass the block device by hotplugging it to the VM
instead of passing this as a 9pfs volume. This would give us
better performance.
Add block device associated with a volume to the list of
container devices, so that it is detached with all other devices
when the container is stopped with detachDevices()
Fixes#137
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
All bind mounts are now passed to the guest with 9p.
We need to exclude /dev/shm, as this is passed as a bind mount
in the spec. We handle /dev/shm in the guest by allocating
memory for it on the guest side. Passing /dev/shm as a 9p mount
was causing it to be mounted twice.
Fixes#190
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Factorize configuration and hardware support for hotplugging block
devices into a single function and use that.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
The rollback does not work as expected because the error has to be
checked from the defer itself.
Fixes#178
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case the container creation fails, we need a proper rollback
regarding the mounts and hotplugs previously performed.
This patch also rework the hotplugDrive() function in order to
prevent createContainer() function complexity to exceed 15.
Fixes#135
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If a container is not running, but created/ready instead, this means
a container process exists and that we can actually exec another
process inside this container. The container does not have to be
in running state.
Fixes#120
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>