Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support Edit Container Mount for kata/runtime-rs using direct volume in K8S/CSI scenario #162

Open
Apokleos opened this issue Sep 15, 2023 · 18 comments · May be fixed by #169
Open

Add support Edit Container Mount for kata/runtime-rs using direct volume in K8S/CSI scenario #162

Apokleos opened this issue Sep 15, 2023 · 18 comments · May be fixed by #169

Comments

@Apokleos
Copy link

Apokleos commented Sep 15, 2023

In the Kubernetes/CSI scenario, users need to use CSI to provide storage volumes to regular containers. After CSI completes the relevant operations, kubelet provides a HostPath and ContainerPath, but does not include a Mount Type. In other words, the Container Mount information that Containerd can retrieve through CRI does not include a type. When converting to the OCI Spec, the Mount type is set to "bind" by default.

Unfortunately,this solution does not work properly in the Kata Containers 3.0/runtime-rs with Direct Volume scenario.
When using a direct volume, Kata Pod needs to determine the type of DirectVolume based on Mount.Type after obtaining mount information. Then, it needs to choose the appropriate way to insert the backend device into the Guest, such as directvol, vfiovol, or spdkvol. Finally, the mount operation is completed in the Guest.

Kata3.0/runtime-rs using direct volume with containerd ctr, more details can be seen how-to-run-kata-containers-with-kinds-of-Block-Volumes
An example as below:

$ cat ./kubelet/kata-direct-vol-002/directvol002/mountInfo.json
{
  "device": "/tmp/stor/rawdisk01.20g", 
  "volume_type": "directvol", 
  "fs_type": "ext4", 
  "metadata":"{}", 
  "options": []
}
$ sudo ctr run -t --rm --runtime io.containerd.kata.v2 --mount type=directvol,src=/kubelet/kata-direct-vol-002/directvol002,dst=/disk002,options=rbind:rw "$image" kata-direct-vol-xx05302045 /bin/bash

However, there are certain differences between ordinary containers and Kata Containers when using CSI to access storage. The Mount Type in the OCI Spec provided by K8S/Containerd is by default the bind type. Ordinary containers will use the bind type by default, but Kata/DirectVolume requires a specific type, such as "directvol" (or other types like vfiovol or spdkvol). Therefore, in the K8S/CSI scenario, Kata Containers wants the Mount in its OCI Spec to contain a specific volume type instead of the bind type when creating a container that uses a Direct Volume. In other words, it is necessary to change the default bind type to the specific volume type at the right time, and then pass it to the Kata runtime-rs.

Although CDI can edit the OCI Spec of Container third-party devices in K8S/DevicePlugin using Container Annotation/CDIDevice information, it can also be extended to CSI to increase the editing ability of Container Mount.

In the K8S/CSI/Kata-runtime-rs scenario using Direct Volume, we hope that CDI can automatically edit a Mount in the OCI Spec according to the Direct Volume type, that is, to change the original "bind" to the "directvol" of the corresponding volume_type in the mountInfo.json instance.

According to the idea, I have already made the initial design and implementation of it, the scheme is as follows.

  • CSI
    (1) Do kata-ctl direct-volume add
    DoAddDirectVolume(mount.HostPath)
    (2) Do cdi config generation in /var/run/cdi/xxx.json
    DoGenerateCDIConfig(mount.HostPath, mount.ContainerPath)
  • containerd
    Set the special vend/class for direct volume to direct.volume/direct-volume.
    (1) Add the parameter CDIMounts []*runtime.Mount to the WithCDI function.
    (2) Use the base64-encoded string of Mount.HostPath as the deviceName to build a mount device that conforms to the CDI specification.
encodedMntpath := DoEncodeBase64(mount.HostPath)
mountDevice := "direct.volume/direct-volume=" + encodedMntpath

The cdi config for Mount as below:

# cat /var/run/cdi/kata-cdi-L3Zhci9saWIv.json
{
  "cdiVersion": "0.6.0",
  "kind": "direct.volume/direct-volume",
  "devices": [
    {
      "name": "L3Zhci9saWIva3ViZWxldC9wb2RzLzBhNmY5OGQ4LWM4YjgtNDgwOS05ZTZlLTEwMWJlNTA0MTM2MS92b2x1bWVzL2t1YmVybmV0ZXMuaW9+ZW1wdHktZGlyL2NkaS1kaXJlY3QteHZvbHg5MDAxMA==",
      "containerEdits": {
        "mounts": [
          {
            "hostPath": "/var/lib/kubelet/pods/0a6f98d8-c8b8-4809-9e6e-101be5041361/volumes/kubernetes.io~csi/cdi-direct-xvolx90010",
            "containerPath": "/disk01",
            "options": [
              "rbind",
              "rw"
            ],
            "type": "directvol"
          }
        ]
      }
    }
  ]
}

And the draft code as below:

// WithCDI updates OCI spec with CDI content
- func WithCDI(annotations map[string]string, CDIDevices []*runtime.CDIDevice) oci.SpecOpts {
+ func WithCDI(annotations map[string]string, CDIDevices []*runtime.CDIDevice, CDIMounts []*runtime.Mount) oci.SpecOpts {
	return func(ctx context.Context, client oci.Client, c *containers.Container, s *oci.Spec) error {
		seen := make(map[string]bool)
		// Add devices from CDIDevices CRI field
		var devices []string
		var err error
		...
		+for _, mount := range CDIMounts {
		+	// encoded mount path
		+	encodedMntpath := DoEncodeBase64(mount.HostPath)
		+	mntDevice := "direct.volume/direct-volume=" + encodedMntpath
		+	if seen[mntDevice] {
		+		log.G(ctx).Debugf("Skipping duplicated CDI device %s", mntDevice)
		+		continue
		+	}

		+	devices = append(devices, mntDevice)
		+	seen[mntDevice] = true
		+}
		...

		return oci.WithCDIDevices(devices...)(ctx, client, c, s)
	}
}
  • CDI

(1) To better support using the base64-URLencoded string of the Mount HostPath as the DeviceName, it is necessary to modify the CDI DeviceName naming check rules to support special symbol =.

(2) As the runtime.Mount object in containerd does not have any additional information to help accurately filter out which HostPath is associated with the direct volume, all Mount.Hostpath objects are allowed to be used to construct DeviceNames and passed to CDI. Then it uses DeviceNames to match cdi Cache database devices. However, the final result is that there is only one Mount that needs to be modified for the direct volume. Therefore, the CDI InjectDevice function needs to be modified to not return an error if there is an unresolved device name.

$git diff container-device-interface/pkg/cdi/cache.go
diff --git a/container-device-interface/pkg/cdi/cache.go b/container-device-interface/pkg/cdi/cache.go
index cb495ebb3..01af15deb 100644
--- a/container-device-interface/pkg/cdi/cache.go
+++ b/container-device-interface/pkg/cdi/cache.go
@@ -243,15 +243,14 @@ func (c *Cache) InjectDevices(ociSpec *oci.Spec, devices ...string) ([]string, e
        }
 
        if unresolved != nil {
-               return unresolved, fmt.Errorf("unresolvable CDI devices %s",
-                       strings.Join(devices, ", "))
+               fmt.Printf("unresolvable CDI devices %s", strings.Join(devices, ", "))
        }
 
        if err := edits.Apply(ociSpec); err != nil {
                return nil, fmt.Errorf("failed to inject devices: %w", err)
        }
 
-       return nil, nil
+       return unresolved, nil
 }
@Apokleos
Copy link
Author

@elezar @zvonkok @bart0sh

@zvonkok
Copy link
Contributor

zvonkok commented Sep 18, 2023

I need to read this carefully and check if this would also have some implications for the KEP: kubernetes/enhancements#4113

@Apokleos
Copy link
Author

@zvonkok Thx for your reply. I am looking forward to discussing this with you.

@elezar
Copy link
Contributor

elezar commented Sep 18, 2023

cc @pohly

@pohly
Copy link
Contributor

pohly commented Sep 18, 2023

There's also kubernetes/enhancements#2893 which didn't get merged.

@Apokleos
Copy link
Author

There's also kubernetes/enhancements#2893 which didn't get merged.

Thx @pohly
I would like to confirm the relationship between KEP-2893 and the issue I raised.
And I am concerned that the KEP-2893 does not seem to include support for kata direct-volume. Is my understanding correct?

@pohly
Copy link
Contributor

pohly commented Sep 20, 2023

I don't know (or remember).

@Apokleos
Copy link
Author

@elezar @pohly @zvonkok
I would appreciate your feedback on supporting the modification of kata direct volume types through CDI.
Do you have any suggestions for optimizing and improving this proposal?

@elezar
Copy link
Contributor

elezar commented Sep 22, 2023

For clarification @Apokleos. Is the use of the device name to encode data a requirement, or is this just how you are implementing this in our current setup?

One issue with the code implemented above is that an unresolvable device is not treated as an error. This breaks assumptions around such devices. How would you propose missing / incorrect device specifications be passed to the user if this is not considered an error?

@Apokleos
Copy link
Author

Thx @elezar for your feedback.

For clarification @Apokleos. Is the use of the device name to encode data a requirement, or is this just how you are implementing this in our current setup?

In this scenario, the Mount HostPath will be used as the device name. However, the HostPath may contain special characters, which are not allowed in the CDI naming specification. To comply with the CDI device name specification, the HostPath was encrypted using base64, even though this may result in special characters after encryption. This was done to maintain consistency with the encryption method used for volume paths in kata direct volumes.

One issue with the code implemented above is that an unresolvable device is not treated as an error. This breaks assumptions around such devices. How would you propose missing / incorrect device specifications be passed to the user if this is not considered an error?

Yes, this is indeed a problem. However, I have a different understanding of unresolved devices after deep diving the code. Do devices not found in the CDI cache belong to unresolved devices? (As the code of InjectDevices indicates that devices not found in the CDI cache belong to unresolved devices. )
IMO, devices not found in the CDI cache are filtered devices, while devices hit cache but failed to be applied are the unresolved devices.

I propose to split InjectDevices into two steps: filtering and injecting.

  • filtering
    The filtering step will be handled by a separate method called FilterDevices . In this step, if devices hit the CDI Cache, they will be the ones we want.
  • injecting
    The devices from filtering step will be passed to the injecting method InjectDevices. If a device fails to be applied, it will be considered as an unresolved device.

And the FilterDevices can be placed in RegistryResolver as below:

type RegistryResolver interface {
	InjectDevices(spec *oci.Spec, device ...string) (unresolved []string, err error)
+	FilterDevices(device ...string) (mached []string, err error)
}

@Apokleos
Copy link
Author

Apokleos commented Oct 4, 2023

Hi @elezar I'm interested in hearing your thoughts on the updated proposal.
I'd like to discuss it further with you and look forward to hearing from you.

Thx.

@Apokleos
Copy link
Author

Apokleos commented Oct 8, 2023

Hi @klihub, Could I hear your suggestions for this proposal? Thx

@klihub
Copy link
Contributor

klihub commented Oct 8, 2023

Hi @klihub, Could I hear your suggestions for this proposal? Thx

My biggest problem with this, in its current form, is that we end up treating unresolvable CDI devices during injection as something to be merely logged instead of flagged as a fatal error. I don't think that is something we'd find an acceptable compromise.

Looking at your (very nice and detailed) description, I gather that the basic problem is roughly

  • there is a CSI driver that tries to create/serve/handle storage volumes for Kata containers
  • naturally we can't inject storage into a Kata container by simply bind-mounting a directory
  • yet, with the current infrastructure (K8s, CRI, runtime) there is no way to convey enough information, so we end up with a simple bind mount

And this proposal (and I guess your earlier hack attempt to the NRI device injector sample plugin to modify mount options) tries to come up with a way to 'patch up' the incorrectly/insufficiently created mount for the container after the fact.

@Apokleos Is this summary correct ?

@Apokleos
Copy link
Author

Apokleos commented Oct 8, 2023

Thank you @klihub for your valuable time. I appreciate your understanding of my proposal.
I would like to discuss the unresolvable CDI devices with you further.

Hi @klihub, Could I hear your suggestions for this proposal? Thx

My biggest problem with this, in its current form, is that we end up treating unresolvable CDI devices during injection as something to be merely logged instead of flagged as a fatal error. I don't think that is something we'd find an acceptable compromise.

@elezar has the same concern about the unresolvable CDI devices in this proposal, and I updated the proposal
What do you think of the new updates?

Looking at your (very nice and detailed) description, I gather that the basic problem is roughly

  • there is a CSI driver that tries to create/serve/handle storage volumes for Kata containers
  • naturally we can't inject storage into a Kata container by simply bind-mounting a directory
  • yet, with the current infrastructure (K8s, CRI, runtime) there is no way to convey enough information, so we end up with a simple bind mount

And this proposal (and I guess your earlier hack attempt to the NRI device injector sample plugin to modify mount options)

Yes, you're right. I try to do "patch up" the Mount's type with NRI annotation when I do some tests for kata/direct-volume with NRI.
And

tries to come up with a way to 'patch up' the incorrectly/insufficiently created mount for the container after the fact.

@Apokleos Is this summary correct ?

Apokleos added a commit to Apokleos/container-device-interface that referenced this issue Oct 9, 2023
Add support Edit Container Mount for kata/runtime-rs using
direct volume in K8S/CSI scenario.

Fixes: cncf-tags#162

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Apokleos added a commit to Apokleos/container-device-interface that referenced this issue Oct 9, 2023
Add support Edit Container Mount for kata/runtime-rs using
direct volume in K8S/CSI scenario.

Fixes: cncf-tags#162

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Apokleos added a commit to Apokleos/containerd that referenced this issue Oct 9, 2023
…t volume

Try to use CDI to 'patch up' the Container Mount for kata/runtime-rs
using direct volume in K8S/CSI scenario.

related issue and PR as below:
CDI issue: cncf-tags/container-device-interface#162
CDI Mount PR: cncf-tags/container-device-interface#169

Fixes: containerd#9203

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
@Apokleos
Copy link
Author

Apokleos commented Oct 9, 2023

Thx @elezar @klihub I appreciate your time to discuss this proposal with me. I have revised it based on your concerns and submitted a PR to clarify it further.
To avoid violating the original design while supporting Mount, I added a dedicated method UpdateMounts to handle Mounts. This allows for unresolved behavior to be allowed in Mount scenarios without errors.

@zvonkok
Copy link
Contributor

zvonkok commented Oct 9, 2023

@Apokleos, @marquiz and I are working on this KEP: kubernetes/enhancements#4113, especially for the Kata use-case.
To summarize your issue, we need additional meta information passed down the CRI. Rather then "hacking" something into CDI I suggest we tackle the issue at the source if possible.
@Apokleos Can you please review the KEP and add a comment about your use-case and what is missing?

Apokleos added a commit to Apokleos/containerd that referenced this issue Oct 16, 2023
…t volume

Try to use CDI to 'patch up' the Container Mount for kata/runtime-rs
using direct volume in K8S/CSI scenario.

related issue and PR as below:
CDI issue: cncf-tags/container-device-interface#162
CDI Mount PR: cncf-tags/container-device-interface#169

Fixes: containerd#9203

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Apokleos added a commit to Apokleos/container-device-interface that referenced this issue Oct 16, 2023
Add support Edit Container Mount for kata/runtime-rs using
direct volume in K8S/CSI scenario.

The added vendor/class: "kata.direct.volume/direct-volume"

Fixes: cncf-tags#162

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Apokleos added a commit to Apokleos/containerd that referenced this issue Oct 17, 2023
…t volume

Try to use CDI to 'patch up' the Container Mount for kata/runtime-rs
using direct volume in K8S/CSI scenario.

related issue and PR as below:
CDI issue: cncf-tags/container-device-interface#162
CDI Mount PR: cncf-tags/container-device-interface#169

Fixes: containerd#9203

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Apokleos added a commit to Apokleos/containerd that referenced this issue Oct 18, 2023
…t volume

Try to use CDI to 'patch up' the Container Mount for kata/runtime-rs
using direct volume in K8S/CSI scenario.

related issue and PR as below:
CDI issue: cncf-tags/container-device-interface#162
CDI Mount PR: cncf-tags/container-device-interface#169

Fixes: containerd#9203

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Apokleos added a commit to Apokleos/container-device-interface that referenced this issue Oct 19, 2023
Add support Edit Container Mount for kata/runtime-rs using
direct volume in K8S/CSI scenario.

The added vendor/class: "kata.direct.volume/direct-volume"

Fixes: cncf-tags#162

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Apokleos added a commit to Apokleos/containerd that referenced this issue Oct 19, 2023
…t volume

Try to use CDI to 'patch up' the Container Mount for kata/runtime-rs
using direct volume in K8S/CSI scenario.

related issue and PR as below:
CDI issue: cncf-tags/container-device-interface#162
CDI Mount PR: cncf-tags/container-device-interface#169

Fixes: containerd#9203

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
@Apokleos
Copy link
Author

@Apokleos, @marquiz and I are working on this KEP: kubernetes/enhancements#4113, especially for the Kata use-case. To summarize your issue, we need additional meta information passed down the CRI. Rather then "hacking" something into CDI I suggest we tackle the issue at the source if possible. @Apokleos Can you please review the KEP and add a comment about your use-case and what is missing?

Hi @zvonkok Thx for your suggestion for the KEP-4113. I have added some comments about my concern.
IMO, the best way to fix this is to provide a way to pass down the Mount Type corresponding to the direct volume, used to build the oci Spec Mount, instead of the default specified type bind.

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants