New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes during CreateVolume: Pods failing with "unexpected encryption status" #3402
Comments
@Rakshith-R PTAL |
@irq0 can you confirm again that this same issue does not occur with just canary image? And can you manually check the created image meta data ? |
@irq0 yeah, please give a try with canary or 3.7.1 image |
Ran the steps again on the current devel branch (71e5b3f). The csi-rbdplugin container limited to 256MiB memory. With 42 pods starting csi-rbdplugin is crashlooping without making much progress. Same result as above, pods failing with unexpected encryption status. For a random pod:
Image metadata:
|
This commit adds code to setup encryption on a rbdVol being repaired in a followup CreateVolume request. This is fixes a bug wherein encryption metadata may not have been set in previous request due to container restart. Fixes: ceph#3402 Signed-off-by: Rakshith R <rar@redhat.com>
With ghcr.io/rakshith-r/cephcsi:enc-fix, I get a whole different error. Any ideas?
Image metadata:
Didn't check the image in depth, but it does seem to at least have the LUKS magic set:
|
@irq0 The encrypted pvc seems to provisioned properly with everything in place. Is that issue occurring on all pvc or a few ? |
With csi-rbdplugin having a 256M limit and crashing, 5/5 pvc / pods created concurrently fail LogsProvisionerCreateVolume request:
NodepluginFailing NodeStageVolume
|
@irq0 |
The last run had a 1GiB memory limit on the csi-rbdplugin nodeserver. I did not see any OOMs of that container. |
256Mib limit for provisioner should be more than enough for this workload, we need to do profiling and check is there any memory leak in cephcsi when provisioning the volume. |
I tested secret based encryption with 3.7.1 i dont see any crash with below limits
but when i tested with metadata type encryption i can see the crash, this confirms we have a memory leak? |
Note, the current issue is not about the memory leak. The memory leak is an issue but I believe it is out of scope of this ticket. It may be reported as a separate ticket if needed. The memory leak is the thing that triggers the crash. And I think it is a bug that if the crash happens in the middle of "create encrypted volume" operation, after restart the volume is reported as properly prepared while actually it is not. |
Yes agree but both are different issues but memory leak triggered other one. good to track it as a different issue. |
This commit adds code to setup encryption on a rbdVol being repaired in a followup CreateVolume request. This is fixes a bug wherein encryption metadata may not have been set in previous request due to container restart. Fixes: ceph#3402 Signed-off-by: Rakshith R <rar@redhat.com>
@irq0 , |
This commit adds code to setup encryption on a rbdVol being repaired in a followup CreateVolume request. This is fixes a bug wherein encryption metadata may not have been set in previous request due to container restart. Fixes: #3402 Signed-off-by: Rakshith R <rar@redhat.com>
Describe the bug
With RBD block encrypted volumes, interrupting CreateVolume leads with high likelihood to broken volumes / RBD images.
Broken in the sense that encryption metadata is in an invalid state and Ceph CSI is unable to attach the volume.
Pods will get stuck in ContainerCreating as a result.
Environment details
Steps to reproduce
Tools and configs I used are in https://github.com/irq0/dev-ceph-csi-fscrypt-config
In my minikube setup I can reproduce with close to 100% probability.
Actual results
Pods in state ContainerCreating with warning:
Expected behavior
Successful volume creation, even when the provider crashes.
Logs
This follows image b60b8ccc-776e-4343-993f-1e2a3669595a along:
csi-rbdplugin-provisioner first CreateVolume request
csi-rbdplugin-provisioner crashes
Crashes / OOM Events
csi-rbdplugin-provisioner restart and second CreateVolume
rbdplugin NodeStageVolume unexpected encryption status
The text was updated successfully, but these errors were encountered: