Volumes are suddenly empty #151

woehrl01 · 2024-02-22T19:59:49Z

Hi,

In our test environment we detected that suddenly the content of volumes are empty. A recreation of the pod make the files reappear. This often happens after the pod is running for multiple hours. We are using containerd with default configuration. I assume this has something to do with the garbage collection of the underlying node. The volumes are mounted es ephemeral csi volumes in readOnly mode. So a deletion of the files by accident can be excluded as a root cause.

Did someone had a similar experience? Any ideas how to troubleshoot that further?

woehrl01 · 2024-02-22T21:31:51Z

I'm not yet deep into the individual technology internals, could it be that we miss lease handling in the containerd case? Otherwise resources are automatically garbage collected after 24h?

See: https://github.com/containerd/containerd/blob/main/docs%2Fgarbage-collection.md#L19-L93

mugdha-adhav · 2024-02-25T13:57:41Z

We heavily use this driver in our production environment and haven't noticed this issue before. Our environment also uses containerd version v1.7.2 on EKS 1.26 nodes.

woehrl01 · 2024-02-25T14:25:47Z

@mugdha-adhav I see, maybe our usage pattern is different. We are spinning up and tear down pods very frequently often the churn is less then 5 seconds.

I have the assumption that there is a problem with the metadata update in that scenario. In combination with sudden kills because of the failing livenessProbe which results deleting the snapshot even though it is still in use by the last pod, resulting in an empty dir.

Additionally to that, the gc of kubelet kicks in regular on our nodes.

That's why I have created to linked PR to try using leases instead of metadata. I'm still trying to find a reliable reproducable example.

woehrl01 · 2024-02-28T21:00:05Z

@mugdha-adhav I could now reproduce this error multiple times with the latest version 1.1.0 under very high load (starting/stopping hundreds of pods with a mix of shared/non-shared volumes). I'm currently preparing a PR to make the driver handle high load scenarios more graceful. The PR is meant as a discussion basis, of what you want to include in this project.

woehrl01 · 2024-02-29T16:32:31Z

@mugdha-adhav When testing our service at large scale with bigger images and different images, I can see an exhaustion of containerd bringing all the pods on a node to a halt. While my changes are fixing the exhaustion of containerd when all images are available, there is a needed modification necessary on the pulling side (to fix the case of scheduling many pods on a fresh node). The PR #137 looks quite promising to fix the exhaustion of containerd, that means I'm waiting on that to get merged first.

mugdha-adhav · 2024-02-29T16:50:05Z

@woehrl01 we mostly use read-write volumes on our clusters, so this issue might be limited to read-only volumes.

I tried reproducing the issue by deleting the image and snapshot from containerD, but I could still see the files mounted in the pod. Hence I doubt if the issue is related to garbage-collection in containerD.

woehrl01 · 2024-02-29T16:57:25Z

@mugdha-adhav Yes, I think this is only related to read-only volumes. During my migration to leases as a GC mechanism, I could verify that DeleteSnapshot has been called, even though there were still leases attached to it.

imuni4fun · 2024-02-29T21:46:23Z

another thought occurred to me: we should verify that mounting the pulled image as writeable does indeed create a new writeable layer (UFS, the way Dockerfile steps represent changes as add/modify/delete ON TOP of previous fs layer) such that two pods mounting in read/write mode cannot change each other's version of that image.

#137 (which should merge tomorrow, btw) reduces parallel pulls to a single request so the resulting on-disk representation would be from a single pull-and-unpack. i'm not sure if the mounting activity (this CSI provider calling into containerd/crio) starts from this unpacked content and creates that new writeable layer to maintain immutability.

mugdha-adhav · 2024-03-01T07:32:59Z

we should verify that mounting the pulled image as writeable does indeed create a new writeable layer

I have already verified this manually by checking the snapshots created by containerD for every new writable volume.

woehrl01 · 2024-03-15T09:54:14Z

@mugdha-adhav I could reverify that and the problem is happing if there are timeouts happing causing some operations to fail, during rollback of those changes the snapshots are deleted.

mugdha-adhav · 2024-08-15T08:15:33Z

@woehrl01 are we good to close this issue based on reasoning mentioned in this comment?

woehrl01 · 2024-08-15T08:26:45Z

@mugdha-adhav no this issue can still happen. I didn't had time to push my changes upstream yet and create a pr to fix this eventually. I still have this in my todo list.

woehrl01 mentioned this issue Feb 23, 2024

feat: use lease instead of image metadata for reference count #152

Closed

mugdha-adhav added the work-in-progress Someone is working on this already label Feb 29, 2024

mugdha-adhav assigned woehrl01 Feb 29, 2024

woehrl01 mentioned this issue Mar 4, 2024

"unable to ensure pod container exists" and "failed to create containerd" errors #153

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volumes are suddenly empty #151

Volumes are suddenly empty #151

woehrl01 commented Feb 22, 2024 •

edited

Loading

woehrl01 commented Feb 22, 2024

mugdha-adhav commented Feb 25, 2024

woehrl01 commented Feb 25, 2024

woehrl01 commented Feb 28, 2024 •

edited

Loading

woehrl01 commented Feb 29, 2024

mugdha-adhav commented Feb 29, 2024

woehrl01 commented Feb 29, 2024

imuni4fun commented Feb 29, 2024

mugdha-adhav commented Mar 1, 2024

woehrl01 commented Mar 15, 2024

mugdha-adhav commented Aug 15, 2024

woehrl01 commented Aug 15, 2024

Volumes are suddenly empty #151

Volumes are suddenly empty #151

Comments

woehrl01 commented Feb 22, 2024 • edited Loading

woehrl01 commented Feb 22, 2024

mugdha-adhav commented Feb 25, 2024

woehrl01 commented Feb 25, 2024

woehrl01 commented Feb 28, 2024 • edited Loading

woehrl01 commented Feb 29, 2024

mugdha-adhav commented Feb 29, 2024

woehrl01 commented Feb 29, 2024

imuni4fun commented Feb 29, 2024

mugdha-adhav commented Mar 1, 2024

woehrl01 commented Mar 15, 2024

mugdha-adhav commented Aug 15, 2024

woehrl01 commented Aug 15, 2024

woehrl01 commented Feb 22, 2024 •

edited

Loading

woehrl01 commented Feb 28, 2024 •

edited

Loading