Hi,
We are deploying stargz on a cluster hosting gitlab runners. This cluster is used to run user workloads, which means we have short-lived pods (from a couple of minutes to a couple of hours), and the images used are always different.
The images can be quite large, ranging from 70 GB to 120 GB compressed, up to 20-50 layers.
The behaviour we are experiencing:
- When we run the first pod on a fresh node, the stargz memory utilization grows to around 0.6G for one of the user images.
- Once the pod finishes running, the memory consumption still stays the same. I think it's because the mounts are preserved.
- When a new workload is scheduled on the same node, the memory consumption grows again, usually to around 1G-1.3G depending on the new image.
I think eventually some of the memory is released, but when we leave the cluster running for a couple of days, and there are many back to back workloads, stargz eventually consumes more and more memory until it exhausts the node.
Is there a way to allow all the user pods to run without killing the node after a while? I would like to preserve the mounts only if there are running pods, but not for pods that have already completed, because otherwise we kill the nodes. I tried to use fuse-manager, but it's the same behaviour, just the memory is being utilized by the fuse-manager process instead of stargz.
Restarting the stargz process doesn't release the memory either.
Can you advise?
Best regards,
Diana
Hi,
We are deploying stargz on a cluster hosting gitlab runners. This cluster is used to run user workloads, which means we have short-lived pods (from a couple of minutes to a couple of hours), and the images used are always different.
The images can be quite large, ranging from 70 GB to 120 GB compressed, up to 20-50 layers.
The behaviour we are experiencing:
I think eventually some of the memory is released, but when we leave the cluster running for a couple of days, and there are many back to back workloads, stargz eventually consumes more and more memory until it exhausts the node.
Is there a way to allow all the user pods to run without killing the node after a while? I would like to preserve the mounts only if there are running pods, but not for pods that have already completed, because otherwise we kill the nodes. I tried to use fuse-manager, but it's the same behaviour, just the memory is being utilized by the fuse-manager process instead of stargz.
Restarting the stargz process doesn't release the memory either.
Can you advise?
Best regards,
Diana