Switch to StatefulSet #54

nijel · 2020-12-04T13:03:27Z

I saw the official Helm chart the other day and one thing stood out — it models Weblate as a stateless Deployment rather than StatefulSet which is great for stateful services. As far as I know, Weblate is currently a stateful service and can't be scaled horizontally. We started using Weblate on Kubernetes way before the official Helm chart got released and we first modelled it as Deployment too but the upgrades were somewhat problematic and it kept failing when trying to re-attach persistent disk to the newly spun up container. We used the "Recreate" rollout strategy but it would still fail and then we switched over to StatefulSet and this issue's been gone ever since.

Anywya, the idea is, should we remodel Weblate as a StatefulService? Is there any specific reason why we're using the Deployment object? I'm assuming that you've already considered it and that there are some reasons that I might have not thought about.

Originally posted by @mareksuscak in WeblateOrg/weblate#4806

Yann-J · 2021-03-15T06:02:38Z

Good point... I can see the main pod has a volume attached, so indeed this wouldn't scale if we increase the replicas, unless the volume is mounted ReadWriteMany, so it seems the right way would indeed be to use a StatefulSet with each pod having their own volume.

However, when I look at what's persisted in this volume, it seems to be essentially the static result of the compilation step at startup... In that case, this volume doesn't really need to be persisted, it can very well be an emptyDir (just so it survives container crashes)...?

I think the only files that really need persistence are the secret file and ssh directory... but even then, they seem to be essentially read-only after an initial setup. This means they could be set up with a pre-install helm hook once, and then mounted as a single ReadOnlyMany volume by all pods of the same Deployment...

In general I think we should try to use Deployment as much as possible over StatefulSets, as every new volume has a cost...

nijel · 2021-03-15T06:39:53Z

It really contains data which is supposed to persist - user uploaded content (screenshots, fonts) or VCS data, see https://docs.weblate.org/en/latest/admin/config.html#data-dir for documentation. Some more insights on what is stored there is also available in WeblateOrg/weblate#2984 (comment).

mareksuscak · 2021-03-15T11:20:15Z

Like @nijel pointed out above, the data directory does hold the user-generated data so StatefulSet would be more than appropriate. However, I'm not sure if Weblate can run multiple instances while maintaining consistency in user-generated data right now. In other words, would it correctly replicate all screenshots? Would each instance correctly synchronize all commits in a timely manner? I don't think we're quite there yet but please correct me if I'm wrong @nijel. That's the main reason for why this transition is on hold I'd say.

nijel · 2021-03-15T11:23:32Z

Yes, the filesystem has to be synchronous across Weblate instances

Yann-J · 2021-03-15T11:42:10Z

Ah yes of course... indeed, in this case, if replication is expected, unless the application manages it, switching to StatefulSet will not be enough.
I would suggest to simply mention that scaling (setting more than 1 replicas) is only possible with ReadWriteMany volumes. I see this accessMode and the storage class is already configurable in values.yaml. I'm not sure that using more than 1 replicas would be a very common use case anyway, as one instance should probably already be able to sustain a fair workload...
RWM volumes tend to be more expensive, so in this case we might want to limit it to the strict minimum of files that indeed have to be replicated across. Auto-generated statics probably shouldn't belong there (?).

bartusz01 · 2023-06-29T14:04:05Z

Hi, we tested running multiple replicas, but ran into an issue, the css file is not always found (depending on which container the traffic is directed to), I suppose it can be fixed with session affinity, but it seems to be related to the these css files being located in /app/cache/static/CACHE/css which is not synced between the containers like /app/data directory.

nijel · 2023-06-29T14:06:19Z

You should run the same version in all replicas; otherwise things will break. /app/cache/static/CACHE/css is filled during container startup and does not need to be synced.

bartusz01 · 2023-06-29T14:57:46Z

what version do you mean? Afaik all versions are equal between containers.

FYI, the css file names are different between containers, when I run ls /app/cache/static/CACHE/css, I get e.g. an output like output.82205c8x9f76.css output.79f6539f66c2.css, which is different in both containers, restarting a container will generate again different names. Meanwhile, in the browser I get a 404 on path "/static/CACHE/css/output.79f6539f66c2.css" if traffic is directed to the other container.

nijel · 2023-06-29T17:54:43Z

Hmm, I thought that django-compressor generates stable names. This should be fixed...

This makes it safe to deploy on multiple servers. See WeblateOrg/helm#54

nijel · 2023-06-29T17:59:01Z

This particular issue should be addressed by WeblateOrg/weblate@90fbea8.

bartusz01 · 2023-06-29T20:13:23Z

Thanks for the quick fix!
Just to confirm, are you sure that it is safe to run multiple replicas (with RWX pv)? Nothing bad can happen with concurrent writes or file locks for instance?

nijel · 2023-06-30T06:34:34Z

Yes, it's safe. All file system accesses are lock protected using Redis, no file locks are used for that.

This makes it safe to deploy on multiple servers. See WeblateOrg/helm#54

zisuu · 2023-07-06T15:22:46Z

Hi @nijel

Thanks for the fix.

Do you maybe have an ETA until when this commit will be part of a new release? Currently we can not have Weblate HA on EKS because of this issue. With an active ChaosKube that randomly kills pods this is a nightmare. 😅

See WeblateOrg/helm#54

nijel · 2023-07-07T06:16:10Z

I've backported the patch to the Docker image in WeblateOrg/docker@ec90869, it will be available later today in bleeding and edge tags.

zisuu · 2023-07-27T12:48:37Z

is there any chance that this patch will also be released in a build with a version tag?

nijel · 2023-07-28T11:53:53Z

It was released in Weblate 4.18.2, so it's already there.

zisuu · 2023-07-29T17:39:47Z

Sorry I missed that. Awesome, thanks a lot

zisuu · 2023-08-09T11:49:47Z

fyi: switched to most recent helm chart and weblate version and it seems to work now. Can now run multiple replicas without running into this css bug anymore

nijel added the enhancement Adding or requesting a new feature. label Dec 21, 2020

nijel added a commit to WeblateOrg/weblate that referenced this issue Jun 29, 2023

settings: Use content hashing for django-compressor

90fbea8

This makes it safe to deploy on multiple servers. See WeblateOrg/helm#54

nijel added a commit to WeblateOrg/weblate that referenced this issue Jun 30, 2023

settings: Use content hashing for django-compressor

b555de8

This makes it safe to deploy on multiple servers. See WeblateOrg/helm#54

nijel added a commit to WeblateOrg/docker that referenced this issue Jul 7, 2023

patches: Backport compressor content based compression

ec90869

See WeblateOrg/helm#54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to StatefulSet #54

Switch to StatefulSet #54

nijel commented Dec 4, 2020

Yann-J commented Mar 15, 2021

nijel commented Mar 15, 2021

mareksuscak commented Mar 15, 2021

nijel commented Mar 15, 2021

Yann-J commented Mar 15, 2021

bartusz01 commented Jun 29, 2023

nijel commented Jun 29, 2023

bartusz01 commented Jun 29, 2023

nijel commented Jun 29, 2023

nijel commented Jun 29, 2023

bartusz01 commented Jun 29, 2023

nijel commented Jun 30, 2023

zisuu commented Jul 6, 2023

nijel commented Jul 7, 2023

zisuu commented Jul 27, 2023

nijel commented Jul 28, 2023

zisuu commented Jul 29, 2023

zisuu commented Aug 9, 2023

Switch to StatefulSet #54

Switch to StatefulSet #54

Comments

nijel commented Dec 4, 2020

Yann-J commented Mar 15, 2021

nijel commented Mar 15, 2021

mareksuscak commented Mar 15, 2021

nijel commented Mar 15, 2021

Yann-J commented Mar 15, 2021

bartusz01 commented Jun 29, 2023

nijel commented Jun 29, 2023

bartusz01 commented Jun 29, 2023

nijel commented Jun 29, 2023

nijel commented Jun 29, 2023

bartusz01 commented Jun 29, 2023

nijel commented Jun 30, 2023

zisuu commented Jul 6, 2023

nijel commented Jul 7, 2023

zisuu commented Jul 27, 2023

nijel commented Jul 28, 2023

zisuu commented Jul 29, 2023

zisuu commented Aug 9, 2023