Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to StatefulSet #54

Open
nijel opened this issue Dec 4, 2020 · 18 comments
Open

Switch to StatefulSet #54

nijel opened this issue Dec 4, 2020 · 18 comments
Labels
enhancement Adding or requesting a new feature.

Comments

@nijel
Copy link
Member

nijel commented Dec 4, 2020

I saw the official Helm chart the other day and one thing stood out — it models Weblate as a stateless Deployment rather than StatefulSet which is great for stateful services. As far as I know, Weblate is currently a stateful service and can't be scaled horizontally. We started using Weblate on Kubernetes way before the official Helm chart got released and we first modelled it as Deployment too but the upgrades were somewhat problematic and it kept failing when trying to re-attach persistent disk to the newly spun up container. We used the "Recreate" rollout strategy but it would still fail and then we switched over to StatefulSet and this issue's been gone ever since.

Anywya, the idea is, should we remodel Weblate as a StatefulService? Is there any specific reason why we're using the Deployment object? I'm assuming that you've already considered it and that there are some reasons that I might have not thought about.

Originally posted by @mareksuscak in WeblateOrg/weblate#4806

@nijel nijel added the enhancement Adding or requesting a new feature. label Dec 21, 2020
@Yann-J
Copy link
Contributor

Yann-J commented Mar 15, 2021

Good point... I can see the main pod has a volume attached, so indeed this wouldn't scale if we increase the replicas, unless the volume is mounted ReadWriteMany, so it seems the right way would indeed be to use a StatefulSet with each pod having their own volume.

However, when I look at what's persisted in this volume, it seems to be essentially the static result of the compilation step at startup... In that case, this volume doesn't really need to be persisted, it can very well be an emptyDir (just so it survives container crashes)...?

I think the only files that really need persistence are the secret file and ssh directory... but even then, they seem to be essentially read-only after an initial setup. This means they could be set up with a pre-install helm hook once, and then mounted as a single ReadOnlyMany volume by all pods of the same Deployment...

In general I think we should try to use Deployment as much as possible over StatefulSets, as every new volume has a cost...

@nijel
Copy link
Member Author

nijel commented Mar 15, 2021

It really contains data which is supposed to persist - user uploaded content (screenshots, fonts) or VCS data, see https://docs.weblate.org/en/latest/admin/config.html#data-dir for documentation. Some more insights on what is stored there is also available in WeblateOrg/weblate#2984 (comment).

@mareksuscak
Copy link

Like @nijel pointed out above, the data directory does hold the user-generated data so StatefulSet would be more than appropriate. However, I'm not sure if Weblate can run multiple instances while maintaining consistency in user-generated data right now. In other words, would it correctly replicate all screenshots? Would each instance correctly synchronize all commits in a timely manner? I don't think we're quite there yet but please correct me if I'm wrong @nijel. That's the main reason for why this transition is on hold I'd say.

@nijel
Copy link
Member Author

nijel commented Mar 15, 2021

Yes, the filesystem has to be synchronous across Weblate instances

@Yann-J
Copy link
Contributor

Yann-J commented Mar 15, 2021

Ah yes of course... indeed, in this case, if replication is expected, unless the application manages it, switching to StatefulSet will not be enough.
I would suggest to simply mention that scaling (setting more than 1 replicas) is only possible with ReadWriteMany volumes. I see this accessMode and the storage class is already configurable in values.yaml. I'm not sure that using more than 1 replicas would be a very common use case anyway, as one instance should probably already be able to sustain a fair workload...
RWM volumes tend to be more expensive, so in this case we might want to limit it to the strict minimum of files that indeed have to be replicated across. Auto-generated statics probably shouldn't belong there (?).

@bartusz01
Copy link

Hi, we tested running multiple replicas, but ran into an issue, the css file is not always found (depending on which container the traffic is directed to), I suppose it can be fixed with session affinity, but it seems to be related to the these css files being located in /app/cache/static/CACHE/css which is not synced between the containers like /app/data directory.

@nijel
Copy link
Member Author

nijel commented Jun 29, 2023

You should run the same version in all replicas; otherwise things will break. /app/cache/static/CACHE/css is filled during container startup and does not need to be synced.

@bartusz01
Copy link

what version do you mean? Afaik all versions are equal between containers.

FYI, the css file names are different between containers, when I run ls /app/cache/static/CACHE/css, I get e.g. an output like output.82205c8x9f76.css output.79f6539f66c2.css, which is different in both containers, restarting a container will generate again different names. Meanwhile, in the browser I get a 404 on path "/static/CACHE/css/output.79f6539f66c2.css" if traffic is directed to the other container.

@nijel
Copy link
Member Author

nijel commented Jun 29, 2023

Hmm, I thought that django-compressor generates stable names. This should be fixed...

nijel added a commit to WeblateOrg/weblate that referenced this issue Jun 29, 2023
This makes it safe to deploy on multiple servers.

See WeblateOrg/helm#54
@nijel
Copy link
Member Author

nijel commented Jun 29, 2023

This particular issue should be addressed by WeblateOrg/weblate@90fbea8.

@bartusz01
Copy link

Thanks for the quick fix!
Just to confirm, are you sure that it is safe to run multiple replicas (with RWX pv)? Nothing bad can happen with concurrent writes or file locks for instance?

@nijel
Copy link
Member Author

nijel commented Jun 30, 2023

Yes, it's safe. All file system accesses are lock protected using Redis, no file locks are used for that.

nijel added a commit to WeblateOrg/weblate that referenced this issue Jun 30, 2023
This makes it safe to deploy on multiple servers.

See WeblateOrg/helm#54
@zisuu
Copy link

zisuu commented Jul 6, 2023

Hi @nijel

Thanks for the fix.

Do you maybe have an ETA until when this commit will be part of a new release? Currently we can not have Weblate HA on EKS because of this issue. With an active ChaosKube that randomly kills pods this is a nightmare. 😅

nijel added a commit to WeblateOrg/docker that referenced this issue Jul 7, 2023
@nijel
Copy link
Member Author

nijel commented Jul 7, 2023

I've backported the patch to the Docker image in WeblateOrg/docker@ec90869, it will be available later today in bleeding and edge tags.

@zisuu
Copy link

zisuu commented Jul 27, 2023

is there any chance that this patch will also be released in a build with a version tag?

@nijel
Copy link
Member Author

nijel commented Jul 28, 2023

It was released in Weblate 4.18.2, so it's already there.

@zisuu
Copy link

zisuu commented Jul 29, 2023

Sorry I missed that. Awesome, thanks a lot

@zisuu
Copy link

zisuu commented Aug 9, 2023

fyi: switched to most recent helm chart and weblate version and it seems to work now. Can now run multiple replicas without running into this css bug anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding or requesting a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants