Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature to wait on ready replicas on scaling up #91

Merged

Conversation

matej-g
Copy link
Contributor

@matej-g matej-g commented Aug 26, 2022

Based on work done in #89

This change adds a flag --allow-only-ready-replicas that changes the behavior of controller on a scale up - if enabled, the controller will first wait on all replicas to be ready before adding them to the hashring. The feature is documented as well.

Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matejgera@gmail.com>
@lud97x
Copy link

lud97x commented Sep 1, 2022

Hi,

I have build an image from this PR, and unfortunately the controller doesnt seems to behave as expected.
What I am expected:
While scaling up my receive STS, the controller updates de configmaps only when replicas are ready

My architecture looks like this:
thanos distributor -> n* thanos receive ingester.
The hashring file is present only on the distributor and is managed by the thanos receive controller, my replication.factor is at 3.

My controller config:

  containers:
  - args:
    - --configmap-name=thanos-receive-base
    - --configmap-generated-name=thanos-receive-generated
    - --file-name=hashrings.json
    - --namespace=$(NAMESPACE)
    - --allow-only-ready-replicas=true

I did a scale up from 15 to 18 replicas, so directly 3 replicas in one update and my service went down with the errors below:

level=debug ts=2022-09-01T12:15:31.639283401Z caller=handler.go:555 component=receive component=receive-handler tenant=lgcy msg="failed to handle request" err="2 errors: replicate write request for endpoint thanos-test-receiver-16.thanos-test-receiver-headless.meta-monitoring-test:10901: quorum not reached: target not available; replicate write request for endpoint thanos-test-receiver-15.thanos-test-receiver-headless.meta-monitoring-test:10901: quorum not reached: target not available"

The configmap generated by the controller has been update with all the pods including the not ready ones and also the not created ones.

Did I miss something in my test?
Thank in advance.

@lud97x
Copy link

lud97x commented Sep 1, 2022

My controller's serviceaccount didnt have read permission on the pod resources.
After I have edited the the controller role it is working fine:

  - apiGroups: 
    - ""
    resources: 
    - "pods"
    verbs:
    - "get"
    - "watch"
    - "list"

@lud97x
Copy link

lud97x commented Sep 2, 2022

So I have tried again and this Pr doesn't work.
I have build a image based on #89 and it works as expected.
I have noticed than in your PR some part was missing, for example https://github.com/michael-burt/thanos-receive-controller/blob/allow-only-ready-replicas/main.go#L595

@matej-g
Copy link
Contributor Author

matej-g commented Sep 8, 2022

So I have tried again and this Pr doesn't work. I have build a image based on #89 and it works as expected. I have noticed than in your PR some part was missing, for example https://github.com/michael-burt/thanos-receive-controller/blob/allow-only-ready-replicas/main.go#L595

Hey @lud97x thanks for taking the time to try this out! So after you adjusted the service account permissions, did it work?

The reason for the difference is stated in the README update I have added as part of this PR. I removed that part because scaling on every pod (un)readiness could potentially lead to a frequent hashring changes, see the explanation in the README in my PR.

Copy link
Collaborator

@philipgough philipgough left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@douglascamata douglascamata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants