Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hashring for all possible failure scenarios #80

Closed
wants to merge 3 commits into from

Conversation

spaparaju
Copy link

@spaparaju spaparaju commented Dec 23, 2021

This PR adds a new flag 'allow-only-ready-replicas' for hashring to contain Thanos Receive replicas which are in the 'Ready' status.

Under this flag, this PR includes fixes for :

  1. hashring containing non-Ready replicas if a replica turns from 'Ready' status to a 'non-ready' status.
  2. hashring containing non-Ready replicas if a scale up does not succeed to the completion (spec.Replicas != status.ReadyReplicas)

Signed-off-by: SriKrishna Paparaju paparaju@gmail.com

Signed-off-by: SriKrishna Paparaju <paparaju@gmail.com>
Copy link

@bill3tt bill3tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the code it appears that we have always mixed up a pod in Running phase with a container Ready to serve requests - it would be good to clean that up in this PR.

Also - I'm wondering if there is a smarter way we can be informed of changes to the hashrings we are watching?

The newFilteredStatefulSetInformer function accepts a filtering function - perhaps we could parse out the pods that are suitable there?

edit: I think we could use the ListOptions.FieldSelectors to do this.

main.go Outdated Show resolved Hide resolved
@spaparaju
Copy link
Author

spaparaju commented Jan 5, 2022

Reading the code it appears that we have always mixed up a pod in Running phase with a container Ready to serve requests - it would be good to clean that up in this PR.

Also - I'm wondering if there is a smarter way we can be informed of changes to the hashrings we are watching?

The newFilteredStatefulSetInformer function accepts a filtering function - perhaps we could parse out the pods that are suitable there?

edit: I think we could use the ListOptions.FieldSelectors to do this.

pods get to Running status once the readiness probe is successful.
The current sync loop leverages status changes notified through Kube. informers.

@bill3tt
Copy link

bill3tt commented Jan 18, 2022

@spaparaju and I chatted about this PR last Thursday, there are a couple of things to do before merging:

  • Test out this change in a local cluster and verify that we see the behaviour we desire & report back results in this PR.
  • Eiher update the naming of the flag to be running rather than ready to better represent the work currently being done or update the mechanism for waiting to poll for readiness in the pods.

@spaparaju
Copy link
Author

@spaparaju and I chatted about this PR last Thursday, there are a couple of things to do before merging:

  • Test out this change in a local cluster and verify that we see the behaviour we desire & report back results in this PR.
    Tested on Minikube. Here are the recordings of the testing: reproduce the hash-ring update problem and how this PR solves the update problem.
  • Eiher update the naming of the flag to be running rather than ready to better represent the work currently being done or update the mechanism for waiting to poll for readiness in the pods.
    renamed the flag to include 'running' instead of the word 'ready'

@bill3tt
Copy link

bill3tt commented Jan 18, 2022

Really nice demos there @spaparaju what an effective way to do code review :) LGTM

Copy link
Contributor

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@matej-g
Copy link
Contributor

matej-g commented Jan 19, 2022

The CI seem to be stuck, probably for some time already 😞, we should take a look.

@bill3tt
Copy link

bill3tt commented Feb 2, 2022

@bwplotka do you have perms to nuke this and start it again?

@matej-g
Copy link
Contributor

matej-g commented Feb 2, 2022

Hm so I also don't seem to have permission to change settings, since I don't see the Settings tab for this repo, we might need to escalate to someone who is able to check if there is any related CI setting.

image

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about change, but does not look significant, lgtm

}
time.Sleep(c.options.scaleTimeout) // Give some time for all replicas before they receive hundreds req/s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this was moved?

@bwplotka
Copy link
Member

Unfortunately I don't have perms to this - also it looks like there is no link to drone CI so connection did not even start. Maybe @squat has some ideas?

If not we should spend 30m to move to githubActions quickly. Otherwsie we would merge this PR without any CI.

@squat squat closed this Feb 24, 2022
@squat squat reopened this Feb 24, 2022
@squat
Copy link
Member

squat commented Feb 24, 2022

I think that the hosted version of Drone CI hasn't been working so well ever since it was acquired by Harness or it has been entirely decommissioned :/ you can't even find https://cloud.drone.io on google anymore. I just disabled and re-enabled the project in Drone and now webhooks are working again, but CI still doesn't run. I think it's time to finally move this project onto GitHub Actions. And also rename master -> main. This repo needs a bit of TLC

@squat squat closed this Feb 24, 2022
@squat squat reopened this Feb 24, 2022
@squat squat closed this Feb 24, 2022
@squat squat reopened this Feb 24, 2022
@michael-burt michael-burt mentioned this pull request May 11, 2022
@michael-burt
Copy link

I opened #89 to address the issue where non-ready replicas were being populated in the hashring configuration. My approach is similar to the one taken in this PR, however I filter only for Ready replicas since Running is not sufficient to ensure that the replica is able to accept write remote write requests.

@matej-g
Copy link
Contributor

matej-g commented Aug 22, 2022

Unless there are objections, I think we should close this in favor of #89, I believe the approach to change hashring only when scaling the stateful set to be preferable here.

@matej-g matej-g closed this Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants