Fix hashring for all possible failure scenarios #80

spaparaju · 2021-12-23T12:59:14Z

This PR adds a new flag 'allow-only-ready-replicas' for hashring to contain Thanos Receive replicas which are in the 'Ready' status.

Under this flag, this PR includes fixes for :

hashring containing non-Ready replicas if a replica turns from 'Ready' status to a 'non-ready' status.
hashring containing non-Ready replicas if a scale up does not succeed to the completion (spec.Replicas != status.ReadyReplicas)

Signed-off-by: SriKrishna Paparaju paparaju@gmail.com

Signed-off-by: SriKrishna Paparaju <paparaju@gmail.com>

bill3tt

Reading the code it appears that we have always mixed up a pod in Running phase with a container Ready to serve requests - it would be good to clean that up in this PR.

Also - I'm wondering if there is a smarter way we can be informed of changes to the hashrings we are watching?

The newFilteredStatefulSetInformer function accepts a filtering function - perhaps we could parse out the pods that are suitable there?

edit: I think we could use the ListOptions.FieldSelectors to do this.

main.go

spaparaju · 2022-01-05T01:44:22Z

Reading the code it appears that we have always mixed up a pod in Running phase with a container Ready to serve requests - it would be good to clean that up in this PR.

Also - I'm wondering if there is a smarter way we can be informed of changes to the hashrings we are watching?

The newFilteredStatefulSetInformer function accepts a filtering function - perhaps we could parse out the pods that are suitable there?

edit: I think we could use the ListOptions.FieldSelectors to do this.

pods get to Running status once the readiness probe is successful.
The current sync loop leverages status changes notified through Kube. informers.

bill3tt · 2022-01-18T11:29:24Z

@spaparaju and I chatted about this PR last Thursday, there are a couple of things to do before merging:

Test out this change in a local cluster and verify that we see the behaviour we desire & report back results in this PR.
Eiher update the naming of the flag to be running rather than ready to better represent the work currently being done or update the mechanism for waiting to poll for readiness in the pods.

…g-replicas

spaparaju · 2022-01-18T17:26:11Z

@spaparaju and I chatted about this PR last Thursday, there are a couple of things to do before merging:

Test out this change in a local cluster and verify that we see the behaviour we desire & report back results in this PR.
Tested on Minikube. Here are the recordings of the testing: reproduce the hash-ring update problem and how this PR solves the update problem.

Eiher update the naming of the flag to be running rather than ready to better represent the work currently being done or update the mechanism for waiting to poll for readiness in the pods.
renamed the flag to include 'running' instead of the word 'ready'

bill3tt · 2022-01-18T19:08:29Z

Really nice demos there @spaparaju what an effective way to do code review :) LGTM

matej-g

Looking good!

matej-g · 2022-01-19T11:00:49Z

The CI seem to be stuck, probably for some time already 😞, we should take a look.

bill3tt · 2022-02-02T15:15:26Z

@bwplotka do you have perms to nuke this and start it again?

matej-g · 2022-02-02T15:23:20Z

Hm so I also don't seem to have permission to change settings, since I don't see the Settings tab for this repo, we might need to escalate to someone who is able to check if there is any related CI setting.

bwplotka

Not sure about change, but does not look significant, lgtm

bwplotka · 2022-02-24T10:59:51Z

main.go

 			}
+			time.Sleep(c.options.scaleTimeout) // Give some time for all replicas before they receive hundreds req/s


Why this was moved?

bwplotka · 2022-02-24T11:03:24Z

Unfortunately I don't have perms to this - also it looks like there is no link to drone CI so connection did not even start. Maybe @squat has some ideas?

If not we should spend 30m to move to githubActions quickly. Otherwsie we would merge this PR without any CI.

squat · 2022-02-24T11:21:32Z

I think that the hosted version of Drone CI hasn't been working so well ever since it was acquired by Harness or it has been entirely decommissioned :/ you can't even find https://cloud.drone.io on google anymore. I just disabled and re-enabled the project in Drone and now webhooks are working again, but CI still doesn't run. I think it's time to finally move this project onto GitHub Actions. And also rename master -> main. This repo needs a bit of TLC

michael-burt · 2022-05-23T18:22:47Z

I opened #89 to address the issue where non-ready replicas were being populated in the hashring configuration. My approach is similar to the one taken in this PR, however I filter only for Ready replicas since Running is not sufficient to ensure that the replica is able to accept write remote write requests.

matej-g · 2022-08-22T12:26:37Z

Unless there are objections, I think we should close this in favor of #89, I believe the approach to change hashring only when scaling the stateful set to be preferable here.

fix hashring for all possible failure scenarios

b3973a7

Signed-off-by: SriKrishna Paparaju <paparaju@gmail.com>

bill3tt mentioned this pull request Dec 30, 2021

Generated Hashring to contain only the statefulset replicas in the Ready status #75

Closed

bill3tt reviewed Dec 30, 2021

View reviewed changes

main.go Outdated Show resolved Hide resolved

reduce timeout for waiting pods, rename the flag to allow-only-runnin…

39fd1c3

…g-replicas

bill3tt approved these changes Jan 18, 2022

View reviewed changes

matej-g approved these changes Jan 19, 2022

View reviewed changes

trigger test pipelines which are stuck

0bf7a16

bwplotka approved these changes Feb 24, 2022

View reviewed changes

squat closed this Feb 24, 2022

squat reopened this Feb 24, 2022

squat closed this Feb 24, 2022

squat reopened this Feb 24, 2022

squat closed this Feb 24, 2022

squat reopened this Feb 24, 2022

michael-burt mentioned this pull request May 11, 2022

Hashring Fixes #88

Closed

matej-g closed this Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hashring for all possible failure scenarios #80

Fix hashring for all possible failure scenarios #80

spaparaju commented Dec 23, 2021 •

edited

bill3tt left a comment •

edited

spaparaju commented Jan 5, 2022 •

edited

bill3tt commented Jan 18, 2022 •

edited

spaparaju commented Jan 18, 2022

bill3tt commented Jan 18, 2022

matej-g left a comment

matej-g commented Jan 19, 2022

bill3tt commented Feb 2, 2022

matej-g commented Feb 2, 2022

bwplotka left a comment •

edited

bwplotka Feb 24, 2022

bwplotka commented Feb 24, 2022

squat commented Feb 24, 2022

michael-burt commented May 23, 2022

matej-g commented Aug 22, 2022

		}
		time.Sleep(c.options.scaleTimeout) // Give some time for all replicas before they receive hundreds req/s

Fix hashring for all possible failure scenarios #80

Fix hashring for all possible failure scenarios #80

Conversation

spaparaju commented Dec 23, 2021 • edited

bill3tt left a comment • edited

Choose a reason for hiding this comment

spaparaju commented Jan 5, 2022 • edited

bill3tt commented Jan 18, 2022 • edited

spaparaju commented Jan 18, 2022

bill3tt commented Jan 18, 2022

matej-g left a comment

Choose a reason for hiding this comment

matej-g commented Jan 19, 2022

bill3tt commented Feb 2, 2022

matej-g commented Feb 2, 2022

bwplotka left a comment • edited

Choose a reason for hiding this comment

bwplotka Feb 24, 2022

Choose a reason for hiding this comment

bwplotka commented Feb 24, 2022

squat commented Feb 24, 2022

michael-burt commented May 23, 2022

matej-g commented Aug 22, 2022

spaparaju commented Dec 23, 2021 •

edited

bill3tt left a comment •

edited

spaparaju commented Jan 5, 2022 •

edited

bill3tt commented Jan 18, 2022 •

edited

bwplotka left a comment •

edited