Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High availability while rolling out pods doesn't seem to work #10141

Open
janario opened this issue May 12, 2024 · 7 comments
Open

High availability while rolling out pods doesn't seem to work #10141

janario opened this issue May 12, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@janario
Copy link

janario commented May 12, 2024

Describe the bug

We noticed that while we rollout new changes on our collector it the applications report some warnning/error on connection refused.

Our collector is managed by the Operator and we noticed already that it doesn't have a readinessProbe, which we will share a fix for it open-telemetry/opentelemetry-operator#2943

But even with the readiness while simulating a rollout and receiving many requests, some of them get dropped.

Steps to reproduce

Our scenario to reproduce it.

  • collector running with at least 2 replicas. readiness and liveness probes in place ✅

we are using siege to make lots of requests to the collector while we roll it out

kubectl -n monitoring run --rm -it siege --image=yokogawa/siege --command -- bash 

root@siege:/# siege -d1 -c 500 -t1m http://otel-default-collector.monitoring:4318
# we know / is 404 but we just want to validate server is responding it

Meanwhile siege is making requests we go in parallel and start a rollout

kubectl -n monitoring rollout restart statefulsets/otel-default-collector

Pods gets replaced but after siege conclude we see that some requests were dropped

** SIEGE 3.0.5
** Preparing 500 concurrent users for battle.
The server is now under siege...[error] socket: -1278044416 connection refused.: Connection refused
[error] socket: -1152166144 connection refused.: Connection refused
[error] socket: 2041366272 connection refused.: Connection refused
[error] socket: 1050703616 connection refused.: Connection refused
[error] socket: -795511040 connection refused.: Connection refused
[error] socket: 190535424 connection refused.: Connection refused
[error] socket: -1550780672 connection refused.: Connection refused
[error] socket: -1152166144 connection refused.: Connection refused
# lots of
[error] socket: 2047239936 unknown network error.: No route to host
[error] socket: -1634699520 unknown network error.: No route to host
[error] socket: 1407358720 unknown network error.: No route to host
[error] socket: -1183635712 unknown network error.: No route to host
[error] socket: -2012334336 unknown network error.: No route to host
[error] socket: 1632261888 unknown network error.: No route to host
[error] socket: 1197561600 unknown network error.: No route to host
[error] socket: 1359525632 unknown network error.: No route to host
[error] socket: 1470297856 unknown network error.: No route to host
# and more lots of
Lifting the server siege...      done.

Transactions:		       19330 hits
Availability:		       96.58 %
Elapsed time:		       59.17 secs
Data transferred:	        0.35 MB
Response time:		        0.01 secs
Transaction rate:	      326.69 trans/sec
Throughput:		        0.01 MB/sec
Concurrency:		        4.66
Successful transactions:           0
Failed transactions:	         685
Longest transaction:	        0.31
Shortest transaction:	        0.00

To fix that we have added a preStop lifecycle

        lifecycle:
          preStop:
            exec:
              command: # this works only with custom collector image, since default is scratch
              - sleep
              - "10"
#            sleep: # not available yet :-/ https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3960-pod-lifecycle-sleep-action/README.md#alternatives
#              seconds: 10

With the lifecycle of a simple sleep 10s the results are much better with 100% of availability. 🕺 🙌

** SIEGE 3.0.5
** Preparing 500 concurrent users for battle.
The server is now under siege...
Lifting the server siege...      done.

Transactions:		       57838 hits
Availability:		      100.00 %
Elapsed time:		       59.67 secs
Data transferred:	        1.05 MB
Response time:		        0.01 secs
Transaction rate:	      969.30 trans/sec
Throughput:		        0.02 MB/sec
Concurrency:		       12.39
Successful transactions:           0
Failed transactions:	           0
Longest transaction:	        0.19
Shortest transaction:	        0.00

But, I didn't want to use a custom image to have sleep command and I wonder if something at otel-collector could be done to make it work as expected.

It seems that during the graceful shutdown something gets wrong and make requests to not be answered.

What did you expect to see?

While rolling out Pods and replicas>=2 not request should be lost.

What did you see instead?

Requests are lost when rolling out new collector pods and not possible to workaround with sleep command.

What version did you use?

0.98.0

What config did you use?

Environment

EKS 1.26 and kind 1.29

Additional context

@janario janario added the bug Something isn't working label May 12, 2024
@janario
Copy link
Author

janario commented May 12, 2024

Don't get too attached to the 96.58 % availability in the failed cases.

In 1m range with 500 users in the scenario without error, it can reach to 57838 successful hits

While when rolling out only a total of 19330 hits, meaning that the client takes more time to handle error and ends up doing much less total requests.

@janario
Copy link
Author

janario commented May 13, 2024

Created some more reproducible scenarios

https://github.com/open-telemetry/opentelemetry-operator/compare/main...janario:opentelemetry-operator:test/ha-collector?expand=1#diff-aec75932cc6d805a7b7412c6fa3579be2fa444f7d48e0f6ec105517e09317c77R26

(I used the operator just because it was easier to integrate the tests.)

Logs in the good scenario with custom image and sleep: ✅

Unpacking siege (4.0.7-1+b1) ...
Setting up siege (4.0.7-1+b1) ...
Starting siege
New configuration template added to //.siege
Run siege -C to view the current settings in that file
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 2 old replicas are pending termination...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out

{	"transactions":			        3665,
	"availability":			      100.00,
	"elapsed_time":			       59.91,
	"data_transferred":		        0.07,
	"response_time":		        0.00,
	"transaction_rate":		       61.18,
	"throughput":			        0.00,
	"concurrency":			        0.12,
	"successful_transactions":	           0,
	"failed_transactions":		           0,
	"longest_transaction":		        0.06,
	"shortest_transaction":		        0.00
} 

When using default image without sleep:

Setting up siege (4.0.7-1+b1) ...
Starting siege
New configuration template added to //.siege
Run siege -C to view the current settings in that file
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
[error] socket: unable to connect sock.c:282: Connection refused
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused

{	"transactions":			        1785,
	"availability":			       94.34,
	"elapsed_time":			       59.13,
	"data_transferred":		        0.03,
	"response_time":		        0.00,
	"transaction_rate":		       30.19,
	"throughput":			        0.00,
	"concurrency":			        0.03,
	"successful_transactions":	           0,
	"failed_transactions":		         107,
	"longest_transaction":		        0.04,
	"shortest_transaction":		        0.00
} 

In both cases logs from the collector are fine:

2024-05-13T14:35:57.696Z	info	service@v0.99.0/service.go:192	Everything is ready. Begin running and processing data.
2024-05-13T14:35:57.696Z	warn	localhostgate/featuregate.go:63	The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default.	{"feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-05-13T14:36:10.198Z	info	otelcol@v0.99.0/collector.go:281	Received signal from OS	{"signal": "terminated"}
2024-05-13T14:36:10.198Z	info	service@v0.99.0/service.go:229	Starting shutdown...
2024-05-13T14:36:10.199Z	info	extensions/extensions.go:59	Stopping extensions...
2024-05-13T14:36:10.199Z	info	service@v0.99.0/service.go:243	Shutdown complete.

@TylerHelmuth
Copy link
Member

It is likely that this is happening bc our healthcheck extension needs improved: open-telemetry/opentelemetry-collector-contrib#26661

@janario
Copy link
Author

janario commented May 13, 2024

@TylerHelmuth should I try with v2 or you mean that even with v2 it is still need improvements?

@TylerHelmuth
Copy link
Member

I mean the current version of extension/healthcheck has some issues that results in readiness and liveliness not being 100% perfect. There is ongoing work to fix the issues, but it is slow going. open-telemetry/opentelemetry-collector-contrib#30673 is a new implementation (and you might be able to go to that branch and use the code in a custom build of the collector).

@janario
Copy link
Author

janario commented May 13, 2024

Got it, Thanks for the details

Later I can try to test with it and add the results here.

So far the workaround would be only custom image with sleep ? :-/

@janario
Copy link
Author

janario commented May 14, 2024

I gave healthcheckv2 a try

open-telemetry/opentelemetry-operator@ca88a7e

But not lucky :-/

I know it is in progress, but still not sure if it would be an issue in the healthcheck 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants