feat(k8s): add error handling tests in K8s #3736

jacobowitz · 2021-10-21T10:45:25Z

This PR mostly add tests around graceful termination of Runtimes and K8s Pods. It also fixes a bug in the ConnectionPool implementation related to removing connections.

In detail the following things are done in this pr:

Update CI script to execute more tests in the tests/k8s/ folder (before it was hardcoded to tests/k8ts/test_k8s.py)
Fix a bug in the ConnectionPool happening on removal of connections
Fix the graceful GrpcDataRuntime test. Now it is failing as the desired behaviour does not actually working. Before the test was not doing the right thing. I deactivated the test though to let CI pass. Eventually this needs to be fixed and the test activated
Added a test verifying the linear scaling of throughput in K8s when scaling the number of replicas. That test works.
Added a test in k8s demonstrating that scaling down pods does not work gracefully. Test is deactivated for now.
Added a test in k8s demonstrating that killing pods does not work gracefully. Test is deactivated for now.
scaling

What is not added here, but should be done in the future:

Add test to kill containers in a K8s pod (mimicking OOM errors and the likes) and verify that no messages are lost
Check that no messages are lost between client and gateway

Closes #3604

github-actions · 2021-10-21T10:53:34Z

Latency summary

Current PR yields:

😶 index QPS at 1248, delta to last 2 avg.: +1%
😶 query QPS at 59, delta to last 2 avg.: +2%
😶 dam extend QPS at 44439, delta to last 2 avg.: -3%
😶 avg flow time within 1.1865 seconds, delta to last 2 avg.: -12%
😶 import jina within 0.4147 seconds, delta to last 2 avg.: -5%

Breakdown

Version	Index QPS	Query QPS	DAM Extend QPS	Avg Flow Time (s)	Import Time (s)
current	1248	59	44439	1.1865	0.4147
`2.1.12`	1376	63	54873	1.1685	0.3981
`2.1.11`	1078	51	37291	1.5541	0.4776

Backed by latency-tracking. Further commits will update this comment.

codecov · 2021-10-21T11:05:27Z

Codecov Report

Merging #3736 (8e39b56) into master (6d8db50) will increase coverage by 3.98%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master    #3736      +/-   ##
==========================================
+ Coverage   86.03%   90.01%   +3.98%     
==========================================
  Files         156      156              
  Lines       11984    11989       +5     
==========================================
+ Hits        10310    10792     +482     
+ Misses       1674     1197     -477

Flag	Coverage Δ
daemon	`44.74% <0.00%> (+21.57%)`	⬆️
jina	`88.43% <50.00%> (+2.44%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
jina/peapods/networking.py	`86.76% <50.00%> (+27.97%)`	⬆️
jina/peapods/peas/__init__.py	`84.57% <0.00%> (-2.13%)`	⬇️
jina/peapods/runtimes/zmq/zed.py	`91.32% <0.00%> (ø)`
jina/types/document/__init__.py	`96.73% <0.00%> (+0.21%)`	⬆️
jina/peapods/pods/__init__.py	`85.26% <0.00%> (+0.24%)`	⬆️
jina/helper.py	`83.12% <0.00%> (+0.35%)`	⬆️
jina/jaml/__init__.py	`95.51% <0.00%> (+0.40%)`	⬆️
jina/peapods/zmq/__init__.py	`89.06% <0.00%> (+0.69%)`	⬆️
jina/peapods/runtimes/jinad/__init__.py	`83.33% <0.00%> (+0.87%)`	⬆️
jina/peapods/stream/base.py	`90.32% <0.00%> (+1.07%)`	⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d8db50...8e39b56. Read the comment docs.

jina/peapods/networking.py

JoanFM · 2021-10-22T21:30:08Z

tests/k8s/test_graceful_request_handling.py

+                logger.debug(f'stop sending new requests after {i} requests')
+                # allow some requests to complete
+                await asyncio.sleep(10.0)
+                os.kill(os.getpid(), signal.SIGKILL)


why is this needed?

because the client.post() will block forever waiting for responses which never come (those messages are lost which is the problem the test is showcasing).
Just killing the request process is the easiest thing to do to stop the test here

I will try to get rid of this though, it seems to break the tests ocassionally

I moved the kill to the parent process. Thats still not the most elegant solution, but I think it should be safe now.

tests/k8s/test_graceful_request_handling.py

JoanFM · 2021-10-25T16:42:52Z

tests/unit/peapods/runtimes/grpc/test_grpc_data_runtime.py

@@ -55,13 +56,16 @@ def start_runtime(args, handle_mock, cancel_event):
 @pytest.mark.slow
 @pytest.mark.timeout(10)
 @pytest.mark.parametrize('close_method', ['TERMINATE', 'CANCEL'])
-def test_grpc_data_runtime_graceful_shutdown(close_method):
+@pytest.mark.asyncio
+@pytest.mark.skip('Graceful shutdown is not working at the moment')


same here. Why is not working this now?

This is the unit test which fails for the same reason as the K8s one, just without K8s. I am canceling and joining the runtime here and demonstrating that not all messages are received.
Before this change the test was not suffucient to catch this case correctly. The improved version of the test now fails as expected

tests/unit/peapods/runtimes/grpc/test_grpc_data_runtime.py

This reverts commit 3c85e12.

jacobowitz · 2021-10-28T13:45:54Z

rebased against master
changed a test setting for the streaming client tests to make them less flaky
readded the previous graceful grpc test
made one of the new tests sync as async is not needed there

github-actions bot added size/XS area/cicd area/housekeeping labels Oct 21, 2021

github-actions bot added size/M area/core area/network area/testing component/peapod component/resource labels Oct 22, 2021

jacobowitz changed the title ~~fix(k8s): run all k8s tests in ci~~ feat(k8s): add error handling tests in K8s Oct 22, 2021

github-actions bot added the size/L label Oct 22, 2021

jacobowitz marked this pull request as ready for review October 22, 2021 20:00

JoanFM reviewed Oct 22, 2021

View reviewed changes

jacobowitz mentioned this pull request Oct 25, 2021

Improve request handling on runtime termination #3777

Closed

jacobowitz requested a review from JoanFM October 25, 2021 16:26

JoanFM suggested changes Oct 25, 2021

View reviewed changes

jacobowitz requested a review from JoanFM October 26, 2021 08:50

JoanFM suggested changes Oct 27, 2021

View reviewed changes

tests/unit/peapods/runtimes/grpc/test_grpc_data_runtime.py Show resolved Hide resolved

jacobowitz added 10 commits October 28, 2021 13:07

fix(k8s): run all k8s tests in ci

a3bdd57

feat(kubernetes): add failure tests

89dc1f5

feat(k8s): do portforward in tests

53a8884

fix(test): wrong pytest decorator

c6a8e93

fix(test): set scope to module for some fixtures

75663cc

Revert "fix(test): set scope to module for some fixtures"

a7ed875

This reverts commit 3c85e12.

feat(k8s): fix test

8b1daff

fix(k8s): fix tests

c1cc6a2

fix(k8s): address comments

2646aaa

feat(k8s): add previous grpc shutdown test

12b2f1e

fix(ci): fix flaky stream tests

5b8e9ed

jacobowitz force-pushed the feat-k8s-testing branch from 9379d8f to 5b8e9ed Compare October 28, 2021 11:08

fix(k8s): make test sync

8e39b56

jacobowitz requested a review from JoanFM October 28, 2021 13:45

JoanFM approved these changes Oct 28, 2021

View reviewed changes

JoanFM merged commit a5853c0 into master Oct 28, 2021

JoanFM deleted the feat-k8s-testing branch October 28, 2021 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(k8s): add error handling tests in K8s #3736

feat(k8s): add error handling tests in K8s #3736

jacobowitz commented Oct 21, 2021 •

edited

Loading

Uh oh!

github-actions bot commented Oct 21, 2021 •

edited

Loading

Uh oh!

codecov bot commented Oct 21, 2021 •

edited

Loading

Uh oh!

Uh oh!

JoanFM Oct 22, 2021

Uh oh!

jacobowitz Oct 25, 2021

Uh oh!

jacobowitz Oct 25, 2021

Uh oh!

jacobowitz Oct 25, 2021

Uh oh!

Uh oh!

JoanFM Oct 25, 2021

Uh oh!

jacobowitz Oct 26, 2021

Uh oh!

Uh oh!

jacobowitz commented Oct 28, 2021

Uh oh!

feat(k8s): add error handling tests in K8s #3736

feat(k8s): add error handling tests in K8s #3736

Conversation

jacobowitz commented Oct 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Latency summary

Breakdown

Uh oh!

codecov bot commented Oct 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

JoanFM Oct 22, 2021

Choose a reason for hiding this comment

Uh oh!

jacobowitz Oct 25, 2021

Choose a reason for hiding this comment

Uh oh!

jacobowitz Oct 25, 2021

Choose a reason for hiding this comment

Uh oh!

jacobowitz Oct 25, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JoanFM Oct 25, 2021

Choose a reason for hiding this comment

Uh oh!

jacobowitz Oct 26, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jacobowitz commented Oct 28, 2021

Uh oh!

jacobowitz commented Oct 21, 2021 •

edited

Loading

github-actions bot commented Oct 21, 2021 •

edited

Loading

codecov bot commented Oct 21, 2021 •

edited

Loading