Skip to content

feat(k8s): add error handling tests in K8s #3736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 28, 2021
Merged

feat(k8s): add error handling tests in K8s #3736

merged 12 commits into from
Oct 28, 2021

Conversation

jacobowitz
Copy link
Contributor

@jacobowitz jacobowitz commented Oct 21, 2021

This PR mostly add tests around graceful termination of Runtimes and K8s Pods. It also fixes a bug in the ConnectionPool implementation related to removing connections.

In detail the following things are done in this pr:

  • Update CI script to execute more tests in the tests/k8s/ folder (before it was hardcoded to tests/k8ts/test_k8s.py)
  • Fix a bug in the ConnectionPool happening on removal of connections
  • Fix the graceful GrpcDataRuntime test. Now it is failing as the desired behaviour does not actually working. Before the test was not doing the right thing. I deactivated the test though to let CI pass. Eventually this needs to be fixed and the test activated
  • Added a test verifying the linear scaling of throughput in K8s when scaling the number of replicas. That test works.
  • Added a test in k8s demonstrating that scaling down pods does not work gracefully. Test is deactivated for now.
  • Added a test in k8s demonstrating that killing pods does not work gracefully. Test is deactivated for now.
    scaling

What is not added here, but should be done in the future:

  • Add test to kill containers in a K8s pod (mimicking OOM errors and the likes) and verify that no messages are lost
  • Check that no messages are lost between client and gateway

Closes #3604

@github-actions github-actions bot added size/XS area/cicd This issue/PR affects the cicd pipeline area/housekeeping This issue/PR is housekeeping labels Oct 21, 2021
@github-actions
Copy link

github-actions bot commented Oct 21, 2021

Latency summary

Current PR yields:

  • 😶 index QPS at 1248, delta to last 2 avg.: +1%
  • 😶 query QPS at 59, delta to last 2 avg.: +2%
  • 😶 dam extend QPS at 44439, delta to last 2 avg.: -3%
  • 😶 avg flow time within 1.1865 seconds, delta to last 2 avg.: -12%
  • 😶 import jina within 0.4147 seconds, delta to last 2 avg.: -5%

Breakdown

Version Index QPS Query QPS DAM Extend QPS Avg Flow Time (s) Import Time (s)
current 1248 59 44439 1.1865 0.4147
2.1.12 1376 63 54873 1.1685 0.3981
2.1.11 1078 51 37291 1.5541 0.4776

Backed by latency-tracking. Further commits will update this comment.

@codecov
Copy link

codecov bot commented Oct 21, 2021

Codecov Report

Merging #3736 (8e39b56) into master (6d8db50) will increase coverage by 3.98%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3736      +/-   ##
==========================================
+ Coverage   86.03%   90.01%   +3.98%     
==========================================
  Files         156      156              
  Lines       11984    11989       +5     
==========================================
+ Hits        10310    10792     +482     
+ Misses       1674     1197     -477     
Flag Coverage Δ
daemon 44.74% <0.00%> (+21.57%) ⬆️
jina 88.43% <50.00%> (+2.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
jina/peapods/networking.py 86.76% <50.00%> (+27.97%) ⬆️
jina/peapods/peas/__init__.py 84.57% <0.00%> (-2.13%) ⬇️
jina/peapods/runtimes/zmq/zed.py 91.32% <0.00%> (ø)
jina/types/document/__init__.py 96.73% <0.00%> (+0.21%) ⬆️
jina/peapods/pods/__init__.py 85.26% <0.00%> (+0.24%) ⬆️
jina/helper.py 83.12% <0.00%> (+0.35%) ⬆️
jina/jaml/__init__.py 95.51% <0.00%> (+0.40%) ⬆️
jina/peapods/zmq/__init__.py 89.06% <0.00%> (+0.69%) ⬆️
jina/peapods/runtimes/jinad/__init__.py 83.33% <0.00%> (+0.87%) ⬆️
jina/peapods/stream/base.py 90.32% <0.00%> (+1.07%) ⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d8db50...8e39b56. Read the comment docs.

@github-actions github-actions bot added size/M area/core This issue/PR affects the core codebase area/network This issue/PR affects network functionality area/testing This issue/PR affects testing component/peapod component/resource labels Oct 22, 2021
@jacobowitz jacobowitz changed the title fix(k8s): run all k8s tests in ci feat(k8s): add error handling tests in K8s Oct 22, 2021
@jacobowitz jacobowitz marked this pull request as ready for review October 22, 2021 20:00
logger.debug(f'stop sending new requests after {i} requests')
# allow some requests to complete
await asyncio.sleep(10.0)
os.kill(os.getpid(), signal.SIGKILL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the client.post() will block forever waiting for responses which never come (those messages are lost which is the problem the test is showcasing).
Just killing the request process is the easiest thing to do to stop the test here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to get rid of this though, it seems to break the tests ocassionally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the kill to the parent process. Thats still not the most elegant solution, but I think it should be safe now.

@@ -55,13 +56,16 @@ def start_runtime(args, handle_mock, cancel_event):
@pytest.mark.slow
@pytest.mark.timeout(10)
@pytest.mark.parametrize('close_method', ['TERMINATE', 'CANCEL'])
def test_grpc_data_runtime_graceful_shutdown(close_method):
@pytest.mark.asyncio
@pytest.mark.skip('Graceful shutdown is not working at the moment')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. Why is not working this now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the unit test which fails for the same reason as the K8s one, just without K8s. I am canceling and joining the runtime here and demonstrating that not all messages are received.
Before this change the test was not suffucient to catch this case correctly. The improved version of the test now fails as expected

@jacobowitz jacobowitz requested a review from JoanFM October 26, 2021 08:50
@jacobowitz
Copy link
Contributor Author

  • rebased against master
  • changed a test setting for the streaming client tests to make them less flaky
  • readded the previous graceful grpc test
  • made one of the new tests sync as async is not needed there

@jacobowitz jacobowitz requested a review from JoanFM October 28, 2021 13:45
@JoanFM JoanFM merged commit a5853c0 into master Oct 28, 2021
@JoanFM JoanFM deleted the feat-k8s-testing branch October 28, 2021 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cicd This issue/PR affects the cicd pipeline area/core This issue/PR affects the core codebase area/housekeeping This issue/PR is housekeeping area/network This issue/PR affects network functionality area/testing This issue/PR affects testing component/peapod component/resource size/L size/M size/XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stress test K8s deployments
2 participants