Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternator tests fails when the connection is closed before the reply is fully read #12143

Closed
nyh opened this issue Nov 30, 2022 · 6 comments
Closed
Labels
area/alternator Alternator related Issues area/test Issues related to the testing system code and environment
Milestone

Comments

@nyh
Copy link
Contributor

nyh commented Nov 30, 2022

@avikivity and @raphaelsc that the full Alternator test suite has been flaky recently, sporadically failing even on fast machines and release builds (so it's not the usual not-long-enough timeout issue).

@avikivity gave an example:

$ ninja build/release/scylla && ./test.py --mode release alternator/run --repeat 1000
[1/2] NINJA build/release/seastar/libseastar.a
ninja: Entering directory `build/release/seastar'
ninja: no work to do.
Found 1000 tests.
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[111/1000] alternator release [ PASS ] run.114
[112/1000] alternator release [ FAIL ] run.113
[113/1000] alternator release [ PASS ] run.119
[114/1000] alternator release [ FAIL ] run.117
[118/1000] alternator release [ PASS ] run.121
[119/1000] alternator release [ FAIL ] run.111
[120/1000] alternator release [ FAIL ] run.115
[122/1000] alternator release [ PASS ] run.122
[123/1000] alternator release [ FAIL ] run.112
[130/1000] alternator release [ PASS ] run.134
[131/1000] alternator release [ FAIL ] run.140
[132/1000] alternator release [ FAIL ] run.137
@nyh nyh added area/test Issues related to the testing system code and environment area/alternator Alternator related Issues labels Nov 30, 2022
@nyh nyh self-assigned this Nov 30, 2022
@nyh
Copy link
Contributor Author

nyh commented Nov 30, 2022

@avikivity @raphaelsc I just ran the test using Avi's command line ./test.py -v --mode dev alternator/run --repeat 1000, 1000 repeats in dev build-mode and 1000 additional times for release mode, on my laptop with 12 cores (so test.py ran 12 tests in parallel). But all 2000 test runs passed successfully, with not a single failure...

Can one of you for whom this error reproduces please attach here a snippet of the test output which shows what failed, or just send me the output of the failed test run? Thanks.

@mykaul mykaul added the high label Dec 1, 2022
@mykaul mykaul added this to the 5.2 milestone Dec 1, 2022
@avikivity
Copy link
Member

run.34.log

@nyh
Copy link
Contributor Author

nyh commented Dec 1, 2022

Thanks @avikivity. The failure is in test_manual_requests.py::test_too_large_request_content_length. The test code has:

    if Version(urllib3.__version__) < Version('1.26'):
        pytest.skip("urllib3 before 1.26.0 threw broken pipe and did not read response and cause issue #8195. Fixed by pull request urllib3/urllib3#1524")
   ...
   response = requests.post(req.url, headers=req.headers, data=req.body, verify=False)
    # In issue #8195, Alternator closed the connection early, causing the
    # library to incorrectly throw an exception (Broken Pipe) instead noticing
    # the error code 413 which the server did send.
    assert response.status_code == 413

As the comment explains, old versions of urllib3 had a bug where it used to throw a Broken Pipe when the server failed to read the entire request, even if the server already sent a response. This was fixed in urllib3, and indeed this test is usually passing, but it appears that there is a race condition which can cause urllib3 to throw:

E           urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

And this fails the test.
This can be considered a bug in urllib3, but it seems that for now, I'll need to allow this test to return a ProtocolError (and reduce the regression-testing power of this test, unfortunately)

@avikivity, @raphaelsc reported that he'd been seeing sporadic failures of this suite for months, so because this specific test was only un-xfailed a week ago (757d2a4), there may be additional rare failures. If one of you has additional test/alternator failure logs you can send me (or just check if it's again test_too_large_request_content_length or something else), it will be great. Thanks.

@nyh
Copy link
Contributor Author

nyh commented Dec 1, 2022

After writing most of a bug report to urll3lib, I realized that this is NOT a urll3lib bug but a Seastar HTTP bug, which I anticipated in #8195 (comment) bug forgot about. Note that the "Connection Reset By Peer" message above. After the client receives a RST it cannot read the responses, even if it wants to. So it is the responsibility of the server to do what needs to be done to avoid sending this RST, and

I'll open an issue and send a patch to xfail/skip/change this test until we fix that, but so that I know whether this patch will Fixes or just Refs this issue, @avikivity / @raphaelsc please help me understand if there are additional test failures, or just this one.

@avikivity avikivity changed the title Alternator tests are flaky Alternator tests fails when the connection is closed before the reply is fully read Dec 1, 2022
@avikivity
Copy link
Member

The title now refers to just one bug, so your fix will fix it.

nyh added a commit to nyh/scylla that referenced this issue Dec 1, 2022
In a recent commit 757d2a4, we removed the "xfail" mark from the test
test_manual_requests.py::test_too_large_request_content_length
because it started to pass on more modern versions of Python, with a
urllib3 bug fixed.

Unfortunately, the celebration was premature: It turns out that although
the test now *usually* passes, it sometimes fails. This is caused by
a Seastar bug scylladb/seastar#1325, which I opened scylladb#12166 to track
in this project. So unfortunately we need to add the "xfail" mark back
to this test.

Note that although the test will now be marked "xfail", it will actually
pass most of the time, so will appear as "xpass" to people run it.
I put a note in the xfail reason string as a reminder why this is
happening.

Fixes scylladb#12143
Refs scylladb#12166
Refs scylladb/seastar#1325

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
@nyh
Copy link
Contributor Author

nyh commented Dec 4, 2022

This is a test patch, fixing an earlier test patch that wasn't backported, so no need to backport this one either.

@nyh nyh removed their assignment Dec 4, 2022
MeepoYang pushed a commit to storage-scsg/scylladb that referenced this issue Apr 28, 2024
In a recent commit 757d2a4, we removed the "xfail" mark from the test
test_manual_requests.py::test_too_large_request_content_length
because it started to pass on more modern versions of Python, with a
urllib3 bug fixed.

Unfortunately, the celebration was premature: It turns out that although
the test now *usually* passes, it sometimes fails. This is caused by
a Seastar bug scylladb/seastar#1325, which I opened scylladb#12166 to track
in this project. So unfortunately we need to add the "xfail" mark back
to this test.

Note that although the test will now be marked "xfail", it will actually
pass most of the time, so will appear as "xpass" to people run it.
I put a note in the xfail reason string as a reminder why this is
happening.

Fixes scylladb#12143
Refs scylladb#12166
Refs scylladb/seastar#1325

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb#12169
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alternator Alternator related Issues area/test Issues related to the testing system code and environment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants