Alternator tests fails when the connection is closed before the reply is fully read #12143

nyh · 2022-11-30T09:24:26Z

@avikivity and @raphaelsc that the full Alternator test suite has been flaky recently, sporadically failing even on fast machines and release builds (so it's not the usual not-long-enough timeout issue).

@avikivity gave an example:

$ ninja build/release/scylla && ./test.py --mode release alternator/run --repeat 1000
[1/2] NINJA build/release/seastar/libseastar.a
ninja: Entering directory `build/release/seastar'
ninja: no work to do.
Found 1000 tests.
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[111/1000] alternator release [ PASS ] run.114
[112/1000] alternator release [ FAIL ] run.113
[113/1000] alternator release [ PASS ] run.119
[114/1000] alternator release [ FAIL ] run.117
[118/1000] alternator release [ PASS ] run.121
[119/1000] alternator release [ FAIL ] run.111
[120/1000] alternator release [ FAIL ] run.115
[122/1000] alternator release [ PASS ] run.122
[123/1000] alternator release [ FAIL ] run.112
[130/1000] alternator release [ PASS ] run.134
[131/1000] alternator release [ FAIL ] run.140
[132/1000] alternator release [ FAIL ] run.137

The text was updated successfully, but these errors were encountered:

nyh · 2022-11-30T14:57:10Z

@avikivity @raphaelsc I just ran the test using Avi's command line ./test.py -v --mode dev alternator/run --repeat 1000, 1000 repeats in dev build-mode and 1000 additional times for release mode, on my laptop with 12 cores (so test.py ran 12 tests in parallel). But all 2000 test runs passed successfully, with not a single failure...

Can one of you for whom this error reproduces please attach here a snippet of the test output which shows what failed, or just send me the output of the failed test run? Thanks.

avikivity · 2022-12-01T11:57:43Z

run.34.log

nyh · 2022-12-01T14:22:01Z

Thanks @avikivity. The failure is in test_manual_requests.py::test_too_large_request_content_length. The test code has:

    if Version(urllib3.__version__) < Version('1.26'):
        pytest.skip("urllib3 before 1.26.0 threw broken pipe and did not read response and cause issue #8195. Fixed by pull request urllib3/urllib3#1524")
   ...
   response = requests.post(req.url, headers=req.headers, data=req.body, verify=False)
    # In issue #8195, Alternator closed the connection early, causing the
    # library to incorrectly throw an exception (Broken Pipe) instead noticing
    # the error code 413 which the server did send.
    assert response.status_code == 413

As the comment explains, old versions of urllib3 had a bug where it used to throw a Broken Pipe when the server failed to read the entire request, even if the server already sent a response. This was fixed in urllib3, and indeed this test is usually passing, but it appears that there is a race condition which can cause urllib3 to throw:

E           urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

And this fails the test.
This can be considered a bug in urllib3, but it seems that for now, I'll need to allow this test to return a ProtocolError (and reduce the regression-testing power of this test, unfortunately)

@avikivity, @raphaelsc reported that he'd been seeing sporadic failures of this suite for months, so because this specific test was only un-xfailed a week ago (757d2a4), there may be additional rare failures. If one of you has additional test/alternator failure logs you can send me (or just check if it's again test_too_large_request_content_length or something else), it will be great. Thanks.

nyh · 2022-12-01T14:55:55Z

After writing most of a bug report to urll3lib, I realized that this is NOT a urll3lib bug but a Seastar HTTP bug, which I anticipated in #8195 (comment) bug forgot about. Note that the "Connection Reset By Peer" message above. After the client receives a RST it cannot read the responses, even if it wants to. So it is the responsibility of the server to do what needs to be done to avoid sending this RST, and

I'll open an issue and send a patch to xfail/skip/change this test until we fix that, but so that I know whether this patch will Fixes or just Refs this issue, @avikivity / @raphaelsc please help me understand if there are additional test failures, or just this one.

avikivity · 2022-12-01T15:46:52Z

The title now refers to just one bug, so your fix will fix it.

In a recent commit 757d2a4, we removed the "xfail" mark from the test test_manual_requests.py::test_too_large_request_content_length because it started to pass on more modern versions of Python, with a urllib3 bug fixed. Unfortunately, the celebration was premature: It turns out that although the test now *usually* passes, it sometimes fails. This is caused by a Seastar bug scylladb/seastar#1325, which I opened scylladb#12166 to track in this project. So unfortunately we need to add the "xfail" mark back to this test. Note that although the test will now be marked "xfail", it will actually pass most of the time, so will appear as "xpass" to people run it. I put a note in the xfail reason string as a reminder why this is happening. Fixes scylladb#12143 Refs scylladb#12166 Refs scylladb/seastar#1325 Signed-off-by: Nadav Har'El <nyh@scylladb.com>

nyh · 2022-12-04T22:19:51Z

This is a test patch, fixing an earlier test patch that wasn't backported, so no need to backport this one either.

In a recent commit 757d2a4, we removed the "xfail" mark from the test test_manual_requests.py::test_too_large_request_content_length because it started to pass on more modern versions of Python, with a urllib3 bug fixed. Unfortunately, the celebration was premature: It turns out that although the test now *usually* passes, it sometimes fails. This is caused by a Seastar bug scylladb/seastar#1325, which I opened scylladb#12166 to track in this project. So unfortunately we need to add the "xfail" mark back to this test. Note that although the test will now be marked "xfail", it will actually pass most of the time, so will appear as "xpass" to people run it. I put a note in the xfail reason string as a reminder why this is happening. Fixes scylladb#12143 Refs scylladb#12166 Refs scylladb/seastar#1325 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb#12169

nyh added area/test Issues related to the testing system code and environment area/alternator Alternator related Issues labels Nov 30, 2022

nyh self-assigned this Nov 30, 2022

mykaul added the high label Dec 1, 2022

mykaul added this to the 5.2 milestone Dec 1, 2022

avikivity changed the title ~~Alternator tests are flaky~~ Alternator tests fails when the connection is closed before the reply is fully read Dec 1, 2022

This was referenced Dec 1, 2022

When HTTP server sends an error reply, don't close the connection before client TCP stack acks the response scylladb/seastar#1325

Open

Alternator's response to oversized request may not reach the client because of an RST #12166

Open

nyh mentioned this issue Dec 1, 2022

alternator-test: xfail a flaky test exposing a known bug #12169

Closed

scylladb-promoter closed this as completed in 6fcb530 Dec 2, 2022

scylladb-promoter added the Backport candidate label Dec 2, 2022

nyh removed the Backport candidate label Dec 4, 2022

nyh removed their assignment Dec 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternator tests fails when the connection is closed before the reply is fully read #12143

Alternator tests fails when the connection is closed before the reply is fully read #12143

nyh commented Nov 30, 2022

nyh commented Nov 30, 2022

avikivity commented Dec 1, 2022

nyh commented Dec 1, 2022

nyh commented Dec 1, 2022 •

edited

avikivity commented Dec 1, 2022

nyh commented Dec 4, 2022

Alternator tests fails when the connection is closed before the reply is fully read #12143

Alternator tests fails when the connection is closed before the reply is fully read #12143

Comments

nyh commented Nov 30, 2022

nyh commented Nov 30, 2022

avikivity commented Dec 1, 2022

nyh commented Dec 1, 2022

nyh commented Dec 1, 2022 • edited

avikivity commented Dec 1, 2022

nyh commented Dec 4, 2022

nyh commented Dec 1, 2022 •

edited