Fix flaky test_preserve_chunked_on_retry_after test #1815

pquentin · 2020-03-13T14:31:51Z

Our test that fails the most in CI is TestChunkedTransfer.test_preserve_chunked_on_retry. I'm opening this issue to figure out a fix.

This test ensures the chunked parameter is preserved even when the server replies with a Retry-After HTTP header, which causes a recursive urlopen call. This was introduced a few months ago in #1734. I actually wrote the test during a urllib3 grant. I then saw errors, and tried to make the test less flaky with https://github.com/urllib3/urllib3/pull/1743/files, but we still get failures on Windows.

Examples:

So, what is this test doing? Two things:

Send a chunked request with one retry to an endpoint who will answer with a Retry-After header
Make sure that the Transfer-Encoding: chunked header was sent in both requests

We already have tests for Retry-After, but I didn't find a way to test this with Tornado, which is a black box. However, now that I'm writing this, I realize I can modify the retry_after endpoint to tell me if Transfer-Encoding: chunked was sent. (Getting clarity is exactly why I'm writing this up, by the way. Thanks for being my rubber duck.)

But still, my understanding is that we're trying to move away from Tornado to write more socket level tests: they're more reliable when written correctly. And a weird failure like that is an opportunity to learn more about HTTP and networking. So how is this test working?

Here's the current version:

def test_preserve_chunked_on_retry(self):
    self.chunked_requests = 0

    def socket_handler(listener):
        for _ in range(2):
            sock = listener.accept()[0]
            request = consume_socket(sock)
            if b"Transfer-Encoding: chunked" in request.split(b"\r\n"):
                self.chunked_requests += 1

            sock.send(
                b"HTTP/1.1 429 Too Many Requests\r\n"
                b"Content-Type: text/plain\r\n"
                b"Retry-After: 1\r\n"
                b"\r\n"
            )
            sock.close()

    self._start_server(socket_handler)
    with HTTPConnectionPool(self.host, self.port) as pool:
        retries = Retry(total=1)
        pool.urlopen(
            "GET", "/", chunked=True, preload_content=False, retries=retries
        )
    assert self.chunked_requests == 2

We know we're going to get two requests, which is why we have for _ in range(2). We update self.chunked_requests with each correctly chunked request. In the failures I've seen, usually the first request works fine, but the second one fails with ConnectionAbortedError: [WinError 10053] An established connection. It's WSAECONNRESET, also known as "Connection reset by peer", which means the server closed the socket before the client read its contents.

(A race condition between those two requests can be safely excluded since we're unfortunately waiting one full second between the two requests, as urllib3 does not support fractional seconds in Retry-After values.)

The good news is that I was able to reproduce "Connection reset by peer" on my macOS computer by removing preload_content=False, which made little sense here anyway. I then googled "socket close http" since the error is due to the closed socket and found https://www.oreilly.com/library/view/http-the-definitive/1565925092/ch04s07.html which mentions Content-Length can be an issue. And then I was able to fix the error on macOS by adding Content-Length: 0!

I tried this on CI, and ran that test 100 times using pytest-flakefinder. I had 0 issues on macOS + Linux but still got 10-20% of failures on Windows: see https://ci.appveyor.com/project/pquentin/urllib3/build/job/xe7084idp7rg7gue for an exemple.

When adding more debugging information, I realized that there's indeed a race where on Windows we read the response after the socket gets closed. And preload_content=False does not help, as the socket can be closed before the client even tries to read the status line. Indeed, other tests wait for the client close before closing in the server side using threading.Event synchronization. But that's not possible here as we want the client to do two requests, so I can't ask urllib3 to tell the test when the first request has been read.

And I found that I do need to close the socket to get the client to proceed. But no, there's a workaround, I can use Connection: close in my server response! So that works fine, and I can always close the two sockets after the tests.

But how is Tornado doing it? When is it closing connections for this to work? I was unable to figure out by reading the code, there are too many knobs and indirections to follow. I also tried reading the code for http.server in Python 3, and got lost too.

The text was updated successfully, but these errors were encountered:

pquentin mentioned this issue Mar 13, 2020

Fix test_preserve_chunked_on_retry_after flakiness #1816

Merged

pquentin closed this as completed in #1816 Mar 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky test_preserve_chunked_on_retry_after test #1815

Fix flaky test_preserve_chunked_on_retry_after test #1815

pquentin commented Mar 13, 2020 •

edited

Fix flaky test_preserve_chunked_on_retry_after test #1815

Fix flaky test_preserve_chunked_on_retry_after test #1815

Comments

pquentin commented Mar 13, 2020 • edited

pquentin commented Mar 13, 2020 •

edited