You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our test that fails the most in CI is TestChunkedTransfer.test_preserve_chunked_on_retry. I'm opening this issue to figure out a fix.
This test ensures the chunked parameter is preserved even when the server replies with a Retry-After HTTP header, which causes a recursive urlopen call. This was introduced a few months ago in #1734. I actually wrote the test during a urllib3 grant. I then saw errors, and tried to make the test less flaky with https://github.com/urllib3/urllib3/pull/1743/files, but we still get failures on Windows.
Send a chunked request with one retry to an endpoint who will answer with a Retry-After header
Make sure that the Transfer-Encoding: chunked header was sent in both requests
We already have tests for Retry-After, but I didn't find a way to test this with Tornado, which is a black box. However, now that I'm writing this, I realize I can modify the retry_after endpoint to tell me if Transfer-Encoding: chunked was sent. (Getting clarity is exactly why I'm writing this up, by the way. Thanks for being my rubber duck.)
But still, my understanding is that we're trying to move away from Tornado to write more socket level tests: they're more reliable when written correctly. And a weird failure like that is an opportunity to learn more about HTTP and networking. So how is this test working?
We know we're going to get two requests, which is why we have for _ in range(2). We update self.chunked_requests with each correctly chunked request. In the failures I've seen, usually the first request works fine, but the second one fails with ConnectionAbortedError: [WinError 10053] An established connection. It's WSAECONNRESET, also known as "Connection reset by peer", which means the server closed the socket before the client read its contents.
(A race condition between those two requests can be safely excluded since we're unfortunately waiting one full second between the two requests, as urllib3 does not support fractional seconds in Retry-After values.)
The good news is that I was able to reproduce "Connection reset by peer" on my macOS computer by removing preload_content=False, which made little sense here anyway. I then googled "socket close http" since the error is due to the closed socket and found https://www.oreilly.com/library/view/http-the-definitive/1565925092/ch04s07.html which mentions Content-Length can be an issue. And then I was able to fix the error on macOS by adding Content-Length: 0!
When adding more debugging information, I realized that there's indeed a race where on Windows we read the response after the socket gets closed. And preload_content=False does not help, as the socket can be closed before the client even tries to read the status line. Indeed, other tests wait for the client close before closing in the server side using threading.Event synchronization. But that's not possible here as we want the client to do two requests, so I can't ask urllib3 to tell the test when the first request has been read.
And I found that I do need to close the socket to get the client to proceed. But no, there's a workaround, I can use Connection: close in my server response! So that works fine, and I can always close the two sockets after the tests.
But how is Tornado doing it? When is it closing connections for this to work? I was unable to figure out by reading the code, there are too many knobs and indirections to follow. I also tried reading the code for http.server in Python 3, and got lost too.
The text was updated successfully, but these errors were encountered:
Our test that fails the most in CI is
TestChunkedTransfer.test_preserve_chunked_on_retry
. I'm opening this issue to figure out a fix.This test ensures the
chunked
parameter is preserved even when the server replies with a Retry-After HTTP header, which causes a recursiveurlopen
call. This was introduced a few months ago in #1734. I actually wrote the test during a urllib3 grant. I then saw errors, and tried to make the test less flaky with https://github.com/urllib3/urllib3/pull/1743/files, but we still get failures on Windows.Examples:
So, what is this test doing? Two things:
Retry-After
headerTransfer-Encoding: chunked
header was sent in both requestsWe already have tests for Retry-After, but I didn't find a way to test this with Tornado, which is a black box. However, now that I'm writing this, I realize I can modify the
retry_after
endpoint to tell me ifTransfer-Encoding: chunked
was sent. (Getting clarity is exactly why I'm writing this up, by the way. Thanks for being my rubber duck.)But still, my understanding is that we're trying to move away from Tornado to write more socket level tests: they're more reliable when written correctly. And a weird failure like that is an opportunity to learn more about HTTP and networking. So how is this test working?
Here's the current version:
We know we're going to get two requests, which is why we have
for _ in range(2)
. We updateself.chunked_requests
with each correctly chunked request. In the failures I've seen, usually the first request works fine, but the second one fails withConnectionAbortedError: [WinError 10053] An established connection
. It's WSAECONNRESET, also known as "Connection reset by peer", which means the server closed the socket before the client read its contents.(A race condition between those two requests can be safely excluded since we're unfortunately waiting one full second between the two requests, as urllib3 does not support fractional seconds in Retry-After values.)
The good news is that I was able to reproduce "Connection reset by peer" on my macOS computer by removing
preload_content=False
, which made little sense here anyway. I then googled "socket close http" since the error is due to the closed socket and found https://www.oreilly.com/library/view/http-the-definitive/1565925092/ch04s07.html which mentions Content-Length can be an issue. And then I was able to fix the error on macOS by addingContent-Length: 0
!I tried this on CI, and ran that test 100 times using pytest-flakefinder. I had 0 issues on macOS + Linux but still got 10-20% of failures on Windows: see https://ci.appveyor.com/project/pquentin/urllib3/build/job/xe7084idp7rg7gue for an exemple.
When adding more debugging information, I realized that there's indeed a race where on Windows we read the response after the socket gets closed. And
preload_content=False
does not help, as the socket can be closed before the client even tries to read the status line. Indeed, other tests wait for the client close before closing in the server side usingthreading.Event
synchronization. But that's not possible here as we want the client to do two requests, so I can't ask urllib3 to tell the test when the first request has been read.And I found that I do need to close the socket to get the client to proceed. But no, there's a workaround, I can use
Connection: close
in my server response! So that works fine, and I can always close the two sockets after the tests.But how is Tornado doing it? When is it closing connections for this to work? I was unable to figure out by reading the code, there are too many knobs and indirections to follow. I also tried reading the code for http.server in Python 3, and got lost too.
The text was updated successfully, but these errors were encountered: