Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cypress:server should really retry requests on network failures #24716

Open
ckoncz-hwx opened this issue Nov 17, 2022 · 9 comments
Open

cypress:server should really retry requests on network failures #24716

ckoncz-hwx opened this issue Nov 17, 2022 · 9 comments
Labels
E2E Issue related to end-to-end testing Triaged Issue has been routed to backlog. This is not a commitment to have it prioritized by the team.

Comments

@ckoncz-hwx
Copy link

Current behavior

I am performing a Cypress run that issues around 30 000 requests to a Cherrypy-powered application.
(I am counting the cypress:server:request successful response received messages in the Cypress debug output).

Out of these requests there about 1-10 that fail with ECONNRESET errors. Here is one:

2022-11-16T22:30:06.042Z cypress:server:request received an error making http request { timeout: undefined, retryIntervals: [], url: 'http://viz:8100/arc/apps/snapshot/268?v=2022.11.16.21.37.35', requestId: 'request19259', retryOnNetworkFailure: true, retryOnStatusCodeFailure: false, delaysRemaining: [], err: Error: socket hang up at connResetException (node:internal/errors:692:14) at Socket.socketOnEnd (node:_http_client:478:23) at Socket.emit (node:events:539:35) at Socket.emit (node:domain:475:12) at endReadableNT (node:internal/streams/readable:1345:12) at processTicksAndRejections (node:internal/process/task_queues:83:21) { code: 'ECONNRESET' } }

(The same formatted manually:)

{ 
    timeout: undefined, 
    retryIntervals: [], 
    url: 'http: //viz:8100/arc/apps/snapshot/268?v=2022.11.16.21.37.35', 
    requestId: 'request19259', 
    retryOnNetworkFailure: true,
    retryOnStatusCodeFailure: false,
    delaysRemaining: [],
    err: 
        Error: socket hang up 
            at connResetException (node:internal/errors:692:14) 
            at Socket.socketOnEnd (node:_http_client:478:23) 
            at Socket.emit (node:events:539:35) 
            at Socket.emit (node:domain:475:12) 
            at endReadableNT (node:internal/streams/readable:1345:12) 
            at processTicksAndRejections (node:internal/process/task_queues:83:21) 
        { code: 'ECONNRESET' } 
}

Despite of having retryOnNetworkFailure: true in the message above, I actually see no retry attempts and the browser request is aborted. This results in random test failures (status null for API requests and window.load event timeout for resources included in the HTML page).

Was able to track down the above reset event and it seems to be because of bad/unfortunate timing:
the Cherrypy server closes the idle connection after a 10 second timeout but Cypress issues the request just before the FIN has arrived.

See the attached screenshot: the server FIN arrived just 0.2 milliseconds after the second GET is issued by Cypress.
The RST sent by the server is totally justified.
image

Desired behavior

In this situation Cypress should resubmit the failed request and provide the browser with the real upstream response.

Test code to reproduce

This issue can not be reproduced in a stable manner. However, it occurs with high probability during long Cypress runs.

Cypress Version

10.11.0

Node version

16.14.2

Operating System

Debian GNU/Linux 11 (bullseye) as Docker image cypress/included:10.11.0

Debug Logs

$ grep -e request19259 test.log
2022-11-16T22:30:06.042Z cypress:server:request received an error making http request { timeout: undefined, retryIntervals: [], url: 'http://viz:8100/arc/apps/snapshot/268?v=2022.11.16.21.37.35', requestId: 'request19259', retryOnNetworkFailure: true, retryOnStatusCodeFailure: false, delaysRemaining: [], err: Error: socket hang up at connResetException (node:internal/errors:692:14) at Socket.socketOnEnd (node:_http_client:478:23) at Socket.emit (node:events:539:35) at Socket.emit (node:domain:475:12) at endReadableNT (node:internal/streams/readable:1345:12) at processTicksAndRejections (node:internal/process/task_queues:83:21) { code: 'ECONNRESET' } }
2022-11-16T22:30:06.042Z cypress:server:request exhausted all attempts retrying request { timeout: undefined, retryIntervals: [], url: 'http://viz:8100/arc/apps/snapshot/268?v=2022.11.16.21.37.35', requestId: 'request19259', retryOnNetworkFailure: true, retryOnStatusCodeFailure: false, delaysRemaining: [], err: Error: socket hang up at connResetException (node:internal/errors:692:14) at Socket.socketOnEnd (node:_http_client:478:23) at Socket.emit (node:events:539:35) at Socket.emit (node:domain:475:12) at endReadableNT (node:internal/streams/readable:1345:12) at processTicksAndRejections (node:internal/process/task_queues:83:21) { code: 'ECONNRESET' } }
2022-11-16T22:30:06.044Z cypress:server:request aborting { requestId: 'request19259' }

Other

No response

@ryanthemanuel
Copy link
Collaborator

Hi @ckoncz-hwx. Thank you for logging this issue. What browser are you using for these tests? We made a change about a year ago that prevents issues going through the proxy from retrying as chrome itself changed to also retry failures. This caused a large number of retries and it caused various issues.

@ckoncz-hwx
Copy link
Author

Hi @ryanthemanuel , thank you for looking into this!
I only use Chrome for running the tests.
The version is the one included into the "cypress/included:10.11.0" Docker image: Chrome 100 (headless).

@mjhenkes mjhenkes assigned mjhenkes and unassigned ryanthemanuel Nov 23, 2022
@nagash77 nagash77 assigned BlueWinds and unassigned mjhenkes Dec 6, 2022
@BlueWinds
Copy link
Contributor

So a bit of context here, courtesy of @flotwig (thanks!).

Cypress used to have logic to say "retry this request if it's a 'transient network error' (which includes ECONNRESET)", following the logic that Chromium used at the time, because something we were passing via CLI to Chromium caused it to NOT retry.

I can't remember the exact cli option but it was right there in the code, if (kTestingMode) { don't retry }, so in order to meet use cases like this user and have their requests retry like they would IRL in Chromium, we copied the retry behavior in the proxy layer. But at some point we noticed that Chrome would retry OUR retries if they failed... or at least, that's what I thought was happening... hence the "oops, all retries!" behavior.

Beyond that I'm not sure what exactly is happening here. It's possible it was a mistake to completely remove the retries, possibly Chromium is still retrying but only in certain cases, not all cases like we thought.
 
We have these skipped system tests that run Chromium, not Cypress, to test out the behavior of different network errors with and without a proxy:

// NOTE: We can just skip these tests, they are really only useful for learning
// about how Chrome does it.
context.skip('Google Chrome', () => {
const testRetries = (path) => {
return launchBrowser(`http://127.0.0.1:${PORT}${path}`)
.then((proc) => {
return Promise.fromCallback((cb) => {
return onVisit = function () {
if (counts[path] >= 3) {
return cb()
}
}
}).then(() => {
proc.kill(9)
expect(counts[path]).to.be.at.least(3)
})
})
}
const testNoRetries = (path) => {
return launchBrowser(`http://localhost:${PORT}${path}`)
.delay(6000)
.then((proc) => {
proc.kill(9)
expect(counts[path]).to.eq(1)
})
}
it('retries 3+ times when receiving immediate reset', () => {
return testRetries('/immediate-reset')
})
it('retries 3+ times when receiving reset after headers', () => {
return testRetries('/after-headers-reset')
})
it('does not retry if reset during body', () => {
return testNoRetries('/during-body-reset')
})
context('behind a proxy server', () => {
const testProxiedRetries = (url) => {
return launchBrowser(url, { withProxy: true })
.then((proc) => {
return Promise.fromCallback((cb) => {
return onVisit = function () {
if (counts[url] >= 3) {
return cb()
}
}
}).then(() => {
proc.kill(9)
expect(counts[url]).to.be.at.least(3)
})
})
}
const testProxiedNoRetries = (url) => {
return launchBrowser('http://during-body-reset.invalid/', { withProxy: true })
.delay(6000)
.then((proc) => {
proc.kill(9)
expect(counts[url]).to.eq(1)
})
}
it('retries 3+ times when receiving immediate reset', () => {
return testProxiedRetries('http://immediate-reset.invalid/')
})
it('retries 3+ times when receiving reset after headers', () => {
return testProxiedRetries('http://after-headers-reset.invalid/')
})
it('does not retry if reset during body', () => {
return testProxiedNoRetries('http://during-body-reset.invalid/')
})
it('does not retry on \'500 Internal Server Error\'', () => {
return testProxiedNoRetries('http://proxy-internal-server-error.invalid/')
})
it('does not retry on \'502 Bad Gateway\'', () => {
return testProxiedNoRetries('http://proxy-bad-gateway.invalid/')
})
it('does not retry on \'503 Service Unavailable\'', () => {
return testProxiedNoRetries('http://proxy-service-unavailable.invalid/')
})
})
})
so that can validate the retry behavior if it still matches what it did 3 years ago... in those cases we should probably still be retrying.

This file USED to switch on kEnableAutomation to decide to retry or not but it doesn't any longer: https://source.chromium.org/chromium/chromium/src/+/main:chrome/renderer/net/net_error_helper_core.cc

It's possible that Chromium changed behavior again in a recent version; I'd be curious if you see the same behavior with failed network requests trying with an older version of Chrome, say 97-99?

That or reduced reproduction that we can run locally (without spawning 60k requests / waiting 10min) would be very helpful in tracking this down.

@ckoncz-hwx
Copy link
Author

Thank you , @BlueWinds , for the update.

The way to reproduce this is to issue two consecutive HTTP requests where the second one is delayed compared to the first one by the target server's keep-alive connection timeout. If we repeat this sequence enough times we might run into a situation, when the client thinks that a connection is still open but the server has already posted the FIN packet.

My production server runs CherryPy which has a keep-alive timeout of 10 seconds. While searching the net on how to achieve the same timeout on nodejs (so that I can provide a javascript-based repro environment), I ran across this article which describes a situation quite similar to mine: https://shuheikagawa.com/blog/2019/04/25/keep-alive-timeout/

The recommended solution there is to have a shorter timeout on the load balancer/proxy than on the upstream server.
So is there a way to specify the Cypress server's keep-alive timeout? if yes, my original problem might be solved.

@BlueWinds
Copy link
Contributor

That article seems to be describing the same situation, good find. Cypress uses the default node timeout of 5s, and I don't think we have functionality to modify this directly on the server. The proxy does copy most headers from the original request intact - I'm not sure how it would handle a keep-alive timeout set on the request from the browser. Might very well be overridden (and therefore a dead end).

One alternative would be to set retries on your tests. I'm assuming this isn't a single test that issues 30k requests. With as rare as the failure is, setting {retries: 2} might not be the worst solution. Not ideal, I know.

I think we'd probably accept a PR if you wanted to expose the server's keepAliveTimeout alongside the other config options (https://docs.cypress.io/guides/references/configuration#Timeouts), but adding configuration is unfortunately a bit of a task (our config system is not the cleanest).

@ckoncz-hwx
Copy link
Author

Yes, configuring {retries: 1} already helps with getting a green run for the whole test collection. Issues only remain with tests that are not prepared for being retried or if the error occurs in a setup step. The first issue has to be solved on my side (making tests retry-able) and the second one is way less frequent.

However, on the 5 second timeout: the TCP flow included in the description seems to contradict that. The connection between the Cypress server and the server under test was open for more than 10 seconds.

My first idea was that there is a one-to-one mapping between client and upstream connections, so if Cypress closes the client connection after the 5s timeout then it will close the matching server connection, too. This does not seem to be the case.

Is then there a connection pool for the Cypress-> upstream server connection that is managed independently from the client connections? And then that has a different timeout (larger than 10 seconds)?

@BlueWinds
Copy link
Contributor

Yes, there's definitely a separate connection pool.

All requests from the browser run through the cypress server - a node process which in turn makes calls to the backend server. That's how cy.intercept() is implemented - requests come from the browser, the server runs logic around them before deciding whether to make a request to the 'real' URL or to respond directly itself.

So it's probably the server pool that has a different timeout. We don't explicitly set a timeout or create a pool anywhere I was able to locate, so it's probably a node default somewhere that's getting used.

@BlueWinds
Copy link
Contributor

One thing to note is that we internally use the request library - old, deprecated, pretty clunky. 😬 So this could be an ancient request issue that's surfacing here.

@flotwig
Copy link
Contributor

flotwig commented Dec 14, 2022

We don't explicitly set a timeout or create a pool anywhere I was able to locate, so it's probably a node default somewhere that's getting used.

@BlueWinds Yeah, we extended Node.js's default Agent to improve HTTP/S keep-alive support (mainly, making it work with requests going through a HTTP_PROXY), but we still pass the connection to the base Agent so Node can handle pooling. So we don't do the pooling ourselves. This all happens here: https://github.com/cypress-io/cypress/blob/ba81e8c458283cb67bf043f8b29a4a8697a93bc0/packages/network/lib/agent.ts

Node.js has made a lot of progress on their own Keep-Alive support over the past few years, so maybe we can get rid of some of the logic we have in there to support Keep-Alive for upstream proxied requests.

@BlueWinds BlueWinds removed their assignment Jan 11, 2023
@BlueWinds BlueWinds added routed-to-e2e E2E Issue related to end-to-end testing labels Jan 11, 2023
@nagash77 nagash77 added Triaged Issue has been routed to backlog. This is not a commitment to have it prioritized by the team. and removed routed-to-e2e labels Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E2E Issue related to end-to-end testing Triaged Issue has been routed to backlog. This is not a commitment to have it prioritized by the team.
Projects
None yet
Development

No branches or pull requests

6 participants