Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI sometimes fails due to timeout on query to service #28

Closed
BenjaminPelletier opened this issue Feb 7, 2023 · 14 comments
Closed

CI sometimes fails due to timeout on query to service #28

BenjaminPelletier opened this issue Feb 7, 2023 · 14 comments
Labels
automated-testing Related to automated testing tools bug Something isn't working P1 High priority tooling

Comments

@BenjaminPelletier
Copy link
Member

Example

Sometimes, the CI checks fail due to a timeout when uss_qualifier queries a service at port 8074; the error message is:

urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='host.docker.internal', port=8074): Read timed out. (read timeout=60)

When the CI is immediately rerun, this error usually disappears.

The service at port 8074 in the standard local deployment (used by CI) is mock_uss providing scdsc capabilities. Unfortunately, the details of this query attempt are not currently captured.

@BenjaminPelletier BenjaminPelletier added bug Something isn't working automated-testing Related to automated testing tools P1 High priority tooling labels Feb 7, 2023
@BenjaminPelletier BenjaminPelletier changed the title CI sometimes fails due to timeout on query to service at port 8074 CI sometimes fails due to timeout on query to service Feb 9, 2023
@BenjaminPelletier
Copy link
Member Author

Additional logging and further investigation seem to suggest that this problem arises when the background daemon of mock_uss's atproxy_client functionality stops working. When this happens, a query to atproxy hangs for the timeout limit (currently 1 minute) and then atproxy responds with a 500 indicating that the handler client didn't handle the request. Looking at recent logs, I see these logs from mock_uss with atproxy_client when a getCapabilities request ends up timing out:

2023-02-10 18:42:51.349 | INFO     | monitoring.mock_uss.atproxy_client.daemon:_poll_atproxy:252 - Handling scd.deleteFlight request, id 60523084-6a32-4a5e-8222-d2254479ac2b
2023-02-10 18:42:51.421 | INFO     | monitoring.mock_uss.atproxy_client.daemon:_poll_atproxy:261 - Request 60523084-6a32-4a5e-8222-d2254479ac2b fulfillment has code 200
2023-02-10 18:42:51.627 | INFO     | monitoring.mock_uss.atproxy_client.daemon:_poll_atproxy:288 - Delivered response to request 60523084-6a32-4a5e-8222-d2254479ac2b to atproxy
2023-02-10 18:42:51.844 | INFO     | monitoring.mock_uss.atproxy_client.daemon:_poll_atproxy:252 - Handling scd.getStatus request, id 188ea33f-fe87-4d01-894a-384a270b52ae
2023-02-10 18:42:51.844 | INFO     | monitoring.mock_uss.atproxy_client.daemon:_poll_atproxy:261 - Request 188ea33f-fe87-4d01-894a-384a270b52ae fulfillment has code 200
2023-02-10 18:42:52.037 | INFO     | monitoring.mock_uss.atproxy_client.daemon:_poll_atproxy:288 - Delivered response to request 188ea33f-fe87-4d01-894a-384a270b52ae to atproxy
2023-02-10 18:42:52.245 | INFO     | monitoring.mock_uss.atproxy_client.daemon:_poll_atproxy:252 - Handling scd.getCapabilities request, id 3b1ae055-2f53-4e0a-8a04-822318676e3d
2023-02-10 18:42:52.245 | INFO     | monitoring.mock_uss.atproxy_client.daemon:_poll_atproxy:261 - Request 3b1ae055-2f53-4e0a-8a04-822318676e3d fulfillment has code 200

The last log line is printed from here, and a successful fulfillment would have then later printed one of the lines in this if block, indicating that this query is responsible for the daemon failure. And indeed, the container is later populated with a stacktrace indicating a connection timeout to port 8075 of host.docker.internal. Port 8075 is the canonical local port for atproxy, confirming that the mock_uss atproxy_client daemon is timing out trying to respond back to atproxy with a result.

@BenjaminPelletier
Copy link
Member Author

BenjaminPelletier commented Feb 13, 2023

This failure appears to not be due to atproxy or mock_uss's atproxy_client functionality. Instead, a call to the scdsc mock_uss (port 8074) to inject a flight intent timed out.

uss_qualifier indicates that the request to inject flight 8b4a497a-825b-47dc-a98a-581320dcd562 was initiated at 16:32:01.04, and the failed (timed-out) result reported at 16:33:01.10. The only other mention of 8b4a497a-825b-47dc-a98a-581320dcd562 in uss_qualifier's logs is port-8074 mock_uss's response to a later request to clear the area where it indicates flight 8b4a497a-825b-47dc-a98a-581320dcd562 was deleted.

The logs of port-8074 mock_uss with scdsc functionality pause at 16:31:58.12 ([inject_flight] Releasing placeholder for flight_id 9b4eb91a-9355-4554-bca6-3a293816eda6), and then a log message at 16:33:01.15 indicates that the inject flight handler started running for flight 8b4a497a-825b-47dc-a98a-581320dcd562.

@BenjaminPelletier
Copy link
Member Author

This is apparently a very difficult problem to diagnose; see my comment here for some additional diagnostic work I've done.

After having this problem occur so often on my local machine that I was essentially unable to complete the local_test uss_qualifier configuration, I was able to reproduce with run_containers_host_docker_internal.sh here, but then the problem vanished and I currently can't reproduce the issue on my local machine.

@barroco
Copy link
Contributor

barroco commented Mar 31, 2023

In addition to @BenjaminPelletier investigation, we have been running periodically the CI job to gather data for the past week here. Please note that when the job fails a tcpdump file which can be open with wireshark is available in the job's artifacts.
In parallel, a monitoring tool has been setup to gather metrics on the node running the CI job to make sure the problem is not due to limited resources. You will find below some screenshots with the resource usage of this job. You will note that the error is reported by the test driver at 12:32:26 (UTC-4).

Screenshot 2023-03-31 at 14 57 35
Screenshot 2023-03-31 at 14 57 56
Screenshot 2023-03-31 at 14 58 06
Screenshot 2023-03-31 at 14 59 07

@BenjaminPelletier
Copy link
Member Author

After disabling atproxy in #126, I observed this failure today:

2023-07-19 16:22:19.718 | WARNING  | monitoring.uss_qualifier.suites.suite:_print_failed_check:50 - New failed check:
  details: "Inject flight query to [http://host.docker.internal:8094/scdsc/v1/flights/dcbb1785-d748-44cd-a8f7-7b8fdd99a264\](http://host.docker.internal:8094/scdsc/v1/flights/dcbb1785-d748-44cd-a8f7-7b8fdd99a264/)
    \ returned 999\n\nStack trace:\nTraceback (most recent call last):\n  File \"/app/monitoring/uss_qualifier/scenarios/flight_planning/test_steps.py\"\
    , line 376, in submit_flight_intent\n    resp, query, flight_id = flight_planner.request_flight(\n\
    \  File \"/app/monitoring/uss_qualifier/resources/flight_planning/flight_planner.py\"\
    , line 96, in request_flight\n    raise QueryError(\nmonitoring.monitorlib.fetch.QueryError:\
    \ Inject flight query to [http://host.docker.internal:8094/scdsc/v1/flights/dcbb1785-d748-44cd-a8f7-7b8fdd99a264\](http://host.docker.internal:8094/scdsc/v1/flights/dcbb1785-d748-44cd-a8f7-7b8fdd99a264/)
    \ returned 999\n"
  documentation_url: null
  name: Successful modification
  participants:
  - uss2
  query_report_timestamps:
  - '2023-07-19T16:21:54.6[900](https://github.com/interuss/monitoring/actions/runs/5601591400/jobs/10245690033#step:6:901)23Z'
  requirements:
  - interuss.automated_testing.flight_planning.ExpectedBehavior
  severity: High
  summary: 'Error from uss2 when attempting to submit a flight intent (flight ID: dcbb1785-d748-44cd-a8f7-7b8fdd99a264)'
  timestamp: '2023-07-19T16:22:19.716135Z'

@Shastick
Copy link
Contributor

For the record, we had the following happen very consistently today:

2023-08-16 19:10:27.093 | DEBUG    | monitoring.uss_qualifier.scenarios.scenario:record_query:328 - Queried PUT http://host.docker.internal:8094/scdsc/v1/flights/5df29b58-b2cf-4e7e-a5aa-749f3ca23908 -> 999
2023-08-16 19:10:27.094 | INFO     | monitoring.uss_qualifier.scenarios.scenario:record_failed:107 - Severity High upgraded to Critical because USS_QUALIFIER_STOP_FAST environment variable indicates true
2023-08-16 19:10:27.097 | WARNING  | monitoring.uss_qualifier.suites.suite:_print_failed_check:54 - New failed check:
  details: "Inject flight query to [http://host.docker.internal:8094/scdsc/v1/flights/5df29b58-b2cf-4e7e-a5aa-749f3ca23908\](http://host.docker.internal:8094/scdsc/v1/flights/5df29b58-b2cf-4e7e-a5aa-749f3ca23908/)
    \ returned 999\n\nStack trace:\nTraceback (most recent call last):\n  File \"/app/monitoring/uss_qualifier/scenarios/flight_planning/test_steps.py\"\
    , line 376, in submit_flight_intent\n    resp, query, flight_id = flight_planner.request_flight(\n\
    \  File \"/app/monitoring/uss_qualifier/resources/flight_planning/flight_planner.py\"\
    , line 96, in request_flight\n    raise QueryError(\nmonitoring.monitorlib.fetch.QueryError:\
    \ Inject flight query to [http://host.docker.internal:8094/scdsc/v1/flights/5df29b58-b2cf-4e7e-a5aa-749f3ca23908\](http://host.docker.internal:8094/scdsc/v1/flights/5df29b58-b2cf-4e7e-a5aa-749f3ca23908/)
    \ returned 999\n\nSeverity High upgraded to Critical because USS_QUALIFIER_STOP_FAST\
    \ environment variable indicates true"
  documentation_url: https://github.com/interuss/monitoring/blob/60c7399e2609960685fa404a857e8ff7f4c61426/monitoring/uss_qualifier/scenarios/astm/utm/nominal_planning/conflict_equal_priority_not_permitted/conflict_equal_priority_not_permitted.md#successful-transition-to-non-conforming-state-check
  name: Successful transition to non-conforming state
  participants:
  - uss2
  query_report_timestamps:
  - '2023-08-16T19:10:02.079477Z'
  requirements:
  - interuss.automated_testing.flight_planning.ExpectedBehavior
  severity: Critical
  summary: 'Error from uss2 when attempting to submit a flight intent (flight ID: 5df29b58-b2cf-4e7e-a5aa-749f3ca23908)'
  timestamp: '2023-08-16T19:10:27.094503Z'
  
2023-08-16 19:10:52.125 | WARNING  | monitoring.monitorlib.fetch:query_and_describe:216 - query_and_describe attempt 1 to DELETE http://host.docker.internal:8094/scdsc/v1/flights/5df29b58-b2cf-4e7e-a5aa-749f3ca23908 failed with timeout ReadTimeout: HTTPConnectionPool(host='host.docker.internal', port=8094): Read timed out. (read timeout=25)

@BenjaminPelletier
Copy link
Member Author

After a huge number of overall failures of the F3411-22a CI task on #170, I took a closer look at the logs and think this may be valuable:

2023-08-17 04:50:13.009 | DEBUG    | __config__:pre_request:26 - gunicorn pre_request from worker 18 (OS PID 18): POST /mock/scd/uss/v1/operational_intents
2023-08-17 04:50:13.011 | DEBUG    | __config__:post_request:37 - gunicorn post_request from worker 18 (OS PID 18): POST /mock/scd/uss/v1/operational_intents -> 204
2023-08-17 04:50:13.735 | DEBUG    | __config__:pre_request:26 - gunicorn pre_request from worker 18 (OS PID 18): POST /mock/scd/uss/v1/operational_intents
2023-08-17 04:50:13.737 | DEBUG    | __config__:post_request:37 - gunicorn post_request from worker 18 (OS PID 18): POST /mock/scd/uss/v1/operational_intents -> 204
2023-08-17 04:50:14.489 | DEBUG    | __config__:pre_request:26 - gunicorn pre_request from worker 18 (OS PID 18): POST /mock/scd/uss/v1/operational_intents
2023-08-17 04:50:14.491 | DEBUG    | __config__:post_request:37 - gunicorn post_request from worker 18 (OS PID 18): POST /mock/scd/uss/v1/operational_intents -> 204
2023-08-17 04:50:14.679 | DEBUG    | __config__:pre_request:26 - gunicorn pre_request from worker 16 (OS PID 16): PUT /scdsc/v1/flights/ab74309d-a25e-4d01-a1c1-8d3599b8add7
2023-08-17 04:50:14.681 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:scdsc_inject_flight:130 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Starting handler
2023-08-17 04:50:14.684 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:148 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Validating flight authorisation
2023-08-17 04:50:14.723 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:214 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Existing flight locked for update
2023-08-17 04:50:14.732 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:253 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Checking for operational intents in the DSS
2023-08-17 04:50:19.784 | WARNING  | monitoring.monitorlib.fetch:query_and_describe:216 - query_and_describe attempt 1 to GET http://scdsc.uss2.localutm/mock/scd/uss/v1/operational_intents/bdd6e1a6-a39d-4170-a8b7-f9cf6db5cae3 failed with timeout ReadTimeout: HTTPConnectionPool(host='scdsc.uss2.localutm', port=80): Read timed out. (read timeout=5)
2023-08-17 04:50:24.692 | DEBUG    | __config__:pre_request:26 - gunicorn pre_request from worker 15 (OS PID 15): PUT /scdsc/v1/flights/ab74309d-a25e-4d01-a1c1-8d3599b8add7
2023-08-17 04:50:24.694 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:scdsc_inject_flight:130 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Starting handler
2023-08-17 04:50:24.697 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:148 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Validating flight authorisation
2023-08-17 04:50:24.788 | WARNING  | monitoring.monitorlib.fetch:query_and_describe:216 - query_and_describe attempt 2 to GET http://scdsc.uss2.localutm/mock/scd/uss/v1/operational_intents/bdd6e1a6-a39d-4170-a8b7-f9cf6db5cae3 failed with timeout ReadTimeout: HTTPConnectionPool(host='scdsc.uss2.localutm', port=80): Read timed out. (read timeout=5)
2023-08-17 04:50:24.791 | DEBUG    | __config__:pre_request:26 - gunicorn pre_request from worker 17 (OS PID 17): GET /mock/scd/uss/v1/operational_intents/bdd6e1a6-a39d-4170-a8b7-f9cf6db5cae3
2023-08-17 04:50:24.833 | DEBUG    | __config__:post_request:37 - gunicorn post_request from worker 17 (OS PID 17): GET /mock/scd/uss/v1/operational_intents/bdd6e1a6-a39d-4170-a8b7-f9cf6db5cae3 -> 200
2023-08-17 04:50:24.883 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:273 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Checking for intersections with bdd6e1a6-a39d-4170-a8b7-f9cf6db5cae3
2023-08-17 04:50:24.883 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:282 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] intersection with bdd6e1a6-a39d-4170-a8b7-f9cf6db5cae3 not considered: intersection with a past version of this flight
2023-08-17 04:50:24.883 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:330 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Sharing operational intent with DSS
2023-08-17 04:50:24.908 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:357 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Notifying subscribers http://scdsc.uss1.localutm/mock/scd, http://scdsc.uss2.localutm/mock/scd
2023-08-17 04:50:24.916 | DEBUG    | __config__:pre_request:26 - gunicorn pre_request from worker 17 (OS PID 17): POST /mock/scd/uss/v1/operational_intents
2023-08-17 04:50:24.918 | DEBUG    | __config__:post_request:37 - gunicorn post_request from worker 17 (OS PID 17): POST /mock/scd/uss/v1/operational_intents -> 204
2023-08-17 04:50:24.919 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:372 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Storing flight in database
2023-08-17 04:50:24.967 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:382 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Complete.
2023-08-17 04:50:25.007 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:418 - [inject_flight] Releasing placeholder for flight_id ab74309d-a25e-4d01-a1c1-8d3599b8add7
2023-08-17 04:50:25.016 | DEBUG    | __config__:post_request:37 - gunicorn post_request from worker 16 (OS PID 16): PUT /scdsc/v1/flights/ab74309d-a25e-4d01-a1c1-8d3599b8add7 -> 200
2023-08-17 04:50:25.284 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:214 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Existing flight locked for update
2023-08-17 04:50:25.293 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:253 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Checking for operational intents in the DSS
2023-08-17 04:50:25.348 | DEBUG    | __config__:pre_request:26 - gunicorn pre_request from worker 16 (OS PID 16): GET /mock/scd/uss/v1/operational_intents/bdd6e1a6-a39d-4170-a8b7-f9cf6db5cae3
2023-08-17 04:50:25.391 | DEBUG    | __config__:post_request:37 - gunicorn post_request from worker 16 (OS PID 16): GET /mock/scd/uss/v1/operational_intents/bdd6e1a6-a39d-4170-a8b7-f9cf6db5cae3 -> 200
2023-08-17 04:50:25.444 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:273 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Checking for intersections with bdd6e1a6-a39d-4170-a8b7-f9cf6db5cae3
2023-08-17 04:50:25.444 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:282 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] intersection with bdd6e1a6-a39d-4170-a8b7-f9cf6db5cae3 not considered: intersection with a past version of this flight
2023-08-17 04:50:25.444 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:330 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Sharing operational intent with DSS
2023-08-17 04:50:25.468 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:357 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Notifying subscribers http://scdsc.uss1.localutm/mock/scd, http://scdsc.uss2.localutm/mock/scd
2023-08-17 04:50:25.479 | DEBUG    | __config__:pre_request:26 - gunicorn pre_request from worker 16 (OS PID 16): POST /mock/scd/uss/v1/operational_intents
2023-08-17 04:50:25.481 | DEBUG    | __config__:post_request:37 - gunicorn post_request from worker 16 (OS PID 16): POST /mock/scd/uss/v1/operational_intents -> 204
2023-08-17 04:50:25.482 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:372 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Storing flight in database
2023-08-17 04:50:25.532 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:382 - [inject_flight:ab74309d-a25e-4d01-a1c1-8d3599b8add7] Complete.
2023-08-17 04:50:25.572 | DEBUG    | monitoring.mock_uss.scdsc.routes_injection:inject_flight:418 - [inject_flight] Releasing placeholder for flight_id ab74309d-a25e-4d01-a1c1-8d3599b8add7
2023-08-17 04:50:25.581 | DEBUG    | __config__:post_request:37 - gunicorn post_request from worker 15 (OS PID 15): PUT /scdsc/v1/flights/ab74309d-a25e-4d01-a1c1-8d3599b8add7 -> 200

These are the logs of mock_uss_scdsc_b, located (in this PR) at scdsc.uss2.localutm. Here's my narration:

  • 14.679: handler 16 to inject flight ab74 starts
  • 14.732: handler 16 begins trying to acquire data about other relevant operational intents
  • 19.784: handler 16 times out (5 second timeout in this PR) trying to query itself at GET http://scdsc.uss2.localutm/mock/scd/uss/v1/operational_intents/bdd6e1a6-a39d-4170-a8b7-f9cf6db5cae3; there is no indication in the logs that such a GET request was received
  • 24.692: a second handler 15 to retry injecting flight ab74 starts
  • 24.788: handler 16(?) times out again
  • 24.791: we finally see the GET request; presumably this is from the third attempt of handler 16
  • 25.016: handler 16 finally completes

Thoughts:

  1. It seems really, really weird that queries from the webserver to itself are sometimes not showing up as being received at any level. I can't think of a reason why this might happen; I don't think we're near our concurrency limit.
  2. mock_uss's scdsc implementation is querying itself for operational intent details but it doesn't need to do that -- it could just read from its internal state cache rather than attempting to get its operational intent details via the external interface
  3. I don't think scdsc even needs to look up the particular operational intent it is querying (bdd6e1a6-a39d-4170-a8b7-f9cf6db5cae3) because that is the operational intent associated with the flight -- we're not going to check it for intersections any way

So, it seems like there is some low-hanging improvements:

  1. Do not let mock_uss's scdsc query the operational intent details of an operational intent it owns
  2. Do not even retrieve the operational intent details of the operational intent that will be modified for this flight

@mickmis
Copy link
Contributor

mickmis commented Aug 17, 2023

Some additional investigations on my end. My gut feeling is that we are hitting some kind of network limit at a deeper level.

Analyzing a tcpdump capture of a failed run:

  • a request and its first retry (sent 25 seconds later) both hit the destination host at the almost same time (some kind of queue got emptied at that point?)
  • a timed out request seems to be using the same TCP stream used by previously successful requests (from the requester's perspective though, might be different from the receiver's perspective since there is some NAT going on in the middle)

Then by trying to monitor some network parameters with ss:

  • there definitely is some funny stuff going at the TCP level, with lost packets and more stuff I don't fully understand, notably the following measures counts that appear and increase progressively as there is more network intensity until timeouts at application level are hit: TCPSackRecovery, TCPLossProbes, TCPSpuriousRtxHostQueues, and more
  • looking a bit online, TCPSpuriousRtxHostQueues could potentially hint at a overloaded TCP queue on the host?
  • also, there seems to be a correlation between network "intensity" and the timeouts, which I believe may explain why not using the atproxy alleviates the issue, and why there is recently more failures on the v22a test (since the recent changes in the CI actually run some tests multiple times); so any change that reduce the amount of network could be beneficial

@Shastick
Copy link
Contributor

a timed out request seems to be using the same TCP stream used by previously successful requests (from the requester's perspective though, might be different from the receiver's perspective since there is some NAT going on in the middle)

  • We could try to configure the http client to open a new TCP socket each time and not reuse them?

BenjaminPelletier added a commit to BenjaminPelletier/monitoring that referenced this issue Aug 18, 2023
BenjaminPelletier added a commit to BenjaminPelletier/monitoring that referenced this issue Aug 18, 2023
barroco pushed a commit that referenced this issue Aug 18, 2023
@BenjaminPelletier
Copy link
Member Author

This continues to be a serious issue causing perhaps 50% of CI runs to fail recently. I am considering trying podman rather than docker to see if that might make things better.

Other debugging attempts welcome as well, for instance configuring no socket reuse.

@mickmis
Copy link
Contributor

mickmis commented Oct 16, 2023

After digging several leads I've became convinced that the issue is caused (at least partially) by Gunicorn, an issue very probably triggered by the high-volume traffic.

Now that there is no NAT involved anymore, the network dumps are much easier to analyze. That was not the case in my previous debugging attempt as since then @BenjaminPelletier reworked the deployed docker containers to talk directly to each other on the docker virtual network rather than going through the host network stack + NAT.

One thing that I've been able to pinpoint by analyzing the dumps, is that all timeouts are due to Gunicorn basically freezing when handling a request. Sometimes we see other timeouts or errors, but whenever that happened, the root could be traced back to such a Gunicorn freeze. The dumps show that the request by the client is sent properly to the container running Gunicorn, the container does send an ACK to the client that it received properly the request (so this is NOT a lost TCP packet), but then absolutely nothing happens inside the container / on the Gunicorn side, no log, no request processing, no nothing.
Looking online I could find many similar issues with Gunicorn, so that does not appear to be an isolated problem:

Additional relevant links:

With all this, all that might be needed is changing Gunicorn configuration. I have an open draft PR (Orbitalize#8) where I've been trying out a number of things. All those different things are stacked on top of each other, so the Gunicorn config change might not be the only change needed. With the changes made there, so far all CI executions are successful without any timeout, currently on attempt no. 8. I will run it again a few more time in an attempt to be more certain that this fixes the issue. Fingers crossed...

The config change in question is about using a different worker class (gevent) as the implementation of the current one (gthread) seems to be somehow the source of the problem. I will make PRs with those changes soon.

Other leads followed

Network issues in GitHub Actions VMs

A lead I followed is issues with the networking in the VMs used by GitHub Actions. I had found several reported issues that could look similar to our issue, but in the end I do not believe this was it:

Docker virtual networks issues

Another credible lead were those system logs in the VM that appeared many times over the execution of the tests:

Oct 13 16:46:51 fv-az617-980 systemd[1]: networkd-dispatcher.service: Got notification message from PID 15038, but reception only permitted for main PID 575
Oct 13 16:46:51 fv-az617-980 networkd-dispatcher[575]: ERROR:Failed to get interface "veth79e2f8d" status: Command '['/usr/bin/networkctl', 'status', '--no-pager', '--no-legend', '--', 'veth79e2f8d']' returned non-zero exit status 1.
Oct 13 16:46:52 fv-az617-980 dockerd[835]: time="2023-10-13T16:46:52.368985350Z" level=info msg="ignoring event" container=6dae1a5b908bd11318e3419522e15f20339f7f5bdc507c44626e44f6869dd132 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Oct 13 16:46:52 fv-az617-980 containerd[595]: time="2023-10-13T16:46:52.369782958Z" level=info msg="shim disconnected" id=6dae1a5b908bd11318e3419522e15f20339f7f5bdc507c44626e44f6869dd132
Oct 13 16:46:52 fv-az617-980 containerd[595]: time="2023-10-13T16:46:52.369872859Z" level=warning msg="cleaning up after shim disconnected" id=6dae1a5b908bd11318e3419522e15f20339f7f5bdc507c44626e44f6869dd132 namespace=moby
Oct 13 16:46:52 fv-az617-980 containerd[595]: time="2023-10-13T16:46:52.369886259Z" level=info msg="cleaning up dead shim"
Oct 13 16:46:52 fv-az617-980 containerd[595]: time="2023-10-13T16:46:52.387467337Z" level=warning msg="cleanup warnings time=\"2023-10-13T16:46:52Z\" level=info msg=\"starting signal loop\" namespace=moby pid=15051 runtime=io.containerd.runc.v2\ntime=\"2023-10-13T16:46:52Z\" level=warning msg=\"failed to remove runc container\" error=\"runc did not terminate successfully: exit status 255: \" runtime=io.containerd.runc.v2\n"
Oct 13 16:46:52 fv-az617-980 systemd-networkd[445]: veth0dfec17: Lost carrier
Oct 13 16:46:52 fv-az617-980 kernel: br-85869bbc1851: port 2(veth0dfec17) entered disabled state
Oct 13 16:46:52 fv-az617-980 kernel: veth222da0f: renamed from eth0
Oct 13 16:46:52 fv-az617-980 networkd-dispatcher[575]: WARNING:Unknown index 38 seen, reloading interface list
Oct 13 16:46:52 fv-az617-980 systemd-networkd[445]: veth0dfec17: Link DOWN
Oct 13 16:46:52 fv-az617-980 kernel: br-85869bbc1851: port 2(veth0dfec17) entered disabled state
Oct 13 16:46:52 fv-az617-980 kernel: device veth0dfec17 left promiscuous mode
Oct 13 16:46:52 fv-az617-980 kernel: br-85869bbc1851: port 2(veth0dfec17) entered disabled state
Oct 13 16:46:52 fv-az617-980 networkctl[15076]: Interface "veth0dfec17" not found.
Oct 13 16:46:52 fv-az617-980 systemd[1]: networkd-dispatcher.service: Got notification message from PID 15076, but reception only permitted for main PID 575
Oct 13 16:46:52 fv-az617-980 networkd-dispatcher[575]: ERROR:Failed to get interface "veth0dfec17" status: Command '['/usr/bin/networkctl', 'status', '--no-pager', '--no-legend', '--', 'veth0dfec17']' returned non-zero exit status 1.

Those were correlated with the numerous creations/deletions of docker containers. Reports found online hinted at issue with interactions between the VM network stack and the Docker virtual networks.
Following this I modified the execution of the test scenarios to be executed in the same container rather than creating/deleting everytime a new container. At this stage I'm not sure this did anything for resolving the issue, but if that is desired I can in any case make a PR with this change.

Some links relevant to that:

@BenjaminPelletier
Copy link
Member Author

This is awesome and we should definitely implement any changes that have been shown to reduce the problem (or any problem, really). It seems like perhaps this problem may have multiple different root causes manifesting with the same observation (lost requests). When I was able to reproduce very reliably at one point (though I am no longer able to reproduce like this), I was able to use an unconfigured nginx container to produce the symptoms, so I don't think that instance was due to gunicorn (rather, something Docker).

@mickmis
Copy link
Contributor

mickmis commented Oct 18, 2023

It seems like perhaps this problem may have multiple different root causes manifesting with the same observation (lost requests). When I was able to reproduce very reliably at one point (though I am no longer able to reproduce like this), I was able to use an unconfigured nginx container to produce the symptoms, so I don't think that instance was due to gunicorn (rather, docker/for-win#8861 (comment)).

That is interesting! That is a lead I had pursue initially and as such in the test PR in question I did try out several things, which were still active when I concluded the gunicorn issue:

In addition there was as well still active the change with the container running tests created only once. So indeed could be different causes? I will start with the gunicorn change and if issues still appear, will proceed with the different other changes.
So far the PR has 13 successful attempts: https://github.com/Orbitalize/monitoring/actions/runs/6532137382/job/17749339895

@BenjaminPelletier
Copy link
Member Author

With zero observed instances of this issue since @mickmis 's fix, I'm going to declare this issue resolved. Huge thanks for the fix!

BenjaminPelletier added a commit to BenjaminPelletier/monitoring that referenced this issue Dec 20, 2023
BenjaminPelletier added a commit that referenced this issue Dec 21, 2023
* Improve CI documentation

* Remove reference to #28

* Fix link
github-actions bot added a commit that referenced this issue Dec 21, 2023
* Improve CI documentation

* Remove reference to #28

* Fix link 9885986
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automated-testing Related to automated testing tools bug Something isn't working P1 High priority tooling
Projects
None yet
Development

No branches or pull requests

4 participants