Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request.Send() doesn't close open connections, leading to 'too many open files' error #4695

Open
colmsnowplow opened this issue Jan 20, 2023 · 0 comments
Labels
bug This issue is a bug. p3 This is a minor priority issue

Comments

@colmsnowplow
Copy link

Describe the bug

My appliciation uses a fork of the kinsumer library, which uses this package to get data from kinesis. We have run into 'too many open files' errors in production on readfrom kinesis. (In case it's relevant, running on ECS with default setting of 1024 max network ports).

From digging into the code, I believe this happens when we process events from the stream very quickly, leading to many pulls from kinesis in a short period of time. It seems that this library doesn't close connections once a shard iterator is done pulling records from the shard, and when many of these calls are made before the connection expires, we run into this issue.

We are pulling data from multiple shards concurrently, but I believe that the issue is not caused by concurrent requests being made, but many sequential requests.

I believe this occurs because:

  • We create a shard iterator
  • We call GetRecords and process much of the data very quickly
  • We then call GetRecords again on the next shard iterator, ad infinitum.
  • GetRecords calls Request.Send() under the hood. This opens a connection on the network, but does not seem to close the connection. (Also note that the comment on this method says "Send will not close the request.Request's body." - I am yet to ascertain if this is relevant to my problem).

At this point, it would be very useful to me if someone more familiar with the codebase could tell me whether the above explanation of the scenario seems valid, or explain some other explanation I could investigate.

At this point I haven't found the time to produce a reproduction, but I can find the time to do so if I'm not barking up the wrong tree.

I would also be interested to hear opinions on how to solve or get around the issue in a sustainable way (beyond increasing the amount of connections available on the box, which we are doing), assuming that it is valid and reproducable.

Expected Behavior

I expect not to run into "too many open files" errors when making many subsequent GetRecords requests.

Current Behavior

Error:

level=error msg="Failed to pull next Kinesis record from Kinsumer client: error performing initial leader actions: 
error loading shard IDs from kinesis: RequestError: send request failed\ncaused by: Post 
\"https://kinesis.eu-central-1.amazonaws.com/\": dial tcp: lookup kinesis.eu-central-1.amazonaws.com on 
10.100.0.2:53: dial udp 10.100.0.2:53: socket: too many open files" error="Failed to pull next Kinesis record 
from Kinsumer client: error performing initial leader actions: error loading shard IDs from kinesis: RequestError: 
send request failed\ncaused by: Post \"https://kinesis.eu-central-1.amazonaws.com/\": dial tcp: lookup 
kinesis.eu-central-1.amazonaws.com on 10.100.0.2:53: dial udp 10.100.0.2:53: socket: too many open files

Reproduction Steps

I'm yet to find a full reproduction but I believe it can be done as follows:

  • Create a client
  • Create a shard iterator for each shard of a kinesis stream which is populated with ~100k records (estimated)
  • Call GetRecords on each, immediately ack, and call GetRecords again immediately - do this in a loop
  • Run the above on a box with a cap on available ports

Possible Solution

I think this can be solved if the connection we open when making the request in GetRecords() is closed, either when the last record is acked, or when we call a request on the next shard iterator.

Alternatively, if GetRecords() returned something which allows me to close the request manually, I could handle it in code.

Additional Information/Context

No response

SDK version used

v1.40.22 - relevant code seems equivlent in latest

Environment details (Version of Go (go version)? OS name and version, etc.)

1.17

@colmsnowplow colmsnowplow added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 20, 2023
@aajtodd aajtodd self-assigned this Feb 13, 2023
@RanVaknin RanVaknin removed the needs-triage This issue or PR still needs to be triaged. label Mar 13, 2023
@RanVaknin RanVaknin added the p3 This is a minor priority issue label Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests

3 participants