Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNS client hanging indefinitely sending PublishCommand #6025

Open
3 tasks done
alesk20 opened this issue Apr 25, 2024 · 6 comments
Open
3 tasks done

SNS client hanging indefinitely sending PublishCommand #6025

alesk20 opened this issue Apr 25, 2024 · 6 comments
Assignees
Labels
bug This issue is a bug. p2 This is a standard priority issue response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days.

Comments

@alesk20
Copy link

alesk20 commented Apr 25, 2024

Checkboxes for prior research

Describe the bug

Hello, I have a problem I can't solve with SNS client. I have a server that receive a big amount of messages from an SQS queue (using SQS client), performs some internal operation and then send a notification with a json message body to an SNS topic, using the sdk method "sns.send" and the argument as instance of the class PublishCommand.

After some hour the server is running, depending on the amount of the data flowing through the sqs consumer, the "sns.send" method begin to hang indefinitely and never respond, and the notification is not being published.
I implemented a timeout of 180 seconds to stop the actual execution and retry the publication on the sns topic, and sometimes it works on the 2nd retry, sometimes on the 3rd and so on.

The problem is that as long as other messages are coming through the sqs queue, more and more messages start to have the same problem, until my server is completely blocked and needs to be restarted. After the restart the messages are succesfully elaborated and notifications are correctly published to the topic.

I have this problem only with aws-sdk v3, running aws-sdk v2 I never had this problem and the operations and logic of my server have remained the same. I tried different versions of the @aws-sdk/client-sns, included the last one, and the problem always occurs.

SDK version number

@aws-sdk/client-sns, @aws-sdk/sqs-consumer

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

Node.js 18

Reproduction Steps

const sns = new SNS({apiVersion: "2010-03-31", endpoint: options.endpointUrl});
const publishCommand = new PublishCommand({
...MessageData,
TopicArn: topic
});
await sns.send(publishCommand);

Observed Behavior

The command "await sns.send(publishCommand)" hangs undefinitely

Expected Behavior

The "sns.send" command should respond immediately or at least after reasonable time.

Possible Solution

No response

Additional Information/Context

No response

@alesk20 alesk20 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 25, 2024
@RanVaknin RanVaknin self-assigned this Apr 25, 2024
@RanVaknin
Copy link
Contributor

Hi @alesk20,

Thanks for reaching out. The behavior is indeed odd. Since the return value from the await call to .send() is hanging, it might be because the server did not close the connection and the SDK is still awaiting a response.

Without seeing more detailed logs it would be very difficult to diagnose. This could be due to different httphandler defaults with regard to connection management that you might need to change.

For example, in the v2 SDK the default timeout was 60 seconds, in v3 we use the defaults provided by node's http client which is 0:

requestTimeout: The number of milliseconds a request can take before automatically being terminated. Defaults to 0, which disables the timeout. The number of milliseconds a request can take before being automatically terminated.

My guess is that this issue where the server hangs is also happening on v2, but the default behavior of the older version makes this more transparent. You might want to dial down the timeout to be more aggressive , perhaps at 60 seconds to align it with v2's behavior and see if this solves your issue.

Thanks,
Ran~

@RanVaknin RanVaknin added response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Apr 30, 2024
@alesk20
Copy link
Author

alesk20 commented Apr 30, 2024

Hi @RanVaknin,

thank you for the response. I'll try setting the timeout explicitly to 60 seconds, but it's still strange that all the messages get published with V2 sdk and instead with V3 sdk they don't get published when sns client hangs.
Shouldn't also the messages handled with V2 sdk not being published if they reach the default 60 secs timeout?
What I observe is that I don't lose any message with V2 sdk but with V3 sdk I lose them when sns client is hanging and I forcefully trhow a timeout.

Thanks

@alesk20
Copy link
Author

alesk20 commented Apr 30, 2024

Hi @RanVaknin,

I want to add another question after reading your response: in the V2 sdk what happens when the default requestTimeout is reached? An error is thrown or the promise is just resolved?

The timeout of 180 seconds I mentioned in my first message was not set on client-sns, but as external timeout to drop the process and retry, so in my actual implementation, after what you said, I think the connection to SNS topic still hangs even if I drop the process.

It still doesn't explain why V3 sdk has this slowdowns publishing messages to SNS topic, while the V2 sdk delivers them immediately, also under huge pressure, without missing any delivery.

Thanks

@RanVaknin
Copy link
Contributor

Hi @alesk20 , requestTimeout means that the connection will terminated from the client side. It does not mean a retry.

Shouldn't also the messages handled with V2 sdk not being published if they reach the default 60 secs timeout?

Not necessarily, the server might receive and process your request but it might not be responding with the status to inform the client that the message was / wasn't processed.

It's hard to say why you are only experiencing this with v3. It might be because differences in connection management, or something you did differently in your code.
Without seeing an end to end example it will be very difficult to root cause this.

Can you set up a minimal github repository that can reliably (intermittently reliably is also ok) reproduce this behavior?
Ideally this reproduction would have the working v2, and the non working v3 code so we can compare these as well.

Thanks,
Ran~

@RanVaknin RanVaknin added response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. and removed response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. labels Apr 30, 2024
@alesk20
Copy link
Author

alesk20 commented May 2, 2024

Hi @RanVaknin,

unfortunately it's very difficult to replicate this case, it only happens to me after 1-2 hours and only in production environment, where I have a lot of traffic on the sqs queue. I also tried to replicate it on a test environment myself, but couldn't manage to do it.

As I said in the first message, I didn't change anything on the code, I just migrate V2 sdk to V3 sdk and upgraded Node.js 16 to Node.js 18, these two are the only things I changed. I don't think the problem is Node.js 18 version.

Can you tell me what happens on V2 sdk when default requestTimeout is reached? The promise gets resolved or an error is thrown?

Thanks.

@RanVaknin
Copy link
Contributor

Hi @alesk20 ,

Can you tell me what happens on V2 sdk when default requestTimeout is reached? The promise gets resolved or an error is thrown?

When v2 requestTimeout (or in its v2 name timeout) is reached, the client will kill the connection, and an error would be thrown as shown here: https://github.com/aws/aws-sdk-js/blob/36e3f6d5c27adf522b7517f095f060f4581d9b03/lib/http/node.js#L86. You might be handling it in v2 and not doing so in v3?

As I said in the first message, I didn't change anything on the code, I just migrate V2 sdk to V3 sdk and upgraded Node.js 16 to Node.js 18, these two are the only things I changed. I don't think the problem is Node.js 18 version.

I understand your concern, however I cannot point to a single point in the SDK and say "this is why your code is not working like it did in v2" There is about 8 years of development between when v2 was first introduced to when v3 was released, the architecture of the two is very different and evolved with the JS language itself and the Ecosystem's best practices.

I tried to strip down all of the http configurations used by the v2 SDK and actually have found that the only http option we explicitly override is indeed timeout however I was wrong initially. We actually set it to 120000ms (2 min) by default:

console.log(sns.config.httpOptions)
// prints: { timeout: 120000 }

I don't think it will be helpful for us to keep comparing the two, and instead we should try and focus how to help with your current setup.

Are you running your application from something like a Docker container? I'm asking because Docker has decent support for tcpDump which allows you to inspect TCP level networking events. You could use that, or any other network diagnostic tool to find what closes those connections.

I understand that your current repro code does not raise the reported behavior, but can you please share it anyway? Right now we are doing a lot of theorizing which is not helpful. By you sharing your code we can better visualize the architecture and do a simple visual check of certain things you might be missing to get this to work correctly (this is not to suggest that your code is wrong). If you have the v2 code handy, feel free to share that too.

Thanks again for your cooperation.

All the best,
Ran~

@RanVaknin RanVaknin added response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. and removed response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. labels May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days.
Projects
None yet
Development

No branches or pull requests

2 participants