Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMTPClient stuck (race condition?) #315

Open
SystemParadox opened this issue Apr 26, 2022 · 15 comments
Open

SMTPClient stuck (race condition?) #315

SystemParadox opened this issue Apr 26, 2022 · 15 comments

Comments

@SystemParadox
Copy link

We're connecting to an ISP mail server over a flakey mobile connection. We've implemented a retry system but it relies on emailjs resolving.

We recently got the following sequence of error responses:

Error: timedout while connecting to smtp server
Error: timedout while connecting to smtp server
Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send
Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send
Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send
Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send
Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send
Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send
Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send
Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send
Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send
Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send
Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send
Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send

On the next retry it seems that emailjs never resolved the promise so our retry queue is now blocked and no email is being sent. This was 4 days ago.

  • client.sending is true
  • client.ready is true
  • client.smtp._state is 2 (CONNECTED)

I have checked netstat and there is definitely no active connection.

@zackschuster
Copy link
Collaborator

can you provide a few more details? specifically:

  • which version of node you're using
  • which version of emailjs you're using
  • a code snippet demonstrating the usecase
  • a call stack for the timedout error (if possible)

@SystemParadox
Copy link
Author

  • node v16.14.0
  • emailjs 3.6.0

This is what our email manager object looks like:

import type { SMTPConnectionOptions } from 'emailjs';
import { Message, SMTPClient } from 'emailjs';
import PQueue from 'p-queue';

export type AddressList = {
    from: string;
    to: string[];
};

/** Internal configuration used by emailjs to send mail, see https://www.npmjs.com/package/emailjs for all options. */
export type EmailConnectionConfig = Partial<SMTPConnectionOptions>;

export default class Emailer {
    /** stop sending mail after this many items are in the queue */
    private readonly mailQueueMaxSize = 512;

    /** The minimum amount of time between sending each email. */
    private readonly globalMailThrottle = 750;

    // throttle mail sending using a promise queue
    private readonly mailQueue = new PQueue({ concurrency: 1, interval: this.globalMailThrottle, intervalCap: 1 });

    private client: SMTPClient;

    constructor(options: EmailConnectionConfig) {
        this.client = new SMTPClient(options);
    }

    public async send(addresses: AddressList, subject: string, text: string) {
        if (!addresses) {
            console.warn(`Not sending email. Address list looks falsey.`);
            return;
        }

        const from = addresses.from;

        if (!from) {
            console.warn(`Not sending email. No from address is configured.`);
            return;
        }

        if (addresses.to.length === 0) {
            console.warn(`Not sending email. No addresses are configured to send.`);
            return;
        }

        const to = addresses.to.join(', ');
        if (this.mailQueue.size >= this.mailQueueMaxSize) {
            console.error(
                `Dropping outgoing email. There are currently too many items in the email queue. (Queue size= ${this.mailQueue.size}, maximum=${this.mailQueueMaxSize})`
            );
            return;
        }

        await new Promise<void>(resolve => {
            const mailTask = async () => {
                try {
                    console.log(`Sending email to ${to}...`);
                    await this.client.sendAsync(
                        new Message({
                            from,
                            to,
                            subject,
                            text,
                            'content-type': text.startsWith('<!DOCTYPE html') ? 'text/html; charset="UTF-8"' : undefined,
                        })
                    );
                    console.log(`Sending email to ${to}  [SENT]`);
                    console.log(`There are`, this.mailQueue.size, 'items left in the mail queue');
                    resolve();
                } catch (err) {
                    console.error('Error sending email!', `(Subject: "${subject}")`);
                    console.error(err);

                    console.log('Putting this email to the back of the send queue...');
                    void this.mailQueue.add(mailTask);
                }
            };
            void this.mailQueue.add(mailTask);
        });
    }
}

Stack trace for the two errors:

12:32:59 [ERROR] Error: timedout while connecting to smtp server
    at Function.create (/app/node_modules/emailjs/rollup/email.cjs:1077:21)
    at Socket.timedout (/app/node_modules/emailjs/rollup/email.cjs:1114:47)
    at Object.onceWrapper (node:events:639:28)
    at Socket.emit (node:events:520:28)
    at Socket._events.default.emit (/app/lib/server/patchEmitter.js:14:18)
    at Socket._onTimeout (node:net:501:8)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7)
12:32:59 [ERROR] Error: bad response on command 'RCPT': -relay not permitted. Please authenticate using your AAISP mailbox, or send
    at Function.create (/app/node_modules/emailjs/rollup/email.cjs:1077:21)
    at response (/app/node_modules/emailjs/rollup/email.cjs:1426:48)
    at caller (/app/node_modules/emailjs/rollup/email.cjs:1192:9)
    at TLSSocket.<anonymous> (/app/node_modules/emailjs/rollup/email.cjs:1381:21)
    at Object.onceWrapper (node:events:640:26)
    at TLSSocket.emit (node:events:520:28)
    at TLSSocket._events.default.emit (/app/lib/server/patchEmitter.js:14:18)
    at notify (/app/node_modules/emailjs/rollup/email.cjs:1105:24)
    at TLSSocket.watch (/app/node_modules/emailjs/rollup/email.cjs:1119:17)
    at TLSSocket.emit (node:events:520:28)

Note that these errors happen several times as described above and our retry system works fine. However, on the last time it logged the Sending email to message, but it never logged the [SENT] message nor any error. sendAsync simply never resolved.

Apparently the SMTP server is accessed through a VPN tunnel - there is no authentication so it relies on having a known static IP through the tunnel. It has been suggested when the flakey mobile connection goes down the relay not permitted errors occur because it tries to access the server directly over the internet using a dynamic IP instead of using the tunnel. My suspicion is that it's this IP change that causes emailjs to get confused and never resolve/reject the promise.

@zackschuster
Copy link
Collaborator

zackschuster commented Apr 26, 2022

thanks for the details. they help a lot 😄

My suspicion is that it's this IP change that causes emailjs to get confused and never resolve/reject the promise.

our connections use node's built-in socket classes, so if that were the case i doubt we could do much. we're kind of at the mercy of the stack in that arena. at any rate, it's still hitting timeout, so the internal state should reset properly. it's very odd to me that it isn't, and suggests to me the internal state is getting corrupted. except...

Note that these errors happen several times as described above and our retry system works fine. However, on the last time it logged the Sending email to message, but it never logged the [SENT] message nor any error. sendAsync simply never resolved.

...if i understand this correctly, it means that emails are still being sent even after a timeout, and it's only the final request that gets snagged (please correct me if i'm wrong). what's more, you have your queue's concurrency set to 1, so you shouldn't be having a race condition between multiple invocations of sendAsync. this all strikes me as deeply odd, and i'm not sure where to start digging.

one test i would suggest is creating a new SMTPClient on every request. obviously that wouldn't fly as an actual solution, but it might tell us something.

@SystemParadox
Copy link
Author

Correct, you'll see that if sendAsync rejects we push back onto the queue so it'll just keep retrying. This all seems to work fine for a while until sendAsync suddenly doesn't resolve/reject for some reason and so we are left hanging.

We dug around in the emailjs source for a while and it's clear that _senddone was never called. We are suspicious of the _sendsmtp callback wrapper which tries to send an smtp RSET before calling _senddone. It's hard to follow the trail but we were not able to satisfy ourselves of exactly what happens if there are errors or the connection is closed at this point.

We have enabled the logging in emailjs now so we'll see if that sheds any more light on the situation. However, there doesn't appear to be any logging around the timeout and error handling paths - it would be really helpful if this could be added so we can see what's happening!

Thanks.

@zackschuster
Copy link
Collaborator

zackschuster commented Apr 27, 2022

Correct, you'll see that if sendAsync rejects we push back onto the queue so it'll just keep retrying. This all seems to work fine for a while until sendAsync suddenly doesn't resolve/reject for some reason and so we are left hanging.

very odd behavior.

We dug around in the emailjs source for a while and it's clear that _senddone was never called. We are suspicious of the _sendsmtp callback wrapper which tries to send an smtp RSET before calling _senddone. It's hard to follow the trail but we were not able to satisfy ourselves of exactly what happens if there are errors or the connection is closed at this point.

the RSET call is old behavior that predates my involvement with the library, so i'm very wary of altering it 😅 it does, however, sound like there could be an error in that code path being swallowed, so i think adding some error handling & such would be fruitful. which ties in with...

We have enabled the logging in emailjs now so we'll see if that sheds any more light on the situation. However, there doesn't appear to be any logging around the timeout and error handling paths - it would be really helpful if this could be added so we can see what's happening!

i agree; the instrumentation is very bare bones; i'll see what i can do. do you need this on the 3.x line, or can we move to 4.x? the 4.x line does not have a CJS bundle, but the code is otherwise virtually identical.

Thanks.

thanks for the report!

@SystemParadox
Copy link
Author

We haven't managed to migrate over to ESM yet so we'll need this on 3.x.

Thanks for the help :)

@zackschuster
Copy link
Collaborator

zackschuster commented Apr 28, 2022

i've managed to repro something that looks like the error you're describing using our test suite. to reproduce, checkout the v3.6.0 tag and apply this diff (it also repros on 4.x with a slightly different diff). running the test suite should consistently fail on client allows message with only 'cc' recipient header; skipping the test (via test.skip) should move the failure to client allows message with only 'bcc' recipient header, and so forth.

the error is triggered specifically by the client's configuration:

const client = new SMTPClient({ port, tls: true });
const server = new SMTPServer({ secure: true });

can you confirm that your configuration isn't similar?

@SystemParadox
Copy link
Author

That looks very similar to our configuration!

We're only using { host: 'x', port: 587, tls: true} - there is no username/password.

@zackschuster
Copy link
Collaborator

zackschuster commented Apr 28, 2022

cool, i think we have a target now 😃 i don't know what the fix would be, unfortunately, so it may take some time to figure out. in the meantime, using ssl/user/password as a workaround may be necessary (if possible).

@zackschuster
Copy link
Collaborator

i've been digging through the code on a separate branch & i think this might be a configuration issue. specifically, i set up connection tests (see can connect with tls and can connect with tls and ssl) and they both run fine... until i force the server to be in secure mode, at which point can connect with tls blows up in the familiar way. this, combined with the Please authenticate using your AAISP mailbox error message, makes me think you're supposed to be connecting with both tls and ssl enabled. the snag then would be if you don't have credentials but the server sends you a "too many unauthenticated requests"-type error.

@SystemParadox
Copy link
Author

Thanks for looking into this. Unfortunately I think we may have drifted away from the real issue, which is that the promise sometimes never resolves/rejects - it just gets left hanging. This should never happen, no matter what the client or server configuration is. It should either be resolved succesfully, rejected with an error, or rejected due to timeout.

@zackschuster
Copy link
Collaborator

Thanks for looking into this. Unfortunately I think we may have drifted away from the real issue, which is that the promise sometimes never resolves/rejects - it just gets left hanging. This should never happen, no matter what the client or server configuration is. It should either be resolved succesfully, rejected with an error, or rejected due to timeout.

correct, that's still an issue, but i think it may be a sympton of the configuration issue. i'm still looking into why the promise is stuck pending.

@zackschuster
Copy link
Collaborator

i managed to create a simplified version of your queue code at https://github.com/eleith/emailjs/blob/%C3%B8/test/queue.ts to test for this exact scenario and... it works as it should, rejecting after five tries with a properly closed-out client. the timeouts even came from the client. i'm at a bit of a loss as to how to continue.

@SystemParadox
Copy link
Author

The latest update is that this ran fine for just over a week, including handling periods of bad connections and errors as above, but eventually something triggered it again (during a period of poor connectivity). So it seems that it only happens when there are connectivity issues, and even then it's pretty intermittent - most of the time it handles it fine.

With emailjs logging enabled these are the last messages we received from emailjs before it just went silent on us:

mail FROM:<[REDACTED]>
rset

If we're ever going to get a test to reproduce this I think you're going to need some sort of test socket that randomly drops packets, hangs or disconnects. Then have a test that retries continuosly forever and just leave it running until it hangs.

I don't think you need any of our queuing system at all, it's not really relevant.

@zackschuster
Copy link
Collaborator

The latest update is that this ran fine for just over a week, including handling periods of bad connections and errors as above, but eventually something triggered it again (during a period of poor connectivity). So it seems that it only happens when there are connectivity issues, and even then it's pretty intermittent - most of the time it handles it fine.

the best kind of bug!

If we're ever going to get a test to reproduce this I think you're going to need some sort of test socket that randomly drops packets, hangs or disconnects. Then have a test that retries continuosly forever and just leave it running until it hangs.

i'll have to do some research on how to set that up. given the randomness, though, i'm worrried the issue is at the os level. i think looking into logging would be a good next step now to try and clear up what exactly is happening when you get your hang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants