Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poc: transaction executor #5273

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

andreas-unleash
Copy link
Contributor

@andreas-unleash andreas-unleash commented Nov 6, 2023

Thinking about the deadlock pain we have been experiencing, I think this will help.

Creates a transaction executor, able to handle transient errors (like deadlocks) with a configurable number of retries with exponential backoff

Signed-off-by: andreas-unleash <andreas@getunleash.ai>
Copy link

vercel bot commented Nov 6, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Ignored Deployments
Name Status Preview Comments Updated (UTC)
unleash-docs ⬜️ Ignored (Inspect) Nov 6, 2023 11:49am
unleash-monorepo-frontend ⬜️ Ignored (Inspect) Visit Preview Nov 6, 2023 11:49am

Signed-off-by: andreas-unleash <andreas@getunleash.ai>
*/
private isTransientError(error: any): boolean {
const transientErrors: { [key: string]: string } = {
'40001': 'serialization_failure',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this transient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serialization failure is an error condition that arises in scenarios involving concurrent database transactions.

Copy link
Contributor

@gastonfournier gastonfournier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great and elegant, but I'm not fully convinced this will actually solve our problem (are we going to get a deadlock 3 times increasing the load in our system?), and also it may negatively impact our ability to scale. I'd prefer keeping things as they are right now and try to deal with the deadlocks in a different way (we should be able to fix the deadlock).

If we were to merge this (which I'd say we should not), I'd like to see some tests.

console.error(
`Transaction failed: ${error.message}. Retrying in ${delayMillis} ms.`,
);
await delay(delayMillis);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will hold the connection open for longer which reduces our ability to answer more requests. It's not only about the increased latency but more about the server's ability to handle requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, also adding the exponential backoff to the clients instead

@andreas-unleash
Copy link
Contributor Author

It's not my intention to merge, just wanted to start a discussion on better handling some db errors (since we are keeping away from orm).

@andreas-unleash
Copy link
Contributor Author

andreas-unleash commented Nov 15, 2023

I wonder if our problems would go away if we added ISOLATION LEVEL SERIALIZABLE to (some of) our transactions (eg the ones that get triggered by a scheduler).
Basically forces them to run serially even in concurrent situations.

await trx.raw('SET TRANSACTION ISOLATION LEVEL SERIALIZABLE');

There are things to consider however:

  • Performance Impact: While providing the highest level of data integrity, the Serializable Isolation level can have a performance impact, especially in a highly concurrent environment, because it can lead to increased transaction rollbacks and retries.

  • Handling Rollbacks: Since Serializable transactions are more prone to rollbacks due to serialization failures (error code 40001), applications should be designed to gracefully handle and potentially retry transactions that are safe to retry.

@gastonfournier
Copy link
Contributor

It's not my intention to merge, just wanted to start a discussion on better handling some db errors (since we are keeping away from orm).

:) 👍 great then! We can keep it going, but I'm also wondering how big of a problem is it right now, and whether we identified where this happens

I wonder if our problems would go away if we added ISOLATION LEVEL SERIALIZABLE to (some of) our transactions (eg the ones that get triggered by a scheduler). Basically forces them to run serially even in concurrent situations.

await trx.raw('SET TRANSACTION ISOLATION LEVEL SERIALIZABLE');

There are things to consider however:

  • Performance Impact: While providing the highest level of data integrity, the Serializable Isolation level can have a performance impact, especially in a highly concurrent environment, because it can lead to increased transaction rollbacks and retries.
  • Handling Rollbacks: Since Serializable transactions are more prone to rollbacks due to serialization failures (error code 40001), applications should be designed to gracefully handle and potentially retry transactions that are safe to retry.

Maybe it is worth trying something like this... Every time I work with databases I have to do some memory refreshers, cause it's not my area of expertise. But something like this can be tested through load tests to validate there's no performance regression, but it won't be as good as real production testing. I know in some languages like java it could be hard to add a feature flag to it, but in node maybe it's not that difficult. In such case we could test this targeting specifically the customer having issues more frequently...

@andreas-unleash
Copy link
Contributor Author

So, here are my 2 cents about what I think is happening:

It's not my intention to merge, just wanted to start a discussion on better handling some db errors (since we are keeping away from orm).

:) 👍 great then! We can keep it going, but I'm also wondering how big of a problem is it right now, and whether we identified where this happens

I wonder if our problems would go away if we added ISOLATION LEVEL SERIALIZABLE to (some of) our transactions (eg the ones that get triggered by a scheduler). Basically forces them to run serially even in concurrent situations.

await trx.raw('SET TRANSACTION ISOLATION LEVEL SERIALIZABLE');

There are things to consider however:

  • Performance Impact: While providing the highest level of data integrity, the Serializable Isolation level can have a performance impact, especially in a highly concurrent environment, because it can lead to increased transaction rollbacks and retries.
  • Handling Rollbacks: Since Serializable transactions are more prone to rollbacks due to serialization failures (error code 40001), applications should be designed to gracefully handle and potentially retry transactions that are safe to retry.

Maybe it is worth trying something like this... Every time I work with databases I have to do some memory refreshers, cause it's not my area of expertise. But something like this can be tested through load tests to validate there's no performance regression, but it won't be as good as real production testing. I know in some languages like java it could be hard to add a feature flag to it, but in node maybe it's not that difficult. In such case we could test this targeting specifically the customer having issues more frequently...

So the only place we get the deadlocks as of now is when trying to commit lastSeen metrics. My hypothesis is that this is happening when 2 pods are sending the same request at the same time ending up with 2 transactions in deadlock. Sorting the entries helped. (which tells me even 1 sec delay/retry would get them out of deadlock.
This is happening for big-ish users (who are also the most likely to have some sort of clock sync going on in their infrastructure).

Theoretically any scheduled task could end up in this state given enough data to persist without (some) transaction management/error handling.

I will try to simulate the conditions and give it a go. :)

@gastonfournier
Copy link
Contributor

So the only place we get the deadlocks as of now is when trying to commit lastSeen metrics. My hypothesis is that this is happening when 2 pods are sending the same request at the same time ending up with 2 transactions in deadlock. Sorting the entries helped. (which tells me even 1 sec delay/retry would get them out of deadlock.
This is happening for big-ish users (who are also the most likely to have some sort of clock sync going on in their infrastructure).

Theoretically any scheduled task could end up in this state given enough data to persist without (some) transaction management/error handling.

If this is the only case this happens I wouldn't make all operations serializable. I think it's fine to lose some "lastSeen" events, it means the last seen will be eventually consistent. I believe the main issue is errors triggering some alerts. But considering the last seen are eventually consistent, I wonder if we can batch them in memory and then have a single process to persist them on a regular cadence. When returning last-seen metric we can just read from memory and go to the DB on a miss (using memory as a read-through cache). I can still see 2 of our pods trying to write to the db at the same time, but that can be handled with jitter and maybe here it makes sense to do retries...

I will try to simulate the conditions and give it a go. :)

Sure, but consider the other option I just shared. If you want we can talk about this problem and do some brainstorming

@ivarconr ivarconr marked this pull request as draft May 7, 2024 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

None yet

2 participants