Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only update litestream_seq if size is below WAL header size #573

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

benbjohnson
Copy link
Owner

Previously, the litestream_seq could increment on a PASSIVE checkpoint even if the WAL existed and did not need to be created. This caused a loop where the increment would trigger replication. This PR fixes this by only executing the increment if the WAL is below a size threshold.

Fixes #422

Copy link
Collaborator

@hifi hifi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than one question I don't see there being any problems with this, other than needing testing with real data. I could throw this in our dev environment to see it doesn't do anything bad.

@@ -629,7 +629,7 @@ func (db *DB) acquireReadLock() error {
}

// Execute read query to obtain read lock.
if _, err := tx.ExecContext(db.ctx, `SELECT COUNT(1) FROM _litestream_seq;`); err != nil {
if _, err := tx.ExecContext(context.Background(), `SELECT COUNT(1) FROM _litestream_seq;`); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this changed to background context? If the db context gets canceled we'd want the lock to fail fast as well, right?

Copy link
Collaborator

@hifi hifi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With enough data, we've discovered that this commit causes a slow stream of wal overwritten by another process errors which forces new generations.

Here's a Loki grep of that string over a period of 6 hours:

image

The quiet gap is where I reverted this commit from our downstream fork and then the errors came back when I put the commit back. The final quiet 10 minutes is when I removed the commit again.

I don't yet know why this happens but do you have ideas @benbjohnson?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replica gets flooded with WAL segments without writes
2 participants