Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres deadlocking when multiple processes run CREATE INDEX CONCURRENTLY #960

Open
AkuSilvenius opened this issue Jul 25, 2023 · 9 comments

Comments

@AkuSilvenius
Copy link

AkuSilvenius commented Jul 25, 2023

Describe the Bug
When multiple processes run migrations in parallel (e.g. multiple application replicas running m.Up()), having CREATE INDEX CONCURRENTLY as part of the migration script will result in a deadlock, leaving behind a dirty database version as well as an INVALID index (expected as mentioned in docs).

Steps to Reproduce
I have added a failing test in my fork to mimic the behavior of multiple processes running the migrations in parallel here - it's basically the same test as in master branch here (which includes a migration file with a CREATE INDEX CONCURRENTLY), but just runs multiple times in parallel. Setting concurrency = 1 will pass the test, but any larger number of concurrency will result in a deadlock and a failing test.

Expected Behavior
Run migrations successfully

Migrate Version
v4.15.2

Loaded Source Drivers
s3, github, gitlab, go-bindata, file, bitbucket, github-ee, godoc-vfs, gcs

Loaded Database Drivers
cockroachdb, neo4j, postgresql, redshift, clickhouse, mysql, pgx, postgres, sqlserver, crdb-postgres, mongodb, spanner, cassandra, cockroach, firebird, firebirdsql, mongodb+srv, stub

Go Version
go version go1.20.3 darwin/arm64

Stacktrace

Failing test output
=== CONT  TestPostgres_ConcurrentMigrations/postgres:12
    dktest.go:35: Pulling image: postgres:12
    dktest.go:53: Error parsing image pull response: context deadline exceeded
    dktest.go:82: Created container: dktest.ContainerInfo{ID:"d3535640fb6417f67c459b688313a75d753bc0b145d921c8bc9a4a1be853c140", Name:"dktest_ZEWaTjfqBG", ImageName:"postgres:12", Ports:[]}
    dktest.go:87: Started container: dktest.ContainerInfo{ID:"d3535640fb6417f67c459b688313a75d753bc0b145d921c8bc9a4a1be853c140", Name:"dktest_ZEWaTjfqBG", ImageName:"postgres:12", Ports:[]}
    dktest.go:97: Inspected container: dktest.ContainerInfo{ID:"d3535640fb6417f67c459b688313a75d753bc0b145d921c8bc9a4a1be853c140", Name:"dktest_ZEWaTjfqBG", ImageName:"postgres:12", Ports:[]}
    migrate_testing.go:30: UP
    migrate_testing.go:30: UP
    migrate_testing.go:30: UP
    migrate_testing.go:32: try lock failed in line 0: SELECT pg_advisory_lock($1) (details: pq: deadlock detected)
    migrate_testing.go:32: try lock failed in line 0: SELECT pg_advisory_lock($1) (details: pq: deadlock detected)
    dktest.go:132: Stopped container: dktest.ContainerInfo{ID:"d3535640fb6417f67c459b688313a75d753bc0b145d921c8bc9a4a1be853c140", Name:"dktest_ZEWaTjfqBG", ImageName:"postgres:12", Ports:[5432/tcp -> :46663]}
    dktest.go:138: Removed container: dktest.ContainerInfo{ID:"d3535640fb6417f67c459b688313a75d753bc0b145d921c8bc9a4a1be853c140", Name:"dktest_ZEWaTjfqBG", ImageName:"postgres:12", Ports:[5432/tcp -> :46663]}
    --- FAIL: TestPostgres_ConcurrentMigrations/postgres:12 (62.80s)

Additional context
This looks to be exactly the same problem as reported and fixed in the flyway library here

@AkuSilvenius
Copy link
Author

AkuSilvenius commented Jul 25, 2023

As mentioned in the description, this issue has been observed before in flyway and the underlying issue seems to be the usage of pg_advisory_lock (source) - the comment there mistakenly says // This will wait indefinitely until the lock can be acquired., but the actual outcome is that an error is returned here immediately if the lock is not available, leading the migration process to fail.

The fix here would be to change the locking strategy (for postgres) from using pg_advisory_lock to using pg_try_advisory_lock instead, to implement the "indefinite wait" - this is what seems to be done in flyway as well.

I'm happy to work on this fix as I already have a fork and a failing test to work with, but wanted to open for discussion here first :)

@jackh-ncl
Copy link

Thanks for the investigation, @AkuSilvenius 🏆.

I've just been bitten by this as well. Any chance of @AkuSilvenius's fix being merged in?

@AkuSilvenius
Copy link
Author

thanks @jackh-ncl, I have a draft PR in #962 with a rather naive approach, and also haven't been able to run the pipeline successfully (looks like it's running out is memory) so I've kept the PR as draft for now.

If someone knows how to fix the problems with the pipeline happy to continue working on it

@vmercierfr
Copy link

Hello,
We are hitting the same issue. Any update on this subject?

@dhui
Copy link
Member

dhui commented Jan 25, 2024

As mentioned in the description, this issue has been observed before in flyway and the underlying issue seems to be the usage of pg_advisory_lock (source) - the comment there mistakenly says // This will wait indefinitely until the lock can be acquired., but the actual outcome is that an error is returned here immediately if the lock is not available, leading the migration process to fail.

The fix here would be to change the locking strategy (for postgres) from using pg_advisory_lock to using pg_try_advisory_lock instead, to implement the "indefinite wait" - this is what seems to be done in flyway as well.

I'm happy to work on this fix as I already have a fork and a failing test to work with, but wanted to open for discussion here first :)

Do you know why pg_advisory_lock is returning an error instead of waiting indefinitely? What's the error?

@Gibstick
Copy link

I've been evaluating different migration tools and I looked into this issue, because it seems like every tool was affected by it. As mentioned earlier, the flyway issue is a good explanation and that issue is referenced by many other tools. Sourcegraph has also written a blog post about this exact issue: https://sourcegraph.com/blog/introducing-migrator-service#a-deeper-contributing-factor. I'd love for this to be resolved because other than this little issue, this tool is my top pick.

Do you know why pg_advisory_lock is returning an error instead of waiting indefinitely? What's the error?

Here is a summary of everything I've read. The error is coming from a deadlock detection mechanism in Postgres. You can see an example of the error in flyway/flyway#1654.

  1. Process A acquires advisory lock
  2. Process B tries to acquire the same advisory lock and waits on process A
  3. Process A does CREATE INDEX CONCURRENTLY...

The implementation of CREATE INDEX CONCURRENTLY has a pessimistic view of which transactions it needs to wait on, and waits for process B to finish because

it must wait for all existing transactions that could potentially modify or use the index to terminate

https://www.postgresql.org/docs/16/sql-createindex.html#id-1.9.3.69.6.4.2

so now process A is waiting for process B, and process B is waiting for process A, and we have a deadlock and postgres kills them both, returning an error to pg_advisory_lock.

@AkuSilvenius
Copy link
Author

Do you know why pg_advisory_lock is returning an error instead of waiting indefinitely? What's the error?

@dhui I had attached the error to the description of the issue, the error returned from pg_advisory_lock is migrate_testing.go:32: try lock failed in line 0: SELECT pg_advisory_lock($1) (details: pq: deadlock detected)

And good question, why the error instead of waiting indefinitely as commented? Even though docs about try_advisory_lock say waiting if necessary, the amount of waiting depends on how deadlock_timeout is set, the default being one second (1s). There is a nice stack answer about this problem here.

@AkuSilvenius
Copy link
Author

thanks for the interest on this topic @vmercierfr @Gibstick :) #962 has been updated based on earlier feedback

@mfridman
Copy link

mfridman commented May 9, 2024

It's mostly a drive-by as I occasionally work on teams that use golang-migrate, and this issue is a bit of a nasty case when you hit it.

In terms of the implementation, I'd echo moving towards pg_try_advisory_lock with a polling solution. It appears you're headed in that direction which is nice!

We ended up doing something similar in goose when we were adding database locking. Mainly dropping this issue as it contains some helpful context ✌️ .

pressly/goose#335 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants