Postgres deadlocking when multiple processes run `CREATE INDEX CONCURRENTLY` #960

AkuSilvenius · 2023-07-25T10:24:29Z

Describe the Bug
When multiple processes run migrations in parallel (e.g. multiple application replicas running m.Up()), having CREATE INDEX CONCURRENTLY as part of the migration script will result in a deadlock, leaving behind a dirty database version as well as an INVALID index (expected as mentioned in docs).

Steps to Reproduce
I have added a failing test in my fork to mimic the behavior of multiple processes running the migrations in parallel here - it's basically the same test as in master branch here (which includes a migration file with a CREATE INDEX CONCURRENTLY), but just runs multiple times in parallel. Setting concurrency = 1 will pass the test, but any larger number of concurrency will result in a deadlock and a failing test.

Expected Behavior
Run migrations successfully

Migrate Version
v4.15.2

Loaded Source Drivers
s3, github, gitlab, go-bindata, file, bitbucket, github-ee, godoc-vfs, gcs

Loaded Database Drivers
cockroachdb, neo4j, postgresql, redshift, clickhouse, mysql, pgx, postgres, sqlserver, crdb-postgres, mongodb, spanner, cassandra, cockroach, firebird, firebirdsql, mongodb+srv, stub

Go Version
go version go1.20.3 darwin/arm64

Stacktrace

Failing test output

=== CONT  TestPostgres_ConcurrentMigrations/postgres:12
    dktest.go:35: Pulling image: postgres:12
    dktest.go:53: Error parsing image pull response: context deadline exceeded
    dktest.go:82: Created container: dktest.ContainerInfo{ID:"d3535640fb6417f67c459b688313a75d753bc0b145d921c8bc9a4a1be853c140", Name:"dktest_ZEWaTjfqBG", ImageName:"postgres:12", Ports:[]}
    dktest.go:87: Started container: dktest.ContainerInfo{ID:"d3535640fb6417f67c459b688313a75d753bc0b145d921c8bc9a4a1be853c140", Name:"dktest_ZEWaTjfqBG", ImageName:"postgres:12", Ports:[]}
    dktest.go:97: Inspected container: dktest.ContainerInfo{ID:"d3535640fb6417f67c459b688313a75d753bc0b145d921c8bc9a4a1be853c140", Name:"dktest_ZEWaTjfqBG", ImageName:"postgres:12", Ports:[]}
    migrate_testing.go:30: UP
    migrate_testing.go:30: UP
    migrate_testing.go:30: UP
    migrate_testing.go:32: try lock failed in line 0: SELECT pg_advisory_lock($1) (details: pq: deadlock detected)
    migrate_testing.go:32: try lock failed in line 0: SELECT pg_advisory_lock($1) (details: pq: deadlock detected)
    dktest.go:132: Stopped container: dktest.ContainerInfo{ID:"d3535640fb6417f67c459b688313a75d753bc0b145d921c8bc9a4a1be853c140", Name:"dktest_ZEWaTjfqBG", ImageName:"postgres:12", Ports:[5432/tcp -> :46663]}
    dktest.go:138: Removed container: dktest.ContainerInfo{ID:"d3535640fb6417f67c459b688313a75d753bc0b145d921c8bc9a4a1be853c140", Name:"dktest_ZEWaTjfqBG", ImageName:"postgres:12", Ports:[5432/tcp -> :46663]}
    --- FAIL: TestPostgres_ConcurrentMigrations/postgres:12 (62.80s)

Additional context
This looks to be exactly the same problem as reported and fixed in the flyway library here

The text was updated successfully, but these errors were encountered:

AkuSilvenius · 2023-07-25T10:40:51Z

As mentioned in the description, this issue has been observed before in flyway and the underlying issue seems to be the usage of pg_advisory_lock (source) - the comment there mistakenly says // This will wait indefinitely until the lock can be acquired., but the actual outcome is that an error is returned here immediately if the lock is not available, leading the migration process to fail.

The fix here would be to change the locking strategy (for postgres) from using pg_advisory_lock to using pg_try_advisory_lock instead, to implement the "indefinite wait" - this is what seems to be done in flyway as well.

I'm happy to work on this fix as I already have a fork and a failing test to work with, but wanted to open for discussion here first :)

jackh-ncl · 2023-09-11T17:23:05Z

Thanks for the investigation, @AkuSilvenius 🏆.

I've just been bitten by this as well. Any chance of @AkuSilvenius's fix being merged in?

AkuSilvenius · 2023-09-12T16:56:52Z

thanks @jackh-ncl, I have a draft PR in #962 with a rather naive approach, and also haven't been able to run the pipeline successfully (looks like it's running out is memory) so I've kept the PR as draft for now.

If someone knows how to fix the problems with the pipeline happy to continue working on it

vmercierfr · 2024-01-23T09:42:48Z

Hello,
We are hitting the same issue. Any update on this subject?

dhui · 2024-01-25T06:53:19Z

As mentioned in the description, this issue has been observed before in flyway and the underlying issue seems to be the usage of pg_advisory_lock (source) - the comment there mistakenly says // This will wait indefinitely until the lock can be acquired., but the actual outcome is that an error is returned here immediately if the lock is not available, leading the migration process to fail.

The fix here would be to change the locking strategy (for postgres) from using pg_advisory_lock to using pg_try_advisory_lock instead, to implement the "indefinite wait" - this is what seems to be done in flyway as well.

I'm happy to work on this fix as I already have a fork and a failing test to work with, but wanted to open for discussion here first :)

Do you know why pg_advisory_lock is returning an error instead of waiting indefinitely? What's the error?

Gibstick · 2024-02-16T15:12:24Z

I've been evaluating different migration tools and I looked into this issue, because it seems like every tool was affected by it. As mentioned earlier, the flyway issue is a good explanation and that issue is referenced by many other tools. Sourcegraph has also written a blog post about this exact issue: https://sourcegraph.com/blog/introducing-migrator-service#a-deeper-contributing-factor. I'd love for this to be resolved because other than this little issue, this tool is my top pick.

Do you know why pg_advisory_lock is returning an error instead of waiting indefinitely? What's the error?

Here is a summary of everything I've read. The error is coming from a deadlock detection mechanism in Postgres. You can see an example of the error in flyway/flyway#1654.

Process A acquires advisory lock
Process B tries to acquire the same advisory lock and waits on process A
Process A does CREATE INDEX CONCURRENTLY...

The implementation of CREATE INDEX CONCURRENTLY has a pessimistic view of which transactions it needs to wait on, and waits for process B to finish because

it must wait for all existing transactions that could potentially modify or use the index to terminate

https://www.postgresql.org/docs/16/sql-createindex.html#id-1.9.3.69.6.4.2

so now process A is waiting for process B, and process B is waiting for process A, and we have a deadlock and postgres kills them both, returning an error to pg_advisory_lock.

AkuSilvenius · 2024-02-20T20:58:07Z

Do you know why pg_advisory_lock is returning an error instead of waiting indefinitely? What's the error?

@dhui I had attached the error to the description of the issue, the error returned from pg_advisory_lock is migrate_testing.go:32: try lock failed in line 0: SELECT pg_advisory_lock($1) (details: pq: deadlock detected)

And good question, why the error instead of waiting indefinitely as commented? Even though docs about try_advisory_lock say waiting if necessary, the amount of waiting depends on how deadlock_timeout is set, the default being one second (1s). There is a nice stack answer about this problem here.

AkuSilvenius · 2024-02-20T21:05:05Z

thanks for the interest on this topic @vmercierfr @Gibstick :) #962 has been updated based on earlier feedback

mfridman · 2024-05-09T18:20:08Z

It's mostly a drive-by as I occasionally work on teams that use golang-migrate, and this issue is a bit of a nasty case when you hit it.

In terms of the implementation, I'd echo moving towards pg_try_advisory_lock with a polling solution. It appears you're headed in that direction which is nice!

We ended up doing something similar in goose when we were adding database locking. Mainly dropping this issue as it contains some helpful context ✌️ .

pressly/goose#335 (comment)

AkuSilvenius mentioned this issue Sep 12, 2023

Postgres: use pg_try_advisory_lock instead of pg_advisory_lock #962

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgres deadlocking when multiple processes run `CREATE INDEX CONCURRENTLY` #960

Postgres deadlocking when multiple processes run `CREATE INDEX CONCURRENTLY` #960

AkuSilvenius commented Jul 25, 2023 •

edited

AkuSilvenius commented Jul 25, 2023 •

edited

jackh-ncl commented Sep 11, 2023

AkuSilvenius commented Sep 12, 2023

vmercierfr commented Jan 23, 2024

dhui commented Jan 25, 2024

Gibstick commented Feb 16, 2024

AkuSilvenius commented Feb 20, 2024

AkuSilvenius commented Feb 20, 2024

mfridman commented May 9, 2024

Postgres deadlocking when multiple processes run CREATE INDEX CONCURRENTLY #960

Postgres deadlocking when multiple processes run CREATE INDEX CONCURRENTLY #960

Comments

AkuSilvenius commented Jul 25, 2023 • edited

AkuSilvenius commented Jul 25, 2023 • edited

jackh-ncl commented Sep 11, 2023

AkuSilvenius commented Sep 12, 2023

vmercierfr commented Jan 23, 2024

dhui commented Jan 25, 2024

Gibstick commented Feb 16, 2024

AkuSilvenius commented Feb 20, 2024

AkuSilvenius commented Feb 20, 2024

mfridman commented May 9, 2024

Postgres deadlocking when multiple processes run `CREATE INDEX CONCURRENTLY` #960

Postgres deadlocking when multiple processes run `CREATE INDEX CONCURRENTLY` #960

AkuSilvenius commented Jul 25, 2023 •

edited

AkuSilvenius commented Jul 25, 2023 •

edited