Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

routing/localchans: fix nested db tx #5643

Merged
merged 2 commits into from
Aug 24, 2021

Conversation

joostjager
Copy link
Collaborator

This PR fixes a bug where a db tx is opened within another db tx.

@joostjager joostjager mentioned this pull request Aug 19, 2021
@@ -76,20 +76,34 @@ func (r *Manager) UpdatePolicy(newSchema routing.ChannelPolicy,
return nil
}

// Add edge to a list. We can't yet apply the new policy,
// because the updateEdge method will open a new db transaction.
edges = append(edges, discovery.EdgeWithInfo{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think edges should be initialized on first access. We can't be sure if ForAllOutgoingChannels is ever retried. Even though all of this is going to be ram cached soon.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yes interesting. But it can't be fixed here I think? It seems that the deeper graph method needs a reset function passed in as well. In fact every graph method that accepts a callback needs a reset function?

There may be non-graph db methods too with a callback. We are accidentally stumbling into this, but perhaps there is more unsafe usage of external variables. The pattern to store items in an external list is quite common.

I do want to re-emphasize my earlier comments btcsuite/btcwallet#757 (comment) and #5484 (comment). I still think that switching to the serializable locking model is a risk that doesn't need to be taken at this time.

Copy link
Collaborator

@bhandras bhandras Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just add a flag and check it? If false => create the array, set flag true.

Edit: yeah, that won't help, need a closure, it's pretty simple to add.

Copy link
Collaborator

@bhandras bhandras Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be non-graph db methods too with a callback. We are accidentally stumbling into this, but perhaps there is more unsafe usage of external variables. The pattern to store items in an external list is quite common.

Also re: accidentally stumbling: I think during the fully remote PR we discussed that we need to make sure all DB transactions are in their closed forms and retries are handled. It may have been forgotten somehow as there was too many other threads of dicussions. Maybe worth merging this PR with #5642 and doing these fixes there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guggero what do you think? I am not sure if it is a good idea to leak details of the database locking model into the graph interface.

Is there an issue in the repo to check whether all transactions are in their closed forms? I had an idea that could help with that: modify the code to artificially always run the code inside each transaction twice. It is then as if everything needs a retry. Itests, bench, etc, should still work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a tx kvdb.RTx parameter to the FetchChannel method of the channel DB.

That is a great suggestion. The whole thing will be retried then with a fresh slice.

What I don't understand is: Why did this never lead to a deadlock with bbolt?

Good question. I will look into it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That the deadlock doesn't occur with bbolt almost makes me think that somehow two different locks are being used. As if there are two instances of something.

This doesn't seem to be the case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now looking at indeed the type of lock that postgres uses. Maybe you are right @guggero that it isn't exactly the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation why getting an RLock twice can lead to a deadlock: https://github.com/sasha-s/go-deadlock

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same problem exists for bbolt. It just doesn't show up in the itest because timing is different. For a recursive read lock to go wrong, there needs to be a write that comes in between. By adding a sleep(5) inside the ForAllOutgoingChannels callback, the issue reproduces with bbolt on master.

@bhandras
Copy link
Collaborator

bhandras commented Aug 19, 2021

Nice fix, I remember seeing this bug once before during the initial etcd port but totally forgot about it since we were using mixed local/remote mode.

@joostjager joostjager force-pushed the updatepolicy-nested-tx branch 3 times, most recently from 46de99b to ffa7b04 Compare August 23, 2021 10:19
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice investigation and fix! We'll need to think of a way to detect and fix other such cases, going to track this in #5557.

This particular case looks good to me 💯

chanbackup/backup.go Show resolved Hide resolved
@joostjager joostjager force-pushed the updatepolicy-nested-tx branch 3 times, most recently from bb4008c to a5446f0 Compare August 23, 2021 11:43
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice to have this fixed! 👍

One minor thing I'd like to suggest is to squash the two commits and separate the release note update as it's a bit easier to follow the changes that way.

@joostjager
Copy link
Collaborator Author

joostjager commented Aug 23, 2021

squash the two commits

I just separated them this morning :)

I thought it would be easier for you to read if the non-functional refactor is on its own.

@bhandras
Copy link
Collaborator

bhandras commented Aug 24, 2021

Needs rebase.

Adds an optional tx parameter to ForAllOutgoingChannels and FetchChannel
so that data can be queried within the context of an existing database
transaction.
This commit fixes a bug where a db tx is opened within another db tx.
@joostjager
Copy link
Collaborator Author

Rebased

@guggero
Copy link
Collaborator

guggero commented Aug 24, 2021

Going to merge this as both itest failures are known flakes.

@guggero guggero merged commit 93d12cd into lightningnetwork:master Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants