-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
routing/localchans: fix nested db tx #5643
routing/localchans: fix nested db tx #5643
Conversation
routing/localchans/manager.go
Outdated
@@ -76,20 +76,34 @@ func (r *Manager) UpdatePolicy(newSchema routing.ChannelPolicy, | |||
return nil | |||
} | |||
|
|||
// Add edge to a list. We can't yet apply the new policy, | |||
// because the updateEdge method will open a new db transaction. | |||
edges = append(edges, discovery.EdgeWithInfo{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think edges
should be initialized on first access. We can't be sure if ForAllOutgoingChannels
is ever retried. Even though all of this is going to be ram cached soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yes interesting. But it can't be fixed here I think? It seems that the deeper graph method needs a reset
function passed in as well. In fact every graph method that accepts a callback needs a reset
function?
There may be non-graph db methods too with a callback. We are accidentally stumbling into this, but perhaps there is more unsafe usage of external variables. The pattern to store items in an external list is quite common.
I do want to re-emphasize my earlier comments btcsuite/btcwallet#757 (comment) and #5484 (comment). I still think that switching to the serializable locking model is a risk that doesn't need to be taken at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just add a flag and check it? If false => create the array, set flag true.
Edit: yeah, that won't help, need a closure, it's pretty simple to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be non-graph db methods too with a callback. We are accidentally stumbling into this, but perhaps there is more unsafe usage of external variables. The pattern to store items in an external list is quite common.
Also re: accidentally stumbling: I think during the fully remote PR we discussed that we need to make sure all DB transactions are in their closed forms and retries are handled. It may have been forgotten somehow as there was too many other threads of dicussions. Maybe worth merging this PR with #5642 and doing these fixes there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guggero what do you think? I am not sure if it is a good idea to leak details of the database locking model into the graph interface.
Is there an issue in the repo to check whether all transactions are in their closed forms? I had an idea that could help with that: modify the code to artificially always run the code inside each transaction twice. It is then as if everything needs a retry. Itests, bench, etc, should still work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a tx kvdb.RTx parameter to the FetchChannel method of the channel DB.
That is a great suggestion. The whole thing will be retried then with a fresh slice.
What I don't understand is: Why did this never lead to a deadlock with bbolt?
Good question. I will look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That the deadlock doesn't occur with bbolt almost makes me think that somehow two different locks are being used. As if there are two instances of something.
This doesn't seem to be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now looking at indeed the type of lock that postgres uses. Maybe you are right @guggero that it isn't exactly the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explanation why getting an RLock
twice can lead to a deadlock: https://github.com/sasha-s/go-deadlock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same problem exists for bbolt. It just doesn't show up in the itest because timing is different. For a recursive read lock to go wrong, there needs to be a write that comes in between. By adding a sleep(5)
inside the ForAllOutgoingChannels
callback, the issue reproduces with bbolt on master.
Nice fix, I remember seeing this bug once before during the initial |
358befd
to
241c32f
Compare
46de99b
to
ffa7b04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice investigation and fix! We'll need to think of a way to detect and fix other such cases, going to track this in #5557.
This particular case looks good to me 💯
bb4008c
to
a5446f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice to have this fixed! 👍
One minor thing I'd like to suggest is to squash the two commits and separate the release note update as it's a bit easier to follow the changes that way.
I just separated them this morning :) I thought it would be easier for you to read if the non-functional refactor is on its own. |
Needs rebase. |
Adds an optional tx parameter to ForAllOutgoingChannels and FetchChannel so that data can be queried within the context of an existing database transaction.
This commit fixes a bug where a db tx is opened within another db tx.
a5446f0
to
bd07b5f
Compare
Rebased |
Going to merge this as both itest failures are known flakes. |
This PR fixes a bug where a db tx is opened within another db tx.