New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure raft storage lock file is update atomically #10683
Conversation
This is required to prevent an empty lock files during restart, if the system crashed before the lock content is written to the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @deepthidevaki as always for the quick fix! Please consider my comments before merging :)
StandardOpenOption.SYNC); | ||
|
||
// If two nodes tries to acquire lock, move will fail with FileAlreadyExistsException | ||
FileUtil.moveDurably( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ I can remember a discussion that this was only supported on some environments wasn't that something? Like only linux or? Is this an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what exactly was the problem?
Flushing the parent directory does not work in windows. But according to what is documented in FileUtil
, this is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using FileUtil#moveDurably
in other places as well. So far no problems are reported. So I guess it should be ok. So I will merge this PR. If we see/know any problems later, let's tackle it then.
} | ||
|
||
@Test | ||
public void canAcquireLockOnDirectoryLockedBySameNode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 Just as an idea regarding another test. If we would make the writing to a file injectable (via dependency injection) we could also fail the writing and write a test whether failing write doesn't lock the storage anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Ya. May be. I will pass it for now. It would be also good if we can inject a mock filesystem in which we can simulate all kinds of failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree but then we shouldn't longer use Files class :) or at least some wrapper :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @npepinpe had looked into to similar ideas for testing journal.
bors merge |
Build succeeded: |
Successfully created backport PR #10703 for |
Successfully created backport PR #10704 for |
Description
Previously the file creation and updating the contents were not done atomically. Moreover the content of the files were not flushed immediately. Because of this, if the pod restarts there is a chance the lock file exists but it is empty. As a result, a new lock cannot be acquired and the partition startup fails.
To fix this, we first the write to a temporary file with "SYNC" option and then move the file atomically to the actual lock file.
Existing tests are refactored. No new test is added to verify this, as it is difficult to simulate crashes while acquiring the lock.
Related issues
closes #10681
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Please refer to our review guidelines.