Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure raft storage lock file is update atomically #10683

Merged
merged 2 commits into from Oct 13, 2022

Conversation

deepthidevaki
Copy link
Contributor

Description

Previously the file creation and updating the contents were not done atomically. Moreover the content of the files were not flushed immediately. Because of this, if the pod restarts there is a chance the lock file exists but it is empty. As a result, a new lock cannot be acquired and the partition startup fails.

To fix this, we first the write to a temporary file with "SYNC" option and then move the file atomically to the actual lock file.

Existing tests are refactored. No new test is added to verify this, as it is difficult to simulate crashes while acquiring the lock.

Related issues

closes #10681

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

This is required to prevent an empty lock files during restart, if the system crashed before the lock content is written to the file.
@github-actions
Copy link
Contributor

Test Results

   939 files  ±    0     939 suites  ±0   1h 44m 3s ⏱️ - 5m 35s
7 459 tests  - 329  7 453 ✔️  - 329  6 💤 ±0  0 ±0 
7 649 runs   - 329  7 641 ✔️  - 329  8 💤 ±0  0 ±0 

Results for commit 8694c15. ± Comparison against base commit fb62d65.

Copy link
Member

@Zelldon Zelldon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @deepthidevaki as always for the quick fix! Please consider my comments before merging :)

StandardOpenOption.SYNC);

// If two nodes tries to acquire lock, move will fail with FileAlreadyExistsException
FileUtil.moveDurably(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ I can remember a discussion that this was only supported on some environments wasn't that something? Like only linux or? Is this an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what exactly was the problem?

Flushing the parent directory does not work in windows. But according to what is documented in FileUtil, this is ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using FileUtil#moveDurably in other places as well. So far no problems are reported. So I guess it should be ok. So I will merge this PR. If we see/know any problems later, let's tackle it then.

}

@Test
public void canAcquireLockOnDirectoryLockedBySameNode() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Just as an idea regarding another test. If we would make the writing to a file injectable (via dependency injection) we could also fail the writing and write a test whether failing write doesn't lock the storage anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Ya. May be. I will pass it for now. It would be also good if we can inject a mock filesystem in which we can simulate all kinds of failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree but then we shouldn't longer use Files class :) or at least some wrapper :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @npepinpe had looked into to similar ideas for testing journal.

@deepthidevaki
Copy link
Contributor Author

bors merge

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@backport-action
Copy link
Collaborator

Successfully created backport PR #10703 for stable/8.0.

@backport-action
Copy link
Collaborator

Successfully created backport PR #10704 for stable/8.1.

zeebe-bors-camunda bot added a commit that referenced this pull request Oct 13, 2022
10704: [Backport stable/8.1] Ensure raft storage lock file is update atomically r=deepthidevaki a=backport-action

# Description
Backport of #10683 to `stable/8.1`.

closes #10681

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Oct 13, 2022
10703: [Backport stable/8.0] Ensure raft storage lock file is update atomically r=deepthidevaki a=backport-action

# Description
Backport of #10683 to `stable/8.0`.

closes #10681

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@korthout korthout added the version:8.1.1 Marks an issue as being completely or in parts released in 8.1.1 label Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version:8.1.1 Marks an issue as being completely or in parts released in 8.1.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure RaftStore lock files are created and updated atomically
4 participants