Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preallocate segment files #9731

Merged
merged 11 commits into from
Jul 12, 2022
Merged

Preallocate segment files #9731

merged 11 commits into from
Jul 12, 2022

Conversation

npepinpe
Copy link
Member

@npepinpe npepinpe commented Jul 7, 2022

Description

This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option.

At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk.

One thing to note, to verify this, we used jnr-posix. The reason behind this is we want to know the actual number of blocks on disk reserved for this file. Files#size, or File#length, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using stat, we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file.

I would've like to capture this in a test utility, but since test-util depends on util, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in journal, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this.

A follow-up PR will come which will optimize the pre-allocation by using the posix_fallocate on POSIX systems.

Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here.

Related issues

closes #6504
closes #8099
related to #7607

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

@npepinpe
Copy link
Member Author

npepinpe commented Jul 7, 2022

NOTE: we might want to wait for #9714 to be merged before reviewing this, as I expect some major conflicts 😄

@github-actions
Copy link
Contributor

github-actions bot commented Jul 7, 2022

Unit Test Results

   791 files  +    1     791 suites  +1   1h 47m 38s ⏱️ + 11m 2s
6 114 tests +187  6 107 ✔️ +187  7 💤 ±0  0 ±0 
6 283 runs  +187  6 276 ✔️ +187  7 💤 ±0  0 ±0 

Results for commit 5167445. ± Comparison against base commit f0d3913.

♻️ This comment has been updated with latest results.

@deepthidevaki
Copy link
Contributor

deepthidevaki commented Jul 8, 2022

The pr #9714 is merged. Sorry for the conflicts 😄

@npepinpe npepinpe force-pushed the 7607-prealloc-segment branch 3 times, most recently from b288fd7 to a045906 Compare July 8, 2022 14:45
@npepinpe
Copy link
Member Author

npepinpe commented Jul 8, 2022

Happy to split off the last 2 commits to a separate PR 👍

@npepinpe
Copy link
Member Author

ping @oleschoenburg let me know if you're too busy to have a look at this

Copy link
Member

@oleschoenburg oleschoenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, only minor adjustments, nothing that would require a re-review 👍

🔧 I wouldn't mind moving preallocate to the journal module. It'd save the duplicated getRealSize test helper and I general I think we don't expect to use this for anything else other than the journal any time soon.

💭 Our example config files don't document this flag (or any other flag in ExperimentalRaftCfg). Not sure if we really want to add them since they are experimental after all.

🔧 There are no tests for the new config, we could add them to ExperimentalCfgTest.

@npepinpe
Copy link
Member Author

Re-review only for the tests comment, and to make sure we're aligned on #9731 (comment) (not sure I understood what you meant).

Copy link
Member

@oleschoenburg oleschoenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good ;)

Just to be sure, you have seen my comments here #9731 (review) but ignored them, right? That's totally fine with me, but just wanted to make sure you didn't miss them.

@npepinpe
Copy link
Member Author

No, sorry, I hadn't 🙈

@npepinpe
Copy link
Member Author

npepinpe commented Jul 11, 2022

I thought about moving the preallocate to the journal, and I might do that in the next PR as we pull in more native stuff. I'm not super keen on splitting file system utilities though, mostly since it's easy to miss what exists for new devs, but I guess it is what it is.

I'll add tests, and document the new config.

Adds a method to preallocate a new file of the expected length,
effectively reserving the disk space for later use. The current
implementation is a "dumb" one, which simply writes 4KB blocks of 0s
until the expected length is reached (meaning the size may be up to 4KB
- 1 byte greater than the desired length, which is acceptable for our
  use case).

While potentially slow, this is not in a hot path, and will be later
optimized with a native syscall to fallocate.
Adds basic tests for `FileUtil#preallocate`. A new dependency was added,
`jnr-posix`, which allows us to check the actual size on disk of the
file in UNIX systems. This gives us the accurate number of blocks
reserved for the file, on disk, even on file systems with compression or
sparse files (i.e. most modern Linux systems), and also ensures that we
don't only read the metadata (e.g. what `Files.size` returns) but really
guarantee we reserved the disk space.
Allows preallocating segment files if configured to do so. On segment
creation, we now preallocate the segment file by default before using
it. If it already existed (but was unused by the log so far), we take
the easy way out and simply delete/recreate it instead of attempting to
grow/shrink it.
Add a new dependency on `jnr-posix` to accurately get the size of the
file on disk. This is necessary to avoid tests passing if we only change
the file's metadata - for example, when mmap-ing a file with a mapping
of length X, then the file will report it has a length of X, even though
it has possibly only one block allocated on disk.
Cleans up the `SegmentedJournal`, `SegmentLoader`, and `SegmentsManager`
a little bit. Instead of constructing dependencies classes, inject them
when building the dependent. This improves testing and maintainability
in the long term.

After this, there were some unused fields which could be removed.
@npepinpe
Copy link
Member Author

bors merge

zeebe-bors-camunda bot added a commit that referenced this pull request Jul 11, 2022
9731: Preallocate segment files r=npepinpe a=npepinpe

## Description

This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option.

At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk.

One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file.

I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this.

A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems.

Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here.

## Related issues

closes #6504
closes #8099
related to #7607  



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
@zeebe-bors-camunda
Copy link
Contributor

Build failed:

@npepinpe
Copy link
Member Author

Known flaky test

bors r+

zeebe-bors-camunda bot added a commit that referenced this pull request Jul 11, 2022
9731: Preallocate segment files r=npepinpe a=npepinpe

## Description

This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option.

At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk.

One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file.

I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this.

A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems.

Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here.

## Related issues

closes #6504
closes #8099
related to #7607  



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
@zeebe-bors-camunda
Copy link
Contributor

Build failed:

@npepinpe
Copy link
Member Author

bors r+

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit 6592e82 into main Jul 12, 2022
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the 7607-prealloc-segment branch July 12, 2022 10:21
@npepinpe
Copy link
Member Author

This will be fun to backport 😄

/backport

@backport-action
Copy link
Collaborator

Backport failed for stable/1.3, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/1.3
git worktree add -d .worktree/backport-9731-to-stable/1.3 origin/stable/1.3
cd .worktree/backport-9731-to-stable/1.3
git checkout -b backport-9731-to-stable/1.3
ancref=$(git merge-base f0d391377f69d94106188ab0870e40f2b9dbdf40 5167445437bb17a27e29632a7dd6fc368e6e151f)
git cherry-pick -x $ancref..5167445437bb17a27e29632a7dd6fc368e6e151f

@backport-action
Copy link
Collaborator

Backport failed for stable/8.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.0
git worktree add -d .worktree/backport-9731-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-9731-to-stable/8.0
git checkout -b backport-9731-to-stable/8.0
ancref=$(git merge-base f0d391377f69d94106188ab0870e40f2b9dbdf40 5167445437bb17a27e29632a7dd6fc368e6e151f)
git cherry-pick -x $ancref..5167445437bb17a27e29632a7dd6fc368e6e151f

zeebe-bors-camunda bot added a commit that referenced this pull request Jul 13, 2022
9777: [Backports stable/8.0] Preallocate segment files r=npepinpe a=npepinpe

## Description

Backports #9731 to 8.0.x.

## Related issues

closes #6504
closes #8099



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Jul 13, 2022
9778: [Backports stable/1.3] Preallocate segment files r=npepinpe a=npepinpe

## Description

Backports #9731 to 1.3.x.

## Related issues

closes #6504
closes #8099



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Jul 21, 2022
9842: [Backport stable/8.0] Backport journal structural updates r=npepinpe a=npepinpe

## Description

This PR backports the structural updates made to the journal in #9714, #9731, #9833, and #9834. This is to ease backporting further fixes to the journal, as the structure deviating so much caused major issues when backporting new fixes.

## Related issues

backports #9714
backports #9731
backports #9833
backports #9834



Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Jul 21, 2022
9842: [Backport stable/8.0] Backport journal structural updates r=npepinpe a=npepinpe

## Description

This PR backports the structural updates made to the journal in #9714, #9731, #9833, and #9834. This is to ease backporting further fixes to the journal, as the structure deviating so much caused major issues when backporting new fixes.

## Related issues

backports #9714
backports #9731
backports #9833
backports #9834



Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Jul 21, 2022
9857: [Backport stable/1.3] Backport journal structural updates r=npepinpe a=npepinpe

## Description

This PR backports the structural updates made to the journal in #9714, #9731, #9833, and #9834. This is to ease backporting further fixes to the journal, as the structure deviating so much caused major issues when backporting new fixes.

## Related issues

backports #9714
backports #9731
backports #9833
backports #9834



Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Jul 25, 2022
9857: [Backport stable/1.3] Backport journal structural updates r=npepinpe a=npepinpe

## Description

This PR backports the structural updates made to the journal in #9714, #9731, #9833, and #9834. This is to ease backporting further fixes to the journal, as the structure deviating so much caused major issues when backporting new fixes.

## Related issues

backports #9714
backports #9731
backports #9833
backports #9834



Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants