-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preallocate segment files #9731
Conversation
8d39412
to
af5d678
Compare
NOTE: we might want to wait for #9714 to be merged before reviewing this, as I expect some major conflicts 😄 |
The pr #9714 is merged. Sorry for the conflicts 😄 |
b288fd7
to
a045906
Compare
Happy to split off the last 2 commits to a separate PR 👍 |
ping @oleschoenburg let me know if you're too busy to have a look at this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, only minor adjustments, nothing that would require a re-review 👍
🔧 I wouldn't mind moving preallocate
to the journal module. It'd save the duplicated getRealSize
test helper and I general I think we don't expect to use this for anything else other than the journal any time soon.
💭 Our example config files don't document this flag (or any other flag in ExperimentalRaftCfg
). Not sure if we really want to add them since they are experimental after all.
🔧 There are no tests for the new config, we could add them to ExperimentalCfgTest
.
journal/src/main/java/io/camunda/zeebe/journal/file/SegmentLoader.java
Outdated
Show resolved
Hide resolved
journal/src/main/java/io/camunda/zeebe/journal/file/SegmentLoader.java
Outdated
Show resolved
Hide resolved
c655519
to
c79f3ba
Compare
Re-review only for the tests comment, and to make sure we're aligned on #9731 (comment) (not sure I understood what you meant). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looks good ;)
Just to be sure, you have seen my comments here #9731 (review) but ignored them, right? That's totally fine with me, but just wanted to make sure you didn't miss them.
No, sorry, I hadn't 🙈 |
I thought about moving the I'll add tests, and document the new config. |
Adds a method to preallocate a new file of the expected length, effectively reserving the disk space for later use. The current implementation is a "dumb" one, which simply writes 4KB blocks of 0s until the expected length is reached (meaning the size may be up to 4KB - 1 byte greater than the desired length, which is acceptable for our use case). While potentially slow, this is not in a hot path, and will be later optimized with a native syscall to fallocate.
Adds basic tests for `FileUtil#preallocate`. A new dependency was added, `jnr-posix`, which allows us to check the actual size on disk of the file in UNIX systems. This gives us the accurate number of blocks reserved for the file, on disk, even on file systems with compression or sparse files (i.e. most modern Linux systems), and also ensures that we don't only read the metadata (e.g. what `Files.size` returns) but really guarantee we reserved the disk space.
Allows preallocating segment files if configured to do so. On segment creation, we now preallocate the segment file by default before using it. If it already existed (but was unused by the log so far), we take the easy way out and simply delete/recreate it instead of attempting to grow/shrink it.
Add a new dependency on `jnr-posix` to accurately get the size of the file on disk. This is necessary to avoid tests passing if we only change the file's metadata - for example, when mmap-ing a file with a mapping of length X, then the file will report it has a length of X, even though it has possibly only one block allocated on disk.
Cleans up the `SegmentedJournal`, `SegmentLoader`, and `SegmentsManager` a little bit. Instead of constructing dependencies classes, inject them when building the dependent. This improves testing and maintainability in the long term. After this, there were some unused fields which could be removed.
785832b
to
5167445
Compare
bors merge |
9731: Preallocate segment files r=npepinpe a=npepinpe ## Description This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option. At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk. One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file. I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this. A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems. Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here. ## Related issues closes #6504 closes #8099 related to #7607 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Build failed: |
Known flaky test bors r+ |
9731: Preallocate segment files r=npepinpe a=npepinpe ## Description This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option. At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk. One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file. I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this. A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems. Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here. ## Related issues closes #6504 closes #8099 related to #7607 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Build failed: |
bors r+ |
Build succeeded: |
This will be fun to backport 😄 /backport |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/1.3
git worktree add -d .worktree/backport-9731-to-stable/1.3 origin/stable/1.3
cd .worktree/backport-9731-to-stable/1.3
git checkout -b backport-9731-to-stable/1.3
ancref=$(git merge-base f0d391377f69d94106188ab0870e40f2b9dbdf40 5167445437bb17a27e29632a7dd6fc368e6e151f)
git cherry-pick -x $ancref..5167445437bb17a27e29632a7dd6fc368e6e151f |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.0
git worktree add -d .worktree/backport-9731-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-9731-to-stable/8.0
git checkout -b backport-9731-to-stable/8.0
ancref=$(git merge-base f0d391377f69d94106188ab0870e40f2b9dbdf40 5167445437bb17a27e29632a7dd6fc368e6e151f)
git cherry-pick -x $ancref..5167445437bb17a27e29632a7dd6fc368e6e151f |
9842: [Backport stable/8.0] Backport journal structural updates r=npepinpe a=npepinpe ## Description This PR backports the structural updates made to the journal in #9714, #9731, #9833, and #9834. This is to ease backporting further fixes to the journal, as the structure deviating so much caused major issues when backporting new fixes. ## Related issues backports #9714 backports #9731 backports #9833 backports #9834 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
9842: [Backport stable/8.0] Backport journal structural updates r=npepinpe a=npepinpe ## Description This PR backports the structural updates made to the journal in #9714, #9731, #9833, and #9834. This is to ease backporting further fixes to the journal, as the structure deviating so much caused major issues when backporting new fixes. ## Related issues backports #9714 backports #9731 backports #9833 backports #9834 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
9857: [Backport stable/1.3] Backport journal structural updates r=npepinpe a=npepinpe ## Description This PR backports the structural updates made to the journal in #9714, #9731, #9833, and #9834. This is to ease backporting further fixes to the journal, as the structure deviating so much caused major issues when backporting new fixes. ## Related issues backports #9714 backports #9731 backports #9833 backports #9834 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
9857: [Backport stable/1.3] Backport journal structural updates r=npepinpe a=npepinpe ## Description This PR backports the structural updates made to the journal in #9714, #9731, #9833, and #9834. This is to ease backporting further fixes to the journal, as the structure deviating so much caused major issues when backporting new fixes. ## Related issues backports #9714 backports #9731 backports #9833 backports #9834 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Description
This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option.
At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk.
One thing to note, to verify this, we used jnr-posix. The reason behind this is we want to know the actual number of blocks on disk reserved for this file.
Files#size
, orFile#length
, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using stat, we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file.I would've like to capture this in a test utility, but since
test-util
depends onutil
, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff injournal
, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this.A follow-up PR will come which will optimize the pre-allocation by using the posix_fallocate on POSIX systems.
Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here.
Related issues
closes #6504
closes #8099
related to #7607
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Please refer to our review guidelines.