Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGBUS error with 1.2.2 #8099

Closed
Zelldon opened this issue Oct 29, 2021 · 19 comments · Fixed by #9731
Closed

SIGBUS error with 1.2.2 #8099

Zelldon opened this issue Oct 29, 2021 · 19 comments · Fixed by #9731
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog support Marks an issue as related to a customer support request version:1.3.13 version:8.1.0-alpha4 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0

Comments

@Zelldon
Copy link
Member

Zelldon commented Oct 29, 2021

Describe the bug

Customer has reported that they got an SIGBUS error with the latest patch release and running their benchmark, it looks like it is happening in the journal.

Impact: Since the customer doesn't run on kubernetes the system had to be restarted.

https://www.geeksforgeeks.org/segmentation-fault-sigsegv-vs-bus-error-sigbus/
2) Bus Error (also known as SIGBUS and is usually signal 10) occur when a process is trying to access memory that the CPU cannot physically address.In other words the memory tried to access by the program is not a valid memory address.It caused due to alignment issues with the CPU (eg. trying to read a long from an address which isn’t a multiple of 4). SIGBUS is abbrivation for “Bus Error”.

SIGBUS signal occurs in below cases,
-> Program instructs the CPU to read or write a specific physical memory address which is not valid / Requested physical address is unrecognized by the whole computer system.
-> Unaligned access of memory (For example, if multi-byte accesses must be 16 bit-aligned, addresses (given in bytes) at 0, 2, 4, 6, and so on would be considered aligned and therefore accessible, while addresses 1, 3, 5, and so on would be considered unaligned.)

To Reproduce

I think we can ask them what they exactly doing, what I know is that they have 24 partitions and 4 brokers with replication factor 4.

Expected behavior

No sigbus I assume?

Log/Stacktrace

�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:33.928 [] [raft-server-2-raft-partition-partition-6] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 33400058 (first log index: 33322496)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:33.971 [] [raft-server-2-raft-partition-partition-22] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 33226084 (first log index: 33091065)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:33.977 [] [raft-server-2-raft-partition-partition-24] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 51869258 (first log index: 51728804)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:33.984 [] [raft-server-2-raft-partition-partition-10] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 34657647 (first log index: 34533361)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:33.999 [] [raft-server-2-raft-partition-partition-6] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 33400058 (first log index: 33322496)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:34.045 [] [raft-server-2-raft-partition-partition-22] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 33226084 (first log index: 33091065)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:34.048 [] [raft-server-2-raft-partition-partition-24] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 51869258 (first log index: 51728804)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:34.061 [] [raft-server-2-raft-partition-partition-10] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 34657647 (first log index: 34533361)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:34.070 [] [raft-server-2-raft-partition-partition-6] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 33400058 (first log index: 33322496)
�[36mzeebe_broker2 |�[0m [thread 82 also had an error]
�[36mzeebe_broker2 |�[0m [thread 104 also had an error][thread 91 also had an error]
�[36mzeebe_broker2 |�[0m 
�[36mzeebe_broker2 |�[0m [thread 99 also had an error]
�[36mzeebe_broker2 |�[0m #
�[36mzeebe_broker2 |�[0m # A fatal error has been detected by the Java Runtime Environment:
�[36mzeebe_broker2 |�[0m #
�[36mzeebe_broker2 |�[0m #  [thread 83 also had an error]
�[36mzeebe_broker2 |�[0m [thread 93 also had an error]
�[36mzeebe_broker2 |�[0m [thread 49 also had an error][thread 48 also had an error]
�[36mzeebe_broker2 |�[0m 
�[36mzeebe_broker2 |�[0m SIGBUS (0x7) at pc=0x00007f91eb54d169, pid=8, tid=92
�[36mzeebe_broker2 |�[0m #
�[36mzeebe_broker2 |�[0m # JRE version: OpenJDK Runtime Environment 18.9 (11.0.13+8) (build 11.0.13+8)
�[36mzeebe_broker2 |�[0m # Java VM: OpenJDK 64-Bit Server VM 18.9 (11.0.13+8, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
�[36mzeebe_broker2 |�[0m # Problematic frame:
�[36mzeebe_broker2 |�[0m # v  ~StubRoutines::jbyte_disjoint_arraycopy
�[36mzeebe_broker2 |�[0m #
�[36mzeebe_broker2 |�[0m # Core dump will be written. Default location: /var/swiftcore/core.%e.8.%h.%t
�[36mzeebe_broker2 |�[0m #
�[36mzeebe_broker2 |�[0m # An error report file with more information is saved as:
�[36mzeebe_broker2 |�[0m # /usr/local/zeebe/hs_err_pid8.log
�[36mzeebe_broker2 |�[0m Compiled method (c2) 103585123 19667   !   4       io.camunda.zeebe.journal.file.SegmentedJournalWriter::append (54 bytes)
�[36mzeebe_broker2 |�[0m  total in heap  [0x00007f91f4170a90,0x00007f91f4174fe8] = 17752
�[36mzeebe_broker2 |�[0m  relocation     [0x00007f91f4170c08,0x00007f91f4170da0] = 408
�[36mzeebe_broker2 |�[0m  main code      [0x00007f91f4170da0,0x00007f91f41730a0] = 8960
�[36mzeebe_broker2 |�[0m  stub code      [0x00007f91f41730a0,0x00007f91f4173120] = 128
�[36mzeebe_broker2 |�[0m  oops           [0x00007f91f4173120,0x00007f91f4173148] = 40
�[36mzeebe_broker2 |�[0m  metadata       [0x00007f91f4173148,0x00007f91f4173340] = 504
�[36mzeebe_broker2 |�[0m  scopes data    [0x00007f91f4173340,0x00007f91f4174738] = 5112
�[36mzeebe_broker2 |�[0m  scopes pcs     [0x00007f91f4174738,0x00007f91f4174dd8] = 1696
�[36mzeebe_broker2 |�[0m  dependencies   [0x00007f91f4174dd8,0x00007f91f4174e00] = 40
�[36mzeebe_broker2 |�[0m  handler table  [0x00007f91f4174e00,0x00007f91f4174ef0] = 240
�[36mzeebe_broker2 |�[0m  nul chk table  [0x00007f91f4174ef0,0x00007f91f4174fe8] = 248
�[36mzeebe_broker2 |�[0m Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
�[36mzeebe_broker2 |�[0m #
�[36mzeebe_broker2 |�[0m # If you would like to submit a bug report, please visit:
�[36mzeebe_broker2 |�[0m #   https://bugreport.java.com/bugreport/crash.jsp
�[36mzeebe_broker2 |�[0m #
�[36mzeebe_broker2 exited with code 134

hs_err_pid8.log

Environment:

  • OS:
  • Zeebe Version: 1.2.2
  • Configuration: 4 broker, 24 partitions, bare metal

Support Case: SUPPORT-11966

@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog labels Oct 29, 2021
@npepinpe
Copy link
Member

npepinpe commented Oct 29, 2021

They're not on K8S but are they using containers at all, or running bare metal?

This is the same customer which had too many segments memory mapped, right? Could it be they have so many they ran out of memory to address? 😄 Are they running on a 32bit system or a 64bit system? Nevermind I can see they're running a 64 bit JDK.

Otherwise, regarding potentially alignment issues, would be interesting to know what CPU arch they run on, but it's not crazy for us to enforce page alignment for our general (and is probably even be beneficial to do so), so we could do it regardless of the issue.

There's other cases where this can occur which might point to a bug. If a file is truncated while mapped, further reads may cause SIGBUS in some JVMs (I believe this was patched but I don't know versions and if it was backported).

@Zelldon
Copy link
Member Author

Zelldon commented Oct 29, 2021

So we have no alignment ? 🤔 the cpu you can see in the hot spot error report file.

@npepinpe
Copy link
Member

No, but I'm not sure in this case the alignment is the problem (though I could be wrong). With mmapd files, sigbus more likely refers to us writing/reading outside of the buffer. This can mean there was not enough virtual space (though the sigbus would happen when we map it, not on append), the file was truncated concurrently (unlikely), we're writing to a buffer that was since unmapped (though I would expect a SIGSEV here), or there was some weird I/O error that wasn't well handled by the JVM (e.g. writing to a network storage which was detached concurrently).

See https://www.sublimetext.com/blog/articles/use-mmap-with-care#:~:text=SIGBUS%20(bus%20error)%20is%20a,we%20failed%20to%20read%2Fwrite for example

@Zelldon
Copy link
Member Author

Zelldon commented Nov 1, 2021

What I can see from the customer data is that the Broker which received the SIGBUS went out of disk space and was not able to delete on several partitions data, on other partitions it was able to delete data. No exporters are configured.

We see endless log statements of:

�[0m 2021-10-27 14:14:17.730 [] [raft-server-2-raft-partition-partition-5] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 49499482 (first log index: 49463401)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:17.730 [] [raft-server-2-raft-partition-partition-5] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 49499482 (first log index: 49463401)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:17.731 [] [raft-server-2-raft-partition-partition-5] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 49499482 (first log index: 49463401)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:17.731 [] [raft-server-2-raft-partition-partition-5] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 49499482 (first log index: 49463401)

SIGBUS message which includes CPD and jdk

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007f91eb54d169, pid=8, tid=92
#
# JRE version: OpenJDK Runtime Environment 18.9 (11.0.13+8) (build 11.0.13+8)
# Java VM: OpenJDK 64-Bit Server VM 18.9 (11.0.13+8, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Core dump will be written. Default location: /var/swiftcore/core.%e.8.%h.%t
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
---------------  S U M M A R Y ------------

Command Line: -Xmx6G -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Xlog:gc*=debug:file=/usr/local/zeebe/logs/gc.log -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError -Xms128m --illegal-access=deny -XX:+ExitOnOutOfMemoryError -Dapp.name=broker -Dapp.pid=8 -Dapp.repo=/usr/local/zeebe/lib -Dapp.home=/usr/local/zeebe -Dbasedir=/usr/local/zeebe io.camunda.zeebe.broker.StandaloneBroker

Host: Intel(R) Xeon(R) Platinum 8270 CPU @ 2.70GHz, 12 cores, 31G, Debian GNU/Linux 11 (bullseye)
Time: Wed Oct 27 14:14:34 2021 UTC elapsed time: 103585.121013 seconds (1d 4h 46m 25s)

@Zelldon
Copy link
Member Author

Zelldon commented Nov 1, 2021

@deepthidevaki @npepinpe can we have an endless loop in deleting or why logs can be printed for same partition so often in ms ?

�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:18.668 [] [raft-server-2-raft-partition-partition-5] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 49499482 (first log index: 49463401)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:18.669 [] [raft-server-2-raft-partition-partition-5] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 49499482 (first log index: 49463401)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:18.670 [] [raft-server-2-raft-partition-partition-5] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 49499482 (first log index: 49463401)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:18.671 [] [raft-server-2-raft-partition-partition-5] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 49499482 (first log index: 49463401)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:18.672 [] [raft-server-2-raft-partition-partition-5] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 49499482 (first log index: 49463401)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:18.672 [] [raft-server-2-raft-partition-partition-5] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 49499482 (first log index: 49463401)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:18.673 [] [raft-server-2-raft-partition-partition-5] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 49499482 (first log index: 49463401)
�[36mzeebe_broker2 |�[0m 2021-10-27 14:14:18.675 [] [raft-server-2-raft-partition-partition-5] DEBUG
�[36mzeebe_broker2 |�[0m       io.camunda.zeebe.journal.file.SegmentedJournal - No segments can be deleted with index < 49499482 (first log index: 49463401)

@deepthidevaki
Copy link
Contributor

Deleting logs are not retried. So I expect this log only once when a new snapshot is taken or received.

@deepthidevaki
Copy link
Contributor

deepthidevaki commented Nov 1, 2021

Oh.. wait! May be we try to compact when raft detects it is OOD? May be in LeadeRole or FollowerRole.

@deepthidevaki
Copy link
Contributor

@npepinpe npepinpe added the support Marks an issue as related to a customer support request label Nov 1, 2021
@deepthidevaki
Copy link
Contributor

@Zelldon FYI OOD may be due to this issue #7767

@npepinpe npepinpe added the 1.2.1 label Nov 1, 2021
@npepinpe npepinpe added this to Planned in Zeebe Nov 1, 2021
@npepinpe npepinpe moved this from Planned to Ready in Zeebe Nov 1, 2021
@npepinpe
Copy link
Member

npepinpe commented Nov 1, 2021

Please timebox some effort to root cause it, then put it back in triage when we have more information on this.

@Zelldon Zelldon moved this from Ready to In progress in Zeebe Nov 1, 2021
@npepinpe
Copy link
Member

npepinpe commented Nov 1, 2021

From the looks, it's usually because it's memory mapping the files in /tmp - which succeeds, but when it tries to write it fails. I'm not sure why it's SIGBUS and not the InternalError we've previously observed - it could be because in this case, the JVM is not using its own MappedByteBuffer API and using something else? I would be curious to try a small test which files a disk, maps a file and then try to write and see if I get a SIGBUS, or if this only happens for JVM reports written to /tmp 🤔

@Zelldon
Copy link
Member Author

Zelldon commented Nov 1, 2021

Could it be that if the disk is full and we end in the endless loop like #8103 we print so many logs that we write tmp full and end in a SIGBUS because the JVM fails with the full /tmp?

@npepinpe
Copy link
Member

npepinpe commented Nov 1, 2021

Could be! I would feel a bit better about this than knowing we have to handle SIGBUS with our own memory mapped files 😄

@Zelldon
Copy link
Member Author

Zelldon commented Nov 1, 2021

Was able to reproduce it.

{"severity":"DEBUG","logging.googleapis.com/sourceLocation":{"function":"storeExporterPositions","file":"ExporterPositionsDistributionService.java","line":44},"message":"Received new exporter state {MetricsExporter=674628}","serviceContext":{"service":"zeebe-broker","version":"zell-disk-test"},"context":{"threadId":27,"partitionId":"3","threadPriority":5,"loggerName":"io.camunda.zeebe.broker.exporter","threadName":"Broker-1-zb-fs-workers-2","actor-name":"Broker-1-Exporter-3"},"timestampSeconds":1635775939,"timestampNanos":140792000}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007fa2e754d162, pid=7, tid=62
#
# JRE version: OpenJDK Runtime Environment 18.9 (11.0.13+8) (build 11.0.13+8)
# Java VM: OpenJDK 64-Bit Server VM 18.9 (11.0.13+8, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Core dump will be written. Default location: /core.%e.7.%t
#
# An error report file with more information is saved as:
# /usr/local/zeebe/data/zeebe_error7.log
[thread 60 also had an error]
[thread 61 also had an error]
Compiled method (c2)  660342 9894       4       io.camunda.zeebe.journal.file.record.SBESerializer::writeData (144 bytes)
 total in heap  [0x00007fa2ef710d90,0x00007fa2ef712d20] = 8080
 relocation     [0x00007fa2ef710f08,0x00007fa2ef710fd8] = 208
 main code      [0x00007fa2ef710fe0,0x00007fa2ef711fa0] = 4032
 stub code      [0x00007fa2ef711fa0,0x00007fa2ef711fc8] = 40
 oops           [0x00007fa2ef711fc8,0x00007fa2ef711fe0] = 24
 metadata       [0x00007fa2ef711fe0,0x00007fa2ef7120e0] = 256
 scopes data    [0x00007fa2ef7120e0,0x00007fa2ef712848] = 1896
 scopes pcs     [0x00007fa2ef712848,0x00007fa2ef712c48] = 1024
 dependencies   [0x00007fa2ef712c48,0x00007fa2ef712c58] = 16
 handler table  [0x00007fa2ef712c58,0x00007fa2ef712c88] = 48
 nul chk table  [0x00007fa2ef712c88,0x00007fa2ef712d20] = 152
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# 

After failing once it seems to fail in a loop, can't access anymore the pod

  State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    134
      Started:      Mon, 01 Nov 2021 15:15:02 +0100
      Finished:     Mon, 01 Nov 2021 15:15:30 +0100
    Ready:          False

HOW

zeebe-values.yaml.txt

I setup the cluster (without elastic and without elastic exporter), run a benchmark, checked who is follower and created several big files with to fill the disk

dd if=/dev/zero of=filename bs=1G count=1
root@zell-disk-test-zeebe-1:/usr/local/zeebe/data# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          95G  5.5G   89G   6% /
tmpfs            64M     0   64M   0% /dev
tmpfs            30G     0   30G   0% /sys/fs/cgroup
/dev/sda1        95G  5.5G   89G   6% /exporters
shm              64M     0   64M   0% /dev/shm
/dev/sdc        4.9G  4.9G     0 100% /usr/local/zeebe/data
tmpfs            30G   12K   30G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs            30G     0   30G   0% /proc/acpi
tmpfs            30G     0   30G   0% /proc/scsi
tmpfs            30G     0   30G   0% /sys/firmware

Unfortunately the error log is empty after several restarts 🤷

root@zell-disk-test-zeebe-1:/usr/local/zeebe/data# ls -la
total 3758356
drwxr-xr-x 4 root  root       4096 Nov  1 14:17 .
drwxrwxr-x 1 zeebe root       4096 Nov  1 14:18 ..
-rw-r--r-- 1 root  root 1073741824 Nov  1 14:10 filename
-rw-r--r-- 1 root  root 1073741824 Nov  1 14:10 filename1
-rw-r--r-- 1 root  root 1073741824 Nov  1 14:10 filename2
-rw-r--r-- 1 root  root  626978816 Nov  1 14:10 filename3
-rw-r--r-- 1 root  root          0 Nov  1 14:24 gc.log
-rw-r--r-- 1 root  root     302538 Nov  1 14:11 gc.log.0
-rw-r--r-- 1 root  root        928 Nov  1 14:13 gc.log.1
-rw-r--r-- 1 root  root          0 Nov  1 14:14 gc.log.2
-rw-r--r-- 1 root  root          0 Nov  1 14:15 gc.log.3
-rw-r--r-- 1 root  root          0 Nov  1 14:16 gc.log.4
drwx------ 2 root  root      16384 Nov  1 14:01 lost+found
drwxr-xr-x 3 root  root       4096 Nov  1 14:01 raft-partition
-rw-r--r-- 1 root  root          0 Nov  1 14:12 zeebe_error7.log

@npepinpe
Copy link
Member

npepinpe commented Nov 1, 2021

Are you filling the data volume or /tmp?

@Zelldon
Copy link
Member Author

Zelldon commented Nov 1, 2021

Please see above, the usr/local/zeebe/data

@npepinpe
Copy link
Member

npepinpe commented Nov 1, 2021

So writing to a memory mapped file when there is no space doesn't produce an InternalError but instead a SIGBUS? How come last time we observed it was an internal error?

Any rate, the fix is then to pre-allocate files. We've already discussed this, pre-allocate the files before memory mapping them to ensure we have enough disk space and don't need to deal with weird errors. See #7607.

EDIT: also nice job on the quick reproduction 🎉

@Zelldon Zelldon removed this from In progress in Zeebe Nov 1, 2021
@npepinpe
Copy link
Member

npepinpe commented Nov 2, 2021

I'll leave this in the backlog and instead we'll focus on #7607 and #7767 which should greatly mitigate this.

@Zelldon Zelldon removed their assignment Nov 2, 2021
@npepinpe npepinpe removed the 1.2.1 label Apr 11, 2022
@npepinpe npepinpe mentioned this issue Jul 8, 2022
10 tasks
zeebe-bors-camunda bot added a commit that referenced this issue Jul 11, 2022
9731: Preallocate segment files r=npepinpe a=npepinpe

## Description

This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option.

At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk.

One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file.

I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this.

A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems.

Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here.

## Related issues

closes #6504
closes #8099
related to #7607  



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 11, 2022
9731: Preallocate segment files r=npepinpe a=npepinpe

## Description

This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option.

At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk.

One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file.

I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this.

A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems.

Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here.

## Related issues

closes #6504
closes #8099
related to #7607  



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 13, 2022
9777: [Backports stable/8.0] Preallocate segment files r=npepinpe a=npepinpe

## Description

Backports #9731 to 8.0.x.

## Related issues

closes #6504
closes #8099



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 13, 2022
9778: [Backports stable/1.3] Preallocate segment files r=npepinpe a=npepinpe

## Description

Backports #9731 to 1.3.x.

## Related issues

closes #6504
closes #8099



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog support Marks an issue as related to a customer support request version:1.3.13 version:8.1.0-alpha4 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants