Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: flush/compact memtables to in-memory sstables #3233

Open
jbowens opened this issue Jan 18, 2024 · 0 comments
Open

db: flush/compact memtables to in-memory sstables #3233

jbowens opened this issue Jan 18, 2024 · 0 comments

Comments

@jbowens
Copy link
Collaborator

jbowens commented Jan 18, 2024

For various reasons, we may accumulate many memtables within the flushable queue:

  • With the current design for db: WAL failover to deal with transient unavailability #3230 a temporary disk stall may result in unbounded growth of queued memtables while the stalled disk prevents flushes, but writes continue.
  • In stores with high write throughput and high flush utilization, multiple memtables may queue while memtables lower in the flushable queue are still being flushed.
  • In db: consider batching memtable flushes when heavy write pressure #1421 we consider deliberately postponing flushes to allow more memtables to queue, improving the shape of L0 sublevels and reducing write amplification by allowing additional data to be elided before being written to L0 (eg, raft log truncation, intent resolution and overwritten expiration leases can all significantly reduce the volume of data that makes it to L0, and the more data batched within the flush the larger the benefit).

However, memtables use significant memory. Keys and values are stored verbatim, uncompressed, even though we expect ~half of writes (the raft log) to not be read. Each KV pair also has at least 32 bytes of arenaskl.Node overhead, plus additional overhead for nodes with skiplist towers that are not the minimal height.

@petermattis recently suggested flushing memtables to in-memory sstables. This has the benefit of representing the same data more compactly. The sstable representation is more compact by reducing the fixed per-key overhead and taking advantage of block prefix key compression. If the data blocks are also compressed and the data is compressible, the data is stored even more compactly, with the tradeoff that reads that must read through the table may need to duplicate some of the data uncompressed within the block cache. When it comes time to durably flush the state to L0, the sstable(s) may also be copied verbatim to storage. We could additionally consider "compactions" of memtables/in-memory sstables, further delaying the eventual write out to L0.

@blathers-crl blathers-crl bot added this to Incoming in Storage Jan 18, 2024
@nicktrav nicktrav moved this from Incoming to Backlog in Storage Jan 23, 2024
jbowens added a commit to jbowens/pebble that referenced this issue Feb 12, 2024
Pull out logic to transition EFOSes to file-only after a flush into a helper
function. The flush1 function is growing large and will grow larger with cockroachdb#3233.
jbowens added a commit that referenced this issue Feb 13, 2024
Pull out logic to transition EFOSes to file-only after a flush into a helper
function. The flush1 function is growing large and will grow larger with #3233.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Storage
  
Backlog
Development

No branches or pull requests

1 participant