use separate thread to compress block store #1389

PSeitz · 2022-06-16T05:07:15Z

Use seperate thread to compress block store for increased indexing performance. This allows to use slower compressors with higher compression ratio, with less or no perfomance impact (with enough cores).

A seperate thread is spawned to compress the docstore, which handles single blocks and stacking from other docstores.
The spawned compressor thread does not write, instead it sends back the compressed data. This is done in order to avoid writing multithreaded on the same file.

Small benchmark 1GB hdfs, zstd level 8

Pre
 Total Nowait Merge: 43.30 Mb/s
 Total Wait Merge: 43.28 Mb/s

Post
Total Nowait Merge: 67.69 Mb/s
Total Wait Merge: 67.69 Mb/s

codecov-commenter · 2022-06-16T06:12:50Z

Codecov Report

Merging #1389 (4b6db03) into main (83d0c13) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1389      +/-   ##
==========================================
+ Coverage   94.29%   94.30%   +0.01%     
==========================================
  Files         236      236              
  Lines       43418    43471      +53     
==========================================
+ Hits        40942    40997      +55     
+ Misses       2476     2474       -2

Impacted Files	Coverage Δ
common/src/writer.rs	`94.11% <ø> (ø)`
src/indexer/merger.rs	`98.97% <100.00%> (ø)`
src/indexer/segment_serializer.rs	`98.07% <100.00%> (-0.04%)`	⬇️
src/indexer/segment_writer.rs	`96.40% <100.00%> (-0.01%)`	⬇️
src/store/index/mod.rs	`97.83% <100.00%> (ø)`
src/store/index/skip_index_builder.rs	`100.00% <100.00%> (ø)`
src/store/mod.rs	`99.17% <100.00%> (ø)`
src/store/writer.rs	`100.00% <100.00%> (+1.08%)`	⬆️
src/schema/facet.rs	`89.88% <0.00%> (-0.06%)`	⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 83d0c13...4b6db03. Read the comment docs.

src/store/writer.rs

fulmicoton · 2022-06-16T07:30:35Z

src/store/writer.rs

+    ) -> io::Result<StoreWriter> {
+        let thread_builder = thread::Builder::new().name("docstore compressor thread".to_string());
+
+        // Data channel to send fs writes, to write only from current thread


interesting... why do we want to write only from current thread?

Even though in tantivy we create a separate file, which is fine for a separate thread to write into, the Directory trait itself doesn't require explicitly thread safe writes. TerminatingWrite is not Send, so the current contract is that the writers stay on the same thread.

It's unlikely to be an issue, but it could be two threads write next to each other at the same page or cache line. So depending on the write target and buffer synchronization between the threads, it could be that they overwrite each others data. Since rust doesn't cover race conditions outside it's memory model, e.g. Files, I'm extra careful there.

That makes sense. Can we make TerminatingWrite: Send and simplify the code though?

Ah so you just offload compression to a different thread. Writing is still done in the same place. Maybe it is clearer that way let me keep on reading on.

src/store/writer.rs

fulmicoton · 2022-06-17T01:04:56Z

A bit of analyis...

We write the docstore
1- when building a new segment
2a- when merging N segments. If we do no have any deletes in the segment being appended we have an operation called stacking that makes it possible to more or less concatenate teh docstore (it is a tiny bit more complicate that but what ever)
2b- If we have deletes, then we need to rebuild the docstore. the operation is then close to what happens in 1.

Threading can help with 2 things
A docstore compression is heavy and does not need to happen on the same thread as inverted index building. We love offloading stuff like this because it increases the indexing throughput without producing smaller segments.
B we do not need to have the CPU wait on IO.

The current approach implementation will not help with B, because the IO still happens in the same thread as the original.
2a is all about B. This is probably more important when using the SSTable implementation.
1 and 2b is probably more about A. (we still io wait but more time is spent on CPU than on IO when building a segment. The gain is probably not negligible though).

2a and 2b are not as important because they do starve us on resource, and they do not impact time to search. (Finishing a merge quickly by using more cores is not very important). They could help resource management on quickwit however. (If all task take clearly one full thread, we can more easily rely on our own scheduling and size our task thread pool by the number of cores).

I think we want to at least do the IO on the thread that does the compression.
(That means moving the File into the thread).

fulmicoton · 2022-06-17T01:10:02Z

Please test on a dataset that has trigger merges (wikipedia is fine), and rely on the sstable dict.

fulmicoton

See comments on the Conversation tab.
#1389

PSeitz · 2022-06-17T12:12:57Z

I tested merge operation on hdfs 14GB(2,4GB index size, 4 segments) and wikipedia 8GB (5,7GB index size, 5 segment) with sstable. It seems to be marginal faster. Merge throughput is ~50MB/s hdfs and 75MB/s wikipedia (measured on index size, not input size), which is slower than most disks. On indexing, no noticeable speed was observed. I think the impact is too small to change the API (adding Send to TerminatingWrite). Although another upside would be that the code would be simpler.

➜  tantivy-cli git:(main) ✗ du -sh hdfs/
2,4G    hdfs/

➜  tantivy-cli git:(main) ✗ du -sh wikipedia
5,7G    wikipedia

Run	Write on seperate thread	Write on one thread
hdfs 1. run	50.43 secs	52.63 secs
hdfs 2. run	48.38 secs	50.57 secs
wikipedia 1. run	71.74 secs	77.27 secs
wikipedia 2. run	79.46 secs	77.25 secs
wikipedia 3. run	81.64 secs

fulmicoton · 2022-06-21T01:37:55Z

@PSeitz Thanks for investigating! That makes sense. We don't flush or anything, so the write are just pushing the data to OS buffer, and the actual write to disk will be done asynchronously by the OS, provided our throughput does not beat the hardware.
The docstore actually write a lot of data but after compression it is fine.
I suspect you would see different results with a lesser hard drive like EBS (gp2 is 250 MiB/s - to be split between writing a split and merging at the same time etc.).

Anyway, can you move the write to a different thread, if only to simplify the code?

src/store/writer.rs

fulmicoton · 2022-06-23T01:24:17Z

src/store/writer.rs

+    }
+
+    /// Flushes current uncompressed block and sends to compressor.
+    fn send_current_block_to_compressor(&mut self) -> io::Result<()> {


I think we can return the SendError directly here.
See discussion on call site.

fulmicoton · 2022-06-23T01:26:06Z

src/store/writer.rs

-        let start_shift = self.writer.written_bytes() as usize;
+    pub fn stack(&mut self, store_reader: StoreReader) -> io::Result<()> {
+        // We flush the current block first before stacking
+        self.send_current_block_to_compressor()?;


Both errors are actually errors on the compressing thread.

If there is an error, could we join the companion thread and return

its io::Error if it has one

a custome io::Error if it panicked.

The code that join/harvest the error could be factorized in an independant method.

that's a good idea, I don't like the error handling here currently, it's not deterministic which error is returned .. but to join the thread we need to consume self. We could swap it with another handle or put it in an option, I don't really like either of those

fulmicoton · 2022-06-23T01:26:19Z

src/store/writer.rs

+        self.send_current_block_to_compressor()?;
+        drop(self.compressor_sender);
+
+        self.compressor_thread_handle


Same as above.

src/store/writer.rs

fulmicoton

Approved but please have a look at the change suggeston and the error handling suggestion. The latter is very optional or can be done later.

Use seperate thread to compress block store for increased indexing performance. This allows to use slower compressors with higher compression ratio, with less or no perfomance impact (with enough cores). A seperate thread is spawned to compress the docstore, which handles single blocks and stacking from other docstores. The spawned compressor thread does not write, instead it sends back the compressed data. This is done in order to avoid writing multithreaded on the same file.

Co-authored-by: Paul Masurel <paul@quickwit.io>

PSeitz force-pushed the doc_writer_thread branch from b4d43c2 to 4320c72 Compare June 16, 2022 05:19

PSeitz requested a review from fulmicoton June 16, 2022 05:44

PSeitz force-pushed the doc_writer_thread branch from 4320c72 to 56f9fc9 Compare June 16, 2022 06:50

fulmicoton reviewed Jun 16, 2022

View reviewed changes