WIP: S2++ #846

klauspost · 2023-08-07T07:28:35Z

Aim

Improve encoding method of S2, which is read backwards compatible with the following:

Output from previous versions can be decompressed.
Output from new versions can requires a new version.
Blocks from new version will always produce an error when decoded with incompatible version.

Version	Snappy Decoder	S2 Decoder	S2++ Decoder
Snappy Encoder	✔️	✔️	✔️
S2 Encoder	❌	✔️	✔️
S2++ Encoder	❌	❌	✔️

Only changes that provide significant improvements with no decompression speed penalty will be considered.
No reduction in seek functionality is accepted.

Method

Fixes the biggest mistake in Snappy (though extremely rarely used in Snappy) - and also implements more efficient repeat codes.

If the first bytes of a block is 0x80, 0x00, 0x00 (copy, 2 byte offset = 0),
this indicates that all Copy with 2-byte offset (10)
and Copy with 4-byte offset (11) tags change.

There can be no literals before this tag and no repeats before a match as specified above.
This will only trigger on this exact tag.

Discussion

~~Blocks below 64K do not need to add this, and it will just be 3 wasted bytes.~~
~~65536 could be added to the base value, but having a max 16MB backreference max seems neater.~~

Using a 3 byte indicator, since block can start with an initial repeat. Having this as the first block will always be invalid in current decoders.

~~Seems like the encoder can unconditionally enable this when block is >64K. Sizes below are with it enabled for all blocks, 4MB blocks.~~ Pretty much always better unless just storing.

~~Consider if old repeat codes should be disabled if this mode (probably).~~ Yes

Sizes

Percentages are calculated as reduction in output size and reduction as percentage of input size.

FILE	Level	Input	Output	Size	Org Size
BEFORE:
gob-stream	1	1911399616	347633082
gob-stream	2	1911399616	303776251
gob-stream	3	1911399616	258013815
after:
gob-stream	1	1911399616	297164561	-14.52%	-2.64%
gob-stream	2	1911399616	269233350	-11.37%	-1.81%
gob-stream	3	1911399616	224782856	-12.88%	-1.74%

BEFORE:
silesia.tar	1	211947520	96899588
silesia.tar	2	211947520	87166102
silesia.tar	3	211947520	79612333
after:
silesia.tar	1	211947520	91660668	-5.41%	-2.47%
silesia.tar	2	211947520	82145738	-5.76%	-2.37%
silesia.tar	3	211947520	74518937	-6.40%	-2.40%

enwik9	1	1000000000	487526653
enwik9	2	1000000000	416581621
enwik9	3	1000000000	370860824
after:
enwik9	1	1000000000	460036037	-5.64%	-2.75%
enwik9	2	1000000000	392514719	-5.78%	-2.41%
enwik9	3	1000000000	341953796	-7.79%	-2.89%

github-june-2days-2019.json	1	6273951764	1041705230
github-june-2days-2019.json	2	6273951764	944873043
github-june-2days-2019.json	3	6273951764	826384742
after:
github-june-2days-2019.json	1	6273951764	940405663	-9.72%	-1.61%
github-june-2days-2019.json	2	6273951764	881830595	-6.67%	-1.00%
github-june-2days-2019.json	3	6273951764	764962673	-7.43%	-0.98%

github-ranks-backup.bin	1	1862623243	623833007
github-ranks-backup.bin	2	1862623243	568441528
github-ranks-backup.bin	3	1862623243	553965705
after:
github-ranks-backup.bin	1	1862623243	598949133	-3.99%	-1.34%
github-ranks-backup.bin	2	1862623243	536791344	-5.57%	-1.70%
github-ranks-backup.bin	3	1862623243	508220735	-8.26%	-2.46%

nyc-taxi-data-10M.csv	1	3325605752	1093518508
nyc-taxi-data-10M.csv	2	3325605752	884711223
nyc-taxi-data-10M.csv	3	3325605752	773678211
after:
nyc-taxi-data-10M.csv	1	3325605752	937134605	-14.30%	-4.70%
nyc-taxi-data-10M.csv	2	3325605752	776582738	-12.22%	-3.25%
nyc-taxi-data-10M.csv	3	3325605752	663806572	-14.20%	-3.30%

apache.log	1	2622574440	230523580
apache.log	2	2622574440	217884490
apache.log	3	2622574440	185357903
after:
apache.log	1	2622574440	188006334	-18.44%	-1.62%
apache.log	2	2622574440	173645540	-20.30%	-1.69%
apache.log	3	2622574440	146077255	-21.19%	-1.50%

consensus.db.10gb	1	10737418240	4549768015
consensus.db.10gb	2	10737418240	4416692817
consensus.db.10gb	3	10737418240	4210593068
after:
consensus.db.10gb	1	10737418240	4332822720	-4.77%	-2.02%
consensus.db.10gb	2	10737418240	4299355082	-2.66%	-1.09%
consensus.db.10gb	3	10737418240	4095105829	-2.74%	-1.08%

rawstudio-mint14.tar	1	8558382592	4413947468
rawstudio-mint14.tar	2	8558382592	4101956347
rawstudio-mint14.tar	3	8558382592	3905189070
after:
rawstudio-mint14.tar	1	8558382592	4241234066	-3.91%	-2.02%
rawstudio-mint14.tar	2	8558382592	3962581837	-3.40%	-1.63%
rawstudio-mint14.tar	3	8558382592	3781979945	-3.16%	-1.44%

10gb.tar	1	10065157632	5915543454
10gb.tar	2	10065157632	5486469704
10gb.tar	3	10065157632	5192490218
after:
10gb.tar	1	10065157632	5733844651	-3.07%	-1.81%
10gb.tar	2	10065157632	5271029444	-3.93%	-2.14%
10gb.tar	3	10065157632	4979564326	-4.10%	-2.12%

sofia-air-quality-dataset.tar	1	15464463872	4991766468
sofia-air-quality-dataset.tar	2	15464463872	4432998200
sofia-air-quality-dataset.tar	3	15464463872	4017422246
after:
sofia-air-quality-dataset.tar	1	15464463872	4665176521	-6.54%	-2.11%
sofia-air-quality-dataset.tar	2	15464463872	4032382208	-9.04%	-2.59%
sofia-air-quality-dataset.tar	3	15464463872	3657874273	-8.95%	-2.32%

If the first bytes of a block is `0x40, 0x00` (repeat, length 4), this indicates that all [Copy with 4-byte offset (11)](https://github.com/google/snappy/blob/main/format_description.txt#L106) are all 3 bytes instead for the remainder of the block. There can be no literals before this tag and no repeats before a match as specified above. This will only trigger on this exact tag. > These are like the copies with 2-byte offsets (see previous subsection), > except that the offset is stored as a 24-bit integer instead of a > 16-bit integer (and thus will occupy three bytes). When in this mode the maximum backreference offset is 16777215. This *cannot* be combined with dictionaries.

klauspost · 2023-09-09T14:14:02Z

Attempted offset delta encoding -16 to 16, length 1-16. Extremely small hit rate. Not worth the complexity.

klauspost · 2023-09-18T07:27:37Z

Experiment with using 1 bit from copy long offset to indicate repeats.

Limits long offsets to length 32, down from 64, forcing a repeat.

Repeat length are encoded as:

// 0-28: Length 1 -> 29
// 29: Length (Read 1) + 1
// 30: Length (Read 2) + 1
// 31: Length (Read 3) + 1

Copy lengths are encoded as

// 0-28: Length 4 -> 32
// 29: Length (Read 1) + 4
// 30: Length (Read 2) + 4
// 31: Length (Read 3) + 4

Input	Level	Improvement
gob-stream	1	1.94%
gob-stream	2	-0.28%
gob-stream	3	0.87%

silesia.tar	1	1.10%
silesia.tar	2	0.54%
silesia.tar	3	0.89%

enwik9	1	0.24%
enwik9	2	0.02%
enwik9	3	0.06%

github-june-2days-2019.json	1	0.63%
github-june-2days-2019.json	2	0.19%
github-june-2days-2019.json	3	0.33%

github-ranks-backup.bin	1	0.35%
github-ranks-backup.bin	2	-0.45%
github-ranks-backup.bin	3	0.02%

nyc-taxi-data-10M.csv	1	1.27%
nyc-taxi-data-10M.csv	2	0.53%
nyc-taxi-data-10M.csv	3	0.72%

apache.log	1	1.11%
apache.log	2	1.21%
apache.log	3	1.39%

consensus.db.10gb	1	0.85%
consensus.db.10gb	2	-0.10%
consensus.db.10gb	3	0.01%

rawstudio-mint14.tar	1	1.23%
rawstudio-mint14.tar	2	0.57%
rawstudio-mint14.tar	3	0.99%

10gb.tar	1	0.35%
10gb.tar	2	0.20%
10gb.tar	3	0.32%

sofia-air-quality-dataset.tar	1	-0.31%
sofia-air-quality-dataset.tar	2	-1.79%
sofia-air-quality-dataset.tar	3	-3.04%

So gains mainly depend on how many repeats compared to long offsets (with long length) there is. Only sofia has a reasonable regression. Only many long offsets with length 32->64 should expect a regression.

This can remove the change from literals=63, which makes the change cleaner.

OP updated.

klauspost · 2023-11-17T15:57:50Z

Added variable length encoding to TagCopy2 as well. Good improvement and simplifies encoding decisions.

klauspost · 2023-11-24T13:06:22Z

Using 1 more bit for length in Tagcopy4 gives a very reasonable improvement.

Hard to make simple, though.

klauspost · 2023-12-06T18:15:56Z

Experimenting with using Copy1 length 11 as indicator for extra (length-64):

Negative means percentage smaller output:

(combined table below)

Undecided...

klauspost · 2023-12-06T18:27:52Z

Using 10 bits (max 1024) for offset:

Negative means percentage smaller output:

(table below)

Again, inconclusive....

klauspost · 2023-12-06T18:46:29Z

Using 10 bit lengths + 4 bits length and last length, read 1 additional byte (16 base)

file	level	Extra-len64	1024 offset	Both
gob-stream	1	-0.06%	0.04%	-0.02%
gob-stream	2	0.01%	0.02%	-0.09%
gob-stream	3	-0.04%	0.09%	0.05%

silesia.tar	1	0.15%	0.03%	0.01%
silesia.tar	2	0.19%	0.09%	0.05%
silesia.tar	3	0.13%	0.11%	0.03%

enwik9	1	0.25%	0.20%	0.18%
enwik9	2	0.30%	0.39%	0.37%
enwik9	3	0.29%	0.23%	0.16%

github-june-2days-2019.json	1	0.09%	-1.03%	-1.19%
github-june-2days-2019.json	2	0.07%	-1.42%	-1.67%
github-june-2days-2019.json	3	-0.01%	-1.05%	-1.28%

github-ranks-backup.bin	1	1.27%	-1.36%	-1.37%
github-ranks-backup.bin	2	0.88%	-1.34%	-1.36%
github-ranks-backup.bin	3	0.61%	-1.30%	-1.33%

nyc-taxi-data-10M.csv	1	-0.26%	0.45%	0.14%
nyc-taxi-data-10M.csv	2	-0.27%	0.45%	-0.02%
nyc-taxi-data-10M.csv	3	-0.56%	0.04%	-0.51%

apache.log	1	-4.21%	-0.38%	-3.90%
apache.log	2	-5.29%	-0.79%	-5.61%
apache.log	3	-5.36%	-0.32%	-4.59%

consensus.db.10gb	1	-0.38%	-0.02%	-0.44%
consensus.db.10gb	2	-0.40%	-0.06%	-0.51%
consensus.db.10gb	3	-0.36%	-0.21%	-0.58%

rawstudio-mint14.tar	1	0.08%	-0.12%	-0.16%
rawstudio-mint14.tar	2	0.13%	-0.13%	-0.18%
rawstudio-mint14.tar	3	0.07%	-0.03%	-0.10%

10gb.tar	1	0.06%	0.50%	0.48%
10gb.tar	2	0.07%	0.33%	0.30%
10gb.tar	3	0.06%	0.22%	0.18%

sofia-air-quality-dataset.tar	1	0.05%	0.39%	0.39%
sofia-air-quality-dataset.tar	2	0.08%	0.44%	0.44%
sofia-air-quality-dataset.tar	3	0.13%	0.67%	0.64%

klauspost · 2023-12-06T20:51:41Z

Offset before length extension? Probably faster for decoding - but maybe more tedious for encoding?

klauspost · 2023-12-06T20:56:50Z

Minimum offset 1 will eliminate a lot of 0 checks.

klauspost · 2023-12-11T12:15:30Z

Using one more bit for extra Copy4 length +28 is just too good to leave.

klauspost · 2023-12-14T14:16:00Z

Length after offset.

klauspost · 2023-12-28T12:28:13Z

Allow 0-3 literals in copy depending on uncompressed position. Table updated.

klauspost · 2023-12-28T12:29:49Z

(project will be published under minio)

klauspost force-pushed the s2-plus-plus branch from e1bd1af to 8b85d78 Compare August 7, 2023 08:10

klauspost added 3 commits August 7, 2023 11:08

Add length esitmation to noasm (revert this)

6729fa1

Do not bail, except length 4

e191575

Better repeat codes.

d1416ed

Tweak emit priority.

4100709

klauspost added 5 commits September 18, 2023 10:32

Use copy4 for repeats as well.

a2dba0e

Update docs

51032dc

Merge branch 'master' into s2-plus-plus

9c714b3

Merge branch 'master' into s2-plus-plus

468930c

Modify TagCopy2 as well

2ee8f39

With subs

fb2efcb

Tweak encoding

617aeec

Use a bit for extra copy length in TagCopy4.

c928a8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: S2++ #846

WIP: S2++ #846

klauspost commented Aug 7, 2023 •

edited

klauspost commented Sep 9, 2023

klauspost commented Sep 18, 2023 •

edited

klauspost commented Nov 17, 2023 •

edited

klauspost commented Nov 24, 2023 •

edited

klauspost commented Dec 6, 2023 •

edited

klauspost commented Dec 6, 2023 •

edited

klauspost commented Dec 6, 2023 •

edited

klauspost commented Dec 6, 2023 •

edited

klauspost commented Dec 6, 2023

klauspost commented Dec 11, 2023

klauspost commented Dec 14, 2023

klauspost commented Dec 28, 2023

klauspost commented Dec 28, 2023

WIP: S2++ #846

Are you sure you want to change the base?

WIP: S2++ #846

Conversation

klauspost commented Aug 7, 2023 • edited

Aim

Method

Discussion

Sizes

klauspost commented Sep 9, 2023

klauspost commented Sep 18, 2023 • edited

klauspost commented Nov 17, 2023 • edited

klauspost commented Nov 24, 2023 • edited

klauspost commented Dec 6, 2023 • edited

klauspost commented Dec 6, 2023 • edited

klauspost commented Dec 6, 2023 • edited

klauspost commented Dec 6, 2023 • edited

klauspost commented Dec 6, 2023

klauspost commented Dec 11, 2023

klauspost commented Dec 14, 2023

klauspost commented Dec 28, 2023

klauspost commented Dec 28, 2023

klauspost commented Aug 7, 2023 •

edited

klauspost commented Sep 18, 2023 •

edited

klauspost commented Nov 17, 2023 •

edited

klauspost commented Nov 24, 2023 •

edited

klauspost commented Dec 6, 2023 •

edited

klauspost commented Dec 6, 2023 •

edited

klauspost commented Dec 6, 2023 •

edited

klauspost commented Dec 6, 2023 •

edited