Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flate: Improve level 1-3 compression #678

Merged
merged 1 commit into from Sep 25, 2022

Conversation

klauspost
Copy link
Owner

@klauspost klauspost commented Sep 25, 2022

Use 5 byte hash instead of 4 byte hash.

This improves compression in most cases and will also yield faster decompression. Little to no performance impact.

Before/after:

file	out	level	insize	outsize	millis
nyc-taxi-data-10M.csv	gzkp	1	3325605752	922273214	14065	225.49
nyc-taxi-data-10M.csv	gzkp	1	3325605752	846471964	14342	221.12

nyc-taxi-data-10M.csv	gzkp	2	3325605752	883782053	15683	202.22
nyc-taxi-data-10M.csv	gzkp	2	3325605752	815766227	14865	213.35

nyc-taxi-data-10M.csv	gzkp	3	3325605752	878726683	17308	183.24
nyc-taxi-data-10M.csv	gzkp	3	3325605752	808448239	16882	187.86

nyc-taxi-data-10M.csv	gzkp	4	3325605752	789447233	20651	153.57
nyc-taxi-data-10M.csv	gzkp	4	3325605752	789447233	20657	153.53

file	out	level	insize	outsize	millis	mb/s
enwik9	gzkp	1	1000000000	382781160	5713	166.90
enwik9	gzkp	1	1000000000	374131553	5826	163.69

enwik9	gzkp	2	1000000000	371351753	6131	155.55
enwik9	gzkp	2	1000000000	361881529	5910	161.36

enwik9	gzkp	3	1000000000	364881746	6891	138.39
enwik9	gzkp	3	1000000000	355065173	6960	137.02

enwik9	gzkp	4	1000000000	342732211	8339	114.36
enwik9	gzkp	4	1000000000	342732211	8252	115.57

file	reset	out	level	files	insize	outsize	millis	mb/s
objectfiles	true	gzkp	1	708	300491980	56114777	1008	284.27
objectfiles	true	gzkp	1	708	300491980	55300071	998	286.90

objectfiles	true	gzkp	2	708	300491980	53946448	1147	249.71
objectfiles	true	gzkp	2	708	300491980	52750260	1109	258.36

objectfiles	true	gzkp	3	708	300491980	53110452	1220	234.82
objectfiles	true	gzkp	3	708	300491980	51947585	1211	236.46


One of the few regressions:

file	out	level	insize	outsize	millis	mb/s
rawstudio-mint14.tar	gzkp	1	8558382592	3960117298	36682	222.50
rawstudio-mint14.tar	gzkp	1	8558382592	3985295228	36619	222.88

rawstudio-mint14.tar	gzkp	2	8558382592	3899597850	38683	210.99
rawstudio-mint14.tar	gzkp	2	8558382592	3921716642	36754	222.06

rawstudio-mint14.tar	gzkp	3	8558382592	3848762302	46588	175.19
rawstudio-mint14.tar	gzkp	3	8558382592	3846475496	45611	178.94

Use 5 byte hash instead of 4 byte hash.

This improves compression in most cases and will also yield faster decompression. Little to no performance impact.

Before/after:
```
file	out	level	insize	outsize	millis
nyc-taxi-data-10M.csv	gzkp	1	3325605752	922273214	14065	225.49
nyc-taxi-data-10M.csv	gzkp	1	3325605752	846471964	14564	217.76

nyc-taxi-data-10M.csv	gzkp	2	3325605752	883782053	15683	202.22
nyc-taxi-data-10M.csv	gzkp	2	3325605752	815766227	15057	210.63

nyc-taxi-data-10M.csv	gzkp	3	3325605752	878726683	17308	183.24
nyc-taxi-data-10M.csv	gzkp	3	3325605752	807241782	17184	184.56

nyc-taxi-data-10M.csv	gzkp	4	3325605752	789447233	20651	153.57
nyc-taxi-data-10M.csv	gzkp	4	3325605752	789447233	20862	152.02

file	out	level	insize	outsize	millis	mb/s
enwik9	gzkp	1	1000000000	382781160	5713	166.90
enwik9	gzkp	1	1000000000	374131553	5926	160.90

enwik9	gzkp	2	1000000000	371351753	6131	155.55
enwik9	gzkp	2	1000000000	361881529	6007	158.74

enwik9	gzkp	3	1000000000	364881746	6891	138.39
enwik9	gzkp	3	1000000000	355065173	7043	135.39

enwik9	gzkp	4	1000000000	342732211	8339	114.36
enwik9	gzkp	4	1000000000	342732211	8327	114.52
```
@klauspost klauspost merged commit b8a3c61 into master Sep 25, 2022
@klauspost klauspost deleted the improve-l1-3-flate-compression branch September 25, 2022 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant