Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Feature Add! Delta compression ratio can reach up to 77.88x! #245

Open
apple-ouyang opened this issue May 10, 2022 · 5 comments
Open

Comments

@apple-ouyang
Copy link
Contributor

Description

Titan now can use delta compression.
Here is my code repository
Acording to the test result, the compression ratio for compressed record can reach up to 77.88x!
However the database disk size shrink ratio is not so big.
You can see my test result below.

Delta Compression procedure

  1. Every call for Put will genertate a feature of the record by Odess similarity detection method.
  2. The feature of the record will stored in the feature index table. Every column family will have a table.
  3. In gc, every valid record will be searched for similar record by feature.
  4. Once foud similar record in the table, they will be compressed into a record + multi deltas

Question

I wanna test the impact of the delta compression for Titan.
But I see 2 tools for testing:

  1. the script in the /tools
  2. go-ycsb used in this ariticle

Here is my question:

  1. If I use the scipt in the /tools. There is a lot of work jobs, which should I choose?
  2. If I use go-ycsb, is there any parameter that I can use to compare with the result in this ariticle?

Test result

Here is the sumary result of titan_delta_compression_test

Enron Email

517401 records have been put into titan databse!

1.40GB(1420666341) are the size of keys and values

59113 (11.42%) is the number of similar records that can be delta compressed

method compress fail compress success delta size delta after size delta compress ratio compress time
kGDelta 0 97799 978.90MB 17.10MB 57.48 1.05s
kXDelta 0 97799 978.90MB 12.60MB 77.88 11.60s
kEDelta 0 97799 978.90MB 25.30MB 38.83 1.40s
method database size database after size database compress ratio blob files size blob files after size blob file compress ratio
kGDelta 1.20GB 974.50MB 1.17 386.90MB 149.00MB 2.60
kXDelta 1.20GB 974.50MB 1.17 386.90MB 149.00MB 2.60
kEDelta 1.20GB 974.50MB 1.17 386.90MB 149.00MB 2.60

Wikipedia

1367732 records have been put into titan databse!

19.10GB(20402694776) are the size of keys and values

731224 (53.46%) is the number of similar records that can be delta compressed

method compress fail compress success delta size delta after size delta compress ratio compress time
kGDelta 16 729411 9.70GB 1.50GB 6.61 69.51s
kXDelta 0 729427 9.70GB 741.70MB 13.34 360.59s
kEDelta 19 729408 9.70GB 2.20GB 4.58 81.94s
method database size database after size database compress ratio blob files size blob files after size blob file compress ratio
kGDelta 7.80GB 7.30GB 1.08 7.50GB 6.80GB 1.10
kXDelta 7.80GB 7.30GB 1.08 7.50GB 6.80GB 1.10
kEDelta 7.80GB 7.30GB 1.08 7.50GB 6.80GB 1.10
@Connor1996
Copy link
Member

Good job, what's the meaning for each method

@Connor1996
Copy link
Member

So you develop a new delta compression algorithm, right? I'm curious about the overhead

@Connor1996
Copy link
Member

For the question, you can just use db_bench in /tools and choose the workload on your demand

@apple-ouyang
Copy link
Contributor Author

apple-ouyang commented Jun 3, 2022 via email

@cscetbon
Copy link

cscetbon commented Mar 1, 2023

@apple-ouyang Do you have any other measurements like the impact one cpu load, disk io etc... ?

Also how does it compare to zstd ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants