Add gzip compression #196

mdedetrich · 2022-04-14T13:22:51Z

About this change - What it does

This PR adds GZip compression along with a suite of tests to make sure that the feature works.

Resolves: #195

Why this way

The first thing to note about the GZip compression is that because Guardian is stateless the only way that it knows that something is compressed (or not) is by the key/object/file extension (i.e. it ends with .gz). This feature influences how the implementation is done, i.e.

If a currently in progress upload is happening and it is configured to not have compression and then the backup client is restarted with compression enabled, that current in progress upload will not have its compression settings changed. The change will only apply to the next upload.
- This is when dealing with object storages such as S3, there is no way to download an upload chunk from an in progress multipart upload, which means its not possible to download the data and either compress or uncompress it to apply the change immediately
- Even if you can do this (i.e. there is a hypothetical file based backup) its still not appropriate because in any case you would have to go back in time and either compress/uncompress the entire current upload. If you specify a large upload size (lets say 1 day) that can be a LOT of data
- The ConfigurationChangeRestartSpec is what tests these changes, i.e. its testing that when you restart the BackupClient with different settings (in context of this PR that means compression) that the current upload is unchanged and the next upload has the applied settings.
Since g-zip compressed files ends in .gz the RestoreClient can easily figure out which files need to be uncompressed and which ones not
In order to not create a huge amount of boilerplate test code, tests have been abstracted away so that you can extend a test to specify what compression you want.
- As an example, all of the test logic within BackupClientInterfaceSpec has been moved to BackupClientInterfaceTest which is a trait. That BackupClientInterfaceTest trait has a field compression so its intended that you have a test classes that extend that trait (in this case BackupClientInterfaceSpec which is the current/original test and GzipCompressionBackupClientInterfaceSpec which is the current/original test but with gzip compression enabled). This allows you to easily specify whether you want to run the entire test suite either with compression or not and can easily be extended for future features.
  - The same applies for RestoreClientInterfaceTest, RealS3BackupClientTest and RealS3RestoreClientTest
  - Due to the files being renamed, the github diff isn't displaying the diff between file renames and instead its displaying a brand new file and a massive diff of the original file with all of the contents removed. Because of this, an easier way to see the changes is to put the contents of the new BackupClientInterfaceTest and the current BackupClientInterfaceSpec into a diff tool (same applies to the other tests)
The mergeBackedUpData from MockedBackupClientInterface has been updated to deal with the fact that mocks now have to decompress the ByteString data required for certain tests. A compression parameter has been added to the method that lets us indicate that we wan't to decompress the data after merging it.
New models have been added to deal with these changes
- BackupObjectMetadata which contains metadata about a backup. Currently it only contains whether the backup is compressed or not but its intended that this datastructure will be extended in the future for other features. There is also BackupObjectMetadata.fromKey method creates the BackupObjectMetadata from a key
- StateDetails which replaces the previous State in order to combine it with a BackupObjectMetadata.
- Compression which is a datastructure that contains a CompressionType along with any details for the compression (such as a compression level)
- CompressionType which is a type of compression (in this case only Gzip).
The cli tool has been updated with a subcommand for compression. In order run Guardian CLI with compression enabled you can just add gzip to toggle it on, if you have gzip enabled you can also add the optional --compression-level to specify the level. See CliSpec changes for more info.
Documentation has also been added for the BackupConfig.

coveralls · 2022-04-14T15:26:14Z

Pull Request Test Coverage Report for Build 3142309373

51 of 63 (80.95%) changed or added relevant lines in 5 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+1.1%) to 76.959%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
core-backup/src/main/scala/io/aiven/guardian/kafka/backup/BackupClientInterface.scala	39	51	76.47%

Files with Coverage Reduction	New Missed Lines	%
core-backup/src/main/scala/io/aiven/guardian/kafka/backup/BackupClientInterface.scala	1	85.53%

Totals
Change from base Build 3138762884:	1.1%
Covered Lines:	334
Relevant Lines:	434

💛 - Coveralls

mdedetrich · 2022-04-14T16:04:47Z

build.sbt

@@ -113,8 +113,8 @@ lazy val core = project
    librarySettings,
    name := s"$baseName-core",
    libraryDependencies ++= Seq(
-      "com.typesafe.akka" %% "akka-actor"        % akkaVersion % Provided,
-      "com.typesafe.akka" %% "akka-stream"       % akkaVersion % Provided,
+      "com.typesafe.akka" %% "akka-actor"        % akkaVersion,


Provided is removed here because otherwise we can't enforce the correct version of akka/akka-streams which we need since its only with akka 2.6.19 and above where we have unsafeDataVia which is required for this feature (see akka/akka#31123)

mdedetrich · 2022-08-30T13:06:55Z

So I have just updated the PR with a lot more work done, basically there is a full implementation of GZip compression as stated in the PR description as well as some added tests to see if it works.

Unfortunately I encountered an issue in upstream akka, the unsafeDataVia function I added to akka-streams appears to be having issues with flows that cannot have truncation and gzip compression happens to be one of those flows. Upstream issue is at akka/akka#31522

mdedetrich · 2022-09-28T09:19:14Z

So the gzip compression PR is finally complete, originally PR post has been updated with the details.

mdedetrich · 2022-09-28T09:49:14Z

core-backup/src/main/scala/io/aiven/guardian/kafka/backup/BackupClientInterface.scala

+    * @return
+    *   A flow that will always produce exactly single output element for a given input element
+    */
+  private[backup] def flattenFlow[T](flow: Flow[T, T, NotUsed], zero: T, foldFunc: (T, T) => T): Flow[T, T, NotUsed] =


The context behind this function is that unsafeDataVia can only works if you provide a Flow that strictly never adds or removes elements from the stream (i.e. it can only modify each incoming element), this is the reason behind the "unsafe" in the method name. GZip compression doesn't do this, i.e. it can arbitrarily add or remove elements at will.

So one of the ways to solve this problem is to forcefully materialize the Gzip compression stream (i.e. execute the entire stream at this point in time usingrunFold) which ensures that for each incoming element, the entire element is gzip stream'ed rather than streaming the ByteString contents into the gzip compression Flow.

This part I didn't get, Compression.gzip is ByteString -> ByteString, how come we need to add or remove elements?

So first thing is, even though we are going from ByteString -> ByteString we are dealing with a Flow/Source which means you have multiple elements of that ByteString (in other words its easier to think of it as Stream of Many of ByteString -> Stream of Many of ByteString). This is what the pure streaming nature of akka-streams implies, you are never dealing with a single strict element but a flow of many elements over time.

Now since we are dealing with a FlowWithContext we have a pairing of offset to ByteString, i.e.

Cursor Data

C1 ByteString 1

C2 ByteString 2

The problem with GZip compression is that it does add or remove elements by design, i.e. it needs to add headers/footers at the start/end of the data stream. So that means that if we naively just use GZip compression the following (simplified) happens, firstly with .unsafeDataVia you have an intermediate stage that produces this.

Cursor Data

N/A Gzipped ByteString Header

C1 Gzipped ByteString 1

C2 Gzipped ByteString 2

N/A Gzipped ByteString Footer

After that .unsafeDataVia will zip up the C1/C2 elements ignoring ones that don't have a commit causing this

Cursor Data

C1 Gzipped ByteString 1

C2 Gzipped ByteString 2

Now this is a broken/incorrect GZIp stream since its missing the headers/footers and it throws a Truncated GZIP stream which is actually exactly what is tested within akka-streams here https://github.com/akka/akka/blob/7abc41cf4e7e8827393b181cd06c5f8ea684e696/akka-stream-tests/src/test/scala/akka/stream/io/compression/GzipWithCustomCompressionLevelSpec.scala#L27-L35.

You can also have a look at the implementation of the GZip compressor at https://github.com/akka/akka/blob/7abc41cf4e7e8827393b181cd06c5f8ea684e696/akka-stream/src/main/scala/akka/stream/impl/io/compression/GzipCompressor.scala#L24-L26, as you can see it adds a header and/or a footer. This is the exact reason why unsafeDataVia has the name "unsafe" in it, its because it can cause a corrupted stream if you naively put in a Flow that can add and/or delete elements. akka/akka#31522 has more details.

Hope that explains everything.

👍 I see, got it now, thanks a lot for explanation

mdedetrich · 2022-09-28T09:53:43Z

core/src/main/scala/io/aiven/guardian/kafka/Utils.scala

+        None
+    }
+
+  @tailrec
  def keyToOffsetDateTime(key: String): OffsetDateTime = {


This had to be changed to deal with the fact that we now have extensions .json.gz where as before we only had .json. The previous implementation would only drop the ending .gz from the key which means we would still have <ISO_DATE_TIME>.json (which isn't a valid timestamp).

The nicest way I could find to solve this issue is to just recursively drop ending extensions between the period (.) until you happen to successfully parse the key as a date.

mdedetrich · 2022-09-28T09:59:15Z

core-restore/src/test/scala/io/aiven/guardian/kafka/restore/RestoreClientInterfaceTest.scala

+      backupMock.clear()
+      val calculatedFuture = for {
+        _ <- backupMock.backup.run()
+        _ <- akka.pattern.after(10 seconds)(Future.successful(()))


This used to be AkkaStreamInitializationConstant (which is 1 second) however the duration was too short which caused flaky tests specifically for the Gzip version of this spec. I suspect that the gzip compression for the large amounts of data being generated happens to add a fair bit of time to the backup, likely due to the fact that the github actions CI machines are also not that powerful (and GZip compression is CPU intensive) and the fact that this is a mock not a real test which means that the entire Source contents are being passed in at once rather than slowly over time with a Kafka cluster.

mdedetrich · 2022-09-28T10:06:20Z

core-restore/src/test/scala/io/aiven/guardian/kafka/restore/RestoreClientInterfaceTest.scala

+      val calculatedFuture = for {
+        _ <- backupMock.backup.run()
+        _ <- akka.pattern.after(10 seconds)(Future.successful(()))
+        processedRecords = backupMock.mergeBackedUpData()


We are not passing in the compression argument here because its the job of the actual RestoreClient to decompress the backed up data later on.

mdedetrich · 2022-09-28T10:06:28Z

core-restore/src/test/scala/io/aiven/guardian/kafka/restore/RestoreClientInterfaceTest.scala

+      val calculatedFuture = for {
+        _ <- backupMock.backup.run()
+        _ <- akka.pattern.after(AkkaStreamInitializationConstant)(Future.successful(()))
+        processedRecords = backupMock.mergeBackedUpData()


We are not passing in the compression argument here because its the job of the actual RestoreClient to decompress the backed up data later on.

mdedetrich · 2022-09-30T07:39:52Z

Note that the tests have passed previously, now running tests is due to rebase of #265 (see https://github.com/aiven/guardian-for-apache-kafka/actions/runs/3142309373)

EDIT: Tests have passed

reta · 2022-10-07T20:19:49Z

Definitely pushes my limits of Akka Streams, second pair of eyes would be helpful, but I think it looks good overall. My personal observation - compression seems to be creeping into every place, may be a bit challenging to add different types (fe zstd, ...), in any case - good one @mdedetrich

mdedetrich · 2022-10-08T00:22:00Z

My personal observation - compression seems to be creeping into every place, may be a bit challenging to add different types (fe zstd, ...), in any case - good one @mdedetrich

@reta Thanks! The abstractions I added in my PR (one of the reasons why it happens to be a bit bigger than normal) would make adding compressions later on easier however there definitely needs to be some code organization done, i.e. abstracting out the compression in its own classes/traits rather than putting it into BackupClientInterface.

mdedetrich marked this pull request as draft April 14, 2022 13:23

mdedetrich force-pushed the add-gzip-compression branch from abe487d to 7b09113 Compare April 14, 2022 13:52

mdedetrich commented Apr 14, 2022

View reviewed changes

mdedetrich force-pushed the add-gzip-compression branch 3 times, most recently from 29809cd to 433c7e0 Compare April 21, 2022 14:05

mdedetrich force-pushed the add-gzip-compression branch 2 times, most recently from a235794 to 2d6b852 Compare May 24, 2022 18:06

mdedetrich force-pushed the add-gzip-compression branch 2 times, most recently from 1f17b9d to d7cc9db Compare July 27, 2022 14:07

mdedetrich force-pushed the add-gzip-compression branch from d7cc9db to 5770f35 Compare August 3, 2022 21:01

mdedetrich force-pushed the add-gzip-compression branch 4 times, most recently from ad6cced to 919bc05 Compare August 30, 2022 13:05

mdedetrich added the upstream Issues that require changes/co-operation with upstream libraries/projects label Aug 30, 2022

mdedetrich force-pushed the add-gzip-compression branch 12 times, most recently from 090378e to 571c29a Compare September 1, 2022 12:29

mdedetrich force-pushed the add-gzip-compression branch 8 times, most recently from b348667 to 170d764 Compare September 28, 2022 09:17

mdedetrich commented Sep 28, 2022

View reviewed changes

mdedetrich marked this pull request as ready for review September 28, 2022 10:12

mdedetrich requested review from jlprat, reta, snuyanzin, ahmedsobeh and RyanSkraba September 28, 2022 10:38

mdedetrich removed the upstream Issues that require changes/co-operation with upstream libraries/projects label Sep 29, 2022

Add gzip compression

30a68a5

mdedetrich force-pushed the add-gzip-compression branch from 170d764 to 30a68a5 Compare September 30, 2022 07:38

reta approved these changes Oct 11, 2022

View reviewed changes

mdedetrich merged commit 07bd7bf into main Oct 11, 2022

mdedetrich deleted the add-gzip-compression branch October 11, 2022 15:09

mdedetrich mentioned this pull request Oct 28, 2022

Add an option to buffer based on S3 minimum chunk size #326

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gzip compression #196

Add gzip compression #196

mdedetrich commented Apr 14, 2022 •

edited

coveralls commented Apr 14, 2022 •

edited

mdedetrich Apr 14, 2022

mdedetrich commented Aug 30, 2022 •

edited

mdedetrich commented Sep 28, 2022

mdedetrich Sep 28, 2022 •

edited

reta Oct 7, 2022

mdedetrich Oct 8, 2022 •

edited

reta Oct 11, 2022

mdedetrich Sep 28, 2022 •

edited

mdedetrich Sep 28, 2022 •

edited

mdedetrich Sep 28, 2022 •

edited

mdedetrich Sep 28, 2022 •

edited

mdedetrich commented Sep 30, 2022 •

edited

reta commented Oct 7, 2022

mdedetrich commented Oct 8, 2022 •

edited

Cursor	Data
N/A	Gzipped ByteString Header
C1	Gzipped ByteString 1
C2	Gzipped ByteString 2
N/A	Gzipped ByteString Footer

Add gzip compression #196

Add gzip compression #196

Conversation

mdedetrich commented Apr 14, 2022 • edited

About this change - What it does

Why this way

coveralls commented Apr 14, 2022 • edited

Pull Request Test Coverage Report for Build 3142309373

💛 - Coveralls

mdedetrich Apr 14, 2022

Choose a reason for hiding this comment

mdedetrich commented Aug 30, 2022 • edited

mdedetrich commented Sep 28, 2022

mdedetrich Sep 28, 2022 • edited

Choose a reason for hiding this comment

reta Oct 7, 2022

Choose a reason for hiding this comment

mdedetrich Oct 8, 2022 • edited

Choose a reason for hiding this comment

reta Oct 11, 2022

Choose a reason for hiding this comment

mdedetrich Sep 28, 2022 • edited

Choose a reason for hiding this comment

mdedetrich Sep 28, 2022 • edited

Choose a reason for hiding this comment

mdedetrich Sep 28, 2022 • edited

Choose a reason for hiding this comment

mdedetrich Sep 28, 2022 • edited

Choose a reason for hiding this comment

mdedetrich commented Sep 30, 2022 • edited

reta commented Oct 7, 2022

mdedetrich commented Oct 8, 2022 • edited

mdedetrich commented Apr 14, 2022 •

edited

coveralls commented Apr 14, 2022 •

edited

mdedetrich commented Aug 30, 2022 •

edited

mdedetrich Sep 28, 2022 •

edited

mdedetrich Oct 8, 2022 •

edited

mdedetrich Sep 28, 2022 •

edited

mdedetrich Sep 28, 2022 •

edited

mdedetrich Sep 28, 2022 •

edited

mdedetrich Sep 28, 2022 •

edited

mdedetrich commented Sep 30, 2022 •

edited

mdedetrich commented Oct 8, 2022 •

edited