Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gzip compression #196

Merged
merged 1 commit into from Oct 11, 2022
Merged

Add gzip compression #196

merged 1 commit into from Oct 11, 2022

Conversation

mdedetrich
Copy link
Collaborator

@mdedetrich mdedetrich commented Apr 14, 2022

About this change - What it does

This PR adds GZip compression along with a suite of tests to make sure that the feature works.

Resolves: #195

Why this way

The first thing to note about the GZip compression is that because Guardian is stateless the only way that it knows that something is compressed (or not) is by the key/object/file extension (i.e. it ends with .gz). This feature influences how the implementation is done, i.e.

  • If a currently in progress upload is happening and it is configured to not have compression and then the backup client is restarted with compression enabled, that current in progress upload will not have its compression settings changed. The change will only apply to the next upload.
    • This is when dealing with object storages such as S3, there is no way to download an upload chunk from an in progress multipart upload, which means its not possible to download the data and either compress or uncompress it to apply the change immediately
    • Even if you can do this (i.e. there is a hypothetical file based backup) its still not appropriate because in any case you would have to go back in time and either compress/uncompress the entire current upload. If you specify a large upload size (lets say 1 day) that can be a LOT of data
    • The ConfigurationChangeRestartSpec is what tests these changes, i.e. its testing that when you restart the BackupClient with different settings (in context of this PR that means compression) that the current upload is unchanged and the next upload has the applied settings.
  • Since g-zip compressed files ends in .gz the RestoreClient can easily figure out which files need to be uncompressed and which ones not
  • In order to not create a huge amount of boilerplate test code, tests have been abstracted away so that you can extend a test to specify what compression you want.
    • As an example, all of the test logic within BackupClientInterfaceSpec has been moved to BackupClientInterfaceTest which is a trait. That BackupClientInterfaceTest trait has a field compression so its intended that you have a test classes that extend that trait (in this case BackupClientInterfaceSpec which is the current/original test and GzipCompressionBackupClientInterfaceSpec which is the current/original test but with gzip compression enabled). This allows you to easily specify whether you want to run the entire test suite either with compression or not and can easily be extended for future features.
      • The same applies for RestoreClientInterfaceTest, RealS3BackupClientTest and RealS3RestoreClientTest
      • Due to the files being renamed, the github diff isn't displaying the diff between file renames and instead its displaying a brand new file and a massive diff of the original file with all of the contents removed. Because of this, an easier way to see the changes is to put the contents of the new BackupClientInterfaceTest and the current BackupClientInterfaceSpec into a diff tool (same applies to the other tests)
  • The mergeBackedUpData from MockedBackupClientInterface has been updated to deal with the fact that mocks now have to decompress the ByteString data required for certain tests. A compression parameter has been added to the method that lets us indicate that we wan't to decompress the data after merging it.
  • New models have been added to deal with these changes
    • BackupObjectMetadata which contains metadata about a backup. Currently it only contains whether the backup is compressed or not but its intended that this datastructure will be extended in the future for other features. There is also BackupObjectMetadata.fromKey method creates the BackupObjectMetadata from a key
    • StateDetails which replaces the previous State in order to combine it with a BackupObjectMetadata.
    • Compression which is a datastructure that contains a CompressionType along with any details for the compression (such as a compression level)
    • CompressionType which is a type of compression (in this case only Gzip).
  • The cli tool has been updated with a subcommand for compression. In order run Guardian CLI with compression enabled you can just add gzip to toggle it on, if you have gzip enabled you can also add the optional --compression-level to specify the level. See CliSpec changes for more info.
  • Documentation has also been added for the BackupConfig.

@mdedetrich mdedetrich marked this pull request as draft April 14, 2022 13:23
@coveralls
Copy link

coveralls commented Apr 14, 2022

Pull Request Test Coverage Report for Build 3142309373

  • 51 of 63 (80.95%) changed or added relevant lines in 5 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+1.1%) to 76.959%

Changes Missing Coverage Covered Lines Changed/Added Lines %
core-backup/src/main/scala/io/aiven/guardian/kafka/backup/BackupClientInterface.scala 39 51 76.47%
Files with Coverage Reduction New Missed Lines %
core-backup/src/main/scala/io/aiven/guardian/kafka/backup/BackupClientInterface.scala 1 85.53%
Totals Coverage Status
Change from base Build 3138762884: 1.1%
Covered Lines: 334
Relevant Lines: 434

💛 - Coveralls

@@ -113,8 +113,8 @@ lazy val core = project
librarySettings,
name := s"$baseName-core",
libraryDependencies ++= Seq(
"com.typesafe.akka" %% "akka-actor" % akkaVersion % Provided,
"com.typesafe.akka" %% "akka-stream" % akkaVersion % Provided,
"com.typesafe.akka" %% "akka-actor" % akkaVersion,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided is removed here because otherwise we can't enforce the correct version of akka/akka-streams which we need since its only with akka 2.6.19 and above where we have unsafeDataVia which is required for this feature (see akka/akka#31123)

@mdedetrich mdedetrich force-pushed the add-gzip-compression branch 3 times, most recently from 29809cd to 433c7e0 Compare April 21, 2022 14:05
@mdedetrich mdedetrich force-pushed the add-gzip-compression branch 2 times, most recently from a235794 to 2d6b852 Compare May 24, 2022 18:06
@mdedetrich mdedetrich force-pushed the add-gzip-compression branch 2 times, most recently from 1f17b9d to d7cc9db Compare July 27, 2022 14:07
@mdedetrich mdedetrich force-pushed the add-gzip-compression branch 4 times, most recently from ad6cced to 919bc05 Compare August 30, 2022 13:05
@mdedetrich
Copy link
Collaborator Author

mdedetrich commented Aug 30, 2022

So I have just updated the PR with a lot more work done, basically there is a full implementation of GZip compression as stated in the PR description as well as some added tests to see if it works.

Unfortunately I encountered an issue in upstream akka, the unsafeDataVia function I added to akka-streams appears to be having issues with flows that cannot have truncation and gzip compression happens to be one of those flows. Upstream issue is at akka/akka#31522

@mdedetrich mdedetrich added the upstream Issues that require changes/co-operation with upstream libraries/projects label Aug 30, 2022
@mdedetrich mdedetrich force-pushed the add-gzip-compression branch 12 times, most recently from 090378e to 571c29a Compare September 1, 2022 12:29
@mdedetrich mdedetrich force-pushed the add-gzip-compression branch 8 times, most recently from b348667 to 170d764 Compare September 28, 2022 09:17
@mdedetrich
Copy link
Collaborator Author

So the gzip compression PR is finally complete, originally PR post has been updated with the details.

* @return
* A flow that will always produce exactly single output element for a given input element
*/
private[backup] def flattenFlow[T](flow: Flow[T, T, NotUsed], zero: T, foldFunc: (T, T) => T): Flow[T, T, NotUsed] =
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context behind this function is that unsafeDataVia can only works if you provide a Flow that strictly never adds or removes elements from the stream (i.e. it can only modify each incoming element), this is the reason behind the "unsafe" in the method name. GZip compression doesn't do this, i.e. it can arbitrarily add or remove elements at will.

So one of the ways to solve this problem is to forcefully materialize the Gzip compression stream (i.e. execute the entire stream at this point in time usingrunFold) which ensures that for each incoming element, the entire element is gzip stream'ed rather than streaming the ByteString contents into the gzip compression Flow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part I didn't get, Compression.gzip is ByteString -> ByteString, how come we need to add or remove elements?

Copy link
Collaborator Author

@mdedetrich mdedetrich Oct 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So first thing is, even though we are going from ByteString -> ByteString we are dealing with a Flow/Source which means you have multiple elements of that ByteString (in other words its easier to think of it as Stream of Many of ByteString -> Stream of Many of ByteString). This is what the pure streaming nature of akka-streams implies, you are never dealing with a single strict element but a flow of many elements over time.

Now since we are dealing with a FlowWithContext we have a pairing of offset to ByteString, i.e.

Cursor Data
C1 ByteString 1
C2 ByteString 2

The problem with GZip compression is that it does add or remove elements by design, i.e. it needs to add headers/footers at the start/end of the data stream. So that means that if we naively just use GZip compression the following (simplified) happens, firstly with .unsafeDataVia you have an intermediate stage that produces this.

Cursor Data
N/A Gzipped ByteString Header
C1 Gzipped ByteString 1
C2 Gzipped ByteString 2
N/A Gzipped ByteString Footer

After that .unsafeDataVia will zip up the C1/C2 elements ignoring ones that don't have a commit causing this

Cursor Data
C1 Gzipped ByteString 1
C2 Gzipped ByteString 2

Now this is a broken/incorrect GZIp stream since its missing the headers/footers and it throws a Truncated GZIP stream which is actually exactly what is tested within akka-streams here https://github.com/akka/akka/blob/7abc41cf4e7e8827393b181cd06c5f8ea684e696/akka-stream-tests/src/test/scala/akka/stream/io/compression/GzipWithCustomCompressionLevelSpec.scala#L27-L35.

You can also have a look at the implementation of the GZip compressor at https://github.com/akka/akka/blob/7abc41cf4e7e8827393b181cd06c5f8ea684e696/akka-stream/src/main/scala/akka/stream/impl/io/compression/GzipCompressor.scala#L24-L26, as you can see it adds a header and/or a footer. This is the exact reason why unsafeDataVia has the name "unsafe" in it, its because it can cause a corrupted stream if you naively put in a Flow that can add and/or delete elements. akka/akka#31522 has more details.

Hope that explains everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I see, got it now, thanks a lot for explanation

None
}

@tailrec
def keyToOffsetDateTime(key: String): OffsetDateTime = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had to be changed to deal with the fact that we now have extensions .json.gz where as before we only had .json. The previous implementation would only drop the ending .gz from the key which means we would still have <ISO_DATE_TIME>.json (which isn't a valid timestamp).

The nicest way I could find to solve this issue is to just recursively drop ending extensions between the period (.) until you happen to successfully parse the key as a date.

backupMock.clear()
val calculatedFuture = for {
_ <- backupMock.backup.run()
_ <- akka.pattern.after(10 seconds)(Future.successful(()))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be AkkaStreamInitializationConstant (which is 1 second) however the duration was too short which caused flaky tests specifically for the Gzip version of this spec. I suspect that the gzip compression for the large amounts of data being generated happens to add a fair bit of time to the backup, likely due to the fact that the github actions CI machines are also not that powerful (and GZip compression is CPU intensive) and the fact that this is a mock not a real test which means that the entire Source contents are being passed in at once rather than slowly over time with a Kafka cluster.

val calculatedFuture = for {
_ <- backupMock.backup.run()
_ <- akka.pattern.after(10 seconds)(Future.successful(()))
processedRecords = backupMock.mergeBackedUpData()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not passing in the compression argument here because its the job of the actual RestoreClient to decompress the backed up data later on.

val calculatedFuture = for {
_ <- backupMock.backup.run()
_ <- akka.pattern.after(AkkaStreamInitializationConstant)(Future.successful(()))
processedRecords = backupMock.mergeBackedUpData()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not passing in the compression argument here because its the job of the actual RestoreClient to decompress the backed up data later on.

@mdedetrich mdedetrich marked this pull request as ready for review September 28, 2022 10:12
@mdedetrich mdedetrich removed the upstream Issues that require changes/co-operation with upstream libraries/projects label Sep 29, 2022
@mdedetrich
Copy link
Collaborator Author

mdedetrich commented Sep 30, 2022

Note that the tests have passed previously, now running tests is due to rebase of #265 (see https://github.com/aiven/guardian-for-apache-kafka/actions/runs/3142309373)

EDIT: Tests have passed

@reta
Copy link
Contributor

reta commented Oct 7, 2022

Definitely pushes my limits of Akka Streams, second pair of eyes would be helpful, but I think it looks good overall. My personal observation - compression seems to be creeping into every place, may be a bit challenging to add different types (fe zstd, ...), in any case - good one @mdedetrich

@mdedetrich
Copy link
Collaborator Author

mdedetrich commented Oct 8, 2022

My personal observation - compression seems to be creeping into every place, may be a bit challenging to add different types (fe zstd, ...), in any case - good one @mdedetrich

@reta Thanks! The abstractions I added in my PR (one of the reasons why it happens to be a bit bigger than normal) would make adding compressions later on easier however there definitely needs to be some code organization done, i.e. abstracting out the compression in its own classes/traits rather than putting it into BackupClientInterface.

@mdedetrich mdedetrich merged commit 07bd7bf into main Oct 11, 2022
@mdedetrich mdedetrich deleted the add-gzip-compression branch October 11, 2022 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add GZip compression
3 participants