GZip Header / Footer info #558

jzabroski · 2021-01-08T16:53:40Z

Small world.

I have about 75GB of gzip data I need to decompress and then load into SQL tables. Since the data vendor could in theory update any historical data at any time, I wanted to ideally

Get list of gz files in both history and daily folders
Read the list entries from the zip file and get filenames / sizes (not sure if size is needed)
Compare filename / sizes to what we have in the database
Anything not in the database -> extract to temp folder
Import data in temp folder into database
Clean up temp folder
Repeat

But, to do this, ideally I would only read the GZip header and footer, so I know how big of a file I am extracting, but I don't see any clean .NET APIs that let you do something like the following pseudo-code:

await using var fileStream = File.OpenAsync("myfile.gz");
await using var gzipStream = new GZipStream(fileStream, ZipMode.Read);
var fileSize = gzipStream.Header.UncompressedFileSize;
var fileName = gzipStream.Header.FileName;

But... this is likely slightly incorrect. Even so, I see that is approximately how Go models it: https://golang.org/src/compress/gzip/gunzip.go?s=1297:1500#L42

I'm a little surprised there doesn't seem to be a .NET library with an API with such an obvious use case, but Go does.

StackOverflow seems to suggest that an older version of this library supported a concrete FilePath value. https://stackoverflow.com/a/39081983/1040437

This is the code I wrote so far, but reader.Entry.Key is blank and reader.Entry.LinkTarget is also blank, and I don't see a FilePath option anywhere.

var files = Directory.EnumerateFiles(historicalDataLocalPath, "*.gz", SearchOption.AllDirectories);
            foreach (var file in files)
            {
                var readerOptions = new ReaderOptions();
                readerOptions.LookForHeader = true; // It looks like this only applies to RarArchive for some reason.
                using Stream stream = File.OpenRead(file);
                using var reader = GZipReader.Open(stream, readerOptions);
                while (reader.MoveToNextEntry())
                {
                    if (reader.Entry.IsDirectory)
                    {
                        continue;
                    }

                    using var entryStream = reader.OpenEntryStream();
                    var outputPath = Path.Combine(configuration.WorkingPath, reader.Entry.Key ?? reader.Entry.LinkTarget);
                    using Stream writeStream = File.OpenWrite(outputPath);
                    entryStream.CopyTo(writeStream);
                }
            }

The text was updated successfully, but these errors were encountered:

adamhathcock · 2021-01-08T17:02:37Z

The header stuff that Go references is actually there:
https://github.com/adamhathcock/sharpcompress/blob/master/src/SharpCompress/Compressors/Deflate/ZlibBaseStream.cs#L63

It just needs exposing or some other minor updates as it looks like the whole header is gathered. This implementation was based on another and I haven't really touched it since.

jzabroski · 2021-01-08T19:11:54Z

Is that why my GZipEntry records don't match PeaZip? Since I am not super familiar with GZip standard and am learning as I go, I don't quite fully understand why these don't match.

jzabroski · 2021-01-08T19:17:48Z

For what its worth, this .gz file is from a RedShift table export. https://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html

So, I think this would be a fairly common task to want to use GZip decompression for, as more and more people use data lakes.

adamhathcock · 2021-01-08T19:42:00Z

There might be a bug in that the GZipEntry isn't picking up the file name correctly.

It's also possible that there is no file name embedded and PeaZip just gives it a default.

I'm guessing I have a bug but would need a sample (and time) to validate.

Any chance you get the source of sharpcompress to debug? I'd put a breakpoint on the gzip header read spot to see what comes out. If that's correct then I need to fix. Else there's just no info.

jzabroski · 2021-01-08T20:08:25Z

I think its not a bug. I did the following to try to analyze further:

Installed the GNU gzip library via chocolatey: choco install -y gzip
Ran refreshenv to add gzip command path to $env:PATH
Ran gzip --list "\\fileshare\path\to\file.gz"

Got the following:

     compressed        uncompressed  ratio uncompressed_name
       26393242            78119087  66.2% \\fileshare\path\to\file

In reading online, the 4th bit of the 4th byte determines if the original filename is kept. When it is not present, the "correct" behavior used by various tools is to use the gz filename without the gz extension. Unfortunately, gzip command line program doesn't directly display that header info, either, which sucks. Adding -v only adds three new columns,

method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 72c113c2 Jun  8 07:59            26393242            78119087  66.2% \\fileshare\path\to\file

Any chance you get the source of sharpcompress to debug? I'd put a breakpoint on the gzip header read spot to see what comes out. If that's correct then I need to fix. Else there's just no info.

This is a good idea. Will fork and see.

jzabroski · 2021-01-08T20:21:23Z

I was able to figure it out with gzip

gzip -dkv "\\fileshare\path\to\file.gz"

outputs:

\\fileshare\path\to\file.gz:
 66.2% -- replaced with \\fileshare\path\to\file

Which, upon reading online, - is the default "file name" if none is given.

I'll still fork the repo and look at contributing a patch sometime soon. Seems like fun.

I also think the API could use some changes to make it more friendly to generic programming and async/await all the way.

adamhathcock · 2021-01-08T20:43:24Z

Showing the filename when that byte is present isn't what this library does. The API doesn't know the name. It knows streams.

That said, all the info should be exposed on GZipStream and/or GZipEntry.

Exposing the info should be easy.

I'm happy to rework the API. I haven't given it critical thought for 10 years! Any thoughts on issues or PRs are welcome.

Async/await has been on the TODO list but I've just never made a start as it seems like a lot of grunt work. Again, PRs welcome! Even partial ones where I could help with this big task.

I'm with a startup and have 3 young kids in lockdown. Not much free time.

adamhathcock · 2021-01-09T12:58:03Z

I got some time and got curious so I dug into the code and spec and then reread your use case.

It looks like all you want is the name and uncompressed size. The name could be "default" which GZipStream doesn't know the file name.

As for the size, it's basically the last 4 bytes on the file, assuming there's only one "member" in the file as most GZ files are. I don't think there's any value this library can add to that use case other than a static method on GZipArchive or something. I guess I could make the entries on the archive read the footers to load size and crc data.

I started a branch here that just exposes LastModified but I don't see anything obvious to do: #560

adamhathcock · 2021-01-09T13:23:14Z

Nevermind, I take that back: If you use GZipArchive it will read the trailer for CRC and size info now.

really need to refactor this lib for nullables and async.

jzabroski · 2021-01-09T14:35:01Z

Do you use Resharper? Feel like I can blaze through refactoring it.

The thing I don't understand is, reading online, is there really such a thing as a GZipArchive? Isn't that just tar.gz? I didn't know what GZipFilePart was either.

I think the ReaderFactory would be nicer if it supported generic types. That would clean up ReaderOptions too since you could have options per format.

adamhathcock · 2021-01-09T16:19:10Z

It's really that in sharpcompress:
Archive = random access
Reader = forward only streaming

Ive kulged a common API over different formats as I could for fun.

FilePart was a way to have the same file in an archive across multiple physical files. For example, Rar and Zip can divide an archive into multi-file archives. You might have a compressed file split over 2 or more physical archive files because of it. FilePart made sense at the time.

I use Rider so basically I use Resharper :)

I'll merge in my gzip changes soon and release then prepare for breaking changes. I want more nullables anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GZip Header / Footer info #558

GZip Header / Footer info #558

jzabroski commented Jan 8, 2021

adamhathcock commented Jan 8, 2021

jzabroski commented Jan 8, 2021

jzabroski commented Jan 8, 2021

adamhathcock commented Jan 8, 2021

jzabroski commented Jan 8, 2021

jzabroski commented Jan 8, 2021 •

edited

adamhathcock commented Jan 8, 2021

adamhathcock commented Jan 9, 2021

adamhathcock commented Jan 9, 2021

jzabroski commented Jan 9, 2021

adamhathcock commented Jan 9, 2021

GZip Header / Footer info #558

GZip Header / Footer info #558

Comments

jzabroski commented Jan 8, 2021

adamhathcock commented Jan 8, 2021

jzabroski commented Jan 8, 2021

jzabroski commented Jan 8, 2021

adamhathcock commented Jan 8, 2021

jzabroski commented Jan 8, 2021

jzabroski commented Jan 8, 2021 • edited

adamhathcock commented Jan 8, 2021

adamhathcock commented Jan 9, 2021

adamhathcock commented Jan 9, 2021

jzabroski commented Jan 9, 2021

adamhathcock commented Jan 9, 2021

jzabroski commented Jan 8, 2021 •

edited