Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GZip Header / Footer info #558

Open
jzabroski opened this issue Jan 8, 2021 · 11 comments
Open

GZip Header / Footer info #558

jzabroski opened this issue Jan 8, 2021 · 11 comments

Comments

@jzabroski
Copy link

Hi @adamhathcock ,

Small world.

I have about 75GB of gzip data I need to decompress and then load into SQL tables. Since the data vendor could in theory update any historical data at any time, I wanted to ideally

  1. Get list of gz files in both history and daily folders
  2. Read the list entries from the zip file and get filenames / sizes (not sure if size is needed)
  3. Compare filename / sizes to what we have in the database
  4. Anything not in the database -> extract to temp folder
  5. Import data in temp folder into database
  6. Clean up temp folder
  7. Repeat

But, to do this, ideally I would only read the GZip header and footer, so I know how big of a file I am extracting, but I don't see any clean .NET APIs that let you do something like the following pseudo-code:

await using var fileStream = File.OpenAsync("myfile.gz");
await using var gzipStream = new GZipStream(fileStream, ZipMode.Read);
var fileSize = gzipStream.Header.UncompressedFileSize;
var fileName = gzipStream.Header.FileName;

But... this is likely slightly incorrect. Even so, I see that is approximately how Go models it: https://golang.org/src/compress/gzip/gunzip.go?s=1297:1500#L42

I'm a little surprised there doesn't seem to be a .NET library with an API with such an obvious use case, but Go does.

StackOverflow seems to suggest that an older version of this library supported a concrete FilePath value. https://stackoverflow.com/a/39081983/1040437

This is the code I wrote so far, but reader.Entry.Key is blank and reader.Entry.LinkTarget is also blank, and I don't see a FilePath option anywhere.

var files = Directory.EnumerateFiles(historicalDataLocalPath, "*.gz", SearchOption.AllDirectories);
            foreach (var file in files)
            {
                var readerOptions = new ReaderOptions();
                readerOptions.LookForHeader = true; // It looks like this only applies to RarArchive for some reason.
                using Stream stream = File.OpenRead(file);
                using var reader = GZipReader.Open(stream, readerOptions);
                while (reader.MoveToNextEntry())
                {
                    if (reader.Entry.IsDirectory)
                    {
                        continue;
                    }

                    using var entryStream = reader.OpenEntryStream();
                    var outputPath = Path.Combine(configuration.WorkingPath, reader.Entry.Key ?? reader.Entry.LinkTarget);
                    using Stream writeStream = File.OpenWrite(outputPath);
                    entryStream.CopyTo(writeStream);
                }
            }
@adamhathcock
Copy link
Owner

The header stuff that Go references is actually there:
https://github.com/adamhathcock/sharpcompress/blob/master/src/SharpCompress/Compressors/Deflate/ZlibBaseStream.cs#L63

It just needs exposing or some other minor updates as it looks like the whole header is gathered. This implementation was based on another and I haven't really touched it since.

@jzabroski
Copy link
Author

Is that why my GZipEntry records don't match PeaZip? Since I am not super familiar with GZip standard and am learning as I go, I don't quite fully understand why these don't match.

image

image

@jzabroski
Copy link
Author

For what its worth, this .gz file is from a RedShift table export. https://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html

So, I think this would be a fairly common task to want to use GZip decompression for, as more and more people use data lakes.

@adamhathcock
Copy link
Owner

There might be a bug in that the GZipEntry isn't picking up the file name correctly.

It's also possible that there is no file name embedded and PeaZip just gives it a default.

I'm guessing I have a bug but would need a sample (and time) to validate.

Any chance you get the source of sharpcompress to debug? I'd put a breakpoint on the gzip header read spot to see what comes out. If that's correct then I need to fix. Else there's just no info.

@jzabroski
Copy link
Author

I think its not a bug. I did the following to try to analyze further:

  1. Installed the GNU gzip library via chocolatey: choco install -y gzip
  2. Ran refreshenv to add gzip command path to $env:PATH
  3. Ran gzip --list "\\fileshare\path\to\file.gz"
  4. Got the following:
         compressed        uncompressed  ratio uncompressed_name
           26393242            78119087  66.2% \\fileshare\path\to\file
    

In reading online, the 4th bit of the 4th byte determines if the original filename is kept. When it is not present, the "correct" behavior used by various tools is to use the gz filename without the gz extension. Unfortunately, gzip command line program doesn't directly display that header info, either, which sucks. Adding -v only adds three new columns,

method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 72c113c2 Jun  8 07:59            26393242            78119087  66.2% \\fileshare\path\to\file

Any chance you get the source of sharpcompress to debug? I'd put a breakpoint on the gzip header read spot to see what comes out. If that's correct then I need to fix. Else there's just no info.

This is a good idea. Will fork and see.

@jzabroski
Copy link
Author

jzabroski commented Jan 8, 2021

I was able to figure it out with gzip

gzip -dkv "\\fileshare\path\to\file.gz"

outputs:

\\fileshare\path\to\file.gz:
 66.2% -- replaced with \\fileshare\path\to\file

Which, upon reading online, - is the default "file name" if none is given.

I'll still fork the repo and look at contributing a patch sometime soon. Seems like fun.

I also think the API could use some changes to make it more friendly to generic programming and async/await all the way.

@adamhathcock
Copy link
Owner

Showing the filename when that byte is present isn't what this library does. The API doesn't know the name. It knows streams.

That said, all the info should be exposed on GZipStream and/or GZipEntry.

Exposing the info should be easy.

I'm happy to rework the API. I haven't given it critical thought for 10 years! Any thoughts on issues or PRs are welcome.

Async/await has been on the TODO list but I've just never made a start as it seems like a lot of grunt work. Again, PRs welcome! Even partial ones where I could help with this big task.

I'm with a startup and have 3 young kids in lockdown. Not much free time.

@adamhathcock
Copy link
Owner

I got some time and got curious so I dug into the code and spec and then reread your use case.

It looks like all you want is the name and uncompressed size. The name could be "default" which GZipStream doesn't know the file name.

As for the size, it's basically the last 4 bytes on the file, assuming there's only one "member" in the file as most GZ files are. I don't think there's any value this library can add to that use case other than a static method on GZipArchive or something. I guess I could make the entries on the archive read the footers to load size and crc data.

I started a branch here that just exposes LastModified but I don't see anything obvious to do: #560

@adamhathcock
Copy link
Owner

Nevermind, I take that back: If you use GZipArchive it will read the trailer for CRC and size info now.

really need to refactor this lib for nullables and async.

@jzabroski
Copy link
Author

Do you use Resharper? Feel like I can blaze through refactoring it.

The thing I don't understand is, reading online, is there really such a thing as a GZipArchive? Isn't that just tar.gz? I didn't know what GZipFilePart was either.

I think the ReaderFactory would be nicer if it supported generic types. That would clean up ReaderOptions too since you could have options per format.

@adamhathcock
Copy link
Owner

It's really that in sharpcompress:
Archive = random access
Reader = forward only streaming

Ive kulged a common API over different formats as I could for fun.

FilePart was a way to have the same file in an archive across multiple physical files. For example, Rar and Zip can divide an archive into multi-file archives. You might have a compressed file split over 2 or more physical archive files because of it. FilePart made sense at the time.

I use Rider so basically I use Resharper :)

I'll merge in my gzip changes soon and release then prepare for breaking changes. I want more nullables anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants