Reading deflate file throw "Unexpected EOF" #837

lutz · 2023-07-12T08:38:42Z

Describe the bug

Hello community,

i am not really firm with deflate compremissed files but with the attached file (data.zip) i get a unexpeced EOF exception from the InflaterInputStream class. It is reproducible with the following code at line var size = _inflater.Read(data, 0, data.Length);

using (var input = new FileStream(@"data4.bin", FileMode.Open))
{
  using (var output = new MemoryStream(65536))
  {
    using (var _inflater = new InflaterInputStream(input))
    {
      var data = new byte[4096];

      while (true)
      {
        data = new byte[4096];

        var size = _inflater.Read(data, 0, data.Length);

        if (size > 0)
        {
          output.Write(data, 0, size);
        }
        else
        {
          break;
        }
      }
    }
  }
}

Best regards
Daniel

Reproduction Code

No response

Steps to reproduce

Create a console app, add the latest release as nuget package of SharpZipLib and run the above code.

Expected behavior

It should not throw a exception

Operating System

Windows

Framework Version

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

piksel · 2023-07-12T12:10:37Z

A zip file is not simply a deflate-compressed file, but an archiving format with file tables etc.
The individual files inside the archive may be deflated, but you need to read the file meta data to find out. I think what you are looking for is ZipInputStream instead of InflaterInputStream.

lutz · 2023-07-12T12:37:06Z

the code should not interpretate as "Load a zip archive". The example is simplified. The data4.bin is extracted and include with deflate comprimised data. The rest (archive and so on) works but not for the data in the file

piksel · 2023-07-12T13:50:30Z

Okay, I see. How was the file created? If there are no headers or meta data about the deflate stream, it can be hard to debug why the file cannot be read, and it might be related to some unsupported feature in our deflate implementation.

piksel · 2023-07-12T14:43:16Z

I tried reading from your data file, a single byte per read, and it seems like the deflate stream just ends after reading 208671 byte(s):

❯ dotnet run
Read 208671 byte(s) before exception: ICSharpCode.SharpZipLib.SharpZipBaseException: Unexpected EOF
   at ICSharpCode.SharpZipLib.Zip.Compression.Streams.InflaterInputStream.Fill()
   at ICSharpCode.SharpZipLib.Zip.Compression.Streams.InflaterInputStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.Stream.ReadByte()
   at Program.<Main>$(String[] args) in /tmp/szl-deflate/Program.cs:line 21

I also tried running it through zlibs example program zpipe and it gives the same result:

./zpipe -d < data.bin | wc -c
zpipe: invalid or incomplete deflate data
208671

lutz · 2023-07-12T19:45:11Z

Thank you for testing. The exact same exeption is throwing here. I dont know what byte is here the problem. The file is the deflate data of an pdf page content stream.

If i use the uncompromise data and convert the byte[] to and utf 8 string the correct data is combing back. So it seems to be that the single byte which occurs the error is the problem.

lutz · 2023-07-17T07:56:26Z

Is there a way to fix it here?

piksel · 2023-07-17T08:47:34Z

If zlib gives the exact same result, it's the data (input file) that is the problem.
The end looks a bit suspicious, perhaps you can just try removing the end of the file, one bye at a time?

lutz · 2023-07-17T10:59:31Z

It seems to be an performance overkill to remove bytewise. Is it possible to get a more concrete exception on which position the problem occur? A naive way

            int length = 4096;

            using (var input = new FileStream(@"data.bin", FileMode.Open))
            {
                using (var output = new MemoryStream(65536))
                {
                    using (var _inflater = new InflaterInputStream(input))
                    {
                        byte[] data;

                        while (true)
                        {
                            data = new byte[length];

                            try
                            {
                                var size = _inflater.Read(data, 0, length);

                                if (size > 0)
                                {
                                    output.Write(data, 0, size);
                                }
                                else
                                {
                                    break;
                                }
                            }
                            catch (ICSharpCode.SharpZipLib.SharpZipBaseException e) when (e.Message.Equals("Unexpected EOF", StringComparison.OrdinalIgnoreCase))
                            {
                                length -= 1;
                            }
                            catch (Exception)
                            {
                                throw;
                            }
                        }
                    }

                    var strg = System.Text.Encoding.UTF8.GetString(output.ToArray());
                }


            }
        }

piksel · 2023-07-17T12:29:11Z

Yes, something like that is what I meant, but not for the final solution, just to find out what parts of the file shouldn't be passed to INFLATE. I assume it would be the same for all files in this format. Perhaps there is an additional CRC or something appended to the end? Or perhaps multiple streams are appended together in the original file and so the last deflate-record has it's "isLastRecord" bit set to false?

lutz · 2023-07-18T09:01:30Z

Thd pdf specification allows that the stream can be a single deflated stream or an array of streams. But on my understanding the concatenation to one single content file happen after deflating. So in this case the data is produced as closed container which is deflated. WHat is a CRC?

lutz · 2023-07-27T17:59:46Z

is there any idea other libraries in the pdf world with own implementations of deflate can work with the data. I don't why it ends on this point because there is more data behind this point.

piksel · 2023-07-28T09:19:40Z

This project is focused on zip and tar.gz/bz2, so I have no insight into PDF, sorry. Plain DEFLATE is not that common in files, I would probably take a look at the producer of those files to see if it either includes too much or too little data. You could also try debugging your program and stepping back in the stack trace to see why more data is required (you would need to have a basic understanding of how the DEFLATE format works though).

asyncritus · 2023-11-07T17:25:19Z

I am getting the same error trying to deflate the data contained in this file
flatedata.zip (unzip the attachment first). It is also a portion of the content stream of a PDF. The data is definitely valid because the PDF from which it was extracted opens fine in Acrobat Reader, and I can also get it to decompress correctly using System.IO.Compression.DeflateStream (after skipping over the first 2 bytes since DeflateStream expects RFC 1951 data vs. RFC 1950 data which InflaterInputStream expects).

asyncritus · 2023-11-08T15:56:31Z

It looks like this is actually a bug in Adobe's PDF generation engine. It is leaving off the last byte of the Adler-32 checksum if the last byte is 0x00. In the case of the file I provided, the computed checksum is 0x60F7D300, but the last 4 bytes of the data in the encoded stream are 0x00, 0x60, 0xF7, and 0xD3. In the case of the file @lutz provided, the computed checksum is 0x79DFAE00, but the last 4 bytes of data in the encoded stream are 0x00, 0x79, 0xDF, and 0xAE. I have confirmed that adding a byte with value 0x00 to the end of each these files causes them to process correctly.

It would seem that Acrobat Reader must be ignoring the header and checksum fields and is just processing the raw DEFLATE data.

piksel · 2023-11-08T16:34:35Z

@asyncritus great detective work!

It could also be the case that the way they are reading/writing the checksum allows for truncating trailing null bytes. In the case of SharpZipLib it should be fairly easy to try to fill any missing bytes in the CRC with 0 bytes if it reaches EOF...

piksel · 2023-11-08T16:39:48Z

...or perhaps it's the tool that extracts out the PDF streams that strips the trailing null bytes? How did you produce the file?

asyncritus · 2023-11-08T17:14:03Z

I opened the PDF file in a hex editor and stripped out everything directly before and directly after the binary stream data. Here you can see where the 0x00 at the end is missing:

Of course this is done programmatically by our PDF parsing software where the problem first manifested itself.

lutz · 2023-11-09T09:29:22Z

@asyncritus Great work. And your result is that what i thought about the adobe pdf engine.

lutz · 2023-11-09T09:33:48Z

@piksel We don`t trail these information when we read. t seems to be that the adobe pdf engine do that with a specific update. We could identfify that the behaviour is changed with adobes indesign 18.5 (windows and mac) update. Before it works and after not.

asyncritus · 2023-11-09T12:53:51Z

After some further investigation with more examples, I've found that it is not just leaving off trailing 0x00 bytes, but as soon as it encounters a 0x00 byte in the checksum, it stops writing data. For example, in one situation the checksum is 0x001E9C82, and none of those bytes are present. In another case, the checksum is 0x6C00878A, and only 0x6C was present.

Our customer that is having these issues is using InDesign 19.0. We are trying to obtain the original InDesign documents so that we can test with an earlier version.

@lutz Have you contacted Adobe about this issue?

lutz · 2023-11-09T21:09:42Z

We could reproduce the behaviour down to version 18.5. One of our customer could check multiple indesign version and the v18.5 seems to be the first. The v17 should be definitiv works.

We have no contact with Adobe. The problem is that most PDF viewers we check works with the files (Adobe Acrobat/Reader , PDF X Change, Summatra, Browser and so on) It could be that most of theme have the identical behavior of ignoring checksum and interprete the raw data.

So we have not enough argument.

The PDF specification is clear enough to say that deflate should be use and deflate spec is strict in his format (checksum anf so on)

It is not the first time that Adobe as inventer of the PDF format is interprete pdf files more in a free way instead of a strict way

piksel · 2023-11-11T12:54:09Z

It seems like the only thing we can do is to add a way to ignore the CRC (in the library, that is). It should be a useful option to have in any case...

lutz added the bug label Jul 12, 2023

github-actions bot added the *no response* label Jul 12, 2023

icsharpcode deleted a comment from SourceproStudio Jul 12, 2023

piksel removed bug *no response* labels Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading deflate file throw "Unexpected EOF" #837

Reading deflate file throw "Unexpected EOF" #837

lutz commented Jul 12, 2023 •

edited

piksel commented Jul 12, 2023

lutz commented Jul 12, 2023

piksel commented Jul 12, 2023

piksel commented Jul 12, 2023 •

edited

lutz commented Jul 12, 2023

lutz commented Jul 17, 2023

piksel commented Jul 17, 2023

lutz commented Jul 17, 2023

piksel commented Jul 17, 2023

lutz commented Jul 18, 2023

lutz commented Jul 27, 2023

piksel commented Jul 28, 2023

asyncritus commented Nov 7, 2023

asyncritus commented Nov 8, 2023

piksel commented Nov 8, 2023

piksel commented Nov 8, 2023

asyncritus commented Nov 8, 2023

lutz commented Nov 9, 2023

lutz commented Nov 9, 2023

asyncritus commented Nov 9, 2023

lutz commented Nov 9, 2023

piksel commented Nov 11, 2023

Reading deflate file throw "Unexpected EOF" #837

Reading deflate file throw "Unexpected EOF" #837

Comments

lutz commented Jul 12, 2023 • edited

Describe the bug

Reproduction Code

Steps to reproduce

Expected behavior

Operating System

Framework Version

Tags

Additional context

piksel commented Jul 12, 2023

lutz commented Jul 12, 2023

piksel commented Jul 12, 2023

piksel commented Jul 12, 2023 • edited

lutz commented Jul 12, 2023

lutz commented Jul 17, 2023

piksel commented Jul 17, 2023

lutz commented Jul 17, 2023

piksel commented Jul 17, 2023

lutz commented Jul 18, 2023

lutz commented Jul 27, 2023

piksel commented Jul 28, 2023

asyncritus commented Nov 7, 2023

asyncritus commented Nov 8, 2023

piksel commented Nov 8, 2023

piksel commented Nov 8, 2023

asyncritus commented Nov 8, 2023

lutz commented Nov 9, 2023

lutz commented Nov 9, 2023

asyncritus commented Nov 9, 2023

lutz commented Nov 9, 2023

piksel commented Nov 11, 2023

lutz commented Jul 12, 2023 •

edited

piksel commented Jul 12, 2023 •

edited