Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading deflate file throw "Unexpected EOF" #837

Open
lutz opened this issue Jul 12, 2023 · 22 comments
Open

Reading deflate file throw "Unexpected EOF" #837

lutz opened this issue Jul 12, 2023 · 22 comments

Comments

@lutz
Copy link

lutz commented Jul 12, 2023

Describe the bug

Hello community,

i am not really firm with deflate compremissed files but with the attached file (data.zip) i get a unexpeced EOF exception from the InflaterInputStream class. It is reproducible with the following code at line var size = _inflater.Read(data, 0, data.Length);

using (var input = new FileStream(@"data4.bin", FileMode.Open))
{
  using (var output = new MemoryStream(65536))
  {
    using (var _inflater = new InflaterInputStream(input))
    {
      var data = new byte[4096];

      while (true)
      {
        data = new byte[4096];

        var size = _inflater.Read(data, 0, data.Length);

        if (size > 0)
        {
          output.Write(data, 0, size);
        }
        else
        {
          break;
        }
      }
    }
  }
}

Best regards
Daniel

Reproduction Code

No response

Steps to reproduce

Create a console app, add the latest release as nuget package of SharpZipLib and run the above code.

Expected behavior

It should not throw a exception

Operating System

Windows

Framework Version

No response

Tags

No response

Additional context

No response

@piksel
Copy link
Member

piksel commented Jul 12, 2023

A zip file is not simply a deflate-compressed file, but an archiving format with file tables etc.
The individual files inside the archive may be deflated, but you need to read the file meta data to find out. I think what you are looking for is ZipInputStream instead of InflaterInputStream.

@lutz
Copy link
Author

lutz commented Jul 12, 2023

the code should not interpretate as "Load a zip archive". The example is simplified. The data4.bin is extracted and include with deflate comprimised data. The rest (archive and so on) works but not for the data in the file

@piksel
Copy link
Member

piksel commented Jul 12, 2023

Okay, I see. How was the file created? If there are no headers or meta data about the deflate stream, it can be hard to debug why the file cannot be read, and it might be related to some unsupported feature in our deflate implementation.

@piksel
Copy link
Member

piksel commented Jul 12, 2023

I tried reading from your data file, a single byte per read, and it seems like the deflate stream just ends after reading 208671 byte(s):

❯ dotnet run
Read 208671 byte(s) before exception: ICSharpCode.SharpZipLib.SharpZipBaseException: Unexpected EOF
   at ICSharpCode.SharpZipLib.Zip.Compression.Streams.InflaterInputStream.Fill()
   at ICSharpCode.SharpZipLib.Zip.Compression.Streams.InflaterInputStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.Stream.ReadByte()
   at Program.<Main>$(String[] args) in /tmp/szl-deflate/Program.cs:line 21

I also tried running it through zlibs example program zpipe and it gives the same result:

./zpipe -d < data.bin | wc -c
zpipe: invalid or incomplete deflate data
208671

@lutz
Copy link
Author

lutz commented Jul 12, 2023

Thank you for testing. The exact same exeption is throwing here. I dont know what byte is here the problem. The file is the deflate data of an pdf page content stream.

If i use the uncompromise data and convert the byte[] to and utf 8 string the correct data is combing back. So it seems to be that the single byte which occurs the error is the problem.

@lutz
Copy link
Author

lutz commented Jul 17, 2023

Is there a way to fix it here?

@piksel
Copy link
Member

piksel commented Jul 17, 2023

If zlib gives the exact same result, it's the data (input file) that is the problem.
The end looks a bit suspicious, perhaps you can just try removing the end of the file, one bye at a time?

@lutz
Copy link
Author

lutz commented Jul 17, 2023

It seems to be an performance overkill to remove bytewise. Is it possible to get a more concrete exception on which position the problem occur? A naive way

            int length = 4096;

            using (var input = new FileStream(@"data.bin", FileMode.Open))
            {
                using (var output = new MemoryStream(65536))
                {
                    using (var _inflater = new InflaterInputStream(input))
                    {
                        byte[] data;

                        while (true)
                        {
                            data = new byte[length];

                            try
                            {
                                var size = _inflater.Read(data, 0, length);

                                if (size > 0)
                                {
                                    output.Write(data, 0, size);
                                }
                                else
                                {
                                    break;
                                }
                            }
                            catch (ICSharpCode.SharpZipLib.SharpZipBaseException e) when (e.Message.Equals("Unexpected EOF", StringComparison.OrdinalIgnoreCase))
                            {
                                length -= 1;
                            }
                            catch (Exception)
                            {
                                throw;
                            }
                        }
                    }

                    var strg = System.Text.Encoding.UTF8.GetString(output.ToArray());
                }


            }
        }

@piksel
Copy link
Member

piksel commented Jul 17, 2023

Yes, something like that is what I meant, but not for the final solution, just to find out what parts of the file shouldn't be passed to INFLATE. I assume it would be the same for all files in this format. Perhaps there is an additional CRC or something appended to the end? Or perhaps multiple streams are appended together in the original file and so the last deflate-record has it's "isLastRecord" bit set to false?

@lutz
Copy link
Author

lutz commented Jul 18, 2023

Thd pdf specification allows that the stream can be a single deflated stream or an array of streams. But on my understanding the concatenation to one single content file happen after deflating. So in this case the data is produced as closed container which is deflated. WHat is a CRC?

@lutz
Copy link
Author

lutz commented Jul 27, 2023

is there any idea other libraries in the pdf world with own implementations of deflate can work with the data. I don't why it ends on this point because there is more data behind this point.

@piksel
Copy link
Member

piksel commented Jul 28, 2023

This project is focused on zip and tar.gz/bz2, so I have no insight into PDF, sorry. Plain DEFLATE is not that common in files, I would probably take a look at the producer of those files to see if it either includes too much or too little data. You could also try debugging your program and stepping back in the stack trace to see why more data is required (you would need to have a basic understanding of how the DEFLATE format works though).

@asyncritus
Copy link

I am getting the same error trying to deflate the data contained in this file
flatedata.zip (unzip the attachment first). It is also a portion of the content stream of a PDF. The data is definitely valid because the PDF from which it was extracted opens fine in Acrobat Reader, and I can also get it to decompress correctly using System.IO.Compression.DeflateStream (after skipping over the first 2 bytes since DeflateStream expects RFC 1951 data vs. RFC 1950 data which InflaterInputStream expects).

@asyncritus
Copy link

It looks like this is actually a bug in Adobe's PDF generation engine. It is leaving off the last byte of the Adler-32 checksum if the last byte is 0x00. In the case of the file I provided, the computed checksum is 0x60F7D300, but the last 4 bytes of the data in the encoded stream are 0x00, 0x60, 0xF7, and 0xD3. In the case of the file @lutz provided, the computed checksum is 0x79DFAE00, but the last 4 bytes of data in the encoded stream are 0x00, 0x79, 0xDF, and 0xAE. I have confirmed that adding a byte with value 0x00 to the end of each these files causes them to process correctly.

It would seem that Acrobat Reader must be ignoring the header and checksum fields and is just processing the raw DEFLATE data.

@piksel
Copy link
Member

piksel commented Nov 8, 2023

@asyncritus great detective work!

It could also be the case that the way they are reading/writing the checksum allows for truncating trailing null bytes. In the case of SharpZipLib it should be fairly easy to try to fill any missing bytes in the CRC with 0 bytes if it reaches EOF...

@piksel
Copy link
Member

piksel commented Nov 8, 2023

...or perhaps it's the tool that extracts out the PDF streams that strips the trailing null bytes? How did you produce the file?

@asyncritus
Copy link

I opened the PDF file in a hex editor and stripped out everything directly before and directly after the binary stream data. Here you can see where the 0x00 at the end is missing:

end of data

Of course this is done programmatically by our PDF parsing software where the problem first manifested itself.

@lutz
Copy link
Author

lutz commented Nov 9, 2023

@asyncritus Great work. And your result is that what i thought about the adobe pdf engine.

@lutz
Copy link
Author

lutz commented Nov 9, 2023

@piksel We don`t trail these information when we read. t seems to be that the adobe pdf engine do that with a specific update. We could identfify that the behaviour is changed with adobes indesign 18.5 (windows and mac) update. Before it works and after not.

@asyncritus
Copy link

After some further investigation with more examples, I've found that it is not just leaving off trailing 0x00 bytes, but as soon as it encounters a 0x00 byte in the checksum, it stops writing data. For example, in one situation the checksum is 0x001E9C82, and none of those bytes are present. In another case, the checksum is 0x6C00878A, and only 0x6C was present.

Our customer that is having these issues is using InDesign 19.0. We are trying to obtain the original InDesign documents so that we can test with an earlier version.

@lutz Have you contacted Adobe about this issue?

@lutz
Copy link
Author

lutz commented Nov 9, 2023

We could reproduce the behaviour down to version 18.5. One of our customer could check multiple indesign version and the v18.5 seems to be the first. The v17 should be definitiv works.

We have no contact with Adobe. The problem is that most PDF viewers we check works with the files (Adobe Acrobat/Reader , PDF X Change, Summatra, Browser and so on) It could be that most of theme have the identical behavior of ignoring checksum and interprete the raw data.

So we have not enough argument.

The PDF specification is clear enough to say that deflate should be use and deflate spec is strict in his format (checksum anf so on)

It is not the first time that Adobe as inventer of the PDF format is interprete pdf files more in a free way instead of a strict way

@piksel
Copy link
Member

piksel commented Nov 11, 2023

It seems like the only thing we can do is to add a way to ignore the CRC (in the library, that is). It should be a useful option to have in any case...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants