Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zlib format incorrectly matches on ASCII files starting with the letter x #386

Merged
merged 3 commits into from Sep 12, 2023

Conversation

dpgarrick
Copy link
Contributor

@dpgarrick dpgarrick commented Sep 11, 2023

Added a test to demonstrate a bug in the Identify routine where the zlib format incorrectly matches on ASCII files starting with the letter x (because ASCII x is the same as the first expected byte for ZlibHeader).

Unsure of how best to address.

The ZlibHeader match could be expanded to look at the second byte and verify it is one of the accepted values associated with the different compression levels.

Or could try and decompress the first part of the file using zlib? Not sure how robust that is, or how much you would need to read...maybe something like the following?

func (zz Zlib) Match(filename string, stream io.Reader) (MatchResult, error) {
	var mr MatchResult

	// match filename
	if strings.Contains(strings.ToLower(filename), zz.Name()) {
		mr.ByName = true
	}

	// match file header
	buf, err := readAtMost(stream, len(ZlibHeader))
	if err != nil {
		return mr, err
	}
	if bytes.Equal(buf, ZlibHeader) {
		// ZlibHeader can be accidentally matched by an ASCII file starting with the letter x
		// Try to decompress the first part of the stream to double check
		buf, err = readAtMost(stream, 4096)
		if err != nil {
			return mr, err
		}

		r, err := zlib.NewReader(bytes.NewReader(buf))
		if err != nil {
			return mr, nil // Not a zlib file
		}
		defer r.Close()

		_, err = io.ReadAll(r)
		if err == nil {
			mr.ByStream = true // Successfully decompressed, likely a zlib file
		}
	}

	return mr, nil
}

@mholt
Copy link
Owner

mholt commented Sep 11, 2023

Lol, in hindsight checking a single byte that happens to be in the ASCII range was not very bright of me 😂 ... I was likely in a hurry.

I'm headed to sleep now but I'm looking forward to checking this out! Thanks for the test.

@mholt
Copy link
Owner

mholt commented Sep 11, 2023

Apparently the first digit might not always be 7, but it is "usually" 7. (It could be 0-7)... and then there are other combinations for the next few bytes: https://stackoverflow.com/a/54915442/1048862

That SO answer gives a table of the 32 possible headers:

      FLEVEL: 0       1       2       3
CINFO:
     0      08 1D   08 5B   08 99   08 D7
     1      18 19   18 57   18 95   18 D3
     2      28 15   28 53   28 91   28 CF
     3      38 11   38 4F   38 8D   38 CB
     4      48 0D   48 4B   48 89   48 C7
     5      58 09   58 47   58 85   58 C3
     6      68 05   68 43   68 81   68 DE
     7      78 01   78 5E   78 9C   78 DA

and so I wonder if it'd be easy enough to just compare the first 4 bytes against those in this table and if there's any matches, it's likely zlib.

I like your approach for decompressing some of it to see if that works -- but it feels like that'd be slower (I'm guessing, though I'm not sure, that it makes allocations). It might not actually matter in practice, which approach we take.

Part of me likes the elegant idea of just comparing a fixed number of bytes, versus setting up a decompressor. What do you think?

@dpgarrick
Copy link
Contributor Author

Yeah I prefer to compare a set number of bytes as well. I pushed an implementation of that and added a test - not extensive in the sense that it doesn't test all the possible values so we are assuming the table provided on SO is accurate. But seems to work for the default compression case and fixes the original issue (matching on ASCII x).

Copy link
Owner

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful! I like this a lot more. Thanks for your thoroughness :)

@mholt mholt merged commit 24fa33e into mholt:master Sep 12, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants