Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression from 1.2.0 to 1.2.1 when ZIP entry's filename contains backslash #324

Closed
Capncavedan opened this issue Apr 10, 2017 · 12 comments · Fixed by #487
Closed

Comments

@Capncavedan
Copy link

Capncavedan commented Apr 10, 2017

I experienced an issue with rubyzip last Fall where a customer-supplied ZIP had backslashes instead of forward-slashes for all the directory separators; I resolved it at the time by updating from rubyzip 0.9.9 to 1.2.0. It translated the backslashes to forward slashes e.g. directory separators on OS X and Linux. (0.9.9 behaved "correctly" on OS X but not on Linux).

The recent update to 1.2.1 has changed behavior on Linux such that the backslash is considered part of the filename again. OS X continues to exhibit what I would consider to be the desired behavior, treating backslashes as directory separators. See below.

➜  zipinfo file-with-backslashes-in-path.zip
Archive:  file-with-backslashes-in-path.zip
Zip file size: 200 bytes, number of entries: 1
-rw-r--r--  3.0 unx        6 tx stor 16-Nov-30 19:44 sample\sample_file.txt
1 file, 6 bytes uncompressed, 6 bytes compressed:  0.0%

# with rubyzip 1.2.1:
> src = ::Zip::File.new "file-with-backslashes-in-path.zip"
> src.entries.first.name
=> "sample\\sample_file.txt"

# with rubyzip 1.2.0:
> src = ::Zip::File.new "file-with-backslashes-in-path.zip"
> src.entries.first.name
=> "sample/sample_file.txt"

Is this change in behavior from 1.2.0 to 1.2.1 intentional? Would you consider it a bug that it behavior is different from Linux to OS X?

@KierranM
Copy link

It would seem that creating a zip in Windows via Right Click > Send To > Compressed Folder, or using PowerShell always creates the folders with backslashes as the separator. We've had failures because of this when attempting to access the files using a forward slash and getting an File does not exist error.

@rlaveycal
Copy link

There seems to be a missing translation of \ to / in the read_c_dir_entry method of Entry.rb. This means that when you iterate over the zip entries they have the "wrong" name.

@name.tr!('\\', '/')

Just needs adding after

@name = io.read(@name_length)

@simonista
Copy link

simonista commented Mar 15, 2018

This is hitting us as well, definitely a regression between 1.2.0 and 1.2.1.

You can see the commit that caused it here: #308

I'm not an expert in character encodings, but it seems like the issue it was trying to address may only be an issue in older ruby versions? I have both ruby 2.2 and 2.4 installed locally (on macOS), and in both cases '表'.encode('windows-31j') == '表'.encode('windows-31j').tr('\\', '/').

@dogatana can you comment at all?

@dogatana
Copy link
Contributor

Hi, guys.

The following irb session log explains the issue caused by tr.

Zip file doesn't have any encoding information about the file names in it.
So the name "表1.txt" is extracted as a string "\x95\1.txt" encoded as ASCII-8BIT.
If we apply tr('\', '/') on this string, the second byte of '表' will be removed.
We cannot use it as a valid file name.

Rubyzip provides us the way to handle file names in zip file, right?
If original byte sequence of the name is maintained, we can avoid trobules.
In this case, original sequences are altered, so I made PR for it.

Regards,

irb
>> file = '表1.txt'
=> "表1.txt"
>> p file
"表1.txt"
=> "表1.txt"
>> file.tr('\\', '/')
=> "表1.txt"
>> file.force_encoding('ascii-8bit')
=> "\x95\\1.txt"
>> new_name = file.tr('\\', '/')
=> "\x95/1.txt"
>> new_name.encode('windows-31j')
Encoding::UndefinedConversionError: "\x95" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to
Windows-31J
        from (irb):17:in `encode'
        from (irb):17
        from C:/Ruby24/bin/irb.cmd:19:in `<main>'
>>

@bbuchalter
Copy link
Contributor

bbuchalter commented Jan 26, 2021

@sofie-c and I were recently bit by this issue and were hoping to make an attempt at a fix. @dogatana, above you mentioned:

Rubyzip provides us the way to handle file names in zip file, right? If original byte sequence of the name is maintained, we can avoid troubles.

Can you clarify what you meant here? Are you suggesting we apply the .tr('\\', '/') change to the file name some where else, or restore the .tr('\\', '/'), but use it conditionally based on charactered encoding?

Either way, I think we need some failing tests to demonstrate both the issue #308 was meant to solve, alongside this issue. Perhaps that's where I'll start.

If anyone solution ideas, please share; we're new to this codebase.

@bbuchalter
Copy link
Contributor

I modified the rubyzip test suite so that when it generates test zip files, it includes a file named p932_encoding_in_filename_表.txt. This causes many tests to fail.

The first failure I'm examining is against test/central_directory_test.rb:13. The purpose of this test is to assert that we get the expected entries in the zip. The test fails because we've told it to expect a file named cp932_encoding_in_filename_表.txt but the entry's name is cp932_encoding_in_filename_\xE8\xA1\xA8.txt.

is represented as \xE8\xA1\xA8 because:

irb(main):012:0> '表'.force_encoding('ascii-8bit')
=> "\xE8\xA1\xA8"
irb(main):006:0> '表'.bytes
=> [232, 161, 168]
irb(main):004:0> "\xE8\xA1\xA8".bytes
=> [232, 161, 168]

I'm trying to decide how to interpret this. Is this failure telling me the application code is wrong or if my test assertion is wrong?

@hainesr
Copy link
Member

hainesr commented Jan 26, 2021

There are tests with Unicode filenames that work though, aren't there?

Do you have the Unicode option on when doing this?

@bbuchalter
Copy link
Contributor

Thanks for the reply @hainesr! Are you talking about #340 - if so, no I don't. I'll give it a try.

@hainesr
Copy link
Member

hainesr commented May 29, 2021

@bbuchalter any joy? One thing I have learned about character encoding over the years is that sometimes you just need to know what character encoding you have, and there's no way to guess/infer it.

@hainesr
Copy link
Member

hainesr commented May 29, 2021

Maybe we could reinstate the .tr('\\', '/') after the encoding has been enforced?

@name.force_encoding(::Zip.force_entry_names_encoding)

@hainesr
Copy link
Member

hainesr commented Jun 1, 2021

Having played around with just about every combination of characters and encodings that I can think of, I think the solution to this to reinstate the .tr('\\', '/') after the encoding of the filename has been enforced. The various encodings of '表' will survive the tr intact if the encoding is correct.

If you have non-ascii characters in your filenames I think the solution is that you need to set ::Zip.force_entry_names_encoding appropriately. There is just no way of inferring what encoding a particular string of bytes is intended to be interpreted as.

@bbuchalter you are seeing '表'.force_encoding('ascii-8bit') => "\xE8\xA1\xA8" because your '表' is encoded in UTF-8! This is because irb and ruby source files are read as UTF-8 (I think). And force-encoding only changes the encoding, not the bytes, so you're seeing the raw bytes of the UTF-8 encoding of '表'. See this for an explanation.

Some working from irb:

> '表'.tr('\\', '/')
 => "表"

> '表\表'.tr('\\', '/')
 => "表/表"

> '表\表'.encode('windows-31j')
 => "\x{955C}\\\x{955C}"

> '表\表'.encode('windows-31j').tr('\\', '/')
 => "\x{955C}/\x{955C}"

> '表\表'.encode('windows-31j').tr('\\', '/').encode('utf-8')
 => "/"

@hainesr hainesr added the bug label Jun 1, 2021
hainesr added a commit to hainesr/rubyzip that referenced this issue Jun 2, 2021
But only do it after we have set filename encoding appropriately to
avoid breaking multibyte characters with `\`s in them.

Fixes rubyzip#324.
@hainesr
Copy link
Member

hainesr commented Jun 2, 2021

Another thought on this: the zip spec says nothing about handling any other encodings other than IBM Code Page 437 and UTF-8. See APPENDIX D in the spec.

Obviously people are using other encodings out there in the wild, which is presumably why we have ::Zip.force_entry_names_encoding available.

hainesr added a commit that referenced this issue Jun 4, 2021
But only do it after we have set filename encoding appropriately to
avoid breaking multibyte characters with `\`s in them.

Fixes #324.
jspanjers pushed a commit to jspanjers/rubyzip that referenced this issue Jun 5, 2021
But only do it after we have set filename encoding appropriately to
avoid breaking multibyte characters with `\`s in them.

Fixes rubyzip#324.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants