Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect decoding of file names with some zip files #403

Closed
jimfcarroll opened this issue Jan 29, 2022 · 4 comments
Closed

Incorrect decoding of file names with some zip files #403

jimfcarroll opened this issue Jan 29, 2022 · 4 comments

Comments

@jimfcarroll
Copy link

jimfcarroll commented Jan 29, 2022

I have a zip file where zip4j doesn't decode the file names correctly while all of the command line utilities and Apache's VFS2 do. From the contents of the zip file it looks like it was made on a Mac. Here is the output of test code I wrote. The first 2 lines are using Apache VFS2. The second 2 lines are from using Zip4j's ZipInputStream

Aquinas, St. Thomas/Primary/Français/Aquin - De l'éternite du monde.doc
9:Français
Aquinas, St. Thomas/Primary/Français/Aquin - De l'éternite du monde.doc
10:Français

The file is 700MB so I can't really attach it here. I tried to create a smaller file by unzipping the archive and rezipping only that file but Zip4j worked fine on the rezipped file.

I saw issue #304 which I'm not sure is related. In any case I don't have control over the zip files I'm unzipping and so I don't know the encoding beforehand.

For completeness the following is my test code. It prints out the name of the 184th entry in the zip file:

import java.io.File;
import java.io.FileInputStream;

import org.apache.commons.compress.archivers.ArchiveEntry;
import org.apache.commons.compress.archivers.zip.ZipArchiveInputStream;

import net.lingala.zip4j.io.inputstream.ZipInputStream;
import net.lingala.zip4j.model.LocalFileHeader;

public class TestEncoding {
    public static String pickOutDirName(final String name) {
        final int start = name.indexOf("Fran");
        final int end = name.substring(start).indexOf('/') + start;
        return name.substring(start, end);
    }

    public static void main(final String[] args) throws Exception {
        final String fullPath = "/path/to/zip/file.zip";
        final File file = new File(fullPath);

        int entryCount = 0;
        try(ZipArchiveInputStream isa = new ZipArchiveInputStream(new FileInputStream(file));) {
            for(ArchiveEntry entry = isa.getNextEntry(); entry != null; entry = isa.getNextEntry()) {
                if(entryCount == 184) {
                    final String name = entry.getName();
                    final String dirName = pickOutDirName(name);
                    System.out.println(name);
                    System.out.println(dirName.length() + ":" + dirName);
                    break;
                }
                entryCount++;
            }
        }

        entryCount = 0;
        try(ZipInputStream isz = new ZipInputStream(new FileInputStream(file));) {
            for(LocalFileHeader entry = isz.getNextEntry(); entry != null; entry = isz.getNextEntry()) {
                if(entryCount == 184) {
                    final String name = entry.getFileName();
                    final String dirName = pickOutDirName(name);
                    System.out.println(name);
                    System.out.println(dirName.length() + ":" + dirName);
                    break;
                }
                entryCount++;
            }
        }
    }
}
@srikanth-lingala
Copy link
Owner

An entry in a zip file has a flag set in its header data which defines if the file name of this entry is UTF-8 encoded or not. Any tool that creates a zip file has to set this flag if it uses utf-8 to encode the file name. In this case, I think the tool that created the zip did not set this flag even though it used utf-8 to encode the file name. The difference you see between zip4j and apache compress is that, if this flag is not set, zip4j uses a zip spec standard charset by default, whereas apache compress uses utf-8 even if the flag is not set. Technically speaking, this is not as per the specification, but I think I tend to agree with the apache compress solution to use utf-8 even if the flag is not set. I will change this in zip4j and include it in the next release.

@jimfcarroll
Copy link
Author

Thanks. Once this hits maven central I'll give it another try.

@srikanth-lingala
Copy link
Owner

Fixed in v2.10.0 released today

@srikanth-lingala
Copy link
Owner

srikanth-lingala commented Jun 3, 2022

@jimfcarroll I am reverting the change I did here because this was having some side effects with zip files that use zip standard charset as reported in this issue. The change I did as part of this issue was to use utf8 by default in zip4j. But, and I am contradicting my statement from earlier comment in this issue, this is not as per the zip specification. Zip specification states to use zip standard charset if utf8 flag is not set.

In your case, if you are sure your zip files use utf8 encoding, you can force zip4j to use utf8 with ZipFile.setCharset(StandardCharsets.UTF_8).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants