Incorrect decoding of file names with some zip files #403

jimfcarroll · 2022-01-29T12:05:51Z

I have a zip file where zip4j doesn't decode the file names correctly while all of the command line utilities and Apache's VFS2 do. From the contents of the zip file it looks like it was made on a Mac. Here is the output of test code I wrote. The first 2 lines are using Apache VFS2. The second 2 lines are from using Zip4j's ZipInputStream

Aquinas, St. Thomas/Primary/Français/Aquin - De l'éternite du monde.doc
9:Français
Aquinas, St. Thomas/Primary/Franc╠ºais/Aquin - De l'e╠üternite du monde.doc
10:Franc╠ºais

The file is 700MB so I can't really attach it here. I tried to create a smaller file by unzipping the archive and rezipping only that file but Zip4j worked fine on the rezipped file.

I saw issue #304 which I'm not sure is related. In any case I don't have control over the zip files I'm unzipping and so I don't know the encoding beforehand.

For completeness the following is my test code. It prints out the name of the 184th entry in the zip file:

import java.io.File;
import java.io.FileInputStream;

import org.apache.commons.compress.archivers.ArchiveEntry;
import org.apache.commons.compress.archivers.zip.ZipArchiveInputStream;

import net.lingala.zip4j.io.inputstream.ZipInputStream;
import net.lingala.zip4j.model.LocalFileHeader;

public class TestEncoding {
    public static String pickOutDirName(final String name) {
        final int start = name.indexOf("Fran");
        final int end = name.substring(start).indexOf('/') + start;
        return name.substring(start, end);
    }

    public static void main(final String[] args) throws Exception {
        final String fullPath = "/path/to/zip/file.zip";
        final File file = new File(fullPath);

        int entryCount = 0;
        try(ZipArchiveInputStream isa = new ZipArchiveInputStream(new FileInputStream(file));) {
            for(ArchiveEntry entry = isa.getNextEntry(); entry != null; entry = isa.getNextEntry()) {
                if(entryCount == 184) {
                    final String name = entry.getName();
                    final String dirName = pickOutDirName(name);
                    System.out.println(name);
                    System.out.println(dirName.length() + ":" + dirName);
                    break;
                }
                entryCount++;
            }
        }

        entryCount = 0;
        try(ZipInputStream isz = new ZipInputStream(new FileInputStream(file));) {
            for(LocalFileHeader entry = isz.getNextEntry(); entry != null; entry = isz.getNextEntry()) {
                if(entryCount == 184) {
                    final String name = entry.getFileName();
                    final String dirName = pickOutDirName(name);
                    System.out.println(name);
                    System.out.println(dirName.length() + ":" + dirName);
                    break;
                }
                entryCount++;
            }
        }
    }
}

The text was updated successfully, but these errors were encountered:

srikanth-lingala · 2022-03-08T10:30:13Z

An entry in a zip file has a flag set in its header data which defines if the file name of this entry is UTF-8 encoded or not. Any tool that creates a zip file has to set this flag if it uses utf-8 to encode the file name. In this case, I think the tool that created the zip did not set this flag even though it used utf-8 to encode the file name. The difference you see between zip4j and apache compress is that, if this flag is not set, zip4j uses a zip spec standard charset by default, whereas apache compress uses utf-8 even if the flag is not set. Technically speaking, this is not as per the specification, but I think I tend to agree with the apache compress solution to use utf-8 even if the flag is not set. I will change this in zip4j and include it in the next release.

jimfcarroll · 2022-03-08T12:20:58Z

Thanks. Once this hits maven central I'll give it another try.

srikanth-lingala · 2022-03-28T15:14:47Z

Fixed in v2.10.0 released today

srikanth-lingala · 2022-06-03T10:16:55Z

@jimfcarroll I am reverting the change I did here because this was having some side effects with zip files that use zip standard charset as reported in this issue. The change I did as part of this issue was to use utf8 by default in zip4j. But, and I am contradicting my statement from earlier comment in this issue, this is not as per the zip specification. Zip specification states to use zip standard charset if utf8 flag is not set.

In your case, if you are sure your zip files use utf8 encoding, you can force zip4j to use utf8 with ZipFile.setCharset(StandardCharsets.UTF_8).

srikanth-lingala self-assigned this Mar 8, 2022

srikanth-lingala added improvement in-progress labels Mar 8, 2022

srikanth-lingala added a commit that referenced this issue Mar 8, 2022

#403 Use utf-8 by default when reading zip file names

d80df16

srikanth-lingala added a commit that referenced this issue Mar 8, 2022

#403 Fix test

70b44a9

srikanth-lingala added resolved and removed in-progress labels Mar 8, 2022

srikanth-lingala closed this as completed Mar 28, 2022

KenobiTom mentioned this issue Jun 3, 2022

Incorrect decoding of file names with some zip files #432

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect decoding of file names with some zip files #403

Incorrect decoding of file names with some zip files #403

jimfcarroll commented Jan 29, 2022 •

edited

srikanth-lingala commented Mar 8, 2022

jimfcarroll commented Mar 8, 2022

srikanth-lingala commented Mar 28, 2022

srikanth-lingala commented Jun 3, 2022 •

edited

Incorrect decoding of file names with some zip files #403

Incorrect decoding of file names with some zip files #403

Comments

jimfcarroll commented Jan 29, 2022 • edited

srikanth-lingala commented Mar 8, 2022

jimfcarroll commented Mar 8, 2022

srikanth-lingala commented Mar 28, 2022

srikanth-lingala commented Jun 3, 2022 • edited

jimfcarroll commented Jan 29, 2022 •

edited

srikanth-lingala commented Jun 3, 2022 •

edited