addoption "charset" to decode filename #831

nenge123 · 2022-06-08T18:29:55Z

add option "charset" to support double-byte encoding language,
this can fix chinese word erro show.

    options = utils.extend(options || {}, {
        base64: false,
        checkCRC32: false,
        optimizedBinaryString: false,
        createFolders: false,
        decodeFileName: utf8.utf8decode,
        charset:'gbk'
    });

add option "charset" to support double-byte encoding language, this can fix chinese word erro show. ``` options = utils.extend(options || {}, { base64: false, checkCRC32: false, optimizedBinaryString: false, createFolders: false, decodeFileName: utf8.utf8decode, charset:'gbk' }); ```

nenge123 · 2022-06-08T18:31:12Z

      decode:function(u8,i){
        i = i||0;
        var charset = this.loadOptions.charset;
        if(charset&&charset!='utf8'){
          for(;i<u8.byteLength;i++){
            if(u8[i]>127){
              //not a ascii
              var utf8 = false;
                var k=0;
                for(var j=1;j<u8[i].toString(2).split('0')[0].length;j++){
                  if(u8[i+j]>>6==2){
                    //10xxxxxx
                    k+=1;
                  }
                }
                if(k>0&&k==j-1&&u8[i+j]>>6!=2){
                    if(k==1&&charset=='gbk'){
                      //double byte
                      //some gbk will erro
                      //return this.decode(u8,j);
                    }else{
                      utf8 = true;
                    }
                }
              if(utf8===false)return new TextDecoder(charset).decode(u8);
              break;
            }
          }
        }
        return new TextDecoder().decode(u8);
      },
      handleUTF8: function() {
          var charset = this.loadOptions.charset,
              utf8decode = utf8.utf8decode,
              decode = this.loadOptions.decodeFileName||utf8decode;
          if(charset&&'TextDecoder' in window&&'Uint8Array' in window){
            this.fileNameStr = this.decode(this.fileName);
            this.fileCommentStr = this.decode(this.fileComment);
          }else if(this.useUTF8()){
            this.fileNameStr = utf8decode(this.fileName);
            this.fileCommentStr = utf8decode(this.fileComment);
          }else{
            var decodeParamType = support.uint8array ? "uint8array" : "array";
            var upath = this.findExtraFieldUnicodePath();
            if (upath !== null) {
                this.fileNameStr = upath;
            } else {
                // ASCII text or unsupported code page
                var fileNameByteArray =  utils.transformTo(decodeParamType, this.fileName);
                this.fileNameStr = decode(fileNameByteArray);
            }
            var ucomment = this.findExtraFieldUnicodeComment();
            if (ucomment !== null) {
                this.fileCommentStr = ucomment;
            } else {
                // ASCII text or unsupported code page
                var commentByteArray =  utils.transformTo(decodeParamType, this.fileComment);
                this.fileCommentStr = decode(commentByteArray);
            }
          }
      },

let the "Options" support "charset" decode filename

Stuk · 2022-06-23T22:05:01Z

Thanks for the PR! Does this only work for gbk encoding, or also others?

And a few thoughts/comments:

Could you merge main and fix the linting errors
Could you add comments to the new code to explain what is going on?
I think it might be best if utf8.js was renamed to maybe charset.js, and decode was moved there
Could you add some tests for this new code?

nenge123 · 2022-06-24T08:58:34Z

Thanks for the PR! Does this only work for gbk encoding, or also others?

And a few thoughts/comments:

Could you merge main and fix the linting errors

Could you add comments to the new code to explain what is going on?

I think it might be best if utf8.js was renamed to maybe charset.js, and decode was moved there

Could you add some tests for this new code?

By coincidence, I compressed the list of file names encoded with gbk and utf8, and it was successfully decoded!
I didn't test other ansi codes such as Japanese and Korean. However, it is certain that at least it is more convenient than the original utf8 encoding, such as emoji.
There may be a problem. Ansi double-byte problem handling problem, because utf8 also has double bytes. Refer to k==1 &&charset="gbk" in decode, which means that if gbk is specified, then all double bytes are encoded as gbk, or continue to loop. Until you make sure it's not utf8.
I think ansyc ("text") decodes faster in this way, and utf8 files will not be garbled.

let the "Options" support "charset"

ab34dfc

let the "Options" support "charset" decode filename

nenge123 changed the title ~~add option "charset" to support other language~~ addoption "charset" to decode filename Jun 9, 2022

nenge123 added 2 commits June 9, 2022 19:27

add TextDecoder decode

8092d27

example

47b1f91

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

addoption "charset" to decode filename #831

addoption "charset" to decode filename #831

nenge123 commented Jun 8, 2022

nenge123 commented Jun 8, 2022 •

edited

Stuk commented Jun 23, 2022

nenge123 commented Jun 24, 2022

addoption "charset" to decode filename #831

Are you sure you want to change the base?

addoption "charset" to decode filename #831

Conversation

nenge123 commented Jun 8, 2022

nenge123 commented Jun 8, 2022 • edited

Stuk commented Jun 23, 2022

nenge123 commented Jun 24, 2022

nenge123 commented Jun 8, 2022 •

edited