Problems when dealing with invalidly-encoded filenames #575

rossj · 2018-05-02T23:26:03Z

Operating System: Debian 9
Node.js version: 8.9.3
fs-extra version: 5.0.0

Hi there. I ran into some cases where remove() was unable to remove a directory due to filename encoding issues. I believe there are similar issues using empty, copy, and move operations (and their sync counterparts - basically anything that relies on fs.readdir / fs.readdirSync).

My issue arose when trying to fs.remove() some directories that were created from an unzip operation. During removes / rimraf's tree walk, some of the returned directories seemed not to exist (although they did), causing the final unlink operation to fail (since it wasn't actually successfully emptied).

It seems that, in general, names on a file system are just byte sequences, which are not guaranteed to represent fully valid strings. This causes the bytes-> string -> bytes operation, that happens when listing and then operating on items in a directory using Node, to not always produce the same file name that it read.

This encoding problem has been a known Node issue for a while, which is why an option was added to return Buffers from fs.readdir. My suggestion is to update the affected methods to use this Buffer option. I'm happy to work on a PR, but I wanted to at least get some feedback and discuss the issue before diving in.

Here are a couple Node issues relating to the file name encoding problem:

nodejs/node-v0.x-archive#2387
nodejs/node#5616

Thanks!

The text was updated successfully, but these errors were encountered:

RyanZim · 2018-05-02T23:57:28Z

My first impulse is PR welcome. I'm assuming this wouldn't affect the external API, but I'm wondering how we'd handle the filter for non-UTF8 filenames. Thoughts?

rossj · 2018-05-03T01:11:45Z

Yes, I was thinking it wouldn't affect the API, but I didn't know about the filter option, so thank you for the heads up.

I think it's possible for copy to read names with the Buffer option, and to convert these to strings before passing them to the filter callback, to keep the API the same, potentially with the new Buffer-variety forms of the names as additional arguments. This, of course, depends on us being able to reliably decode the Buffers into string names the same way that fs is doing it currently.

From this it looks like Node simply uses utf-8 encoding by default. This seems strange since I think Windows stores file names in UTF-16 / UCS-2 encoding, but I just checked on Windows and the Buffers are indeed utf-8 encoded.

RyanZim · 2018-05-03T11:59:09Z

I think it's possible for copy to read names with the Buffer option, and to convert these to strings before passing them to the filter callback, to keep the API the same.

That's my first thought too, but how do we actually filter non-UTF8 files?

rossj · 2018-05-03T14:05:31Z

Ah, I was thinking of not filtering non-UTF8 names and just sending whatever string we get from the UTF8 conversion to the filter function. I'm pretty sure that Buffer.toString() will insert U+FFFD � for invalid UTF-8 sequences instead of failing. Continuing to send these potentially-incorrect strings to the filter function is no worse than the current situation, and it allows for string-based filtering of all files (regardless of if they are UTF8 or not) if the user only cares about ASCII, e.g. return src.indexOf('thing') >= 0.

bcoe · 2021-08-22T23:34:31Z

@rossj @RyanZim, bringing this issue back up, because we face the same problem with fs.cp() in Node.js.

I've been working on a port of Node.js' path methods that work on Buffers:

https://github.com/bcoe/path-buffer

I've made an effort to detect utf8 vs., utf16, so that the appropriate separator is added or removed by methods like join and dirname, but I'm not an expert at string encodings, so it would be good to have someone who's bumped into the issue confirm the logic is sound.

RyanZim added bug platform-linux labels May 2, 2018

RyanZim mentioned this issue Jul 20, 2018

Error copying files with special characters (accent). #605

Closed

This was referenced Aug 5, 2018

remove*() and empty*(): remove contents as Buffers to handle non-UTF-8 filenames on Linux #612

Closed

copy*() name-based isSrcSubdir check is insufficient, perhaps needs ino #613

Closed

rakeshtembhurne mentioned this issue Aug 13, 2019

Renovate cleanRepository fails when repo contains files with special characters renovatebot/renovate#3933

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems when dealing with invalidly-encoded filenames #575

Problems when dealing with invalidly-encoded filenames #575

rossj commented May 2, 2018

RyanZim commented May 2, 2018

rossj commented May 3, 2018

RyanZim commented May 3, 2018 •

edited

rossj commented May 3, 2018

bcoe commented Aug 22, 2021 •

edited

Problems when dealing with invalidly-encoded filenames #575

Problems when dealing with invalidly-encoded filenames #575

Comments

rossj commented May 2, 2018

RyanZim commented May 2, 2018

rossj commented May 3, 2018

RyanZim commented May 3, 2018 • edited

rossj commented May 3, 2018

bcoe commented Aug 22, 2021 • edited

RyanZim commented May 3, 2018 •

edited

bcoe commented Aug 22, 2021 •

edited