Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff parsing fails when filenames contain unicode characters #418

Closed
cyclotron3k opened this issue Sep 11, 2019 · 19 comments
Closed

Diff parsing fails when filenames contain unicode characters #418

cyclotron3k opened this issue Sep 11, 2019 · 19 comments

Comments

@cyclotron3k
Copy link
Contributor

cyclotron3k commented Sep 11, 2019

Subject of the issue

Unicode characters in filenames breaks diffing

Your environment

  • git version 2.20.1
  • git (1.5.0)
  • ruby 2.6.4p104 (2019-08-28 revision 67798) [x86_64-linux]

Steps to reproduce

$ mkdir -p /tmp/test && cd /tmp/test
$ git init
$ printf '\xE2\x98\xA0' # print a skull and crossbones symbol
$ touch 'some_file'
$ git add .
$ git commit -m"init"
$ touch 'my_other_file_\xE2\x98\xA0'
$ git add .
$ git commit -m"commit"
git = Git.init '/tmp/test'
git.diff('@^').each { |x| ... }

Expected behaviour

The diff to be produced as usual

Actual behaviour

NoMethodError: undefined method `[]' for nil:NilClass

The first line of a diff is usually something like git --diff a/somefile b/somefile, but when the filename contains a unicode character, the filenames are escaped, (e.g. diff --git "a/my_other_file_\\xE2\\x98\\xA0" "b/my_other_file_\\xE2\\x98\\xA0") and this regex fails to detect the line.

@stale
Copy link

stale bot commented Nov 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 10, 2019
@stale stale bot closed this as completed Nov 17, 2019
@jcouball
Copy link
Member

@cyclotron3k if you are still interested, can you see if this issue was fixed with the 1.6.0.pre1 version? I think this issue might be fixed by #405.

@jcouball jcouball reopened this Jan 21, 2020
@cyclotron3k
Copy link
Contributor Author

Unfortunately it doesn't seem to have helped:

irb(main):005:0> git.diff('@^').each { |x| x }
Traceback (most recent call last):
       10: from /usr/local/bin/irb:23:in `<main>'
        9: from /usr/local/bin/irb:23:in `load'
        8: from /usr/local/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
        7: from (irb):5
        6: from (irb):5:in `rescue in irb_binding'
        5: from /usr/local/bundle/gems/git-1.6.0.pre1/lib/git/diff.rb:68:in `each'
        4: from /usr/local/bundle/gems/git-1.6.0.pre1/lib/git/diff.rb:109:in `process_full'
        3: from /usr/local/bundle/gems/git-1.6.0.pre1/lib/git/diff.rb:130:in `process_full_diff'
        2: from /usr/local/bundle/gems/git-1.6.0.pre1/lib/git/diff.rb:130:in `each'
        1: from /usr/local/bundle/gems/git-1.6.0.pre1/lib/git/diff.rb:147:in `block in process_full_diff'
NoMethodError (undefined method `[]' for nil:NilClass)

@jcouball
Copy link
Member

jcouball commented Jan 27, 2020

@cyclotron3k, over the weekend I was able to take a look at this and found the problem. It doesn't look like a problem with encoding, but a bug in how the diff output is parsed.

I believe this patch will make it work:

diff --git a/lib/git/diff.rb b/lib/git/diff.rb
index fac0495..d31a79a 100644
--- a/lib/git/diff.rb
+++ b/lib/git/diff.rb
@@ -128,7 +128,7 @@ module Git
         final = {}
         current_file = nil
         @full_diff.split("\n").each do |line|
-          if m = /^diff --git a\/(.*?) b\/(.*?)/.match(line)
+          if m = %r{^diff --git a/(.+?) b/(.+?)$}.match(line)
             current_file = m[1]
             final[current_file] = defaults.merge({:patch => line, :path => current_file})
           else

However, there are a couple of things to figure out before a PR can be submitted:
(1) how will this be tested (I know how, just haven't turned my attention to it yet)
(2) I notice that the setting of core.quotePath changes how Git outputs the file names with non-ASCII characters. In my case, that git config was set to true which output the filenames in ASCII with Unicode escapes. This is not idea, but better than throwing an error like you are seeing now. I need to look a little deeper into this setting to figure out what is the best way to handle it.

@jcouball jcouball removed the wontfix label Jan 27, 2020
@cyclotron3k
Copy link
Contributor Author

cyclotron3k commented Jan 28, 2020

Take this with a pinch of salt (because I many not have applied the patch correctly), but I'm not sure if that regex alteration is going to work.

I tried in a ruby:2.7 docker container, and I still had the same problem. I think because when unicode characters are present, git formats the diff as: diff --git "a/some_file" "b/some_file", which the updated regex still chokes on.

I tried with core.quotePath set to true and false and it didn't seem to make a difference for me.

This may work a bit better: /\Adiff --git ("?)a\/(.+?)\1 \1b\/(.+?)\1\z/

@jcouball
Copy link
Member

Taking a step back, I have a problem with the reproduction recipe that you gave in the original post. Maybe this is a clue to why we are seeing different output.

I am on a Mac running MacOS 10.15.2, using zsh (also tried in bash).

The printf works as intended:

$ printf '\xE2\x98\xA0'
☠%
$ 

but the touch does not:

$ ls -al
total 0
drwxr-xr-x   4 couballj  wheel  128 Jan 28 08:27 .
drwxrwxrwt  16 root      wheel  512 Jan 28 08:24 ..
drwxr-xr-x  12 couballj  wheel  384 Jan 28 08:30 .git
-rw-r--r--   1 couballj  wheel    0 Jan 28 08:24 some_file
$ touch 'my_other_file_\xE2\x98\xA0'
$ ls -al
total 0
drwxr-xr-x   5 couballj  wheel  160 Jan 28 08:31 .
drwxrwxrwt  16 root      wheel  512 Jan 28 08:24 ..
drwxr-xr-x  12 couballj  wheel  384 Jan 28 08:31 .git
-rw-r--r--   1 couballj  wheel    0 Jan 28 08:31 my_other_file_\xE2\x98\xA0
-rw-r--r--   1 couballj  wheel    0 Jan 28 08:24 some_file
$

This works:

$ touch 'my_other_file_☠'
$ ls -al
total 0
drwxr-xr-x   5 couballj  wheel  160 Jan 28 08:31 .
drwxrwxrwt  16 root      wheel  512 Jan 28 08:24 ..
drwxr-xr-x  12 couballj  wheel  384 Jan 28 08:31 .git
-rw-r--r--   1 couballj  wheel    0 Jan 28 08:31 my_other_file_\xE2\x98\xA0
-rw-r--r--   1 couballj  wheel    0 Jan 28 08:32 my_other_file_☠
-rw-r--r--   1 couballj  wheel    0 Jan 28 08:24 some_file
$

@jcouball
Copy link
Member

On my computer, I get different output depending on the value of core.quotePath:

$ git config core.quotePath
false
$ git '--git-dir=/tmp/test/.git' '--work-tree=/tmp/test' diff '-p' '@^'
diff --git a/my_other_file_☠ b/my_other_file_☠
new file mode 100644
index 0000000..e69de29
$ git config core.quotePath true
$ git config core.quotePath
true
$ git '--git-dir=/tmp/test/.git' '--work-tree=/tmp/test' diff '-p' '@^'
diff --git "a/my_other_file_\342\230\240" "b/my_other_file_\342\230\240"
new file mode 100644
index 0000000..e69de29
$ 

@jcouball
Copy link
Member

jcouball commented Jan 28, 2020

All that said, I think your proposed regex works for both cases (see the irb transcript below). I'll prepare a patch based on this regex.

$ irb
2.6.5 :001 > regex = %r{\Adiff --git ("?)a/(.+?)\1 \1b/(.+?)\1\z}
 => /\Adiff --git ("?)a\/(.+?)\1 \1b\/(.+?)\1\z/
2.6.5 :002 > example1 = 'diff --git a/my_other_file_☠ b/my_other_file_☠'
 => "diff --git a/my_other_file_☠ b/my_other_file_☠"
2.6.5 :003 > pp regex.match(example1)
#<MatchData
 "diff --git a/my_other_file_☠ b/my_other_file_☠"
 1:""
 2:"my_other_file_☠"
 3:"my_other_file_☠">
 => #<MatchData "diff --git a/my_other_file_☠ b/my_other_file_☠" 1:"" 2:"my_other_file_☠" 3:"my_other_file_☠">
2.6.5 :004 > example2 = 'diff --git "a/my_other_file_\\342\\230\\240" "b/my_other_file_\\342\\230\\240"'
 => "diff --git \"a/my_other_file_\\342\\230\\240\" \"b/my_other_file_\\342\\230\\240\""
2.6.5 :005 > pp regex.match(example2)
#<MatchData
 "diff --git \"a/my_other_file_\\342\\230\\240\" \"b/my_other_file_\\342\\230\\240\""
 1:"\""
 2:"my_other_file_\\342\\230\\240"
 3:"my_other_file_\\342\\230\\240">
 => #<MatchData "diff --git \"a/my_other_file_\\342\\230\\240\" \"b/my_other_file_\\342\\230\\240\"" 1:"\"" 2:"my_other_file_\\342\\230\\240" 3:"my_other_file_\\342\\230\\240">
2.6.5 :006 >

@cyclotron3k
Copy link
Contributor Author

Interesting. So your testing has revealed a few things...

My test-case was a bit faulty. That touch command wasn't working on my computer either. It was just creating a file named my_other_file_\xE2\x98\xA0, similar to your testing.

But it didn't matter because the problem is not git escaping unicode characters, it's git automatically escaping filenames (and paths) that include pretty much any non-alphanumeric character, including those backslashes.

I tried the core.quotePath setting again, but it seems to have no effect in my version of git: git version 2.20.1 on Debian GNU/Linux 10

@jcouball
Copy link
Member

My git version is: git version 2.21.0 (Apple Git-122.2)

One of the things I have come to realize is that this gem is at the mercy of each individual user's git configuration. It would be nice to normalize the configuration used each time this gem calls the git command. PR #427 is another issue where a normalized git configuration (in this case, -c color.ui=never) could have avoided the issue.

@jcouball
Copy link
Member

jcouball commented Jan 29, 2020

@cyclotron3k what is are your values for the LANG and LC_* environment variables when git is run?

https://pubs.opengroup.org/onlinepubs/7908799/xbd/envvar.html

Here is what I have:

$ set | grep 'LANG\|LC_'
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_TERMINAL=iTerm2
LC_TERMINAL_VERSION=3.3.7
$ 

@cyclotron3k
Copy link
Contributor Author

cyclotron3k commented Jan 30, 2020

LANG=C.UTF-8
LC_ALL=C.UTF-8
LC_CTYPE=C.UTF-8

Edit: I'm doing all my testing using the official ruby:2.7 docker image, which does not have the above environment variables set by default. I add them manually each time.

@jcouball
Copy link
Member

I suspect that our testing of the effect of core.quotePath was wrong.

Do you see a difference when you run these commands:

git -c core.quotePath=true '--git-dir=/tmp/test/.git' '--work-tree=/tmp/test' diff '-p' '@^'

vs.

git -c core.quotePath=false '--git-dir=/tmp/test/.git' '--work-tree=/tmp/test' diff '-p' '@^'

@cyclotron3k
Copy link
Contributor Author

No difference 🤷‍♂️

root@6f4629917782:/# export LANG=C.UTF-8
root@6f4629917782:/# export LC_ALL=C.UTF-8
root@6f4629917782:/# export LC_CTYPE=C.UTF-8
root@6f4629917782:/# 
root@6f4629917782:/# mkdir -p /tmp/test && cd /tmp/test
root@6f4629917782:/tmp/test# git init
Initialized empty Git repository in /tmp/test/.git/
root@6f4629917782:/tmp/test# printf '\xE2\x98\xA0' # print a skull and crossbones symbol
☠root@6f4629917782:/tmp/test# touch 'some_file'
root@6f4629917782:/tmp/test# git add .
root@6f4629917782:/tmp/test# git config --global user.email "you@example.com"
root@6f4629917782:/tmp/test# git config --global user.name "Your Name"
root@6f4629917782:/tmp/test# git commit -m"init"
[master (root-commit) a985439] init
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 some_file
root@6f4629917782:/tmp/test# touch 'my_other_file_\xE2\x98\xA0'
root@6f4629917782:/tmp/test# git add .
root@6f4629917782:/tmp/test# git commit -m"commit"
[master 3d2d86b] commit
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 "my_other_file_\\xE2\\x98\\xA0"
root@6f4629917782:/tmp/test# 
root@6f4629917782:/tmp/test# git -c core.quotePath=true '--git-dir=/tmp/test/.git' '--work-tree=/tmp/test' diff '-p' '@^'
diff --git "a/my_other_file_\\xE2\\x98\\xA0" "b/my_other_file_\\xE2\\x98\\xA0"
new file mode 100644
index 0000000..e69de29
root@6f4629917782:/tmp/test# git -c core.quotePath=false '--git-dir=/tmp/test/.git' '--work-tree=/tmp/test' diff '-p' '@^'
diff --git "a/my_other_file_\\xE2\\x98\\xA0" "b/my_other_file_\\xE2\\x98\\xA0"
new file mode 100644
index 0000000..e69de29
root@6f4629917782:/tmp/test# git --version
git version 2.20.1
root@6f4629917782:/tmp/test# cat /etc/issue
Debian GNU/Linux 10 \n \l

@jcouball
Copy link
Member

@cyclotron3k I created a PR to fix this issue in #440. Can you please checkout this code and run bundle install and then bundle exec rake test in your environment?

I suspect that, although the test works on my machine and on the Travis CI build, it might not work in your environment. If it does, then we are golden.

@cyclotron3k
Copy link
Contributor Author

When LC_* and LANG envars are set to UTF-8, all tests pass (albeit with a bunch of 2.7 related warnings), and when UTF-8 is not set, I get one failure:

================================================================================================================================================================
     31:       diff.each { |diff_file| diff_paths << diff_file.path }
     32:     end
     33:     assert_include(diff_paths, 'asdf\"asdf')
  => 34:     assert_include(diff_paths, 'my_other_file_☠')
     35:   end
     36: end
/srv/ruby-git/tests/units/test_diff_with_quoted_path.rb:34:in `test_diff_with_quoted_path'
Failure: test_diff_with_quoted_path(TestDiffWithQuotedPath):
  <["asdf\\\"asdf", "my_other_file_\u00E2\u0098\u00A0"]> was expected to include
  <"my_other_file_\u2620">.
================================================================================================================================================================

@cyclotron3k
Copy link
Contributor Author

I also tested in 2.6, and I got a failure the first time it ran (I think because user.email and user.email were not set). But then there were no failures on subsequent runs.

I noticed the following had been set globally

user.email=git@example.com
user.name=GitExample

@stale
Copy link

stale bot commented Mar 31, 2020

A friendly reminder that this issue had no activity for 60 days.

@jcouball
Copy link
Member

I believe this issue was addressed with #504

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants