Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-96268: Fix loading invalid UTF-8 #96270

Merged
merged 11 commits into from
Sep 7, 2022
13 changes: 10 additions & 3 deletions Lib/test/test_source_encoding.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,16 +236,23 @@ def test_invalid_utf8(self):
# test it is to write actual files to disk.

# Each example is put inside a string at the top of the file so
# it's an otherwise valid Python source file.
template = b'"%s"\n'
# it's an otherwise valid Python source file. Put some newlines
# beforehand so we can assert that the error is reported on the
# correct line.
template = b'\n\n\n"%s"\n'

fn = TESTFN
self.addCleanup(unlink, fn)

def check(content):
with open(fn, 'wb') as fp:
fp.write(template % content)
script_helper.assert_python_failure(fn)
rc, stdout, stderr = script_helper.assert_python_failure(fn)
# We want to assert that the python subprocess failed gracefully,
# not via a signal.
self.assertGreaterEqual(rc, 1)
self.assertIn(b"Non-UTF-8 code starting with", stderr)
self.assertIn(b"on line 5", stderr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I miscounting here? The string in the template appears to me to be on the 4th line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Indeed you are correct.

The generation of the error message adds 1 to tok->lineno. I don't know if that's correct or not, but it seems like other error messages that report tok->lineno don't do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. There's a comment in tokenizer.c right above the PyErr_Format() call explaining why 1 has to be added. But I wonder if your change disturbed this logic? I don't understand how, though. It could also be that the comment was wrong. Maybe @pablogsal understands this logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this is because the parser (or at least some parts of it) emits line numbers that start with 0 but the rest of the VM needs line numbers starting at 1 to display exceptions. But there has been some time since I had to deal with this so some details could be missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mystery is that in the updated test, an error in a string on line 4 is reported at line 5. Unless I misread the test.

Copy link
Member

@pablogsal pablogsal Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hummmm, that may be pointing to something breaking. I bet that this is pointing past the file. Without looking in detail I don't know exactly what could be going on with this specific test. Unfortunately it may be that there was some implicit contract on the reporting that these changes are breaking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think there is some kind of bug here. These are the errors in different versions:

❯ python3.8 lel.py
  File "lel.py", line 4
SyntaxError: Non-UTF-8 code starting with '\xc0' in file lel.py on line 4, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

❯ python3.9 lel.py
SyntaxError: Non-UTF-8 code starting with '\xc0' in file /Users/pgalindo3/lel.py on line 4, but no encoding declared; see https://python.org/dev/peps/pep-0263/ for details

❯ python3.10 lel.py
SyntaxError: Non-UTF-8 code starting with '\xc0' in file /Users/pgalindo3/lel.py on line 5, but no encoding declared; see https://python.org/dev/peps/pep-0263/ for details

So something changed in 3.10 around this, it seems.

Copy link
Member

@pablogsal pablogsal Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that line is just wrong because the line generated is already good for the exception. I made this change:

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index f2606f17d1..924c97ba8a 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -535,7 +535,7 @@ ensure_utf8(char *line, struct tok_state *tok)
                      "in file %U on line %i, "
                      "but no encoding declared; "
                      "see https://peps.python.org/pep-0263/ for details",
-                     badchar, tok->filename, tok->lineno + 1);
+                     badchar, tok->filename, tok->lineno);
         return 0;
     }
     return 1;

And the full (current) test suite passes without errors:

== Tests result: SUCCESS ==

407 tests OK.

29 tests skipped:
    test_curses test_dbm_gnu test_devpoll test_epoll test_gdb
    test_idle test_ioctl test_launcher test_msilib
    test_multiprocessing_fork test_ossaudiodev test_perf_profiler
    test_smtpnet test_socketserver test_spwd test_startfile test_tcl
    test_tix test_tkinter test_ttk test_ttk_textonly test_turtle
    test_urllib2net test_urllibnet test_winconsoleio test_winreg
    test_winsound test_xmlrpc_net test_zipfile64

Total duration: 6 min 1 see

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdboom do you want to include the fix in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pablogsal: Yes, it makes sense to just fix this in this PR.


# continuation bytes in a sequence of 2, 3, or 4 bytes
continuation_bytes = [bytes([x]) for x in range(0x80, 0xC0)]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Loading a file with invalid UTF-8 will now report the broken character at
the correct location.
30 changes: 21 additions & 9 deletions Parser/tokenizer.c
Original file line number Diff line number Diff line change
Expand Up @@ -489,25 +489,37 @@ static void fp_ungetc(int c, struct tok_state *tok) {

/* Check whether the characters at s start a valid
UTF-8 sequence. Return the number of characters forming
the sequence if yes, 0 if not. */
the sequence if yes, 0 if not. The special cases match
those in stringlib/codecs.h:decode_utf8.
mdboom marked this conversation as resolved.
Show resolved Hide resolved
*/
static int valid_utf8(const unsigned char* s)
mdboom marked this conversation as resolved.
Show resolved Hide resolved
{
int expected = 0;
int length;
if (*s < 0x80)
if (*s < 0x80) {
/* single-byte code */
return 1;
if (*s < 0xc0)
/* following byte */
return 0;
if (*s < 0xE0)
} else if (*s < 0xE0) {
if (*s < 0xC2) {
gvanrossum marked this conversation as resolved.
Show resolved Hide resolved
return 0;
}
expected = 1;
else if (*s < 0xF0)
} else if (*s < 0xF0) {
if (*s == 0xE0 && *(s + 1) < 0xA0) {
return 0;
} else if (*s == 0xED && *(s + 1) >= 0xA0) {
return 0;
}
expected = 2;
else if (*s < 0xF8)
} else if (*s < 0xF5) {
if (*(s + 1) < 0x90 ? *s == 0xF0 : *s == 0xF4) {
return 0;
}
gvanrossum marked this conversation as resolved.
Show resolved Hide resolved
expected = 3;
else
} else {
/* invalid start byte */
return 0;
}
length = expected + 1;
for (; expected; expected--)
if (s[expected] < 0x80 || s[expected] >= 0xC0)
Expand Down