gh-96268: Fix loading invalid UTF-8 #96270

mdboom · 2022-08-25T14:20:57Z

This makes tokenizer.c:valid_utf8 match stringlib/codecs.h:decode_utf8.

This also fixes the related test so it will always detect the expected failure
and error message.

Issue: Assert and incorrect error message when loading source file containing invalid UTF-8 #96268

This makes tokenizer.c:valid_utf8 match stringlib/codecs.h:decode_utf8. This also fixes the related test so it will always detect the expected failure and error message.

bedevere-bot · 2022-08-25T14:21:30Z

🤖 New build scheduled with the buildbot fleet by @mdboom for commit 407eef7 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

Lib/test/test_source_encoding.py

Parser/tokenizer.c

mdboom · 2022-08-25T16:14:41Z

Lib/test/test_source_encoding.py

@@ -14,11 +14,11 @@ class MiscSourceEncodingTest(unittest.TestCase):

    def test_pep263(self):
        self.assertEqual(
-            "�����".encode("utf-8"),
+            "ðÉÔÏÎ".encode("utf-8"),


Sorry, these were changed unintentionally by my editor. Going to revert...

Ah, actually these were added by Github's generated commit 6d43cc which accepted @ezio-melotti's suggestion. Seems like a bug in Github, which isn't surprising, given this file is not valid UTF-8. I'll clean this up by hand.

gvanrossum

Got me nerd-sniped. :-)

gvanrossum · 2022-08-30T21:43:07Z

Lib/test/test_source_encoding.py

+            # not via a signal.
+            self.assertGreaterEqual(rc, 1)
+            self.assertIn(b"Non-UTF-8 code starting with", stderr)
+            self.assertIn(b"on line 5", stderr)


Am I miscounting here? The string in the template appears to me to be on the 4th line.

Good catch. Indeed you are correct.

The generation of the error message adds 1 to tok->lineno. I don't know if that's correct or not, but it seems like other error messages that report tok->lineno don't do that.

Hm. There's a comment in tokenizer.c right above the PyErr_Format() call explaining why 1 has to be added. But I wonder if your change disturbed this logic? I don't understand how, though. It could also be that the comment was wrong. Maybe @pablogsal understands this logic?

IIRC this is because the parser (or at least some parts of it) emits line numbers that start with 0 but the rest of the VM needs line numbers starting at 1 to display exceptions. But there has been some time since I had to deal with this so some details could be missing.

The mystery is that in the updated test, an error in a string on line 4 is reported at line 5. Unless I misread the test.

Hummmm, that may be pointing to something breaking. I bet that this is pointing past the file. Without looking in detail I don't know exactly what could be going on with this specific test. Unfortunately it may be that there was some implicit contract on the reporting that these changes are breaking.

Ah, I think there is some kind of bug here. These are the errors in different versions:

❯ python3.8 lel.py File "lel.py", line 4 SyntaxError: Non-UTF-8 code starting with '\xc0' in file lel.py on line 4, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details ❯ python3.9 lel.py SyntaxError: Non-UTF-8 code starting with '\xc0' in file /Users/pgalindo3/lel.py on line 4, but no encoding declared; see https://python.org/dev/peps/pep-0263/ for details ❯ python3.10 lel.py SyntaxError: Non-UTF-8 code starting with '\xc0' in file /Users/pgalindo3/lel.py on line 5, but no encoding declared; see https://python.org/dev/peps/pep-0263/ for details

So something changed in 3.10 around this, it seems.

I think that line is just wrong because the line generated is already good for the exception. I made this change:

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c index f2606f17d1..924c97ba8a 100644 --- a/Parser/tokenizer.c +++ b/Parser/tokenizer.c @@ -535,7 +535,7 @@ ensure_utf8(char *line, struct tok_state *tok) "in file %U on line %i, " "but no encoding declared; " "see https://peps.python.org/pep-0263/ for details", - badchar, tok->filename, tok->lineno + 1); + badchar, tok->filename, tok->lineno); return 0; } return 1;

And the full (current) test suite passes without errors:

== Tests result: SUCCESS == 407 tests OK. 29 tests skipped: test_curses test_dbm_gnu test_devpoll test_epoll test_gdb test_idle test_ioctl test_launcher test_msilib test_multiprocessing_fork test_ossaudiodev test_perf_profiler test_smtpnet test_socketserver test_spwd test_startfile test_tcl test_tix test_tkinter test_ttk test_ttk_textonly test_turtle test_urllib2net test_urllibnet test_winconsoleio test_winreg test_winsound test_xmlrpc_net test_zipfile64 Total duration: 6 min 1 see

@mdboom do you want to include the fix in this PR?

@pablogsal: Yes, it makes sense to just fix this in this PR.

Parser/tokenizer.c

mdboom · 2022-08-31T17:55:29Z

@pablogsal: I leave it to you to decide whether this is backported to 3.11. If we don't backport, I'll file a separate PR for 3.11 to make the tests pass on buildbots with pydebug and saving coredump files (where they are currently failing).

gvanrossum

I'll let @pablogsal decide about the 3.11 and 3.10 backports. (It would be less risky to backport just the lineno fix perhaps?)

bedevere-bot · 2022-08-31T22:38:46Z

🤖 New build scheduled with the buildbot fleet by @gvanrossum for commit f8e9e6e 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

gvanrossum

Thanks. I think it's time to merge this.

miss-islington · 2022-09-07T21:23:57Z

Thanks @mdboom for the PR, and @gvanrossum for merging it 🌮🎉.. I'm working now to backport this PR to: 3.11.
🐍🍒⛏🤖

bedevere-bot · 2022-09-07T21:24:07Z

GH-96668 is a backport of this pull request to the 3.11 branch.

This makes tokenizer.c:valid_utf8 match stringlib/codecs.h:decode_utf8. It also fixes an off-by-one error introduced in 3.10 for the line number when the tokenizer reports bad UTF8. (cherry picked from commit 8bc356a) Co-authored-by: Michael Droettboom <mdboom@gmail.com>

mdboom added 2 commits August 25, 2022 10:13

pythongh-96268: Fix loading invalid UTF-8

e4aaa14

This makes tokenizer.c:valid_utf8 match stringlib/codecs.h:decode_utf8. This also fixes the related test so it will always detect the expected failure and error message.

Add blurb

407eef7

mdboom requested review from pablogsal and lysnikolaou as code owners August 25, 2022 14:20

bedevere-bot added the awaiting review label Aug 25, 2022

mdboom added 🔨 test-with-buildbots Test PR w/ buildbots; report in status section topic-unicode needs backport to 3.11 only security fixes and removed awaiting review labels Aug 25, 2022

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Aug 25, 2022

mdboom added the awaiting review label Aug 25, 2022

mdboom added 2 commits August 25, 2022 10:22

Merge remote-tracking branch 'upstream/main' into fix-valid-utf8

e453819

Fix blurb

3d60ff7

ezio-melotti reviewed Aug 25, 2022

View reviewed changes

Lib/test/test_source_encoding.py Outdated Show resolved Hide resolved

Lib/test/test_source_encoding.py Outdated Show resolved Hide resolved

Parser/tokenizer.c Show resolved Hide resolved

mdboom commented Aug 25, 2022

View reviewed changes

Use assertIn instead

5cc57ad

mdboom force-pushed the fix-valid-utf8 branch from 6d43cc4 to 5cc57ad Compare August 25, 2022 16:25

mdboom requested a review from ezio-melotti August 25, 2022 16:25

gvanrossum self-requested a review August 29, 2022 16:52

gvanrossum reviewed Aug 30, 2022

View reviewed changes

mdboom requested a review from gvanrossum August 31, 2022 14:55

mdboom added 3 commits August 31, 2022 10:56

Fix reference to other decoding function

ad4de7a

Fix coding style

18927b1

Add comments about handled code ranges in each branch

f741a9d

Fix line number in error message

f8e9e6e

gvanrossum approved these changes Aug 31, 2022

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels Aug 31, 2022

gvanrossum added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Aug 31, 2022

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Aug 31, 2022

mdboom mentioned this pull request Sep 6, 2022

Extended ASCII characters in multiline strings cause "SystemError: Negative size passed to PyUnicode_New" when the encoding is not specified #96611

Closed

mdboom added 2 commits September 6, 2022 15:39

Remove obsolete comment

ace4a8c

PEP7

df074a8

gvanrossum approved these changes Sep 7, 2022

View reviewed changes

gvanrossum merged commit 8bc356a into python:main Sep 7, 2022

bedevere-bot removed awaiting merge needs backport to 3.11 only security fixes labels Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-96268: Fix loading invalid UTF-8 #96270

gh-96268: Fix loading invalid UTF-8 #96270

mdboom commented Aug 25, 2022 •

edited by bedevere-bot

bedevere-bot commented Aug 25, 2022

mdboom Aug 25, 2022

mdboom Aug 25, 2022

gvanrossum left a comment

gvanrossum Aug 30, 2022

mdboom Aug 31, 2022

gvanrossum Aug 31, 2022

pablogsal Aug 31, 2022

gvanrossum Aug 31, 2022

pablogsal Aug 31, 2022 •

edited

pablogsal Aug 31, 2022

pablogsal Aug 31, 2022 •

edited

pablogsal Aug 31, 2022

mdboom Aug 31, 2022

mdboom commented Aug 31, 2022

gvanrossum left a comment

bedevere-bot commented Aug 31, 2022

gvanrossum left a comment

miss-islington commented Sep 7, 2022

bedevere-bot commented Sep 7, 2022

gh-96268: Fix loading invalid UTF-8 #96270

gh-96268: Fix loading invalid UTF-8 #96270

Conversation

mdboom commented Aug 25, 2022 • edited by bedevere-bot

bedevere-bot commented Aug 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvanrossum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablogsal Aug 31, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablogsal Aug 31, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdboom commented Aug 31, 2022

gvanrossum left a comment

Choose a reason for hiding this comment

bedevere-bot commented Aug 31, 2022

gvanrossum left a comment

Choose a reason for hiding this comment

miss-islington commented Sep 7, 2022

bedevere-bot commented Sep 7, 2022

mdboom commented Aug 25, 2022 •

edited by bedevere-bot

pablogsal Aug 31, 2022 •

edited

pablogsal Aug 31, 2022 •

edited