Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended ASCII characters in multiline strings cause "SystemError: Negative size passed to PyUnicode_New" when the encoding is not specified #96611

Closed
polprog opened this issue Sep 6, 2022 · 5 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@polprog
Copy link

polprog commented Sep 6, 2022

Bug report

In some cases, when dealing with multi-line string with non-utf8 encoded files, python will throw a SystemError: Negative size passed to PyUnicode_New and not execute any code.

Minimal test case:

print("""
ą""")

This is only a problem if the non-utf8 character lies on a new line (at any point in the line)

A similar test case behaves correctly

print("""ą""")

And reports an encoding warning, which is the expected behavior

SyntaxError: Non-UTF-8 code starting with '\xb1' in file C:\Users\xxxxx\test.py on line 2, but no encoding declared; see https://python.org/dev/peps/pep-0263/ for details

Since this is an encoding related errors, both files are attached (as .txt, GitHub does not allow .py attachments).
test.txt - single line (correct behavior)
test_ml.txt - multi line (bug)

My environment

  • CPython versions tested on: Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] on win32
  • Operating system and architecture: Windows 10 Pro 21H2 (19044.1826)
@polprog polprog added the type-bug An unexpected behavior, bug, or error label Sep 6, 2022
@mdboom
Copy link
Contributor

mdboom commented Sep 6, 2022

#96270 may fix this. Let me confirm.

@mdboom
Copy link
Contributor

mdboom commented Sep 6, 2022

#96270 may fix this. Let me confirm.

It does not fix this.

@mdboom mdboom self-assigned this Sep 6, 2022
@mdboom
Copy link
Contributor

mdboom commented Sep 6, 2022

Since copy-and-paste doesn't usually preserve broken encodings, this is a convenient way to reproduce the bug:

open("x.py", "wb").write(b'print("""\n\xb1""")')
$ python x.py

@eryksun
Copy link
Contributor

eryksun commented Sep 6, 2022

In tok_get() in "Parser/tokenizer.c", the following code blindly handles EOF returned from tok_nextc() as if it's the end of the file.

cpython/Parser/tokenizer.c

Lines 1936 to 1948 in 6744490

/* Get rest of string */
while (end_quote_size != quote_size) {
c = tok_nextc(tok);
if (c == EOF || (quote_size == 1 && c == '\n')) {
assert(tok->multi_line_start != NULL);
// shift the tok_state's location into
// the start of string, and report the error
// from the initial quote character
tok->cur = (char *)tok->start;
tok->cur++;
tok->line_start = tok->multi_line_start;
int start = tok->lineno;
tok->lineno = tok->first_lineno;

In this case, however, tok->done is E_DECODE instead of E_EOF. This gets set by error_ret(), which also clears tok->start and tok->cur to NULL. The above code increments tok->cur to 1. Subsequently, _syntaxerror_range() tries to decode the text for the syntax error using the negative size 1 - tok->line_start.

@eryksun eryksun added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Sep 6, 2022
mdboom added a commit to mdboom/cpython that referenced this issue Sep 6, 2022
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 6, 2022
…string (pythonGH-96623)

(cherry picked from commit 05692c6)

Co-authored-by: Michael Droettboom <mdboom@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 6, 2022
…string (pythonGH-96623)

(cherry picked from commit 05692c6)

Co-authored-by: Michael Droettboom <mdboom@gmail.com>
miss-islington added a commit that referenced this issue Sep 6, 2022
…GH-96623)

(cherry picked from commit 05692c6)

Co-authored-by: Michael Droettboom <mdboom@gmail.com>
miss-islington added a commit that referenced this issue Sep 6, 2022
…GH-96623)

(cherry picked from commit 05692c6)

Co-authored-by: Michael Droettboom <mdboom@gmail.com>
@mdboom
Copy link
Contributor

mdboom commented Sep 7, 2022

Thanks for the report, @polprog, and the diagnostics, @eryksun.

@mdboom mdboom closed this as completed Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants