Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SystemError: <built-in function Page_get_texttrace> returned a result with an error set #2045

Closed
shiu886 opened this issue Nov 14, 2022 · 3 comments
Labels

Comments

@shiu886
Copy link

shiu886 commented Nov 14, 2022

Running this script

import fitz
print(fitz.__doc__)
doc = fitz.open('2-p1.pdf')
for page in doc:
        allSpans = page.get_texttrace()
        print(f"{page.number}, # of spans={len(allSpans)}")

on some pdf file, for example, 2-p1.pdf
will cause

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcb in position 0: invalid continuation byte

The above exception was the direct cause of the following exception:

SystemError: <class 'UnicodeDecodeError'> returned a result with an error set

<same messages repeats many times>

  File "D:\Program Files\Python\Python37\lib\site-packages\fitz\fitz.py", line 6278, in get_texttrace
    val = _fitz.Page_get_texttrace(self)
SystemError: <built-in function Page_get_texttrace> returned a result with an error set

My system

PyMuPDF 1.21.0: Python bindings for the MuPDF 1.21.0 library.
Version date: 2022-11-08 00:00:01.
Built for Python 3.7 on win32 (64-bit).

I tried also 1.19.6 and 1.20.2. All give this same error.

@julian-smith-artifex-com
Copy link
Collaborator

Thanks for reporting this. I've reproduced it, will investigate some more later today.

@JorjMcKie
Copy link
Collaborator

This is being caused by a font name in the file, that cannot be interpretated as UTF-8. So a fallback to escape decoding must be used - which happens for (hopefully) all other places where font names are extracted.
This occasion was previously undetected, but it is an easy change.

pprint(doc.get_page_fonts(0))
[(6578, 'ttf', 'TrueType', 'ABCDEE+ËÎÌå', 'F1', 'WinAnsiEncoding'),  # this one!
 (6580, 'ttf', 'Type0', 'ABCDEE+ËÎÌå', 'F2', 'Identity-H'),  # this one!
 (4, 'ttf', 'TrueType', 'ABCDEE+Calibri', 'F6', 'WinAnsiEncoding')]

This was referenced Nov 14, 2022
julian-smith-artifex-com pushed a commit that referenced this issue Nov 14, 2022
Python C function `Py_BuildValue("s", fontname)` will fail if fontname is not UTF8-encoded.
Use PyUnicodeRawEscape function for fontnames instead - like everywhere else in PyMuPDF.
@julian-smith-artifex-com
Copy link
Collaborator

Fixed in PyMuPDF-1.21.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants