Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_text produces hexadecimal output #2413

Open
staff0rd opened this issue Jan 16, 2024 · 2 comments
Open

extract_text produces hexadecimal output #2413

staff0rd opened this issue Jan 16, 2024 · 2 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@staff0rd
Copy link

The below code results in what looks like a bunch of hexadecimal. The first page of the pdf is displayed below, I note that I can copy/paste text normally from it (via Google Chrome).

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.4, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
pdfreader = PdfReader('kia-stonic-owners-manual-my23.pdf')
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

# write text to file
with open('text.txt', 'w') as f:
    f.write(raw_text)

Share here the PDF file(s) that cause the issue:
kia-stonic-owners-manual-my23.pdf

First page of pdf

image

top of text.txt

image

@stefan6419846 stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Feb 15, 2024
@IshmamR
Copy link

IshmamR commented Mar 18, 2024

Did you find a workaround for this?

@pubpub-zz
Copy link
Collaborator

the fonts in the PDF have no tounicode mapping which is the standard way to get translation for text extraction. without such information pypdf uses the codes. Personally, I've not been able yet to identify a way to get a unicode from the font

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants