Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336

Open
renanbirck opened this issue Dec 9, 2023 · 4 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests help wanted We appreciate help everywhere - this one might be an easy start! is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@renanbirck
Copy link

I am trying to parse this PDF. However, I am getting on the output of extract_text() a bunch of spaces that are not in the original PDF.

See the screenshot - the original PDF on the left, the output of for what I mean (e.g. "Av. Beir a Rio" should be "Av. Beira Rio", "Cen tro" should be "Centro"):

image

If I copy/paste from Okular or other PDF reader to a text document, it is copied correctly, so I know the PDF file is not broken.

Environment

I am using Python 3.12 in Fedora 39.

$ python -m platform
Linux-6.6.4-200.fc39.x86_64-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('pycryptodome', '3.19.0'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
reader = PdfReader('Pesquisa-de-Precos-Combustiveis-novembro-2023.pdf')
text = reader.pages[0].extract_text()
@stefan6419846
Copy link
Collaborator

This is a known limitation with multiple similar issues already being reported and is explained inside the docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces

TL;DR: How a text layer is being retrieved depends on the actual library implementation - each tends to have its own advantages and limits. In this specific case, the pdftotext layout mode (based upon poppler, one of the standard PDF libraries for Linux systems) seems to provide "correct" results, as well as mutool convert.

@renanbirck
Copy link
Author

This is a known limitation with multiple similar issues already being reported and is explained inside the docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces

I understand. Is there any way I can work around it in pypdf? Other PDF libraries (like pymupdf, based on mupdf) don't have that problem.

@stefan6419846
Copy link
Collaborator

You might want to have a look at the code from #2038 (comment).

@MartinThoma MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. labels Dec 24, 2023
MartinThoma added a commit that referenced this issue Dec 24, 2023
MartinThoma added a commit that referenced this issue Dec 24, 2023
@MartinThoma MartinThoma added Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Dec 24, 2023
MartinThoma added a commit that referenced this issue Dec 24, 2023
## What's new

### Bug Fixes (BUG)
-  Handle IndirectObject as image filter (#2355) by @stefan6419846

### Documentation (DOC)
-  Quote specs in generate_file_identifiers (#2363) by @exiledkingcc
-  Notes about form fields and annotations (#1945) by @dmjohnsson23
-  Notes about update_page_form_field_values(auto_regenerate) (#2359) by @dmjohnsson23
-  Fix stamping example (#2358) by @dmjohnsson23
-  Stamp images directly on a PDF (#2357) by @dmjohnsson23
-  Correct the example of adding highlight annotation (#2341) by @Tobeabellwether

### Maintenance (MAINT)
-  Update upload-artifact and download-artifact actions from v3 to v4 (#2352) by @stefan6419846

### Testing (TST)
-  Add xfail test for #2336 (#2365) by @MartinThoma
-  Increase test coverage for flate handling of image mode 1 (#2339) by @stefan6419846

### Code Style (STY)
-  File identifier generation restructuring (#2362) by @exiledkingcc
-  Add PdfWriter._ID attribute (#2361) by @exiledkingcc
-  Variable naming convention (#2360) by @MartinThoma

[Full Changelog](3.17.3...3.17.4)
@pubpub-zz
Copy link
Collaborator

@renanbirck
the extra spaces the output of the "tt" special character conversion. I don't know how to get the good output :the translation is not part of the ToUnicode field. I don't know neither how other programs are doing the translation

@pubpub-zz pubpub-zz added the help wanted We appreciate help everywhere - this one might be an easy start! label Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests help wanted We appreciate help everywhere - this one might be an easy start! is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants