Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336

renanbirck · 2023-12-09T22:04:06Z

I am trying to parse this PDF. However, I am getting on the output of extract_text() a bunch of spaces that are not in the original PDF.

See the screenshot - the original PDF on the left, the output of for what I mean (e.g. "Av. Beir a Rio" should be "Av. Beira Rio", "Cen tro" should be "Centro"):

If I copy/paste from Okular or other PDF reader to a text document, it is copied correctly, so I know the PDF file is not broken.

Environment

I am using Python 3.12 in Fedora 39.

$ python -m platform
Linux-6.6.4-200.fc39.x86_64-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('pycryptodome', '3.19.0'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
reader = PdfReader('Pesquisa-de-Precos-Combustiveis-novembro-2023.pdf')
text = reader.pages[0].extract_text()

stefan6419846 · 2023-12-10T08:55:36Z

This is a known limitation with multiple similar issues already being reported and is explained inside the docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces

TL;DR: How a text layer is being retrieved depends on the actual library implementation - each tends to have its own advantages and limits. In this specific case, the pdftotext layout mode (based upon poppler, one of the standard PDF libraries for Linux systems) seems to provide "correct" results, as well as mutool convert.

renanbirck · 2023-12-13T15:38:36Z

This is a known limitation with multiple similar issues already being reported and is explained inside the docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces

I understand. Is there any way I can work around it in pypdf? Other PDF libraries (like pymupdf, based on mupdf) don't have that problem.

stefan6419846 · 2023-12-13T15:59:54Z

You might want to have a look at the code from #2038 (comment).

@stefan6419846

## What's new ### Bug Fixes (BUG) - Handle IndirectObject as image filter (#2355) by @stefan6419846 ### Documentation (DOC) - Quote specs in generate_file_identifiers (#2363) by @exiledkingcc - Notes about form fields and annotations (#1945) by @dmjohnsson23 - Notes about update_page_form_field_values(auto_regenerate) (#2359) by @dmjohnsson23 - Fix stamping example (#2358) by @dmjohnsson23 - Stamp images directly on a PDF (#2357) by @dmjohnsson23 - Correct the example of adding highlight annotation (#2341) by @Tobeabellwether ### Maintenance (MAINT) - Update upload-artifact and download-artifact actions from v3 to v4 (#2352) by @stefan6419846 ### Testing (TST) - Add xfail test for #2336 (#2365) by @MartinThoma - Increase test coverage for flate handling of image mode 1 (#2339) by @stefan6419846 ### Code Style (STY) - File identifier generation restructuring (#2362) by @exiledkingcc - Add PdfWriter._ID attribute (#2361) by @exiledkingcc - Variable naming convention (#2360) by @MartinThoma [Full Changelog](3.17.3...3.17.4)

pubpub-zz · 2024-04-02T19:48:53Z

@renanbirck
the extra spaces the output of the "tt" special character conversion. I don't know how to get the good output :the translation is not part of the ToUnicode field. I don't know neither how other programs are doing the translation

MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. labels Dec 24, 2023

MartinThoma added a commit that referenced this issue Dec 24, 2023

TST: Add xfail test for #2336

3d26c24

MartinThoma mentioned this issue Dec 24, 2023

TST: Add xfail test for #2336 #2365

Merged

MartinThoma added a commit that referenced this issue Dec 24, 2023

TST: Add xfail test for #2336 (#2365)

ba36031

MartinThoma added Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Dec 24, 2023

pubpub-zz added the help wanted We appreciate help everywhere - this one might be an easy start! label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336

Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336

renanbirck commented Dec 9, 2023

stefan6419846 commented Dec 10, 2023

renanbirck commented Dec 13, 2023

stefan6419846 commented Dec 13, 2023

pubpub-zz commented Apr 2, 2024

Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336

Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336

Comments

renanbirck commented Dec 9, 2023

Environment

Code + PDF

stefan6419846 commented Dec 10, 2023

renanbirck commented Dec 13, 2023

stefan6419846 commented Dec 13, 2023

pubpub-zz commented Apr 2, 2024