Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336
Labels
Has MCVE
A minimal, complete and verifiable example helps a lot to debug / understand feature requests
help wanted
We appreciate help everywhere - this one might be an easy start!
is-bug
From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
whitespace
While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.
workflow-text-extraction
From a users perspective, text extraction is the affected feature/workflow
I am trying to parse this PDF. However, I am getting on the output of extract_text() a bunch of spaces that are not in the original PDF.
See the screenshot - the original PDF on the left, the output of for what I mean (e.g. "Av. Beir a Rio" should be "Av. Beira Rio", "Cen tro" should be "Centro"):
If I copy/paste from Okular or other PDF reader to a text document, it is copied correctly, so I know the PDF file is not broken.
Environment
I am using Python 3.12 in Fedora 39.
Code + PDF
This is a minimal, complete example that shows the issue:
The text was updated successfully, but these errors were encountered: