DOC: Example code doesn't give the right output (fix + proven, didn't want to create a pull request for that) #2431

etern4l-white · 2024-02-01T08:40:18Z

I was trying to use the exact same example mentioned in here, but it gives blank output, even though I copied the same code, and same PDF file. (Fix is at the bottom of this issue report)

Environment

Debian

$ python -m platform
Linux-6.1.0-12-amd64-x86_64-with-glibc2.36

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.1, crypt_provider=('cryptography', '41.0.7'), PIL=10.2.0

Code + PDF

This is a minimal, complete example that shows the issue (same example from documentation):

from pypdf import PdfReader

reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]

parts = []


def visitor_body(text, cm, tm, font_dict, font_size):
    y = cm[5]
    if y > 50 and y < 720:
        parts.append(text)


page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)

print(text_body)

Fix

Just change cm to tm. The selection of height must be from the text matrix, not current matrix.

Here's to the PDF file.

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-02-01T08:49:23Z

Do you want to submit a corresponding PR?

etern4l-white · 2024-02-01T09:26:27Z

Do you want to submit a corresponding PR?

Is it worth a PR? I mean it's only in the documentation, not in the code. If I'm allowed to, then I'm more than open.

stefan6419846 · 2024-02-01T09:30:22Z

The docs should ideally be correct, thus fixing it makes sense to avoid confusion for future readers.

etern4l-white · 2024-02-01T09:35:13Z

The docs should ideally be correct, thus fixing it makes sense to avoid confusion for future readers.

Ok, I'll open a PR. Thanks.

etern4l-white · 2024-02-01T10:52:44Z

Hey @stefan6419846, this is the first time I do a pull request. What's the next step? I think contributors will check that pull request and if it meets the requirements it's accepted? I'm completely new 😅

stefan6419846 · 2024-02-01T11:03:53Z

No worries. Everyone did their first PR/contribution at some point in time. And apparently you already found our contribution docs which have told you about the desired PR prefixes ;)

The current maintainer (Martin) will approve the CI run for your commit in the near future to check whether there is anything about your change which draws further attention. As soon as this has been completed, your PR will ideally be approved (maybe after some further manual checks) and then merged into our code base and trigger the rebuild of the hosted docs not later than for the next release.

etern4l-white · 2024-02-01T11:21:49Z

Alright, that was very informative. Thanks!

etern4l-white linked a pull request Feb 1, 2024 that will close this issue

DOC: Change extract-text.md example codes from using cm to tm #2432

Open

stefan6419846 added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Example code doesn't give the right output (fix + proven, didn't want to create a pull request for that) #2431

DOC: Example code doesn't give the right output (fix + proven, didn't want to create a pull request for that) #2431

etern4l-white commented Feb 1, 2024

stefan6419846 commented Feb 1, 2024

etern4l-white commented Feb 1, 2024

stefan6419846 commented Feb 1, 2024

etern4l-white commented Feb 1, 2024

etern4l-white commented Feb 1, 2024

stefan6419846 commented Feb 1, 2024

etern4l-white commented Feb 1, 2024

DOC: Example code doesn't give the right output (fix + proven, didn't want to create a pull request for that) #2431

DOC: Example code doesn't give the right output (fix + proven, didn't want to create a pull request for that) #2431

Comments

etern4l-white commented Feb 1, 2024

Environment

Code + PDF

Fix

stefan6419846 commented Feb 1, 2024

etern4l-white commented Feb 1, 2024

stefan6419846 commented Feb 1, 2024

etern4l-white commented Feb 1, 2024

etern4l-white commented Feb 1, 2024

stefan6419846 commented Feb 1, 2024

etern4l-white commented Feb 1, 2024