Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Example code doesn't give the right output (fix + proven, didn't want to create a pull request for that) #2431

Open
etern4l-white opened this issue Feb 1, 2024 · 7 comments · May be fixed by #2432
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@etern4l-white
Copy link

I was trying to use the exact same example mentioned in here, but it gives blank output, even though I copied the same code, and same PDF file. (Fix is at the bottom of this issue report)

Environment

Debian

$ python -m platform
Linux-6.1.0-12-amd64-x86_64-with-glibc2.36

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.1, crypt_provider=('cryptography', '41.0.7'), PIL=10.2.0

Code + PDF

This is a minimal, complete example that shows the issue (same example from documentation):

from pypdf import PdfReader

reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]

parts = []


def visitor_body(text, cm, tm, font_dict, font_size):
    y = cm[5]
    if y > 50 and y < 720:
        parts.append(text)


page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)

print(text_body)

Fix

Just change cm to tm. The selection of height must be from the text matrix, not current matrix.

Here's to the PDF file.

@stefan6419846
Copy link
Collaborator

Do you want to submit a corresponding PR?

@etern4l-white
Copy link
Author

Do you want to submit a corresponding PR?

Is it worth a PR? I mean it's only in the documentation, not in the code. If I'm allowed to, then I'm more than open.

@stefan6419846
Copy link
Collaborator

The docs should ideally be correct, thus fixing it makes sense to avoid confusion for future readers.

@etern4l-white
Copy link
Author

The docs should ideally be correct, thus fixing it makes sense to avoid confusion for future readers.

Ok, I'll open a PR. Thanks.

@etern4l-white
Copy link
Author

Hey @stefan6419846, this is the first time I do a pull request. What's the next step? I think contributors will check that pull request and if it meets the requirements it's accepted? I'm completely new 😅

@stefan6419846
Copy link
Collaborator

No worries. Everyone did their first PR/contribution at some point in time. And apparently you already found our contribution docs which have told you about the desired PR prefixes ;)

The current maintainer (Martin) will approve the CI run for your commit in the near future to check whether there is anything about your change which draws further attention. As soon as this has been completed, your PR will ideally be approved (maybe after some further manual checks) and then merged into our code base and trigger the rebuild of the hosted docs not later than for the next release.

@etern4l-white
Copy link
Author

Alright, that was very informative. Thanks!

@stefan6419846 stefan6419846 added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants