Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Change extract-text.md example codes from using cm to tm #2432

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

etern4l-white
Copy link

TL;DR

Fixes #2431
Changed cm (current_matrix) to tm (text matrix).

Problem

In the extract-text documentation here, the example codes that are used won't produce the correct output.

For example, the first code snippet should output the text of table of contents, but it outputs nothing. The second code snippet is supposed to convert page 4 from the PDF to a SVG, including the text, but it only outputs empty fields.

Reason

The coordination process should be using the text matrix instead of current matrix.

Solution/update

Just changed the cm to tm in the code snippets.

@etern4l-white
Copy link
Author

This is my first pull request ever. Any comments?

Added by @stefan6419846

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
@pubpub-zz
Copy link
Collaborator

I personally do not recommend to extract text using cm : We've observed many case where the actual text and ordering is not valid.
Maybe you should try to use the new text extraction with the layout.

@stefan6419846
Copy link
Collaborator

@pubpub-zz This is an example inside our docs which somehow stopped working correctly, maybe because due to internal changes, maybe because it has not worked at all in the past. If ever, we should explain the new layout mode besides the "classic" mode in the docs. But this can be part of another PR - our examples should work and yield correct results to avoid frustrations.

@etern4l-white
Copy link
Author

@pubpub-zz This is an example inside our docs which somehow stopped working correctly, maybe because due to internal changes, maybe because it has not worked at all in the past. If ever, we should explain the new layout mode besides the "classic" mode in the docs. But this can be part of another PR - our examples should work and yield correct results to avoid frustrations.

By the way, in pypdf version 3 (i guess) documentation it uses tm intead of cm , too. See here. I don't know the exact underlying reason.

@stefan6419846
Copy link
Collaborator

This has been a change by @pubpub-zz: bcd85c4

@etern4l-white
Copy link
Author

Hm, so tm was making issues, and that change fixed the issue? That makes sense, but I believe the example in the docs should be working, too. I'm kind of confused to be honest, since I'm not that deep in the internals of how it works.

Maybe the example PDF was the problem? Because if that change was to solve a problem, then maybe tm was producing an issue in most of PDFs, but not in the example PDF, so switching to cm might have fixed the issue for alot of PDFs, but not the one in the example.

I think for me to validate this assumption I should test it on many PDFs. I will do later today because I got an assignment to do 😅.

@stefan6419846
Copy link
Collaborator

Did you have a chance to further check this already?

@stefan6419846 stefan6419846 added the on-hold PR requests that need clarification before they can be merged.A comment must give details label Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-hold PR requests that need clarification before they can be merged.A comment must give details
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: Example code doesn't give the right output (fix + proven, didn't want to create a pull request for that)
3 participants