Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug - can not extract data from file in the newest version 1.21.1 #2238

Closed
Matmaus opened this issue Feb 17, 2023 · 4 comments
Closed

Bug - can not extract data from file in the newest version 1.21.1 #2238

Matmaus opened this issue Feb 17, 2023 · 4 comments

Comments

@Matmaus
Copy link

Matmaus commented Feb 17, 2023

Bug description

Since version 1.21.1, I have a problem with extracting data from files having some content before header or after EOF. In the older version 1.21.0 (or any older version) there was no problem. Firefox, for example, has no issue opening the file.

To Reproduce

test.py

#!/usr/bin/env python3

import sys

import fitz as pymupdf


def main(filepath):
    doc = pymupdf.open(filepath)

    first_page = doc.load_page(0).get_text('text')
    last_page = doc.load_page(-1).get_text('text')

    print(f'{first_page=}')
    print(f'{last_page=}')


if __name__ == '__main__':
    main(sys.argv[1])

PDF to download

Example file: test.pdf

How to run:

To reproduce this problem you can run the program above with the following file test.pdf.

$ pip install PyMuPDF==1.21.0
$ python testy.py test.pdf
first_page='Hello World\n'
last_page='Hello World\n'
$
$ pip install PyMuPDF==1.21.1
$ python testy.py test.pdf
first_page=''
last_page=''

My configuration

  • 5.15.0-60-generic (Ubuntu)
  • GCC 11.3.0
  • Python 3.10.6
  • PyMuPDF 1.21.1, PyPI.
@Matmaus Matmaus changed the title Bug - can not extract data from file in the newest version Bug - can not extract data from file in the newest version 1.21.1 Feb 17, 2023
@julian-smith-artifex-com
Copy link
Collaborator

Thanks for the clear report, i have reproduced the issue.

I suspect it's a change in MuPDF rather then PyMuPDF itself; will see what the MuPDF people think next week.

@julian-smith-artifex-com
Copy link
Collaborator

It looks like this is not caused by a change in MuPDF after all.

Instead it's caused by PyMuPDF's fix for #2048, where it defaults to clipping to the page mediabox.

Unfortunately PyMuPDF's text clipping only includes glyphs whose bounding boxes are entirely included in the clip rect. Even though the Hello World text in your PDF looks to be entirely visible, the font's bounding boxes have the same y0 and y1 regardless of the actual glyph, and they actually extend slightly below the baseline to contain lower case chars with descenders. And the Page's mediabox seems to be exactly on the baseline, so the glyphs bounding boxes are not entirely contained in the mediabox and are excluded.

A workaround is to specify an infinite cliprect when calling get_text(), and this fixes your test with PyMuPDF-1.21.1:

    first_page = doc.load_page(0).get_text('text', pymupdf.INFINITE_RECT())
    last_page = doc.load_page(-1).get_text('text', pymupdf.INFINITE_RECT())

[In the next release we might look into supporting 'overlap' semantics as well as, or instead of, the current 'contained' semantics.]

@Matmaus
Copy link
Author

Matmaus commented Feb 20, 2023

Your workaround works, thanks 👍.

[In the next release we might look into supporting 'overlap' semantics as well as, or instead of, the current 'contained' semantics.]

It would be nice if I would not have to use the workaround, but at least it works now. Also, I would be happy if you would use the provided file in your test suite.

julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 7, 2023
julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 7, 2023
…xtracting text.

Also fixed Story.draw() to handle exceptions e.g. from fz_draw_story().
julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 8, 2023
…xtracting text.

We now include chars that overlap with the clipbox, instead of only those that
are entirely contained within the clipbox.
@julian-smith-artifex-com
Copy link
Collaborator

My tree now uses 'overlap' semantics rather than 'contains', which fixes the problem. [But i haven't yet pushed to github.]

Thanks for the offer to use your file in the test suite, i've done so in my tree.

julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 13, 2023
julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 13, 2023
…xtracting text.

We now include chars that overlap with the clipbox, instead of only those that
are entirely contained within the clipbox.
julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 13, 2023
julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 13, 2023
…xtracting text.

We now include chars that overlap with the clipbox, instead of only those that
are entirely contained within the clipbox.

Note that new fn JM_rects_overlap() still returns true if one of the rects is
empty. This allows things to work with ligatures, where component glyphs can
have zero width.
julian-smith-artifex-com added a commit that referenced this issue Mar 14, 2023
julian-smith-artifex-com added a commit that referenced this issue Mar 14, 2023
…ng text.

We now include chars that overlap with the clipbox, instead of only those that
are entirely contained within the clipbox.

Note that new fn JM_rects_overlap() still returns true if one of the rects is
empty. This allows things to work with ligatures, where component glyphs can
have zero width.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants