Bug - can not extract data from file in the newest version 1.21.1 #2238

Matmaus · 2023-02-17T10:31:08Z

Bug description

Since version 1.21.1, I have a problem with extracting data from files having some content before header or after EOF. In the older version 1.21.0 (or any older version) there was no problem. Firefox, for example, has no issue opening the file.

To Reproduce

test.py

#!/usr/bin/env python3

import sys

import fitz as pymupdf


def main(filepath):
    doc = pymupdf.open(filepath)

    first_page = doc.load_page(0).get_text('text')
    last_page = doc.load_page(-1).get_text('text')

    print(f'{first_page=}')
    print(f'{last_page=}')


if __name__ == '__main__':
    main(sys.argv[1])

PDF to download

Example file: test.pdf

How to run:

To reproduce this problem you can run the program above with the following file test.pdf.

$ pip install PyMuPDF==1.21.0
$ python testy.py test.pdf
first_page='Hello World\n'
last_page='Hello World\n'
$
$ pip install PyMuPDF==1.21.1
$ python testy.py test.pdf
first_page=''
last_page=''

My configuration

5.15.0-60-generic (Ubuntu)
GCC 11.3.0
Python 3.10.6
PyMuPDF 1.21.1, PyPI.

julian-smith-artifex-com · 2023-02-17T22:06:30Z

Thanks for the clear report, i have reproduced the issue.

I suspect it's a change in MuPDF rather then PyMuPDF itself; will see what the MuPDF people think next week.

julian-smith-artifex-com · 2023-02-20T13:38:46Z

It looks like this is not caused by a change in MuPDF after all.

Instead it's caused by PyMuPDF's fix for #2048, where it defaults to clipping to the page mediabox.

Unfortunately PyMuPDF's text clipping only includes glyphs whose bounding boxes are entirely included in the clip rect. Even though the Hello World text in your PDF looks to be entirely visible, the font's bounding boxes have the same y0 and y1 regardless of the actual glyph, and they actually extend slightly below the baseline to contain lower case chars with descenders. And the Page's mediabox seems to be exactly on the baseline, so the glyphs bounding boxes are not entirely contained in the mediabox and are excluded.

A workaround is to specify an infinite cliprect when calling get_text(), and this fixes your test with PyMuPDF-1.21.1:

    first_page = doc.load_page(0).get_text('text', pymupdf.INFINITE_RECT())
    last_page = doc.load_page(-1).get_text('text', pymupdf.INFINITE_RECT())

[In the next release we might look into supporting 'overlap' semantics as well as, or instead of, the current 'contained' semantics.]

Matmaus · 2023-02-20T17:15:23Z

Your workaround works, thanks 👍.

[In the next release we might look into supporting 'overlap' semantics as well as, or instead of, the current 'contained' semantics.]

It would be nice if I would not have to use the workaround, but at least it works now. Also, I would be happy if you would use the provided file in your test suite.

…xtracting text. Also fixed Story.draw() to handle exceptions e.g. from fz_draw_story().

…xtracting text. We now include chars that overlap with the clipbox, instead of only those that are entirely contained within the clipbox.

julian-smith-artifex-com · 2023-03-12T12:53:08Z

My tree now uses 'overlap' semantics rather than 'contains', which fixes the problem. [But i haven't yet pushed to github.]

Thanks for the offer to use your file in the test suite, i've done so in my tree.

…xtracting text. We now include chars that overlap with the clipbox, instead of only those that are entirely contained within the clipbox.

…xtracting text. We now include chars that overlap with the clipbox, instead of only those that are entirely contained within the clipbox. Note that new fn JM_rects_overlap() still returns true if one of the rects is empty. This allows things to work with ligatures, where component glyphs can have zero width.

…ng text. We now include chars that overlap with the clipbox, instead of only those that are entirely contained within the clipbox. Note that new fn JM_rects_overlap() still returns true if one of the rects is empty. This allows things to work with ligatures, where component glyphs can have zero width.

Matmaus changed the title ~~Bug - can not extract data from file in the newest version~~ Bug - can not extract data from file in the newest version 1.21.1 Feb 17, 2023

julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 7, 2023

Added test for pymupdf#2238.

0b517cc

julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 7, 2023

fitz/: Fix pymupdf#2238 - use 'overlap' rather than 'contains' when e…

d605482

…xtracting text. Also fixed Story.draw() to handle exceptions e.g. from fz_draw_story().

julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 13, 2023

Added test for pymupdf#2238.

b945148

julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 13, 2023

Added test for pymupdf#2238.

bbc8572

julian-smith-artifex-com added a commit that referenced this issue Mar 14, 2023

Added test for #2238.

cdc2358

julian-smith-artifex-com added the Fixed in next release label Mar 14, 2023

julian-smith-artifex-com removed the Fixed in next release label Apr 14, 2023

julian-smith-artifex-com closed this as completed Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug - can not extract data from file in the newest version 1.21.1 #2238

Bug - can not extract data from file in the newest version 1.21.1 #2238

Matmaus commented Feb 17, 2023 •

edited

julian-smith-artifex-com commented Feb 17, 2023

julian-smith-artifex-com commented Feb 20, 2023

Matmaus commented Feb 20, 2023

julian-smith-artifex-com commented Mar 12, 2023

Bug - can not extract data from file in the newest version 1.21.1 #2238

Bug - can not extract data from file in the newest version 1.21.1 #2238

Comments

Matmaus commented Feb 17, 2023 • edited

Bug description

To Reproduce

test.py

PDF to download

How to run:

My configuration

julian-smith-artifex-com commented Feb 17, 2023

julian-smith-artifex-com commented Feb 20, 2023

Matmaus commented Feb 20, 2023

julian-smith-artifex-com commented Mar 12, 2023

Matmaus commented Feb 17, 2023 •

edited