Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid size of TextPage and bbox with newest version 1.21.0 #2048

Closed
jn-chrn opened this issue Nov 15, 2022 · 11 comments
Closed

Invalid size of TextPage and bbox with newest version 1.21.0 #2048

jn-chrn opened this issue Nov 15, 2022 · 11 comments
Labels
upstream bug bug outside this package

Comments

@jn-chrn
Copy link

jn-chrn commented Nov 15, 2022

Describe the bug

Reading some text from PDF files using textpage.extractDICT() returns invalid dimensions with version 1.21.0

To Reproduce

To reproduce, please use this piece of code which:

  • opens the attached PDF
  • gets a TextPage from the only page of the document
  • computes the size of the page for comparison
  • gets the width and height of the TextPage
    • the size of the TextPage is clearly invalid
  • gets the bbox of the first span inside the first span of the first block
    • the bbox dimentsions are clearly invalid
import fitz

document: fitz.Document = fitz.open("crop.pdf")
page = list(document.pages())[0]

page_rect: fitz.Rect = page.rect
text_page = page.get_textpage()
texts_as_dict = text_page.extractDICT()

# The file's size is about 47.4 x 14.0
assert abs(page_rect.width - 47.4) < 0.1
assert abs(page_rect.height - 14.0) < 0.1

# WRONG HERE ALREADY:
# The returned size of the page is '4294967168.0 x 4294967168.0'
assert abs(texts_as_dict["width"] - 47.4) < 0.1
assert abs(texts_as_dict["height"] - 14.0) < 0.1

first_span = texts_as_dict["blocks"][0]["lines"][0]["spans"][0]
bbox = first_span["bbox"]

# The size of the bbox return with version 1.19.6 is:
# '(29.58..., 2.87..., 35.07..., 10.60...)'
assert bbox[2] < 50  # ERROR: returned value '1044369984.0'
assert bbox[3] < 50  # ERROR: returned value '13269935104.0'

Attached PDF: crop.pdf

Expected behavior

With PyMuPDF version 1.19.6, the size of the extracted bbox was very small. With the newest version, its size became way too large (with a factor of 1e8).

Your configuration

print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)
3.10.6 (main, Oct  7 2022, 20:19:58) [GCC 11.2.0] 
 linux 
 
PyMuPDF 1.21.0: Python bindings for the MuPDF 1.21.0 library.
Version date: 2022-11-08 00:00:01.
Built for Python 3.10 on linux (64-bit).

PyMuPDF was installed using pip install pymupdf.

julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Nov 15, 2022
@julian-smith-artifex-com
Copy link
Collaborator

Thanks for this report and the reproduccer.

I've just pushed a change so that get_textpage() (and therefore extractDICT()) defaults to setting the rect to the page's rect, unless a clip rect is explicitly passed in.

This fixes the failure of your test programme, and will be in the next release.

(Note that your test programme fails later on because texts_as_dict["blocks"][0]["lines"][0]["spans"] is empty.)

@jn-chrn
Copy link
Author

jn-chrn commented Nov 15, 2022

Thank you for the fast fix!

Note that your test programme fails later on because texts_as_dict["blocks"][0]["lines"][0]["spans"] is empty.

I tried again and did not get this issue, there are 3 elements in the list of spans when I try locally. Then the bboxes of all these spans are also very large.

@jn-chrn jn-chrn closed this as completed Nov 15, 2022
@jn-chrn jn-chrn reopened this Nov 15, 2022
@jn-chrn
Copy link
Author

jn-chrn commented Nov 15, 2022

Just to make it clear again, there are two issues:

  • at the top level of the dictionary of extracted text (with text_page.extractDICT()), the width and height are invalid
  • at the level of "span" elements, the bbox is invalid on some PDF files we have, and is invalid on the first span in the attached file

@JorjMcKie
Copy link
Collaborator

@jn-chrn admittedly, this PDF has some very, very unusual specifications and fonts:

  1. the MediaBox does not start at (0,0) but at (1063.9544, 1001.37216). The CropBox is identical to the MediaBox.
  2. the relevant fonts are Type3 with invalid font bboxes, fitz.Rect(0,0,0,0). And the critical values for character geometry computations, font.ascender / font.descender are unusable, namely equal to the max. C float value - which is the direct reason for computing infinite bboxes.

PyMuPDF's get_text("dict",...) method computes span / line / block boundary boxes as the rectangle unions of the single characters contained therein (which is inevitable for technical reasons). So this explains those infinite reactangles.

The PyMuPDF-specific logic to validate character bboxes can be switched off via fitz.TOOLS.unset_quad_corrections(True) in which case the original MuPDF computations will prevail.
In this case, this remedy won't work either: The bboxes are no longer infinite, but still crazy enough.

Anyway, if doing get_text(<any-option>, clip=page.rect) will deliver no text all.

@JorjMcKie
Copy link
Collaborator

@jn-chrn - just encountered a spot in the code, where character bbox calculation will go wrong if font ascender / descender take on max C float values - which is the case here.
I am making progress and will be right back once the situation is clarified.

@JorjMcKie
Copy link
Collaborator

As mentioned before, it's the fault of those preculiar Type3 fonts. Because they deliver nonsense values for data that are required for bbox computation, some ersatz assumptions must be made. The best result I so far achieve looks like this for your case:
image
The block/line/span bbox (black border) has these values (the blue boxes are single characters):

'bbox': (22.474653244018555,
           3.4806418418884277,
           34.903072357177734,
           8.929698944091797),

To achieve this, the script must use fitz.Tools().set_small_glyph_heights(True) to enforce corrective bbox / character quad computations ...

@JorjMcKie JorjMcKie added the upstream bug bug outside this package label Nov 20, 2022
@jn-chrn
Copy link
Author

jn-chrn commented Nov 21, 2022

Regarding the PDF file itself being unusual, it was created from a much larger file using mutool poster, so it may have some remains of the original file.


An important note: no large bbox was there with 1.19.6! But with the latest version (1.21.0), we got many of them.

The following code returns, for the bboxes with a width higher than 10^6:

  • a count of 309 bboxes with version 1.21.0
  • a count of 0 bboxes with version 1.19.6
import fitz

document: fitz.Document = fitz.open(
    "crop.pdf"
)
page = list(document.pages())[0]

page_rect: fitz.Rect = page.rect
text_page = page.get_textpage()
texts_as_dict = text_page.extractDICT()

counter = 0
for block in texts_as_dict["blocks"]:
    for line in block["lines"]:
        direction = line["dir"]
        for span in line["spans"]:
            quad: fitz.Quad = fitz.recover_quad(line_dir=direction, span=span)
            if quad.width > 1e6:
                counter += 1

print(counter)

So this small PDFs has many bboxes which are very large with the latest version, but none for older version. This issue only started to occur after 1.19.6.

@JorjMcKie
Copy link
Collaborator

Regarding the PDF file itself being unusual, it was created from a much larger file using mutool poster, so it may have some remains of the original file.

Don't take my comment personal 😉.
You are right, that page obviously is being "cut out" from a much larger one.
There is some problem within the code creating the TextPage (in MuPDF). In the most current version, the Type3 font is no longer interpreted correctly.
This leads to those crazy large bboxes and character widths. I have developed corrective code in PyMuPDF, which delivers reasonable results, when following this coding pattern:

import fitz
import sys

vsn = f"-{sys.version_info[0]}-{sys.version_info[1]}"

# following ensures using PyMuPDF corrections:
fitz.TOOLS.set_small_glyph_heights(True)

doc = fitz.open("crop.pdf")
page = doc[0]
page.clean_contents()  # make sure page.draw_rect() lands in right place

blocks = page.get_text(
    "dict",
    clip=page.rect,  # only look at visible page
    flags=fitz.TEXTFLAGS_TEXT,  # only look at text
)["blocks"]
for b in blocks:
    page.draw_rect(b["bbox"], width=0.2, color=fitz.pdfcolor["green"])
    for l in b["lines"]:
        for s in l["spans"]:
            print(s["text"])
doc.ez_save(f"zdict{vsn}.pdf")

Output:

py testdict.py
km

1.6

And
grafik
Internally, I also had to change the decision whether a character should be regarded inside the "clip" from: "bbox is completely inside clip" to: "character origin is inside clip".
Where "origin" is the bottom left point of a character (glyph) - where drawing of it starts.

@JorjMcKie
Copy link
Collaborator

I have submitted a related bug in MuPDF's issue system.

@jn-chrn
Copy link
Author

jn-chrn commented Nov 24, 2022

Thanks for the insight, and the fast answer (as always)!


Don't take my comment personal

(I had to defend my poor little stupidly made PDF 😄 )

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in PyMuPDF-1.21.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

3 participants