`PageObject.extract_text`s `text_visitor` reports a wrong matrix for some text nodes #2513

LukeSerne · 2024-03-10T17:12:29Z

While trying to extract lemmas from this page, I found that some text "nodes" (not sure what the technical term is, I'll refer to them as nodes in this issue) are passed to visitor_text with seemingly wrong matrix values.

Environment

$ python -m platform
Linux-6.5.0-21-generic-x86_64-with-glibc2.35
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.1.0, crypt_provider=('cryptography', '3.4.8'), PIL=9.0.1

Code + PDF

This is a minimal, complete example that shows the issue. Observe (using a PDF reader) that the nodes ZURRA˓A, KHIRBE and T EL appear next to each other. Also save the script below (to example.py for example) and run it, passing the path to the attached pdf as first parameter.

import pypdf
import sys

def main():

    reader = pypdf.PdfReader(sys.argv[1], strict=True)
    page = reader.pages[0]

    def text_visitor(text, transform, matrix, font_dict, font_size):
        if "T EL" in text or "ZURRA˓A, KHIRBE" in text:
            print(f"{text!r} has matrix {matrix}")

    page.extract_text(visitor_text=text_visitor)

if __name__ == "__main__":
    main()

Observe that the output is:

$ python example.py ./zurra_page.pdf 
'ZURRA˓A, KHIRBE' has matrix [1.0, 0.0, 0.0, 1.0, 50.4, 687.12]
' T EL' has matrix [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]

I expected the last two elements of the T EL node to be the x and y position of the node (which pdfbox shows to be 177.92 and 687.12 respectively).
I also noticed that pdfbox seems to indicate the text in the node is T EL, but pdfpy reports T EL (note the leading space). Is pdfpy mistakenly adding a leading space?

Files

The sample PDF used with this is a page from a PDF version of the Anchor Bible Dictionary: zurra_page.pdf

This page in pdfbox's debugger, which clearly shows the coordinates of the T EL node:

Traceback

There is no exception raised, so there also is no traceback.

The text was updated successfully, but these errors were encountered:

LukeSerne · 2024-03-16T11:37:56Z

After doing some debugging, I found that the visitor_text function is called from _page.py:1654. Printing tm_matrix just before the visitor_text function is called, shows [1.0, 0.0, 0.0, 1.0, 177.92, 687.12] - exactly the expected value of the matrix argument passed to the visitor_text function. Logging the value of both tm_matrix and memo_tm at every call to process_operation shows the following output:

tm_matrix=[1.0, 0.0, 0.0, 1.0, 169.34, 702.96]   memo_tm=[1.0, 0.0, 0.0, 1.0, 169.34, 702.96]   at process_operation(b'BDC', ['/P', {'/MCID': 2}])
tm_matrix=[1.0, 0.0, 0.0, 1.0, 169.34, 702.96]   memo_tm=[1.0, 0.0, 0.0, 1.0, 169.34, 702.96]   at process_operation(b'BT', [])
tm_matrix=[1.0, 0.0, 0.0, 1.0, 0.0, 0.0]         memo_tm=[1.0, 0.0, 0.0, 1.0, 0.0, 0.0]         at process_operation(b'Tf', ['/F8', 13.98])
tm_matrix=[1.0, 0.0, 0.0, 1.0, 0.0, 0.0]         memo_tm=[1.0, 0.0, 0.0, 1.0, 0.0, 0.0]         at process_operation(b'Tm', [1, 0, 0, 1, 50.4, 687.12])
tm_matrix=[1.0, 0.0, 0.0, 1.0, 50.4, 687.12]     memo_tm=[1.0, 0.0, 0.0, 1.0, 50.4, 687.12]     at process_operation(b'Tj', [b'\x00=\x008\x005\x005'])
tm_matrix=[1.0, 0.0, 0.0, 1.0, 50.4, 687.12]     memo_tm=[1.0, 0.0, 0.0, 1.0, 50.4, 687.12]     at process_operation(b'Tj', [b'\x00$\x07\xa0\x00$\x00\x0f\x00\x03\x00.\x00+\x00,\x005'])
tm_matrix=[1.0, 0.0, 0.0, 1.0, 50.4, 687.12]     memo_tm=[1.0, 0.0, 0.0, 1.0, 50.4, 687.12]     at process_operation(b'Tj', [b'\x00%\x00('])
tm_matrix=[1.0, 0.0, 0.0, 1.0, 50.4, 687.12]     memo_tm=[1.0, 0.0, 0.0, 1.0, 50.4, 687.12]     at process_operation(b'ET', [])
'ZURRA˓A, KHIRBE' has matrix [1.0, 0.0, 0.0, 1.0, 50.4, 687.12] and transform [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
tm_matrix=[1.0, 0.0, 0.0, 1.0, 50.4, 687.12]     memo_tm=[1.0, 0.0, 0.0, 1.0, 50.4, 687.12]     at process_operation(b'BT', [])
tm_matrix=[1.0, 0.0, 0.0, 1.0, 0.0, 0.0]         memo_tm=[1.0, 0.0, 0.0, 1.0, 0.0, 0.0]         at process_operation(b'Tf', ['/F1', 13.98])
tm_matrix=[1.0, 0.0, 0.0, 1.0, 0.0, 0.0]         memo_tm=[1.0, 0.0, 0.0, 1.0, 0.0, 0.0]         at process_operation(b'Tm', [1, 0, 0, 1, 177.92, 687.12])
tm_matrix=[1.0, 0.0, 0.0, 1.0, 177.92, 687.12]   memo_tm=[1.0, 0.0, 0.0, 1.0, 0.0, 0.0]         at process_operation(b'Tj', [b'T EL'])
tm_matrix=[1.0, 0.0, 0.0, 1.0, 177.92, 687.12]   memo_tm=[1.0, 0.0, 0.0, 1.0, 0.0, 0.0]         at process_operation(b'ET', [])
' T EL' has matrix [1.0, 0.0, 0.0, 1.0, 0.0, 0.0] and transform [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]

We see that for the text node that has the correct text matrix, both tm_matrix and memo_tm are set during the Tm operation. However, for the node with incorrect matrix, only tm_matrix is set during its Tm operation. This syncing of tm_matrix and memo_tm happens at lines 1798 to 1800 of _page.py:

pypdf/pypdf/_page.py

Lines 1798 to 1800 in 6cf47c5

    
           if text == "": 
        
               memo_cm = cm_matrix.copy() 
        
               memo_tm = tm_matrix.copy()

This text variable that this condition depends on, is the (first) output of a call to crlf_space_check (imported from _text_extraction/__init__.py, which is unfortunately undocumented. It seems this function uses the difference in positions between consecutive text nodes to determine whether to append a space or a newline to the text.

It seems to me that the condition text == "" should be removed, and the matrix should always be copied. Removing that condition does not change the output of the text returned by extract_text. There's probably good reason why this check is there, but I haven't discovered it.

Git blame shows that this line was last modified in commit bcd85c4. Reverting to 3.16.2 (the last release before this change) gives the correct output for the example, but it's broken in 3.16.3. Since this commit is the only commit that touched text extraction between 3.16.2 and 3.16.3, I think it's safe to say that this issue is a regression caused by commit bcd85c4.

stefan6419846 · 2024-03-16T11:42:15Z

Thanks for the analysis. This appears to be a duplicate of #2353 in this case.

stefan6419846 added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Mar 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PageObject.extract_text`s `text_visitor` reports a wrong matrix for some text nodes #2513

`PageObject.extract_text`s `text_visitor` reports a wrong matrix for some text nodes #2513

LukeSerne commented Mar 10, 2024

LukeSerne commented Mar 16, 2024

stefan6419846 commented Mar 16, 2024

PageObject.extract_texts text_visitor reports a wrong matrix for some text nodes #2513

PageObject.extract_texts text_visitor reports a wrong matrix for some text nodes #2513

Comments

LukeSerne commented Mar 10, 2024

Environment

Code + PDF

Files

Traceback

LukeSerne commented Mar 16, 2024

stefan6419846 commented Mar 16, 2024

`PageObject.extract_text`s `text_visitor` reports a wrong matrix for some text nodes #2513

`PageObject.extract_text`s `text_visitor` reports a wrong matrix for some text nodes #2513