BUG: Ambiguous translated references #2558

biredel · 2024-03-29T20:33:36Z

pypdf uses id(obj) to keep track of objects in _id_translated. This identifier is not unique for different objects, only for objects existing at the same time. e.g. In CPython id(obj), is just a memory address. This cannot be used to tell whether we already have referenced a page or not. It might point to a page of a since-deleted PdfReader, silently corrupting out output.

The bug manifests rarely, and the corrupted output document usually looks OK at first glance (but has missing/duplicated content), the following is my best reproducer so far, adapted from https://stackoverflow.com/questions/78186323/

# python3 repro.py < resources/multilang.pdf

import gc, io, sys
from pypdf import PdfReader, PdfWriter

def pdfsize(fin):
    pdfout = PdfWriter()
    pageout = pdfout.add_blank_page(width=10, height=10)
    reader = None
    for i in range(80):
        fin.seek(0)
        del reader
        gc.collect(0)
        reader = PdfReader(fin, strict=True)
        for pagein in reader.pages:
            pageout.merge_page(pagein)
    with io.BytesIO() as fout:
        pdfout.write(fout)
        return fout.tell()

if __name__ == "__main__":
    with io.BytesIO() as fin:
        fin.write(sys.stdin.buffer.read())
        for i in range(50):
            if i == 0:
                first_size = pdfsize(fin)
                first_size = pdfsize(fin)
            current_size = pdfsize(fin)
            if not first_size == current_size:
                raise RuntimeError(
                    f"generated PDF file size changed after repeating {i} times {first_size=} != {current_size=}"
                )

The attached draft replaces dict[id(obj)] with dict[weakref(obj)] - For the simple insert/lookup/delete cases, that behaves the same. Yet no-longer-existing objects are cleared automatically, removing the possible collision.

There is another instance of id(obj) outside of __repr__() methods in PageObject.hash_value_data - I have not determined whether that can cause related problems.

Essentially resubmitting #1995 as a replacement for #1841 which has memory usage side-effects and does not fully avoid #1788 anyway.

id(obj) can and will hit the same value for different objects, if they do not exist at the same time. In CPython, its just a memory address. When that collission occurs, PdfWriter when asked to insert a page from one PdfReader object may silently confuse it with one from a since-deleted PdfReader object, thus corrupting output in not-easily-noticed ways. Replacing dict[id(obj)] with dict[weakref(obj)] should suffice, as we do not replace entries - we only ever insert or delete.

pubpub-zz · 2024-03-29T21:51:40Z

@biredel
can you have a review at #1788.
I'm expecting PdfReader used in merging into PdfWriters should not be deleted from memory through the PR attached.

biredel · 2024-03-29T23:03:02Z

@pubpub-zz Thanks for pointing me there. I only now noticed the almost identical PR #1995 .. suggested before I even ran into the bug.

What was applied instead, #1841 did not resolve the problem because it only touches one out of two (maybe three?) opportunities to get it wrong: PageObject.clone()->PdfObject._reference_clone() and IndirectObject.clone() duplicate that code.

I expect that (costly) workaround can be removed if the weakref fix works.

pubpub-zz · 2024-03-30T07:53:31Z

Also. Have you try first to clone the pages before calling merge_page ?

This reverts commit 8188553.

codecov · 2024-03-30T11:43:30Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.73%. Comparing base (e35df5a) to head (03e8b81).
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2558      +/-   ##
==========================================
+ Coverage   94.71%   94.73%   +0.01%     
==========================================
  Files          50       50              
  Lines        8237     8257      +20     
  Branches     1646     1651       +5     
==========================================
+ Hits         7802     7822      +20     
  Misses        267      267              
  Partials      168      168

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

biredel · 2024-03-30T12:08:32Z

Also. Have you try first to clone the pages before calling merge_page ?

That is how I know both code paths are exposed. Have the page clone() first, then PdfObject._reference_clone with the workaround catches the problem. Call merge_page() right away, then IndirectObject.clone bites you.

pubpub-zz · 2024-03-31T18:37:39Z

@biredel
I've tried your code unsuccessfully : I met no issue.

also there is a trick in the code : _id_translated values are not only Dict[int, int]] but there is also in the dict an element with "PreventGC" that keeps a link to the reader that should prevent the GC to delete it before the writer. This being assigned during the cloning process, it is currently not called when using merge_page

Therefore, looking at your test code I recommend a test calling pageout.merge_page(pagein.clone(pdfout)) to possibly prevent error.

Can you give feedback ?

biredel · 2024-04-01T09:50:58Z

I've tried your code unsuccessfully : I met no issue.

What worked best for me was a low-core-count amd64 virtual machine, on a 3.10 series CPython, high iteration count and reading resources/multilang.pdf.

also there is a trick in the code

I saw that and recommend this is discarded once the reason for keeping that reference is resolved.

Therefore, looking at your test code I recommend a test calling pageout.merge_page(pagein.clone(pdfout)) to possibly prevent error.
Can you give feedback ?

Yes; positive; confirming. I did do that. That is how I know both code paths are exposed. You can trigger the bug via pagein.clone() too, once there is at least one id of since-deleted objects in the dictionary keys. It is much easier to trigger the bug using merge_page only[1] so that is what my sample code does, but one such call suffices to potentially corrupt the output.

[1] I suspect that is because CPython is much more likely to recycle id() if the new object fits into the memory slot of a no longer referenced object.

biredel added 2 commits March 30, 2024 11:34

Python 3.7 (EoL) syntax

1484610

Revert "ROB: Cope with garbage collector during cloning (py-pdf#1841)"

03e8b81

This reverts commit 8188553.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Ambiguous translated references #2558

BUG: Ambiguous translated references #2558

biredel commented Mar 29, 2024 •

edited

pubpub-zz commented Mar 29, 2024

biredel commented Mar 29, 2024 •

edited

pubpub-zz commented Mar 30, 2024

codecov bot commented Mar 30, 2024

biredel commented Mar 30, 2024

pubpub-zz commented Mar 31, 2024

biredel commented Apr 1, 2024

BUG: Ambiguous translated references #2558

Are you sure you want to change the base?

BUG: Ambiguous translated references #2558

Conversation

biredel commented Mar 29, 2024 • edited

pubpub-zz commented Mar 29, 2024

biredel commented Mar 29, 2024 • edited

pubpub-zz commented Mar 30, 2024

codecov bot commented Mar 30, 2024

Codecov Report

biredel commented Mar 30, 2024

pubpub-zz commented Mar 31, 2024

biredel commented Apr 1, 2024

biredel commented Mar 29, 2024 •

edited

biredel commented Mar 29, 2024 •

edited