Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Each page merge will add another graphics state isolation call #2219

Closed
stefan6419846 opened this issue Sep 26, 2023 · 4 comments
Closed

Each page merge will add another graphics state isolation call #2219

stefan6419846 opened this issue Sep 26, 2023 · 4 comments

Comments

@stefan6419846
Copy link
Collaborator

Merging a page multiple times will isolate the graphics state multiple times instead of re-using the existing wrapping. This is unecessary and probably should be avoided by checking for an isolated graphics state before doing any wrapping. (This has been reported as a possible enhancement in #2203 as well, but extracted to a dedicated issue to allow easier tracking.)

Caching might be useful to avoid doing "expensive" checks each time.

See https://github.com/pymupdf/PyMuPDF/blob/1c2f1da3eb2541f1c8dbd50acb3a916939c99d3e/src/__init__.py#L9141-L9146 and https://github.com/pymupdf/PyMuPDF/blob/1c2f1da3eb2541f1c8dbd50acb3a916939c99d3e/src/__init__.py#L8864-L8876 for an examle implementation in PyMuPDF.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.2.0-10022-tuxedo-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.16.2, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader, PdfWriter


watermark = PdfReader('watermark.pdf').pages[0]

writer = PdfWriter(clone_from='file.pdf')
for page in writer.pages:
    page.merge_page(watermark)
    page.merge_page(watermark)

This happens for all PDF files.

Please note that the above example code is minimal, although using the same watermark twice is rather useless in production, but shows the issue anyway. Imagine you want to use two different backgrounds and/or overlays on the same page, which would trigger exactly the same.

Traceback

Suppose we start with self._data of the page being q\n 841.680[...]0 Do\nQ\n\n \n. Running the above example leads to self._data being q\nq\nq\n 841[...]ET\nQ\nQ\n\nQ\n, thus we have isolated the graphics state three times, although once should be enough.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 28, 2023

@stefan6419846
can you indicate what is the drawback of having add these isolations? Currently we are doing no checks, so nothing to optimize.
From my analysis an improvement would be to not convert the text into operations list. I've started (still in progress) to propose a PR #2226

@stefan6419846
Copy link
Collaborator Author

stefan6419846 commented Sep 28, 2023

can you indicate what is the drawback of having add these isolations?

It just bloats the PDF file unnecessarily when doing lots of page merges, for example by adding multiple overlays, without any real usage.

Currently we are doing no checks, so nothing to optimize.

I do not understand this. Where am I talking about optimizing checks? Because I propose to cache the isolation state?

From my analysis an improvement would be to not convert the text into operations list. I've started (still in progress) to propose a PR #2226

At least for the examples I have seen, we already have a string in isolate_graphics_state most of the time. #2226 seems to be an exact copy of the original PR, so there is nothing I am able to evaluate.

@pubpub-zz
Copy link
Collaborator

Oups forgot some stash

@MartinThoma
Copy link
Member

This issue was closed as no user stumbled over it so far.

There was a PR solving it ( #2224 ), but we had concerns regarding special cases (split streams). For this reason, we decided to close the PR and the issue until somebody actually stumbles over it.

@MartinThoma MartinThoma closed this as not planned Won't fix, can't repro, duplicate, stale Dec 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants