Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PdfReadError: Unexpected end of stream #1090

Closed
MartinThoma opened this issue Jul 10, 2022 · 3 comments · Fixed by #1327
Closed

PdfReadError: Unexpected end of stream #1090

MartinThoma opened this issue Jul 10, 2022 · 3 comments · Fixed by #1327
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@MartinThoma
Copy link
Member

I wanted to extract text from a PDF

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-121-generic-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.2

Code + PDF

The pdf: pdf/5c7a7f24459bcb9700d650062e0ab8bb.pdf

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader('pdf/5c7a7f24459bcb9700d650062e0ab8bb.pdf')
>>> reader.metadata
{'/ModDate': "D:20051220065746-05'00'", '/CreationDate': "D:20051220065728-05'00'", '/Producer': 'Creo Normalizer JTP'}
>>> for page in reader.pages: print(page)
... 
{'/Annots': IndirectObject(8, 0, 139985924129648), '/Contents': [IndirectObject(20, 0, 139985924129648), IndirectObject(21, 0, 139985924129648), IndirectObject(22, 0, 139985924129648), IndirectObject(23, 0, 139985924129648), IndirectObject(24, 0, 139985924129648), IndirectObject(25, 0, 139985924129648), IndirectObject(30, 0, 139985924129648), IndirectObject(31, 0, 139985924129648)], '/Type': '/Page', '/Parent': IndirectObject(1, 0, 139985924129648), '/Rotate': 0, '/MediaBox': [72, 72, 684, 864], '/CropBox': [72, 72, 684, 864], '/BleedBox': [72, 72, 684, 864], '/TrimBox': [72, 72, 684, 864], '/ArtBox': [0, 0, 756, 936], '/Resources': IndirectObject(12, 0, 139985924129648), '/HDAG_Tools': IndirectObject(67, 0, 139985924129648), '/CREO_Tools': IndirectObject(68, 0, 139985924129648), '/CREO_Orientation': 0, '/CREO_ScaleFactor': [1, 1]}
>>> for page in reader.pages: print(page.extract_text())
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1316, in extract_text
    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1138, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1196, in __init__
    self.__parse_content_stream(stream_bytes)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1212, in __parse_content_stream
    ii = self._read_inline_image(stream)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1253, in _read_inline_image
    raise PdfReadError("Unexpected end of stream")
PyPDF2.errors.PdfReadError: Unexpected end of stream
@MartinThoma MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-robustness-issue From a users perspective, this is about robustness Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jul 10, 2022
@pubpub-zz
Copy link
Collaborator

the PDF has an inline image where there is a EMC between the EI and the Q. PyPDF2 used to detect the end of the image by having a Q following the EI. This is not in accordance with the standard although this sequence is very common. I've issued the PR using [whitespace]EI[whitespace] to detect the end of the image. this is compatible with presence if EI within the image flow (a test case with such a file exists in test_generic.py)

@MartinThoma
Copy link
Member Author

Potentially related PR: #332

MartinThoma pushed a commit that referenced this issue Sep 6, 2022
Fix some images reading when some operations are inserted between EI and Q
end of image is now considered with [whitespace]EI[whitespace] (4 characters should be sufficient)

Fixes #1090
@pubpub-zz
Copy link
Collaborator

agree with you, @MartinThoma .
the PR #1327 reviews just the criteria from #740. the test now "replaces" the check of Q by a check of a "whitespace" before EI. the amount of bytes checks remains the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants