PdfReadError: Unexpected end of stream #1090

MartinThoma · 2022-07-10T09:38:12Z

I wanted to extract text from a PDF

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-121-generic-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.2

Code + PDF

The pdf: pdf/5c7a7f24459bcb9700d650062e0ab8bb.pdf

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader('pdf/5c7a7f24459bcb9700d650062e0ab8bb.pdf')
>>> reader.metadata
{'/ModDate': "D:20051220065746-05'00'", '/CreationDate': "D:20051220065728-05'00'", '/Producer': 'Creo Normalizer JTP'}
>>> for page in reader.pages: print(page)
... 
{'/Annots': IndirectObject(8, 0, 139985924129648), '/Contents': [IndirectObject(20, 0, 139985924129648), IndirectObject(21, 0, 139985924129648), IndirectObject(22, 0, 139985924129648), IndirectObject(23, 0, 139985924129648), IndirectObject(24, 0, 139985924129648), IndirectObject(25, 0, 139985924129648), IndirectObject(30, 0, 139985924129648), IndirectObject(31, 0, 139985924129648)], '/Type': '/Page', '/Parent': IndirectObject(1, 0, 139985924129648), '/Rotate': 0, '/MediaBox': [72, 72, 684, 864], '/CropBox': [72, 72, 684, 864], '/BleedBox': [72, 72, 684, 864], '/TrimBox': [72, 72, 684, 864], '/ArtBox': [0, 0, 756, 936], '/Resources': IndirectObject(12, 0, 139985924129648), '/HDAG_Tools': IndirectObject(67, 0, 139985924129648), '/CREO_Tools': IndirectObject(68, 0, 139985924129648), '/CREO_Orientation': 0, '/CREO_ScaleFactor': [1, 1]}
>>> for page in reader.pages: print(page.extract_text())
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1316, in extract_text
    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1138, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1196, in __init__
    self.__parse_content_stream(stream_bytes)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1212, in __parse_content_stream
    ii = self._read_inline_image(stream)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1253, in _read_inline_image
    raise PdfReadError("Unexpected end of stream")
PyPDF2.errors.PdfReadError: Unexpected end of stream

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2022-09-05T20:27:07Z

the PDF has an inline image where there is a EMC between the EI and the Q. PyPDF2 used to detect the end of the image by having a Q following the EI. This is not in accordance with the standard although this sequence is very common. I've issued the PR using [whitespace]EI[whitespace] to detect the end of the image. this is compatible with presence if EI within the image flow (a test case with such a file exists in test_generic.py)

MartinThoma · 2022-09-06T16:57:21Z

Potentially related PR: #332

Fix some images reading when some operations are inserted between EI and Q end of image is now considered with [whitespace]EI[whitespace] (4 characters should be sufficient) Fixes #1090

pubpub-zz · 2022-09-06T19:14:13Z

agree with you, @MartinThoma .
the PR #1327 reviews just the criteria from #740. the test now "replaces" the check of Q by a check of a "whitespace" before EI. the amount of bytes checks remains the same.

pubpub-zz mentioned this issue Sep 5, 2022

ROB : fix image extraction #1327

Merged

MartinThoma closed this as completed in #1327 Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PdfReadError: Unexpected end of stream #1090

PdfReadError: Unexpected end of stream #1090

MartinThoma commented Jul 10, 2022

pubpub-zz commented Sep 5, 2022

MartinThoma commented Sep 6, 2022

pubpub-zz commented Sep 6, 2022

PdfReadError: Unexpected end of stream #1090

PdfReadError: Unexpected end of stream #1090

Comments

MartinThoma commented Jul 10, 2022

Environment

Code + PDF

pubpub-zz commented Sep 5, 2022

MartinThoma commented Sep 6, 2022

pubpub-zz commented Sep 6, 2022