ValueError: invalid literal for int() with base 16: b'F:' #2598

sureshkvl · 2024-04-15T17:24:36Z

pypdf version: 4.2.0
platform: Linux-6.5.0-1018-oem-x86_64-with-glibc2.35
Python: 3.10.12

Traceback error

File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/_page.py", line 2083, in extract_text
    return self._extract_text(
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/_page.py", line 1804, in _extract_text
    for operands, operator in content.operations:
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1245, in operations
    self._parse_content_stream(BytesIO(b_(self._data)))
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1135, in _parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1286, in read_object
    return read_hex_string_from_stream(stream, forced_encoding)
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_utils.py", line 29, in read_hex_string_from_stream
    txt += chr(int(x, base=16))
ValueError: invalid literal for int() with base 16: b'F:'

Below is the python script

from pypdf import PdfReader
reader = PdfReader("biology/lebo102.pdf")
page = reader.pages[0]
print(page.extract_text())
page = reader.pages[1]
print(page.extract_text())
page = reader.pages[2]
print(page.extract_text())

The pdf file is attached
lebo102.pdf

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-04-16T08:50:19Z

The issue is on page 2. Due to peeking with <F the corresponding stream part is considered hexadecimal, but starts with <F\x00\x00:, where the : is no valid hexadecimal character.

I am not sure where this actually originates from, thus further analysis is required here.

pubpub-zz · 2024-04-16T11:15:13Z

I've started the analysis and the issue is coming from EI and inline image extraction.
I've found in pdf.js some approach to isolate the data.
Work in progress

closes py-pdf#2598

stefan6419846 added generic The generic submodule is affected is-robustness-issue From a users perspective, this is about robustness labels Apr 16, 2024

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue May 3, 2024

ROB: improve inline image extraction

b449664

closes py-pdf#2598

pubpub-zz linked a pull request May 3, 2024 that will close this issue

ROB: improve inline image extraction #2622

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: invalid literal for int() with base 16: b'F:' #2598

ValueError: invalid literal for int() with base 16: b'F:' #2598

sureshkvl commented Apr 15, 2024 •

edited

stefan6419846 commented Apr 16, 2024

pubpub-zz commented Apr 16, 2024

ValueError: invalid literal for int() with base 16: b'F:' #2598

ValueError: invalid literal for int() with base 16: b'F:' #2598

Comments

sureshkvl commented Apr 15, 2024 • edited

stefan6419846 commented Apr 16, 2024

pubpub-zz commented Apr 16, 2024

sureshkvl commented Apr 15, 2024 •

edited