Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: invalid literal for int() with base 16: b'F:' #2598

Open
sureshkvl opened this issue Apr 15, 2024 · 2 comments · May be fixed by #2622
Open

ValueError: invalid literal for int() with base 16: b'F:' #2598

sureshkvl opened this issue Apr 15, 2024 · 2 comments · May be fixed by #2622
Labels
generic The generic submodule is affected is-robustness-issue From a users perspective, this is about robustness

Comments

@sureshkvl
Copy link

sureshkvl commented Apr 15, 2024

pypdf version: 4.2.0
platform: Linux-6.5.0-1018-oem-x86_64-with-glibc2.35
Python: 3.10.12

Traceback error

File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/_page.py", line 2083, in extract_text
    return self._extract_text(
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/_page.py", line 1804, in _extract_text
    for operands, operator in content.operations:
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1245, in operations
    self._parse_content_stream(BytesIO(b_(self._data)))
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1135, in _parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1286, in read_object
    return read_hex_string_from_stream(stream, forced_encoding)
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_utils.py", line 29, in read_hex_string_from_stream
    txt += chr(int(x, base=16))
ValueError: invalid literal for int() with base 16: b'F:'


Below is the python script

from pypdf import PdfReader
reader = PdfReader("biology/lebo102.pdf")
page = reader.pages[0]
print(page.extract_text())
page = reader.pages[1]
print(page.extract_text())
page = reader.pages[2]
print(page.extract_text())

The pdf file is attached
lebo102.pdf

@stefan6419846
Copy link
Collaborator

The issue is on page 2. Due to peeking with <F the corresponding stream part is considered hexadecimal, but starts with <F\x00\x00:, where the : is no valid hexadecimal character.

I am not sure where this actually originates from, thus further analysis is required here.

@pubpub-zz
Copy link
Collaborator

I've started the analysis and the issue is coming from EI and inline image extraction.
I've found in pdf.js some approach to isolate the data.
Work in progress

@stefan6419846 stefan6419846 added generic The generic submodule is affected is-robustness-issue From a users perspective, this is about robustness labels Apr 16, 2024
pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue May 3, 2024
@pubpub-zz pubpub-zz linked a pull request May 3, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generic The generic submodule is affected is-robustness-issue From a users perspective, this is about robustness
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants