Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong characters during extract_text with /Differences for font /TJQCZS+FzBookMaker2DlFont #2605

Open
zailushang2006 opened this issue Apr 22, 2024 · 2 comments
Labels
is-cjk-issue Issue related to CJK (Chinese-Japanese-Korean) workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@zailushang2006
Copy link

zailushang2006 commented Apr 22, 2024

I need to extract text from a PDF document using the page.extract_text function, but all the extracted Chinese characters are garbled. I suspect that this PDF document uses several special Chinese fonts: /TJQCZS+FzBookMaker2DlFont. I used debug to examine the source code of PyPDF, and in the /Font->/Encoding->/Differences mapping table, characters are mapped to special encodings as follows:

{'/Differences': [35, '/G23', 36, '/G24', 37, '/G25', 38, '/G26', 39, '/G27', 40, '/G28', 41, '/G29', 42, '/G2A', 43, '/G2B', 44, '/G2C', 45, '/G2D', 46, '/G2E', 47, '/G2F', 48, '/G30', 49, '/G31'], '/Type': '/Encoding'}

The font file is decoded using the specified /Filter: /FlateDecode under /Font->/FontDescriptor->/FontFile3, but the font file is garbled.

Since Adobe Acrobat can display the text correctly, there must be another way to handle this. I am not very familiar with the structure and protocols of PDF documents, so I am unsure how to resolve this issue.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19044-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('cryptography', '42.0.2'), PIL=10.2.0

Code + PDFex

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader(pdf_path)

number_of_pages = len(reader.pages)
print(f"Number of pages: {number_of_pages}")
for i in range(number_of_pages):
    if i != 3:
        continue
    page = reader.pages[i]

    text = page.extract_text()
    print(text[:5000])

Share here the PDF file(s) that cause the issue.
GB+15322.2-2019.pdf

Traceback

This is the complete traceback I see:

page 3 (start 0):

84971221-CBF2-46dc-B435-6ADF2271A1D4

print result:

686E886A-E4B7-4bb5-9BAC-05A609334090

@stefan6419846 stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-cjk-issue Issue related to CJK (Chinese-Japanese-Korean) labels Apr 23, 2024
@pubpub-zz
Copy link
Collaborator

The fact that Adobe is able to display glyphs (images or drawings) does not mean it can associate them with some characters. copy paste using acrobat reader, pdf.JS (firefox) or PDFium (chrome) does not provide results. I strongly doubt, there is an easy way to extract data. My only approach would be to build/print to images and then use an OCR to extract text. This is out of pypdf capabilities.

@stefan6419846
Copy link
Collaborator

As far as I have seen yesterday, pdftotext/poppler would indeed provide somehow valid results for page 4.

@stefan6419846 stefan6419846 changed the title extract_text extract text Error. /BaseFont is /TJQCZS+FzBookMaker2DlFont20536874081. Wrong characters during extract_text with /Differences for font /TJQCZS+FzBookMaker2DlFont Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-cjk-issue Issue related to CJK (Chinese-Japanese-Korean) workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants
@pubpub-zz @zailushang2006 @stefan6419846 and others