Cannot extract text correctly for some CJK fonts #2295

wwaguai · 2023-11-13T07:57:39Z

Hi there, we're trying to utilize this cool library to extract text for some processing, but it seems it failed on the attached PDF. It contains some Traditional Chinese characters but the output looks like some random characters.

Looks like this PDF is utilizing CFF based CIDFontType0C as subtype, wondering if that's not currently supported by pypdf? Let us know if there's anything we can help as well. Not super familiar but happy to help out.

Environment

Which environment were you using when you encountered the problem?

$ python3 -m platform
macOS-13.5-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.16.2, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader("caibao.pdf")
number_of_pages = len(reader.pages)
for i in range(number_of_pages): 
    page = reader.pages[i]
    text = page.extract_text()
    print(text)

Output:
2ࣨ
˜
ɚཧɚɧϋ
ʬ˜ɧɤ˚ɚཧɚɚϋ
ʬ˜ɧɤ˚ Νˢᜊਗ
ৰ̮�
ϗɝ 299,194 269,505 11%
ˣл 139,022 114,941 21%
л 80,729 67,284 20%
л 53,417 42,963 24%
л 52,009 42,032 24%
ɛ͏࿆ʩ�
 Ñਿ͉ 5.486 4.407 24%
 Ñᛅᑛ 5.334 4.320 23%
л 98,511 73,205 35%
л 70,086 53,684 31%
ɛ͏࿆ʩ�
 Ñਿ͉ 7.393 5.628 31%
 Ñᛅᑛ 7.236 5.516 31%

PDF:
caibao.pdf

MartinThoma · 2023-11-13T09:25:38Z

Interesting. The first check passed: I can copy the text without issues with the Chrome PDF viewer.

pdfium2 gives:

2
未經審核
截至下列日期止六個月
二零二三年
六月三十日
二零二二年
六月三十日 同比變動
（人民幣百萬元，另有指明者除外）
收入 299,194 269,505 11%
毛利 139,022 114,941 21%
經營盈利 80,729 67,284 20%
期內盈利 53,417 42,963 24%
本公司權益持有人應佔盈利 52,009 42,032 24%
每股盈利（每股人民幣元）
－基本 5.486 4.407 24%
－攤薄 5.334 4.320 23%
非國際財務報告準則經營盈利 98,511 73,205 35%
非國際財務報告準則本公司權益持有人應佔盈利 70,086 53,684 31%
非國際財務報告準則每股盈利（每股人民幣元）
－基本 7.393 5.628 31%
－攤薄 7.236 5.516 31%

So it definitely is a shortcoming of pypdf. Thanks for sharing!

MartinThoma · 2023-11-13T09:28:12Z

@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?

wwaguai · 2023-11-13T09:34:12Z

@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?

it's publicly available that you can download from internet, I think it can be used there for testing

stefan6419846 · 2023-11-14T11:21:56Z

@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?

it's publicly available that you can download from internet, I think it can be used there for testing

There is a difference between publicly available files which we are already using for regular testing and the files from the sample-files repository, which are subject to a Creative Commons license you usually can provide if you are the owner/creator of the file only.

wwaguai · 2023-11-15T02:41:07Z

@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?

it's publicly available that you can download from internet, I think it can be used there for testing

There is a difference between publicly available files which we are already using for regular testing and the files from the sample-files repository, which are subject to a Creative Commons license you usually can provide if you are the owner/creator of the file only.

https://static.www.tencent.com/uploads/2023/08/29/1d726a2226130c610975c21480cf1890.PDF
you can probably reproduce using this file (it's Tencent's financial report, same as where we got the sample), that said, I feel like it's not under Creative Common License, and sorry, appearently I'm not the creator of it.
It can be reproduced if you use the font: MHeiHK-Bold, however I do not have copyright for that font so not sure if that can be used for this case. That said here's a very simple example using that:
caibao2.pdf

wwaguai assigned MartinThoma Nov 13, 2023

MartinThoma removed their assignment Nov 13, 2023

MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-cjk-issue Issue related to CJK (Chinese-Japanese-Korean) labels Nov 13, 2023

wwaguai closed this as completed Nov 13, 2023

wwaguai reopened this Nov 13, 2023

stefan6419846 mentioned this issue Dec 12, 2023

extract_text() return garbled characters #2330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot extract text correctly for some CJK fonts #2295

Cannot extract text correctly for some CJK fonts #2295

wwaguai commented Nov 13, 2023

MartinThoma commented Nov 13, 2023

MartinThoma commented Nov 13, 2023

wwaguai commented Nov 13, 2023

stefan6419846 commented Nov 14, 2023

wwaguai commented Nov 15, 2023

Cannot extract text correctly for some CJK fonts #2295

Cannot extract text correctly for some CJK fonts #2295

Comments

wwaguai commented Nov 13, 2023

Environment

Code + PDF

MartinThoma commented Nov 13, 2023

MartinThoma commented Nov 13, 2023

wwaguai commented Nov 13, 2023

stefan6419846 commented Nov 14, 2023

wwaguai commented Nov 15, 2023