Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot extract text correctly for some CJK fonts #2295

Open
wwaguai opened this issue Nov 13, 2023 · 5 comments
Open

Cannot extract text correctly for some CJK fonts #2295

wwaguai opened this issue Nov 13, 2023 · 5 comments
Labels
is-cjk-issue Issue related to CJK (Chinese-Japanese-Korean) workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@wwaguai
Copy link

wwaguai commented Nov 13, 2023

Hi there, we're trying to utilize this cool library to extract text for some processing, but it seems it failed on the attached PDF. It contains some Traditional Chinese characters but the output looks like some random characters.

Looks like this PDF is utilizing CFF based CIDFontType0C as subtype, wondering if that's not currently supported by pypdf? Let us know if there's anything we can help as well. Not super familiar but happy to help out.

Environment

Which environment were you using when you encountered the problem?

$ python3 -m platform
macOS-13.5-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.16.2, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader("caibao.pdf")
number_of_pages = len(reader.pages)
for i in range(number_of_pages): 
    page = reader.pages[i]
    text = page.extract_text()
    print(text)

Output:
2ࣨ
˜
ɚཧɚɧϋ
ʬ˜ɧɤ˚ɚཧɚɚϋ
ʬ˜ɧɤ˚ Νˢᜊਗ
ৰ̮ϗɝ 299,194 269,505 11%
ˣл 139,022 114,941 21%
л 80,729 67,284 20%
л 53,417 42,963 24%
л 52,009 42,032 24%
ɛ͏࿆ʩÑਿ͉ 5.486 4.407 24%
 Ñᛅᑛ 5.334 4.320 23%
л 98,511 73,205 35%
л 70,086 53,684 31%
ɛ͏࿆ʩÑਿ͉ 7.393 5.628 31%
 Ñᛅᑛ 7.236 5.516 31%

PDF:
caibao.pdf

@MartinThoma MartinThoma removed their assignment Nov 13, 2023
@MartinThoma
Copy link
Member

Interesting. The first check passed: I can copy the text without issues with the Chrome PDF viewer.

pdfium2 gives:

2
未經審核
截至下列日期止六個月
二零二三年
六月三十日
二零二二年
六月三十日 同比變動
(人民幣百萬元,另有指明者除外)
收入 299,194 269,505 11%
毛利 139,022 114,941 21%
經營盈利 80,729 67,284 20%
期內盈利 53,417 42,963 24%
本公司權益持有人應佔盈利 52,009 42,032 24%
每股盈利(每股人民幣元)
-基本 5.486 4.407 24%
-攤薄 5.334 4.320 23%
非國際財務報告準則經營盈利 98,511 73,205 35%
非國際財務報告準則本公司權益持有人應佔盈利 70,086 53,684 31%
非國際財務報告準則每股盈利(每股人民幣元)
-基本 7.393 5.628 31%
-攤薄 7.236 5.516 31%

So it definitely is a shortcoming of pypdf. Thanks for sharing!

@MartinThoma MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-cjk-issue Issue related to CJK (Chinese-Japanese-Korean) labels Nov 13, 2023
@MartinThoma
Copy link
Member

@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?

@wwaguai wwaguai closed this as completed Nov 13, 2023
@wwaguai
Copy link
Author

wwaguai commented Nov 13, 2023

@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?

it's publicly available that you can download from internet, I think it can be used there for testing

@wwaguai wwaguai reopened this Nov 13, 2023
@stefan6419846
Copy link
Collaborator

@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?

it's publicly available that you can download from internet, I think it can be used there for testing

There is a difference between publicly available files which we are already using for regular testing and the files from the sample-files repository, which are subject to a Creative Commons license you usually can provide if you are the owner/creator of the file only.

@wwaguai
Copy link
Author

wwaguai commented Nov 15, 2023

@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?

it's publicly available that you can download from internet, I think it can be used there for testing

There is a difference between publicly available files which we are already using for regular testing and the files from the sample-files repository, which are subject to a Creative Commons license you usually can provide if you are the owner/creator of the file only.

https://static.www.tencent.com/uploads/2023/08/29/1d726a2226130c610975c21480cf1890.PDF
you can probably reproduce using this file (it's Tencent's financial report, same as where we got the sample), that said, I feel like it's not under Creative Common License, and sorry, appearently I'm not the creator of it.
It can be reproduced if you use the font: MHeiHK-Bold, however I do not have copyright for that font so not sure if that can be used for this case. That said here's a very simple example using that:
caibao2.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-cjk-issue Issue related to CJK (Chinese-Japanese-Korean) workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

3 participants