New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot extract text correctly for some CJK fonts #2295
Comments
Interesting. The first check passed: I can copy the text without issues with the Chrome PDF viewer.
So it definitely is a shortcoming of pypdf. Thanks for sharing! |
@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing? |
it's publicly available that you can download from internet, I think it can be used there for testing |
There is a difference between publicly available files which we are already using for regular testing and the files from the |
https://static.www.tencent.com/uploads/2023/08/29/1d726a2226130c610975c21480cf1890.PDF |
Hi there, we're trying to utilize this cool library to extract text for some processing, but it seems it failed on the attached PDF. It contains some Traditional Chinese characters but the output looks like some random characters.
Looks like this PDF is utilizing CFF based CIDFontType0C as subtype, wondering if that's not currently supported by pypdf? Let us know if there's anything we can help as well. Not super familiar but happy to help out.
Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
PDF:
caibao.pdf
The text was updated successfully, but these errors were encountered: