Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/uXXXXX instead of a single character in extracted text for some pdfs #2273

Open
equaeghe opened this issue Oct 27, 2023 · 3 comments
Open
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@equaeghe
Copy link

I am trying to get a somewhat reliable estimate of the number of visual (non-whitespace, non-metadata) characters in pdf files. For this, I use the extract_text function.

I stumbled across a situation where visually the same text gives rise to different character counts. Namely, I have an original LaTeX-produced pdf and a derived version of it which was processed by some Adobe software. After investigating, it turns out that in the derived version, some characters from the original are replaced by /uXXXXX strings. This occurs mainly for math symbols. For example in the original, there is 𝛼 and in the derived, there is the string /u1D6FC (where indeed u+1D6FC corresponds to the italic math alpha in unicode).

I assume the above difference is due to some underlying difference in encoding of the unicode character. I would like to use pypdf to get a somewhat reliable estimate of the number of visual characters and think in this case, the correct thing for pypdf to do would be to interpret /u1D6FC at the appropriate point in its text extraction processing pipeline as 𝛼 and similarly for all other such unicode characters.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.1.57-gentoo-a-x86_64-AMD_Ryzen_7_PRO_4750U_with_Radeon_Graphics-with-glibc2.37

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.15.5, crypt_provider=('cryptography', '41.0.4'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf
import difflib

original = pypdf.PdfReader("original.pdf").pages[0].extract_text()
derived = pypdf.PdfReader("derived.pdf").pages[0].extract_text()

print(
    "\n".join(
        list(
            difflib.unified_diff(
                original.split(), derived.split(),
                fromfile="original", tofile="derived", n=0
            )
        )
    ).replace("\n\n", "\n")
)

Output:

--- original
+++ derived
@@ -52 +52 @@
-𝐴1𝐴2Y
+/u1D4341/u1D4342Y
@@ -92,3 +92,3 @@
-for𝐴2,
-with𝛼=1,
-𝛽=0.5,𝑞=25
+for/u1D4342,
+with/u1D6FC=1,
+/u1D6FD=0.5,/u1D45E=25

Test pdfs:

@stefan6419846
Copy link
Collaborator

This seems to be slightly related to #2038 as well.

@pubpub-zz
Copy link
Collaborator

@equaeghe
If you open "derived.pdf" and try to copy the sentence with the alpha,beta characters and paste the characters, they look wrong.
this is not true with "original.pdf" the issue is within the program which is doing the conversion. sorry

@equaeghe
Copy link
Author

@equaeghe If you open "derived.pdf" and try to copy the sentence with the alpha,beta characters and paste the characters, they look wrong. this is not true with "original.pdf" the issue is within the program which is doing the conversion. sorry

Sorry, but I do not understand how copy-pasting using some specific application can be an argument. It just means that the application you are using (which?) deals with this similarly as pypdf. (They both may be doing things correctly or both may have a bug.) I'm assuming it displays the pdf correctly? (I still think it is some decoding issue.)

If I use okular to view the pdfs and copy-paste a fragment including the alpha and beta, I get:

  • Original:
    with α = 1,
    β = 0.5, q = 25
    
  • Derived:
    with α = 1,
    β = 0.5, q = 25
    

So okular does what one would expect based on the visual representation of the pdf.

If I use Firefox:

  • Original
    with 𝛼 = 1,
    𝛽 = 0.5, 𝑞 = 25
    
  • Derived:
    with 𝛼 = 1,
    𝛽 = 0.5, 𝑞 = 25
    

So firefox does what one would expect based on the visual representation of the pdf, even keeping the italics.

If okular and firefox can get the right characters out, so should pypdf.

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants