/uXXXXX instead of a single character in extracted text for some pdfs #2273

equaeghe · 2023-10-27T09:58:50Z

I am trying to get a somewhat reliable estimate of the number of visual (non-whitespace, non-metadata) characters in pdf files. For this, I use the extract_text function.

I stumbled across a situation where visually the same text gives rise to different character counts. Namely, I have an original LaTeX-produced pdf and a derived version of it which was processed by some Adobe software. After investigating, it turns out that in the derived version, some characters from the original are replaced by /uXXXXX strings. This occurs mainly for math symbols. For example in the original, there is 𝛼 and in the derived, there is the string /u1D6FC (where indeed u+1D6FC corresponds to the italic math alpha in unicode).

I assume the above difference is due to some underlying difference in encoding of the unicode character. I would like to use pypdf to get a somewhat reliable estimate of the number of visual characters and think in this case, the correct thing for pypdf to do would be to interpret /u1D6FC at the appropriate point in its text extraction processing pipeline as 𝛼 and similarly for all other such unicode characters.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.1.57-gentoo-a-x86_64-AMD_Ryzen_7_PRO_4750U_with_Radeon_Graphics-with-glibc2.37

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.15.5, crypt_provider=('cryptography', '41.0.4'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf
import difflib

original = pypdf.PdfReader("original.pdf").pages[0].extract_text()
derived = pypdf.PdfReader("derived.pdf").pages[0].extract_text()

print(
    "\n".join(
        list(
            difflib.unified_diff(
                original.split(), derived.split(),
                fromfile="original", tofile="derived", n=0
            )
        )
    ).replace("\n\n", "\n")
)

Output:

--- original
+++ derived
@@ -52 +52 @@
-𝐴1𝐴2Y
+/u1D4341/u1D4342Y
@@ -92,3 +92,3 @@
-for𝐴2,
-with𝛼=1,
-𝛽=0.5,𝑞=25
+for/u1D4342,
+with/u1D6FC=1,
+/u1D6FD=0.5,/u1D45E=25

Test pdfs:

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2023-10-27T11:54:03Z

This seems to be slightly related to #2038 as well.

pubpub-zz · 2023-10-27T17:26:37Z

@equaeghe
If you open "derived.pdf" and try to copy the sentence with the alpha,beta characters and paste the characters, they look wrong.
this is not true with "original.pdf" the issue is within the program which is doing the conversion. sorry

equaeghe · 2023-10-27T18:18:53Z

@equaeghe If you open "derived.pdf" and try to copy the sentence with the alpha,beta characters and paste the characters, they look wrong. this is not true with "original.pdf" the issue is within the program which is doing the conversion. sorry

Sorry, but I do not understand how copy-pasting using some specific application can be an argument. It just means that the application you are using (which?) deals with this similarly as pypdf. (They both may be doing things correctly or both may have a bug.) I'm assuming it displays the pdf correctly? (I still think it is some decoding issue.)

If I use okular to view the pdfs and copy-paste a fragment including the alpha and beta, I get:

Original:
```
with α = 1,
β = 0.5, q = 25
```
Derived:
```
with α = 1,
β = 0.5, q = 25
```

So okular does what one would expect based on the visual representation of the pdf.

If I use Firefox:

Original
```
with 𝛼 = 1,
𝛽 = 0.5, 𝑞 = 25
```
Derived:
```
with 𝛼 = 1,
𝛽 = 0.5, 𝑞 = 25
```

So firefox does what one would expect based on the visual representation of the pdf, even keeping the italics.

If okular and firefox can get the right characters out, so should pypdf.

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/uXXXXX instead of a single character in extracted text for some pdfs #2273

/uXXXXX instead of a single character in extracted text for some pdfs #2273

equaeghe commented Oct 27, 2023

stefan6419846 commented Oct 27, 2023

pubpub-zz commented Oct 27, 2023

equaeghe commented Oct 27, 2023

/uXXXXX instead of a single character in extracted text for some pdfs #2273

/uXXXXX instead of a single character in extracted text for some pdfs #2273

Comments

equaeghe commented Oct 27, 2023

Environment

Code + PDF

stefan6419846 commented Oct 27, 2023

pubpub-zz commented Oct 27, 2023

equaeghe commented Oct 27, 2023