You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to get a somewhat reliable estimate of the number of visual (non-whitespace, non-metadata) characters in pdf files. For this, I use the extract_text function.
I stumbled across a situation where visually the same text gives rise to different character counts. Namely, I have an original LaTeX-produced pdf and a derived version of it which was processed by some Adobe software. After investigating, it turns out that in the derived version, some characters from the original are replaced by /uXXXXX strings. This occurs mainly for math symbols. For example in the original, there is 𝛼 and in the derived, there is the string /u1D6FC (where indeed u+1D6FC corresponds to the italic math alpha in unicode).
I assume the above difference is due to some underlying difference in encoding of the unicode character. I would like to use pypdf to get a somewhat reliable estimate of the number of visual characters and think in this case, the correct thing for pypdf to do would be to interpret /u1D6FC at the appropriate point in its text extraction processing pipeline as 𝛼 and similarly for all other such unicode characters.
Environment
Which environment were you using when you encountered the problem?
@equaeghe
If you open "derived.pdf" and try to copy the sentence with the alpha,beta characters and paste the characters, they look wrong.
this is not true with "original.pdf" the issue is within the program which is doing the conversion. sorry
@equaeghe If you open "derived.pdf" and try to copy the sentence with the alpha,beta characters and paste the characters, they look wrong. this is not true with "original.pdf" the issue is within the program which is doing the conversion. sorry
Sorry, but I do not understand how copy-pasting using some specific application can be an argument. It just means that the application you are using (which?) deals with this similarly as pypdf. (They both may be doing things correctly or both may have a bug.) I'm assuming it displays the pdf correctly? (I still think it is some decoding issue.)
If I use okular to view the pdfs and copy-paste a fragment including the alpha and beta, I get:
Original:
with α = 1,
β = 0.5, q = 25
Derived:
with α = 1,
β = 0.5, q = 25
So okular does what one would expect based on the visual representation of the pdf.
If I use Firefox:
Original
with 𝛼 = 1,
𝛽 = 0.5, 𝑞 = 25
Derived:
with 𝛼 = 1,
𝛽 = 0.5, 𝑞 = 25
So firefox does what one would expect based on the visual representation of the pdf, even keeping the italics.
If okular and firefox can get the right characters out, so should pypdf.
I am trying to get a somewhat reliable estimate of the number of visual (non-whitespace, non-metadata) characters in pdf files. For this, I use the
extract_text
function.I stumbled across a situation where visually the same text gives rise to different character counts. Namely, I have an original LaTeX-produced pdf and a derived version of it which was processed by some Adobe software. After investigating, it turns out that in the derived version, some characters from the original are replaced by /uXXXXX strings. This occurs mainly for math symbols. For example in the original, there is
𝛼
and in the derived, there is the string/u1D6FC
(where indeed u+1D6FC corresponds to the italic math alpha in unicode).I assume the above difference is due to some underlying difference in encoding of the unicode character. I would like to use pypdf to get a somewhat reliable estimate of the number of visual characters and think in this case, the correct thing for pypdf to do would be to interpret
/u1D6FC
at the appropriate point in its text extraction processing pipeline as𝛼
and similarly for all other such unicode characters.Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
Output:
Test pdfs:
The text was updated successfully, but these errors were encountered: