Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_text() return garbled characters #2330

Open
ChanghaoLau opened this issue Dec 7, 2023 · 6 comments
Open

extract_text() return garbled characters #2330

ChanghaoLau opened this issue Dec 7, 2023 · 6 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@ChanghaoLau
Copy link

ChanghaoLau commented Dec 7, 2023

I get garbled characters when parsing pdf file. The file I use is this. There may be encoding issues?

Environment

$ python -m platform
Linux-4.18.0-147.5.1.6.h841.eulerosv2r9.x86_64-x86_64-with-glibc2.17

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('pycryptodome', '3.19.0'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

file_path = '20120812.pdf'
page_idx = 0

reader = PdfReader(file_path)
page = reader.pages[page_idx]
text = page.extract_text()
print(text)

The pdf file can be obtained from this url.

The output is:

2012୍8ᄅ ACTA AUTOMATICA SINICA August, 2012
م
ᇛ ਟ1ࡹ1ྷ ೦2ᅦ ม1
ᅋေم, ྛऊো ,ۋ, ০Ⴈ
......
@unique-Li-yuankun
Copy link

I meet the same problem.

@MartinThoma MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Dec 12, 2023
@MartinThoma
Copy link
Member

Thank you for the good error report.

I can confirm:

  1. The PDF contains text that can be copy-pasted (it's not an image)
  2. The copy-pasted text looks fine (it's not intentionally garbled within the file / via the font to avoid copy-pasting)
  3. pypdf was used

I'll make some more checks after work.

@stefan6419846
Copy link
Collaborator

This might be related to #2295 as well. In

t = tt.decode(cmap[0], "surrogatepass") # apply str encoding
we decode the operands [b'\x05\xbb'] using utf-16-be, as this is the cmap[0] value: ('utf-16-be', {}, '/F1', {'/Subtype': '/Type0', '/DescendantFonts': [IndirectObject(7, 0, 140296872535904)], '/Name': '/F1', '/BaseFont': '/KSZZAC+SimSun', '/Encoding': '/Identity-H', '/Type': '/Font'}) The Latin text seems to use actual charmaps instead:

('charmap', {'®': 'ff', '¯': 'fi', '±': 'ffi', 'Ä': '¨', '%': '%', '(': '(', ')': ')', ',': ',', '-': '-', '.': '.', '/': '/', '0': '0', '1': '1', '2': '2', '3': '3', '4': '4', '5': '5', '6': '6', '7': '7', '8': '8', '9': '9', ':': ':', ';': ';', '=': '=', '@': '@', 'A': 'A', 'C': 'C', 'D': 'D', 'E': 'E', 'F': 'F', 'G': 'G', 'H': 'H', 'I': 'I', 'J': 'J', 'K': 'K', 'L': 'L', 'M': 'M', 'N': 'N', 'O': 'O', 'P': 'P', 'R': 'R', 'S': 'S', 'T': 'T', 'U': 'U', 'V': 'V', 'X': 'X', 'Y': 'Y', 'Z': 'Z', '[': '[', ']': ']', 'a': 'a', 'b': 'b', 'c': 'c', 'd': 'd', 'e': 'e', 'f': 'f', 'g': 'g', 'h': 'h', 'i': 'i', 'j': 'j', 'k': 'k', 'l': 'l', 'm': 'm', 'n': 'n', 'o': 'o', 'p': 'p', 'q': 'q', 'r': 'r', 's': 's', 't': 't', 'u': 'u', 'v': 'v', 'w': 'w', 'x': 'x', 'y': 'y', 'z': 'z'}, '/F2', {'/Subtype': '/Type1', '/FontDescriptor': IndirectObject(14, 0, 140296872535904), '/LastChar': 196, '/Widths': [285, 514, 856, 514, 856, 799, 285, 400, 400, 514, 799, 285, 343, 285, 514, 514, 514, 514, 514, 514, 514, 514, 514, 514, 514, 285, 285, 285, 799, 485, 485, 799, 771, 728, 742, 785, 699, 671, 806, 771, 371, 528, 799, 642, 942, 771, 799, 699, 799, 756, 571, 742, 771, 771, 1056, 771, 771, 628, 285, 514, 285, 514, 285, 285, 514, 571, 457, 571, 457, 314, 514, 571, 285, 314, 542, 285, 856, 571, 514, 571, 542, 402, 405, 400, 571, 542, 742, 542, 542, 457, 514, 1028, 514, 514, 514, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 642, 856, 799, 714, 685, 771, 742, 799, 742, 799, 0, 0, 742, 600, 571, 571, 856, 856, 285, 314, 514, 514, 514, 514, 514, 771, 457, 514, 742, 799, 514, 928, 1042, 799, 285, 514], '/Name': '/F2', '/BaseFont': '/KSZZAC+CMR9', '/FirstChar': 33, '/Type': '/Font'})

For reference: The file from #2295 has ('utf-16-be', {}, '/R11', {'/BaseFont': '/GSWDKI+MHeiHK-Bold', '/Type': '/Font', '/Encoding': '/Identity-H', '/DescendantFonts': [IndirectObject(12, 0, 139916737754976)], '/Subtype': '/Type0'}) for the wrong characters as well, while there cmap for Arabic numbers looks good again (dict encoding in this case):

({0: '\x00', 1: '\x01', 2: '\x02', 3: '\x03', 4: '\x04', 5: '\x05', 6: '\x06', 7: '\x07', 8: '\x08', 9: '\t', 10: '\n', 11: '\x0b', 12: '\x0c', 13: '\r', 14: '\x0e', 15: '\x0f', 16: '\x10', 17: '\x11', 18: '\x12', 19: '\x13', 20: '\x14', 21: '\x15', 22: '\x16', 23: '\x17', 24: '\x18', 25: '\x19', 26: '\x1a', 27: '\x1b', 28: '\x1c', 29: '\x1d', 30: '\x1e', 31: '\x1f', 32: ' ', 33: '!', 34: '"', 35: '#', 36: '$', 37: '%', 38: '&', 39: "'", 40: '(', 41: ')', 42: '*', 43: '+', 44: ',', 45: '-', 46: '.', 47: '/', 48: '0', 49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8', 57: '9', 58: ':', 59: ';', 60: '<', 61: '=', 62: '>', 63: '?', 64: '@', 65: 'A', 66: 'B', 67: 'C', 68: 'D', 69: 'E', 70: 'F', 71: 'G', 72: 'H', 73: 'I', 74: 'J', 75: 'K', 76: 'L', 77: 'M', 78: 'N', 79: 'O', 80: 'P', 81: 'Q', 82: 'R', 83: 'S', 84: 'T', 85: 'U', 86: 'V', 87: 'W', 88: 'X', 89: 'Y', 90: 'Z', 91: '[', 92: '\\', 93: ']', 94: '^', 95: '_', 96: '`', 97: 'a', 98: 'b', 99: 'c', 100: 'd', 101: 'e', 102: 'f', 103: 'g', 104: 'h', 105: 'i', 106: 'j', 107: 'k', 108: 'l', 109: 'm', 110: 'n', 111: 'o', 112: 'p', 113: 'q', 114: 'r', 115: 's', 116: 't', 117: 'u', 118: 'v', 119: 'w', 120: 'x', 121: 'y', 122: 'z', 123: '{', 124: '|', 125: '}', 126: '~', 127: '\x7f', 128: '€', 129: '\x81', 130: '‚', 131: 'ƒ', 132: '„', 133: '…', 134: '†', 135: '‡', 136: 'ˆ', 137: '‰', 138: 'Š', 139: '‹', 140: 'Œ', 141: '\x8d', 142: 'Ž', 143: '\x8f', 144: '\x90', 145: '‘', 146: '’', 147: '“', 148: '”', 149: '•', 150: '–', 151: '—', 152: '˜', 153: '™', 154: 'š', 155: '›', 156: 'œ', 157: '\x9d', 158: 'ž', 159: 'Ÿ', 160: '\xa0', 161: '¡', 162: '¢', 163: '£', 164: '¤', 165: '¥', 166: '¦', 167: '§', 168: '¨', 169: '©', 170: 'ª', 171: '«', 172: '¬', 173: '\xad', 174: '®', 175: '¯', 176: '°', 177: '±', 178: '²', 179: '³', 180: '´', 181: 'µ', 182: '¶', 183: '·', 184: '¸', 185: '¹', 186: 'º', 187: '»', 188: '¼', 189: '½', 190: '¾', 191: '¿', 192: 'À', 193: 'Á', 194: 'Â', 195: 'Ã', 196: 'Ä', 197: 'Å', 198: 'Æ', 199: 'Ç', 200: 'È', 201: 'É', 202: 'Ê', 203: 'Ë', 204: 'Ì', 205: 'Í', 206: 'Î', 207: 'Ï', 208: 'Ð', 209: 'Ñ', 210: 'Ò', 211: 'Ó', 212: 'Ô', 213: 'Õ', 214: 'Ö', 215: '×', 216: 'Ø', 217: 'Ù', 218: 'Ú', 219: 'Û', 220: 'Ü', 221: 'Ý', 222: 'Þ', 223: 'ß', 224: 'à', 225: 'á', 226: 'â', 227: 'ã', 228: 'ä', 229: 'å', 230: 'æ', 231: 'ç', 232: 'è', 233: 'é', 234: 'ê', 235: 'ë', 236: 'ì', 237: 'í', 238: 'î', 239: 'ï', 240: 'ð', 241: 'ñ', 242: 'ò', 243: 'ó', 244: 'ô', 245: 'õ', 246: 'ö', 247: '÷', 248: 'ø', 249: 'ù', 250: 'ú', 251: 'û', 252: 'ü', 253: 'ý', 254: 'þ', 255: 'ÿ'}, {}, '/R18', {'/BaseFont': '/ZHXRWX+TimesLTStd-Bold', '/FontDescriptor': IndirectObject(19, 0, 140377741535072), '/Type': '/Font', '/FirstChar': 44, '/LastChar': 57, '/Widths': [250, 0, 250, 0, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], '/Encoding': '/WinAnsiEncoding', '/Subtype': '/Type1'})

@pubpub-zz
Copy link
Collaborator

@MartinThoma has written:
2. The copy-pasted text looks fine (it's not intentionally garbled within the file / via the font to avoid copy-pasting)

Can you please indicate which program you have used. I did the test unsucessfully with Acrobat

@stefan6419846
Copy link
Collaborator

pdftotext/poppler seems to work fine for example: pdftotext -f 1 -l 1 20120812.pdf -.

@MartinThoma
Copy link
Member

I used the Google Chrome reader

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

5 participants