Replies: 1 comment
-
The PDF text is written in Helvetica Base-14 font with an array of explicitly given character widths, where the width of the space character is not given ... and therefore 0! In [8]: page.get_text("words")
Out[8]:
[(100.0,
270.20001220703125,
154.6719970703125, # x1 of "Hello"
303.1759948730469,
'Hello',
0,
0,
0),
(154.6719970703125, # x0 of "World"
270.20001220703125,
217.33599853515625,
303.1759948730469,
'World',
0,
1,
0)] ... we see that the end coordinate of "Hello" equals the start coordinate of "World" - which is correct. In [4]: blocks=page.get_text("dict")["blocks"]
In [5]: [s for b in blocks for l in b["lines"] for s in l["spans"]]
Out[5]:
[{'size': 24.0,
'flags': 0,
'font': 'Helvetica',
'color': 0,
'ascender': 1.0750000476837158,
'descender': -0.29899999499320984,
'text': 'Hello World',
'origin': (100.0, 296.0),
'bbox': (100.0, 270.20001220703125, 217.33599853515625, 303.1759948730469)}] Whereas version 1.24 gives us 2 spans: In [9]: blocks=page.get_text("dict")["blocks"]
In [10]: [s for b in blocks for l in b["lines"] for s in l["spans"]]
Out[10]:
[{'size': 24.0,
'flags': 0,
'font': 'Helvetica',
'color': 0,
'ascender': 1.0750000476837158,
'descender': -0.29899999499320984,
'text': 'Hello ',
'origin': (100.0, 296.0),
'bbox': (100.0, 270.20001220703125, 161.343994140625, 303.1759948730469)},
{'size': 24.0,
'flags': 0,
'font': 'Helvetica',
'color': 0,
'ascender': 1.0750000476837158,
'descender': -0.29899999499320984,
'text': 'World',
'origin': (154.6719970703125, 296.0),
'bbox': (154.6719970703125,
270.20001220703125,
217.33599853515625,
303.1759948730469)}] But however you view it, it is based on a design decision taken in MuPDF not in PyMuPDF. MuPDF's CLI tool also produces the following when executing |
Beta Was this translation helpful? Give feedback.
-
Description of the bug
File: Simple PDF 2.0 file.pdf (taken from PDF association GitHub page with example PDFs)
Since version v1.24.0 I see unexpected new line in the parsed text. Here is a text object of the PDF above:
How to reproduce the bug
To reproduce
Version 1.23.26:
Version 1.24.0:
Expected behaviour
I would say that the additional new line should not be there.
PyMuPDF version
1.24.1
Operating system
Linux
Python version
3.10
Beta Was this translation helpful? Give feedback.
All reactions