Identify different sections in a PDF #3312
-
Description of the bugI want to identify different sections in a PDF for latter usage. **1.1 Uitgangspunten 1.2 Controles 1.1 Uitgangspunten and 2 lines after that is the first section. 1.2 Controles and 4 lines after that is the second section. I'm using following code:
@JorjMcKie Please help on this. How to reproduce the bugNA PyMuPDF version1.23.25 Operating systemWindows Python version3.8 |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
This is no bug report but a Discussions item -> tranferring. |
Beta Was this translation helpful? Give feedback.
-
There is no such thing as you are asking for. Therefore, if there are In reality, these permutations are rare (but do exist!). Executing a by-character extraction and recomposing the intended text every single time therefore most probably is a big waste of resources and performance. PyMuPDF (rather actually MuPDF) works under the hypothesis that things are in decent, reasonable state and tries to segment text chunks in blocks and lines as we have in the "dict" / "blocks" extractions. This was a long speech to argue that a unique identification like insinuated makes no sense. You do have the page position - which comes close to that. |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie Thanks for your reply. What do you mean here: "You do have the page position - which comes close to that." ? What position you are referring to? Or, is there any other work around from your side to meet up my requirement? I need to use PyMuPDF and meet that requirement also! Thanks. |
Beta Was this translation helpful? Give feedback.
Well, you have to download these PDFs and open them locally.
Otherwise, please look at the description of the
get_text("dict",...)
out here: all item returned contain a "bbox", which is a rect-like tuple of coordinates: where you find the item on the page.