Identify different sections in a PDF #3312

santanuOUP · 2024-03-27T11:53:07Z

santanuOUP
Mar 27, 2024

Description of the bug

I want to identify different sections in a PDF for latter usage.
Consider following content with 2 sections:

**1.1 Uitgangspunten
Deze service, EftOphfCrOvkORCA mag enkel vanuit UPF aangeroepen worden. Er is daarom
ook geen publieke service omschrijving beschikbaar.

1.2 Controles
Alle beschreven meldingen in dit document worden in de monitor-tabel gezet.
De volgende (standaard) controles worden altijd uitgevoerd.
• Elk gegeven moet het juiste formaat hebben.
Melding: ‘Ongeldige invoer: heeft niet het juiste formaat.’**

1.1 Uitgangspunten and 2 lines after that is the first section. 1.2 Controles and 4 lines after that is the second section.
Is there any unique value is set by PyMuPDF when reading for each sections? I can't find any such within the properties.

I'm using following code:

page_blocks = page.get_text("dict")["blocks"]                    
for block in page_blocks:
     if "lines" in block.keys():
         spans = block['lines']                           
         for span in spans:
              span_info = span['spans']                                    
              for text_info in span_info:
                    text = text_info['text']

@JorjMcKie Please help on this.

How to reproduce the bug

NA

PyMuPDF version

1.23.25

Operating system

Windows

Python version

3.8

Answered by JorjMcKie

Mar 27, 2024

Well, you have to download these PDFs and open them locally.

Otherwise, please look at the description of the get_text("dict",...) out here: all item returned contain a "bbox", which is a rect-like tuple of coordinates: where you find the item on the page.

View full answer

JorjMcKie · 2024-03-27T12:51:56Z

JorjMcKie
Mar 27, 2024
Maintainer

This is no bug report but a Discussions item -> tranferring.

0 replies

JorjMcKie · 2024-03-27T13:11:09Z

JorjMcKie
Mar 27, 2024
Maintainer

There is no such thing as you are asking for.
Please consider that in PDF each single character may be written on its location on page independent from all other characters - in any sequence.

Therefore, if there are N characters visible on some page, there exist (up to) N! (mathematical faculty function) different ways to produce exactly the same page appearance - only one of which is actually readable when text-extracted. To underpin this, compare these two apparently identical files and extract their text: file1, file2.

In reality, these permutations are rare (but do exist!). Executing a by-character extraction and recomposing the intended text every single time therefore most probably is a big waste of resources and performance.

PyMuPDF (rather actually MuPDF) works under the hypothesis that things are in decent, reasonable state and tries to segment text chunks in blocks and lines as we have in the "dict" / "blocks" extractions.
More often however, the sequence of blocks does not follow "natural" reading sequence and needs to be treated appropriately.

This was a long speech to argue that a unique identification like insinuated makes no sense. You do have the page position - which comes close to that.
The standard behavior of all text extraction is that it follows the sequence as encoded in the page's appearance source - which not necessarily follow the reading sequence as explained.

0 replies

santanuOUP · 2024-03-27T13:44:00Z

santanuOUP
Mar 27, 2024
Author

@JorjMcKie Thanks for your reply.
The embedded files are showing : Invalid PDF
I can't open those files.

What do you mean here: "You do have the page position - which comes close to that." ? What position you are referring to?
How can I get that position from get_text("dict")["blocks"] result?

Or, is there any other work around from your side to meet up my requirement? I need to use PyMuPDF and meet that requirement also!

Thanks.

1 reply

JorjMcKie Mar 27, 2024
Maintainer

Well, you have to download these PDFs and open them locally.

Otherwise, please look at the description of the get_text("dict",...) out here: all item returned contain a "bbox", which is a rect-like tuple of coordinates: where you find the item on the page.

Answer selected by santanuOUP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify different sections in a PDF #3312

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Identify different sections in a PDF #3312

santanuOUP Mar 27, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 3 comments · 1 reply

JorjMcKie Mar 27, 2024 Maintainer

JorjMcKie Mar 27, 2024 Maintainer

santanuOUP Mar 27, 2024 Author

JorjMcKie Mar 27, 2024 Maintainer

santanuOUP
Mar 27, 2024

Replies: 3 comments 1 reply

JorjMcKie
Mar 27, 2024
Maintainer

JorjMcKie
Mar 27, 2024
Maintainer

santanuOUP
Mar 27, 2024
Author

JorjMcKie Mar 27, 2024
Maintainer