Updated extractText() #397

Tom-Evers · 2018-03-04T10:04:57Z

Added changes proposed in issue #17

Tom-Evers · 2018-03-04T19:00:02Z

Some lines contain multiple draw operations, for example if underlined text is drawn text first, underlining ("________") second at the same vertical coordinates.

The toggle 'skip_intertwining_text' will by default skip the next line if intertwining text is detected.
When set to false, it will simply insert text after the previous line.

Indentation is now also properly handled.

deven96

Does this output the text in the correct order?

Tom-Evers · 2018-07-12T20:10:46Z

It should, yeah, but it has been some time since I worked on this...

The problem is: if the PDF itself has the text in the wrong order but relocated with weird offsets, there's a good chance it'll still mess up the order. Then again, the method that was used before my commit would then still be worse.

deven96 · 2018-07-13T14:23:40Z

Good work though mate

…

On Thu, Jul 12, 2018 at 9:10 PM Tom-Evers ***@***.***> wrote: It should, yeah, but it has been some time since I worked on this... The problem is: if the PDF itself has the text in the wrong order but relocated with weird offsets, there's a good chance it'll still mess up the order. Then again, the method that was used before my commit would then still be worse. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#397 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWXgwLFQjVs3DDI63o71cvJCSgs14NELks5uF61KgaJpZM4SbRaR> .

Tom-Evers · 2018-10-01T08:46:18Z

Can this be tested/pulled?

joegrange · 2018-11-07T19:08:28Z

Looks like a good improvement. I'm also having white space issues that this should improve.

Tom-Evers · 2018-11-08T08:48:28Z

It should only improve things, and never break anything that isn't broken already.

Could this be pulled please?

TZanke · 2018-11-08T11:24:28Z

I would like to have the newest changes also, but it doesnt look like anyone will build a new package. Any PyPDF2 fork out there with newer packages then PyPDF2 itself? Even pdfrw is not maintained very well, so do i miss some brand new python PDF engine on github where all the effort goes to?

PyPDF2/pdf.py

MartinThoma · 2022-04-16T09:17:03Z

@TZanke I just became the maintainer this month - and PyPDF2 is moving again 🚀

MartinThoma · 2022-04-16T10:14:28Z

It seems like this PR breaks a couple of things. Could you please have a look?

MartinThoma · 2022-06-06T12:18:59Z

This PR addressed #17, but #924 fixed it (+ many other things). Hence I close it.

Thank you for the PR! I hope I can respond quicker in future to such improvements :-)

Tom-Evers added 2 commits March 4, 2018 11:03

Updated extractText() according to changes proposed in issue py-pdf#17

9217428

Handling intertwining text properly.

08699cb

deven96 reviewed Jul 12, 2018

View reviewed changes

Tom-Evers mentioned this pull request Nov 8, 2018

Advanced text extraction #464

Closed

MartinThoma added Tiny Pull requests that make a tiny change - and thus should be easy to merge PdfReader The PdfReader component is affected labels Apr 6, 2022

MartinThoma changed the title ~~Updated extractText() according to changes proposed in issue #17~~ Updated extractText() Apr 16, 2022

Merge branch 'main' into extractText

dd1a529

MartinThoma reviewed Apr 16, 2022

View reviewed changes

PyPDF2/pdf.py Outdated Show resolved Hide resolved

Update PyPDF2/pdf.py

4d352a6

Merge branch 'main' into extractText

fa2fb12

MartinThoma added needs-change The PR/issue cannot be handled as issue and needs to be improved workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow labels Apr 16, 2022

Merge branch 'main' into extractText

ab64198

MartinThoma closed this Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated extractText() #397

Updated extractText() #397

Tom-Evers commented Mar 4, 2018

Tom-Evers commented Mar 4, 2018

deven96 left a comment

Tom-Evers commented Jul 12, 2018

deven96 commented Jul 13, 2018 via email

Tom-Evers commented Oct 1, 2018

joegrange commented Nov 7, 2018

Tom-Evers commented Nov 8, 2018

TZanke commented Nov 8, 2018

MartinThoma commented Apr 16, 2022

MartinThoma commented Apr 16, 2022

MartinThoma commented Jun 6, 2022

Updated extractText() #397

Updated extractText() #397

Conversation

Tom-Evers commented Mar 4, 2018

Tom-Evers commented Mar 4, 2018

deven96 left a comment

Choose a reason for hiding this comment

Tom-Evers commented Jul 12, 2018

deven96 commented Jul 13, 2018 via email

Tom-Evers commented Oct 1, 2018

joegrange commented Nov 7, 2018

Tom-Evers commented Nov 8, 2018

TZanke commented Nov 8, 2018

MartinThoma commented Apr 16, 2022

MartinThoma commented Apr 16, 2022

MartinThoma commented Jun 6, 2022