BUG: Improve spacing for text extraction #806

MartinThoma · 2022-04-23T11:11:51Z

No description provided.

codecov-commenter · 2022-04-23T20:46:53Z

Codecov Report

Merging #806 (fb4a895) into main (d4c8cab) will increase coverage by 0.01%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##             main     #806      +/-   ##
==========================================
+ Coverage   75.22%   75.24%   +0.01%     
==========================================
  Files          11       11              
  Lines        3516     3522       +6     
  Branches      810      814       +4     
==========================================
+ Hits         2645     2650       +5     
  Misses        658      658              
- Partials      213      214       +1

Impacted Files	Coverage Δ
PyPDF2/pdf.py	`81.85% <75.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4c8cab...fb4a895. Read the comment docs.

A change I would like to highlight is the performance improvement for large PDF files (#808) 🎉 New Features (ENH): - Add papersizes (#800) - Allow setting permission flags when encrypting (#803) - Allow setting form field flags (#802) Bug Fixes (BUG): - TypeError in xmp._converter_date (#813) - Improve spacing for text extraction (#806) - Fix PDFDocEncoding Character Set (#809) Robustness (ROB): - Use null ID when encrypted but no ID given (#812) - Handle recursion error (#804) Documentation (DOC): - CMaps (#811) - The PDF Format + commit prefixes (#810) - Add compression example (#792) Developer Experience (DEV): - Add Benchmark for Performance Testing (#781) Maintenance (MAINT): - Validate PDF magic byte in strict mode (#814) - Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (#339) - Quadratic runtime while parsing reduced to linear (#808) Testing (TST): - Newlines in text extraction (#807) Full Changelog: 1.27.8...1.27.9

PyPDF2 now takes positive / negative spaces between text blocks into account. Not very elegant, but the result looks way better than before.

A change I would like to highlight is the performance improvement for large PDF files (py-pdf#808) 🎉 New Features (ENH): - Add papersizes (py-pdf#800) - Allow setting permission flags when encrypting (py-pdf#803) - Allow setting form field flags (py-pdf#802) Bug Fixes (BUG): - TypeError in xmp._converter_date (py-pdf#813) - Improve spacing for text extraction (py-pdf#806) - Fix PDFDocEncoding Character Set (py-pdf#809) Robustness (ROB): - Use null ID when encrypted but no ID given (py-pdf#812) - Handle recursion error (py-pdf#804) Documentation (DOC): - CMaps (py-pdf#811) - The PDF Format + commit prefixes (py-pdf#810) - Add compression example (py-pdf#792) Developer Experience (DEV): - Add Benchmark for Performance Testing (py-pdf#781) Maintenance (MAINT): - Validate PDF magic byte in strict mode (py-pdf#814) - Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (py-pdf#339) - Quadratic runtime while parsing reduced to linear (py-pdf#808) Testing (TST): - Newlines in text extraction (py-pdf#807) Full Changelog: py-pdf/pypdf@1.27.8...1.27.9

MartinThoma added 6 commits April 23, 2022 13:11

BUG: Improve spacing for text extraction

f23dd38

Merge branch 'main' into spacing

9162127

Adjust logic

ec37cfa

Merge branch 'main' into spacing

dab0831

Adjust test

39ef2b2

Flake8 fixes

4bdddda

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Apr 23, 2022

MartinThoma added 5 commits April 23, 2022 22:40

Whitespace improvements

69ba8d0

Better defaults for text extraction

1c3ac0a

Remove re import

84a8be2

Fix lorem ipsum

5a728e0

Remove todo

fb4a895

MartinThoma merged commit d1be80d into main Apr 23, 2022

MartinThoma deleted the spacing branch April 23, 2022 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Improve spacing for text extraction #806

BUG: Improve spacing for text extraction #806

MartinThoma commented Apr 23, 2022

codecov-commenter commented Apr 23, 2022 •

edited

BUG: Improve spacing for text extraction #806

BUG: Improve spacing for text extraction #806

Conversation

MartinThoma commented Apr 23, 2022

codecov-commenter commented Apr 23, 2022 • edited

Codecov Report

codecov-commenter commented Apr 23, 2022 •

edited