Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Improve spacing for text extraction #806

Merged
merged 11 commits into from Apr 23, 2022
Merged

BUG: Improve spacing for text extraction #806

merged 11 commits into from Apr 23, 2022

Conversation

MartinThoma
Copy link
Member

No description provided.

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Apr 23, 2022
@codecov-commenter
Copy link

codecov-commenter commented Apr 23, 2022

Codecov Report

Merging #806 (fb4a895) into main (d4c8cab) will increase coverage by 0.01%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##             main     #806      +/-   ##
==========================================
+ Coverage   75.22%   75.24%   +0.01%     
==========================================
  Files          11       11              
  Lines        3516     3522       +6     
  Branches      810      814       +4     
==========================================
+ Hits         2645     2650       +5     
  Misses        658      658              
- Partials      213      214       +1     
Impacted Files Coverage Δ
PyPDF2/pdf.py 81.85% <75.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4c8cab...fb4a895. Read the comment docs.

@MartinThoma MartinThoma merged commit d1be80d into main Apr 23, 2022
@MartinThoma MartinThoma deleted the spacing branch April 23, 2022 20:49
MartinThoma added a commit that referenced this pull request Apr 24, 2022
A change I would like to highlight is the performance improvement for
large PDF files (#808) 🎉

New Features (ENH):
-  Add papersizes (#800)
-  Allow setting permission flags when encrypting (#803)
-  Allow setting form field flags (#802)

Bug Fixes (BUG):
-  TypeError in xmp._converter_date (#813)
-  Improve spacing for text extraction (#806)
-  Fix PDFDocEncoding Character Set (#809)

Robustness (ROB):
-  Use null ID when encrypted but no ID given (#812)
-  Handle recursion error (#804)

Documentation (DOC):
-  CMaps (#811)
-  The PDF Format + commit prefixes (#810)
-  Add compression example (#792)

Developer Experience (DEV):
-  Add Benchmark for Performance Testing (#781)

Maintenance (MAINT):
-  Validate PDF magic byte in strict mode (#814)
-  Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (#339)
-  Quadratic runtime while parsing reduced to linear  (#808)

Testing (TST):
-  Newlines in text extraction (#807)

Full Changelog: 1.27.8...1.27.9
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022
PyPDF2 now takes positive / negative spaces between text blocks into account. Not very elegant, but the result looks way better than before.
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022
A change I would like to highlight is the performance improvement for
large PDF files (py-pdf#808) 🎉

New Features (ENH):
-  Add papersizes (py-pdf#800)
-  Allow setting permission flags when encrypting (py-pdf#803)
-  Allow setting form field flags (py-pdf#802)

Bug Fixes (BUG):
-  TypeError in xmp._converter_date (py-pdf#813)
-  Improve spacing for text extraction (py-pdf#806)
-  Fix PDFDocEncoding Character Set (py-pdf#809)

Robustness (ROB):
-  Use null ID when encrypted but no ID given (py-pdf#812)
-  Handle recursion error (py-pdf#804)

Documentation (DOC):
-  CMaps (py-pdf#811)
-  The PDF Format + commit prefixes (py-pdf#810)
-  Add compression example (py-pdf#792)

Developer Experience (DEV):
-  Add Benchmark for Performance Testing (py-pdf#781)

Maintenance (MAINT):
-  Validate PDF magic byte in strict mode (py-pdf#814)
-  Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (py-pdf#339)
-  Quadratic runtime while parsing reduced to linear  (py-pdf#808)

Testing (TST):
-  Newlines in text extraction (py-pdf#807)

Full Changelog: py-pdf/pypdf@1.27.8...1.27.9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants