Update pdf.py PageObject.extractText() #334

jusdino · 2017-03-19T16:42:31Z

These changes allow for an optional text separator for TJ and Tj operators.

These source alterations were originally suggested in StackOverflow at:
http://stackoverflow.com/questions/11017379/pypdf-ignores-newlines-in-pdf-file
by DSM

I'm just passing along the good suggestion in hopes that the change may become standard in some future version.

These changes allow for an optional text separator for TJ and Tj operators. These source alterations were originally suggested in StackOverflow at: http://stackoverflow.com/questions/11017379/pypdf-ignores-newlines-in-pdf-file by DSM I'm just passing along the good suggestion in hopes that the change may become standard in some future version.

MartinThoma · 2022-04-06T19:14:12Z

Do you have an example where something else than a single whitespace would be desired?

MartinThoma · 2022-04-06T19:15:03Z

By the way: Sorry that it took so long to react! I do realize that you propably don't even remember this PR.

Also: Don't worry about the failing tests; that is expected for this PR.

jusdino · 2022-04-07T01:52:48Z

Yeah, this was a while ago... Ok, I resurrected the project I was working on.

So I was trying to extract text from some form-formatted pdf pages which had newlines separating the text I was interested in, so I used page.extractText(Tj_sep='\n') to get it organized the way I needed.

PyPDF2/pdf.py

Features: - Add alpha channel support for png files in Script (#614) Bug fixes (BUG): - Fix formatWarning for filename without slash (#612) - Add whitespace between words for extractText() (#569, #334) - "invalid escape sequence" SyntaxError (#522) - Avoid error when printing warning in pythonw (#486) - Stream operations can be List or Dict (#665) Documentation (DOC): - Added Scripts/pdf-image-extractor.py - Documentation improvements (#550, #538, #324, #426, #394) Tests and Test setup (TST): - Add Github Action which automatically run unit tests via pytest and static code analysis with Flake8 (#660) - Add several unit tests (#661, #663) - Add .coveragerc to create coverage reports Developer Experience Improvements (DEV): - Pre commit: Developers can now `pre-commit install` to avoid tiny issues like trailing whitespaces Miscallenious: - Add the LICENSE file to the distributed packages (#288) - Use setuptools instead of distutils (#599) - Improvements for the PyPI page (#644) - Python 3 changes (#504, #366) You can see the full changelog at: 1.26.0...1.27.0

jusdino and others added 2 commits March 19, 2017 10:41

Merge branch 'master' into patch-1

5a4fec3

MartinThoma added PdfReader The PdfReader component is affected Feature labels Apr 6, 2022

MartinThoma reviewed Apr 7, 2022

View reviewed changes

PyPDF2/pdf.py Outdated Show resolved Hide resolved

Update PyPDF2/pdf.py

c39572b

MartinThoma merged commit 12c7047 into py-pdf:master Apr 7, 2022

jusdino deleted the patch-1 branch April 8, 2022 02:32

MartinThoma mentioned this pull request Apr 22, 2022

extractText number problem #466

Closed

MartinThoma added is-feature A feature request and removed Feature labels Jun 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pdf.py PageObject.extractText() #334

Update pdf.py PageObject.extractText() #334

jusdino commented Mar 19, 2017

MartinThoma commented Apr 6, 2022

MartinThoma commented Apr 6, 2022

jusdino commented Apr 7, 2022

Update pdf.py PageObject.extractText() #334

Update pdf.py PageObject.extractText() #334

Conversation

jusdino commented Mar 19, 2017

MartinThoma commented Apr 6, 2022

MartinThoma commented Apr 6, 2022

jusdino commented Apr 7, 2022