Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pdf.py PageObject.extractText() #334

Merged
merged 3 commits into from Apr 7, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 3 additions & 2 deletions PyPDF2/pdf.py
Expand Up @@ -2648,7 +2648,7 @@ def compressContentStreams(self):
content = ContentStream(content, self.pdf)
self[NameObject("/Contents")] = content.flateEncode()

def extractText(self):
def extractText(self, Tj_sep="", TJ_sep=" "):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
Expand All @@ -2670,6 +2670,7 @@ def extractText(self):
if operator == b_("Tj"):
_text = operands[0]
if isinstance(_text, TextStringObject):
text += Tj_sep
text += _text
text += "\n"
elif operator == b_("T*"):
Expand All @@ -2687,7 +2688,7 @@ def extractText(self):
elif operator == b_("TJ"):
for i in operands[0]:
if isinstance(i, TextStringObject):
text += " "
text += TJ_sep
text += i
text += "\n"
return text
Expand Down