Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed extractText()-Not returning text with spaces #569

Merged
merged 1 commit into from Apr 6, 2022
Merged

Fixed extractText()-Not returning text with spaces #569

merged 1 commit into from Apr 6, 2022

Conversation

inboxsgk
Copy link
Contributor

Previously the function .extractText() reads the text in the PDF and returns without any spaces.
In this fix the pdf.py file has been modified to add " " (space) in between two words

Here is an example below:-
Original Sentence : "The quick brown fox jumps over the lazy dog"

Previous Output : "Thequickbrownfoxjumpsoverthelazydog"

Output After fix : "The quick brown fox jumps over the lazy dog"

Previously the function .extractText() reads the text in the PDF and returns without any spaces.
In this fix the pdf.py file has been modified to add " " (space) in between two words

Here is an example below:-
Original Sentence : "The quick brown fox jumps over the lazy dog"

Previous Output : "Thequickbrownfoxjumpsoverthelazydog"

After the fix : "The quick brown fox jumps over the lazy dog"
@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected labels Apr 6, 2022
@MartinThoma
Copy link
Member

Thank you for the contribution! I'm sorry that it took so long - I try to be quicker in future 🤞

@MartinThoma MartinThoma merged commit 02cc54b into py-pdf:master Apr 6, 2022
MartinThoma added a commit that referenced this pull request Apr 7, 2022
Features:

 - Add alpha channel support for png files in Script (#614)

Bug fixes (BUG):

 - Fix formatWarning for filename without slash (#612)
 - Add whitespace between words for extractText() (#569, #334)
 - "invalid escape sequence" SyntaxError (#522)
 - Avoid error when printing warning in pythonw (#486)
 - Stream operations can be List or Dict (#665)

Documentation (DOC):

 - Added Scripts/pdf-image-extractor.py
 - Documentation improvements (#550, #538, #324, #426, #394)

Tests and Test setup (TST):

 - Add Github Action which automatically run unit tests via pytest and
   static code analysis with Flake8 (#660)
 - Add several unit tests (#661, #663)
 - Add .coveragerc to create coverage reports

Developer Experience Improvements (DEV):

 - Pre commit: Developers can now `pre-commit install` to avoid tiny issues
               like trailing whitespaces

Miscallenious:

 - Add the LICENSE file to the distributed packages (#288)
 - Use setuptools instead of distutils (#599)
 - Improvements for the PyPI page (#644)
 - Python 3 changes (#504, #366)

You can see the full changelog at: 1.26.0...1.27.0
@Viennoiserie
Copy link

Could you please show in which directory can be found the pyPDF2 source file containing the " extractText() " method please ?

@MartinThoma
Copy link
Member

It's in _page.py

@Viennoiserie
Copy link

I just found in "PyPDF2" files (outside of the pycache folder) the -page.py ... problem is it only has 1000 or so linges whereas the ones modified on ghit have about 3000 of those ... maybe I don't have the right file or version (yet I installed the package yesterday :/)

@Viennoiserie
Copy link

_page.py *

@Viennoiserie
Copy link

I just modified my " _page.py " file and copy pasted the one on git here... still not working, if you don't mind of course, could you tell me where might be the problem

@pubpub-zz
Copy link
Collaborator

@Viennoiserie / @inboxsgk ,
can you provide a example of PDF file where you are getting the issue for analysis

@Viennoiserie
Copy link

TEST.pdf

I am trying to make a function (for my webapp) that can append all the words contained in the pdf into an array. The app then finds the words asked by the user... so I went onto word : wrote text that would be " hard " for python to work with and the results aren't the ones I wanted:

I expect: ['Thomas', 'Vienot', 'CACA', 'Partie']

but I get: ['Thomas', 'VienotCACA', 'Partie']

@Viennoiserie
Copy link

If you want I can also provide you my code (nothing to complex but I think it should work):

from PyPDF2 import PdfFileReader

def pdf_to_words(file_name):

pdf_obj = open(file_name + '.pdf', 'rb')
pdf_array = []
word_array = []
    
pdf_reader = PdfFileReader(pdf_obj)   
nb_page = pdf_reader.numPages
    
for i in range(nb_page):
        
       pdf_array.append(pdf_reader.getPage(i).extractText())
       
pdf_obj.close()

for i in range(len(pdf_array)):
    
    word = ""
    
    for j in range(len(pdf_array[i])):
        
        if(ord(pdf_array[i][j]) not in range(0,65) and ord(pdf_array[i][j]) not in range(91,97) and ord(pdf_array[i][j]) not in range(123,128)):
            
            word += pdf_array[i][j]
            
        else:
            if(word != ""):
                word_array.append(word)
                
            word = ""
            
return(word_array)

def main():

file_name = "TEST"
# file_name = "VIENOT_Thomas"

word_array = pdf_to_words(file_name)

print(word_array)

if name == "main":

main()

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jun 2, 2022

@Viennoiserie,
The issue is not within PyPDF2.
If you just run extract_text on your PDF you get :
' Thomas \n \n Vienot CACA\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n '
There seems to be lots of empty lines, but I've checked they are part of your way
I would propose you this solution:
[x for x in "".join( [ x if x.isalnum() else " " for x in PyPDF2.PdfReader("TEST(3).pdf").pages[0].extractText().replace("\n","")] ) .split(" ') if x!=""]

@Viennoiserie
Copy link

Thank you, indeed, I have tried my program on other PDFs and there was no problem :/
Sorry for bothering !

@MartinThoma MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants