Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning and similarity based on empty vector #3745

Closed
CartierPierre opened this issue May 15, 2019 · 7 comments
Closed

Warning and similarity based on empty vector #3745

CartierPierre opened this issue May 15, 2019 · 7 comments
Labels
third-party Third-party packages and services

Comments

@CartierPierre
Copy link

Hello, I have an issue using Spacy and french model.

How to reproduce the behaviour

self.nlp = spacy.load('fr_core_news_md')
self.nlp(string1).similarity(self.nlp(string2))

First, i had this error :

BATTERIE SECHE 12V/7AH (this is the string2)
Traceback (most recent call last):
[......] in similarity
return self.nlp(t1).similarity(self.nlp(t2))
File "doc.pyx", line 381, in spacy.tokens.doc.Doc.similarity
File "/home/pierre/.local/lib/python3.6/site-packages/spacy/errors.py", line 437, in user_warning
_warn(message, "user")
File "/home/pierre/.local/lib/python3.6/site-packages/spacy/errors.py", line 464, in _warn
warnings.warn_explicit(message, category, stack[1], stack[2])
File "/usr/lib/python3.6/warnings.py", line 99, in _showwarnmsg
msg.file, msg.line)
File "/home/pierre/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1069, in _showwarning
file.write(formatWarning(message, category, filename, lineno, line))
File "/home/pierre/.local/lib/python3.6/site-packages/PyPDF2/utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
IndexError: list index out of range

So I tried to just take my string out of my algo (it was in a dataframe, nvm) and it gives :

nlp.py:61: UserWarning: [W008] Evaluating Doc.similarity based on empty vectors.
test_nlp()

I don't know why the warning is not working the first time.

My Environment

  • Operating System: Win10
  • Python Version Used: 3.6
  • spaCy Version Used: 2.1.3
  • Environment Information: Ubuntu subsystem for win10
@CartierPierre
Copy link
Author

If someone knows how to suppress warnings, maybe it will prevent the error ?

@ines
Copy link
Member

ines commented May 15, 2019

This is very strange – I'm confused that the error occurs in PyPDF2, since this package shouldn't have to do anything with it at all? 🤔 Are you doing anything custom with it?

If someone knows how to suppress warnings, maybe it will prevent the error ?

You should be able to set the SPACY_WARNING_IGNORE environment variable with one or more warning IDs. For example, SPACY_WARNING_IGNORE=W008. This should suppress the message.

@CartierPierre
Copy link
Author

CartierPierre commented May 16, 2019

I used PyPDF2 in the project but just to read documents. It's strange to me too. Warning lib use PyPDF2 ?

from PyPDF2 import PdfFileReader
PdfFileReader(open(filename,'rb')).getNumPages()

Maybe it's not a spacy issue ?

You should be able to set the SPACY_WARNING_IGNORE environment variable

In python or in Ubuntu ?

@CartierPierre
Copy link
Author

An other way to search, I used Camelot earlier in the script which is using PyPDF2 to extract some datas. I don't know if it changes some PyPDF2 parameters.

@ines ines added the third-party Third-party packages and services label May 16, 2019
@ines
Copy link
Member

ines commented May 16, 2019

Yeah, it's weird that there's an interaction between the two. I think I may have found the cause – check out these lines in PyPDF2/pdf.py:

# have to dynamically override the default showwarning since there are no
# public methods that specify the 'file' parameter
def _showwarning(message, category, filename, lineno, file=warndest, line=None):
    if file is None:
        file = sys.stderr
    try:
        file.write(formatWarning(message, category, filename, lineno, line))
    except IOError:
        pass
warnings.showwarning = _showwarning

So basically, in the PdfFileReader, PyPDF2 is writing to the built-in warnings module and overwrites warnings.showwarning 🤔 So when spaCy calls into warnings to show you the warnings, it's actually executing PyPDF2's monkeypatched version of that method and that call fails, for whatever reason.

I'm not really sure what we should do about this, since we can't so easily work around other third-party packages monkeypatching Python built-ins. I would argue that monkeypatching Python builtins isn't really a good practice, for this exact reason.

Edit: Just found this thread discussing the problem and there seems to be an open PR about this. So it should probably be fixed in PyPDF2 soon 🎉 py-pdf/pypdf#67

In python or in Ubuntu ?

On the command line in your terminal etc., so when you execute the Python file. For example:

SPACY_WARNING_IGNORE=W008 python nlp.py

You can also export it once for your session by running the following command before you execut your Python file:

export SPACY_WARNING_IGNORE=W008

If you're using an IDE like PyCharm etc. to execute your code, there's usually an option to set environment variables from within the editor as well.

@CartierPierre
Copy link
Author

Thank you @ines, you rocked !

@lock
Copy link

lock bot commented Jun 15, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
third-party Third-party packages and services
Projects
None yet
Development

No branches or pull requests

2 participants