Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting idf_ is impossible #7102

Closed
Sbelletier opened this issue Jul 28, 2016 · 12 comments · Fixed by #10899
Closed

Setting idf_ is impossible #7102

Sbelletier opened this issue Jul 28, 2016 · 12 comments · Fixed by #10899

Comments

@Sbelletier
Copy link

Context

Rather than a bug i guess that would go as a sort of "enhancement proposition" ?

I'm currently trying to persist a TfidfTransformer by basically saving its parameters in a mongoDB database and then rebuilding it alike. This technique works for CountVectorizer but simply blocks for TfidfTransformer as there is no way to set idf_.
Is there any actual architectural reason why setting this attributes raise an error ? if yes, do you have an idea for a workaround ? I obviously want to avoid keeping the matrix on which it has been fitted as it would completely mess up the architecture (i believe that the saving/loading process should be separated from the whole treatment/learning process and trying to keep both would mean having to propagate a dirty return)

Steps/Code to Reproduce

functioning example on CountVectorizer

#let us say that CountV is the previously built countVectorizer that we want to recreate identically
from sklearn.feature_extraction.text import CountVectorizer 

doc = ['some fake text that is fake to test the vectorizer']

c = CountVectorizer
c.set_params(**CountV.get_params())
c.set_params(**{'vocabulary':CountV.vocabulary_})
#Now let us test if they do the same conversion
m1 =  CountV.transform(doc)
m2 = c.transform(doc)
print m1.todense().tolist()#just for visibility sake here
print m2.todense().tolist()
#Note : This code does what is expected

This might not seem very impressive, but dictionnaries can be stored inside of mongoDB databases, which means that you can basically restore the CountVectoriser or at least an identical copy of it by simply storing vocabulary_ and the output of get_params() .

Now the incriminated piece of code

#let us say that TFtransformer is the previously computed transformer
from sklearn.feature_extraction.text import TfidfTransformer
t = TfidfTransformer()
t.set_params(**TFtransformer.get_params())
#Now here comes the problem :
#2 potential solutions
t.set_params(**{'idf':TFtransformer.idf_})
t.idf_ = TFtransformer.idf_

I would expect that at least one would work.
However, both return an error.

  • In the first case, it seems logical, as there is no idf/idf_ parameter
  • In the second case, i suppose that encapsulation forbids the direct setting

I think that being able to reproduce a fitted object (even if it is only for non-classifier objects) without having to recompute it at each launch would benefit a lot of applications.
I'm currently developping a RestAPI that has to do heavy computations on data before feeding it to the vectorizer, having to relearn the whole model with each computing is very slow, and means i have to currently wait up to half an hour for modifications that are sometimes about 1 line of code.

Versions

Windows-10-10.0.10586
('Python', '2.7.11 |Continuum Analytics, Inc.| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.11.0')
('SciPy', '0.17.1')
('Scikit-Learn', '0.17.1')

@jnothman
Copy link
Member

jnothman commented Jul 28, 2016

I'd be happy to see a setter for idf_.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 28, 2016 via email

@jnothman
Copy link
Member

jnothman commented Jul 28, 2016

What's wrong with providing a setter function that transforms to _idf_diag
appropriately?

@Sbelletier
Copy link
Author

I'd be happy to see a setter for idf_.

But that's not what the TfidfTransformer uses internally. Shouldn't the
user be setting the _idf_diag matrix rather? I agree that setting a
private attribute is ugly, so the question is: shouldn't we make it
public?

As the setter for _idf_diag is defined, this does give me a workaround for my problem. This leads me to two new questions.

  • Would you be interested by a snippet/ipynotebook as an example on how to store the correct attributes for CountVectorizer/TfidfTransformer to be restored ?
  • I am currently unsure whether i would be able to provide a correct setter for idf_, should i keep the issue open for more skilled developpers to see ?

Thank you for the quick response

@amueller
Copy link
Member

+1 on what @jnothman said

@manu-chroma
Copy link
Contributor

Hey, I'd be interested in working on this one.

@nelson-liu
Copy link
Contributor

sure go ahead @manu-chroma :)

@manu-chroma
Copy link
Contributor

Sorry for not getting back earlier.
In the following code example given by the @Sbelletier :

#let us say that CountV is the previously built countVectorizer that we want to recreate identically
from sklearn.feature_extraction.text import CountVectorizer 

doc = ['some fake text that is fake to test the vectorizer']

c = CountVectorizer
c.set_params(**CountV.get_params())
c.set_params(**{'vocabulary':CountV.vocabulary_})

Can someone elaborate on how the set_params funtion works for both the lines and what are they actually taking as params. I had a look at the docs but still don't get it. Thanks.

@nelson-liu
Copy link
Contributor

hi @manu-chroma
so the get_params() function returns a dictionary with the parameters of the previously fit CountVectorizer CountV (the ** is kwargs syntax and unpacks a dictionary to function arguments). set_params() does the opposite; given a dictionary, set the values of the parameters of the current object to the values in the dictionary. Here's an ipython example of the code above

In [1]: from sklearn.feature_extraction.text import CountVectorizer
   ...:
   ...: doc = ['some fake text that is fake to test the vectorizer']
   ...:

In [2]: CountV = CountVectorizer(analyzer="char")

In [3]: CountV.fit(doc)
Out[3]:
CountVectorizer(analyzer='char', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [4]: CountV.get_params()
Out[4]:
{'analyzer': 'char',
 'binary': False,
 'decode_error': u'strict',
 'dtype': numpy.int64,
 'encoding': u'utf-8',
 'input': u'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': None,
 'strip_accents': None,
 'token_pattern': u'(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None}

In [5]: c = CountVectorizer()

In [6]: c.get_params()
Out[6]:
{'analyzer': u'word', #note that the analyzer is "word"
 'binary': False,
 'decode_error': u'strict',
 'dtype': numpy.int64,
 'encoding': u'utf-8',
 'input': u'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': None,
 'strip_accents': None,
 'token_pattern': u'(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None} #note that vocabulary is none

In [7]: c.set_params(**CountV.get_params())
Out[7]:
CountVectorizer(analyzer='char', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None) #note that vocabulary is none

In [8]: c.set_params(**{'vocabulary':CountV.vocabulary_})
Out[8]:
CountVectorizer(analyzer='char', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None,
        vocabulary={u'a': 1, u' ': 0, u'c': 2, u'e': 3, u'f': 4, u'i': 6, u'h': 5, u'k': 7, u'm': 8, u'o': 9, u's': 11, u'r': 10, u't': 12, u'v': 13, u'x': 14, u'z': 15})

Notice how in CountV the value of analyzer is char and in c the value of analyzer is word. c.set_params(**CountV.get_params()) sets the parameters of c to those of CountV, and that the analyzer is accordingly changed to char.

Doing the same thing to vocabulary, you can see that running c.set_params(**{'vocabulary':CountV.vocabulary_}) sets c to have the vocabulary of the previously fit CountVectorizer (CountV), despite it not being fit before (thus having a vocabulary value of None previously).

@manu-chroma
Copy link
Contributor

hey @nelson-liu, thanks so much for explaining. I didn't know about kwargs syntaxbefore.

So essentially, a new setter function would be added to class TfidfVectorizer to alter the parameter idf_ ? Want to make things clearer for myself before I code. Thanks.

@charltonaustin
Copy link
Contributor

charltonaustin commented Sep 26, 2016

I looked through this, and it isn't really possible to add a setter for idf_ that updates _idf_diag @manu-chroma @nelson-liu @amueller @jnothman. The reason being when we set _idf_diag we use n_features which comes from X in the fit function text.py:1018. So we would have to keep track of n_features, add it to the setter, or we could move over to setting idf_diag directly as @GaelVaroquaux suggested. I would suggest we directly set idf_diag since we aren't really keeping the state necessary for idf but that's just my humble opinion. *Note when we get idf we are actually deriving it from _idf_diag text.py:1069
Below is a test that I used to start out the code.

def test_copy_idf__tf():
    counts = [[3, 0, 1],
              [2, 0, 0],
              [3, 0, 0],
              [4, 0, 0],
              [3, 2, 0],
              [3, 0, 2]]
    t1 = TfidfTransformer()
    t2 = TfidfTransformer()
    t1.fit_transform(counts)
    t2.set_params(**t1.get_params())
    #t2.set_params(**{'idf':t1.idf_})
    t2.idf_ = t1.idf_

@jnothman
Copy link
Member

jnothman commented Sep 26, 2016

Surely the number of features is already present in the value of _idf_diag?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants