Setting idf_ is impossible #7102

Sbelletier · 2016-07-28T11:11:16Z

Context

Rather than a bug i guess that would go as a sort of "enhancement proposition" ?

I'm currently trying to persist a TfidfTransformer by basically saving its parameters in a mongoDB database and then rebuilding it alike. This technique works for CountVectorizer but simply blocks for TfidfTransformer as there is no way to set idf_.
Is there any actual architectural reason why setting this attributes raise an error ? if yes, do you have an idea for a workaround ? I obviously want to avoid keeping the matrix on which it has been fitted as it would completely mess up the architecture (i believe that the saving/loading process should be separated from the whole treatment/learning process and trying to keep both would mean having to propagate a dirty return)

Steps/Code to Reproduce

functioning example on CountVectorizer

#let us say that CountV is the previously built countVectorizer that we want to recreate identically
from sklearn.feature_extraction.text import CountVectorizer 

doc = ['some fake text that is fake to test the vectorizer']

c = CountVectorizer
c.set_params(**CountV.get_params())
c.set_params(**{'vocabulary':CountV.vocabulary_})
#Now let us test if they do the same conversion
m1 =  CountV.transform(doc)
m2 = c.transform(doc)
print m1.todense().tolist()#just for visibility sake here
print m2.todense().tolist()
#Note : This code does what is expected

This might not seem very impressive, but dictionnaries can be stored inside of mongoDB databases, which means that you can basically restore the CountVectoriser or at least an identical copy of it by simply storing vocabulary_ and the output of get_params() .

Now the incriminated piece of code

#let us say that TFtransformer is the previously computed transformer
from sklearn.feature_extraction.text import TfidfTransformer
t = TfidfTransformer()
t.set_params(**TFtransformer.get_params())
#Now here comes the problem :
#2 potential solutions
t.set_params(**{'idf':TFtransformer.idf_})
t.idf_ = TFtransformer.idf_

I would expect that at least one would work.
However, both return an error.

In the first case, it seems logical, as there is no idf/idf_ parameter
In the second case, i suppose that encapsulation forbids the direct setting

I think that being able to reproduce a fitted object (even if it is only for non-classifier objects) without having to recompute it at each launch would benefit a lot of applications.
I'm currently developping a RestAPI that has to do heavy computations on data before feeding it to the vectorizer, having to relearn the whole model with each computing is very slow, and means i have to currently wait up to half an hour for modifications that are sometimes about 1 line of code.

Versions

Windows-10-10.0.10586
('Python', '2.7.11 |Continuum Analytics, Inc.| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.11.0')
('SciPy', '0.17.1')
('Scikit-Learn', '0.17.1')

The text was updated successfully, but these errors were encountered:

jnothman · 2016-07-28T11:22:19Z

I'd be happy to see a setter for idf_.

GaelVaroquaux · 2016-07-28T11:30:16Z

I'd be happy to see a setter for idf_.

But that's not what the TfidfTransformer uses internally. Shouldn't the user be setting the _idf_diag matrix rather? I agree that setting a private attribute is ugly, so the question is: shouldn't we make it public?

jnothman · 2016-07-28T11:31:19Z

What's wrong with providing a setter function that transforms to _idf_diag
appropriately?

Sbelletier · 2016-07-28T11:46:44Z

I'd be happy to see a setter for idf_.

But that's not what the TfidfTransformer uses internally. Shouldn't the
user be setting the _idf_diag matrix rather? I agree that setting a
private attribute is ugly, so the question is: shouldn't we make it
public?

As the setter for _idf_diag is defined, this does give me a workaround for my problem. This leads me to two new questions.

Would you be interested by a snippet/ipynotebook as an example on how to store the correct attributes for CountVectorizer/TfidfTransformer to be restored ?
I am currently unsure whether i would be able to provide a correct setter for idf_, should i keep the issue open for more skilled developpers to see ?

Thank you for the quick response

amueller · 2016-07-28T16:19:16Z

+1 on what @jnothman said

manu-chroma · 2016-08-09T07:41:20Z

Hey, I'd be interested in working on this one.

nelson-liu · 2016-08-10T20:27:21Z

sure go ahead @manu-chroma :)

manu-chroma · 2016-09-02T09:49:07Z

Sorry for not getting back earlier.
In the following code example given by the @Sbelletier :

#let us say that CountV is the previously built countVectorizer that we want to recreate identically
from sklearn.feature_extraction.text import CountVectorizer 

doc = ['some fake text that is fake to test the vectorizer']

c = CountVectorizer
c.set_params(**CountV.get_params())
c.set_params(**{'vocabulary':CountV.vocabulary_})

Can someone elaborate on how the set_params funtion works for both the lines and what are they actually taking as params. I had a look at the docs but still don't get it. Thanks.

nelson-liu · 2016-09-04T11:22:35Z

hi @manu-chroma
so the get_params() function returns a dictionary with the parameters of the previously fit CountVectorizer CountV (the ** is kwargs syntax and unpacks a dictionary to function arguments). set_params() does the opposite; given a dictionary, set the values of the parameters of the current object to the values in the dictionary. Here's an ipython example of the code above

In [1]: from sklearn.feature_extraction.text import CountVectorizer
   ...:
   ...: doc = ['some fake text that is fake to test the vectorizer']
   ...:

In [2]: CountV = CountVectorizer(analyzer="char")

In [3]: CountV.fit(doc)
Out[3]:
CountVectorizer(analyzer='char', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [4]: CountV.get_params()
Out[4]:
{'analyzer': 'char',
 'binary': False,
 'decode_error': u'strict',
 'dtype': numpy.int64,
 'encoding': u'utf-8',
 'input': u'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': None,
 'strip_accents': None,
 'token_pattern': u'(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None}

In [5]: c = CountVectorizer()

In [6]: c.get_params()
Out[6]:
{'analyzer': u'word', #note that the analyzer is "word"
 'binary': False,
 'decode_error': u'strict',
 'dtype': numpy.int64,
 'encoding': u'utf-8',
 'input': u'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': None,
 'strip_accents': None,
 'token_pattern': u'(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None} #note that vocabulary is none

In [7]: c.set_params(**CountV.get_params())
Out[7]:
CountVectorizer(analyzer='char', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None) #note that vocabulary is none

In [8]: c.set_params(**{'vocabulary':CountV.vocabulary_})
Out[8]:
CountVectorizer(analyzer='char', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None,
        vocabulary={u'a': 1, u' ': 0, u'c': 2, u'e': 3, u'f': 4, u'i': 6, u'h': 5, u'k': 7, u'm': 8, u'o': 9, u's': 11, u'r': 10, u't': 12, u'v': 13, u'x': 14, u'z': 15})

Notice how in CountV the value of analyzer is char and in c the value of analyzer is word. c.set_params(**CountV.get_params()) sets the parameters of c to those of CountV, and that the analyzer is accordingly changed to char.

Doing the same thing to vocabulary, you can see that running c.set_params(**{'vocabulary':CountV.vocabulary_}) sets c to have the vocabulary of the previously fit CountVectorizer (CountV), despite it not being fit before (thus having a vocabulary value of None previously).

manu-chroma · 2016-09-05T15:45:12Z

hey @nelson-liu, thanks so much for explaining. I didn't know about kwargs syntaxbefore.

So essentially, a new setter function would be added to class TfidfVectorizer to alter the parameter idf_ ? Want to make things clearer for myself before I code. Thanks.

charltonaustin · 2016-09-26T15:14:51Z

I looked through this, and it isn't really possible to add a setter for idf_ that updates _idf_diag @manu-chroma @nelson-liu @amueller @jnothman. The reason being when we set _idf_diag we use n_features which comes from X in the fit function text.py:1018. So we would have to keep track of n_features, add it to the setter, or we could move over to setting idf_diag directly as @GaelVaroquaux suggested. I would suggest we directly set idf_diag since we aren't really keeping the state necessary for idf but that's just my humble opinion. *Note when we get idf we are actually deriving it from _idf_diag text.py:1069
Below is a test that I used to start out the code.

def test_copy_idf__tf():
    counts = [[3, 0, 1],
              [2, 0, 0],
              [3, 0, 0],
              [4, 0, 0],
              [3, 2, 0],
              [3, 0, 2]]
    t1 = TfidfTransformer()
    t2 = TfidfTransformer()
    t1.fit_transform(counts)
    t2.set_params(**t1.get_params())
    #t2.set_params(**{'idf':t1.idf_})
    t2.idf_ = t1.idf_

jnothman · 2016-09-26T22:07:09Z

Surely the number of features is already present in the value of _idf_diag?

amueller added Enhancement Need Contributor labels Jul 28, 2016

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

serega added a commit to serega/scikit-learn that referenced this issue Apr 1, 2018

idf_ setter for TfidfTransformer. Fixes scikit-learn#7102

639a6a0

serega mentioned this issue Apr 1, 2018

[MRG+1] idf_ setter for TfidfTransformer. #10899

Merged

rth closed this as completed in #10899 Apr 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting idf_ is impossible #7102

Setting idf_ is impossible #7102

Sbelletier commented Jul 28, 2016

jnothman commented Jul 28, 2016 •

edited by amueller

GaelVaroquaux commented Jul 28, 2016 via email

jnothman commented Jul 28, 2016 •

edited by amueller

Sbelletier commented Jul 28, 2016

amueller commented Jul 28, 2016

manu-chroma commented Aug 9, 2016

nelson-liu commented Aug 10, 2016

manu-chroma commented Sep 2, 2016

nelson-liu commented Sep 4, 2016

manu-chroma commented Sep 5, 2016

charltonaustin commented Sep 26, 2016 •

edited

jnothman commented Sep 26, 2016 •

edited by amueller

Setting idf_ is impossible #7102

Setting idf_ is impossible #7102

Comments

Sbelletier commented Jul 28, 2016

Context

Steps/Code to Reproduce

Versions

jnothman commented Jul 28, 2016 • edited by amueller

GaelVaroquaux commented Jul 28, 2016 via email

jnothman commented Jul 28, 2016 • edited by amueller

Sbelletier commented Jul 28, 2016

amueller commented Jul 28, 2016

manu-chroma commented Aug 9, 2016

nelson-liu commented Aug 10, 2016

manu-chroma commented Sep 2, 2016

nelson-liu commented Sep 4, 2016

manu-chroma commented Sep 5, 2016

charltonaustin commented Sep 26, 2016 • edited

jnothman commented Sep 26, 2016 • edited by amueller

jnothman commented Jul 28, 2016 •

edited by amueller

jnothman commented Jul 28, 2016 •

edited by amueller

charltonaustin commented Sep 26, 2016 •

edited

jnothman commented Sep 26, 2016 •

edited by amueller