Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is TfIdfVectorizer safe to use? #6078

Open
geofforigin8 opened this issue Apr 11, 2024 · 6 comments
Open

Is TfIdfVectorizer safe to use? #6078

geofforigin8 opened this issue Apr 11, 2024 · 6 comments
Labels
operator Issues related to ONNX operators question Questions about ONNX

Comments

@geofforigin8
Copy link

Ask a Question

Question

Is TfIdfVectorizer safe to use?

I thought I was going crazy after models I had trained with a TfIdfVectorizer in the pipeline started having (sometimes large) discrepancies in predictions after converting to ONNX.

After some googling, I found this post here: http://www.xavierdupre.fr/app/onnxcustom/helpsphinx/gyexamples/plot_transformer_discrepancy.html

It appears that the issue the author raised with sklearn is still open:
scikit-learn/scikit-learn#13733

I would use one of the two work-arounds in the first article but they both depend on the mlprodict package which appears archived so I am hesitant to depend on it.

So I am asking, is there a recommended workaround? Also, why is there no mention of this issue in the official docs?

@geofforigin8 geofforigin8 added the question Questions about ONNX label Apr 11, 2024
@justinchuby justinchuby added the operator Issues related to ONNX operators label Apr 11, 2024
@justinchuby
Copy link
Contributor

cc @xadupre @gramalingam

@xadupre
Copy link
Contributor

xadupre commented Apr 11, 2024

mlprodict is archived because its main functionality, a python runtime for onnx called OnnxInference, was moved into onnx package as ReferenceEvaluator. I did not write an example about the trick I put in place to avoid ambiguities but it is maintained in unit tests: https://github.com/onnx/sklearn-onnx/blob/main/tests/test_sklearn_text.py#L11. This issue is not mentioned in onnx documentation because the ambiguity comes from scikit-learn and not onnx. This case usually don't happen unless the n-grams contain spaces. In that case, the converter gets confused.

@geofforigin8
Copy link
Author

geofforigin8 commented Apr 11, 2024

Thank you for the quick reply! Using the test cases as a model, I wrote a sample program and I was not having success when using any of the min_df, max_df or max_features parameters.

It looks like convert_sklearn_text_vectorizer is expecting stop_words_ to be a set of string but when min_df et al are set, it is being set to a set of tuples, so as a result running to_onnx on it will fail with an error like this:
TypeError: One stop word is not a string ('is', 'this') in stop_words={('is', 'this'), ...

@geofforigin8
Copy link
Author

geofforigin8 commented Apr 11, 2024

I've sort of fixed my problem by setting stop_words to an empty set and moving on. However, I'm still seeing some discrepancies and tracked it down to what looks like single-char tokens not being skipped by ONNX?

to reproduce:

import numpy as np 
import skl2onnx
from skl2onnx.sklapi import TraceableTfidfVectorizer, TraceableCountVectorizer
from skl2onnx import update_registered_converter
from skl2onnx.shape_calculators.text_vectorizer import (
    calculate_sklearn_text_vectorizer_output_shapes,
)
from skl2onnx.operator_converters.text_vectoriser import convert_sklearn_text_vectorizer
from skl2onnx.operator_converters.tfidf_vectoriser import convert_sklearn_tfidf_vectoriser
from skl2onnx.common.data_types import StringTensorType
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession


update_registered_converter(
    TraceableCountVectorizer,
    "Skl2onnxTraceableCountVectorizer",
    calculate_sklearn_text_vectorizer_output_shapes,
    convert_sklearn_text_vectorizer,
    options={
        "tokenexp": None,
        "separators": None,
        "nan": [True, False],
        "keep_empty_string": [True, False],
        "locale": None,
    },
)

update_registered_converter(
    TraceableTfidfVectorizer,
    "Skl2onnxTraceableTfidfVectorizer",
    calculate_sklearn_text_vectorizer_output_shapes,
    convert_sklearn_tfidf_vectoriser,
    options={
        "tokenexp": None,
        "separators": None,
        "nan": [True, False],
        "keep_empty_string": [True, False],
        "locale": None,
    },
)

corpus = np.array([
        "This is the first document.",
        "This document is the second document.",
        "And this is the third one.",
        "Is this the first document?",
        "first Z document", # uncomment this and the last assertion fails because ONNX still sees the "Z" token when it shouldn't?
        "",
    ]
)

FEATURES = 12

vect = TraceableTfidfVectorizer(ngram_range=(1,2), max_features=FEATURES)
vect.fit(corpus)

print('vocabulary:', vect.vocabulary_)
print('stop_words:', vect.stop_words_)

A = vect.transform(corpus).todense()

vect.stop_words_ = set() # should be safe right?

B = vect.transform(corpus).todense()
assert_almost_equal(A,B)  # make sure vectorizer still does what we expect

model_onnx = skl2onnx.to_onnx(
    vect,
    "TfidfVectorizer",
    initial_types=[("input", StringTensorType([None, 1]))],
    target_opset=12,
)

sess = InferenceSession(model_onnx.SerializeToString(), providers=['CPUExecutionProvider'])

inputs = {"input": corpus.reshape((-1, 1))}

out = sess.run(None, inputs)[0]
assert out.shape == (len(corpus), FEATURES)
assert_almost_equal(out.ravel(), A.A1)

sklearn gives the same vector for both strings as expected:
vect.transform(["first document", "first Z document"]).todense()

matrix([[0.        , 0.51822427, 0.60474937, 0.60474937, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        ],
        [0.        , 0.51822427, 0.60474937, 0.60474937, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        ]])

And the model from onnx is is only matching the results on the first string:
sess.run(None, {'input': [["first document"], ["first Z document"]]})[0]

array([[0.        , 0.5182243 , 0.6047494 , 0.6047494 , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.65069556, 0.7593387 , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ]], dtype=float32)

@gramalingam
Copy link
Contributor

Hi, I haven't understood the details here. ONNX is a specification. The issue could either be in the conversion side or it could be a bug in the onnxruntime implementation. If it is the latter, I would recommend filing an issue in the onnxruntime repo. Do you know?

Xavier might have a better idea (he is off this week).

@geofforigin8
Copy link
Author

geofforigin8 commented Apr 19, 2024

Nope, I did not know. I don't have a full image of all the related projects, ownership and relations between them. Thanks for the link though, I thought I found a related issue reported there but, I don't think that's it.

I was able to resolve my issue by explicitly passing "tokenexp": r"\b\w\w+\b" in the options. I don't think this substitution is correct:
https://github.com/onnx/sklearn-onnx/blob/d2029c1a9752f62a63fc5c4447b4d9fe75e8fe39/skl2onnx/operator_converters/text_vectoriser.py#L238

It should be something like [a-zA-Z0-9_]{2,} or \b\w\w+\b. Is the reason for this replacement because the onnx runtime chokes on the (?u) that sklearn has in it's default for token_pattern?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operator Issues related to ONNX operators question Questions about ONNX
Projects
None yet
Development

No branches or pull requests

4 participants