sklearn
May 2022
- Avoid timeouts in
datasets.fetch_openml
by not passing a timeout argument,23358
byLoïc Estève <lesteve>
.
- Avoid spurious warning in
decomposition.IncrementalPCA
when n_samples == n_components.23264
byLucy Liu <lucyleeow>
.
- The partial_fit method of
feature_selection.SelectFromModel
now conducts validation for max_features and feature_names_in parameters.23299
byLong Bao <lorentzbao>
.
- Fixes
metrics.precision_recall_curve
to compute precision-recall at 100% recall. The Precision-Recall curve now displays the last point corresponding to a classifier that always predicts the positive class: recall=100% and precision=class balance.23214
byStéphane Collot <stephanecollot>
andMax Baak <mbaak>
.
preprocessing.PolynomialFeatures
withdegree
equal to 0 will raise error wheninclude_bias
is set to False, and outputs a single constant array wheninclude_bias
is set to True.23370
byZhehao Liu <MaxwellLZH>
.
- Fixes performance regression with low cardinality features for
tree.DecisionTreeClassifier
,tree.DecisionTreeRegressor
,ensemble.RandomForestClassifier
,ensemble.RandomForestRegressor
,ensemble.GradientBoostingClassifier
, andensemble.GradientBoostingRegressor
.23410
byLoïc Estève <lesteve>
.
utils.class_weight.compute_sample_weight
now works with sparse y.23115
bykernc <kernc>
.
May 2022
For a short description of the main highlights of the release, please refer to sphx_glr_auto_examples_release_highlights_plot_release_highlights_1_1_0.py
.
Version 1.1.0 of scikit-learn requires python 3.8+, numpy 1.17.3+ and scipy 1.3.2+. Optional minimal dependency is matplotlib 3.1.2+.
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
cluster.KMeans
now defaults toalgorithm="lloyd"
instead ofalgorithm="auto"
, which was equivalent toalgorithm="elkan"
. Lloyd's algorithm and Elkan's algorithm converge to the same solution, up to numerical rounding errors, but in general Lloyd's algorithm uses much less memory, and it is often faster.- Fitting
tree.DecisionTreeClassifier
,tree.DecisionTreeRegressor
,ensemble.RandomForestClassifier
,ensemble.RandomForestRegressor
,ensemble.GradientBoostingClassifier
, andensemble.GradientBoostingRegressor
is on average 15% faster than in previous versions thanks to a new sort algorithm to find the best split. Models might be different because of a different handling of splits with tied criterion values: both the old and the new sorting algorithm are unstable sorting algorithms.22868
by Thomas Fan. - The eigenvectors initialization for
cluster.SpectralClustering
andmanifold.SpectralEmbedding
now samples from a Gaussian when using the 'amg' or 'lobpcg' solver. This change improves numerical stability of the solver, but may result in a different model. feature_selection.f_regression
andfeature_selection.r_regression
will now returned finite score by default instead of np.nan and np.inf for some corner case. You can use force_finite=False if you really want to get non-finite values and keep the old behavior.- Panda's DataFrames with all non-string columns such as a MultiIndex no longer warns when passed into an Estimator. Estimators will continue to ignore the column names in DataFrames with non-string columns. For feature_names_in_ to be defined, columns must be all strings.
22410
by Thomas Fan. preprocessing.KBinsDiscretizer
changed handling of bin edges slightly, which might result in a different encoding with the same data.calibration.calibration_curve
changed handling of bin edges slightly, which might result in a different output curve given the same data.discriminant_analysis.LinearDiscriminantAnalysis
now uses the correct variance-scaling coefficient which may result in different model behavior.feature_selection.SelectFromModel.fit
andfeature_selection.SelectFromModel.partial_fit
can now be called with prefit=True. estimators_ will be a deep copy of estimator when prefit=True.23271
byGuillaume Lemaitre <glemaitre>
.
Low-level routines for reductions on pairwise distances for dense float64 datasets have been refactored. The following functions and estimators now benefit from improved performances in terms of hardware scalability and speed-ups:
sklearn.metrics.pairwise_distances_argmin
sklearn.metrics.pairwise_distances_argmin_min
sklearn.cluster.AffinityPropagation
sklearn.cluster.Birch
sklearn.cluster.MeanShift
sklearn.cluster.OPTICS
sklearn.cluster.SpectralClustering
sklearn.feature_selection.mutual_info_regression
sklearn.neighbors.KNeighborsClassifier
sklearn.neighbors.KNeighborsRegressor
sklearn.neighbors.RadiusNeighborsClassifier
sklearn.neighbors.RadiusNeighborsRegressor
sklearn.neighbors.LocalOutlierFactor
sklearn.neighbors.NearestNeighbors
sklearn.manifold.Isomap
sklearn.manifold.LocallyLinearEmbedding
sklearn.manifold.TSNE
sklearn.manifold.trustworthiness
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
For instance
sklearn.neighbors.NearestNeighbors.kneighbors
andsklearn.neighbors.NearestNeighbors.radius_neighbors
can respectively be up to ×20 and ×5 faster than previously.21987
,22064
,22065
,22288
and22320
byJulien Jerphanion <jjerphan>
.- All scikit-learn models now generate a more informative error message when some input contains unexpected NaN or infinite values. In particular the message contains the input name ("X", "y" or "sample_weight") and if an unexpected NaN value is found in X, the error message suggests potential solutions.
21219
byOlivier Grisel <ogrisel>
. - All scikit-learn models now generate a more informative error message when setting invalid hyper-parameters with set_params.
21542
byOlivier Grisel <ogrisel>
. - Removes random unique identifiers in the HTML representation. With this change, jupyter notebooks are reproducible as long as the cells are run in the same order.
23098
by Thomas Fan. - Estimators with non_deterministic tag set to True will skip both check_methods_sample_order_invariance and check_methods_subset_invariance tests.
22318
byZhehao Liu <MaxwellLZH>
. - The option for using the log loss, aka binomial or multinomial deviance, via the loss parameters was made more consistent. The preferred way is by setting the value to "log_loss". Old option names are still valid and produce the same models, but are deprecated and will be removed in version 1.3.
- For
ensemble.GradientBoostingClassifier
, the loss parameter name "deviance" is deprecated in favor of the new name "log_loss", which is now the default.23036
byChristian Lorentzen <lorentzenchr>
. - For
ensemble.HistGradientBoostingClassifier
, the loss parameter names "auto", "binary_crossentropy" and "categorical_crossentropy" are deprecated in favor of the new name "log_loss", which is now the default.23040
byChristian Lorentzen <lorentzenchr>
. - For
linear_model.SGDClassifier
, the loss parameter name "log" is deprecated in favor of the new name "log_loss".23046
byChristian Lorentzen <lorentzenchr>
.
- For
- Rich html representation of estimators is now enabled by default in Jupyter notebooks. It can be deactivated by setting display='text' in
sklearn.set_config
.22856
byJérémie du Boisberranger <jeremiedbb>
. - The error message is improved when importing
model_selection.HalvingGridSearchCV
,model_selection.HalvingRandomSearchCV
, orimpute.IterativeImputer
without importing the experimental flag.23194
by Thomas Fan. - Added an extension in doc/conf.py to automatically generate the list of estimators that handle NaN values.
23198
by Lise Kleiber,Zhehao Liu <MaxwellLZH>
andChiara Marmo <cmarmo>
.
calibration.calibration_curve
accepts a parameter pos_label to specify the positive class label.21032
byGuillaume Lemaitre <glemaitre>
.calibration.CalibratedClassifierCV.fit
now supports passing fit_params, which are routed to the base_estimator.18170
byBenjamin Bossan <BenjaminBossan>
.calibration.CalibrationDisplay
accepts a parameter pos_label to add this information to the plot.21038
byGuillaume Lemaitre <glemaitre>
.calibration.calibration_curve
handles bin edges more consistently now.14975
by Andreas Müller and22526
byMeekail Zain <micky774>
.calibration.calibration_curve
's normalize parameter is now deprecated and will be removed in version 1.3. It is recommended that a proper probability (i.e. a classifier'spredict_proba
positive class) is used for y_prob.23095
byJordan Silke <jsilke>
.
BisectingKMeans
introducing Bisecting K-Means algorithm20031
byMichal Krawczyk <michalkrawczyk>
,Tom Dupre la Tour <TomDLT>
andJérémie du Boisberranger <jeremiedbb>
.cluster.SpectralClustering
andcluster.spectral_clustering
now include the new 'cluster_qr' method that clusters samples in the embedding space as an alternative to the existing 'kmeans' and 'discrete' methods. Seecluster.spectral_clustering
for more details.21148
byAndrew Knyazev <lobpcg>
.- Adds
get_feature_names_out
tocluster.Birch
,cluster.FeatureAgglomeration
,cluster.KMeans
,cluster.MiniBatchKMeans
.22255
by Thomas Fan. cluster.SpectralClustering
now raises consistent error messages when passed invalid values for n_clusters, n_init, gamma, n_neighbors, eigen_tol or degree.21881
byHugo Vassard <hvassard>
.cluster.AffinityPropagation
now returns cluster centers and labels if they exist, even if the model has not fully converged. When returning these potentially-degenerate cluster centers and labels, a new warning message is shown. If no cluster centers were constructed, then the cluster centers remain an empty list with labels set to -1 and the original warning message is shown.22217
byMeekail Zain <micky774>
.- In
cluster.KMeans
, the defaultalgorithm
is now"lloyd"
which is the full classical EM-style algorithm. Both"auto"
and"full"
are deprecated and will be removed in version 1.3. They are now aliases for"lloyd"
. The previous default was"auto"
, which relied on Elkan's algorithm. Lloyd's algorithm uses less memory than Elkan's, it is faster on many datasets, and its results are identical, hence the change.21735
byAurélien Geron <ageron>
. cluster.KMeans
's init parameter now properly supports array-like input and NumPy string scalars.22154
by Thomas Fan.
compose.ColumnTransformer
now removes validation errors from __init__ and set_params methods.22537
byiofall <iofall>
andArisa Y. <arisayosh>
.get_feature_names_out
functionality incompose.ColumnTransformer
was broken when columns were specified using slice. This is fixed in22775
and22913
byrandomgeek78 <randomgeek78>
.
covariance.GraphicalLassoCV
now accepts NumPy array for the parameter alphas.22493
byGuillaume Lemaitre <glemaitre>
.
- the inverse_transform method of
cross_decomposition.PLSRegression
,cross_decomposition.PLSCanonical
andcross_decomposition.CCA
now allows reconstruction of a X target when a Y parameter is given.19680
byRobin Thibaut <robinthibaut>
. - Adds
get_feature_names_out
to all transformers in the~sklearn.cross_decomposition
module:cross_decomposition.CCA
,cross_decomposition.PLSSVD
,cross_decomposition.PLSRegression
, andcross_decomposition.PLSCanonical
.22119
by Thomas Fan. - The shape of the
coef_
attribute ofcross_decomposition.CCA
,cross_decomposition.PLSCanonical
andcross_decomposition.PLSRegression
will change in version 1.3, from (n_features, n_targets) to (n_targets, n_features), to be consistent with other linear models and to make it work with interface expecting a specific shape for coef_ (e.g.feature_selection.RFE
).22016
byGuillaume Lemaitre <glemaitre>
. - add the fitted attribute intercept_ to
cross_decomposition.PLSCanonical
,cross_decomposition.PLSRegression
, andcross_decomposition.CCA
. The method predict is indeed equivalent to Y = X @ coef_ + intercept_.22015
byGuillaume Lemaitre <glemaitre>
.
datasets.load_files
now accepts a ignore list and an allow list based on file extensions.19747
byTony Attalla <tonyattalla>
and22498
byMeekail Zain <micky774>
.datasets.make_swiss_roll
now supports the optional argument hole; when set to True, it returns the swiss-hole dataset.21482
bySebastian Pujalte <pujaltes>
.datasets.make_blobs
no longer copies data during the generation process, therefore uses less memory.22412
byZhehao Liu <MaxwellLZH>
.datasets.load_diabetes
now accepts the parameterscaled
, to allow loading unscaled data. The scaled version of this dataset is now computed from the unscaled data, and can produce slightly different results that in previous version (within a 1e-4 absolute tolerance).16605
byMandy Gu <happilyeverafter95>
.datasets.fetch_openml
now has two optional arguments n_retries and delay. By default,datasets.fetch_openml
will retry 3 times in case of a network failure with a delay between each try.21901
byRileran <rileran>
.datasets.fetch_covtype
is now concurrent-safe: data is downloaded to a temporary directory before being moved to the data directory.23113
byIlion Beyst <iasoon>
.datasets.make_sparse_coded_signal
now accepts a parameter data_transposed to explicitly specify the shape of matrix X. The default behavior True is to return a transposed matrix X corresponding to a (n_features, n_samples) shape. The default value will change to False in version 1.3.21425
byGabriel Stefanini Vicente <g4brielvs>
.
- Added a new estimator
decomposition.MiniBatchNMF
. It is a faster but less accurate version of non-negative matrix factorization, better suited for large datasets.16948
byChiara Marmo <cmarmo>
,Patricio Cerda <pcerda>
andJérémie du Boisberranger <jeremiedbb>
. decomposition.dict_learning
,decomposition.dict_learning_online
anddecomposition.sparse_encode
preserve dtype for numpy.float32.decomposition.DictionaryLearning
,decomposition.MiniBatchDictionaryLearning
anddecomposition.SparseCoder
preserve dtype for numpy.float32.22002
byTakeshi Oura <takoika>
.decomposition.PCA
exposes a parameter n_oversamples to tuneutils.randomized_svd
and get accurate results when the number of features is large.21109
bySmile <x-shadow-man>
.The
decomposition.MiniBatchDictionaryLearning
anddecomposition.dict_learning_online
have been refactored and now have a stopping criterion based on a small change of the dictionary or objective function, controlled by the new max_iter, tol and max_no_improvement parameters. In addition, some of their parameters and attributes are deprecated.- the n_iter parameter of both is deprecated. Use max_iter instead.
- the iter_offset, return_inner_stats, inner_stats and return_n_iter parameters of
decomposition.dict_learning_online
serve internal purpose and are deprecated. - the inner_stats_, iter_offset_ and random_state_ attributes of
decomposition.MiniBatchDictionaryLearning
serve internal purpose and are deprecated. - the default value of the batch_size parameter of both will change from 3 to 256 in version 1.3.
18975
byJérémie du Boisberranger <jeremiedbb>
.decomposition.SparsePCA
anddecomposition.MiniBatchSparsePCA
preserve dtype for numpy.float32.22111
byTakeshi Oura <takoika>
.decomposition.TruncatedSVD
now allows n_components == n_features, if algorithm='randomized'.22181
byZach Deane-Mayer <zachmayer>
.- Adds
get_feature_names_out
to all transformers in the~sklearn.decomposition
module:decomposition.DictionaryLearning
,decomposition.FactorAnalysis
,decomposition.FastICA
,decomposition.IncrementalPCA
,decomposition.KernelPCA
,decomposition.LatentDirichletAllocation
,decomposition.MiniBatchDictionaryLearning
,decomposition.MiniBatchSparsePCA
,decomposition.NMF
,decomposition.PCA
,decomposition.SparsePCA
, anddecomposition.TruncatedSVD
.21334
by Thomas Fan. decomposition.TruncatedSVD
exposes the parameter n_oversamples and power_iteration_normalizer to tuneutils.randomized_svd
and get accurate results when the number of features is large, the rank of the matrix is high, or other features of the matrix make low rank approximation difficult.21705
byJay S. Stanley III <stanleyjs>
.decomposition.PCA
exposes the parameter power_iteration_normalizer to tuneutils.randomized_svd
and get more accurate results when low rank approximation is difficult.21705
byJay S. Stanley III <stanleyjs>
.decomposition.FastICA
now validates input parameters in fit instead of __init__.21432
byHannah Bohle <hhnnhh>
andMaren Westermann <marenwestermann>
.decomposition.FastICA
now accepts np.float32 data without silent upcasting. The dtype is preserved by fit and fit_transform and the main fitted attributes use a dtype of the same precision as the training data.22806
byJihane Bennis <JihaneBennis>
andOlivier Grisel <ogrisel>
.decomposition.FactorAnalysis
now validates input parameters in fit instead of __init__.21713
byHaya <HayaAlmutairi>
andKrum Arnaudov <krumeto>
.decomposition.KernelPCA
now validates input parameters in fit instead of __init__.21567
byMaggie Chege <MaggieChege>
.decomposition.PCA
anddecomposition.IncrementalPCA
more safely calculate precision using the inverse of the covariance matrix if self.noise_variance_ is zero.22300
byMeekail Zain <micky774>
and15948
bysysuresh
.- Greatly reduced peak memory usage in
decomposition.PCA
when calling fit or fit_transform.22553
byMeekail Zain <micky774>
. decomposition.FastICA
now supports unit variance for whitening. The default value of its whiten argument will change from True (which behaves like 'arbitrary-variance') to 'unit-variance' in version 1.3.19490
byFacundo Ferrin <fferrin>
andJulien Jerphanion <jjerphan>
.
- Adds
get_feature_names_out
todiscriminant_analysis.LinearDiscriminantAnalysis
.22120
by Thomas Fan. discriminant_analysis.LinearDiscriminantAnalysis
now uses the correct variance-scaling coefficient which may result in different model behavior.15984
byOkon Samuel <OkonSamuel>
and22696
byMeekail Zain <micky774>
.
dummy.DummyRegressor
no longer overrides the constant parameter during fit.22486
by Thomas Fan.
- Added additional option loss="quantile" to
ensemble.HistGradientBoostingRegressor
for modelling quantiles. The quantile level can be specified with the new parameter quantile.21800
and20567
byChristian Lorentzen <lorentzenchr>
. fit
ofensemble.GradientBoostingClassifier
andensemble.GradientBoostingRegressor
now callsutils.check_array
with parameter force_all_finite=False for non initial warm-start runs as it has already been checked before.22159
byGeoffrey Paris <Geoffrey-Paris>
.ensemble.HistGradientBoostingClassifier
is faster, for binary and in particular for multiclass problems thanks to the new private loss function module.20811
,20567
and21814
byChristian Lorentzen <lorentzenchr>
.- Adds support to use pre-fit models with cv="prefit" in
ensemble.StackingClassifier
andensemble.StackingRegressor
.16748
bySiqi He <siqi-he>
and22215
byMeekail Zain <micky774>
. ensemble.RandomForestClassifier
andensemble.ExtraTreesClassifier
have the new criterion="log_loss", which is equivalent to criterion="entropy".23047
byChristian Lorentzen <lorentzenchr>
.- Adds
get_feature_names_out
toensemble.VotingClassifier
,ensemble.VotingRegressor
,ensemble.StackingClassifier
, andensemble.StackingRegressor
.22695
and22697
by Thomas Fan. ensemble.RandomTreesEmbedding
now has an informativeget_feature_names_out
function that includes both tree index and leaf index in the output feature names.21762
byZhehao Liu <MaxwellLZH>
and Thomas Fan.- Fitting a
ensemble.RandomForestClassifier
,ensemble.RandomForestRegressor
,ensemble.ExtraTreesClassifier
,ensemble.ExtraTreesRegressor
, andensemble.RandomTreesEmbedding
is now faster in a multiprocessing setting, especially for subsequent fits with warm_start enabled.22106
byPieter Gijsbers <PGijsbers>
. - Change the parameter validation_fraction in
ensemble.GradientBoostingClassifier
andensemble.GradientBoostingRegressor
so that an error is raised if anything other than a float is passed in as an argument.21632
byGenesis Valencia <genvalen>
. - Removed a potential source of CPU oversubscription in
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
when CPU resource usage is limited, for instance using cgroups quota in a docker container.22566
byJérémie du Boisberranger <jeremiedbb>
. ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
no longer warns when fitting on a pandas DataFrame with a non-default scoring parameter and early_stopping enabled.22908
by Thomas Fan.- Fixes HTML repr for
ensemble.StackingClassifier
andensemble.StackingRegressor
.23097
by Thomas Fan. - The attribute loss_ of
ensemble.GradientBoostingClassifier
andensemble.GradientBoostingRegressor
has been deprecated and will be removed in version 1.3.23079
byChristian Lorentzen <lorentzenchr>
. - Changed the default of max_features to 1.0 for
ensemble.RandomForestRegressor
and to "sqrt" forensemble.RandomForestClassifier
. Note that these give the same fit results as before, but are much easier to understand. The old default value "auto" has been deprecated and will be removed in version 1.3. The same changes are also applied forensemble.ExtraTreesRegressor
andensemble.ExtraTreesClassifier
.20803
byBrian Sun <bsun94>
. - Improve runtime performance of
ensemble.IsolationForest
by skipping repetitive input checks.23149
byZhehao Liu <MaxwellLZH>
.
feature_extraction.FeatureHasher
now supports PyPy.23023
by Thomas Fan.feature_extraction.FeatureHasher
now validates input parameters in transform instead of __init__.21573
byHannah Bohle <hhnnhh>
andMaren Westermann <marenwestermann>
.feature_extraction.text.TfidfVectorizer
now does not create afeature_extraction.text.TfidfTransformer
at __init__ as required by our API.21832
byGuillaume Lemaitre <glemaitre>
.
- Added auto mode to
feature_selection.SequentialFeatureSelector
. If the argument n_features_to_select is 'auto', select features until the score improvement does not exceed the argument tol. The default value of n_features_to_select changed from None to 'warn' in 1.1 and will become 'auto' in 1.3. None and 'warn' will be removed in 1.3.20145
bymurata-yu <murata-yu>
. - Added the ability to pass callables to the max_features parameter of
feature_selection.SelectFromModel
. Also introduced new attribute max_features_ which is inferred from max_features and the data during fit. If max_features is an integer, then max_features_ = max_features. If max_features is a callable, then max_features_ = max_features(X).22356
byMeekail Zain <micky774>
. feature_selection.GenericUnivariateSelect
preserves float32 dtype.18482
byThierry Gameiro <titigmr>
andDaniel Kharsa <aflatoune>
and22370
byMeekail Zain <micky774>
.- Add a parameter force_finite to
feature_selection.f_regression
andfeature_selection.r_regression
. This parameter allows to force the output to be finite in the case where a feature or a the target is constant or that the feature and target are perfectly correlated (only for the F-statistic).17819
byJuan Carlos Alfaro Jiménez <alfaro96>
. - Improve runtime performance of
feature_selection.chi2
with boolean arrays.22235
by Thomas Fan. - Reduced memory usage of
feature_selection.chi2
.21837
byLouis Wagner <lrwagner>
.
- predict and sample_y methods of
gaussian_process.GaussianProcessRegressor
now return arrays of the correct shape in single-target and multi-target cases, and for both normalize_y=False and normalize_y=True.22199
byGuillaume Lemaitre <glemaitre>
,Aidar Shakerimoff <AidarShakerimoff>
andTenavi Nakamura-Zimmerer <Tenavi>
. gaussian_process.GaussianProcessClassifier
raises a more informative error if CompoundKernel is passed via kernel.22223
byMarcoM <marcozzxx810>
.
impute.SimpleImputer
now warns with feature names when features which are skipped due to the lack of any observed values in the training set.21617
byChristian Ritter <chritter>
.- Added support for pd.NA in
impute.SimpleImputer
.21114
byYing Xiong <yxiong>
. - Adds
get_feature_names_out
toimpute.SimpleImputer
,impute.KNNImputer
,impute.IterativeImputer
, andimpute.MissingIndicator
.21078
by Thomas Fan. - The verbose parameter was deprecated for
impute.SimpleImputer
. A warning will always be raised upon the removal of empty columns.21448
byOleh Kozynets <OlehKSS>
andChristian Ritter <chritter>
.
- Add a display to plot the boundary decision of a classifier by using the method
inspection.DecisionBoundaryDisplay.from_estimator
.16061
by Thomas Fan. - In
inspection.PartialDependenceDisplay.from_estimator
, allow kind to accept a list of strings to specify which type of plot to draw for each feature interaction.19438
byGuillaume Lemaitre <glemaitre>
. inspection.PartialDependenceDisplay.from_estimator
,inspection.PartialDependenceDisplay.plot
, andinspection.plot_partial_dependence
now support plotting centered Individual Conditional Expectation (cICE) and centered PDP curves controlled by setting the parameter centered.18310
byJohannes Elfner <JoElfner>
andGuillaume Lemaitre <glemaitre>
.
- Adds
get_feature_names_out
toisotonic.IsotonicRegression
.22249
by Thomas Fan.
- Adds
get_feature_names_out
tokernel_approximation.AdditiveChi2Sampler
.kernel_approximation.Nystroem
,kernel_approximation.PolynomialCountSketch
,kernel_approximation.RBFSampler
, andkernel_approximation.SkewedChi2Sampler
.22137
and22694
by Thomas Fan.
linear_model.ElasticNet
,linear_model.ElasticNetCV
,linear_model.Lasso
andlinear_model.LassoCV
support sample_weight for sparse input X.22808
byChristian Lorentzen <lorentzenchr>
.linear_model.Ridge
with solver="lsqr" now supports to fit sparse input with fit_intercept=True.22950
byChristian Lorentzen <lorentzenchr>
.linear_model.QuantileRegressor
support sparse input for the highs based solvers.21086
byVenkatachalam Natchiappan <venkyyuvy>
. In addition, those solvers now use the CSC matrix right from the beginning which speeds up fitting.22206
byChristian Lorentzen <lorentzenchr>
.linear_model.LogisticRegression
is faster forsolvers="lbfgs"
andsolver="newton-cg"
, for binary and in particular for multiclass problems thanks to the new private loss function module. In the multiclass case, the memory consumption has also been reduced for these solvers as the target is now label encoded (mapped to integers) instead of label binarized (one-hot encoded). The more classes, the larger the benefit.21808
,20567
and21814
byChristian Lorentzen <lorentzenchr>
.linear_model.GammaRegressor
,linear_model.PoissonRegressor
andlinear_model.TweedieRegressor
are faster forsolvers="lbfgs"
.22548
,21808
and20567
byChristian Lorentzen <lorentzenchr>
.- Rename parameter base_estimator to estimator in
linear_model.RANSACRegressor
to improve readability and consistency. base_estimator is deprecated and will be removed in 1.3.22062
byAdrian Trujillo <trujillo9616>
. linear_model.ElasticNet
and and other linear model classes using coordinate descent show error messages when non-finite parameter weights are produced.22148
byChristian Ritter <chritter>
andNorbert Preining <norbusan>
.linear_model.ElasticNet
andlinear_model.Lasso
now raise consistent error messages when passed invalid values for l1_ratio, alpha, max_iter and tol.22240
byArturo Amor <ArturoAmorQ>
.linear_model.BayesianRidge
andlinear_model.ARDRegression
now preserve float32 dtype.9087
byArthur Imbert <Henley13>
and22525
byMeekail Zain <micky774>
.linear_model.RidgeClassifier
is now supporting multilabel classification.19689
byGuillaume Lemaitre <glemaitre>
.linear_model.RidgeCV
andlinear_model.RidgeClassifierCV
now raise consistent error message when passed invalid values for alphas.21606
byArturo Amor <ArturoAmorQ>
.linear_model.Ridge
andlinear_model.RidgeClassifier
now raise consistent error message when passed invalid values for alpha, max_iter and tol.21341
byArturo Amor <ArturoAmorQ>
.linear_model.orthogonal_mp_gram
preservse dtype for numpy.float32.22002
byTakeshi Oura <takoika>
.linear_model.LassoLarsIC
now correctly computes AIC and BIC. An error is now raised when n_features > n_samples and when the noise variance is not provided.21481
byGuillaume Lemaitre <glemaitre>
andAndrés Babino <ababino>
.linear_model.TheilSenRegressor
now validates input parametermax_subpopulation
in fit instead of __init__.21767
byMaren Westermann <marenwestermann>
.linear_model.ElasticNetCV
now produces correct warning when l1_ratio=0.21724
byYar Khine Phyo <yarkhinephyo>
.linear_model.LogisticRegression
andlinear_model.LogisticRegressionCV
now set the n_iter_ attribute with a shape that respects the docstring and that is consistent with the shape obtained when using the other solvers in the one-vs-rest setting. Previously, it would record only the maximum of the number of iterations for each binary sub-problem while now all of them are recorded.21998
byOlivier Grisel <ogrisel>
.- The property family of
linear_model.TweedieRegressor
is not validated in __init__ anymore. Instead, this (private) property is deprecated inlinear_model.GammaRegressor
,linear_model.PoissonRegressor
andlinear_model.TweedieRegressor
, and will be removed in 1.3.22548
byChristian Lorentzen <lorentzenchr>
. - The coef_ and intercept_ attributes of
linear_model.LinearRegression
are now correctly computed in the presence of sample weights when the input is sparse.22891
byJérémie du Boisberranger <jeremiedbb>
. - The coef_ and intercept_ attributes of
linear_model.Ridge
with solver="sparse_cg" and solver="lbfgs" are now correctly computed in the presence of sample weights when the input is sparse.22899
byJérémie du Boisberranger <jeremiedbb>
. linear_model.SGDRegressor
andlinear_model.SGDClassifier
now computes the validation error correctly when early stopping is enabled.23256
byZhehao Liu <MaxwellLZH>
.linear_model.LassoLarsIC
now exposes noise_variance as a parameter in order to provide an estimate of the noise variance. This is particularly relevant when n_features > n_samples and the estimator of the noise variance cannot be computed.21481
byGuillaume Lemaitre <glemaitre>
.
manifold.Isomap
now supports radius-based neighbors via the radius argument.19794
byZhehao Liu <MaxwellLZH>
.manifold.spectral_embedding
andmanifold.SpectralEmbedding
supports np.float32 dtype and will preserve this dtype.21534
byAndrew Knyazev <lobpcg>
.- Adds
get_feature_names_out
tomanifold.Isomap
andmanifold.LocallyLinearEmbedding
.22254
by Thomas Fan. - added metric_params to
manifold.TSNE
constructor for additional parameters of distance metric to use in optimization.21805
byJeanne Dionisi <jeannedionisi>
and22685
byMeekail Zain <micky774>
. manifold.trustworthiness
raises an error if n_neighbours >= n_samples / 2 to ensure a correct support for the function.18832
byHong Shao Yang <hongshaoyang>
and23033
byMeekail Zain <micky774>
.manifold.spectral_embedding
now uses Gaussian instead of the previous uniform on [0, 1] random initial approximations to eigenvectors in eigen_solvers lobpcg and amg to improve their numerical stability.21565
byAndrew Knyazev <lobpcg>
.
metrics.r2_score
andmetrics.explained_variance_score
have a new force_finite parameter. Setting this parameter to False will return the actual non-finite score in case of perfect predictions or constant y_true, instead of the finite approximation (1.0 and 0.0 respectively) currently returned by default.17266
bySylvain Marié <smarie>
.metrics.d2_pinball_score
andmetrics.d2_absolute_error_score
calculate the D2 regression score for the pinball loss and the absolute error respectively.metrics.d2_absolute_error_score
is a special case ofmetrics.d2_pinball_score
with a fixed quantile parameter alpha=0.5 for ease of use and discovery. The D2 scores are generalizations of the r2_score and can be interpeted as the fraction of deviance explained.22118
byOhad Michel <ohadmich>
.metrics.top_k_accuracy_score
raises an improved error message when y_true is binary and y_score is 2d.22284
by Thomas Fan.metrics.roc_auc_score
now supportsaverage=None
in the multiclass case whenmulticlass='ovr'
which will return the score per class.19158
byNicki Skafte <SkafteNicki>
.- Adds im_kw parameter to
metrics.ConfusionMatrixDisplay.from_estimator
metrics.ConfusionMatrixDisplay.from_predictions
, andmetrics.ConfusionMatrixDisplay.plot
. The im_kw parameter is passed to the matplotlib.pyplot.imshow call when plotting the confusion matrix.20753
by Thomas Fan. metrics.silhouette_score
now supports integer input for precomputed distances.22108
by Thomas Fan.- Fixed a bug in
metrics.normalized_mutual_info_score
which could return unbounded values.22635
byJérémie du Boisberranger <jeremiedbb>
. - Fixes
metrics.precision_recall_curve
andmetrics.average_precision_score
when true labels are all negative.19085
byVarun Agrawal <varunagrawal>
. - metrics.SCORERS is now deprecated and will be removed in 1.3. Please use
metrics.get_scorer_names
to retrieve the names of all available scorers.22866
by Adrin Jalali. - Parameters
sample_weight
andmultioutput
ofmetrics.mean_absolute_percentage_error
are now keyword-only, in accordance with SLEP009. A deprecation cycle was introduced.21576
byPaul-Emile Dugnat <pedugnat>
. - The "wminkowski" metric of
metrics.DistanceMetric
is deprecated and will be removed in version 1.3. Instead the existing "minkowski" metric now takes in an optional w parameter for weights. This deprecation aims at remaining consistent with SciPy 1.8 convention.21873
byYar Khine Phyo <yarkhinephyo>
. metrics.DistanceMetric
has been moved fromsklearn.neighbors
tosklearn.metrics
. Using neighbors.DistanceMetric for imports is still valid for backward compatibility, but this alias will be removed in 1.3.21177
byJulien Jerphanion <jjerphan>
.
mixture.GaussianMixture
andmixture.BayesianGaussianMixture
can now be initialized using k-means++ and random data points.20408
byGordon Walsh <g-walsh>
,Alberto Ceballos<alceballosa>
andAndres Rios<ariosramirez>
.- Fix a bug that correctly initialize precisions_cholesky_ in
mixture.GaussianMixture
when providing precisions_init by taking its square root.22058
byGuillaume Lemaitre <glemaitre>
. mixture.GaussianMixture
now normalizes weights_ more safely, preventing rounding errors when callingmixture.GaussianMixture.sample
with n_components=1.23034
byMeekail Zain <micky774>
.
- it is now possible to pass scoring="matthews_corrcoef" to all model selection tools with a scoring argument to use the Matthews correlation coefficient (MCC).
22203
byOlivier Grisel <ogrisel>
. - raise an error during cross-validation when the fits for all the splits failed. Similarly raise an error during grid-search when the fits for all the models and all the splits failed.
21026
byLoïc Estève <lesteve>
. model_selection.GridSearchCV
,model_selection.HalvingGridSearchCV
now validate input parameters in fit instead of __init__.21880
byMrinal Tyagi <MrinalTyagi>
.model_selection.learning_curve
now supports partial_fit with regressors.22982
by Thomas Fan.
multiclass.OneVsRestClassifier
now supports a verbose parameter so progress on fitting can be seen.22508
byChris Combs <combscCode>
.multiclass.OneVsOneClassifier.predict
returns correct predictions when the inner classifier only has apredict_proba
.22604
by Thomas Fan.
- Adds
get_feature_names_out
toneighbors.RadiusNeighborsTransformer
,neighbors.KNeighborsTransformer
andneighbors.NeighborhoodComponentsAnalysis
.22212
byMeekail Zain <micky774>
. neighbors.KernelDensity
now validates input parameters in fit instead of __init__.21430
byDesislava Vasileva <DessyVV>
andLucy Jimenez <LucyJimenez>
.neighbors.KNeighborsRegressor.predict
now works properly when given an array-like input if KNeighborsRegressor is first constructed with a callable passed to the weights parameter.22687
byMeekail Zain <micky774>
.
neural_network.MLPClassifier
andneural_network.MLPRegressor
show error messages when optimizers produce non-finite parameter weights.22150
byChristian Ritter <chritter>
andNorbert Preining <norbusan>
.- Adds
get_feature_names_out
toneural_network.BernoulliRBM
.22248
by Thomas Fan.
- Added support for "passthrough" in
pipeline.FeatureUnion
. Setting a transformer to "passthrough" will pass the features unchanged.20860
byShubhraneel Pal <shubhraneel>
. pipeline.Pipeline
now does not validate hyper-parameters in __init__ but in .fit().21888
byiofall <iofall>
andArisa Y. <arisayosh>
.pipeline.FeatureUnion
does not validate hyper-parameters in __init__. Validation is now handled in .fit() and .fit_transform().21954
byiofall <iofall>
andArisa Y. <arisayosh>
.- Defines __sklearn_is_fitted__ in
pipeline.FeatureUnion
to return correct result withutils.validation.check_is_fitted
.22953
byrandomgeek78 <randomgeek78>
.
preprocessing.OneHotEncoder
now supports grouping infrequent categories into a single feature. Grouping infrequent categories is enabled by specifying how to select infrequent categories with min_frequency or max_categories.16018
by Thomas Fan.- Adds a subsample parameter to
preprocessing.KBinsDiscretizer
. This allows specifying a maximum number of samples to be used while fitting the model. The option is only available when strategy is set to quantile.21445
byFelipe Bidu <fbidu>
andAmanda Dsouza <amy12xx>
. - Adds encoded_missing_value to
preprocessing.OrdinalEncoder
to configure the encoded value for missing data.21988
by Thomas Fan. - Added the get_feature_names_out method and a new parameter feature_names_out to
preprocessing.FunctionTransformer
. You can set feature_names_out to 'one-to-one' to use the input features names as the output feature names, or you can set it to a callable that returns the output feature names. This is especially useful when the transformer changes the number of features. If feature_names_out is None (which is the default), then get_output_feature_names is not defined.21569
byAurélien Geron <ageron>
. - Adds
get_feature_names_out
topreprocessing.Normalizer
,preprocessing.KernelCenterer
,preprocessing.OrdinalEncoder
, andpreprocessing.Binarizer
.21079
by Thomas Fan. preprocessing.PowerTransformer
with method='yeo-johnson' better supports significantly non-Gaussian data when searching for an optimal lambda.20653
by Thomas Fan.preprocessing.LabelBinarizer
now validates input parameters in fit instead of __init__.21434
byKrum Arnaudov <krumeto>
.preprocessing.FunctionTransformer
with check_inverse=True now provides informative error message when input has mixed dtypes.19916
byZhehao Liu <MaxwellLZH>
.preprocessing.KBinsDiscretizer
handles bin edges more consistently now.14975
by Andreas Müller and22526
byMeekail Zain <micky774>
.- Adds
preprocessing.KBinsDiscretizer.get_feature_names_out
support when encode="ordinal".22735
by Thomas Fan.
- Adds an inverse_transform method and a compute_inverse_transform parameter to
random_projection.GaussianRandomProjection
andrandom_projection.SparseRandomProjection
. When the parameter is set to True, the pseudo-inverse of the components is computed during fit and stored as inverse_components_.21701
byAurélien Geron <ageron>
. random_projection.SparseRandomProjection
andrandom_projection.GaussianRandomProjection
preserves dtype for numpy.float32.22114
byTakeshi Oura <takoika>
.- Adds
get_feature_names_out
to all transformers in thesklearn.random_projection
module:random_projection.GaussianRandomProjection
andrandom_projection.SparseRandomProjection
.21330
byLoïc Estève <lesteve>
.
svm.OneClassSVM
,svm.NuSVC
,svm.NuSVR
,svm.SVC
andsvm.SVR
now expose n_iter_, the number of iterations of the libsvm optimization routine.21408
byJuan Martín Loyola <jmloyola>
.svm.SVR
,svm.SVC
,svm.NuSVR
,svm.OneClassSVM
,svm.NuSVC
now raise an error when the dual-gap estimation produce non-finite parameter weights.22149
byChristian Ritter <chritter>
andNorbert Preining <norbusan>
.svm.NuSVC
,svm.NuSVR
,svm.SVC
,svm.SVR
,svm.OneClassSVM
now validate input parameters in fit instead of __init__.21436
byHaidar Almubarak <Haidar13 >
.
tree.DecisionTreeClassifier
andtree.ExtraTreeClassifier
have the new criterion="log_loss", which is equivalent to criterion="entropy".23047
byChristian Lorentzen <lorentzenchr>
.- Fix a bug in the Poisson splitting criterion for
tree.DecisionTreeRegressor
.22191
byChristian Lorentzen <lorentzenchr>
. - Changed the default value of max_features to 1.0 for
tree.ExtraTreeRegressor
and to "sqrt" fortree.ExtraTreeClassifier
, which will not change the fit result. The original default value "auto" has been deprecated and will be removed in version 1.3. Setting max_features to "auto" is also deprecated fortree.DecisionTreeClassifier
andtree.DecisionTreeRegressor
.22476
byZhehao Liu <MaxwellLZH>
.
utils.check_array
andutils.multiclass.type_of_target
now accept an input_name parameter to make the error message more informative when passed invalid input data (e.g. with NaN or infinite values).21219
byOlivier Grisel <ogrisel>
.utils.check_array
returns a float ndarray with np.nan when passed a Float32 or Float64 pandas extension array with pd.NA.21278
by Thomas Fan.utils.estimator_html_repr
shows a more helpful error message when running in a jupyter notebook that is not trusted.21316
by Thomas Fan.utils.estimator_html_repr
displays an arrow on the top left corner of the HTML representation to show how the elements are clickable.21298
by Thomas Fan.utils.check_array
with dtype=None returns numeric arrays when passed in a pandas DataFrame with mixed dtypes. dtype="numeric" will also make better infer the dtype when the DataFrame has mixed dtypes.22237
by Thomas Fan.utils.check_scalar
now has better messages when displaying the type.22218
by Thomas Fan.- Changes the error message of the ValidationError raised by
utils.check_X_y
when y is None so that it is compatible with the check_requires_y_none estimator check.22578
byClaudio Salvatore Arcidiacono <ClaudioSalvatoreArcidiacono>
. utils.class_weight.compute_class_weight
now only requires that all classes in y have a weight in class_weight. An error is still raised when a class is present in y but not in class_weight.22595
by Thomas Fan.utils.estimator_html_repr
has an improved visualization for nested meta-estimators.21310
by Thomas Fan.utils.check_scalar
raises an error when include_boundaries={"left", "right"} and the boundaries are not set.22027
byMarie Lanternier <mlant>
.utils.metaestimators.available_if
correctly returns a bounded method that can be pickled.23077
by Thomas Fan.utils.estimator_checks.check_estimator
's argument is now called estimator (previous name was Estimator).22188
byMathurin Massias <mathurinm>
.utils.metaestimators.if_delegate_has_method
is deprecated and will be removed in version 1.3. Useutils.metaestimators.available_if
instead.22830
byJérémie du Boisberranger <jeremiedbb>
.
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 1.0, including:
2357juan, Abhishek Gupta, adamgonzo, Adam Li, adijohar, Aditya Kumawat, Aditya Raghuwanshi, Aditya Singh, Adrian Trujillo Duron, Adrin Jalali, ahmadjubair33, AJ Druck, aj-white, Alan Peixinho, Alberto Mario Ceballos-Arroyo, Alek Lefebvre, Alex, Alexandre Gramfort, alexanmv, almeidayoel, Amanda Dsouza, Aman Sharma, Amar pratap singh, Amit, amrcode, András Simon, Andreas Mueller, Andrew Knyazev, Andriy, Angus L'Herrou, Ankit Sharma, Anne Ducout, Arisa, Arth, arthurmello, Arturo Amor, ArturoAmor, Atharva Patil, aufarkari, Aurélien Geron, avm19, Ayan Bag, baam, Behrouz B, Ben3940, Benjamin Bossan, Bharat Raghunathan, Bijil Subhash, bmreiniger, Brandon Truth, Brenden Kadota, Brian Sun, cdrig, Chalmer Lowe, Chiara Marmo, Chitteti Srinath Reddy, Chloe-Agathe Azencott, Christian Lorentzen, Christian Ritter, christopherlim98, Christoph T. Weidemann, Christos Aridas, Claudio Salvatore Arcidiacono, combscCode, Daniela Fernandes, Dave Eargle, David Poznik, Dea María Léon, Dennis Osei, DessyVV, Dev514, Dimitri Papadopoulos Orfanos, Diwakar Gupta, Dr. Felix M. Riese, drskd, Emiko Sano, Emmanouil Gionanidis, EricEllwanger, Erich Schubert, Eric Larson, Eric Ndirangu, Estefania Barreto-Ojeda, eyast, Fatima GASMI, Federico Luna, Felix Glushchenkov, fkaren27, Fortune Uwha, FPGAwesome, francoisgoupil, Frans Larsson, Gabor Berei, Gabor Kertesz, Gabriel Stefanini Vicente, Gabriel S Vicente, Gael Varoquaux, GAURAV CHOUDHARY, Gauthier I, genvalen, Geoffrey-Paris, Giancarlo Pablo, glennfrutiz, gpapadok, Guillaume Lemaitre, Guillermo Tomás Fernández Martín, Gustavo Oliveira, Haidar Almubarak, Hannah Bohle, Haoyin Xu, Haya, Helder Geovane Gomes de Lima, henrymooresc, Hideaki Imamura, Himanshu Kumar, Hind-M, hmasdev, hvassard, i-aki-y, iasoon, Inclusive Coding Bot, Ingela, iofall, Ishan Kumar, Jack Liu, Jake Cowton, jalexand3r, J Alexander, Jauhar, Jaya Surya Kommireddy, Jay Stanley, Jeff Hale, je-kr, JElfner, Jenny Vo, Jérémie du Boisberranger, Jihane, Jirka Borovec, Joel Nothman, Jon Haitz Legarreta Gorroño, Jordan Silke, Jorge Ciprián, Jorge Loayza, Joseph Chazalon, Joseph Schwartz-Messing, JSchuerz, Juan Carlos Alfaro Jiménez, Juan Martin Loyola, Julien Jerphanion, katotten, Kaushik Roy Chowdhury, Ken4git, kernc, Kevin Doucet, KimAYoung, Koushik Joshi, Kranthi Sedamaki, krumetoft, lesnee, Long Bao, Logan Thomas, Loic Esteve, Louis Wagner, LucieClair, Lucy Liu, Luiz Eduardo Amaral, Magali, MaggieChege, Mai, mandjevant, Mandy Gu, Manimaran, MarcoM, Maren Westermann, Maria Boerner, MarieS-WiMLDS, Martel Corentin, mathurinm, Matías, matjansen, Matteo Francia, Maxwell, Max Baak, Meekail Zain, Megabyte, Mehrdad Moradizadeh, melemo2, Michael I Chen, michalkrawczyk, Micky774, milana2, millawell, Ming-Yang Ho, Mitzi, miwojc, Mizuki, mlant, Mohamed Haseeb, Mohit Sharma, Moonkyung94, mpoemsl, MrinalTyagi, Mr. Leu, msabatier, murata-yu, N, Nadirhan Şahin, NartayXD, nastegiano, nathansquan, nat-salt, Nicki Skafte Detlefsen, Nicolas Hug, Niket Jain, Nikhil Suresh, Nikita Titov, Nikolay Kondratyev, Ohad Michel, Oleksandr Husak, Olivier Grisel, partev, Patrick Ferreira, Paul, pelennor, PierreAttard, Pieter Gijsbers, Pinky, poloso, Pramod Anantharam, puhuk, Purna Chandra Mansingh, QuadV, Rahil Parikh, Randall Boyes, randomgeek78, Raz Hoshia, Reshama Shaikh, Ricardo Ferreira, Richard Taylor, Rileran, Rishabh, Robin Thibaut, Roman Feldbauer, Roman Yurchak, Ross Barnowski, rsnegrin, Sachin Yadav, sakinaOuisrani, Sam Adam Day, Sanjay Marreddi, Sebastian Pujalte, SEELE, Seyedsaman (Sam) Emami, ShanDeng123, Shao Yang Hong, sharmadharmpal, shaymerNaturalint, Shubhraneel Pal, siavrez, slishak, Smile, spikebh, sply88, Stéphane Collot, Sultan Orazbayev, Sumit Saha, Sven Eschlbeck, Swapnil Jha, Sylvain Marié, Takeshi Oura, Tamires Santana, Tenavi, teunpe, Theis Ferré Hjortkjær, Thiruvenkadam, Thomas J. Fan, t-jakubek, Tom Dupré la Tour, TONY GEORGE, Tyler Martin, Tyler Reddy, Udit Gupta, Ugo Marchand, Varun Agrawal, Venkatachalam N, Vera Komeyer, victoirelouis, Vikas Vishwakarma, Vikrant khedkar, Vladimir Chernyy, Vladimir Kim, WeijiaDu, Xiao Yuan, Yar Khine Phyo, Ying Xiong, yiyangq, Yosshi999, Yuki Koyama, Zach Deane-Mayer, Zeel B Patel, zempleni, zhenfisher, 赵丰 (Zhao Feng)