skrub
Skrub is a very recent package. It is currently undergoing fast development and backward compatibility is not ensured.
- Added the
MultiAggJoiner
that allows to augment a main table with multiple auxiliary tables.876
byThéo Jolivet <TheooJ>
. AggJoiner
now only accepts a single table as an input, and some of its parameters were renamed to be consistent with theMultiAggJoiner
. It now has akey
parameter that allows to join main and auxiliary tables that share the same column names. :pr:`876 byThéo Jolivet <TheooJ>
.
TargetEncoder
has been removed in favor ofsklearn.preprocessing.TargetEncoder
, available since scikit-learn 1.3.Joiner
andfuzzy_join
support several ways of rescaling distances;match_score
has been replaced bymax_dist
; bugs which prevented the Joiner to consistently vectorize inputs and accept or reject matches across calls to transform have been fixed.821
byJérôme Dockès <jeromedockes>
.InterpolationJoiner
was added to join two tables by using machine-learning to infer the matching rows from the second table.742
byJérôme Dockès <jeromedockes>
.- Pipelines including
TableVectorizer
can now be grid-searched, since we can now call set_params on the default transformers ofTableVectorizer
.814
byVincent Maladiere <Vincent-Maladiere>
to_datetime
is now available to support pandas.to_datetime over dataframes and 2d arrays.784
byVincent Maladiere <Vincent-Maladiere>
- Some parameters of
Joiner
have changed. The goal is to harmonize parameters across all estimator that perform join(-like) operations, as discussed in #751.757
byJérôme Dockès <jeromedockes>
. dataframe.pd_join
,dataframe.pd_aggregate
,dataframe.pl_join
anddataframe.pl_aggregate
are now available in the dataframe submodule.733
byVincent Maladiere <Vincent-Maladiere>
FeatureAugmenter
is renamed toJoiner
.674
byJovan Stojanovic <jovan-stojanovic>
fuzzy_join
andFeatureAugmenter
can now join on datetime columns.552
byJovan Stojanovic <jovan-stojanovic>
Joiner
now supports joining on multiple column keys.674
byJovan Stojanovic <jovan-stojanovic>
- The signatures of all encoders and functions have been revised to enforce cleaner calls. This means that some arguments that could previously be passed positionally now have to be passed as keywords.
514
byLilian Boulard <LilianBoulard>
. - Parallelized the
GapEncoder
column-wise. Parameters n_jobs and verbose added to the signature.582
byLilian Boulard <LilianBoulard>
- Introducing
AggJoiner
, a transformer performing aggregation on auxiliary tables followed by left-joining on a base table.600
byVincent Maladiere <Vincent-Maladiere>
. - Introducing
AggTarget
, a transformer performing aggregation on the target y, followed by left-joining on a base table.600
byVincent Maladiere <Vincent-Maladiere>
. - Added the
SelectCols
andDropCols
transformers that allow selecting a subset of a dataframe's columns inside of a pipeline.804
byJérôme Dockès <jeromedockes>
.
DatetimeEncoder
doesn't remove constant features anymore. It also supports an 'errors' argument to raise or coerce errors during transform, and a 'add_total_seconds' argument to include the number of seconds since Epoch.784
byVincent Maladiere <Vincent-Maladiere>
- Scaling of
matching_score
infuzzy_join
is now between 0 and 1; it used to be between 0.5 and 1. Moreover, the division by 0 error that occurred when all rows had a perfect match has been fixed.802
byJérôme Dockès <jeromedockes>
. TableVectorizer
is now able to apply parallelism at the column level rather than the transformer level. This is the default for univariate transformers, likeMinHashEncoder
, andGapEncoder
.592
byLeo Grinsztajn <LeoGrin>
inverse_transform
inSimilarityEncoder
now works as expected; it used to raise an exception.801
byJérôme Dockès <jeromedockes>
.TableVectorizer
propagate the n_jobs parameter to the underlying transformers except if the underlying transformer already set explicitly n_jobs.761
byLeo Grinsztajn <LeoGrin>
,Guillaume Lemaitre <glemaitre>
, andJerome Dockes <jeromedockes>
.- Parallelized the
deduplicate
function. Parameter n_jobs added to the signature.618
byJovan Stojanovic <jovan-stojanovic>
andLilian Boulard <LilianBoulard>
- Functions
datasets.fetch_ken_embeddings
,datasets.fetch_ken_table_aliases
anddatasets.fetch_ken_types
have been renamed.602
byJovan Stojanovic <jovan-stojanovic>
- Make pyarrow an optional dependencies to facilitate the integration with pyodide.
639
byGuillaume Lemaitre <glemaitre>
. - Bumped minimal required Python version to 3.10.
606
byGael Varoquaux <GaelVaroquaux>
- Bumped minimal required versions for the dependencies:
- numpy >= 1.23.5
- scipy >= 1.9.3
- scikit-learn >= 1.2.1
- pandas >= 1.5.3
613
byLilian Boulard <LilianBoulard>
- You can now pass column-specific transformers to
TableVectorizer
using the specific_transformers argument.583
byLilian Boulard <LilianBoulard>
. - Do not support 1-D array (and pandas Series) in
TableVectorizer
. Pass a 2-D array (or a pandas DataFrame) with a single column instead. This change is for compliance with the scikit-learn API.647
byGuillaume Lemaitre <glemaitre>
- Fixes a bug in
TableVectorizer
with `remainder`: it is now cloned if it's a transformer so that the same instance is not shared between different transformers.678
byGuillaume Lemaitre <glemaitre>
GapEncoder
speedup680
byLeo Grinsztajn <LeoGrin>
- Improved
GapEncoder
's early stopping logic. The parameters tol and min_iter have been removed. The parameter max_no_improvement can now be used to control the early stopping.663
bySimona Maggio <simonamaggio>
593
byLilian Boulard <LilianBoulard>
681
byLeo Grinsztajn <LeoGrin>
- Implementation improvement leading to a ~x5 speedup for each iteration.
- Better default hyperparameters: batch_size now defaults to 1024, and max_iter_e_steps to 1.
- Improved
- Removed the most_frequent and k-means strategies from the
SimilarityEncoder
. These strategy were used for scalability reasons, but we recommend using theMinHashEncoder
or theGapEncoder
instead.596
byLeo Grinsztajn <LeoGrin>
- Removed the similarity argument from the
SimilarityEncoder
constructor, as we only support the ngram similarity.596
byLeo Grinsztajn <LeoGrin>
- Added the analyzer parameter to the
SimilarityEncoder
to allow word counts for similarity measures.619
byJovan Stojanovic <jovan-stojanovic>
- skrub now uses modern type hints introduced in PEP 585.
609
byLilian Boulard <LilianBoulard>
- Some bug fixes for
TableVectorizer
(579
):- check_is_fitted now looks at "transformers_" rather than "columns_"
- the default of the remainder parameter in the docstring is now "passthrough" instead of "drop" to match the implementation.
- uint8 and int8 dtypes are now considered as numerical columns.
- Removed the leading "<" and trailing ">" symbols from KEN entities and types.
601
byJovan Stojanovic <jovan-stojanovic>
- Add get_feature_names_out method to
MinHashEncoder
.616
byLeo Grinsztajn <LeoGrin>
- Removed requests from the requirements.
613
byLilian Boulard <LilianBoulard>
TableVectorizer
now handles mixed types columns without failing by converting them to string before type inference.623`by :user:`Leo Grinsztajn <LeoGrin>
- Moved the default storage location of data to the user's home folder.
652
byFelix Lefebvre <flefebv>
andGael Varoquaux <GaelVaroquaux>
- Fixed bug when using
TableVectorizer
's transform method on categorical columns with missing values.644
byLeo Grinsztajn <LeoGrin>
TableVectorizer
never output a sparse matrix by default. This can be changed by increasing the sparse_threshold parameter.646
byLeo Grinsztajn <LeoGrin>
TableVectorizer
doesn't fail anymore if an infered type doesn't work during transform. The new entries not matching the type are replaced by missing values.666
byLeo Grinsztajn <LeoGrin>
- Dataset fetcher
datasets.fetch_employee_salaries
now has a parameter overload_job_titles to allow overloading the job titles (employee_position_title) with the column underfilled_job_title, which provides some more information about the job title.581
byLilian Boulard <LilianBoulard>
- Fix bugs which was triggered when extract_until was "year", "month", "microseconds" or "nanoseconds", and add the option to set it to None to only extract total_time, the time from epoch.
DatetimeEncoder
.743
byLeo Grinsztajn <LeoGrin>
Skrub was born from the dirty_cat package.
fuzzy_join
andFeatureAugmenter
can now join on numerical columns based on the euclidean distance.530
byJovan Stojanovic <jovan-stojanovic>
fuzzy_join
andFeatureAugmenter
can perform many-to-many joins on lists of numerical or string key columns.530
byJovan Stojanovic <jovan-stojanovic>
GapEncoder.transform
will not continue fitting of the instance anymore. It makes functions that depend on it (~GapEncoder.get_feature_names_out
,~GapEncoder.score
, etc.) deterministic once fitted.548
byLilian Boulard <LilianBoulard>
fuzzy_join
andFeatureAugmenter
now perform joins on missing values as in pandas.merge but raises a warning.522
and529
byJovan Stojanovic <jovan-stojanovic>
- Added
get_ken_table_aliases
andget_ken_types
for exploring KEN embeddings.539
byLilian Boulard <LilianBoulard>
.
- Improvement of date column detection and date format inference in
TableVectorizer
. The format inference now tries to find a format which works for all non-missing values of the column, and only tries pandas default inference if it fails.543
byLeo Grinsztajn <LeoGrin>
587
byLeo Grinsztajn <LeoGrin>
- SuperVectorizer is renamed as
TableVectorizer
, a warning is raised when using the old name.484
byJovan Stojanovic <jovan-stojanovic>
- New experimental feature: joining tables using
fuzzy_join
by approximate key matching. Matches are based on string similarities and the nearest neighbors matches are found for each category.291
byJovan Stojanovic <jovan-stojanovic>
andLeo Grinsztajn <LeoGrin>
- New experimental feature:
FeatureAugmenter
, a transformer that augments withfuzzy_join
the number of features in a main table by using information from auxiliary tables.409
byJovan Stojanovic <jovan-stojanovic>
- Unnecessary API has been made private: everything (files, functions, classes) starting with an underscore shouldn't be imported in your code.
331
byLilian Boulard <LilianBoulard>
- The
MinHashEncoder
now supports a n_jobs parameter to parallelize the hashes computation.267
byLeo Grinsztajn <LeoGrin>
andLilian Boulard <LilianBoulard>
. - New experimental feature: deduplicating misspelled categories using
deduplicate
by clustering string distances. This function works best when there are significantly more duplicates than underlying categories.339
byMoritz Boos <mjboos>
.
- Add example Wikipedia embeddings to enrich the data.
487
byJovan Stojanovic <jovan-stojanovic>
- datasets.fetching: contains a new function
get_ken_embeddings
that can be used to download Wikipedia embeddings and filter them by type. - datasets.fetching: contains a new function
fetch_world_bank_indicator
that can be used to download indicators from the World Bank Open Data platform.291
byJovan Stojanovic <jovan-stojanovic>
- Removed example Fitting scalable, non-linear models on data with dirty categories.
386
byJovan Stojanovic <jovan-stojanovic>
MinHashEncoder
'sminhash
method is no longer public.379
byJovan Stojanovic <jovan-stojanovic>
- Fetching functions now have an additional argument
directory
, which can be used to specify where to save and load from datasets.432
byLilian Boulard <LilianBoulard>
- Fetching functions now have an additional argument
directory
, which can be used to specify where to save and load from datasets.432
and453
byLilian Boulard <LilianBoulard>
- The
TableVectorizer
's default OneHotEncoder for low cardinality categorical variables now defaults to handle_unknown="ignore" instead of handle_unknown="error" (for sklearn >= 1.0.0). This means that categories seen only at test time will be encoded by a vector of zeroes instead of raising an error.473
byLeo Grinsztajn <LeoGrin>
- The
MinHashEncoder
now considers None and empty strings as missing values, rather than raising an error.378
byGael Varoquaux <GaelVaroquaux>
- New encoder:
DatetimeEncoder
can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, ...). It is now the default transformer used in theTableVectorizer
for datetime columns.239
byLeo Grinsztajn <LeoGrin>
The
TableVectorizer
has seen some major improvements and bug fixes:- Fixes the automatic casting logic in
transform
. - To avoid dimensionality explosion when a feature has two unique values, the default encoder (
~sklearn.preprocessing.OneHotEncoder
) now drops one of the two vectors (see parameter drop="if_binary"). fit_transform
andtransform
can now return unencoded features, like the~sklearn.compose.ColumnTransformer
's behavior. Previously, aRuntimeError
was raised.
300
byLilian Boulard <LilianBoulard>
- Fixes the automatic casting logic in
- Backward-incompatible change in the TableVectorizer: To apply
remainder
to features (with the*_transformer
parameters), the value'remainder'
must be passed, instead ofNone
in previous versions.None
now indicates that we want to use the default transformer.303
byLilian Boulard <LilianBoulard>
- Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required.
289
byLilian Boulard <LilianBoulard>
- Bumped minimum dependencies:
- scikit-learn>=0.23
- scipy>=1.4.0
- numpy>=1.17.3
- pandas>=1.2.0
299
and300
byLilian Boulard <LilianBoulard>
- Dropped support for Jaro, Jaro-Winkler and Levenshtein distances.
- The
SimilarityEncoder
now exclusively usesngram
for similarities, and the similarity parameter is deprecated. It will be removed in 0.5.282
byLilian Boulard <LilianBoulard>
- The
- The
transformers_
attribute of theTableVectorizer
now contains column names instead of column indices for the "remainder" columns.266
byLeo Grinsztajn <LeoGrin>
- Fixed a bug in the
TableVectorizer
causing aFutureWarning
when using theget_feature_names_out
method.262
byLilian Boulard <LilianBoulard>
Improvements to the
TableVectorizer
- Type detection works better: handles dates, numerics columns encoded as strings, or numeric columns containing strings for missing values.
238
byLeo Grinsztajn <LeoGrin>
get_feature_names
becomesget_feature_names_out
, following changes in the scikit-learn API.get_feature_names
is deprecated in scikit-learn > 1.0.241
byGael Varoquaux <GaelVaroquaux>
- Improvements to the
MinHashEncoder
- It is now possible to fit multiple columns simultaneously with the
MinHashEncoder
. Very useful when using for instance the~sklearn.compose.make_column_transformer
function, on multiple columns.
- It is now possible to fit multiple columns simultaneously with the
243
byJovan Stojanovic <jovan-stojanovic>
- Improvements to the
- Fixed a bug that resulted in the
GapEncoder
ignoring the analyzer argument.242
byJovan Stojanovic <jovan-stojanovic>
GapEncoder
's get_feature_names_out now accepts all iterators, not just lists.255
byLilian Boulard <LilianBoulard>
- Fixed
DeprecationWarning
raised by the usage of distutils.version.LooseVersion.261
byLilian Boulard <LilianBoulard>
- Remove trailing imports in the
MinHashEncoder
. - Fix typos and update links for website.
- Documentation of the
TableVectorizer
and theSimilarityEncoder
improved.
Also see pre-release 0.2.0a1 below for additional changes.
- Bump minimum dependencies:
- scikit-learn (>=0.21.0)
202
byLilian Boulard <LilianBoulard>
- pandas (>=1.1.5) ! NEW REQUIREMENT !
155
byLilian Boulard <LilianBoulard>
- scikit-learn (>=0.21.0)
- datasets.fetching - backward-incompatible changes to the example datasets fetchers:
- The backend has changed: we now exclusively fetch the datasets from OpenML. End users should not see any difference regarding this.
- The frontend, however, changed a little: the fetching functions stay the same but their return values were modified in favor of a more Pythonic interface. Refer to the docstrings of functions dirty_cat.datasets.fetch_* for more information.
- The example notebooks were updated to reflect these changes.
155
byLilian Boulard <LilianBoulard>
- Backward incompatible change to
MinHashEncoder
: TheMinHashEncoder
now only supports two dimensional inputs of shape (N_samples, 1).185
byLilian Boulard <LilianBoulard>
andAlexis Cvetkov <alexis-cvetkov>
. Update handle_missing parameters:
GapEncoder
: the default value "zero_impute" becomes "empty_impute" (see doc).MinHashEncoder
: the default value "" becomes "zero_impute" (see doc).
210
byAlexis Cvetkov <alexis-cvetkov>
.- Add a method "get_feature_names_out" for the
GapEncoder
and theTableVectorizer
, since get_feature_names will be depreciated in scikit-learn 1.2.216
byAlexis Cvetkov <alexis-cvetkov>
- Removed hard-coded CSV file dirty_cat/data/FiveThirtyEight_Midwest_Survey.csv.
Improvements to the
TableVectorizer
- Missing values are not systematically imputed anymore
- Type casting and per-column imputation are now learnt during fitting
- Several bugfixes
201
byLilian Boulard <LilianBoulard>
Version 0.2.0a1 is a pre-release. To try it, you have to install it manually using:
pip install --pre dirty_cat==0.2.0a1
or from the GitHub repository:
pip install git+https://github.com/dirty-cat/dirty_cat.git
- Bump minimum dependencies:
- Python (>= 3.6)
- NumPy (>= 1.16)
- SciPy (>= 1.2)
- scikit-learn (>= 0.20.0)
TableVectorizer
: Added automatic transform through theTableVectorizer
class. It transforms columns automatically based on their type. It provides a replacement for scikit-learn's~sklearn.compose.ColumnTransformer
simpler to use on heterogeneous pandas DataFrame.167
byLilian Boulard <LilianBoulard>
- Backward incompatible change to
GapEncoder
: TheGapEncoder
now only supports two-dimensional inputs of shape (n_samples, n_features). Internally, features are encoded by independentGapEncoder
models, and are then concatenated into a single matrix.185
byLilian Boulard <LilianBoulard>
andAlexis Cvetkov <alexis-cvetkov>
.
- Fix get_feature_names for scikit-learn > 0.21.
216
byAlexis Cvetkov <alexis-cvetkov>
- RuntimeWarnings due to overflow in
GapEncoder
.161
byAlexis Cvetkov <alexis-cvetkov>
GapEncoder
: Added online Gamma-Poisson factorization through theGapEncoder
class. This method discovers latent categories formed via combinations of substrings, and encodes string data as combinations of these categories. To be used if interpretability is important.153
byAlexis Cvetkov <alexis-cvetkov>
- Multiprocessing exception in notebook.
154
byLilian Boulard <LilianBoulard>
- MinHashEncoder: Added
minhash_encoder.py
andfast_hast.py
files that implement minhash encoding through theMinHashEncoder
class. This method allows for fast and scalable encoding of string categorical variables. - datasets.fetch_employee_salaries: change the origin of download for employee_salaries.
- The function now return a bunch with a dataframe under the field "data", and not the path to the csv file.
- The field "description" has been renamed to "DESCR".
- SimilarityEncoder: Fixed a bug when using the Jaro-Winkler distance as a similarity metric. Our implementation now accurately reproduces the behaviour of the
python-Levenshtein
implementation. - SimilarityEncoder: Added a handle_missing attribute to allow encoding with missing values.
- TargetEncoder: Added a handle_missing attribute to allow encoding with missing values.
- MinHashEncoder: Added a handle_missing attribute to allow encoding with missing values.
- SimilarityEncoder: Accelerate
SimilarityEncoder.transform
, by:- computing the vocabulary count vectors in
fit
instead oftransform
- computing the similarities in parallel using
joblib
. This option can be turned on/off via then_jobs
attribute of theSimilarityEncoder
.
- computing the vocabulary count vectors in
- SimilarityEncoder: Fix a bug that was preventing a
SimilarityEncoder
to be created whencategories
was a list. - SimilarityEncoder: Set the dtype passed to the ngram similarity to float32, which reduces memory consumption during encoding.
- SimilarityEncoder: Change the default ngram range to (2, 4) which performs better empirically.
- SimilarityEncoder: Added a most_frequent strategy to define prototype categories for large-scale learning.
- SimilarityEncoder: Added a k-means strategy to define prototype categories for large-scale learning.
- SimilarityEncoder: Added the possibility to use hashing ngrams for stateless fitting with the ngram similarity.
- SimilarityEncoder: Performance improvements in the ngram similarity.
- SimilarityEncoder: Expose a get_feature_names method.