Changes

skrub

Ongoing development

Skrub is a very recent package. It is currently undergoing fast development and backward compatibility is not ensured.

Major changes

Added the MultiAggJoiner that allows to augment a main table with multiple auxiliary tables. 876 by Théo Jolivet <TheooJ>.
AggJoiner now only accepts a single table as an input, and some of its parameters were renamed to be consistent with the MultiAggJoiner. It now has a key parameter that allows to join main and auxiliary tables that share the same column names. :pr:`876 by Théo Jolivet <TheooJ>.

Minor changes

skrub release 0.1.0

Major changes

TargetEncoder has been removed in favor of sklearn.preprocessing.TargetEncoder, available since scikit-learn 1.3.
Joiner and fuzzy_join support several ways of rescaling distances; match_score has been replaced by max_dist; bugs which prevented the Joiner to consistently vectorize inputs and accept or reject matches across calls to transform have been fixed. 821 by Jérôme Dockès <jeromedockes>.
InterpolationJoiner was added to join two tables by using machine-learning to infer the matching rows from the second table. 742 by Jérôme Dockès <jeromedockes>.
Pipelines including TableVectorizer can now be grid-searched, since we can now call set_params on the default transformers of TableVectorizer. 814 by Vincent Maladiere <Vincent-Maladiere>
to_datetime is now available to support pandas.to_datetime over dataframes and 2d arrays. 784 by Vincent Maladiere <Vincent-Maladiere>
Some parameters of Joiner have changed. The goal is to harmonize parameters across all estimator that perform join(-like) operations, as discussed in #751. 757 by Jérôme Dockès <jeromedockes>.
dataframe.pd_join, dataframe.pd_aggregate, dataframe.pl_join and dataframe.pl_aggregate are now available in the dataframe submodule. 733 by Vincent Maladiere <Vincent-Maladiere>
FeatureAugmenter is renamed to Joiner. 674 by Jovan Stojanovic <jovan-stojanovic>
fuzzy_join and FeatureAugmenter can now join on datetime columns. 552 by Jovan Stojanovic <jovan-stojanovic>
Joiner now supports joining on multiple column keys. 674 by Jovan Stojanovic <jovan-stojanovic>
The signatures of all encoders and functions have been revised to enforce cleaner calls. This means that some arguments that could previously be passed positionally now have to be passed as keywords. 514 by Lilian Boulard <LilianBoulard>.
Parallelized the GapEncoder column-wise. Parameters n_jobs and verbose added to the signature. 582 by Lilian Boulard <LilianBoulard>
Introducing AggJoiner, a transformer performing aggregation on auxiliary tables followed by left-joining on a base table. 600 by Vincent Maladiere <Vincent-Maladiere>.
Introducing AggTarget, a transformer performing aggregation on the target y, followed by left-joining on a base table. 600 by Vincent Maladiere <Vincent-Maladiere>.
Added the SelectCols and DropCols transformers that allow selecting a subset of a dataframe's columns inside of a pipeline. 804 by Jérôme Dockès <jeromedockes>.

Minor changes

DatetimeEncoder doesn't remove constant features anymore. It also supports an 'errors' argument to raise or coerce errors during transform, and a 'add_total_seconds' argument to include the number of seconds since Epoch. 784 by Vincent Maladiere <Vincent-Maladiere>
Scaling of matching_score in fuzzy_join is now between 0 and 1; it used to be between 0.5 and 1. Moreover, the division by 0 error that occurred when all rows had a perfect match has been fixed. 802 by Jérôme Dockès <jeromedockes>.
TableVectorizer is now able to apply parallelism at the column level rather than the transformer level. This is the default for univariate transformers, like MinHashEncoder, and GapEncoder. 592 by Leo Grinsztajn <LeoGrin>
inverse_transform in SimilarityEncoder now works as expected; it used to raise an exception. 801 by Jérôme Dockès <jeromedockes>.
TableVectorizer propagate the n_jobs parameter to the underlying transformers except if the underlying transformer already set explicitly n_jobs. 761 by Leo Grinsztajn <LeoGrin>, Guillaume Lemaitre <glemaitre>, and Jerome Dockes <jeromedockes>.
Parallelized the deduplicate function. Parameter n_jobs added to the signature. 618 by Jovan Stojanovic <jovan-stojanovic> and Lilian Boulard <LilianBoulard>
Functions datasets.fetch_ken_embeddings, datasets.fetch_ken_table_aliases and datasets.fetch_ken_types have been renamed. 602 by Jovan Stojanovic <jovan-stojanovic>
Make pyarrow an optional dependencies to facilitate the integration with pyodide. 639 by Guillaume Lemaitre <glemaitre>.
Bumped minimal required Python version to 3.10. 606 by Gael Varoquaux <GaelVaroquaux>
Bumped minimal required versions for the dependencies:
- numpy >= 1.23.5
- scipy >= 1.9.3
- scikit-learn >= 1.2.1
- pandas >= 1.5.3 613 by Lilian Boulard <LilianBoulard>
You can now pass column-specific transformers to TableVectorizer using the specific_transformers argument. 583 by Lilian Boulard <LilianBoulard>.
Do not support 1-D array (and pandas Series) in TableVectorizer. Pass a 2-D array (or a pandas DataFrame) with a single column instead. This change is for compliance with the scikit-learn API. 647 by Guillaume Lemaitre <glemaitre>
Fixes a bug in TableVectorizer with `remainder`: it is now cloned if it's a transformer so that the same instance is not shared between different transformers. 678 by Guillaume Lemaitre <glemaitre>
GapEncoder speedup 680 by Leo Grinsztajn <LeoGrin>
- Improved GapEncoder's early stopping logic. The parameters tol and min_iter have been removed. The parameter max_no_improvement can now be used to control the early stopping. 663 by Simona Maggio <simonamaggio> 593 by Lilian Boulard <LilianBoulard> 681 by Leo Grinsztajn <LeoGrin>
- Implementation improvement leading to a ~x5 speedup for each iteration.
- Better default hyperparameters: batch_size now defaults to 1024, and max_iter_e_steps to 1.
Removed the most_frequent and k-means strategies from the SimilarityEncoder. These strategy were used for scalability reasons, but we recommend using the MinHashEncoder or the GapEncoder instead. 596 by Leo Grinsztajn <LeoGrin>
Removed the similarity argument from the SimilarityEncoder constructor, as we only support the ngram similarity. 596 by Leo Grinsztajn <LeoGrin>
Added the analyzer parameter to the SimilarityEncoder to allow word counts for similarity measures. 619 by Jovan Stojanovic <jovan-stojanovic>
skrub now uses modern type hints introduced in PEP 585. 609 by Lilian Boulard <LilianBoulard>
Some bug fixes for TableVectorizer ( 579):
- check_is_fitted now looks at "transformers_" rather than "columns_"
- the default of the remainder parameter in the docstring is now "passthrough" instead of "drop" to match the implementation.
- uint8 and int8 dtypes are now considered as numerical columns.
Removed the leading "<" and trailing ">" symbols from KEN entities and types. 601 by Jovan Stojanovic <jovan-stojanovic>
Add get_feature_names_out method to MinHashEncoder. 616 by Leo Grinsztajn <LeoGrin>
Removed requests from the requirements. 613 by Lilian Boulard <LilianBoulard>
TableVectorizer now handles mixed types columns without failing by converting them to string before type inference. 623`by :user:`Leo Grinsztajn <LeoGrin>
Moved the default storage location of data to the user's home folder. 652 by Felix Lefebvre <flefebv> and Gael Varoquaux <GaelVaroquaux>
Fixed bug when using TableVectorizer's transform method on categorical columns with missing values. 644 by Leo Grinsztajn <LeoGrin>
TableVectorizer never output a sparse matrix by default. This can be changed by increasing the sparse_threshold parameter. 646 by Leo Grinsztajn <LeoGrin>
TableVectorizer doesn't fail anymore if an infered type doesn't work during transform. The new entries not matching the type are replaced by missing values. 666 by Leo Grinsztajn <LeoGrin>
Dataset fetcher datasets.fetch_employee_salaries now has a parameter overload_job_titles to allow overloading the job titles (employee_position_title) with the column underfilled_job_title, which provides some more information about the job title. 581 by Lilian Boulard <LilianBoulard>
Fix bugs which was triggered when extract_until was "year", "month", "microseconds" or "nanoseconds", and add the option to set it to None to only extract total_time, the time from epoch. DatetimeEncoder. 743 by Leo Grinsztajn <LeoGrin>

Before skrub: dirty_cat

Skrub was born from the dirty_cat package.

Dirty-cat release 0.4.1

Major changes

fuzzy_join and FeatureAugmenter can now join on numerical columns based on the euclidean distance. 530 by Jovan Stojanovic <jovan-stojanovic>
fuzzy_join and FeatureAugmenter can perform many-to-many joins on lists of numerical or string key columns. 530 by Jovan Stojanovic <jovan-stojanovic>
GapEncoder.transform will not continue fitting of the instance anymore. It makes functions that depend on it (~GapEncoder.get_feature_names_out, ~GapEncoder.score, etc.) deterministic once fitted. 548 by Lilian Boulard <LilianBoulard>
fuzzy_join and FeatureAugmenter now perform joins on missing values as in pandas.merge but raises a warning. 522 and 529 by Jovan Stojanovic <jovan-stojanovic>
Added get_ken_table_aliases and get_ken_types for exploring KEN embeddings. 539 by Lilian Boulard <LilianBoulard>.

Minor changes

Improvement of date column detection and date format inference in TableVectorizer. The format inference now tries to find a format which works for all non-missing values of the column, and only tries pandas default inference if it fails. 543 by Leo Grinsztajn <LeoGrin> 587 by Leo Grinsztajn <LeoGrin>

Dirty-cat Release 0.4.0

Major changes

SuperVectorizer is renamed as TableVectorizer, a warning is raised when using the old name. 484 by Jovan Stojanovic <jovan-stojanovic>
New experimental feature: joining tables using fuzzy_join by approximate key matching. Matches are based on string similarities and the nearest neighbors matches are found for each category. 291 by Jovan Stojanovic <jovan-stojanovic> and Leo Grinsztajn <LeoGrin>
New experimental feature: FeatureAugmenter, a transformer that augments with fuzzy_join the number of features in a main table by using information from auxiliary tables. 409 by Jovan Stojanovic <jovan-stojanovic>
Unnecessary API has been made private: everything (files, functions, classes) starting with an underscore shouldn't be imported in your code. 331 by Lilian Boulard <LilianBoulard>
The MinHashEncoder now supports a n_jobs parameter to parallelize the hashes computation. 267 by Leo Grinsztajn <LeoGrin> and Lilian Boulard <LilianBoulard>.
New experimental feature: deduplicating misspelled categories using deduplicate by clustering string distances. This function works best when there are significantly more duplicates than underlying categories. 339 by Moritz Boos <mjboos>.

Minor changes

Add example Wikipedia embeddings to enrich the data. 487 by Jovan Stojanovic <jovan-stojanovic>
datasets.fetching: contains a new function get_ken_embeddings that can be used to download Wikipedia embeddings and filter them by type.
datasets.fetching: contains a new function fetch_world_bank_indicator that can be used to download indicators from the World Bank Open Data platform. 291 by Jovan Stojanovic <jovan-stojanovic>
Removed example Fitting scalable, non-linear models on data with dirty categories. 386 by Jovan Stojanovic <jovan-stojanovic>
MinHashEncoder's minhash method is no longer public. 379 by Jovan Stojanovic <jovan-stojanovic>
Fetching functions now have an additional argument directory, which can be used to specify where to save and load from datasets. 432 by Lilian Boulard <LilianBoulard>
Fetching functions now have an additional argument directory, which can be used to specify where to save and load from datasets. 432 and 453 by Lilian Boulard <LilianBoulard>
The TableVectorizer's default OneHotEncoder for low cardinality categorical variables now defaults to handle_unknown="ignore" instead of handle_unknown="error" (for sklearn >= 1.0.0). This means that categories seen only at test time will be encoded by a vector of zeroes instead of raising an error. 473 by Leo Grinsztajn <LeoGrin>

Bug fixes

The MinHashEncoder now considers None and empty strings as missing values, rather than raising an error. 378 by Gael Varoquaux <GaelVaroquaux>

Dirty-cat Release 0.3.0

Major changes

New encoder: DatetimeEncoder can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, ...). It is now the default transformer used in the TableVectorizer for datetime columns. 239 by Leo Grinsztajn <LeoGrin>
The TableVectorizer has seen some major improvements and bug fixes:
- Fixes the automatic casting logic in transform.
- To avoid dimensionality explosion when a feature has two unique values, the default encoder (~sklearn.preprocessing.OneHotEncoder) now drops one of the two vectors (see parameter drop="if_binary").
- fit_transform and transform can now return unencoded features, like the ~sklearn.compose.ColumnTransformer's behavior. Previously, a RuntimeError was raised.
300 by Lilian Boulard <LilianBoulard>
Backward-incompatible change in the TableVectorizer: To apply remainder to features (with the *_transformer parameters), the value 'remainder' must be passed, instead of None in previous versions. None now indicates that we want to use the default transformer. 303 by Lilian Boulard <LilianBoulard>
Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required. 289 by Lilian Boulard <LilianBoulard>
Bumped minimum dependencies:
- scikit-learn>=0.23
- scipy>=1.4.0
- numpy>=1.17.3
- pandas>=1.2.0 299 and 300 by Lilian Boulard <LilianBoulard>
Dropped support for Jaro, Jaro-Winkler and Levenshtein distances.
- The SimilarityEncoder now exclusively uses ngram for similarities, and the similarity parameter is deprecated. It will be removed in 0.5. 282 by Lilian Boulard <LilianBoulard>

Notes

The transformers_ attribute of the TableVectorizer now contains column names instead of column indices for the "remainder" columns. 266 by Leo Grinsztajn <LeoGrin>

Dirty-cat Release 0.2.2

Bug fixes

Fixed a bug in the TableVectorizer causing a FutureWarning when using the get_feature_names_out method. 262 by Lilian Boulard <LilianBoulard>

Dirty-cat Release 0.2.1

Major changes

Improvements to the TableVectorizer
- Type detection works better: handles dates, numerics columns encoded as strings, or numeric columns containing strings for missing values.
238 by Leo Grinsztajn <LeoGrin>
get_feature_names becomes get_feature_names_out, following changes in the scikit-learn API. get_feature_names is deprecated in scikit-learn > 1.0. 241 by Gael Varoquaux <GaelVaroquaux>
Improvements to the MinHashEncoder
- It is now possible to fit multiple columns simultaneously with the MinHashEncoder. Very useful when using for instance the ~sklearn.compose.make_column_transformer function, on multiple columns.
243 by Jovan Stojanovic <jovan-stojanovic>

Bug-fixes

Fixed a bug that resulted in the GapEncoder ignoring the analyzer argument. 242 by Jovan Stojanovic <jovan-stojanovic>
GapEncoder's get_feature_names_out now accepts all iterators, not just lists. 255 by Lilian Boulard <LilianBoulard>
Fixed DeprecationWarning raised by the usage of distutils.version.LooseVersion. 261 by Lilian Boulard <LilianBoulard>

Notes

Remove trailing imports in the MinHashEncoder.
Fix typos and update links for website.
Documentation of the TableVectorizer and the SimilarityEncoder improved.

Dirty-cat Release 0.2.0

Also see pre-release 0.2.0a1 below for additional changes.

Major changes

Bump minimum dependencies:
- scikit-learn (>=0.21.0) 202 by Lilian Boulard <LilianBoulard>
- pandas (>=1.1.5) ! NEW REQUIREMENT ! 155 by Lilian Boulard <LilianBoulard>
datasets.fetching - backward-incompatible changes to the example datasets fetchers:
- The backend has changed: we now exclusively fetch the datasets from OpenML. End users should not see any difference regarding this.
- The frontend, however, changed a little: the fetching functions stay the same but their return values were modified in favor of a more Pythonic interface. Refer to the docstrings of functions dirty_cat.datasets.fetch_* for more information.
- The example notebooks were updated to reflect these changes. 155 by Lilian Boulard <LilianBoulard>
Backward incompatible change to MinHashEncoder: The MinHashEncoder now only supports two dimensional inputs of shape (N_samples, 1). 185 by Lilian Boulard <LilianBoulard> and Alexis Cvetkov <alexis-cvetkov>.
Update handle_missing parameters:
- GapEncoder: the default value "zero_impute" becomes "empty_impute" (see doc).
- MinHashEncoder: the default value "" becomes "zero_impute" (see doc).
210 by Alexis Cvetkov <alexis-cvetkov>.
Add a method "get_feature_names_out" for the GapEncoder and the TableVectorizer, since get_feature_names will be depreciated in scikit-learn 1.2. 216 by Alexis Cvetkov <alexis-cvetkov>

Notes

Removed hard-coded CSV file dirty_cat/data/FiveThirtyEight_Midwest_Survey.csv.
Improvements to the TableVectorizer
- Missing values are not systematically imputed anymore
- Type casting and per-column imputation are now learnt during fitting
- Several bugfixes
201 by Lilian Boulard <LilianBoulard>

Dirty-cat Release 0.2.0a1

Version 0.2.0a1 is a pre-release. To try it, you have to install it manually using:

pip install --pre dirty_cat==0.2.0a1

or from the GitHub repository:

pip install git+https://github.com/dirty-cat/dirty_cat.git

Major changes

Bump minimum dependencies:
- Python (>= 3.6)
- NumPy (>= 1.16)
- SciPy (>= 1.2)
- scikit-learn (>= 0.20.0)
TableVectorizer: Added automatic transform through the TableVectorizer class. It transforms columns automatically based on their type. It provides a replacement for scikit-learn's ~sklearn.compose.ColumnTransformer simpler to use on heterogeneous pandas DataFrame. 167 by Lilian Boulard <LilianBoulard>
Backward incompatible change to GapEncoder: The GapEncoder now only supports two-dimensional inputs of shape (n_samples, n_features). Internally, features are encoded by independent GapEncoder models, and are then concatenated into a single matrix. 185 by Lilian Boulard <LilianBoulard> and Alexis Cvetkov <alexis-cvetkov>.

Bug-fixes

Fix get_feature_names for scikit-learn > 0.21. 216 by Alexis Cvetkov <alexis-cvetkov>

Dirty-cat Release 0.1.1

Major changes

Bug-fixes

RuntimeWarnings due to overflow in GapEncoder. 161 by Alexis Cvetkov <alexis-cvetkov>

Dirty-cat Release 0.1.0

Major changes

GapEncoder: Added online Gamma-Poisson factorization through the GapEncoder class. This method discovers latent categories formed via combinations of substrings, and encodes string data as combinations of these categories. To be used if interpretability is important. 153 by Alexis Cvetkov <alexis-cvetkov>

Bug-fixes

Multiprocessing exception in notebook. 154 by Lilian Boulard <LilianBoulard>

Dirty-cat Release 0.0.7

MinHashEncoder: Added minhash_encoder.py and fast_hast.py files that implement minhash encoding through the MinHashEncoder class. This method allows for fast and scalable encoding of string categorical variables.
datasets.fetch_employee_salaries: change the origin of download for employee_salaries.
- The function now return a bunch with a dataframe under the field "data", and not the path to the csv file.
- The field "description" has been renamed to "DESCR".
SimilarityEncoder: Fixed a bug when using the Jaro-Winkler distance as a similarity metric. Our implementation now accurately reproduces the behaviour of the python-Levenshtein implementation.
SimilarityEncoder: Added a handle_missing attribute to allow encoding with missing values.
TargetEncoder: Added a handle_missing attribute to allow encoding with missing values.
MinHashEncoder: Added a handle_missing attribute to allow encoding with missing values.

Dirty-cat Release 0.0.6

SimilarityEncoder: Accelerate SimilarityEncoder.transform, by:
- computing the vocabulary count vectors in fit instead of transform
- computing the similarities in parallel using joblib. This option can be turned on/off via the n_jobs attribute of the SimilarityEncoder.
SimilarityEncoder: Fix a bug that was preventing a SimilarityEncoder to be created when categories was a list.
SimilarityEncoder: Set the dtype passed to the ngram similarity to float32, which reduces memory consumption during encoding.

Dirty-cat Release 0.0.5

SimilarityEncoder: Change the default ngram range to (2, 4) which performs better empirically.
SimilarityEncoder: Added a most_frequent strategy to define prototype categories for large-scale learning.
SimilarityEncoder: Added a k-means strategy to define prototype categories for large-scale learning.
SimilarityEncoder: Added the possibility to use hashing ngrams for stateless fitting with the ngram similarity.
SimilarityEncoder: Performance improvements in the ngram similarity.
SimilarityEncoder: Expose a get_feature_names method.

Files

CHANGES.rst

Latest commit

History

CHANGES.rst

File metadata and controls

Changes

Ongoing development

Major changes

Minor changes

skrub release 0.1.0

Major changes

Minor changes

Before skrub: dirty_cat

Dirty-cat release 0.4.1

Major changes

Minor changes

Dirty-cat Release 0.4.0

Major changes

Minor changes

Bug fixes

Dirty-cat Release 0.3.0

Major changes

Notes

Dirty-cat Release 0.2.2

Bug fixes

Dirty-cat Release 0.2.1

Major changes

Bug-fixes

Notes

Dirty-cat Release 0.2.0

Major changes

Notes

Dirty-cat Release 0.2.0a1

Major changes

Bug-fixes

Dirty-cat Release 0.1.1

Major changes

Bug-fixes

Dirty-cat Release 0.1.0

Major changes

Bug-fixes

Dirty-cat Release 0.0.7

Dirty-cat Release 0.0.6

Dirty-cat Release 0.0.5