Skrub has not been released yet. It is currently undergoing fast development and backward compatibility is not ensured.
TargetEncoderhas been removed in favor of
sklearn.preprocessing.TargetEncoder, available since scikit-learn 1.3.
The signatures of all encoders and functions have been revised to enforce cleaner calls. This means that some arguments that could previously be passed positionally now have to be passed as keywords. #514 by Lilian Boulard.
DatetimeEncoderdoesn’t remove constant features anymore. It also supports an ‘errors’ argument to raise or coerce errors during transform, and a ‘add_total_seconds’ argument to include the number of seconds since Epoch. #784 by Vincent Maladiere
fuzzy_join()is now between 0 and 1; it used to be between 0.5 and 1. Moreover, the division by 0 error that occurred when all rows had a perfect match has been fixed. #802 by Jérôme Dockès.
TableVectorizeris now able to apply parallelism at the column level rather than the transformer level. This is the default for univariate transformers, like
GapEncoder. #592 by Leo Grinsztajn
TableVectorizerpropagate the n_jobs parameter to the underlying transformers except if the underlying transformer already set explicitly n_jobs. #761 by Leo Grinsztajn, Guillaume Lemaitre, and Jerome Dockes.
Do not support 1-D array (and pandas Series) in
TableVectorizer. Pass a 2-D array (or a pandas DataFrame) with a single column instead. This change is for compliance with the scikit-learn API. #647 by Guillaume Lemaitre
GapEncoder’s early stopping logic. The parameters tol and min_iter have been removed. The parameter max_no_improvement can now be used to control the early stopping. #663 by Simona Maggio #593 by Lilian Boulard #681 by Leo Grinsztajn
Implementation improvement leading to a ~x5 speedup for each iteration.
Better default hyperparameters: batch_size now defaults to 1024, and max_iter_e_steps to 1.
Removed the most_frequent and k-means strategies from the
SimilarityEncoder. These strategy were used for scalability reasons, but we recommend using the
GapEncoderinstead. #596 by Leo Grinsztajn
check_is_fitted now looks at “transformers_” rather than “columns_”
the default of the remainder parameter in the docstring is now “passthrough” instead of “drop” to match the implementation.
uint8 and int8 dtypes are now considered as numerical columns.
datasets.fetch_employee_salaries()now has a parameter overload_job_titles to allow overloading the job titles (employee_position_title) with the column underfilled_job_title, which provides some more information about the job title. #581 by Lilian Boulard
Before skrub: dirty_cat#
Skrub was born from the dirty_cat package.
Dirty-cat release 0.4.1#
GapEncoder.transform()will not continue fitting of the instance anymore. It makes functions that depend on it (
score(), etc.) deterministic once fitted. #548 by Lilian Boulard
Dirty-cat Release 0.4.0#
New experimental feature: joining tables using
fuzzy_join()by approximate key matching. Matches are based on string similarities and the nearest neighbors matches are found for each category. #291 by Jovan Stojanovic and Leo Grinsztajn
New experimental feature: deduplicating misspelled categories using
deduplicate()by clustering string distances. This function works best when there are significantly more duplicates than underlying categories. #339 by Moritz Boos.
datasets.fetching: contains a new function
get_ken_embeddings()that can be used to download Wikipedia embeddings and filter them by type.
TableVectorizer’s default OneHotEncoder for low cardinality categorical variables now defaults to handle_unknown=”ignore” instead of handle_unknown=”error” (for sklearn >= 1.0.0). This means that categories seen only at test time will be encoded by a vector of zeroes instead of raising an error. #473 by Leo Grinsztajn
Dirty-cat Release 0.3.0#
DatetimeEncodercan transform a datetime column into several numerical columns (year, month, day, hour, minute, second, …). It is now the default transformer used in the
TableVectorizerfor datetime columns. #239 by Leo Grinsztajn
TableVectorizerhas seen some major improvements and bug fixes:
Fixes the automatic casting logic in
To avoid dimensionality explosion when a feature has two unique values, the default encoder (
OneHotEncoder) now drops one of the two vectors (see parameter drop=”if_binary”).
transformcan now return unencoded features, like the
ColumnTransformer’s behavior. Previously, a
Backward-incompatible change in the TableVectorizer: To apply
remainderto features (with the
*_transformerparameters), the value
'remainder'must be passed, instead of
Nonein previous versions.
Nonenow indicates that we want to use the default transformer. #303 by Lilian Boulard
Bumped minimum dependencies:
Dropped support for Jaro, Jaro-Winkler and Levenshtein distances.
Dirty-cat Release 0.2.2#
Dirty-cat Release 0.2.1#
Improvements to the
Type detection works better: handles dates, numerics columns encoded as strings, or numeric columns containing strings for missing values.
- Improvements to the
- Improvements to the
Dirty-cat Release 0.2.0#
Also see pre-release 0.2.0a1 below for additional changes.
Bump minimum dependencies:
datasets.fetching - backward-incompatible changes to the example datasets fetchers:
The backend has changed: we now exclusively fetch the datasets from OpenML. End users should not see any difference regarding this.
The frontend, however, changed a little: the fetching functions stay the same but their return values were modified in favor of a more Pythonic interface. Refer to the docstrings of functions dirty_cat.datasets.fetch_* for more information.
Update handle_missing parameters:
GapEncoder: the default value “zero_impute” becomes “empty_impute” (see doc).
MinHashEncoder: the default value “” becomes “zero_impute” (see doc).
Dirty-cat Release 0.2.0a1#
Version 0.2.0a1 is a pre-release. To try it, you have to install it manually using:
pip install --pre dirty_cat==0.2.0a1
or from the GitHub repository:
pip install git+https://github.com/dirty-cat/dirty_cat.git
Bump minimum dependencies:
Python (>= 3.6)
NumPy (>= 1.16)
SciPy (>= 1.2)
scikit-learn (>= 0.20.0)
TableVectorizer: Added automatic transform through the
TableVectorizerclass. It transforms columns automatically based on their type. It provides a replacement for scikit-learn’s
ColumnTransformersimpler to use on heterogeneous pandas DataFrame. #167 by Lilian Boulard
Backward incompatible change to
GapEncodernow only supports two-dimensional inputs of shape (n_samples, n_features). Internally, features are encoded by independent
GapEncodermodels, and are then concatenated into a single matrix. #185 by Lilian Boulard and Alexis Cvetkov.
Dirty-cat Release 0.1.1#
Dirty-cat Release 0.1.0#
Dirty-cat Release 0.0.7#
fast_hast.pyfiles that implement minhash encoding through the
MinHashEncoderclass. This method allows for fast and scalable encoding of string categorical variables.
datasets.fetch_employee_salaries: change the origin of download for employee_salaries.
The function now return a bunch with a dataframe under the field “data”, and not the path to the csv file.
The field “description” has been renamed to “DESCR”.
SimilarityEncoder: Fixed a bug when using the Jaro-Winkler distance as a similarity metric. Our implementation now accurately reproduces the behaviour of the
SimilarityEncoder: Added a handle_missing attribute to allow encoding with missing values.
TargetEncoder: Added a handle_missing attribute to allow encoding with missing values.
MinHashEncoder: Added a handle_missing attribute to allow encoding with missing values.
Dirty-cat Release 0.0.6#
computing the vocabulary count vectors in
computing the similarities in parallel using
joblib. This option can be turned on/off via the
n_jobsattribute of the
SimilarityEncoder: Fix a bug that was preventing a
SimilarityEncoderto be created when
categorieswas a list.
SimilarityEncoder: Set the dtype passed to the ngram similarity to float32, which reduces memory consumption during encoding.
Dirty-cat Release 0.0.5#
SimilarityEncoder: Change the default ngram range to (2, 4) which performs better empirically.
SimilarityEncoder: Added a most_frequent strategy to define prototype categories for large-scale learning.
SimilarityEncoder: Added a k-means strategy to define prototype categories for large-scale learning.
SimilarityEncoder: Added the possibility to use hashing ngrams for stateless fitting with the ngram similarity.
SimilarityEncoder: Performance improvements in the ngram similarity.
SimilarityEncoder: Expose a get_feature_names method.