Release history#

Ongoing development#

Release 0.6.0#

Highlights#

Major feature! Skrub DataOps are a powerful new way of combining dataframe transformations over multiple tables, and machine learning pipelines. DataOps can be combined to form compled data plans, that can be used to train and tune machine learning models. Then, the DataOps plans can be exported as Learners (skrub.SkrubLearner), standalone objects that can be used on new data. More detail about the DataOps can be found in the User guide and in the examples.
The TableReport has been improved with many new features. Series are now supported directly. It is now possible to skip computing column associations and generating plots when the number of columns in the dataframe exceeds a user-defined threshold. Columns with high cardinality and sorted columns are now highlighted in the report.
selectors, ApplyToCols and ApplyToFrame are now available, providing utilities for selecting columns to which a transformer should be applied in a flexible way. For more details, see the User guide and the example.
The SquashingScaler has been added: it robustly rescales and smoothly clips numerical columns, enabling more robust handling of numerical columns with neural networks. See the example

New features#

The Skrub DataOps are new mechanism for building machine-learning pipelines that handle multiple tables and easily describing their hyperparameter spaces. Main PR: #1233 by Jérôme Dockès. Additional work from other contributors can be found here: Vincent Maladiere provided very important help by trying the DataOps on many use-cases and datasets, providing feedback and suggesting improvements, improving the examples (including creating all the figures in the examples) and adding jitter to the parallel coordinate plots, Riccardo Cappuzzo experimented with the DataOps, suggested improvements and improved the examples, Gaël Varoquaux , Guillaume Lemaitre, Adrin Jalali, Olivier Grisel and others participated through many discussions in defining the requirements and the public API. See the examples for an introduction.
The selectors module provides utilities for selecting columns to which a transformer should be applied in a flexible way. The module was created in #895 by Jérôme Dockès and added to the public API in #1341 by Jérôme Dockès.
The DropUninformative transformer is now available. This transformer employs different heuristics to detect columns that are not likely to bring useful information for training a model. The current implementation includes detection of columns that contain only a single value (constant columns), only missing values, or all unique values (such as IDs). #1313 by Riccardo Cappuzzo.
get_config(), set_config() and config_context() are now available to configure settings for dataframes display and expressions. patch_display() and unpatch_display() are deprecated and will be removed in the next release of skrub. #1427 by Vincent Maladiere. The global configuration includes the parameter cardinality_threshold that controls the threshold value used to warn user if they have high cardinality columns in their dataset. #1498 by rouk1. Additionally, the parameter float_precision controls the number of significant digits displayed for floating-point values in reports. #1470 by George S.
Added the SquashingScaler, a transformer that robustly rescales and smoothly clips numerical columns, enabling more robust handling of numerical columns with neural networks. #1310 by Vincent Maladiere and David Holzmüller.
datasets.toy_order() is now available to create a toy dataframe and corresponding targets for examples. #1485 by Antoine Canaguier-Durand.
ApplyToCols and ApplyToFrame are now available to apply transformers on a set of columns independently and jointly respectively. #1478 by Vincent Maladiere.

Changes#

Warning

The default high cardinality encoder for both TableVectorizer and tabular_learner() (now tabular_pipeline()) has been changed from GapEncoder to StringEncoder. #1354 by Riccardo Cappuzzo.

The tabular_learner function has been deprecated in favor of tabular_pipeline() to honor its scikit-learn pipeline cultural heritage, and remove the ambiguity with the data ops Learner. #1493 by Vincent Maladiere.
StringEncoder now exposes the stop_words argument, which is passed to the underlying vectorizer (TfidfVectorizer, or HashingVectorizer). #1415 by Vincent Maladiere.
A new parameter max_association_columns has been added to the TableReport to skip association computation when the number of columns exceeds the specified value. #1304 by Victoria Shevchenko.
The packaging dependency was removed. #1307 by Jovan Stojanovic
TextEncoder, StringEncoder and GapEncoder now compute the total standard deviation norm during training, which is a global constant, and normalize the vector outputs by performing element-wise division on all entries. #1274 by Vincent Maladiere.
The DropIfTooManyNulls transformer has been replaced by the DropUninformative transformer and will be removed in a future release. #1313 by Riccardo Cappuzzo
The concat_horizontal() function was replaced with concat(). Horizontal or vertical concatenation is now controlled by the axis parameter. #1334 by Parasa V Prajwal.
The TableVectorizer and Cleaner now accept a datetime_format parameter for specifying the format to use when parsing datetime columns. #1358 by Riccardo Cappuzzo.
The SimpleCleaner has been removed. use Cleaner instead. #1370 by Riccardo Cappuzzo.
The periodic encoding for the day_in_year has been removed from the DatetimeEncoder as it was redundant. The feature itself is still added if the flag is set to True. #1396 by Riccardo Cappuzzo.
The naming scheme used for the features generated by TextEncoder, StringEncoder, MinHashEncoder, DatetimeEncoder has been standardized. Now features generated by all encoders have indices in the range [0, n_components-1], rather than [1, n_components]. Additionally, columns with empty name are assigned a default name that depends on the encoder used. #1405 by Riccardo Cappuzzo.
The optional dependencies ‘dev’, ‘doc’, ‘lint’ and ‘test’ have been coalesced into ‘dev’. #1404 by Vincent Maladiere.
The TableReport now supports Series in addition to Dataframes. #1420 by Vitor Pohlenz.
The Cleaner now exposes a parameter to convert numerical values to float32. #1440 by Riccardo Cappuzzo.
The TableReport now shows if columns are sorted. #1512 by Dea María Léon.

Bugfixes#

Fixed a bug that caused the StringEncoder and TextEncoder to raise an exception if the input column was a Categorical datatype. #1401 by Riccardo Cappuzzo.

Documentation#

A large number of improvements to the examples, docstrings, and the documentation website have been made. Contributors include Vincent Maladiere, Riccardo Cappuzzo, Jérôme Dockès, Gael Varoquaux, Gabriela Gómez Jiménez, Sylvain Combettes, Frits Hermans, Vitor Pohlenz, Arturo Amor Quiroz, Marie Sacksick, Emilien Battel, George El Haber, Antoine Canaguier-Durand, and Lionel Kusch.

Release 0.5.4#

Maintenance#

Make skrub compatible with scikit-learn 1.7. #1434 by Vincent Maladiere.

Release 0.5.3#

Changes#

The SimpleCleaner has been renamed to Cleaner. Use of the name SimpleCleaner is deprecated and will result in an error in some future release of skrub. #1275 by Riccardo Cappuzzo.
A new parameter max_plot_columns has been added to the TableReport and patch_display() to skip column plots when the number of columns exceeds the specified value. #1255 by Priscilla Baah.

Release 0.5.2#

New features#

The TableReport now switches its visual theme between light and dark according to the user preferences. #1201 by rouk1.
Adding a new way to control the location of the data directory, using envar SKRUB_DATA_DIRECTORY. #1215 by Thomas S.
The DatetimeEncoder now supports periodic encoding of datetime features with trigonometric functions and B-splines transformers. #1235 by Riccardo Cappuzzo.
The TableReport now also compute Pearson’s correlation for numeric values. #1203 by Reshama Shaikh and Vincent Maladiere.
The SimpleCleaner is now available (⚠️ it was renamed to Cleaner in skrub 0.5.3.). This transformer is a lightweight pre-processor that applies some of the transformations applied by the TableVectorizer, with a simpler interface. #1266 by Riccardo Cappuzzo and Jerome Dockes .

Changes#

The estimator returned by tabular_learner() now uses spline encoding of datetime features when the supervised learner is not a model based on decision trees such as random forests or gradient boosting. #1264 by Guillaume Lemaitre.
The “distribution” tab of the TableReport now stacks cards horizontally to avoid adding vertical space. #1259 by Gaël Varoquaux
Progress messages when generating a TableReport are now written to stderr instead of stdout. #1236 by Priscilla Baah
Optimize the StringEncoder: lower memory footprint and faster execution in some cases. #1248 by Gaël Varoquaux

Bug fixes#

StringEncoder now works correctly in presence of null values. #1224 by Jérôme Dockès.
The TableVectorizer.get_feature_names_out() method now works when used in a scikit-learn pipeline by exposing the input_features parameter. #1258 by Guillaume Lemaitre.

Release 0.5.1#

New features#

The StringEncoder encodes strings using tf-idf and truncated SVD decomposition and provides a cheaper alternative to GapEncoder. #1159 by Riccardo Cappuzzo.

Changes#

New dataset fetching methods have been added: fetch_videogame_sales(), fetch_bike_sharing(), fetch_flight_delays(), fetch_country_happiness(), and removed fetch_road_safety(). #1218 by Vincent Maladiere

Bug fixes#

Maintenance#

Release 0.4.1#

Changes#

class:

TableReport has write_html method

#1190 by Mojdeh Rastgoo.
A new parameter verbose has been added to the TableReport to toggle on or off the printing of progress information when a report is being generated. #1182 by Priscilla Baah.
A parameter verbose has been added to the patch_display() to toggle on or off the printing of progress information when a table report is being generated. #1188 by Priscilla Baah.
tabular_learner() accepts the alias "regression" for the option "regressor" and "classification" for "classifier". #1180 by Mojdeh Rastgoo.

Bug fixes#

Generating a TableReport could have an effect on the matplotib configuration which could cause plots not to display inline in jupyter notebooks any more. This has been fixed in skrub in #1172 by Jérôme Dockès and the matplotlib issue can be tracked here.
The labels on bar plots in the TableReport for columns of object dtypes that have a repr spanning multiple lines could be unreadable. This has been fixed in #1196 by Jérôme Dockès.
Improve the performance of deduplicate() by removing some unnecessary computations. #1193 by Jérôme Dockès.

Maintenance#

Make skrub compatible with scikit-learn 1.6. #1169 by Guillaume Lemaitre.

Release 0.4.0#

Highlights#

The TextEncoder can extract embeddings from a string column with a deep learning language model (possibly downloaded from the HuggingFace Hub).
Several improvements to the TableReport such as better support for other scripts than the latin alphabet in the bar plot labels, smaller report sizes, clipping the outliers to better see the details of distributions in histograms. See the full changelog for details.
The TableVectorizer can now drop columns that contain a fraction of null values above a user-chosen threshold.

New features#

The TextEncoder is now available to encode string columns with diverse entries. It allows the representation of table entries as embeddings computed by a deep learning language model. The weights of this model can be fetched locally or from the HuggingFace Hub. #1077 by Vincent Maladiere.
The column_associations() function has been added. It computes a pairwise measure of statistical dependence between all columns in a dataframe (the same as shown in the TableReport). #1109 by Jérôme Dockès.
The patch_display() function has been added. It changes the display of pandas and polars dataframes in jupyter notebooks to replace them with a TableReport. This can be undone with unpatch_display(). #1108 by Jérôme Dockès

Major changes#

AggJoiner, AggTarget and MultiAggJoiner now require the operations argument. They do not split columns by type anymore, but apply operations on all selected cols. “median” is now supported, “hist” and “value_counts” are no longer supported. #1116 by Théo Jolivet.
The AggTarget no longer supports y inputs of type list. #1116 by Théo Jolivet.

Minor changes#

The column filter selection dropdown in the tablereport is smaller and its label has been removed to save space. #1107 by Jérôme Dockès.
The TableReport now uses the font size of its parent element when inserted into another page. This makes it smaller in pages that use a smaller font size than the browser default such as VSCode in some configurations. It also makes it easier to control its size when inserting it in a web page by setting the font size of its parent element. A few other small adjustments have also been made to make it a bit more compact. #1098 by Jérôme Dockès.
Display of labels in the plots of the TableReport, especially for other scripts than the latin alphabet, has improved.
- before, some characters could be missing and replaced by empty boxes.
- before, when the text is truncated, the ellipsis “…” could appear on the wrong side for right-to-left scripts.
Moreover, when the text contains line breaks it now appears all on one line. Note this only affects the labels in the plots; the rest of the report did not have these problems. #1097 by Jérôme Dockès and #1138 by Jérôme Dockès.
In the TableReport it is now possible, before clicking any of the cells, to reach the dataframe sample table and activate a cell with tab key navigation. #1101 by Jérôme Dockès.
The “Column name” column of the “summary statistics” table in the TableReport is now always visible when scrolling the table. #1102 by Jérôme Dockès.
Added parameter drop_null_fraction to TableVectorizer to drop columns based on whether they contain a fraction of nulls larger than the given threshold. #1115 and #1149 by Riccardo Cappuzzo.
The TableReport now provides more helpful output for columns of dtype TimeDelta / Duration. #1152 by Jérôme Dockès.
The TableReport now also reports the number of unique values for numeric columns. #1154 by Jérôme Dockès.
The TableReport, when plotting histograms, now detects outliers and clips the range of data shown in the histogram. This allows seeing more detail in the shown distribution. #1157 by Jérôme Dockès.

Bug fixes#

The TableReport could raise an exception when one of the columns contained datetimes with time zones and missing values; this has been fixed in #1114 by Jérôme Dockès.
In scikit-learn versions older than 1.4 the TableVectorizer could fail on polars dataframes when used with the default parameters. This has been fixed in #1122 by Jérôme Dockès.
The TableReport would raise an exception when the input (pandas) dataframe contained several columns with the same name. This has been fixed in #1125 by Jérôme Dockès.
The TableReport would raise an exception when a column contained infinite values. This has been fixed in #1150 by Jérôme Dockès and #1151 by Jérôme Dockès.

Release 0.3.1#

Minor changes#

For tree-based models, tabular_learner() now adds handle_unknown=’use_encoded_value’ to the OrdinalEncoder, to avoid errors with new categories in the test set. This is consistent with the setting of OneHotEncoder used by default in the TableVectorizer. #1078 by Gaël Varoquaux
The reports created by TableReport, when inserted in an html page (or displayed in a notebook), now use the same font as the surrounding page. #1038 by Jérôme Dockès.
The content of the dataframe corresponding to the currently selected table cell in the TableReport can be copied without actually selecting the text (as in a spreadsheet). #1048 by Jérôme Dockès.
The selection of content displayed in the TableReport’s copy-paste boxes has been removed. Now they always display the value of the selected item. When copied, the repr of the selected item is copied to the clipboard. #1058 by Jérôme Dockès.
A “stats” panel has been added to the TableReport, showing summary statistics for all columns (number of missing values, mean, etc. – similar to pandas.info() ) in a table. It can be sorted by each column. #1056 and #1068 by Jérôme Dockès.
The credit fraud dataset is now available with the fetch_credit_fraud function(). #1053 by Vincent Maladiere.
Added zero padding for column names in MinHashEncoder to improve column ordering consistency. #1069 by Shreekant Nandiyawar.
The selection in the TableReport’s sample table can now be manipulated with the keyboard. #1065 by Jérôme Dockès.
The TableReport now displays the pandas (multi-)index, and has a better display & interaction of pandas columns when the columns are a MultiIndex. #1083 by Jérôme Dockès.
It is possible to control the number of rows displayed by the TableReport in the “sample” tab panel by specifying n_rows. #1083 by Jérôme Dockès.
the TableReport used to raise an exception when the dataframe contained unhashable types such as python lists. This has been fixed in #1087 by Jérôme Dockès.
Display’s columns name with the HTML representation of the fitted TableVectorizer. This has been fixed in #1093 by Shreekant Nandiyawar.
AggTarget will now work even when y is a Series and not raise any error. This has been fixed in #1094 by Shreekant Nandiyawar.

Release 0.3.0#

Highlights#

Polars dataframes are now supported across all skrub estimators.
TableReport generates an interactive report for a dataframe. This page regroups some precomputed examples.

Major changes#

The InterpolationJoiner now supports polars dataframes. #1016 by Théo Jolivet.
The TableReport provides an interactive report on a dataframe’s contents: an overview, summary statistics and plots, statistical associations between columns. It can be displayed in a jupyter notebook, a browser tab or saved as a static HTML page. #984 by Jérôme Dockès.

Minor changes#

Joiner and fuzzy_join() used to raise an error when columns with the same name appeared in the main and auxiliary table (after adding the suffix). This is now allowed and a random string is inserted in the duplicate column to ensure all names are unique. #1014 by Jérôme Dockès.
AggJoiner and AggTarget could produce outputs whose column names varied across calls to transform in some cases in the presence of duplicate column names, now the output names are always the same. #1013 by Jérôme Dockès.
In some cases AggJoiner and AggTarget inserted a column in the output named “index” containing the pandas index of the auxiliary table. This has been corrected. #1020 by Jérôme Dockès.

Release 0.2.0#

Major changes#

The Joiner has been adapted to support polars dataframes. #945 by Théo Jolivet.
The TableVectorizer now consistently applies the same transformation across different calls to transform. There also have been some breaking changes to its functionality: (i) all transformations are now applied independently to each column, i.e. it does not perform multivariate transformations (ii) in specific_transformers the same column may not be used twice (go through 2 different transformers). #902 by Jérôme Dockès.
Some parameters of TableVectorizer have been renamed: high_cardinality_transformer → high_cardinality, low_cardinality_transformer → low_cardinality, datetime_transformer → datetime, numeric_transformer → numeric. #947 by Jérôme Dockès.
The GapEncoder and MinHashEncoder are now a single-column transformers: their fit, fit_transform and transform methods accept a single column (a pandas or polars Series). Dataframes and numpy arrays are not accepted. #920 and #923 by Jérôme Dockès.
Added the MultiAggJoiner that allows to augment a main table with multiple auxiliary tables. #876 by Théo Jolivet.
AggJoiner now only accepts a single table as an input, and some of its parameters were renamed to be consistent with the MultiAggJoiner. It now has a key` parameter that allows to join main and auxiliary tables that share the same column names. #876 by Théo Jolivet.
tabular_learner() has been added to easily create a supervised learner that works well on tabular data. #926 by Jérôme Dockès.

Minor changes#

GapEncoder and MinHashEncoder used to modify their input in-place, replacing missing values with a string. They no longer do so. Their parameter handle_missing has been removed; now missing values are always treated as the empty string. #930 by Jérôme Dockès.
The minimum supported python version is now 3.9 #939 by Jérôme Dockès.
Skrub supports numpy 2. #946 by Jérôme Dockès.
fetch_ken_embeddings() now add suffix even with the default value for the parameter pca_components. #956 by Guillaume Lemaitre.
Joiner now performs some preprocessing (the same as done by the TableVectorizer, eg trying to parse dates, converting pandas object columns with mixed types to a single type) on the joining columns before vectorizing them. #972 by Jérôme Dockès.

skrub release 0.1.1#

This is a bugfix release to adapt to the most recent versions of pandas (2.2) and scikit-learn (1.5). There are no major changes to the functionality of skrub.

skrub release 0.1.0#

Major changes#

TargetEncoder has been removed in favor of sklearn.preprocessing.TargetEncoder, available since scikit-learn 1.3.
Joiner and fuzzy_join() support several ways of rescaling distances; match_score has been replaced by max_dist; bugs which prevented the Joiner to consistently vectorize inputs and accept or reject matches across calls to transform have been fixed. #821 by Jérôme Dockès.
InterpolationJoiner was added to join two tables by using machine-learning to infer the matching rows from the second table. #742 by Jérôme Dockès.
Pipelines including TableVectorizer can now be grid-searched, since we can now call set_params on the default transformers of TableVectorizer. #814 by Vincent Maladiere
to_datetime() is now available to support pandas.to_datetime over dataframes and 2d arrays. #784 by Vincent Maladiere
Some parameters of Joiner have changed. The goal is to harmonize parameters across all estimator that perform join(-like) operations, as discussed in #751. #757 by Jérôme Dockès.
dataframe.pd_join(), dataframe.pd_aggregate(), dataframe.pl_join() and dataframe.pl_aggregate() are now available in the dataframe submodule. #733 by Vincent Maladiere
FeatureAugmenter is renamed to Joiner. #674 by Jovan Stojanovic
fuzzy_join() and FeatureAugmenter can now join on datetime columns. #552 by Jovan Stojanovic
Joiner now supports joining on multiple column keys. #674 by Jovan Stojanovic
The signatures of all encoders and functions have been revised to enforce cleaner calls. This means that some arguments that could previously be passed positionally now have to be passed as keywords. #514 by Lilian Boulard.
Parallelized the GapEncoder column-wise. Parameters n_jobs and verbose added to the signature. #582 by Lilian Boulard
Introducing AggJoiner, a transformer performing aggregation on auxiliary tables followed by left-joining on a base table. #600 by Vincent Maladiere.
Introducing AggTarget, a transformer performing aggregation on the target y, followed by left-joining on a base table. #600 by Vincent Maladiere.
Added the SelectCols and DropCols transformers that allow selecting a subset of a dataframe’s columns inside of a pipeline. #804 by Jérôme Dockès.

Minor changes#

DatetimeEncoder doesn’t remove constant features anymore. It also supports an ‘errors’ argument to raise or coerce errors during transform, and a ‘add_total_seconds’ argument to include the number of seconds since Epoch. #784 by Vincent Maladiere
Scaling of matching_score in fuzzy_join() is now between 0 and 1; it used to be between 0.5 and 1. Moreover, the division by 0 error that occurred when all rows had a perfect match has been fixed. #802 by Jérôme Dockès.
TableVectorizer is now able to apply parallelism at the column level rather than the transformer level. This is the default for univariate transformers, like MinHashEncoder, and GapEncoder. #592 by Leo Grinsztajn
inverse_transform in SimilarityEncoder now works as expected; it used to raise an exception. #801 by Jérôme Dockès.
TableVectorizer propagate the n_jobs parameter to the underlying transformers except if the underlying transformer already set explicitly n_jobs. #761 by Leo Grinsztajn, Guillaume Lemaitre, and Jerome Dockes.
Parallelized the deduplicate() function. Parameter n_jobs added to the signature. #618 by Jovan Stojanovic and Lilian Boulard
Functions datasets.fetch_ken_embeddings(), datasets.fetch_ken_table_aliases() and datasets.fetch_ken_types() have been renamed. #602 by Jovan Stojanovic
Make pyarrow an optional dependencies to facilitate the integration with pyodide. #639 by Guillaume Lemaitre.
Bumped minimal required Python version to 3.10. #606 by Gael Varoquaux
Bumped minimal required versions for the dependencies: - numpy >= 1.23.5 - scipy >= 1.9.3 - scikit-learn >= 1.2.1 - pandas >= 1.5.3 #613 by Lilian Boulard
You can now pass column-specific transformers to TableVectorizer using the specific_transformers argument. #583 by Lilian Boulard.
Do not support 1-D array (and pandas Series) in TableVectorizer. Pass a 2-D array (or a pandas DataFrame) with a single column instead. This change is for compliance with the scikit-learn API. #647 by Guillaume Lemaitre
Fixes a bug in TableVectorizer with remainder: it is now cloned if it’s a transformer so that the same instance is not shared between different transformers. #678 by Guillaume Lemaitre
GapEncoder speedup #680 by Leo Grinsztajn
- Improved GapEncoder’s early stopping logic. The parameters tol and min_iter have been removed. The parameter max_no_improvement can now be used to control the early stopping. #663 by Simona Maggio #593 by Lilian Boulard #681 by Leo Grinsztajn
- Implementation improvement leading to a ~x5 speedup for each iteration.
- Better default hyperparameters: batch_size now defaults to 1024, and max_iter_e_steps to 1.
Removed the most_frequent and k-means strategies from the SimilarityEncoder. These strategy were used for scalability reasons, but we recommend using the MinHashEncoder or the GapEncoder instead. #596 by Leo Grinsztajn
Removed the similarity argument from the SimilarityEncoder constructor, as we only support the ngram similarity. #596 by Leo Grinsztajn
Added the analyzer parameter to the SimilarityEncoder to allow word counts for similarity measures. #619 by Jovan Stojanovic
skrub now uses modern type hints introduced in PEP 585. #609 by Lilian Boulard
Some bug fixes for TableVectorizer ( #579):
- check_is_fitted now looks at “transformers_” rather than “columns_”
- the default of the remainder parameter in the docstring is now “passthrough” instead of “drop” to match the implementation.
- uint8 and int8 dtypes are now considered as numerical columns.
Removed the leading “<” and trailing “>” symbols from KEN entities and types. #601 by Jovan Stojanovic
Add get_feature_names_out method to MinHashEncoder. #616 by Leo Grinsztajn
Removed requests from the requirements. #613 by Lilian Boulard
TableVectorizer now handles mixed types columns without failing by converting them to string before type inference. #623`by :user:`Leo Grinsztajn <LeoGrin>
Moved the default storage location of data to the user’s home folder. #652 by Felix Lefebvre and Gael Varoquaux
Fixed bug when using TableVectorizer’s transform method on categorical columns with missing values. #644 by Leo Grinsztajn
TableVectorizer never output a sparse matrix by default. This can be changed by increasing the sparse_threshold parameter. #646 by Leo Grinsztajn
TableVectorizer doesn’t fail anymore if an inferred type doesn’t work during transform. The new entries not matching the type are replaced by missing values. #666 by Leo Grinsztajn

Dataset fetcher datasets.fetch_employee_salaries() now has a parameter overload_job_titles to allow overloading the job titles (employee_position_title) with the column underfilled_job_title, which provides some more information about the job title. #581 by Lilian Boulard

Fix bugs which was triggered when extract_until was “year”, “month”, “microseconds” or “nanoseconds”, and add the option to set it to None to only extract total_time, the time from epoch. DatetimeEncoder. #743 by Leo Grinsztajn

Before skrub: dirty_cat#

Skrub was born from the dirty_cat package.

Dirty-cat release 0.4.1#

Major changes#

fuzzy_join() and FeatureAugmenter can now join on numerical columns based on the euclidean distance. #530 by Jovan Stojanovic
fuzzy_join() and FeatureAugmenter can perform many-to-many joins on lists of numerical or string key columns. #530 by Jovan Stojanovic
GapEncoder.transform() will not continue fitting of the instance anymore. It makes functions that depend on it (get_feature_names_out(), score(), etc.) deterministic once fitted. #548 by Lilian Boulard
fuzzy_join() and FeatureAugmenter now perform joins on missing values as in pandas.merge but raises a warning. #522 and #529 by Jovan Stojanovic
Added get_ken_table_aliases() and get_ken_types() for exploring KEN embeddings. #539 by Lilian Boulard.

Minor changes#

Improvement of date column detection and date format inference in TableVectorizer. The format inference now tries to find a format which works for all non-missing values of the column, and only tries pandas default inference if it fails. #543 by Leo Grinsztajn #587 by Leo Grinsztajn

Dirty-cat Release 0.4.0#

Major changes#

SuperVectorizer is renamed as TableVectorizer, a warning is raised when using the old name. #484 by Jovan Stojanovic
New experimental feature: joining tables using fuzzy_join() by approximate key matching. Matches are based on string similarities and the nearest neighbors matches are found for each category. #291 by Jovan Stojanovic and Leo Grinsztajn
New experimental feature: FeatureAugmenter, a transformer that augments with fuzzy_join() the number of features in a main table by using information from auxiliary tables. #409 by Jovan Stojanovic
Unnecessary API has been made private: everything (files, functions, classes) starting with an underscore shouldn’t be imported in your code. #331 by Lilian Boulard
The MinHashEncoder now supports a n_jobs parameter to parallelize the hashes computation. #267 by Leo Grinsztajn and Lilian Boulard.
New experimental feature: deduplicating misspelled categories using deduplicate() by clustering string distances. This function works best when there are significantly more duplicates than underlying categories. #339 by Moritz Boos.

Minor changes#

Add example Wikipedia embeddings to enrich the data. #487 by Jovan Stojanovic
datasets.fetching: contains a new function get_ken_embeddings() that can be used to download Wikipedia embeddings and filter them by type.
datasets.fetching: contains a new function fetch_world_bank_indicator() that can be used to download indicators from the World Bank Open Data platform. #291 by Jovan Stojanovic
Removed example Fitting scalable, non-linear models on data with dirty categories. #386 by Jovan Stojanovic
MinHashEncoder’s minhash() method is no longer public. #379 by Jovan Stojanovic
Fetching functions now have an additional argument directory, which can be used to specify where to save and load from datasets. #432 by Lilian Boulard
Fetching functions now have an additional argument directory, which can be used to specify where to save and load from datasets. #432 and #453 by Lilian Boulard
The TableVectorizer’s default OneHotEncoder for low cardinality categorical variables now defaults to handle_unknown=”ignore” instead of handle_unknown=”error” (for sklearn >= 1.0.0). This means that categories seen only at test time will be encoded by a vector of zeroes instead of raising an error. #473 by Leo Grinsztajn

Bug fixes#

The MinHashEncoder now considers None and empty strings as missing values, rather than raising an error. #378 by Gael Varoquaux

Dirty-cat Release 0.3.0#

Major changes#

New encoder: DatetimeEncoder can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, …). It is now the default transformer used in the TableVectorizer for datetime columns. #239 by Leo Grinsztajn
The TableVectorizer has seen some major improvements and bug fixes:
- Fixes the automatic casting logic in transform.
- To avoid dimensionality explosion when a feature has two unique values, the default encoder (OneHotEncoder) now drops one of the two vectors (see parameter drop=”if_binary”).
- fit_transform and transform can now return unencoded features, like the ColumnTransformer’s behavior. Previously, a RuntimeError was raised.
#300 by Lilian Boulard
Backward-incompatible change in the TableVectorizer: To apply remainder to features (with the *_transformer parameters), the value 'remainder' must be passed, instead of None in previous versions. None now indicates that we want to use the default transformer. #303 by Lilian Boulard
Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required. #289 by Lilian Boulard
Bumped minimum dependencies:
- scikit-learn>=0.23
- scipy>=1.4.0
- numpy>=1.17.3
- pandas>=1.2.0 #299 and #300 by Lilian Boulard
Dropped support for Jaro, Jaro-Winkler and Levenshtein distances.
- The SimilarityEncoder now exclusively uses ngram for similarities, and the similarity parameter is deprecated. It will be removed in 0.5. #282 by Lilian Boulard

Notes#

The transformers_ attribute of the TableVectorizer now contains column names instead of column indices for the “remainder” columns. #266 by Leo Grinsztajn

Dirty-cat Release 0.2.2#

Bug fixes#

Fixed a bug in the TableVectorizer causing a FutureWarning when using the get_feature_names_out() method. #262 by Lilian Boulard

Dirty-cat Release 0.2.1#

Major changes#

Improvements to the TableVectorizer
- Type detection works better: handles dates, numerics columns encoded as strings, or numeric columns containing strings for missing values.
#238 by Leo Grinsztajn
get_feature_names() becomes get_feature_names_out(), following changes in the scikit-learn API. get_feature_names() is deprecated in scikit-learn > 1.0. #241 by Gael Varoquaux
Improvements to the MinHashEncoder
- It is now possible to fit multiple columns simultaneously with the MinHashEncoder. Very useful when using for instance the make_column_transformer() function, on multiple columns.
#243 by Jovan Stojanovic

Bug-fixes#

Fixed a bug that resulted in the GapEncoder ignoring the analyzer argument. #242 by Jovan Stojanovic
GapEncoder’s get_feature_names_out now accepts all iterators, not just lists. #255 by Lilian Boulard
Fixed DeprecationWarning raised by the usage of distutils.version.LooseVersion. #261 by Lilian Boulard

Notes#

Remove trailing imports in the MinHashEncoder.
Fix typos and update links for website.
Documentation of the TableVectorizer and the SimilarityEncoder improved.

Dirty-cat Release 0.2.0#

Also see pre-release 0.2.0a1 below for additional changes.

Major changes#

Bump minimum dependencies:
- scikit-learn (>=0.21.0) #202 by Lilian Boulard
- pandas (>=1.1.5) ! NEW REQUIREMENT ! #155 by Lilian Boulard
datasets.fetching - backward-incompatible changes to the example datasets fetchers:
- The backend has changed: we now exclusively fetch the datasets from OpenML. End users should not see any difference regarding this.
- The frontend, however, changed a little: the fetching functions stay the same but their return values were modified in favor of a more Pythonic interface. Refer to the docstrings of functions dirty_cat.datasets.fetch_* for more information.
- The example notebooks were updated to reflect these changes. #155 by Lilian Boulard
Backward incompatible change to MinHashEncoder: The MinHashEncoder now only supports two dimensional inputs of shape (N_samples, 1). #185 by Lilian Boulard and Alexis Cvetkov.
Update handle_missing parameters:
- GapEncoder: the default value “zero_impute” becomes “empty_impute” (see doc).
- MinHashEncoder: the default value “” becomes “zero_impute” (see doc).
#210 by Alexis Cvetkov.
Add a method “get_feature_names_out” for the GapEncoder and the TableVectorizer, since get_feature_names will be depreciated in scikit-learn 1.2. #216 by Alexis Cvetkov

Notes#

Removed hard-coded CSV file dirty_cat/data/FiveThirtyEight_Midwest_Survey.csv.
Improvements to the TableVectorizer
- Missing values are not systematically imputed anymore
- Type casting and per-column imputation are now learnt during fitting
- Several bugfixes
#201 by Lilian Boulard

Dirty-cat Release 0.2.0a1#

Version 0.2.0a1 is a pre-release. To try it, you have to install it manually using:

pip install --pre dirty_cat==0.2.0a1

or from the GitHub repository:

pip install git+https://github.com/dirty-cat/dirty_cat.git

Major changes#

Bump minimum dependencies:
- Python (>= 3.6)
- NumPy (>= 1.16)
- SciPy (>= 1.2)
- scikit-learn (>= 0.20.0)
TableVectorizer: Added automatic transform through the TableVectorizer class. It transforms columns automatically based on their type. It provides a replacement for scikit-learn’s ColumnTransformer simpler to use on heterogeneous pandas DataFrame. #167 by Lilian Boulard
Backward incompatible change to GapEncoder: The GapEncoder now only supports two-dimensional inputs of shape (n_samples, n_features). Internally, features are encoded by independent GapEncoder models, and are then concatenated into a single matrix. #185 by Lilian Boulard and Alexis Cvetkov.

Bug-fixes#

Fix get_feature_names for scikit-learn > 0.21. #216 by Alexis Cvetkov

Dirty-cat Release 0.1.1#

Major changes#

Bug-fixes#

RuntimeWarnings due to overflow in GapEncoder. #161 by Alexis Cvetkov

Dirty-cat Release 0.1.0#

Major changes#

GapEncoder: Added online Gamma-Poisson factorization through the GapEncoder class. This method discovers latent categories formed via combinations of substrings, and encodes string data as combinations of these categories. To be used if interpretability is important. #153 by Alexis Cvetkov

Bug-fixes#

Multiprocessing exception in notebook. #154 by Lilian Boulard

Dirty-cat Release 0.0.7#

MinHashEncoder: Added minhash_encoder.py and fast_hast.py files that implement minhash encoding through the MinHashEncoder class. This method allows for fast and scalable encoding of string categorical variables.
datasets.fetch_employee_salaries: change the origin of download for employee_salaries.
- The function now return a bunch with a dataframe under the field “data”, and not the path to the csv file.
- The field “description” has been renamed to “DESCR”.
SimilarityEncoder: Fixed a bug when using the Jaro-Winkler distance as a similarity metric. Our implementation now accurately reproduces the behaviour of the python-Levenshtein implementation.
SimilarityEncoder: Added a handle_missing attribute to allow encoding with missing values.
TargetEncoder: Added a handle_missing attribute to allow encoding with missing values.
MinHashEncoder: Added a handle_missing attribute to allow encoding with missing values.

Dirty-cat Release 0.0.6#

SimilarityEncoder: Accelerate SimilarityEncoder.transform, by:
- computing the vocabulary count vectors in fit instead of transform
- computing the similarities in parallel using joblib. This option can be turned on/off via the n_jobs attribute of the SimilarityEncoder.
SimilarityEncoder: Fix a bug that was preventing a SimilarityEncoder to be created when categories was a list.
SimilarityEncoder: Set the dtype passed to the ngram similarity to float32, which reduces memory consumption during encoding.

Dirty-cat Release 0.0.5#

SimilarityEncoder: Change the default ngram range to (2, 4) which performs better empirically.
SimilarityEncoder: Added a most_frequent strategy to define prototype categories for large-scale learning.
SimilarityEncoder: Added a k-means strategy to define prototype categories for large-scale learning.
SimilarityEncoder: Added the possibility to use hashing ngrams for stateless fitting with the ngram similarity.
SimilarityEncoder: Performance improvements in the ngram similarity.
SimilarityEncoder: Expose a get_feature_names method.

Release history#

Ongoing development#

Release 0.6.0#

Highlights#

New features#

Changes#

Bugfixes#

Documentation#

Release 0.5.4#

Maintenance#

Release 0.5.3#

Changes#

Release 0.5.2#

New features#

Changes#

Bug fixes#

Release 0.5.1#

New features#

Changes#

Bug fixes#

Maintenance#

Release 0.4.1#

Changes#

Bug fixes#

Maintenance#

Release 0.4.0#

Highlights#

New features#

Major changes#

Minor changes#

Bug fixes#

Release 0.3.1#

Minor changes#

Release 0.3.0#

Highlights#

Major changes#

Minor changes#

Release 0.2.0#

Major changes#

Minor changes#

skrub release 0.1.1#

skrub release 0.1.0#

Major changes#

Minor changes#

Before skrub: dirty_cat#

Dirty-cat release 0.4.1#

Major changes#

Minor changes#

Dirty-cat Release 0.4.0#

Major changes#

Minor changes#

Bug fixes#

Dirty-cat Release 0.3.0#

Major changes#

Notes#

Dirty-cat Release 0.2.2#

Bug fixes#

Dirty-cat Release 0.2.1#

Major changes#

Bug-fixes#

Notes#

Dirty-cat Release 0.2.0#

Major changes#

Notes#

Dirty-cat Release 0.2.0a1#

Major changes#

Bug-fixes#

Dirty-cat Release 0.1.1#

Major changes#

Bug-fixes#

Dirty-cat Release 0.1.0#

Major changes#

Bug-fixes#

Dirty-cat Release 0.0.7#

Dirty-cat Release 0.0.6#

Dirty-cat Release 0.0.5#

This Page