.. _changes:
===============
Release history
===============
.. currentmodule:: skrub
Release 0.7.0
=============
New features
------------
- It is now possible to tune the choices in a :class:`DataOp` with `Optuna
`_. See
:ref:`example_optuna_choices` for an example.
:pr:`1661` by :user:`Jérôme Dockès `.
- :meth:`DataOp.skb.apply` now allows passing extra named arguments to the
estimator's methods through the parameters ``fit_kwargs``, ``predict_kwargs``
etc. :pr:`1642` by :user:`Jérôme Dockès `.
- TableReport now displays the mean statistic for boolean columns.
:pr:`1647` by :user:`Abdelhakim Benechehab `.
- :meth:`DataOp.skb.get_vars` allows inspecting all the variables, or all the
named dataops, in a :class:`DataOp`. This lets us easily know what keys should
be present in the ``environment`` dictionary we pass to
:meth:`DataOp.skb.eval` or to :meth:`SkrubLearner.fit`,
:meth:`SkrubLearner.predict`, etc. .
:pr:`1646` by :user:`Jérôme Dockès `.
- :meth:`DataOp.skb.iter_cv_splits` iterates over the training and testing
environments produced by a CV splitter -- similar to
:meth:`DataOp.skb.train_test_split` but for multiple cross-validation splits.
:pr:`1653` by :user:`Jérôme Dockès `.
- :class:`TableReport` now supports ``np.array``. :pr:`1676` by :user: `Nisma Amjad `.
- :meth:`DataOp.skb.full_report` now accepts a new parameter, title, that is displayed
in the html report.
:pr:`1654` by :user:`Marie Sacksick `.
- :class:`TableReport` now includes the ``open_tab`` parameter, which lets the
user select which tab should be opened when the ``TableReport`` is
rendered. :pr:`1737` by :user:`Riccardo Cappuzzo`.
Changes
-------
- The minimum supported version of Python has been increased to 3.10. Additionally,
the minimum supported versions of scikit-learn and requests are 1.4.2 and 2.27.1
respectively. Support for python 3.14 has been added.
:pr:`1572` by :user:`Riccardo Cappuzzo`.
- The :meth: `DataOp.skb.full_report` method now deletes reports created with
``output_dir=None`` after 7 days. :pr:`1657` by :user: `Simon Dierickx `.
- The :func: `tabular_pipeline` uses a :class:`SquashingScaler` instead of a
:class:`StandardScaler` for centering and scaling numerical features
when linear models are used.
:pr:`1644` by :user:`Simon Dierickx `
- The transformer :class:`ToFloat`, previously called `ToFloat32`, is now public.
:pr:`1687` by :user:`Marie Sacksick `.
- Improved the error message raised when a Polars lazyframe is passed to
:class:`TableReport`, clarifying that `.collect()` must be called first.
:pr:`1767` by :user:`Fatima Ben Kadour `
- Computing the associations in `TableReport` is now deterministic and can be controlled
by the new parameter ``subsampling_seed`` of the global configuration.
:pr:`1775` by :user:`Thomas S. `.
- Added ``cast_to_str`` parameter to :class:`Cleaner` to prevent unintended
conversion of list/object-like columns to strings unless explicitly enabled.
:pr:`1789` by :user:`PilliSiddharth`
Bugfixes
--------
- The :meth:`skrub.cross_validate` function now raises a specific exception if the wrong variable
type is passed.
:pr:`1799` by :user: `Eloi Massoulié`
- Fixed various issues with some transformers by adding ``get_feature_names_out``
to all single column transformers.
:pr:`1666` by :user:`Riccardo Cappuzzo`.
- Issues occurring when :meth:`DataOp.skb.apply` was passed a DataOp as the
estimator have been fixed in :pr:`1671` by :user:`Jérôme Dockès
`.
- :class:`TableReport` could raise an error while trying to check if Polars
columns with some dtypes (lists, structs) are sorted. It would not indicate
Polars columns sorted in descending order. Fixed in :pr:`1673` by
:user:`Jérôme Dockès `.
- Fixed nightly checks and added support for upcoming library versions, including Pandas
v3.0. :pr:`1664` by :user:`Auguste Baum ` and
:user:`Riccardo Cappuzzo `.
- Fixed the use of :class:`TableReport` and :class:`Cleaner` with Polars dataframes
containing a column with empty string as name.
:pr:`1722` by :user:`Marie Sacksick `.
- Fixed an issue where :class:`TableReport` would fail when computing associations
for Polars dataframes if PyArrow was not installed.
:pr:`1742` by :user:`Riccardo Cappuzzo `.
- Fixed an issue in the Data Ops report generation in cases where the DataOp
contained escape characters or were spanning multiple lines.
:pr:`1764` by :user:`Riccardo Cappuzzo `.
- Added :meth:`get_feature_names_out` to :class:`Cleaner` for consistency with the
:class:`TableVectorizer` and other transformers. :pr:`1762` by
:user:`Riccardo Cappuzzo `.
- Improve error message when :class:`TextEncoder` is used without the optional
transformers dependencies. :pr:`1769` by :user:`Fangxuan Zhou `.
- Accessing ``.skb.applied_estimator`` on a :class:`DataOp` after calling
``.skb.set_name()``, ``.skb.set_description()``, ``.skb.mark_as_X()`` or
``.skb.mark_as_y()`` used to raise an error, this has been fixed in :pr:`1782`
by :user:`Jérôme Dockès `.
- Fixed potential issues that could arise in :meth:`ParamSearch.plot_results`
when NaN values were present in the crossvalidation results.
:pr:`1800` by :user:`Riccardo Cappuzzo `.
Release 0.6.2
=============
New features
------------
- The :meth:`DataOp.skb.full_report` now displays the time each node took to
evaluate. :pr:`1596` by :user:`Jérôme Dockès `.
Changes
-------
- Ken embeddings are now deprecated, the functions :func:`datasets.get_ken_embeddings`,
:func:`datasets.get_ken_table_aliases`, and :func:`datasets.get_ken_types` will be
removed in the next release of skrub.
:pr:`1546` by :user:`Vincent Maladiere `.
- Improved error messages when a DataOp is being sent to dispatched functions.
:pr:`1607` by :user:`Riccardo Cappuzzo`.
- The accepted values for the parameter ``how`` of :meth:`DataOp.skb.apply` have
changed. The new values are ``"auto"`` (unchanged), ``"cols"`` to wrap the
transformer in :class:`ApplyToCols`, ``"frame"`` to wrap the transformer in
:class:`ApplyToFrame`, or ``"no_wrap"`` for no wrapping. The old values are
deprecated and will result in an error in a future release.
:pr:`1628` by :user:`Jérôme Dockès `.
- The parameter ``splitter`` of :meth:`DataOp.skb.train_test_split` has been
renamed ``split_func``. :pr:`1630` by :user:`Jérôme Dockès `.
- KEN embeddings and all the relevant functions have been removed from skrub.
:pr:`1567` by :user:`Riccardo Cappuzzo`.
- The objects ``tabular_learner`` and ``DropIfTooManyNulls`` were removed. Use
:func:`tabular_pipeline` and :class:`DropUninformative` instead.
:pr:`1567` by :user:`Riccardo Cappuzzo`.
- The skrub global configuration now includes a parameter for setting the default
verbosity of the :class:`TableReport`.
:pr:`1567` by :user:`Riccardo Cappuzzo`.
Bugfixes
--------
- Fixed a compatibility bug with Polars 1.32.3 that may cause `ToFloat32` to fail
when applied to categorical columns. :pr:`1570` by :user:`Riccardo Cappuzzo`.
- Fixed the display of DataOp objects in google colab cell outputs (no output
was displayed). :pr:`1590` by :user:`Jérôme Dockès `.
- Fixed an error that occurred when using ``.skb.concat`` with a pandas dataframe
with column names that aren't strings. :pr:`1594` by :user:`Riccardo Cappuzzo`.
- Fixed the range from which :func:`choose_float` and :func:`choose_int` sample
values when ``log=False`` and ``n_steps`` is ``None``. It was between ``low``
and ``low + high``, now it is between ``low`` and ``high``. :pr:`1603` by
:user:`Jérôme Dockès `.
- DataOp hyperparameter search would raise an error when doing classification
and using the ``scoring`` parameter, when the dataop contained no variables.
Fixed in :pr:`1601` by :user:`Jérôme Dockès `.
- :class:`SkrubLearner` used to do a prediction on the train set during
``fit()``, this has been fixed.
:pr:`1610` by :user:`Jérôme Dockès `.
- :class:`DataOp` would raise errors when containing subclasses of list, tuple
or dict that cannot be initialized with an instance of the builtin type (such
as classes created by ``collections.namedtuple``), this has been fixed.
DataOps now only recurse into the builtin collections to evaluate their items
(not into their subclasses). If you need the items evaluated (ie if they
contain DataOps or Choices), store them in one of the builtin collections.
:pr:`1612` by :user:`Jérôme Dockès `.
- :meth:`SkrubLearner.report` with ``mode="fit"`` used to display the dataops
themselves, rather than their outputs, in the report. This has been fixed in
:pr:`1623` by :user:`Jérôme Dockès `.
- Fixed a bug that happened when ``get_feature_names_out`` was called on instances
of the :class:`DatetimeEncoder`. :pr:`1622` by :user:`Riccardo Cappuzzo`.
Release 0.6.1
===================
Bugfixes
--------
- ``get_feature_names_out`` now works correctly when used by :class:`GapEncoder`,
:class:`DropCols`, :class:`SelectCols:` from within a scikit-learn ``Pipeline``. In
addition, :class:`DropCols`'s ``get_feature_names_out`` method now returns the
names of the columns that are not dropped, rather than the names of the columns
that are dropped. :pr:`1543` by :user:`Riccardo Cappuzzo`.
Release 0.6.0
=============
Highlights
----------
- Major feature! Skrub DataOps are a powerful new way of
combining dataframe transformations over multiple tables, and machine learning
pipelines. DataOps can be combined to form compled data plans, that can be used
to train and tune machine learning models. Then, the DataOps plans can be exported
as ``Learners`` (:class:`skrub.SkrubLearner`), standalone objects that can be
used on new data. More detail about the DataOps can be found in the
:ref:`User guide ` and in the
:ref:`examples `.
- The :class:`TableReport` has been improved with many new features. Series are
now supported directly. It is now
possible to skip computing column associations and generating plots when the
number of columns in the dataframe exceeds a user-defined threshold. Columns with
high cardinality and sorted columns are now highlighted in the report.
- :mod:`selectors`, :class:`ApplyToCols` and :class:`ApplyToFrame` are now available,
providing utilities for selecting columns to which a transformer should be applied
in a flexible way. For more details, see the :ref:`User guide `
and the :ref:`example `.
- The :class:`SquashingScaler` has been added: it robustly rescales and smoothly
clips numeric columns, enabling more robust handling of numeric columns
with neural networks. See the :ref:`example `
New features
------------
- The Skrub DataOps are new mechanism for building machine-learning
pipelines that handle multiple tables and easily describing their
hyperparameter spaces. Main PR: :pr:`1233` by :user:`Jérôme Dockès `.
Additional work from other contributors can be found
`here `_:
:user:`Vincent Maladiere ` provided very important help by
trying the DataOps on many use-cases and datasets, providing feedback and
suggesting improvements, improving the examples (including creating all the
figures in the examples) and adding jitter to the parallel coordinate plots,
:user:`Riccardo Cappuzzo` experimented with the DataOps,
suggested improvements and improved the examples, :user:`Gaël Varoquaux
` , :user:`Guillaume Lemaitre `, :user:`Adrin Jalali
`, :user:`Olivier Grisel ` and others participated
through many discussions in defining the requirements and the public API.
See :ref:`the examples ` for
an introduction.
- The :mod:`selectors` module provides utilities for selecting columns to which
a transformer should be applied in a flexible way. The module was created in
:pr:`895` by :user:`Jérôme Dockès ` and added to the public API
in :pr:`1341` by :user:`Jérôme Dockès `.
- The :class:`DropUninformative` transformer is now available. This transformer
employs different heuristics to detect columns that are not likely to bring
useful information for training a model.
The current implementation includes detection of columns that contain only a
single value (constant columns), only missing values, or all unique values (such
as IDs). :pr:`1313` by :user:`Riccardo Cappuzzo`.
- :func:`get_config`, :func:`set_config` and :func:`config_context` are now available
to configure settings for dataframes display and expressions. :func:`patch_display`
and :func:`unpatch_display` are deprecated and will be removed in the next release
of skrub. :pr:`1427` by :user:`Vincent Maladiere `.
The global configuration includes the parameter ``cardinality_threshold`` that
controls the threshold value used to warn user if they have high cardinality
columns in their dataset. :pr:`1498` by :user:`rouk1 `.
Additionally, the parameter ``float_precision``
controls the number of significant digits displayed for floating-point values
in reports. :pr:`1470` by :user:`George S `.
- Added the :class:`SquashingScaler`, a transformer that
robustly rescales and smoothly clips numeric columns,
enabling more robust handling of numeric columns
with neural networks. :pr:`1310` by :user:`Vincent Maladiere ` and
:user:`David Holzmüller `.
- :func:`datasets.toy_order` is now available to create a toy dataframe and
corresponding targets for examples.
:pr:`1485` by :user:`Antoine Canaguier-Durand `.
- :class:`ApplyToCols` and :class:`ApplyToFrame` are now available to apply transformers
on a set of columns independently and jointly respectively.
:pr:`1478` by :user:`Vincent Maladiere`.
Changes
-------
.. warning::
The default high cardinality encoder for both :class:`TableVectorizer` and
:meth:`tabular_learner` (now :meth:`tabular_pipeline`) has been changed from
:class:`GapEncoder` to :class:`StringEncoder`. :pr:`1354` by
:user:`Riccardo Cappuzzo`.
- The ``tabular_learner`` function has been deprecated in favor of :func:`tabular_pipeline` to honor
its scikit-learn pipeline cultural heritage, and remove the ambiguity with the data
ops Learner. :pr:`1493` by :user:`Vincent Maladiere `.
- :class:`StringEncoder` now exposes the ``stop_words`` argument, which is passed to the
underlying vectorizer (:class:`~sklearn.feature_extraction.text.TfidfVectorizer`,
or :class:`~sklearn.feature_extraction.text.HashingVectorizer`). :pr:`1415` by
:user:`Vincent Maladiere `.
- A new parameter ``max_association_columns`` has been added to the
:class:`TableReport` to skip association computation when the number of columns
exceeds the specified value. :pr:`1304` by :user:`Victoria Shevchenko `.
- The `packaging` dependency was removed.
:pr:`1307` by :user:`Jovan Stojanovic `
- :class:`TextEncoder`, :class:`StringEncoder` and :class:`GapEncoder` now compute the
total standard deviation norm during training, which is a global constant, and
normalize the vector outputs by performing element-wise division on all entries.
:pr:`1274` by :user:`Vincent Maladiere `.
- The :class:`DropIfTooManyNulls` transformer has been replaced by the
:class:`DropUninformative` transformer and will be removed in a future release.
:pr:`1313` by :user:`Riccardo Cappuzzo`
- The :func:`concat_horizontal` function was replaced with :func:`concat`. Horizontal or vertical concatenation
is now controlled by the `axis` parameter. :pr:`1334` by :user:`Parasa V Prajwal `.
- The :class:`TableVectorizer` and :class:`Cleaner` now accept a `datetime_format`
parameter for specifying the format to use when parsing datetime columns.
:pr:`1358` by :user:`Riccardo Cappuzzo`.
- The :class:`SimpleCleaner` has been removed. use :class:`Cleaner` instead. :pr:`1370` by :user:`Riccardo Cappuzzo`.
- The periodic encoding for the ``day_in_year`` has been removed from the :class:`DatetimeEncoder` as it was
redundant. The feature itself is still added if the flag is set to ``True``. :pr:`1396` by :user:`Riccardo Cappuzzo`.
- The naming scheme used for the features generated by :class:`TextEncoder`, :class:`StringEncoder`, :class:`MinHashEncoder`,
:class:`DatetimeEncoder` has been standardized. Now features generated by all encoders have indices in the range
``[0, n_components-1]``, rather than ``[1, n_components]``. Additionally, columns with empty name are assigned a default
name that depends on the encoder used. :pr:`1405` by :user:`Riccardo Cappuzzo`.
- The optional dependencies 'dev', 'doc', 'lint' and 'test' have been coalesced into
'dev'. :pr:`1404` by :user:`Vincent Maladiere `.
- The :class:`TableReport` now supports Series in addition to Dataframes. :pr:`1420` by :user:`Vitor Pohlenz`.
- The :class:`Cleaner` now exposes a parameter to convert numeric values to float32. :pr:`1440` by
:user:`Riccardo Cappuzzo`.
- The :class:`TableReport` now shows if columns are sorted. :pr:`1512` by :user:`Dea María Léon`.
Bugfixes
--------
- Fixed a bug that caused the :class:`StringEncoder` and :class:`TextEncoder` to raise an exception if the
input column was a Categorical datatype. :pr:`1401` by :user:`Riccardo Cappuzzo`.
Documentation
-------------
A large number of improvements to the examples, docstrings, and the documentation
website have been made. Contributors include :user:`Vincent Maladiere `,
:user:`Riccardo Cappuzzo`, :user:`Jérôme Dockès `,
:user:`Gael Varoquaux `, :user:`Gabriela Gómez Jiménez `,
:user:`Sylvain Combettes `, :user:`Frits Hermans `,
:user:`Vitor Pohlenz `, :user:`Arturo Amor Quiroz `,
:user:`Marie Sacksick `, :user:`Emilien Battel `,
:user:`George El Haber `, :user:`Antoine Canaguier-Durand `, and
:user:`Lionel Kusch `.
Release 0.5.4
=============
Maintenance
-----------
* Make ``skrub`` compatible with scikit-learn 1.7.
:pr:`1434` by :user:`Vincent Maladiere `.
Release 0.5.3
=============
Changes
-------
- The :class:`SimpleCleaner` has been renamed to :class:`Cleaner`. Use of the
name :class:`SimpleCleaner` is deprecated and will result in an error in some
future release of skrub. :pr:`1275` by :user:`Riccardo Cappuzzo`.
- A new parameter ``max_plot_columns`` has been added to the
:class:`TableReport` and :func:`patch_display` to skip column plots when the
number of columns exceeds the specified value. :pr:`1255` by :user:`Priscilla
Baah`.
Release 0.5.2
=============
New features
------------
- The :class:`TableReport` now switches its visual theme between light and dark according to the user preferences.
:pr:`1201` by :user:`rouk1 `.
- Adding a new way to control the location of the data directory, using envar ``SKRUB_DATA_DIRECTORY``.
:pr:`1215` by :user:`Thomas S. `
- The :class:`DatetimeEncoder` now supports periodic encoding of datetime features
with trigonometric functions and B-splines transformers.
:pr:`1235` by :user:`Riccardo Cappuzzo`.
- The :class:`TableReport` now also compute Pearson's correlation for numeric values.
:pr:`1203` by :user:`Reshama Shaikh ` and
:user:`Vincent Maladiere `.
- The :class:`SimpleCleaner` is now available (⚠️ it was renamed to
:class:`Cleaner` in skrub ``0.5.3``.). This transformer is a lightweight
pre-processor that applies some of the transformations applied by the
:class:`TableVectorizer`, with a simpler interface. :pr:`1266` by
:user:`Riccardo Cappuzzo` and :user:`Jerome Dockes ` .
Changes
-------
- The estimator returned by :func:`tabular_learner` now uses spline encoding of
datetime features when the supervised learner is not a model based on decision
trees such as random forests or gradient boosting. :pr:`1264` by
:user:`Guillaume Lemaitre `.
- The "distribution" tab of the ``TableReport`` now stacks cards horizontally to avoid adding
vertical space.
:pr:`1259` by :user:`Gaël Varoquaux `
- Progress messages when generating a ``TableReport`` are now written to stderr instead of stdout.
:pr:`1236` by :user:`Priscilla Baah`
- Optimize the :class:`StringEncoder`: lower memory footprint and faster execution in some cases.
:pr:`1248` by :user:`Gaël Varoquaux `
Bug fixes
---------
- :class:`StringEncoder` now works correctly in presence of null values.
:pr:`1224` by :user:`Jérôme Dockès `.
- The :meth:`TableVectorizer.get_feature_names_out` method now works when used in a
scikit-learn pipeline by exposing the `input_features` parameter.
:pr:`1258` by :user:`Guillaume Lemaitre `.
Release 0.5.1
=============
New features
------------
* The :class:`StringEncoder` encodes strings using tf-idf and truncated SVD
decomposition and provides a cheaper alternative to :class:`GapEncoder`.
:pr:`1159` by :user:`Riccardo Cappuzzo`.
Changes
-------
* New dataset fetching methods have been added: :func:`fetch_videogame_sales`,
:func:`fetch_bike_sharing`, :func:`fetch_flight_delays`,
:func:`fetch_country_happiness`, and removed :func:`fetch_road_safety`.
:pr:`1218` by :user:`Vincent Maladiere `
Bug fixes
---------
Maintenance
-----------
Release 0.4.1
=============
Changes
-------
* :class:`TableReport` has `write_html` method. :pr:`1190` by :user:`Mojdeh Rastgoo`.
* A new parameter ``verbose`` has been added to the :class:`TableReport` to toggle on or off the
printing of progress information when a report is being generated.
:pr:`1182` by :user:`Priscilla Baah`.
* A parameter ``verbose`` has been added to the :func:`patch_display` to toggle on or off the
printing of progress information when a table report is being generated.
:pr:`1188` by :user:`Priscilla Baah`.
* :func:`tabular_learner` accepts the alias ``"regression"`` for the option
``"regressor"`` and ``"classification"`` for ``"classifier"``.
:pr:`1180` by :user:`Mojdeh Rastgoo `.
Bug fixes
---------
* Generating a ``TableReport`` could have an effect on the matplotib
configuration which could cause plots not to display inline in jupyter
notebooks any more. This has been fixed in skrub in :pr:`1172` by
:user:`Jérôme Dockès ` and the matplotlib issue can be tracked
`here `_.
* The labels on bar plots in the ``TableReport`` for columns of object dtypes
that have a repr spanning multiple lines could be unreadable. This has been
fixed in :pr:`1196` by :user:`Jérôme Dockès `.
* Improve the performance of :func:`deduplicate` by removing some unnecessary
computations. :pr:`1193` by :user:`Jérôme Dockès `.
Maintenance
-----------
* Make ``skrub`` compatible with scikit-learn 1.6.
:pr:`1169` by :user:`Guillaume Lemaitre `.
Release 0.4.0
=============
Highlights
----------
* The :class:`TextEncoder` can extract embeddings from a string column with a deep
learning language model (possibly downloaded from the HuggingFace Hub).
* Several improvements to the :class:`TableReport` such as better support for
other scripts than the latin alphabet in the bar plot labels, smaller report
sizes, clipping the outliers to better see the details of distributions in
histograms. See the full changelog for details.
* The :class:`TableVectorizer` can now drop columns that contain a fraction of
null values above a user-chosen threshold.
New features
------------
* The :class:`TextEncoder` is now available to encode string columns with
diverse entries.
It allows the representation of table entries as embeddings computed by a deep
learning language model. The weights of this model can be fetched locally
or from the HuggingFace Hub.
:pr:`1077` by :user:`Vincent Maladiere `.
* The :func:`column_associations` function has been added. It computes a
pairwise measure of statistical dependence between all columns in a dataframe
(the same as shown in the :class:`TableReport`). :pr:`1109` by :user:`Jérôme
Dockès `.
* The :func:`patch_display` function has been added. It changes the display of
pandas and polars dataframes in jupyter notebooks to replace them with a
:class:`TableReport`. This can be undone with :func:`unpatch_display`.
:pr:`1108` by :user:`Jérôme Dockès `
Major changes
-------------
* :class:`AggJoiner`, :class:`AggTarget` and :class:`MultiAggJoiner` now require
the `operations` argument. They do not split columns by type anymore, but
apply `operations` on all selected cols. "median" is now supported, "hist" and
"value_counts" are no longer supported. :pr:`1116` by :user:`Théo Jolivet `.
* The :class:`AggTarget` no longer supports `y` inputs of type list. :pr:`1116`
by :user:`Théo Jolivet `.
Minor changes
-------------
* The column filter selection dropdown in the tablereport is smaller and its
label has been removed to save space. :pr:`1107` by :user:`Jérôme Dockès
`.
* The TableReport now uses the font size of its parent element when inserted
into another page. This makes it smaller in pages that use a smaller font size
than the browser default such as VSCode in some configurations. It also makes
it easier to control its size when inserting it in a web page by setting the
font size of its parent element. A few other small adjustments have also been
made to make it a bit more compact. :pr:`1098` by :user:`Jérôme Dockès
`.
* Display of labels in the plots of the TableReport, especially for other
scripts than the latin alphabet, has improved.
- before, some characters could be missing and replaced by empty boxes.
- before, when the text is truncated, the ellipsis "..." could appear on the
wrong side for right-to-left scripts.
Moreover, when the text contains line breaks it now appears all on one line.
Note this only affects the labels in the plots; the rest of the report did not
have these problems.
:pr:`1097` by :user:`Jérôme Dockès `
and :pr:`1138` by :user:`Jérôme Dockès `.
* In the TableReport it is now possible, before clicking any of the cells, to
reach the dataframe sample table and activate a cell with tab key navigation.
:pr:`1101` by :user:`Jérôme Dockès `.
* The "Column name" column of the "summary statistics" table in the TableReport
is now always visible when scrolling the table. :pr:`1102` by :user:`Jérôme
Dockès `.
* Added parameter `drop_null_fraction` to `TableVectorizer` to drop columns based
on whether they contain a fraction of nulls larger than the given threshold.
:pr:`1115` and :pr:`1149` by :user:`Riccardo Cappuzzo `.
* The :class:`TableReport` now provides more helpful output for columns of dtype
TimeDelta / Duration. :pr:`1152` by :user:`Jérôme Dockès `.
* The :class:`TableReport` now also reports the number of unique values for
numeric columns. :pr:`1154` by :user:`Jérôme Dockès `.
* The :class:`TableReport`, when plotting histograms, now detects outliers and
clips the range of data shown in the histogram. This allows seeing more detail
in the shown distribution. :pr:`1157` by :user:`Jérôme Dockès `.
Bug fixes
---------
* The :class:`TableReport` could raise an exception when one of the columns
contained datetimes with time zones and missing values; this has been fixed in
:pr:`1114` by :user:`Jérôme Dockès `.
* In scikit-learn versions older than 1.4 the :class:`TableVectorizer` could
fail on polars dataframes when used with the default parameters. This has been
fixed in :pr:`1122` by :user:`Jérôme Dockès `.
* The :class:`TableReport` would raise an exception when the input (pandas)
dataframe contained several columns with the same name. This has been fixed in
:pr:`1125` by :user:`Jérôme Dockès `.
* The :class:`TableReport` would raise an exception when a column contained
infinite values. This has been fixed in :pr:`1150` by :user:`Jérôme Dockès
` and :pr:`1151` by Jérôme Dockès.
Release 0.3.1
=============
Minor changes
-------------
* For tree-based models, :func:`tabular_learner` now adds
`handle_unknown='use_encoded_value'` to the `OrdinalEncoder`, to avoid
errors with new categories in the test set. This is consistent with the
setting of `OneHotEncoder` used by default in the
:class:`TableVectorizer`. :pr:`1078` by :user:`Gaël Varoquaux `
* The reports created by :class:`TableReport`, when inserted in an html page (or
displayed in a notebook), now use the same font as the surrounding page.
:pr:`1038` by :user:`Jérôme Dockès `.
* The content of the dataframe corresponding to the currently selected table
cell in the TableReport can be copied without actually selecting the text (as
in a spreadsheet).
:pr:`1048` by :user:`Jérôme Dockès `.
* The selection of content displayed in the TableReport's copy-paste boxes has
been removed. Now they always display the value of the selected item. When
copied, the repr of the selected item is copied to the clipboard.
:pr:`1058` by :user:`Jérôme Dockès `.
* A "stats" panel has been added to the TableReport, showing summary statistics
for all columns (number of missing values, mean, etc. -- similar to
``pandas.info()`` ) in a table. It can be sorted by each column.
:pr:`1056` and :pr:`1068` by :user:`Jérôme Dockès `.
* The credit fraud dataset is now available with the
:func:`fetch_credit_fraud function`.
:pr:`1053` by :user:`Vincent Maladiere `.
* Added zero padding for column names in :class:`MinHashEncoder` to improve column ordering consistency.
:pr:`1069` by :user:`Shreekant Nandiyawar `.
* The selection in the TableReport's sample table can now be manipulated with
the keyboard. :pr:`1065` by :user:`Jérôme Dockès `.
* The ``TableReport`` now displays the pandas (multi-)index, and has a better
display & interaction of pandas columns when the columns are a MultiIndex.
:pr:`1083` by :user:`Jérôme Dockès `.
* It is possible to control the number of rows displayed by the TableReport in
the "sample" tab panel by specifying ``n_rows``.
:pr:`1083` by :user:`Jérôme Dockès `.
* the `TableReport` used to raise an exception when the dataframe contained
unhashable types such as python lists. This has been fixed in :pr:`1087` by
:user:`Jérôme Dockès `.
* Display's columns name with the HTML representation of the fitted TableVectorizer.
This has been fixed in :pr:`1093` by :user:`Shreekant Nandiyawar `.
* AggTarget will now work even when y is a Series and not raise any error.
This has been fixed in :pr:`1094` by :user:`Shreekant Nandiyawar `.
Release 0.3.0
=============
Highlights
----------
* Polars dataframes are now supported across all ``skrub`` estimators.
* :class:`TableReport` generates an interactive report for a dataframe. This
`page `_ regroups some
precomputed examples.
Major changes
-------------
* The :class:`InterpolationJoiner` now supports polars dataframes. :pr:`1016`
by :user:`Théo Jolivet `.
* The :class:`TableReport` provides an interactive report on a dataframe's
contents: an overview, summary statistics and plots, statistical associations
between columns. It can be displayed in a jupyter notebook, a browser tab or
saved as a static HTML page. :pr:`984` by :user:`Jérôme Dockès `.
Minor changes
-------------
* :class:`Joiner` and :func:`fuzzy_join` used to raise an error when columns
with the same name appeared in the main and auxiliary table (after adding the
suffix). This is now allowed and a random string is inserted in the duplicate
column to ensure all names are unique.
:pr:`1014` by :user:`Jérôme Dockès `.
* :class:`AggJoiner` and :class:`AggTarget` could produce outputs whose column
names varied across calls to `transform` in some cases in the presence of
duplicate column names, now the output names are always the same.
:pr:`1013` by :user:`Jérôme Dockès `.
* In some cases :class:`AggJoiner` and :class:`AggTarget` inserted a column in
the output named "index" containing the pandas index of the auxiliary table.
This has been corrected.
:pr:`1020` by :user:`Jérôme Dockès `.
Release 0.2.0
=============
Major changes
-------------
* The :class:`Joiner` has been adapted to support polars dataframes. :pr:`945` by :user:`Théo Jolivet `.
* The :class:`TableVectorizer` now consistently applies the same transformation
across different calls to `transform`. There also have been some breaking
changes to its functionality: (i) all transformations are now applied
independently to each column, i.e. it does not perform multivariate
transformations (ii) in ``specific_transformers`` the same column may not be
used twice (go through 2 different transformers).
:pr:`902` by :user:`Jérôme Dockès `.
* Some parameters of :class:`TableVectorizer` have been renamed:
`high_cardinality_transformer` → `high_cardinality`,
`low_cardinality_transformer` → `low_cardinality`,
`datetime_transformer` → `datetime`, `numeric_transformer` → `numeric`.
:pr:`947` by :user:`Jérôme Dockès `.
* The :class:`GapEncoder` and :class:`MinHashEncoder` are now a single-column
transformers: their ``fit``, ``fit_transform`` and ``transform`` methods
accept a single column (a pandas or polars Series). Dataframes and numpy
arrays are not accepted.
:pr:`920` and :pr:`923` by :user:`Jérôme Dockès `.
* Added the :class:`MultiAggJoiner` that allows to augment a main table with
multiple auxiliary tables. :pr:`876` by :user:`Théo Jolivet `.
* :class:`AggJoiner` now only accepts a single table as an input, and some of its
parameters were renamed to be consistent with the :class:`MultiAggJoiner`.
It now has a ``key``` parameter that allows to join main and auxiliary tables that share
the same column names. :pr:`876` by :user:`Théo Jolivet `.
* :func:`tabular_learner` has been added to easily create a supervised
learner that works well on tabular data. :pr:`926` by :user:`Jérôme Dockès
`.
Minor changes
-------------
* :class:`GapEncoder` and :class:`MinHashEncoder` used to modify their input
in-place, replacing missing values with a string. They no longer do so. Their
parameter `handle_missing` has been removed; now missing values are always
treated as the empty string.
:pr:`930` by :user:`Jérôme Dockès `.
* The minimum supported python version is now 3.9
:pr:`939` by :user:`Jérôme Dockès `.
* Skrub supports numpy 2. :pr:`946` by :user:`Jérôme Dockès `.
* :func:`~datasets.fetch_ken_embeddings` now add suffix even with the default
value for the parameter `pca_components`.
:pr:`956` by :user:`Guillaume Lemaitre `.
* :class:`Joiner` now performs some preprocessing (the same as done by the
:class:`TableVectorizer`, eg trying to parse dates, converting pandas object
columns with mixed types to a single type) on the joining columns before
vectorizing them. :pr:`972` by :user:`Jérôme Dockès `.
skrub release 0.1.1
===================
This is a bugfix release to adapt to the most recent versions of pandas (2.2) and
scikit-learn (1.5). There are no major changes to the functionality of skrub.
skrub release 0.1.0
===================
Major changes
-------------
* :class:`TargetEncoder` has been removed in favor of
:class:`sklearn.preprocessing.TargetEncoder`, available since scikit-learn 1.3.
* :class:`Joiner` and :func:`fuzzy_join` support several ways of rescaling
distances; ``match_score`` has been replaced by ``max_dist``; bugs which
prevented the Joiner to consistently vectorize inputs and accept or reject
matches across calls to transform have been fixed. :pr:`821` by :user:`Jérôme
Dockès `.
* :class:`InterpolationJoiner` was added to join two tables by using
machine-learning to infer the matching rows from the second table.
:pr:`742` by :user:`Jérôme Dockès `.
* Pipelines including :class:`TableVectorizer` can now be grid-searched, since
we can now call `set_params` on the default transformers of :class:`TableVectorizer`.
:pr:`814` by :user:`Vincent Maladiere `
* :func:`to_datetime` is now available to support pandas.to_datetime
over dataframes and 2d arrays.
:pr:`784` by :user:`Vincent Maladiere `
* Some parameters of :class:`Joiner` have changed. The goal is to harmonize
parameters across all estimator that perform join(-like) operations, as
discussed in `#751 `_.
:pr:`757` by :user:`Jérôme Dockès `.
* :func:`dataframe.pd_join`, :func:`dataframe.pd_aggregate`,
:func:`dataframe.pl_join` and :func:`dataframe.pl_aggregate`
are now available in the dataframe submodule.
:pr:`733` by :user:`Vincent Maladiere `
* :class:`FeatureAugmenter` is renamed to :class:`Joiner`.
:pr:`674` by :user:`Jovan Stojanovic `
* :func:`fuzzy_join` and :class:`FeatureAugmenter` can now join on datetime columns.
:pr:`552` by :user:`Jovan Stojanovic `
* :class:`Joiner` now supports joining on multiple column keys.
:pr:`674` by :user:`Jovan Stojanovic `
* The signatures of all encoders and functions have been revised to enforce
cleaner calls. This means that some arguments that could previously be passed
positionally now have to be passed as keywords.
:pr:`514` by :user:`Lilian Boulard `.
* Parallelized the :class:`GapEncoder` column-wise. Parameters `n_jobs` and `verbose`
added to the signature. :pr:`582` by :user:`Lilian Boulard `
* Introducing :class:`AggJoiner`, a transformer performing
aggregation on auxiliary tables followed by left-joining on a base table.
:pr:`600` by :user:`Vincent Maladiere `.
* Introducing :class:`AggTarget`, a transformer performing
aggregation on the target y, followed by left-joining on a base table.
:pr:`600` by :user:`Vincent Maladiere `.
* Added the :class:`SelectCols` and :class:`DropCols` transformers that allow
selecting a subset of a dataframe's columns inside of a pipeline. :pr:`804` by
:user:`Jérôme Dockès `.
Minor changes
-------------
* :class:`DatetimeEncoder` doesn't remove constant features anymore.
It also supports an 'errors' argument to raise or coerce errors during
transform, and a 'add_total_seconds' argument to include the number of
seconds since Epoch.
:pr:`784` by :user:`Vincent Maladiere `
* Scaling of ``matching_score`` in :func:`fuzzy_join` is now between 0 and 1; it used to be between 0.5 and 1. Moreover, the division by 0 error that occurred when all rows had a perfect match has been fixed. :pr:`802` by :user:`Jérôme Dockès `.
* :class:`TableVectorizer` is now able to apply parallelism at the column level rather than the transformer level. This is the default for univariate transformers, like :class:`MinHashEncoder`, and :class:`GapEncoder`.
:pr:`592` by :user:`Leo Grinsztajn `
* ``inverse_transform`` in :class:`SimilarityEncoder` now works as expected; it used to raise an exception. :pr:`801` by :user:`Jérôme Dockès `.
* :class:`TableVectorizer` propagate the `n_jobs` parameter to the underlying
transformers except if the underlying transformer already set explicitly `n_jobs`.
:pr:`761` by :user:`Leo Grinsztajn `, :user:`Guillaume Lemaitre `,
and :user:`Jerome Dockes `.
* Parallelized the :func:`deduplicate` function. Parameter `n_jobs`
added to the signature. :pr:`618` by :user:`Jovan Stojanovic `
and :user:`Lilian Boulard `
* Functions :func:`datasets.fetch_ken_embeddings`, :func:`datasets.fetch_ken_table_aliases`
and :func:`datasets.fetch_ken_types` have been renamed.
:pr:`602` by :user:`Jovan Stojanovic `
* Make `pyarrow` an optional dependencies to facilitate the integration
with `pyodide`.
:pr:`639` by :user:`Guillaume Lemaitre `.
* Bumped minimal required Python version to 3.10. :pr:`606` by
:user:`Gael Varoquaux `
* Bumped minimal required versions for the dependencies:
- numpy >= 1.23.5
- scipy >= 1.9.3
- scikit-learn >= 1.2.1
- pandas >= 1.5.3 :pr:`613` by :user:`Lilian Boulard `
* You can now pass column-specific transformers to :class:`TableVectorizer`
using the `specific_transformers` argument.
:pr:`583` by :user:`Lilian Boulard `.
* Do not support 1-D array (and pandas Series) in :class:`TableVectorizer`. Pass a
2-D array (or a pandas DataFrame) with a single column instead. This change is for
compliance with the scikit-learn API.
:pr:`647` by :user:`Guillaume Lemaitre `
* Fixes a bug in :class:`TableVectorizer` with `remainder`: it is now cloned if it's
a transformer so that the same instance is not shared between different
transformers.
:pr:`678` by :user:`Guillaume Lemaitre `
* :class:`GapEncoder` speedup :pr:`680` by :user:`Leo Grinsztajn `
- Improved :class:`GapEncoder`'s early stopping logic. The parameters `tol` and `min_iter`
have been removed. The parameter `max_no_improvement` can now be used to control the
early stopping.
:pr:`663` by :user:`Simona Maggio `
:pr:`593` by :user:`Lilian Boulard `
:pr:`681` by :user:`Leo Grinsztajn `
- Implementation improvement leading to a ~x5 speedup for each iteration.
- Better default hyperparameters: `batch_size` now defaults to 1024, and `max_iter_e_steps`
to 1.
* Removed the `most_frequent` and `k-means` strategies from the :class:`SimilarityEncoder`.
These strategy were used for scalability reasons, but we recommend using the :class:`MinHashEncoder`
or the :class:`GapEncoder` instead. :pr:`596` by :user:`Leo Grinsztajn `
* Removed the `similarity` argument from the :class:`SimilarityEncoder` constructor,
as we only support the ngram similarity. :pr:`596` by :user:`Leo Grinsztajn `
* Added the `analyzer` parameter to the :class:`SimilarityEncoder` to allow word counts
for similarity measures. :pr:`619` by :user:`Jovan Stojanovic `
* skrub now uses modern type hints introduced in PEP 585.
:pr:`609` by :user:`Lilian Boulard `
* Some bug fixes for :class:`TableVectorizer` ( :pr:`579`):
- `check_is_fitted` now looks at `"transformers_"` rather than `"columns_"`
- the default of the `remainder` parameter in the docstring is now `"passthrough"`
instead of `"drop"` to match the implementation.
- uint8 and int8 dtypes are now considered as numeric columns.
* Removed the leading "<" and trailing ">" symbols from KEN entities
and types.
:pr:`601` by :user:`Jovan Stojanovic `
* Add `get_feature_names_out` method to :class:`MinHashEncoder`.
:pr:`616` by :user:`Leo Grinsztajn `
* Removed `requests` from the requirements. :pr:`613` by :user:`Lilian Boulard `
* :class:`TableVectorizer` now handles mixed types columns without failing
by converting them to string before type inference.
:pr:`623`by :user:`Leo Grinsztajn `
* Moved the default storage location of data to the user's home folder.
:pr:`652` by :user:`Felix Lefebvre ` and
:user:`Gael Varoquaux `
* Fixed bug when using :class:`TableVectorizer`'s `transform` method on
categorical columns with missing values.
:pr:`644` by :user:`Leo Grinsztajn `
* :class:`TableVectorizer` never output a sparse matrix by default. This can be changed by
increasing the `sparse_threshold` parameter. :pr:`646` by :user:`Leo Grinsztajn `
* :class:`TableVectorizer` doesn't fail anymore if an inferred type doesn't work during transform.
The new entries not matching the type are replaced by missing values. :pr:`666` by :user:`Leo Grinsztajn `
- Dataset fetcher :func:`datasets.fetch_employee_salaries` now has a parameter
`overload_job_titles` to allow overloading the job titles
(`employee_position_title`) with the column `underfilled_job_title`,
which provides some more information about the job title.
:pr:`581` by :user:`Lilian Boulard `
* Fix bugs which was triggered when `extract_until` was "year", "month", "microseconds"
or "nanoseconds", and add the option to set it to `None` to only extract `total_time`,
the time from epoch. :class:`DatetimeEncoder`. :pr:`743` by :user:`Leo Grinsztajn `
Before skrub: dirty_cat
========================
Skrub was born from the `dirty_cat `__
package.
Dirty-cat release 0.4.1
==========================
Major changes
-------------
* :func:`fuzzy_join` and :class:`FeatureAugmenter` can now join on numeric columns based on the euclidean distance.
:pr:`530` by :user:`Jovan Stojanovic `
* :func:`fuzzy_join` and :class:`FeatureAugmenter` can perform many-to-many joins on lists of numeric or string key columns.
:pr:`530` by :user:`Jovan Stojanovic `
* :func:`GapEncoder.transform` will not continue fitting of the instance anymore.
It makes functions that depend on it (:func:`~GapEncoder.get_feature_names_out`,
:func:`~GapEncoder.score`, etc.) deterministic once fitted.
:pr:`548` by :user:`Lilian Boulard `
* :func:`fuzzy_join` and :class:`FeatureAugmenter` now perform joins on missing values as in `pandas.merge`
but raises a warning. :pr:`522` and :pr:`529` by :user:`Jovan Stojanovic `
* Added :func:`get_ken_table_aliases` and :func:`get_ken_types` for exploring
KEN embeddings. :pr:`539` by :user:`Lilian Boulard `.
Minor changes
-------------
* Improvement of date column detection and date format inference in :class:`TableVectorizer`. The
format inference now tries to find a format which works for all non-missing values of the column, and only
tries pandas default inference if it fails.
:pr:`543` by :user:`Leo Grinsztajn `
:pr:`587` by :user:`Leo Grinsztajn `
Dirty-cat Release 0.4.0
=========================
Major changes
-------------
* `SuperVectorizer` is renamed as :class:`TableVectorizer`, a warning is raised when using the old name.
:pr:`484` by :user:`Jovan Stojanovic `
* New experimental feature: joining tables using :func:`fuzzy_join` by approximate key matching. Matches are based
on string similarities and the nearest neighbors matches are found for each category.
:pr:`291` by :user:`Jovan Stojanovic ` and :user:`Leo Grinsztajn `
* New experimental feature: :class:`FeatureAugmenter`, a transformer
that augments with :func:`fuzzy_join` the number of features in a main table by using information from auxiliary tables.
:pr:`409` by :user:`Jovan Stojanovic `
* Unnecessary API has been made private: everything (files, functions, classes)
starting with an underscore shouldn't be imported in your code. :pr:`331` by :user:`Lilian Boulard `
* The :class:`MinHashEncoder` now supports a `n_jobs` parameter to parallelize
the hashes computation. :pr:`267` by :user:`Leo Grinsztajn ` and :user:`Lilian Boulard `.
* New experimental feature: deduplicating misspelled categories using :func:`deduplicate` by clustering string distances.
This function works best when there are significantly more duplicates than underlying categories.
:pr:`339` by :user:`Moritz Boos `.
Minor changes
-------------
* Add example `Wikipedia embeddings to enrich the data`. :pr:`487` by :user:`Jovan Stojanovic `
* **datasets.fetching**: contains a new function :func:`get_ken_embeddings` that can be used to download Wikipedia
embeddings and filter them by type.
* **datasets.fetching**: contains a new function :func:`fetch_world_bank_indicator` that can be used to download indicators
from the World Bank Open Data platform.
:pr:`291` by :user:`Jovan Stojanovic `
* Removed example `Fitting scalable, non-linear models on data with dirty categories`. :pr:`386` by :user:`Jovan Stojanovic `
* :class:`MinHashEncoder`'s :func:`minhash` method is no longer public. :pr:`379` by :user:`Jovan Stojanovic `
* Fetching functions now have an additional argument ``directory``,
which can be used to specify where to save and load from datasets.
:pr:`432` by :user:`Lilian Boulard `
* Fetching functions now have an additional argument ``directory``,
which can be used to specify where to save and load from datasets.
:pr:`432` and :pr:`453` by :user:`Lilian Boulard `
* The :class:`TableVectorizer`'s default `OneHotEncoder` for low cardinality categorical variables now defaults
to `handle_unknown="ignore"` instead of `handle_unknown="error"` (for sklearn >= 1.0.0).
This means that categories seen only at test time will be encoded by a vector of zeroes instead of raising an error. :pr:`473` by :user:`Leo Grinsztajn `
Bug fixes
---------
* The :class:`MinHashEncoder` now considers `None` and empty strings as missing values, rather
than raising an error. :pr:`378` by :user:`Gael Varoquaux `
Dirty-cat Release 0.3.0
==========================
Major changes
-------------
* New encoder: :class:`DatetimeEncoder` can transform a datetime column into several numeric columns
(year, month, day, hour, minute, second, ...). It is now the default transformer used
in the :class:`TableVectorizer` for datetime columns. :pr:`239` by :user:`Leo Grinsztajn `
* The :class:`TableVectorizer` has seen some major improvements and bug fixes:
- Fixes the automatic casting logic in ``transform``.
- To avoid dimensionality explosion when a feature has two unique values, the default encoder (:class:`~sklearn.preprocessing.OneHotEncoder`) now drops one of the two vectors (see parameter `drop="if_binary"`).
- ``fit_transform`` and ``transform`` can now return unencoded features, like the :class:`~sklearn.compose.ColumnTransformer`'s behavior. Previously, a ``RuntimeError`` was raised.
:pr:`300` by :user:`Lilian Boulard `
* **Backward-incompatible change in the TableVectorizer**:
To apply ``remainder`` to features (with the ``*_transformer`` parameters),
the value ``'remainder'`` must be passed, instead of ``None`` in previous versions.
``None`` now indicates that we want to use the default transformer. :pr:`303` by :user:`Lilian Boulard `
* Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required. :pr:`289` by :user:`Lilian Boulard `
* Bumped minimum dependencies:
- scikit-learn>=0.23
- scipy>=1.4.0
- numpy>=1.17.3
- pandas>=1.2.0 :pr:`299` and :pr:`300` by :user:`Lilian Boulard