.. _user_guide:

User guide
==========

Skrub eases preparing tables for machine learning.

Starting from rich, complex data stored in one or several dataframes, it helps
performing the data wrangling necessary to produce a numeric array that is fed
to a machine-learning model. This wrangling comprises joining tables (possibly
with inexact matches), parsing text into structured data such as dates,
extracting numeric features, etc.

For those tasks, skrub does not replace a dataframe library. Instead, it
leverages polars or pandas to provide more high-level building blocks that are
typically needed in a machine-learning pipeline.

Crucially, the transformations implemented by skrub are *stateful*: skrub
records the transformations that were applied to the training data and replays the
same operations when the pipeline is applied to make predictions on unseen
data. Implementing data-wrangling steps as transformers that can be fitted is
essential to prevent data leakage and ensure generalization.

.. topic:: Skrub highlights:

 - eases separating the train and test operations, allowing to tune
   preprocessing steps to the data and improving the generalization of tabular
   machine-learning models.

 - enables statistical and imperfect assembly, as machine-learning models
   can typically retrieve signals even in noisy data.

|

.. include:: includes/big_toc_css.rst

.. toctree::
   :maxdepth: 2

   end_to_end_pipeline
   encoding
   assembling
   cleaning