User guide#

Skrub eases preparing tables for machine learning.

Starting from rich, complex data stored in one or several dataframes, it helps performing the data wrangling necessary to produce a numeric array that is fed to a machine-learning model. This wrangling comprises joining tables (possibly with inexact matches), parsing text into structured data such as dates, extracting numeric features, etc.

For those tasks, skrub does not replace a dataframe library. Instead, it leverages polars or pandas to provide more high-level building blocks that are typically needed in a machine-learning pipeline.

Crucially, the transformations implemented by skrub are stateful: skrub records the transformations that were applied to the training data and replays the same operations when the pipeline is applied to make predictions on unseen data. Implementing data-wrangling steps as transformers that can be fitted is essential to prevent data leakage and ensure generalization.