User Guide#

Skrub is a library that eases machine learning with dataframes, from exploring dataframes to validating a machine-learning pipeline.

The TableReport is a powerful data exploration tool, which can be followed by data sanitization and feature engineering tools in the Cleaner and TableVectorizer. The tabular_pipeline() combines the two to build a strong baseline for dataframes.

The skrub column-level encoders can be tweaked by the user for more specific needs. Various multi-column transformers and the selectors API provide a high degree of control over which columns should be modified.

More complex, multi-table scenarios can make use of the skrub Data Ops, which enable constructing and validating pipelines that involve multiple dataframes and hyperparameter tuning.

Skrub does not replace pandas or polars. Instead, it leverages the dataframe libraries to provide more high-level building blocks that perform the data preprocessing steps that are typically needed in a machine learning pipeline.