User Guide#
Skrub is a library that eases machine learning with dataframes for machine learning.
Starting from rich, complex data stored in one or several dataframes, it helps performing the data wrangling necessary to produce a numeric array that is fed to a machine-learning model. This wrangling comprises joining tables (possibly with inexact matches), parsing structured data such as datetimes from text, and extracting numeric features from non-numeric data.
For those tasks, skrub does not replace pandas or polars. Instead, it leverages the dataframe libraries to provide more high-level building blocks that perform the data preprocessing steps that are typically needed in a machine learning pipeline.
This guide demonstrates how to resolve various issues using Skrub’s features. See the examples section for full code snippets.
- Exploring dataframes with the
TableReport
- Feature engineering for categorical data
- Parsing and encoding datetimes
- Strong baseline pipelines
- Data Preparation with
skrub
Transformers - Skrub Selectors: helpers for selecting columns in a dataframe
- Skrub DataOps: fit, tune, and validate arbitrary data wrangling
- Assembling Skrub DataOps into complex machine learning pipelines
- Applying machine-learning estimators
- Applying different transformers using Skrub selectors and DataOps
- Documenting the DataOps plan with node names and descriptions
- Evaluating and debugging the DataOps plan with
.skb.full_report()
- Using only a part of a DataOps plan
- Subsampling data for easier development and debugging
- Tuning and validating Skrub Pipelines
- Splitting the data in train and test sets
- Improving the confidence in our score through cross-validation
- Using the Skrub
choose_*
functions to tune hyperparameters - Validating hyperparameter search with nested cross-validation
- Going beyond estimator hyperparameters: nesting choices and choosing pipelines
- Linking choices depending on other choices
- Assembling: joining multiple tables
- Example datasets, utilities, and customization