User Guide#

Skrub is a library that eases machine learning with dataframes for machine learning.

Starting from rich, complex data stored in one or several dataframes, it helps performing the data wrangling necessary to produce a numeric array that is fed to a machine-learning model. This wrangling comprises joining tables (possibly with inexact matches), parsing structured data such as datetimes from text, and extracting numeric features from non-numeric data.

For those tasks, skrub does not replace pandas or polars. Instead, it leverages the dataframe libraries to provide more high-level building blocks that perform the data preprocessing steps that are typically needed in a machine learning pipeline.

This guide demonstrates how to resolve various issues using Skrub’s features. See the examples section for full code snippets.

Exploring dataframes with the TableReport
Feature engineering for categorical data
Parsing and encoding datetimes
- Parsing Datetime Strings
- Encoding and Feature Engineering on Datetimes
Strong baseline pipelines
- TableVectorizer
- tabular_pipeline()
Data Preparation with skrub Transformers
Skrub Selectors: helpers for selecting columns in a dataframe
Skrub DataOps: fit, tune, and validate arbitrary data wrangling
- What are Skrub DataOps, and why do we need them?
- Basics of DataOps: the DataOps plan, variables, and learners
Assembling Skrub DataOps into complex machine learning pipelines
Tuning and validating Skrub Pipelines
Assembling: joining multiple tables
Example datasets, utilities, and customization

User Guide#

This Page