User Guide#
Skrub is a Python library that facilitates machine learning with tabular data (dataframes, such as pandas and polars) using a scikit-learn-compatible API.
Use the sections below to navigate the guide. For a quickstart example, try Getting Started. For runnable code, see the Example gallery. For class and function details, see the API Reference. For common use cases and how to address them, see the How-to guides.
- Getting Started with skrub
- Exploring a Dataframe
- Wrangling data with good defaults
Cleaner: sanitizing a dataframe- Transforming a table into an array of numeric features:
TableVectorizer - Building robust ML baselines with
tabular_pipeline() - The logic used by the tabular pipeline is quite simple
- Extending the pipeline with the
.stepsattribute - Using a pipeline as the estimator
- Transforming selected columns with
ApplyToCols
- Column-level feature extraction
- Multi-column operations
- Removing unneeded columns with
DropUninformativeandCleaner - Dropping columns with many missing values
- Applying
DropUninformativeonly to a subset of columns - Skrub Selectors, for selecting columns in a dataframe
- Selecting based on dtype or data properties
- Categories of selectors
filter()andfilter_names()to select with user-defined criteria- Select columns with null values
- Removing unneeded columns with
- Complex multi-table pipelines with Data Ops
- Data Ops basic concepts
- Basics of DataOps: the DataOps plan, variables, and learners
- Building a simple DataOps plan
- Tutorial: Using Data Ops to build a machine-learning pipeline
- Using previews for easier development and debugging
- DataOps allow direct access to methods of the underlying data
- Control flow in DataOps: eager and deferred evaluation
- How do skrub Data Ops differ from the alternatives?
- Building a complex pipeline with the skrub Data Ops
- Applying machine-learning estimators
- Applying different transformers using skrub selectors and DataOps
- Documenting the DataOps plan with node names and descriptions
- Evaluating and debugging the DataOps plan with
.skb.full_report() - Using only a part of a DataOps plan
- Subsampling data for easier development and debugging
- Tuning and validating skrub DataOps plans
- Tuning and validating skrub DataOps plans
- Improving the confidence in our score through cross-validation
- Splitting the data in train and test sets
- Passing additional arguments to the splitter
- Passing additional arguments to the scorer
- Using the skrub
choose_*functions to tune hyperparameters - Feature selection with skrub
SelectColsandDropCols - Validating hyperparameter search with nested cross-validation
- Going beyond estimator hyperparameters: nesting choices and choosing pipelines
- Linking choices depending on other choices
- Exporting the DataOps plan as a learner and reusing it
- Tuning DataOps with Optuna
- Data Ops basic concepts
- Joining Dataframes