User Guide#
Skrub is a Python library that facilitates machine learning with tabular data (dataframes, as with pandas and polars) using a scikit-learn–compatible API.
Use the sections below to navigate the guide. For runnable code, see the Example gallery. For class and function details, see the API Reference.
- Exploring a Dataframe
- Wrangling data with good defaults
- Column-level feature extraction
- Multi-column operations
- Operating over multiple columns at once
- Removing unneeded columns with
DropUninformativeandCleaner - Applying
DropUninformativeonly to a subset of columns - Skrub Selectors, for selecting columns in a dataframe
- Selecting based on dtype or data properties
- Categories of selectors
filter()andfilter_names()to select with user-defined criteria
- Complex multi-table pipelines with Data Ops
- Data Ops basic concepts
- Basics of DataOps: the DataOps plan, variables, and learners
- Building a simple DataOps plan
- Using previews for easier development and debugging
- DataOps allow direct access to methods of the underlying data
- Control flow in DataOps: eager and deferred evaluation
- How do skrub Data Ops differ from the alternatives?
- Building a complex pipeline with the skrub Data Ops
- Applying machine-learning estimators
- Applying different transformers using skrub selectors and DataOps
- Documenting the DataOps plan with node names and descriptions
- Evaluating and debugging the DataOps plan with
.skb.full_report() - Using only a part of a DataOps plan
- Subsampling data for easier development and debugging
- Tuning and validating Skrub DataOps plans
- Tuning and validating skrub DataOps plans
- Improving the confidence in our score through cross-validation
- Splitting the data in train and test sets
- Using the skrub
choose_*functions to tune hyperparameters - Feature selection with skrub
SelectColsandDropCols - Validating hyperparameter search with nested cross-validation
- Going beyond estimator hyperparameters: nesting choices and choosing pipelines
- Linking choices depending on other choices
- Exporting the DataOps plan as a learner and reusing it
- Tuning DataOps with Optuna
- Data Ops basic concepts
- Configuration and dataset utilities
- Joining Dataframes