User Guide#
Skrub is a library that eases machine learning with dataframes, from exploring dataframes to validating a machine-learning pipeline.
The TableReport
is a powerful data exploration tool, which can be followed by
data sanitization and feature engineering tools in the Cleaner
and TableVectorizer
.
The tabular_pipeline()
combines the two to build a strong baseline for dataframes.
The skrub column-level encoders can be tweaked by the user for more specific needs. Various multi-column transformers and the selectors API provide a high degree of control over which columns should be modified.
More complex, multi-table scenarios can make use of the skrub Data Ops, which enable constructing and validating pipelines that involve multiple dataframes and hyperparameter tuning.
Skrub does not replace pandas or polars. Instead, it leverages the dataframe libraries to provide more high-level building blocks that perform the data preprocessing steps that are typically needed in a machine learning pipeline.
- Exploring a Dataframe
- Wrangling data with good defaults
- Column-level feature extraction
- Multi-column operations
- Operating over multiple columns at once
- Removing unneeded columns with
DropUninformative
andCleaner
- Applying
DropUninformative
only to a subset of columns - Skrub Selectors: helpers for selecting columns in a dataframe
- Selecting based on dtype or data properties
- Categories of selectors
- Advanced selectors:
filter()
andfilter_names()
- Combining selectors with other skrub transformers
- Complex multi-table pipelines with Data Ops
- Data Ops basic concepts
- Basics of DataOps: the DataOps plan, variables, and learners
- Building a simple DataOps plan
- Using previews for easier development and debugging
- DataOps allow direct access to methods of the underlying data
- Control flow in DataOps: eager and deferred evaluation
- How do skrub Data Ops differ from the alternatives?
- Building a complex pipeline with the skrub Data Ops
- Applying machine-learning estimators
- Applying different transformers using skrub selectors and DataOps
- Documenting the DataOps plan with node names and descriptions
- Evaluating and debugging the DataOps plan with
.skb.full_report()
- Using only a part of a DataOps plan
- Subsampling data for easier development and debugging
- Tuning and validating Skrub DataOps plans
- Tuning and validating skrub DataOps plans
- Improving the confidence in our score through cross-validation
- Splitting the data in train and test sets
- Using the skrub
choose_*
functions to tune hyperparameters - Feature selection with skrub
SelectCols
andDropCols
- Validating hyperparameter search with nested cross-validation
- Going beyond estimator hyperparameters: nesting choices and choosing pipelines
- Linking choices depending on other choices
- Exporting the DataOps plan as a learner and reusing it
- Data Ops basic concepts
- Configuration and dataset utilities
- Joining Dataframes