Complex multi-table pipelines with Data Ops#
Skrub provides an easy way to build complex, flexible machine learning pipelines.
There are several needs that are not easily addressed with standard scikit-learn
tools such as Pipeline and
ColumnTransformer, and for which the Skrub DataOps offer
a solution:
Multiple tables: We often have several tables of different shapes (for example, “Customers”, “Orders”, and “Products” tables) that need to be processed and assembled into a design matrix
X. The targetymay also be the result of some data processing. Standard scikit-learn estimators do not support this, as they expect right away a single design matrixXand a target arrayy, with one row per observation.DataFrame wrangling: Performing typical DataFrame operations such as projections, joins, and aggregations should be possible and allow leveraging the powerful and familiar APIs of Pandas or Polars.
Hyperparameter tuning: Choices of estimators, hyperparameters, and even the pipeline architecture can be guided by validation scores. Specifying ranges of possible values outside of the pipeline itself (as in
GridSearchCV) is difficult in complex pipelines.Iterative development: Building a pipeline step by step while inspecting intermediate results allows for a short feedback loop and early discovery of errors.
In this section we cover all about the skrub Data Ops, from starting out with a simple example, to more advanced concepts like parameter tuning and and pipeline validation.
Data Ops basic concepts#
- Basics of DataOps: the DataOps plan, variables, and learners
- Building a simple DataOps plan
- Using previews for easier development and debugging
- DataOps allow direct access to methods of the underlying data
- Control flow in DataOps: eager and deferred evaluation
- How do skrub Data Ops differ from the alternatives?
Building a complex pipeline with the skrub Data Ops#
- Applying machine-learning estimators
- Applying different transformers using skrub selectors and DataOps
- Documenting the DataOps plan with node names and descriptions
- Evaluating and debugging the DataOps plan with
.skb.full_report() - Using only a part of a DataOps plan
- Subsampling data for easier development and debugging
Tuning and validating Skrub DataOps plans#
- Tuning and validating skrub DataOps plans
- Improving the confidence in our score through cross-validation
- Splitting the data in train and test sets
- Using the skrub
choose_*functions to tune hyperparameters - Feature selection with skrub
SelectColsandDropCols - Validating hyperparameter search with nested cross-validation
- Going beyond estimator hyperparameters: nesting choices and choosing pipelines
- Linking choices depending on other choices
- Exporting the DataOps plan as a learner and reusing it