Complex multi-table pipelines with Data Ops#
Skrub provides an easy way to build complex, flexible machine learning pipelines.
There are several needs that are not easily addressed with standard scikit-learn
tools such as Pipeline
and
ColumnTransformer
, and for which the Skrub DataOps offer
a solution:
Multiple tables: We often have several tables of different shapes (for example, “Customers”, “Orders”, and “Products” tables) that need to be processed and assembled into a design matrix
X
. The targety
may also be the result of some data processing. Standard scikit-learn estimators do not support this, as they expect right away a single design matrixX
and a target arrayy
, with one row per observation.DataFrame wrangling: Performing typical DataFrame operations such as projections, joins, and aggregations should be possible and allow leveraging the powerful and familiar APIs of Pandas or Polars.
Hyperparameter tuning: Choices of estimators, hyperparameters, and even the pipeline architecture can be guided by validation scores. Specifying ranges of possible values outside of the pipeline itself (as in
GridSearchCV
) is difficult in complex pipelines.Iterative development: Building a pipeline step by step while inspecting intermediate results allows for a short feedback loop and early discovery of errors.
In this section we cover all about the skrub Data Ops, from starting out with a simple example, to more advanced concepts like parameter tuning and and pipeline validation.
Data Ops basic concepts#
- Basics of DataOps: the DataOps plan, variables, and learners
- Building a simple DataOps plan
- Using previews for easier development and debugging
- DataOps allow direct access to methods of the underlying data
- Control flow in DataOps: eager and deferred evaluation
- How do skrub Data Ops differ from the alternatives?
Building a complex pipeline with the skrub Data Ops#
- Applying machine-learning estimators
- Applying different transformers using skrub selectors and DataOps
- Documenting the DataOps plan with node names and descriptions
- Evaluating and debugging the DataOps plan with
.skb.full_report()
- Using only a part of a DataOps plan
- Subsampling data for easier development and debugging
Tuning and validating Skrub DataOps plans#
- Tuning and validating skrub DataOps plans
- Improving the confidence in our score through cross-validation
- Splitting the data in train and test sets
- Using the skrub
choose_*
functions to tune hyperparameters - Feature selection with skrub
SelectCols
andDropCols
- Validating hyperparameter search with nested cross-validation
- Going beyond estimator hyperparameters: nesting choices and choosing pipelines
- Linking choices depending on other choices
- Exporting the DataOps plan as a learner and reusing it