Complex multi-table pipelines with Data Ops#

Skrub provides an easy way to build complex, flexible machine learning pipelines. There are several needs that are not easily addressed with standard scikit-learn tools such as Pipeline and ColumnTransformer, and for which the Skrub DataOps offer a solution:

  • Multiple tables: We often have several tables of different shapes (for example, “Customers”, “Orders”, and “Products” tables) that need to be processed and assembled into a design matrix X. The target y may also be the result of some data processing. Standard scikit-learn estimators do not support this, as they expect right away a single design matrix X and a target array y, with one row per observation.

  • DataFrame wrangling: Performing typical DataFrame operations such as projections, joins, and aggregations should be possible and allow leveraging the powerful and familiar APIs of Pandas or Polars.

  • Hyperparameter tuning: Choices of estimators, hyperparameters, and even the pipeline architecture can be guided by validation scores. Specifying ranges of possible values outside of the pipeline itself (as in GridSearchCV) is difficult in complex pipelines.

  • Iterative development: Building a pipeline step by step while inspecting intermediate results allows for a short feedback loop and early discovery of errors.

In this section we cover all about the skrub Data Ops, from starting out with a simple example, to more advanced concepts like parameter tuning and and pipeline validation.

Data Ops basic concepts#

Building a complex pipeline with the skrub Data Ops#

Tuning and validating Skrub DataOps plans#