Building a simple DataOps plan#

Let’s build a simple DataOps plan that adds two variables together.

We start by declaring the variables:

>>> import skrub
>>> a = skrub.var("a")
>>> b = skrub.var("b")

We then apply transformations (in this case, an addition) composing more complex DataOps.

>>> c = a + b
>>> c
<BinOp: add>

Finally, we can evaluate the plan by passing the environment in which the plan should be evaluated. The environment is a dictionary that maps variable names to their values.

>>> c.skb.eval({"a": 10, "b": 6})
16

As shown above, the special .skb attribute allows to interact with the DataOp object itself, and .skb.eval() evaluates the DataOp plan. By default, .skb.eval() uses the values passed in the variable definitions, but it can also take an explicit environment dictionary as an argument.

Finally, we can export the plan as a Learner that can be fitted and applied to new data:

>>> learner = c.skb.make_learner()
>>> learner.fit_transform({"a": 10, "b": 7})
17

When using Data Ops, it is important to ensure that all operations are being tracked by acting on the Data Ops, rather than (for example) the starting dataframe. Consider the following example:

>>> import pandas as pd
>>> df = pd.DataFrame({"col": [1, 2, 3]})
>>> df
   col
0    1
1    2
2    3
>>> df_do = skrub.var("df", df)
>>> df_do
<Var 'df'>
Result:
―――――――
   col
0    1
1    2
2    3

df_do is a Data Op that wraps df, so its preview shows the content of df. Then, if we now modify df_do by doubling the column, we can see that both steps (the creation of the variable, and the doubling) are now tracked by the final Data Op.

>>> df_doubled = df_do.assign(col=df_do["col"]*2)
>>> df_doubled
<CallMethod 'assign'>
Result:
―――――――
   col
0    2
1    4
2    6
>>> print(df_doubled.skb.describe_steps())
Var 'df'
( Var 'df' )*
GetItem 'col'
BinOp: mul
CallMethod 'assign'
* Cached, not recomputed

On the other hand, working directly on df leads us to the same result, but the actual operations are not being tracked. By working only on Data Ops we ensure that all the operations done on the data are added correctly to the computational graph, which then allows the resulting learner to execute all steps as intended.

See Introduction to machine-learning pipelines with skrub DataOps for an introductory example on how to use skrub DataOps on a single dataframe.