Using previews for easier development and debugging#

To make interactive development easier without having to call eval() after each step, it is possible to preview the result of a DataOp by passing a value along with its name when creating a variable.

>>> import skrub
>>> a = skrub.var("a", 10) # we pass the value 10 in addition to the name
>>> b = skrub.var("b", 6)
>>> c = a + b
>>> c  # now the display of c includes a preview of the result
<BinOp: add>
Result:
―――――――
16

Previews are eager computations on the current data, and since they are computed immediately they can spot errors early on:

>>> import pandas as pd
>>> df = pd.DataFrame({"col": [1, 2, 3]})
>>> a = skrub.var("a", df)  # we pass the DataFrame as a value

Next, we use the pandas drop column and try to drop a column without specifying the axis:

>>> a.drop("col")
Traceback (most recent call last):
    ...
RuntimeError: Evaluation of '.drop()' failed.
You can see the full traceback above. The error message was:
KeyError: "['col'] not found in axis"

Note that seeing results for the values we provided does not change the fact that we are building a pipeline that we want to reuse, not just computing the result for a fixed input. The displayed result is only preview of the output on one example dataset.

>>> c.skb.eval({"a": 3, "b": 2})
5

It is not necessary to provide a value for every variable: it is however advisable to do so when possible, as it allows to catch errors early on.

Disabling previews and eager checks#

By default, as soon as a DataOp is defined, some validity checks are performed and the preview results are computed eagerly. In very complex DataOps plans (100+ nodes), running checks after adding each node can cause a noticeable overhead. To avoid this, it is possible to disable eager checks with the "eager_data_ops" is easily achieved with the "eager_data_ops" configuration option.

>>> with skrub.config_context(eager_data_ops=False):
...     # no checks are performed when b is defined so no error in the line below:
...     b = skrub.var('a', 1) + skrub.var('a', 2)
...     # checks are still performed (once) before the DataOp is actually used so
...     # evaluating the DataOp, using .skb.make_learner() etc _would_ still raise:
...     # b.skb.eval() ## raises ValueError: Choice and node names must be unique.
>>> b # Note there is no preview, even though we provided values for the variables
<BinOp: add>