skrub.DataOp.skb.find_X_y#

DataOp.skb.find_X_y()[source]#

Find the nodes that have been marked with mark_as_X() and mark_as_y().

Returns:

dict

A dictionary containing the following keys (all are optional):

“X”, if a node has been marked with DataOp.skb.mark_as_X().
“y”, if a node has been marked with DataOp.skb.mark_as_y().
Additionally, if a cv has been passed to DataOp.skb.mark_as_X(), the parameters that were passed to mark_as_X:
- “cv”
- “split_kwargs”

See also

DataOp.skb.find: Find a node by name or by an arbitrary predicate.
SkrubLearner.truncated_after: Truncate the (possibly fitted) SkrubLearner after the specified node.

Notes

To evaluate the DataOps in the returned dictionary, it is recommended to evaluate the whole dict as a single DataOp:

Xy = my_data_op.skb.find_X_y()
Xy_values = skrub.as_data_op(Xy).skb.eval({...})
X_value = Xy_values['X']
y_value = Xy_values['y']

rather than:

Xy = my_data_op.skb.find_X_y()
X_value = Xy['X'].skb.eval({...})
y_value = Xy['y'].skb.eval({...})

Indeed, evaluating each value in the dict separately can result in running some computation twice, and worse, obtaining X and y that are not aligned if the data loading and processing that produces X and y produces row in an undeterministic order (e.g. due to aggregations, joins, database queries etc.).

Examples

>>> import skrub
>>> from sklearn.dummy import DummyClassifier

>>> df = skrub.datasets.toy_products()
>>> df
   description  price            seller     category
0       screen    100   supermarket.com  electronics
1       hammer     15  bestproducts.com        tools
2     keyboard     20   supermarket.com  electronics
3      usb key      9  bestproducts.com  electronics
4      charger     13  bestproducts.com  electronics
5  screwdriver     12   supermarket.com        tools

>>> data = skrub.var("df")
>>> groups = data["seller"]
>>> X = data[["description", "price"]].skb.mark_as_X()
>>> y = data["category"].skb.mark_as_y()
>>> pred = X.skb.apply(DummyClassifier(), y=y)
>>> X_y = pred.skb.find_X_y()
>>> X_y
{'X': <GetItem ['description', 'price']>, 'y': <GetItem 'category'>}
>>> X_y['X'] is X
True

To compute the values, evaluate the whole dictionary as a single DataOp:

>>> X_y_values = skrub.as_data_op(X_y).skb.eval({'df': df})
>>> X_y_values
{'X':    description  price
     screen    100
     hammer     15
   keyboard     20
    usb key      9
    charger     13
screwdriver     12, 'y': 0    electronics
        tools
  electronics
  electronics
  electronics
        tools
Name: category, dtype: str}

When a cv object was passed to DataOp.skb.mark_as_X(), the result will also contain the keys "cv" and "split_kwargs":

>>> from sklearn.model_selection import LeaveOneGroupOut

>>> X = data[["description", "price"]].skb.mark_as_X(
...     cv=LeaveOneGroupOut(), split_kwargs={"groups": groups}
... )
>>> pred = X.skb.apply(DummyClassifier(), y=y)
>>> pred.skb.find_X_y()
{'X': <X>, 'cv': LeaveOneGroupOut(), 'split_kwargs': {'groups': <GetItem 'seller'>}, 'y': <GetItem 'category'>}