Applying different transformers using skrub selectors and DataOps#

It is possible to use skrub selectors to define which columns to apply transformers to, and then apply different transformers to different subsets of the data.

For example, this can be useful to apply TextEncoder to columns that contain free-flowing text, and StringEncoder to other string columns that contain categorical data such as country names.

Or, a string column may need to be encoded in an ordered way, like in the following example with grades.

>>> import skrub
>>> import pandas as pd
>>> data = {
...     "subject": ["Math", "English", "History", "Science", "Art"],
...     "grade": ["A", "B", "C", "A", "B"]
... }
>>> df = pd.DataFrame(data)
>>> grades = skrub.var("grades", df)
>>> grades
<Var 'grades'>
Result:
―――――――
   subject grade
0     Math     A
1  English     B
2  History     C
3  Science     A
4      Art     B

We encode the subjects with the StringEncoder:

>>> from skrub import StringEncoder
>>> enc_subject = grades.skb.select(cols="subject").skb.apply(StringEncoder(n_components=2))

For the grades, we define a deferred() function that maps the strings to the order we want. Remember that objects inside deferred functions are regular Python objects (more detail in Control flow in DataOps: eager and deferred evaluation).

>>> @skrub.deferred
... def encode_ordered(df):
...     grade_order = {"A": 3, "B": 2, "C": 1}
...     return df["grade"].map(grade_order)
>>> enc_grades = grades.skb.apply_func(encode_ordered)
>>> enc_grades
<Call 'encode_ordered'>
Result:
―――――――
0    3
1    2
2    1
3    3
4    2
Name: grade, dtype: int64

Finally, we combine the resulting dataframe and series using another deferred function.

>>> @skrub.deferred
... def combine(subjects, grades):
...     subjects["grade"] = grades
...     return subjects
>>> combine(enc_subject, enc_grades)
<Call 'combine'>
Result:
―――――――
      subject_0     subject_1  grade
0  1.800470e-07  1.704487e+00      3
1  1.675736e-07 -1.998386e-08      2
2  1.615310e+00  2.142048e-07      1
3 -4.709333e-08  5.155605e-08      3
4 -5.441046e-01  4.167525e-09      2

In the next example, we apply a StringEncoder to columns with high cardinality, a mathematical operation to columns with nulls, and a TableVectorizer to all other columns. We use the skrub selectors to select the columns based on our requirements.

>>> import pandas as pd
>>> import skrub
>>> orders_df = pd.DataFrame(
...     {
...         "item": ["pen", "cup", "pen", "fork"],
...         "price": [1.5, None, 1.5, 2.2],
...         "qty": [1, 1, 2, 4],
...     }
... )
>>> orders = skrub.var("orders", orders_df)
>>> orders
<Var 'orders'>
Result:
―――――――
   item  price  qty
0   pen    1.5    1
1   cup    NaN    1
2   pen    1.5    2
3  fork    2.2    4

We create some selectors with different conditions:

>>> from skrub import selectors as s
>>> high_cardinality = s.string() - s.cardinality_below(2)
>>> has_nulls = s.has_nulls()
>>> leftover = s.all() - high_cardinality - has_nulls

>>> vectorizer = skrub.StringEncoder(n_components=2)
>>> vectorized_items = orders.skb.select(high_cardinality).skb.apply(vectorizer)
>>> vectorized_items
<Apply StringEncoder>
Result:
―――――――
          item_0        item_1  price  qty
0  1.511858e+00  9.380015e-08    1.5    1
1 -1.704687e-07  1.511858e+00    NaN    1
2  1.511858e+00  9.380015e-08    1.5    2
3 -5.458670e-09 -6.917769e-08    2.2    4

>>> vectorized_has_nulls = orders.skb.select(cols=has_nulls) * 11
>>> vectorized_has_nulls
    <BinOp: mul>
    Result:
    ―――――――
       price
    0   16.5
    1    NaN
    2   16.5
    3   24.2
>>> everything_else = orders.skb.select(cols=leftover).skb.apply(skrub.TableVectorizer())

After encoding the columns, the resulting DataOps can be concatenated together to obtain the final result:

>>> encoded = (
...   everything_else.skb.concat([vectorized_items, vectorized_has_nulls], axis=1)
... )
>>> encoded
   qty        item_0        item_1  price
0  1.0  1.594282e+00 -1.224524e-07   16.5
1  1.0  9.228692e-08  1.473794e+00    NaN
2  2.0  1.594282e+00 -1.224524e-07   16.5
3  4.0  7.643604e-09  6.080018e-01   24.2

More info on advanced column selection and manipulation be found in Skrub Selectors, for selecting columns in a dataframe and example Hands-On with Column Selection and Transformers.