X
of the right shapeproduct_vectorizer = skrub.TableVectorizer(
high_cardinality=skrub.StringEncoder(n_components=5)
)
vectorized_products = product_vectorizer.fit_transform(data.products)
product_vectorizer = skrub.TableVectorizer(
high_cardinality=skrub.StringEncoder(n_components=5)
)
vectorized_products = product_vectorizer.fit_transform(data.products)
🤔
product_vectorizer
?products
table: data leakageColumnTransformer
😟😰aggregated_products = (
vectorized_products.groupby("basket_ID").agg("mean").reset_index()
)
X = X.merge(aggregated_products, left_on="ID", right_on="basket_ID").drop(
columns=["ID", "basket_ID"]
)
🤔
data = skrub.datasets.fetch_credit_fraud()
X = skrub.X(data.baskets[["ID"]])
y = skrub.y(data.baskets["fraud_flag"])
products = skrub.var("products", data.products)
X
, y
, products
represent inputs to the modelfrom skrub import selectors as s
products = products[products["basket_ID"].isin(X["ID"])]
product_vectorizer = skrub.TableVectorizer(
high_cardinality=skrub.StringEncoder(n_components=5)
)
vectorized_products = products.skb.apply(
product_vectorizer, cols=s.all() - "basket_ID"
)
products
based on X
from skrub import selectors as s
products = products[products["basket_ID"].isin(X["ID"])]
product_vectorizer = skrub.TableVectorizer(
high_cardinality=skrub.StringEncoder(n_components=5)
)
vectorized_products = products.skb.apply(
product_vectorizer, cols=s.all() - "basket_ID"
)
products
based on X
product_vectorizer
is added to the modelfrom skrub import selectors as s
products = products[products["basket_ID"].isin(X["ID"])]
product_vectorizer = skrub.TableVectorizer(
high_cardinality=skrub.StringEncoder(n_components=5)
)
vectorized_products = products.skb.apply(
product_vectorizer, cols=s.all() - "basket_ID"
)
products
based on X
product_vectorizer
is added to the modelfrom skrub import selectors as s
products = products[products["basket_ID"].isin(X["ID"])]
product_vectorizer = skrub.TableVectorizer(
high_cardinality=skrub.StringEncoder(
n_components=skrub.choose_int(2, 20)
)
)
vectorized_products = products.skb.apply(
product_vectorizer, cols=s.all() - "basket_ID"
)
aggregated_products = (
vectorized_products.groupby("basket_ID").agg("mean").reset_index()
)
X = X.merge(aggregated_products, left_on="ID", right_on="basket_ID").drop(
columns=["ID", "basket_ID"]
)
Evaluation
Training & using a model
train.py
Without skrub: 😭😭😭
pipe = Pipeline([("dim_reduction", PCA()), ("regressor", Ridge())])
grid = [
{
"dim_reduction": [PCA()],
"dim_reduction__n_components": [10, 20, 30],
"regressor": [Ridge()],
"regressor__alpha": loguniform(0.1, 10.0),
},
{
"dim_reduction": [SelectKBest()],
"dim_reduction__k": [10, 20, 30],
"regressor": [Ridge()],
"regressor__alpha": loguniform(0.1, 10.0),
},
{
"dim_reduction": [PCA()],
"dim_reduction__n_components": [10, 20, 30],
"regressor": [RandomForestClassifier()],
"regressor__n_estimators": loguniform(20, 200),
},
{
"dim_reduction": [SelectKBest()],
"dim_reduction__k": [10, 20, 30],
"regressor": [RandomForestClassifier()],
"regressor__n_estimators": loguniform(20, 200),
},
]
model = RandomizedSearchCV(pipe, grid)
NO!
With skrub: replace any value with a range
product_vectorizer = skrub.TableVectorizer(
high_cardinality=skrub.StringEncoder(
n_components=skrub.choose_int(2, 20)
)
)
# ...
search = pred.skb.get_randomized_search(scoring="roc_auc", fitted=True)
search.plot_parallel_coord()