X
of the right shape🤔
product_vectorizer
?products
table: data leakageColumnTransformer
😟😰🤔
X
, y
, products
represent inputs to the modelproducts
based on X
products
based on X
product_vectorizer
is added to the modelproducts
based on X
product_vectorizer
is added to the modelfrom skrub import selectors as s
products = products[products["basket_ID"].isin(X["ID"])]
product_vectorizer = skrub.TableVectorizer(
high_cardinality=skrub.StringEncoder(
n_components=skrub.choose_int(2, 20)
)
)
vectorized_products = products.skb.apply(
product_vectorizer, cols=s.all() - "basket_ID"
)
Evaluation
Training & using a model
train.py
Without skrub: 😭😭😭
pipe = Pipeline([("dim_reduction", PCA()), ("regressor", Ridge())])
grid = [
{
"dim_reduction": [PCA()],
"dim_reduction__n_components": [10, 20, 30],
"regressor": [Ridge()],
"regressor__alpha": loguniform(0.1, 10.0),
},
{
"dim_reduction": [SelectKBest()],
"dim_reduction__k": [10, 20, 30],
"regressor": [Ridge()],
"regressor__alpha": loguniform(0.1, 10.0),
},
{
"dim_reduction": [PCA()],
"dim_reduction__n_components": [10, 20, 30],
"regressor": [RandomForestClassifier()],
"regressor__n_estimators": loguniform(20, 200),
},
{
"dim_reduction": [SelectKBest()],
"dim_reduction__k": [10, 20, 30],
"regressor": [RandomForestClassifier()],
"regressor__n_estimators": loguniform(20, 200),
},
]
model = RandomizedSearchCV(pipe, grid)
NO!
With skrub: replace any value with a range