Machine learning with dataframes

skrub is a Python library to ease preprocessing and feature engineering for tabular machine learning.
We directly connect database tables to machine learning.

Effortless Pipelines

Create strong scikit-learn pipeline baselines effortlessly with TableVectorizer and tabular_learner.

Given, a complex dataframe df: (expand for full code)
from skrub.datasets import fetch_employee_salaries
dataset = fetch_employee_salaries()
df = dataset.X
y = dataset.y
df
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and... Fulltime-Regular Office Services Coordinator 09/22/1986 1986
1 M POL Department of Police ISB Major Crimes... Fulltime-Regular Master Police Officer 09/12/1988 1988
... ... ... ... ... ... ... ... ...
9226 M CCL County Council Council Central Staff Fulltime-Regular Manager II 09/05/2006 2006
9227 M DLC Department of Liquor Control Licensure, Regulation... Fulltime-Regular Alcohol/Tobacco Enforcement Specialist II 01/30/2012 2012
from sklearn.model_selection import cross_val_score
from skrub import tabular_learner
cross_val_score(tabular_learner('regressor'), df, y)
array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])

Powerful Feature Engineering

Encode text and high cardinality categorical data with the GapEncoder and MinHashEncoder, and extract features from dates with the DatetimeEncoder.

from skrub import GapEncoder
gap = GapEncoder().fit(df["employee_position_title"])
encoded_labels = gap.transform(
    df["employee_position_title"].head()
)
plt.imshow(encoded_labels)

Interactive Data Exploration

Explore your dataframes interactively with TableReport.

from skrub import TableReport
TableReport(df)

Try it on your dataset →

Click anywhere on the table

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Tune arbitrary data wrangling

Inspect it, apply it to new data

Chain an arbitrary set of operations to prepare, transform, assemble multiple tables for machine learning, and then tune the full pipeline, inspect it, or apply it to new data.

Works with any computational or dataframe engine.

Discover the skrub expressions →

Given two input dataframes products_df and baskets_df: (expand for full code)
# A dataset with multiple tables
import skrub
from skrub.datasets import fetch_credit_fraud
# use the test set to have smaller data
dataset = fetch_credit_fraud(split="test")

# Extract simplified tables
# Drop the columns that are not central to the analysis
products_df = dataset.products[["basket_ID", "cash_price", "Nbr_of_prod_purchas"]]
# Rename the ID column to "basket_ID"
baskets_df = dataset.baskets.rename(columns={"ID": "basket_ID"})
basket_ID cash_price Nbr_of_prod_purchas
85517 889 1
83008 1399 1
... ... ...
95939 1149 1
95939 7 1
basket_ID fraud_flag
85517 0
83008 0
... ...
97639 0
95939 0
# Define the inputs of our skrub pipeline
products = skrub.var("products", products_df)
baskets = skrub.var("baskets", baskets_df)

# Specify our "X" and "y" variables for machine learning
basket_IDs = baskets[["basket_ID"]].skb.mark_as_X()
fraud_flags = baskets["fraud_flag"].skb.mark_as_y()

# A pandas-based data-preparation pipeline that merges the tables
aggregated_products = products.groupby("basket_ID").agg(
    skrub.choose_from(("mean", "max", "count"))).reset_index()
features = basket_IDs.merge(aggregated_products, on="basket_ID")
from sklearn.ensemble import ExtraTreesClassifier
predictions = features.skb.apply(ExtraTreesClassifier(), y=fraud_flags)

# Now use skrub to tune hyperparameters of the above pipeline
search = predictions.skb.get_grid_search(fitted=True, scoring="roc_auc")
search.plot_results()

Our Community

The Skrub project is powered by the efforts of a world-wide community of contributors. Here we display a randomly selected group of 30 contributors.

Try it yourself!

Ready to write less code and get more insights? Dive into skrub now and be part of an emerging community!