Machine learning with dataframes

skrub is a Python library to ease preprocessing and feature engineering for tabular machine learning.
We directly connect database tables to machine learning.

Get started →

Effortless Tabular Learning

Create strong scikit-learn pipeline baselines effortlessly with TableVectorizer and tabular_learner.

Given, a complex dataframe df: (expand for full code)

from skrub.datasets import fetch_employee_salaries
dataset = fetch_employee_salaries()
df = dataset.X
y = dataset.y
df
        

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and...	Fulltime-Regular	Office Services Coordinator	09/22/1986	1986
1	M	POL	Department of Police	ISB Major Crimes...	Fulltime-Regular	Master Police Officer	09/12/1988	1988
...	...	...	...	...	...	...	...	...
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation...	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2012

from sklearn.model_selection import cross_val_score
from skrub import tabular_learner
cross_val_score(tabular_learner('regressor'), df, y)
        

array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])

Interactive Data Exploration

Explore your dataframes interactively with TableReport.

from skrub import TableReport
TableReport(df)
    

Try it on your dataset →

Click anywhere on the table

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	09/22/1986	1,986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1,988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1,989

9,226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2,006
9,227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2,012

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	False	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	False	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	False	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	False	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	False	0 (0.0%)	443 (4.8%)
6	date_first_hired	ObjectDType	False	0 (0.0%)	2264 (24.5%)
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
assignment_category	employee_position_title	0.674
division	assignment_category	0.624
division	employee_position_title	0.522
department_name	employee_position_title	0.416
department	employee_position_title	0.416
department	assignment_category	0.404
department_name	assignment_category	0.404
gender	department	0.377
gender	department_name	0.377
department	division	0.364
department_name	division	0.364
gender	employee_position_title	0.285
gender	assignment_category	0.276
gender	division	0.264
employee_position_title	date_first_hired	0.246
department	date_first_hired	0.165
department_name	date_first_hired	0.165
date_first_hired	year_first_hired	0.164
employee_position_title	year_first_hired	0.138

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Powerful Feature Engineering

Encode text and high cardinality categorical data (StringEncoder, TextEncoder, GapEncoder, and MinHashEncoder), or extract features from dates with the DatetimeEncoder.

from skrub import GapEncoder
gap = GapEncoder().fit(df["employee_position_title"])
encoded_labels = gap.transform(
    df["employee_position_title"].head()
)
plt.imshow(encoded_labels)
    

Tune arbitrary data wrangling

Inspect it, apply it to new data

Chain an arbitrary set of operations to prepare, transform, assemble multiple tables for machine learning, and then tune the full pipeline, inspect it, or apply it to new data.

Works with any computational or dataframe engine.

Discover the skrub DataOps →

Given two input dataframes products_df and baskets_df: (expand for full code)

# A dataset with multiple tables
import skrub
from skrub.datasets import fetch_credit_fraud
# use the test set to have smaller data
dataset = fetch_credit_fraud(split="test")

# Extract simplified tables
# Drop the columns that are not central to the analysis
products_df = dataset.products[["basket_ID", "cash_price", "Nbr_of_prod_purchas"]]
# Rename the ID column to "basket_ID"
baskets_df = dataset.baskets.rename(columns={"ID": "basket_ID"})

basket_ID	cash_price	Nbr_of_prod_purchas
85517	889	1
83008	1399	1
...	...	...
95939	1149	1
95939	7	1

basket_ID	fraud_flag
85517	0
83008	0
...	...
97639	0
95939	0

# Define the inputs of our skrub pipeline
products = skrub.var("products", products_df)
baskets = skrub.var("baskets", baskets_df)

# Specify our "X" and "y" variables for machine learning
basket_IDs = baskets[["basket_ID"]].skb.mark_as_X()
fraud_flags = baskets["fraud_flag"].skb.mark_as_y()

# A pandas-based data-preparation pipeline that merges the tables
aggregated_products = products.groupby("basket_ID").agg(
    skrub.choose_from(("mean", "max", "count"))).reset_index()
features = basket_IDs.merge(aggregated_products, on="basket_ID")
from sklearn.ensemble import ExtraTreesClassifier
predictions = features.skb.apply(ExtraTreesClassifier(), y=fraud_flags)

# Now use skrub to tune hyperparameters of the above pipeline
search = predictions.skb.make_grid_search(fitted=True, scoring="roc_auc")
search.plot_results()

Our Community

The Skrub project is powered by the efforts of a world-wide community of contributors. Here we display a randomly selected group of 30 contributors.

Try it yourself!

Ready to write less code and get more insights? Dive into skrub now and be part of an emerging community!

Get started →

skrub

Machine learning with dataframes

Effortless Tabular Learning

Interactive Data Exploration

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired