skrub

Prepping tables for machine learning

  • Built for scikit-learn, Python
  • Robust to dirty data
  • Easy learning on pandas dataframes

Less data wrangling, more machine learning

tabular_learner: easily create tabular-learning pipelines that wrangle complex dataframes.

Given, a complex dataframe df: (expand for full code)
>>> from skrub.datasets import fetch_employee_salaries
>>> dataset = fetch_employee_salaries()
>>> df = dataset.X
>>> y = dataset.y
>>> df
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and... Fulltime-Regular Office Services Coordinator 09/22/1986 1986
1 M POL Department of Police ISB Major Crimes... Fulltime-Regular Master Police Officer 09/12/1988 1988
... ... ... ... ... ... ... ... ...
9226 M CCL County Council Council Central Staff Fulltime-Regular Manager II 09/05/2006 2006
9227 M DLC Department of Liquor Control Licensure, Regulation... Fulltime-Regular Alcohol/Tobacco Enforcement Specialist II 01/30/2012 2012
>>> from sklearn.model_selection import cross_val_score
>>> from skrub import tabular_learner
>>> cross_val_score(tabular_learner('regressor'), df, y)
array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])