Assembling
fuzzy_join()
: Joining
tables on non-normalized categories with approximate matching. Example
Joiner
,
AggJoiner
: transformers for joining multiple tables together. Example
Encoding
TableVectorizer
:
turn a pandas dataframe into a numerical array for
machine learning. Example
GapEncoder
:
OneHotEncoder but robust to typos or non-normalized categories. Example
Cleaning
deduplicate()
: merge
categories of similar morphology (spelling). Example
Less data wrangling, more machine learning
tabular_learner
:
easily create tabular-learning pipelines that wrangle complex dataframes.
Given, a complex dataframe
df
: (expand for full code)
>>> from skrub.datasets import fetch_employee_salaries
>>> dataset = fetch_employee_salaries()
>>> df = dataset.X
>>> y = dataset.y
>>> df
>>> from sklearn.model_selection import cross_val_score
>>> from skrub import tabular_learner
>>> cross_val_score(tabular_learner('regressor'), df, y)
array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])