API reference#
This page lists all available functions and classes of skrub.
Joining tables
Join two tables based on approximate matching using the appropriate similarity metric. |
Augment a main table by fuzzy joining an auxiliary table to it. |
|
Aggregate auxiliary dataframes before joining them on a base dataframe. |
|
Aggregate a target |
Join with a table augmented by machine-learning predictions. |
Column selection in a pipeline
Select a subset of a DataFrame's columns. |
|
Drop a subset of a DataFrame's columns. |
Vectorizing a dataframe
Automatically transform a heterogeneous dataframe to a numerical array. |
Dirty category encoders
Constructs latent topics with continuous encoding. |
|
Encode string categorical features by applying the MinHash method to n-gram decompositions of strings. |
|
Encode string categories to a similarity matrix, to capture fuzziness across a few categories. |
Dealing with dates
Transforms each datetime column into several numeric columns for temporal features (e.g year, month, day...). |
Convert the columns of a dataframe or 2d array into a datetime representation. |
Deduplication: merging variants of the same entry
Deduplicate categorical data by hierarchically clustering similar strings. |
Data download and generation
Fetches the employee salaries dataset (regression), available at https://openml.org/d/42125 |
|
Fetches the medical charge dataset (regression), available at https://openml.org/d/42720 |
|
Fetches the midwest survey dataset (classification), available at https://openml.org/d/42805 |
|
Fetches the open payments dataset (classification), available at https://openml.org/d/42738 |
|
Fetches the road safety dataset (classification), available at https://openml.org/d/42803 |
|
Fetches the traffic violations dataset (classification), available at https://openml.org/d/42132 |
|
Fetches the drug directory dataset (classification), available at https://openml.org/d/43044 |
|
Fetches a dataset of an indicator from the World Bank open data platform. |
|
Fetches a dataset from Movielens. |
|
Get the supported aliases of embedded KEN entities tables. |
|
Helper function to search for KEN entity types. |
|
Download Wikipedia embeddings by type. |
|
Duplicates examples with spelling mistakes. |