API reference#

This page lists all available functions and classes of skrub.

Joining tables

fuzzy_join

Fuzzy (approximate) join.

Joiner

Augment features in a main table by fuzzy-joining an auxiliary table to it.

AggJoiner

Aggregate auxiliary dataframes before joining them on a base dataframe.

AggTarget

Aggregate a target y before joining its aggregation on a base dataframe.

InterpolationJoiner

Join with a table augmented by machine-learning predictions.

Column selection in a pipeline

SelectCols

Select a subset of a DataFrame's columns.

DropCols

Drop a subset of a DataFrame's columns.

Vectorizing a dataframe

TableVectorizer

Automatically transform a heterogeneous dataframe to a numerical array.

Dirty category encoders

GapEncoder

Constructs latent topics with continuous encoding.

MinHashEncoder

Encode string categorical features by applying the MinHash method to n-gram decompositions of strings.

SimilarityEncoder

Encode string categories to a similarity matrix, to capture fuzziness across a few categories.

Dealing with dates

DatetimeEncoder

Transforms each datetime column into several numeric columns for temporal features (e.g year, month, day...).

to_datetime

Convert the columns of a dataframe or 2d array into a datetime representation.

Deduplication: merging variants of the same entry

deduplicate

Deduplicate categorical data by hierarchically clustering similar strings.

Data download and generation

datasets.fetch_employee_salaries

Fetches the employee salaries dataset (regression), available at https://openml.org/d/42125

datasets.fetch_medical_charge

Fetches the medical charge dataset (regression), available at https://openml.org/d/42720

datasets.fetch_midwest_survey

Fetches the midwest survey dataset (classification), available at https://openml.org/d/42805

datasets.fetch_open_payments

Fetches the open payments dataset (classification), available at https://openml.org/d/42738

datasets.fetch_road_safety

Fetches the road safety dataset (classification), available at https://openml.org/d/42803

datasets.fetch_traffic_violations

Fetches the traffic violations dataset (classification), available at https://openml.org/d/42132

datasets.fetch_drug_directory

Fetches the drug directory dataset (classification), available at https://openml.org/d/43044

datasets.fetch_world_bank_indicator

Fetches a dataset of an indicator from the World Bank open data platform.

datasets.fetch_movielens

Fetches a dataset from Movielens.

datasets.fetch_ken_table_aliases

Get the supported aliases of embedded KEN entities tables.

datasets.fetch_ken_types

Helper function to search for KEN entity types.

datasets.fetch_ken_embeddings

Download Wikipedia embeddings by type.

datasets.make_deduplication_data

Duplicates examples with spelling mistakes.