API reference#

This page lists all available functions and classes of skrub.

Joining tables

fuzzy_join

Fuzzy (approximate) join.

Joiner

Augment features in a main table by fuzzy-joining an auxiliary table to it.

AggJoiner

Aggregate an auxiliary dataframe before joining it on a base dataframe.

MultiAggJoiner

Extension of the AggJoiner to multiple auxiliary tables.

AggTarget

Aggregate a target y before joining its aggregation on a base dataframe.

InterpolationJoiner

Join with a table augmented by machine-learning predictions.

Column selection in a pipeline

SelectCols

Select a subset of a DataFrame's columns.

DropCols

Drop a subset of a DataFrame's columns.

Vectorizing a dataframe

TableVectorizer

Transform a dataframe to a numerical (vectorized) representation.

tabular_learner

Get a simple machine-learning pipeline for tabular data.

Dirty category encoders

GapEncoder

Constructs latent topics with continuous encoding.

MinHashEncoder

Encode string categorical features by applying the MinHash method to n-gram decompositions of strings.

SimilarityEncoder

Encode string categories to a similarity matrix, to capture fuzziness across a few categories.

ToCategorical

Convert a string column to Categorical dtype.

Dealing with dates

DatetimeEncoder

Extract temporal features such as month, day of the week, … from a datetime column.

ToDatetime

Parse datetimes represented as strings and return Datetime columns.

to_datetime

Convert DataFrame or column to Datetime dtype.

Deduplication: merging variants of the same entry

deduplicate

Deduplicate categorical data by hierarchically clustering similar strings.

Data download and generation

datasets.fetch_employee_salaries

Fetches the employee salaries dataset (regression), available at https://openml.org/d/42125

datasets.fetch_medical_charge

Fetches the medical charge dataset (regression), available at https://openml.org/d/42720

datasets.fetch_midwest_survey

Fetches the midwest survey dataset (classification), available at https://openml.org/d/42805

datasets.fetch_open_payments

Fetches the open payments dataset (classification), available at https://openml.org/d/42738

datasets.fetch_road_safety

Fetches the road safety dataset (classification), available at https://openml.org/d/42803

datasets.fetch_traffic_violations

Fetches the traffic violations dataset (classification), available at https://openml.org/d/42132

datasets.fetch_drug_directory

Fetches the drug directory dataset (classification), available at https://openml.org/d/43044

datasets.fetch_world_bank_indicator

Fetches a dataset of an indicator from the World Bank open data platform.

datasets.fetch_movielens

Fetches a dataset from Movielens.

datasets.fetch_ken_table_aliases

Get the supported aliases of embedded KEN entities tables.

datasets.fetch_ken_types

Helper function to search for KEN entity types.

datasets.fetch_ken_embeddings

Download Wikipedia embeddings by type.

datasets.make_deduplication_data

Duplicates examples with spelling mistakes.