API#

This page lists all available functions and classes of skrub.

Joining dataframes#

Joiner

Augment features in a main table by fuzzy-joining an auxiliary table to it.

AggJoiner

Aggregate an auxiliary dataframe before joining it on a base dataframe.

MultiAggJoiner

Extension of the AggJoiner to multiple auxiliary tables.

AggTarget

Aggregate a target y before joining its aggregation on a base dataframe.

InterpolationJoiner

Join with a table augmented by machine-learning predictions.

fuzzy_join

Fuzzy (approximate) join.

Encoding a column#

GapEncoder

Encode string columns by constructing latent topics.

MinHashEncoder

Encode string categorical features by applying the MinHash method to n-gram decompositions of strings.

SimilarityEncoder

Encode string categories to a similarity matrix, to capture fuzziness across a few categories.

DatetimeEncoder

Extract temporal features such as month, day of the week, … from a datetime column.

ToCategorical

Convert a string column to Categorical dtype.

ToDatetime

Parse datetimes represented as strings and return Datetime columns.

to_datetime

Convert DataFrame or column to Datetime dtype.

Deep Learning#

These encoders require installing additional dependencies around torch. See the “deep learning dependencies” section in the Install guide for more details.

TextEncoder

Encode string features by applying a pretrained language model downloaded from the HuggingFace Hub.

Building a pipeline#

TableVectorizer

Transform a dataframe to a numeric (vectorized) representation.

SelectCols

Select a subset of a DataFrame's columns.

DropCols

Drop a subset of a DataFrame's columns.

tabular_learner

Get a simple machine-learning pipeline for tabular data.

Generating an HTML report#

TableReport

Summarize the contents of a dataframe.

patch_display

Replace the default DataFrame HTML displays with skrub.TableReport.

unpatch_display

Undo the effect of skrub.patch_display().

column_associations

Get measures of statistical associations between all pairs of columns.

Cleaning a dataframe#

deduplicate

Deduplicate categorical data by hierarchically clustering similar strings.

Downloading a dataset#

fetch_employee_salaries

Fetches the employee salaries dataset (regression), available at https://openml.org/d/42125

fetch_medical_charge

Fetches the medical charge dataset (regression), available at https://openml.org/d/42720

fetch_midwest_survey

Fetches the midwest survey dataset (classification), available at https://openml.org/d/42805

fetch_open_payments

Fetches the open payments dataset (classification), available at https://openml.org/d/42738

fetch_road_safety

Fetches the road safety dataset (classification), available at https://openml.org/d/42803

fetch_traffic_violations

Fetches the traffic violations dataset (classification), available at https://openml.org/d/42132

fetch_drug_directory

Fetches the drug directory dataset (classification), available at https://openml.org/d/43044

fetch_world_bank_indicator

Fetches a dataset of an indicator from the World Bank open data platform.

fetch_movielens

Fetches a dataset from Movielens.

fetch_credit_fraud

Fetch the credit fraud dataset from figshare.

fetch_ken_table_aliases

Get the supported aliases of embedded KEN entities tables.

fetch_ken_types

Helper function to search for KEN entity types.

fetch_ken_embeddings

Download Wikipedia embeddings by type.

make_deduplication_data

Duplicates examples with spelling mistakes.