API#

This page lists all available functions and classes of skrub.

Joining dataframes#

Joiner

Augment features in a main table by fuzzy-joining an auxiliary table to it.

AggJoiner

Aggregate an auxiliary dataframe before joining it on a base dataframe.

MultiAggJoiner

Extension of the AggJoiner to multiple auxiliary tables.

AggTarget

Aggregate a target y before joining its aggregation on a base dataframe.

InterpolationJoiner

Join with a table augmented by machine-learning predictions.

fuzzy_join

Fuzzy (approximate) join.

Encoding a column#

GapEncoder

Encode string columns by constructing latent topics.

MinHashEncoder

Encode string categorical features by applying the MinHash method to n-gram decompositions of strings.

SimilarityEncoder

Encode string categories to a similarity matrix, to capture fuzziness across a few categories.

DatetimeEncoder

Extract temporal features such as month, day of the week, … from a datetime column.

ToCategorical

Convert a string column to Categorical dtype.

ToDatetime

Parse datetimes represented as strings and return Datetime columns.

StringEncoder

Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD).

to_datetime

Convert DataFrame or column to Datetime dtype.

Deep Learning#

These encoders require installing additional dependencies around torch. See the “deep learning dependencies” section in the Install guide for more details.

TextEncoder

Encode string features by applying a pretrained language model downloaded from the HuggingFace Hub.

Building a pipeline#

TableVectorizer

Transform a dataframe to a numeric (vectorized) representation.

SelectCols

Select a subset of a DataFrame's columns.

DropCols

Drop a subset of a DataFrame's columns.

tabular_learner

Get a simple machine-learning pipeline for tabular data.

Generating an HTML report#

TableReport

Summarize the contents of a dataframe.

patch_display

Replace the default DataFrame HTML displays with skrub.TableReport.

unpatch_display

Undo the effect of skrub.patch_display().

column_associations

Get measures of statistical associations between all pairs of columns.

Cleaning a dataframe#

deduplicate

Deduplicate categorical data by hierarchically clustering similar strings.

Downloading a dataset#

fetch_bike_sharing

Fetch the bike sharing dataset (regression) available at skrub-data/skrub-data-files

fetch_country_happiness

Fetch the happiness index dataset (regression) available at skrub-data/skrub-data-files

fetch_credit_fraud

Fetch the credit fraud dataset (classification) available at skrub-data/skrub-data-files

fetch_drug_directory

Fetches the drug directory dataset (classification), available at skrub-data/skrub-data-files

fetch_employee_salaries

Fetches the employee salaries dataset (regression), available at skrub-data/skrub-data-files

fetch_flight_delays

Fetch the flight delays dataset (regression) available at skrub-data/skrub-data-files

fetch_ken_embeddings

Download Wikipedia embeddings by type.

fetch_ken_table_aliases

Get the supported aliases of embedded KEN entities tables.

fetch_ken_types

Helper function to search for KEN entity types.

fetch_medical_charge

Fetches the medical charge dataset (regression), available at skrub-data/skrub-data-files

fetch_midwest_survey

Fetches the midwest survey dataset (classification), available at skrub-data/skrub-data-files

fetch_movielens

Fetch the movielens dataset (regression) available at skrub-data/skrub-data-files

fetch_open_payments

Fetches the open payments dataset (classification), available at skrub-data/skrub-data-files

fetch_toxicity

Fetch the toxicity dataset (classification) available at skrub-data/skrub-data-files

fetch_traffic_violations

Fetches the traffic violations dataset (classification), available at skrub-data/skrub-data-files

fetch_videogame_sales

Fetch the videogame sales dataset (regression) available at skrub-data/skrub-data-files

get_data_dir

Returns the directory in which skrub looks for data.

make_deduplication_data

Duplicates examples with spelling mistakes.