API#
This page lists all available functions and classes of skrub.
Joining dataframes#
Augment features in a main table by fuzzy-joining an auxiliary table to it. |
|
Aggregate an auxiliary dataframe before joining it on a base dataframe. |
|
Extension of the |
|
Aggregate a target y before joining its aggregation on a base dataframe. |
|
Join with a table augmented by machine-learning predictions. |
Fuzzy (approximate) join. |
Encoding a column#
Encode string columns by constructing latent topics. |
|
Encode string categorical features by applying the MinHash method to n-gram decompositions of strings. |
|
Encode string categories to a similarity matrix, to capture fuzziness across a few categories. |
|
Extract temporal features such as month, day of the week, … from a datetime column. |
|
Convert a string column to Categorical dtype. |
|
Parse datetimes represented as strings and return |
Convert DataFrame or column to Datetime dtype. |
Deep Learning#
These encoders require installing additional dependencies around torch. See the “deep learning dependencies” section in the Install guide for more details.
Encode string features by applying a pretrained language model downloaded from the HuggingFace Hub. |
Building a pipeline#
Transform a dataframe to a numeric (vectorized) representation. |
|
Select a subset of a DataFrame's columns. |
|
Drop a subset of a DataFrame's columns. |
Get a simple machine-learning pipeline for tabular data. |
Generating an HTML report#
Summarize the contents of a dataframe. |
Replace the default DataFrame HTML displays with |
|
Undo the effect of |
|
Get measures of statistical associations between all pairs of columns. |
Cleaning a dataframe#
Deduplicate categorical data by hierarchically clustering similar strings. |
Downloading a dataset#
Fetches the employee salaries dataset (regression), available at https://openml.org/d/42125 |
|
Fetches the medical charge dataset (regression), available at https://openml.org/d/42720 |
|
Fetches the midwest survey dataset (classification), available at https://openml.org/d/42805 |
|
Fetches the open payments dataset (classification), available at https://openml.org/d/42738 |
|
Fetches the road safety dataset (classification), available at https://openml.org/d/42803 |
|
Fetches the traffic violations dataset (classification), available at https://openml.org/d/42132 |
|
Fetches the drug directory dataset (classification), available at https://openml.org/d/43044 |
|
Fetches a dataset of an indicator from the World Bank open data platform. |
|
Fetches a dataset from Movielens. |
|
Fetch the credit fraud dataset from figshare. |
|
Get the supported aliases of embedded KEN entities tables. |
|
Helper function to search for KEN entity types. |
|
Download Wikipedia embeddings by type. |
|
Duplicates examples with spelling mistakes. |