API#
This page lists all available functions and classes of skrub.
Joining dataframes#
Augment features in a main table by fuzzy-joining an auxiliary table to it. |
|
Aggregate an auxiliary dataframe before joining it on a base dataframe. |
|
Extension of the |
|
Aggregate a target y before joining its aggregation on a base dataframe. |
|
Join with a table augmented by machine-learning predictions. |
Fuzzy (approximate) join. |
Encoding a column#
Encode string columns by constructing latent topics. |
|
Encode string categorical features by applying the MinHash method to n-gram decompositions of strings. |
|
Encode string categories to a similarity matrix, to capture fuzziness across a few categories. |
|
Extract temporal features such as month, day of the week, … from a datetime column. |
|
Convert a string column to Categorical dtype. |
|
Parse datetimes represented as strings and return |
|
Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD). |
Convert DataFrame or column to Datetime dtype. |
Deep Learning#
These encoders require installing additional dependencies around torch. See the “deep learning dependencies” section in the Install guide for more details.
Encode string features by applying a pretrained language model downloaded from the HuggingFace Hub. |
Building a pipeline#
Transform a dataframe to a numeric (vectorized) representation. |
|
Column-wise consistency checks and sanitization, eg of null values or dates. |
|
Select a subset of a DataFrame's columns. |
|
Drop a subset of a DataFrame's columns. |
|
Drop column if it is found to be uninformative according to various criteria. |
Get a simple machine-learning pipeline for tabular data. |
Skrub Expressions#
The tabular_learner
provides a pre-defined, default pipeline for datasets that contain a simple table.
For more control or in order to build pipelines for more datasets, use the skrub expressions.
Create an expression |
|
Construct a choice between False and True. |
|
Construct a choice of floating-point numbers from a numeric range. |
|
Construct a choice among several possible outcomes. |
|
Construct a choice of integers from a numeric range. |
|
Cross-validate a pipeline built from an expression. |
|
Wrap function calls in an expression |
|
Return the mode in which the expression is currently being evaluated. |
|
Construct a choice between a value and |
|
Create a skrub variable. |
|
Create a skrub variable and mark it as being |
|
Create a skrub variable and mark it as being |
Representation of a computation that can be used to build ML pipelines. |
Apply a scikit-learn estimator to a dataframe or numpy array. |
|
Get an independent clone of the expression. |
|
Concatenate dataframes vertically or horizontally. |
|
Cross-validate the expression. |
|
Describe the hyper-parameters extracted from choices in the expression. |
|
Get a text representation of the computation graph. |
|
Get an SVG string representing the computation graph. |
|
Drop some columns. |
|
Evaluate the expression. |
|
Freeze the result during pipeline fitting. |
|
Generate a full report of the expression's evaluation. |
|
Collect the values of the variables contained in the expression. |
|
Get a skrub pipeline for this expression. |
|
Find the best parameters with grid search. |
|
Find the best parameters with randomized search. |
|
Create a conditional expression. |
|
Mark this expression as being the |
|
Mark this expression as being the |
|
Select based on the value of an expression. |
|
Select a subset of columns. |
|
Give a description to this expression. |
|
Give a name to this expression. |
|
Split an environment into a training an testing environments. |
A user-defined description or comment about the expression. |
|
Whether this expression has been marked with |
|
Whether this expression has been marked with |
|
A user-chosen name for the expression. |
|
Retrieve the estimator applied in the previous step, as an expression. |
Pipeline that evaluates a skrub expression. |
|
Pipeline that evaluates a skrub expression with hyperparameter tuning. |
Selecting columns in a DataFrame#
The srkub selectors provide a flexible way to specify the columns on which a
transformation should be applied. They are meant to be used for the cols
argument of Expr.skb.apply()
, Expr.skb.select()
,
Expr.skb.drop()
, SelectCols
or DropCols
.
Select all columns. |
|
Select columns that have a Date or Datetime data type. |
|
Select columns that have an Boolean data type. |
|
Select columns whose cardinality (number of unique values) is (strictly) below |
|
Select columns that have a Categorical (or polars Enum) data type. |
|
Select columns by name. |
|
Select columns for which |
|
Select columns based on their name. |
|
Select columns that have a floating-point data type. |
|
Select columns by name with Unix shell style 'glob' pattern. |
|
Select columns that contain at least one null value. |
|
Select columns that have an integer data type. |
|
Invert a selector. |
|
Transform a selector, column name or list of column names into a selector. |
|
Select columns that have a numeric data type. |
|
Select columns by name with a regular expression. |
|
Apply a selector to a dataframe and return the resulting dataframe. |
|
Select columns that have a String data type. |
Generating an HTML report#
Summarize the contents of a dataframe. |
Replace the default DataFrame HTML displays with |
|
Undo the effect of |
|
Get measures of statistical associations between all pairs of columns. |
Cleaning a dataframe#
Deduplicate categorical data by hierarchically clustering similar strings. |
Downloading a dataset#
Fetch the bike sharing dataset (regression) available at skrub-data/skrub-data-files |
|
Fetch the happiness index dataset (regression) available at skrub-data/skrub-data-files |
|
Fetch the credit fraud dataset (classification) available at skrub-data/skrub-data-files |
|
Fetches the drug directory dataset (classification), available at skrub-data/skrub-data-files |
|
Fetches the employee salaries dataset (regression), available at skrub-data/skrub-data-files |
|
Fetch the flight delays dataset (regression) available at skrub-data/skrub-data-files |
|
Download Wikipedia embeddings by type. |
|
Get the supported aliases of embedded KEN entities tables. |
|
Helper function to search for KEN entity types. |
|
Fetches the medical charge dataset (regression), available at skrub-data/skrub-data-files |
|
Fetches the midwest survey dataset (classification), available at skrub-data/skrub-data-files |
|
Fetch the movielens dataset (regression) available at skrub-data/skrub-data-files |
|
Fetches the open payments dataset (classification), available at skrub-data/skrub-data-files |
|
Fetch the toxicity dataset (classification) available at skrub-data/skrub-data-files |
|
Fetches the traffic violations dataset (classification), available at skrub-data/skrub-data-files |
|
Fetch the videogame sales dataset (regression) available at skrub-data/skrub-data-files |
|
Returns the directory in which skrub looks for data. |
|
Duplicates examples with spelling mistakes. |