API Reference#

This is the class and function reference of skrub. Please refer to the full user guide for further details, as the raw specifications of classes and functions may not be enough to give full guidelines on their use.

Building a pipeline#

`tabular_pipeline`	Get a simple machine-learning pipeline for tabular data.
`tabular_learner`	Get a simple machine-learning pipeline for tabular data.
`TableVectorizer`	Transform a dataframe to a numeric (vectorized) representation.
`SelectCols`	Select a subset of a DataFrame's columns.
`DropCols`	Drop a subset of a DataFrame's columns.
`ApplyToCols`	Map a transformer to columns in a dataframe.
`ApplyToFrame`	Apply a transformer to part of a dataframe.

Encoding a column#

`StringEncoder`	Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD).
`TextEncoder`	Encode string features by applying a pretrained language model downloaded from the HuggingFace Hub.
`MinHashEncoder`	Encode string categorical features by applying the MinHash method to n-gram decompositions of strings.
`GapEncoder`	Encode string columns by constructing latent topics.
`SimilarityEncoder`	Encode string categories to a similarity matrix, to capture fuzziness across a few categories.
`ToCategorical`	Convert a string column to Categorical dtype.
`DatetimeEncoder`	Extract temporal features such as month, day of the week, … from a datetime column.
`ToDatetime`	Parse datetimes represented as strings and return `Datetime` columns.
`to_datetime`	Convert DataFrame or column to Datetime dtype.

Exploring a dataframe#

`TableReport`	Summarize the contents of a dataframe.
`patch_display`	Replace the default DataFrame HTML displays with `skrub.TableReport`.
`unpatch_display`	Undo the effect of `skrub.patch_display()`.
`column_associations`	Get measures of statistical associations between all pairs of columns.

Cleaning a dataframe#

`SquashingScaler`	Perform robust centering and scaling followed by soft clipping.
`deduplicate`	Deduplicate categorical data by hierarchically clustering similar strings.
`Cleaner`	Column-wise consistency checks and sanitization of dtypes, null values and dates.
`DropUninformative`	Drop column if it is found to be uninformative according to various criteria.

Joining dataframes#

`Joiner`	Augment features in a main table by fuzzy-joining an auxiliary table to it.
`AggJoiner`	Aggregate an auxiliary dataframe before joining it on a base dataframe.
`MultiAggJoiner`	Extension of the `AggJoiner` to multiple auxiliary tables.
`AggTarget`	Aggregate a target y before joining its aggregation on a base dataframe.
`InterpolationJoiner`	Join with a table augmented by machine-learning predictions.
`fuzzy_join`	Fuzzy (approximate) join.

Selectors#

`selectors.all`	Select all columns.
`selectors.any_date`	Select columns that have a Date or Datetime data type.
`selectors.boolean`	Select columns that have an Boolean data type.
`selectors.cardinality_below`	Select columns whose cardinality (number of unique values) is (strictly) below `threshold`.
`selectors.categorical`	Select columns that have a Categorical (or polars Enum) data type.
`selectors.cols`	Select columns by name.
`selectors.filter`	Select columns for which `predicate` returns True.
`selectors.filter_names`	Select columns based on their name.
`selectors.float`	Select columns that have a floating-point data type.
`selectors.glob`	Select columns by name with Unix shell style 'glob' pattern.
`selectors.has_nulls`	Select columns that contain at least one null value.
`selectors.integer`	Select columns that have an integer data type.
`selectors.inv`	Invert a selector.
`selectors.make_selector`	Transform a selector, column name or list of column names into a selector.
`selectors.numeric`	Select columns that have a numeric data type.
`selectors.regex`	Select columns by name with a regular expression.
`selectors.select`	Apply a selector to a dataframe and return the resulting dataframe.
`selectors.string`	Select columns that have a String data type.

DataOps#

`var`	Create a skrub variable.
`X`	Create a skrub variable and mark it as being `X`.
`y`	Create a skrub variable and mark it as being `y`.
`as_data_op`	Create a DataOp `DataOp` that evaluates to the given value.
`deferred`	Wrap function calls in a DataOp `DataOp`.

DataOp

Representation of a computation that can be used to build DataOps plans and learners.

`choose_bool`	A choice between `True` and `False`.
`choose_float`	A choice of floating-point numbers from a numeric range.
`choose_int`	A choice of integers from a numeric range.
`choose_from`	A choice among several possible outcomes.
`optional`	A choice between `value` and `None`.

`cross_validate`	Cross-validate a learner built from a DataOp.
`eval_mode`	Return the mode in which the DataOp is currently being evaluated.

`DataOp.skb.apply`	Apply a scikit-learn estimator to a dataframe or numpy array.
`DataOp.skb.apply_func`	Apply the given function.
`DataOp.skb.clone`	Get an independent clone of the DataOp.
`DataOp.skb.concat`	Concatenate dataframes vertically or horizontally.
`DataOp.skb.cross_validate`	Cross-validate the DataOp plan.
`DataOp.skb.describe_defaults`	Describe the hyper-parameters used by the default learner.
`DataOp.skb.describe_param_grid`	Describe the hyper-parameters extracted from choices in the DataOp.
`DataOp.skb.describe_steps`	Get a text representation of the computation graph.
`DataOp.skb.draw_graph`	Get an SVG string representing the computation graph.
`DataOp.skb.drop`	Drop some columns.
`DataOp.skb.eval`	Evaluate the DataOp.
`DataOp.skb.freeze_after_fit`	Freeze the result during learner fitting.
`DataOp.skb.full_report`	Generate a full report of the DataOp's evaluation.
`DataOp.skb.get_data`	Collect the values of the variables contained in the DataOp.
`DataOp.skb.make_learner`	Get a skrub learner for this DataOp.
`DataOp.skb.make_grid_search`	Find the best parameters with grid search.
`DataOp.skb.make_randomized_search`	Find the best parameters with randomized search.
`DataOp.skb.if_else`	Create a conditional DataOp.
`DataOp.skb.iter_learners_grid`	Get learners with different parameter combinations.
`DataOp.skb.iter_learners_randomized`	Get learners with different parameter combinations.
`DataOp.skb.mark_as_X`	Mark this DataOp as being the `X` table.
`DataOp.skb.mark_as_y`	Mark this DataOp as being the `y` table.
`DataOp.skb.match`	Select based on the value of a DataOp.
`DataOp.skb.preview`	Get the value computed for previews (shown when printing the DataOp).
`DataOp.skb.select`	Select a subset of columns.
`DataOp.skb.set_description`	Give a description to this DataOp.
`DataOp.skb.set_name`	Give a name to this DataOp.
`DataOp.skb.subsample`	Configure subsampling of a dataframe or numpy array.
`DataOp.skb.train_test_split`	Split an environment into a training an testing environments.

`DataOp.skb.description`	A user-defined description or comment about the DataOp.
`DataOp.skb.is_X`	Whether this DataOp has been marked with `skb.mark_as_X()`.
`DataOp.skb.is_y`	Whether this DataOp has been marked with `skb.mark_as_y()`.
`DataOp.skb.name`	A user-chosen name for the DataOp.
`DataOp.skb.applied_estimator`	Retrieve the estimator applied in the previous step, as a DataOp.

`SkrubLearner`	Learner that evaluates a skrub DataOp.
`ParamSearch`	Learner that evaluates a skrub DataOp with hyperparameter tuning.

Configuration#

`get_config`	Retrieve current values for configuration set by `set_config()`.
`set_config`	Set global skrub configuration.
`config_context`	Context manager for global skrub configuration.

Datasets#

`datasets.fetch_bike_sharing`	Fetch the bike sharing dataset (regression) available at skrub-data/skrub-data-files
`datasets.fetch_country_happiness`	Fetch the happiness index dataset (regression) available at skrub-data/skrub-data-files
`datasets.fetch_credit_fraud`	Fetch the credit fraud dataset (classification) available at skrub-data/skrub-data-files
`datasets.fetch_drug_directory`	Fetches the drug directory dataset (classification), available at skrub-data/skrub-data-files
`datasets.fetch_employee_salaries`	Fetches the employee salaries dataset (regression), available at skrub-data/skrub-data-files
`datasets.fetch_flight_delays`	Fetch the flight delays dataset (regression) available at skrub-data/skrub-data-files
`datasets.fetch_ken_embeddings`	Download Wikipedia embeddings by type.
`datasets.fetch_ken_table_aliases`	Get the supported aliases of embedded KEN entities tables.
`datasets.fetch_ken_types`	Helper function to search for KEN entity types.
`datasets.fetch_medical_charge`	Fetches the medical charge dataset (regression), available at skrub-data/skrub-data-files
`datasets.fetch_midwest_survey`	Fetches the midwest survey dataset (classification), available at skrub-data/skrub-data-files
`datasets.fetch_movielens`	Fetch the movielens dataset (regression) available at skrub-data/skrub-data-files
`datasets.fetch_open_payments`	Fetches the open payments dataset (classification), available at skrub-data/skrub-data-files
`datasets.fetch_toxicity`	Fetch the toxicity dataset (classification) available at skrub-data/skrub-data-files
`datasets.fetch_traffic_violations`	Fetches the traffic violations dataset (classification), available at skrub-data/skrub-data-files
`datasets.fetch_videogame_sales`	Fetch the videogame sales dataset (regression) available at skrub-data/skrub-data-files
`datasets.get_data_dir`	Returns the directory in which skrub looks for data.
`datasets.make_deduplication_data`	Duplicates examples with spelling mistakes.
`datasets.toy_orders`	Create a toy dataframe and corresponding targets for examples.