skrub.AggJoiner#

Usage examples at the bottom of this page.

class skrub.AggJoiner(aux_table, *, aux_key, main_key, cols=None, operation=None, suffix=None)[source]#

Aggregate auxiliary dataframes before joining them on a base dataframe.

Apply numerical and categorical aggregation operations on the columns to aggregate, selected by dtypes. See the list of supported operations at the parameter agg_ops.

The grouping columns used during the aggregation are the columns used as keys for joining.

Accepts pandas.DataFrame and polars.DataFrame inputs.

Parameters:
aux_tableDataFrameLike or str or iterable

Auxiliary dataframe to aggregate then join on the base table. The placeholder string “X” can be provided to perform self-aggregation on the input data.

aux_keystr, or iterable of str, or iterable of iterable of str

Select the columns from the auxiliary dataframe to use as keys during the join operation.

main_keystr or iterable of str

Select the columns from the main table to use as keys during the join operation. If main_key is a list, we will perform a multi-column join.

colsstr, or iterable of str, or iterable of iterable of str, default=None

Select the columns from the auxiliary dataframe to use as values during the aggregation operations. If None, cols are all columns from table, except aux_key.

operationstr or iterable of str, default=None

Aggregation operations to perform on the auxiliary table.

numerical{“sum”, “mean”, “std”, “min”, “max”, “hist”, “value_counts”}

‘hist’ and ‘value_counts’ accepts an integer argument to parametrize the binning.

categorical : {“mode”, “count”, “value_counts”}

If set to None (the default), [‘mean’, ‘mode’] will be used.

suffixstr or iterable of str, default=None

The suffixes that will be added to each table columns in case of duplicate column names. If set to None, the table index in ‘aux_table’ are used, e.g. for a duplicate columns: price (main table), price_1 (auxiliary table 1), price_2 (auxiliary table 2), etc.

See also

AggTarget

Aggregates the target y before joining its aggregation on the base dataframe.

Joiner

Augments a main table by automatically joining multiple auxiliary tables on it.

Examples

>>> import pandas as pd
>>> main = pd.DataFrame({
...     "airportId": [1, 2],
...     "airportName": ["Paris CDG", "NY JFK"],
... })
>>> aux = pd.DataFrame({
...     "flightId": range(1, 7),
...     "from_airport": [1, 1, 1, 2, 2, 2],
...     "total_passengers": [90, 120, 100, 70, 80, 90],
...     "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
... })
>>> join_agg = AggJoiner(
...     aux_table=aux,
...     aux_key="from_airport",
...     main_key="airportId",
...     cols=["total_passengers", "company"],
...     operation=["mean", "mode"],
... )
>>> join_agg.fit_transform(main)
   airportId airportName company_mode_1  total_passengers_mean_1
0          1   Paris CDG             AF               103.33...
1          2      NY JFK             DL                80.00...

Methods

check_input(X)

Perform a check on column names data type and suffixes.

fit(X[, y])

Aggregate auxiliary tables based on the main keys.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Left-join pre-aggregated tables on X.

check_input(X)[source]#

Perform a check on column names data type and suffixes.

Parameters:
XDataFrameLike

The raw input to check.

fit(X, y=None)[source]#

Aggregate auxiliary tables based on the main keys.

Parameters:
XDataframeLike

Input data, based table on which to left join the auxiliary tables.

yarray_like of shape (n_samples), default=None

Prediction target. Used to compute correlations between the generated covariates and the target for screening purposes.

Returns:
AggJoiner

Fitted AggJoiner instance (self).

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray_like of shape (n_samples, n_features)

Input samples.

yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Left-join pre-aggregated tables on X.

Parameters:
XDataFrameLike

The input data to transform.

Returns:
X_transformedDataFrameLike

The augmented input.

Examples using skrub.AggJoiner#

Self-aggregation on MovieLens

Self-aggregation on MovieLens