`skrub`.AggJoiner#

Usage examples at the bottom of this page.

class skrub.AggJoiner(aux_table, *, aux_key, main_key, cols=None, operation=None, suffix=None)[source]#

Aggregate auxiliary dataframes before joining them on a base dataframe.

Apply numerical and categorical aggregation operations on the columns to aggregate, selected by dtypes. See the list of supported operations at the parameter agg_ops.

The grouping columns used during the aggregation are the columns used as keys for joining.

Accepts pandas.DataFrame and polars.DataFrame inputs.

Parameters:

aux_tableDataFrameLike or str or iterable

Auxiliary dataframe to aggregate then join on the base table. The placeholder string “X” can be provided to perform self-aggregation on the input data.

aux_keystr, or iterable of str, or iterable of iterable of str

Select the columns from the auxiliary dataframe to use as keys during the join operation.

main_keystr or iterable of str

Select the columns from the main table to use as keys during the join operation. If main_key is a list, we will perform a multi-column join.

colsstr, or iterable of str, or iterable of iterable of str, default=None

Select the columns from the auxiliary dataframe to use as values during the aggregation operations. If None, cols are all columns from table, except aux_key.

operationstr or iterable of str, default=None

Aggregation operations to perform on the auxiliary table.

numerical{“sum”, “mean”, “std”, “min”, “max”, “hist”, “value_counts”}: ‘hist’ and ‘value_counts’ accepts an integer argument to parametrize the binning.

categorical : {“mode”, “count”, “value_counts”}

If set to None (the default), [‘mean’, ‘mode’] will be used.

suffixstr or iterable of str, default=None

The suffixes that will be added to each table columns in case of duplicate column names. If set to None, the table index in ‘aux_table’ are used, e.g. for a duplicate columns: price (main table), price_1 (auxiliary table 1), price_2 (auxiliary table 2), etc.

See also

AggTarget: Aggregates the target y before joining its aggregation on the base dataframe.
Joiner: Augments a main table by automatically joining multiple auxiliary tables on it.

Examples

>>> import pandas as pd
>>> main = pd.DataFrame({
...     "airportId": [1, 2],
...     "airportName": ["Paris CDG", "NY JFK"],
... })
>>> aux = pd.DataFrame({
...     "flightId": range(1, 7),
...     "from_airport": [1, 1, 1, 2, 2, 2],
...     "total_passengers": [90, 120, 100, 70, 80, 90],
...     "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
... })
>>> join_agg = AggJoiner(
...     aux_table=aux,
...     aux_key="from_airport",
...     main_key="airportId",
...     cols=["total_passengers", "company"],
...     operation=["mean", "mode"],
... )
>>> join_agg.fit_transform(main)
   airportId airportName company_mode_1  total_passengers_mean_1
0          1   Paris CDG             AF               103.33...
1          2      NY JFK             DL                80.00...

Methods

`check_input`(X)	Perform a check on column names data type and suffixes.
`fit`(X[, y])	Aggregate auxiliary tables based on the main keys.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Left-join pre-aggregated tables on X.

check_input(X)[source]#

Perform a check on column names data type and suffixes.

Parameters:

XDataFrameLike: The raw input to check.

fit(X, y=None)[source]#

Aggregate auxiliary tables based on the main keys.

Parameters:

XDataframeLike: Input data, based table on which to left join the auxiliary tables.
yarray_like of shape (n_samples), default=None: Prediction target. Used to compute correlations between the generated covariates and the target for screening purposes.

Returns:

AggJoiner: Fitted AggJoiner instance (self).

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

Xarray_like of shape (n_samples, n_features): Input samples.
yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns:

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

“default”: Default output format of a transformer
“pandas”: DataFrame output
None: Transform configuration is unchanged

Returns:

selfestimator instance: Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

transform(X)[source]#

Left-join pre-aggregated tables on X.

Parameters:

XDataFrameLike: The input data to transform.

Returns:

X_transformedDataFrameLike: The augmented input.

Examples using `skrub.AggJoiner`#

Self-aggregation on MovieLens

skrub.AggJoiner#

Examples using skrub.AggJoiner#

`skrub`.AggJoiner#

Examples using `skrub.AggJoiner`#