AggJoiner#

class skrub.AggJoiner(aux_table, operations, *, key=None, main_key=None, aux_key=None, cols=None, suffix='')[source]#

Aggregate an auxiliary dataframe before joining it on a base dataframe.

Apply numerical and categorical aggregation operations on the columns (i.e. cols) to aggregate. See the list of supported operations at the parameter operations.

If cols is not provided, cols are all columns from aux_table, except aux_key.

Accepts pandas.DataFrame and polars.DataFrame inputs.

Parameters:

aux_tableDataFrameLike or “X”

Auxiliary dataframe to aggregate then join on the base table. The placeholder string “X” can be provided to perform self-aggregation on the input data.

operationsstr or iterable of str

Aggregation operations to perform on the auxiliary table.

Supported operations are “count”, “mode”, “min”, “max”, “sum”, “median”, “mean”, “std”. The operations “sum”, “median”, “mean”, “std” are reserved to numeric type columns.

keystr, default=None

The column name to use for both main_key and aux_key when they are the same. Provide either key or both main_key and aux_key. If key is an iterable, we will perform a multi-column join.

main_keystr or iterable of str, default=None

Select the columns from the main table to use as keys during the join operation. If main_key is an iterable, we will perform a multi-column join.

aux_keystr or iterable of str, default=None

Select the columns from the auxiliary dataframe to use as keys during the join operation. If aux_key is an iterable, we will perform a multi-column join.

colsstr or iterable of str, default=None

Select the columns from the auxiliary dataframe to use as values during the aggregation operations. By default, cols are all columns from aux_table, except aux_key.

suffixstr, default=””

Suffix to append to the aux_table’s column names. You can use it to avoid duplicate column names in the join.

See also

AggTarget: Aggregates the target y before joining its aggregation on the base dataframe.
Joiner: Augments a main table by automatically joining an auxiliary table on it.
MultiAggJoiner: Extension of the AggJoiner to multiple auxiliary tables.

Examples

>>> import pandas as pd
>>> from skrub import AggJoiner
>>> main = pd.DataFrame({
...     "airportId": [1, 2],
...     "airportName": ["Paris CDG", "NY JFK"],
... })
>>> aux = pd.DataFrame({
...     "flightId": range(1, 7),
...     "from_airport": [1, 1, 1, 2, 2, 2],
...     "total_passengers": [90, 120, 100, 70, 80, 90],
...     "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
... })
>>> agg_joiner = AggJoiner(
...     aux_table=aux,
...     operations="mean",
...     main_key="airportId",
...     aux_key="from_airport",
...     cols="total_passengers",
... )
>>> agg_joiner.fit_transform(main)
   airportId  airportName  total_passengers_mean
0          1    Paris CDG              103.33...
1          2       NY JFK               80.00...

Methods

`fit`(X[, y])	Aggregate auxiliary table based on the main keys.
`fit_transform`(X[, y])	Aggregate auxiliary table based on the main keys.
`get_feature_names_out`()	Get output feature names for transformation.
`get_params`([deep])	Get parameters for this estimator.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Left-join pre-aggregated table on X.

fit(X, y=None)[source]#

Aggregate auxiliary table based on the main keys.

Parameters:

XDataFrameLike: Input data, based table on which to left join the auxiliary table.
yNone: Unused, only here for compatibility.

Returns:

AggJoiner: Fitted AggJoiner instance (self).

fit_transform(X, y=None)[source]#

Aggregate auxiliary table based on the main keys.

Parameters:

XDataFrameLike: Input data, based table on which to left join the auxiliary table.
yNone: Unused, only here for compatibility.

Returns:

DataFrame: The augmented input.

get_feature_names_out()[source]#

Get output feature names for transformation.

Returns:

List of str: Transformed feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

“default”: Default output format of a transformer
“pandas”: DataFrame output
“polars”: Polars output
None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

selfestimator instance: Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

transform(X)[source]#

Left-join pre-aggregated table on X.

Parameters:

XDataFrameLike: The input data to transform.

Returns:

DataFrame: The augmented input.

Gallery examples#

Getting Started

AggJoiner on a credit fraud dataset

AggJoiner#

Gallery examples#

This Page