AggJoiner#

Usage examples at the bottom of this page.

class skrub.AggJoiner(aux_table, *, key=None, main_key=None, aux_key=None, cols=None, operations=None, suffix='')[source]#

Aggregate an auxiliary dataframe before joining it on a base dataframe.

Apply numerical and categorical aggregation operations on the columns (i.e. cols) to aggregate, selected by dtypes. See the list of supported operations at the parameter operations.

If cols is not provided, cols are all columns from aux_table, except aux_key.

Accepts pandas.DataFrame and polars.DataFrame inputs.

Parameters:
aux_tableDataFrameLike or “X”

Auxiliary dataframe to aggregate then join on the base table. The placeholder string “X” can be provided to perform self-aggregation on the input data.

keystr, default=None

The column name to use for both main_key and aux_key when they are the same. Provide either key or both main_key and aux_key. If key is an iterable, we will perform a multi-column join.

main_keystr or iterable of str, default=None

Select the columns from the main table to use as keys during the join operation. If main_key is an iterable, we will perform a multi-column join.

aux_keystr or iterable of str, default=None

Select the columns from the auxiliary dataframe to use as keys during the join operation. If aux_key is an iterable, we will perform a multi-column join.

colsstr or iterable of str, default=None

Select the columns from the auxiliary dataframe to use as values during the aggregation operations. If set to None, cols are all columns from aux_table, except aux_key.

operationsstr or iterable of str, default=None

Aggregation operations to perform on the auxiliary table.

  • numerical : {“sum”, “mean”, “std”, “min”, “max”, “hist”, “value_counts”} “hist” and “value_counts” accept an integer argument to parametrize the binning.

  • categorical : {“mode”, “count”, “value_counts”}

  • If set to None (the default), [“mean”, “mode”] will be used.

suffixstr, default=””

Suffix to append to the aux_table’s column names. You can use it to avoid duplicate column names in the join.

See also

AggTarget

Aggregates the target y before joining its aggregation on the base dataframe.

Joiner

Augments a main table by automatically joining an auxiliary table on it.

MultiAggJoiner

Extension of the AggJoiner to multiple auxiliary tables.

Examples

>>> import pandas as pd
>>> from skrub import AggJoiner
>>> main = pd.DataFrame({
...     "airportId": [1, 2],
...     "airportName": ["Paris CDG", "NY JFK"],
... })
>>> aux = pd.DataFrame({
...     "flightId": range(1, 7),
...     "from_airport": [1, 1, 1, 2, 2, 2],
...     "total_passengers": [90, 120, 100, 70, 80, 90],
...     "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
... })
>>> agg_joiner = AggJoiner(
...     aux_table=aux,
...     main_key="airportId",
...     aux_key="from_airport",
...     cols=["total_passengers", "company"],
...     operations=["mean", "mode"],
... )
>>> agg_joiner.fit_transform(main)
   airportId  airportName  company_mode  total_passengers_mean
0          1    Paris CDG            AF              103.33...
1          2       NY JFK            DL               80.00...

Methods

fit(X[, y])

Aggregate auxiliary table based on the main keys.

fit_transform(X[, y])

Aggregate auxiliary table based on the main keys.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Left-join pre-aggregated table on X.

fit(X, y=None)[source]#

Aggregate auxiliary table based on the main keys.

Parameters:
XDataFrameLike

Input data, based table on which to left join the auxiliary table.

yNone

Unused, only here for compatibility.

Returns:
AggJoiner

Fitted AggJoiner instance (self).

fit_transform(X, y=None)[source]#

Aggregate auxiliary table based on the main keys.

Parameters:
XDataFrameLike

Input data, based table on which to left join the auxiliary table.

yNone

Unused, only here for compatibility.

Returns:
DataFrame

The augmented input.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • “polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Left-join pre-aggregated table on X.

Parameters:
XDataFrameLike

The input data to transform.

Returns:
DataFrame

The augmented input.