MultiAggJoiner#

Usage examples at the bottom of this page.

class skrub.MultiAggJoiner(aux_tables, *, keys=None, main_keys=None, aux_keys=None, cols=None, operations=None, suffixes=None)[source]#

Extension of the AggJoiner to multiple auxiliary tables.

Apply numerical and categorical aggregation operations on the cols to aggregate, selected by dtypes. See the list of supported operations at the parameter operations.

If cols is not provided, cols is set to a list of lists. For each table in aux_tables, the corresponding list will be all columns of that table, except the aux_keys associated with that table.

As opposed to the AggJoiner, here aux_tables is an iterable of tables, each of which will be joined on the main table. Therefore aux_keys is now an iterable of keys, of the same length as aux_tables, and each entry in aux_keys is used to join the corresponding auxiliary table. In the same way, each entry in cols is an iterable of columns to aggregate in the corresponding auxiliary table. If the keys are the same in the main table and the auxiliary tables, the keys parameter can be used instead of main_keys and aux_keys.

Therefore if we have a single table, we could either use

Note that for keys, main_keys, aux_keys, cols and operations, an input of the form [["a"], ["b"], ["c", "d"]] is valid while ["a", "b", ["c", "d"]] is not.

Using a column from the first auxiliary table to join the second auxiliary table is not (yet) supported.

Accepts pandas.DataFrame and polars.DataFrame inputs.

Parameters:
aux_tablesiterable of DataFrameLike or “X”

Auxiliary dataframes to aggregate then join on the base table. The placeholder string “X” can be provided to perform self-aggregation on the input data. To provide a single auxiliary table, aux_tables = [table] is supported, but not aux_tables = table. It’s possible to provide both the placeholder “X” and auxiliary tables, as in aux_tables = [table, "X"]. If that’s the case, the second table will be replaced by the input data.

keysiterable of iterable of str, default=None

The column names to use for both main_keys and aux_key when they are the same. Provide either key or both main_keys and aux_keys. If entries in keys contains multiple columns, we will perform a multi-column join.

All keys must be present in the main and auxiliary tables before fit. It’s not (yet) possible to use columns from the first joined table to join the second.

If not None, there must be an iterable of keys for each table in aux_tables.

main_keysiterable of iterable of str, default=None

Select the columns from the main table to use as keys during the join operation. If entries in main_keys contains multiple columns, we will perform a multi-column join.

If not None, there must be an iterable of main_keys for each table in aux_tables.

aux_keysiterable of iterable of str, default=None

Select the columns from the auxiliary dataframes to use as keys during the join operation. If entries in aux_keys contains multiple columns, we will perform a multi-column join.

All aux_keys must be present in respective aux_tables before fit. It’s not (yet) possible to use columns from the first joined table to join the second.

If not None, there must be an iterable of aux_keys for each table in aux_tables.

colsiterable of iterable of str, default=None

Select the columns from the auxiliary dataframes to use as values during the aggregation operations.

If not None, there must be an iterable of cols for each table in aux_tables.

If set to None, cols is set to a list of lists. For each table in aux_tables, the corresponding list will be all columns of that table, except the aux_keys associated with that table.

operationsiterable of iterable of str, default=None

Aggregation operations to perform on the auxiliary tables.

If not None, there must be an iterable of operations for each table in aux_tables.

  • numerical : {“sum”, “mean”, “std”, “min”, “max”, “hist”, “value_counts”} “hist” and “value_counts” accept an integer argument to parametrize the binning.

  • categorical : {“mode”, “count”, “value_counts”}

  • If set to None (the default), [“mean”, “mode”] will be used for all auxiliary tables.

suffixesiterable of str, default=None

Suffixes to append to the aux_tables’ column names. If set to None, the table indexes in aux_tables are used, e.g. for an aggregation of 2 aux_tables, “_0” and “_1” would be appended to column names.

See also

AggJoiner

Aggregate an auxiliary dataframe before joining it on a base dataframe.

Examples

>>> import pandas as pd
>>> from skrub import MultiAggJoiner
>>> patients = pd.DataFrame({
...    "patient_id": [1, 2],
...    "age": ["72", "45"],
... })
>>> hospitalizations = pd.DataFrame({
...    "visit_id": range(1, 7),
...    "patient_id": [1, 1, 1, 1, 2, 2],
...    "days_of_stay": [2, 4, 1, 1, 3, 12],
...    "hospital": ["Cochin", "Bichat", "Cochin", "Necker", "Bichat", "Bichat"],
... })
>>> medications = pd.DataFrame({
...    "medication_id": range(1, 6),
...    "patient_id": [1, 1, 1, 1, 2],
...    "medication": ["ozempic", "ozempic", "electrolytes", "ozempic", "morphine"],
... })
>>> glucose = pd.DataFrame({
...    "biology_id": range(1, 7),
...    "patientID": [1, 1, 1, 1, 2, 2],
...    "value": [1.4, 3.4, 1.0, 0.8, 3.1, 6.5],
... })
>>> multi_agg_joiner = MultiAggJoiner(
...    aux_tables=[hospitalizations, medications, glucose],
...    main_keys=[["patient_id"], ["patient_id"], ["patient_id"]],
...    aux_keys=[["patient_id"], ["patient_id"], ["patientID"]],
...    cols=[["days_of_stay"], ["medication"], ["value"]],
...    operations=[["max"], ["mode"], ["mean", "std"]],
...    suffixes=["", "", "_glucose"],
... )
>>> multi_agg_joiner.fit_transform(patients)
   patient_id  age  ...  value_mean_glucose  value_std_glucose
0           1   72  ...                1.65           1.193035
1           2   45  ...                4.80           2.404163

The MultiAggJoiner makes it convenient to aggregate multiple tables, but the same results could be obtained by chaining 3 separate AggJoiner:

>>> from skrub import AggJoiner
>>> from sklearn.pipeline import make_pipeline
>>> agg_joiner_1 = AggJoiner(
...    aux_table=hospitalizations,
...    key="patient_id",
...    cols="days_of_stay",
...    operations="max",
... )
>>> agg_joiner_2 = AggJoiner(
...    aux_table=medications,
...    key="patient_id",
...    cols="medication",
...    operations="mode",
... )
>>> agg_joiner_3 = AggJoiner(
...    aux_table=glucose,
...    main_key="patient_id",
...    aux_key="patientID",
...    cols="value",
...    operations=["mean", "std"],
...    suffix="_glucose",
... )
>>> pipeline = make_pipeline(agg_joiner_1, agg_joiner_2, agg_joiner_3)
>>> pipeline.fit_transform(patients)
   patient_id  age  ...  value_mean_glucose  value_std_glucose
0           1   72  ...                1.65           1.193035
1           2   45  ...                4.80           2.404163

Methods

fit(X[, y])

Aggregate auxiliary tables based on the main keys.

fit_transform(X[, y])

Aggregate auxiliary tables based on the main keys.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Left-join pre-aggregated tables on X.

fit(X, y=None)[source]#

Aggregate auxiliary tables based on the main keys.

Parameters:
XDataFrameLike

Input data, based table on which to left join the auxiliary tables.

yNone

Unused, only here for compatibility.

Returns:
MultiAggJoiner

Fitted MultiAggJoiner instance (self).

fit_transform(X, y=None)[source]#

Aggregate auxiliary tables based on the main keys.

Parameters:
XDataFrameLike

Input data, based table on which to left join the auxiliary tables.

yNone

Unused, only here for compatibility.

Returns:
DataFrame

The augmented input.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • “polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Left-join pre-aggregated tables on X.

Parameters:
XDataFrameLike

The input data to transform.

Returns:
DataFrame

The augmented input.