AggJoiner#
Usage examples at the bottom of this page.
- class skrub.AggJoiner(aux_table, *, key=None, main_key=None, aux_key=None, cols=None, operations=None, suffix='')[source]#
Aggregate an auxiliary dataframe before joining it on a base dataframe.
Apply numerical and categorical aggregation operations on the columns (i.e. cols) to aggregate, selected by dtypes. See the list of supported operations at the parameter operations.
If cols is not provided, cols are all columns from aux_table, except aux_key.
Accepts
pandas.DataFrame
andpolars.DataFrame
inputs.- Parameters:
- aux_tableDataFrameLike or “X”
Auxiliary dataframe to aggregate then join on the base table. The placeholder string “X” can be provided to perform self-aggregation on the input data.
- key
str
, default=None The column name to use for both main_key and aux_key when they are the same. Provide either key or both main_key and aux_key. If key is an iterable, we will perform a multi-column join.
- main_key
str
or iterable ofstr
, default=None Select the columns from the main table to use as keys during the join operation. If main_key is an iterable, we will perform a multi-column join.
- aux_key
str
or iterable ofstr
, default=None Select the columns from the auxiliary dataframe to use as keys during the join operation. If aux_key is an iterable, we will perform a multi-column join.
- cols
str
or iterable ofstr
, default=None Select the columns from the auxiliary dataframe to use as values during the aggregation operations. If set to None, cols are all columns from aux_table, except aux_key.
- operations
str
or iterable ofstr
, default=None Aggregation operations to perform on the auxiliary table.
numerical : {“sum”, “mean”, “std”, “min”, “max”, “hist”, “value_counts”} “hist” and “value_counts” accept an integer argument to parametrize the binning.
categorical : {“mode”, “count”, “value_counts”}
If set to None (the default), [“mean”, “mode”] will be used.
- suffix
str
, default=”” Suffix to append to the aux_table’s column names. You can use it to avoid duplicate column names in the join.
See also
AggTarget
Aggregates the target y before joining its aggregation on the base dataframe.
Joiner
Augments a main table by automatically joining an auxiliary table on it.
MultiAggJoiner
Extension of the AggJoiner to multiple auxiliary tables.
Examples
>>> import pandas as pd >>> from skrub import AggJoiner >>> main = pd.DataFrame({ ... "airportId": [1, 2], ... "airportName": ["Paris CDG", "NY JFK"], ... }) >>> aux = pd.DataFrame({ ... "flightId": range(1, 7), ... "from_airport": [1, 1, 1, 2, 2, 2], ... "total_passengers": [90, 120, 100, 70, 80, 90], ... "company": ["DL", "AF", "AF", "DL", "DL", "TR"], ... }) >>> agg_joiner = AggJoiner( ... aux_table=aux, ... main_key="airportId", ... aux_key="from_airport", ... cols=["total_passengers", "company"], ... operations=["mean", "mode"], ... ) >>> agg_joiner.fit_transform(main) airportId airportName company_mode total_passengers_mean 0 1 Paris CDG AF 103.33... 1 2 NY JFK DL 80.00...
Methods
fit
(X[, y])Aggregate auxiliary table based on the main keys.
fit_transform
(X[, y])Aggregate auxiliary table based on the main keys.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Left-join pre-aggregated table on X.
- fit_transform(X, y=None)[source]#
Aggregate auxiliary table based on the main keys.
- Parameters:
- XDataFrameLike
Input data, based table on which to left join the auxiliary table.
- y
None
Unused, only here for compatibility.
- Returns:
- DataFrame
The augmented input.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
“polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.