skrub
.AggJoiner#
Usage examples at the bottom of this page.
- class skrub.AggJoiner(aux_table, *, aux_key, main_key, cols=None, operation=None, suffix=None)[source]#
Aggregate auxiliary dataframes before joining them on a base dataframe.
Apply numerical and categorical aggregation operations on the columns to aggregate, selected by dtypes. See the list of supported operations at the parameter agg_ops.
The grouping columns used during the aggregation are the columns used as keys for joining.
Accepts
pandas.DataFrame
andpolars.DataFrame
inputs.- Parameters:
- aux_tableDataFrameLike or
str
or iterable Auxiliary dataframe to aggregate then join on the base table. The placeholder string “X” can be provided to perform self-aggregation on the input data.
- aux_key
str
, or iterable ofstr
, or iterable of iterable ofstr
Select the columns from the auxiliary dataframe to use as keys during the join operation.
- main_key
str
or iterable ofstr
Select the columns from the main table to use as keys during the join operation. If main_key is a list, we will perform a multi-column join.
- cols
str
, or iterable ofstr
, or iterable of iterable ofstr
, default=None Select the columns from the auxiliary dataframe to use as values during the aggregation operations. If None, cols are all columns from table, except aux_key.
- operation
str
or iterable ofstr
, default=None Aggregation operations to perform on the auxiliary table.
- numerical{“sum”, “mean”, “std”, “min”, “max”, “hist”, “value_counts”}
‘hist’ and ‘value_counts’ accepts an integer argument to parametrize the binning.
categorical : {“mode”, “count”, “value_counts”}
If set to None (the default), [‘mean’, ‘mode’] will be used.
- suffix
str
or iterable ofstr
, default=None The suffixes that will be added to each table columns in case of duplicate column names. If set to None, the table index in ‘aux_table’ are used, e.g. for a duplicate columns: price (main table), price_1 (auxiliary table 1), price_2 (auxiliary table 2), etc.
- aux_tableDataFrameLike or
See also
Examples
>>> import pandas as pd >>> main = pd.DataFrame({ ... "airportId": [1, 2], ... "airportName": ["Paris CDG", "NY JFK"], ... }) >>> aux = pd.DataFrame({ ... "flightId": range(1, 7), ... "from_airport": [1, 1, 1, 2, 2, 2], ... "total_passengers": [90, 120, 100, 70, 80, 90], ... "company": ["DL", "AF", "AF", "DL", "DL", "TR"], ... }) >>> join_agg = AggJoiner( ... aux_table=aux, ... aux_key="from_airport", ... main_key="airportId", ... cols=["total_passengers", "company"], ... operation=["mean", "mode"], ... ) >>> join_agg.fit_transform(main) airportId airportName company_mode_1 total_passengers_mean_1 0 1 Paris CDG AF 103.33... 1 2 NY JFK DL 80.00...
Methods
check_input
(X)Perform a check on column names data type and suffixes.
fit
(X[, y])Aggregate auxiliary tables based on the main keys.
fit_transform
(X[, y])Fit to data, then transform it.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Left-join pre-aggregated tables on X.
- check_input(X)[source]#
Perform a check on column names data type and suffixes.
- Parameters:
- XDataFrameLike
The raw input to check.
- fit(X, y=None)[source]#
Aggregate auxiliary tables based on the main keys.
- Parameters:
- XDataframeLike
Input data, based table on which to left join the auxiliary tables.
- yarray_like of shape (n_samples), default=None
Prediction target. Used to compute correlations between the generated covariates and the target for screening purposes.
- Returns:
- AggJoiner
Fitted
AggJoiner
instance (self).
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Input samples.
- yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_params
dict
Additional fit parameters.
- Returns:
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
None: Transform configuration is unchanged
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.