skrub
.AggTarget#
Usage examples at the bottom of this page.
- class skrub.AggTarget(main_key, operation=None, suffix=None)[source]#
Aggregate a target
y
before joining its aggregation on a base dataframe.Accepts
pandas.DataFrame
orpolars.DataFrame
inputs.- Parameters:
- main_key
str
or iterable ofstr
Select the columns from the main table to use as keys during the aggregation of the target and during the join operation.
If main_key refer to a single column, a single aggregation for this key will be generated and a single join will be performed.
Otherwise, if main_key is a list of keys, the target will be aggregated using each key separately, then each aggregation of the target will be joined on the main table.
- operation
str
or iterable ofstr
, optional Aggregation operations to perform on the auxiliary table.
- numerical{“sum”, “mean”, “std”, “min”, “max”, “hist(3)”, “value_counts”}
‘hist’ and ‘value_counts’ accepts an integer argument to parametrize the binning.
categorical : {“mode”, “count”, “value_counts”}
If set to None (the default), [‘mean’, ‘mode’] will be used.
- suffix
str
, optional The suffix to append to the columns of the target table if the join result in some duplicates columns. If set to None, “_target” is used.
- main_key
See also
Examples
>>> import pandas as pd >>> X = pd.DataFrame({ ... "flightId": range(1, 7), ... "from_airport": [1, 1, 1, 2, 2, 2], ... "total_passengers": [90, 120, 100, 70, 80, 90], ... "company": ["DL", "AF", "AF", "DL", "DL", "TR"], ... }) >>> y = np.array([1, 1, 0, 0, 1, 1]) >>> agg_target = AggTarget( ... main_key="company", ... operation=["mean", "max"], ... ) >>> agg_target.fit_transform(X, y) flightId from_airport ... y_0_max_target y_0_mean_target 0 1 1 ... 1 0.666667 1 2 1 ... 1 0.500000 2 3 1 ... 1 0.500000 3 4 2 ... 1 0.666667 4 5 2 ... 1 0.666667 5 6 2 ... 1 1.000000 [6 rows x 6 columns]
Methods
check_input
(X, y)Perform a check on column names data type and suffixes.
fit
(X, y)Aggregate the target
y
based on keys fromX
.fit_transform
(X[, y])Fit to data, then transform it.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Left-join pre-aggregated tables on X.
- check_input(X, y)[source]#
Perform a check on column names data type and suffixes.
- Parameters:
- XDataFrameLike
The raw input to check.
- yDataFrameLike or SeriesLike or ArrayLike
The raw target to check.
- Returns:
- y_DataFrameLike,
Transformation of the target.
- fit(X, y)[source]#
Aggregate the target
y
based on keys fromX
.- Parameters:
- XDataFrameLike
Must contains the columns names defined in
main_key
.- yDataFrameLike or SeriesLike or ArrayLike
y
length must matchX
length, with matching indices. The target can be continuous or discrete, with multiple columns.If the target is continuous, only numerical operations, listed in
num_operations
, can be applied.If the target is discrete, only categorical operations, listed in
categ_operations
, can be applied.Note that the target type is determined by
sklearn.utils.multiclass.type_of_target()
.
- Returns:
- AggTarget
Fitted AggTarget instance (self).
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Input samples.
- yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_params
dict
Additional fit parameters.
- Returns:
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
None: Transform configuration is unchanged
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.