.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/08_join_aggregation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_08_join_aggregation.py: AggJoiner on a credit fraud dataset =================================== Many problems involve tables whose entities have a one-to-many relationship. To simplify aggregate-then-join operations for machine learning, we can include the |AggJoiner| in our pipeline. In this example, we are tackling a fraudulent loan detection use case. Because fraud is rare, this dataset is extremely imbalanced, with a prevalence of around 1.4%. The data consists of two distinct entities: e-commerce "baskets", and "products". Baskets can be tagged fraudulent (1) or not (0), and are essentially a list of products of variable size. Each basket is linked to at least one products, e.g. basket 1 can have product 1 and 2. .. image:: ../../_static/08_example_data.png :width: 450 px | Our aim is to predict which baskets are fraudulent. The products dataframe can be joined on the baskets dataframe using the ``basket_ID`` column. Each product has several attributes: - a category (marked by the column ``"item"``), - a model (``"model"``), - a brand (``"make"``), - a merchant code (``"goods_code"``), - a price per unit (``"cash_price"``), - a quantity selected in the basket (``"Nbr_of_prod_purchas"``) .. |AggJoiner| replace:: :class:`~skrub.AggJoiner` .. |Joiner| replace:: :class:`~skrub.Joiner` .. |DropCols| replace:: :class:`~skrub.DropCols` .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` .. |TableReport| replace:: :class:`~skrub.TableReport` .. |MinHashEncoder| replace:: :class:`~skrub.MinHashEncoder` .. |TargetEncoder| replace:: :class:`~sklearn.preprocessing.TargetEncoder` .. |make_pipeline| replace:: :func:`~sklearn.pipeline.make_pipeline` .. |Pipeline| replace:: :class:`~sklearn.pipeline.Pipeline` .. |HGBC| replace:: :class:`~sklearn.ensemble.HistGradientBoostingClassifier` .. |OrdinalEncoder| replace:: :class:`~sklearn.preprocessing.OrdinalEncoder` .. |TunedThresholdClassifierCV| replace:: :class:`~sklearn.model_selection.TunedThresholdClassifierCV` .. |CalibrationDisplay| replace:: :class:`~sklearn.calibration.CalibrationDisplay` .. |pandas.melt| replace:: :func:`~pandas.melt` .. GENERATED FROM PYTHON SOURCE LINES 83-90 .. code-block:: Python from skrub import TableReport from skrub.datasets import fetch_credit_fraud bunch = fetch_credit_fraud() products, baskets = bunch.products, bunch.baskets TableReport(products) .. raw:: html

	basket_ID	item	cash_price	make	model	goods_code	Nbr_of_prod_purchas
1	51,113	COMPUTER PERIPHERALS ACCESSORIES	409	APPLE	APPLE WATCH SERIES 6 GPS 44MM SPACE GREY ALUMINIUM	239001518	1
9	41,798	COMPUTERS	1,187	APPLE	2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8	239246780	1
11	39,361	COMPUTERS	898	APPLE	2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC	239246776	1
15	38,615	COMPUTER PERIPHERALS ACCESSORIES	379	APPLE	APPLE WATCH SERIES 6 GPS 40MM BLUE ALUMINIUM CASE	239001540	1
16	70,262	COMPUTERS	1,899	APPLE	2021 APPLE MACBOOK PRO 14 M1 PRO PROCESSOR 16GB RA	240575990	1

163,352	42,613	BEDROOM FURNITURE	259	SILENTNIGHT	SILENTNIGHT SLEEP GENIUS FULL HEIGHT HEADBOARD DOU	236938439	1
163,353	42,613	OUTDOOR FURNITURE	949	LG OUTDOOR	LG OUTDOOR BERGEN 2-SEAT GARDEN SIDE TABLE RECLINI	239742814	1
163,354	43,567	COMPUTERS	1,099	APPLE	2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25	240040978	1
163,355	43,567	COMPUTERS	2,099	APPLE	2020 APPLE IMAC 27 ALL-IN-ONE INTEL CORE I7 8GB RA	238923518	1
163,356	68,268	TELEVISIONS HOME CINEMA	799	LG	LG OLED48A16LA 2021 OLED HDR 4K ULTRA HD SMART TV	239866717	1

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Int64DType	0 (0.0%)	61241 (56.0%)	3.59e+04	2.24e+04	0	35,203	76,543
1	item	ObjectDType	0 (0.0%)	166 (0.2%)
2	cash_price	Int64DType	0 (0.0%)	1280 (1.2%)	672.	714.	0	499	18,349
3	make	ObjectDType	1273 (1.2%)	690 (0.6%)
4	model	ObjectDType	1273 (1.2%)	6477 (5.9%)
5	goods_code	ObjectDType	0 (0.0%)	10738 (9.8%)
6	Nbr_of_prod_purchas	Int64DType	0 (0.0%)	19 (< 0.1%)	1.05	0.426	1	1	40

Column 1	Column 2	Cramér's V	Pearson's Correlation
model	goods_code	0.637
item	model	0.467
item	make	0.466
item	goods_code	0.419
make	model	0.316
make	goods_code	0.256
item	cash_price	0.248
cash_price	model	0.242
cash_price	goods_code	0.220
cash_price	make	0.217
basket_ID	item	0.207
basket_ID	model	0.199
basket_ID	make	0.136
make	Nbr_of_prod_purchas	0.114
basket_ID	goods_code	0.111
item	Nbr_of_prod_purchas	0.108
basket_ID	cash_price	0.0764	0.159
cash_price	Nbr_of_prod_purchas	0.0531	0.0317
basket_ID	Nbr_of_prod_purchas	0.0485	-0.00555
goods_code	Nbr_of_prod_purchas	0.0430

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 91-93 .. code-block:: Python TableReport(baskets) .. raw:: html

	ID	fraud_flag
1	51,113	0
7	41,798	0
9	39,361	0
13	38,615	0
14	70,262	0

92,785	21,243	0
92,786	45,891	0
92,787	42,613	0
92,788	43,567	0
92,789	68,268	0

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	ID	Int64DType	0 (0.0%)	61241 (100.0%)	3.82e+04	2.21e+04	0	38,158	76,543
1	fraud_flag	Int64DType	0 (0.0%)	2 (< 0.1%)	0.0130	0.113	0	0	1

Column 1	Column 2	Cramér's V	Pearson's Correlation
ID	fraud_flag	0.0728	0.0444

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 94-107 Naive aggregation ----------------- Let's explore a naive solution first. .. note:: Click :ref:`here` to skip this section and see the AggJoiner in action! The first idea that comes to mind to merge these two tables is to aggregate the products attributes into lists, using their basket IDs. .. GENERATED FROM PYTHON SOURCE LINES 107-110 .. code-block:: Python products_grouped = products.groupby("basket_ID").agg(list) TableReport(products_grouped) .. raw:: html

basket_ID	item	cash_price	make	model	goods_code	Nbr_of_prod_purchas
0	['COMPUTERS', 'WARRANTY', 'FULFILMENT CHARGE']	[1249, 35, 11]	['APPLE', 'RETAILER', 'RETAILER']	['2021 APPLE IMAC 24 ALL-IN-ONE M1 PROCESSOR 8GB RAM', 'RETAILER', 'RETAILER']	['240040969', '236604727', 'FULFILMENT']	[1, 1, 1]
1	['OUTDOOR ACCESSORIES', 'OUTDOOR FURNITURE']	[679, 369]	['KETTLER', 'RETAILER']	['RETAILER', 'RETAILER']	['237874616', '238222170']	[1, 1]
2	['OUTDOOR FURNITURE', 'OUTDOOR FURNITURE']	[1879, 110]	['KETTLER', 'KETTLER']	['RETAILER', 'RETAILER']	['239482916', '235452317']	[1, 1]
4	['TELEPHONES, FAX MACHINES & TWO-WAY RADIOS', 'FULFILMENT CHARGE']	[999, 0]	['APPLE', 'RETAILER']	['APPLE IPHONE 12 PRO', 'RETAILER']	['239091969', 'FULFILMENT']	[1, 1]
5	['LIVING & DINING FURNITURE']	[749]	['RETAILER']	['RETAILER']	['238000174']	[1]

76,538	['HOT DRINK PREPARATION', 'BARWARE', 'KITCHEN SCALES MEASURES', 'KITCHEN SCALES MEASURES', 'WINDOW DRESSING', 'LIVING DINING FURNITURE', 'SERVICE', 'LIVING DINING FURNITURE']	[6, 5, 2, 1, 120, 1549, 0, 1349]	['RETAILER', 'RETAILER', 'ANYDAY RETAILER', 'ANYDAY RETAILER', 'RETAILER', 'RETAILER', 'RETAILER', 'RETAILER']	['RETAILER TEA STRAINER WITH STAND', 'RETAILER DOUBLE JIGGER', 'ANYDAY RETAILER PLASTIC MEASURING JUG 1', 'ANYDAY RETAILER PLASTIC MEASURING JUG 5', 'RETAILER RONA PAIR LINED EYELET CURTAIN', 'RETAILER BARBICAN LARGE 3 SEATER SOFA L', 'RETAILER', 'RETAILER BARBICAN MEDIUM 2 SEATER SOFA']	['231251059', '231034699', '236902782', '236902832', '237968549', '237013495', 'DMS4462', '237013514']	[1, 1, 1, 1, 1, 1, 1, 1]
76,539	['AUDIO ACCESSORIES', 'WARRANTY', 'HEALTH BEAUTY ELECTRICAL', 'WARRANTY']	[140, 20, 357, 15]	['APPLE', 'RETAILER', 'DYSON', 'RETAILER']	['2021 APPLE AIRPODS WITH MAGSAFE CHARGING CASE 3RD', 'RETAILER', 'DYSON CORRALE CORD-FREE HAIR STRAIGHTENERS', 'RETAILER']	['240575988', '236604738', '238602413', '237371145']	[1, 1, 1, 1]
76,540	['COMPUTER PERIPHERALS ACCESSORIES']	[399]	['APPLE']	['APPLE WATCH NIKE SERIES 7 GPS 45MM MIDNIGHT ALUMIN']	['240382077']	[1]
76,541	['BEDROOM FURNITURE', 'SERVICE', 'BEDROOM FURNITURE', 'BEDROOM FURNITURE', 'SERVICE', 'SERVICE', 'BED LINEN', 'BEDROOM FURNITURE']	[1519, 30, 279, 339, 30, 0, 26, 749]	['RETAILER', 'RETAILER', 'RETAILER', 'SILENTNIGHT', 'RETAILER', 'RETAILER', 'RETAILER', 'RETAILER']	['RETAILER LUXURY NATURAL COLLECTION BRIT', 'RETAILER', 'RETAILER CLASSIC ECO 800 POCKET SPRING', 'SILENTNIGHT NON SPRUNG 2 DRAWER DIVAN STORAGE BED', 'RETAILER', 'RETAILER', 'RETAILER NATURAL COTTON QUILTED MATTRES', 'RETAILER ROUEN OTTOMAN STORAGE UPHOLSTE']	['240361566', 'DMS22', '240108867', '236938413', 'DMS22', 'DMS4463', '231083318', '238088761']	[1, 1, 1, 1, 1, 1, 1, 1]
76,543	['COMPUTERS', 'FULFILMENT CHARGE']	[1649, 7]	['APPLE', 'RETAILER']	['2021 APPLE IMAC 24 ALL-IN-ONE M1 PROCESSOR 8GB RAM', 'RETAILER']	['240040968', 'FULFILMENT']	[1, 1]

Column	Column name	dtype	Unique values
0	item	ObjectDType	3181 (5.2%)
1	cash_price	ObjectDType	10884 (17.8%)
2	make	ObjectDType	2651 (4.3%)
3	model	ObjectDType	8182 (13.4%)
4	goods_code	ObjectDType	12554 (20.5%)
5	Nbr_of_prod_purchas	ObjectDType	596 (1.0%)

Column 1	Column 2	Cramér's V
item	make	0.566
model	goods_code	0.563
cash_price	goods_code	0.498
item	model	0.478
make	Nbr_of_prod_purchas	0.429
make	model	0.353
item	Nbr_of_prod_purchas	0.352
cash_price	model	0.334
item	cash_price	0.287
item	goods_code	0.285
make	goods_code	0.258
model	Nbr_of_prod_purchas	0.217
cash_price	make	0.195
goods_code	Nbr_of_prod_purchas	0.158
cash_price	Nbr_of_prod_purchas	0.155

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 111-114 Then, we can expand all lists into columns, as if we were "flattening" the dataframe. We end up with a products dataframe ready to be joined on the baskets dataframe, using ``"basket_ID"`` as the join key. .. GENERATED FROM PYTHON SOURCE LINES 114-124 .. code-block:: Python import pandas as pd products_flatten = [] for col in products_grouped.columns: cols = [f"{col}{idx}" for idx in range(24)] products_flatten.append(pd.DataFrame(products_grouped[col].to_list(), columns=cols)) products_flatten = pd.concat(products_flatten, axis=1) products_flatten.insert(0, "basket_ID", products_grouped.index) TableReport(products_flatten) .. raw:: html

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Int64DType	0 (0.0%)	61241 (100.0%)	3.82e+04	2.21e+04	0	38,158	76,543
1	item0	ObjectDType	0 (0.0%)	133 (0.2%)
2	item1	ObjectDType	30267 (49.4%)	130 (0.2%)
3	item2	ObjectDType	52467 (85.7%)	121 (0.2%)
4	item3	ObjectDType	58223 (95.1%)	119 (0.2%)
5	item4	ObjectDType	59813 (97.7%)	98 (0.2%)
6	item5	ObjectDType	60366 (98.6%)	91 (0.1%)
7	item6	ObjectDType	60632 (99.0%)	81 (0.1%)
8	item7	ObjectDType	60778 (99.2%)	76 (0.1%)
9	item8	ObjectDType	60874 (99.4%)	71 (0.1%)
10	item9	ObjectDType	60939 (99.5%)	58 (< 0.1%)
11	item10	ObjectDType	60995 (99.6%)	61 (< 0.1%)
12	item11	ObjectDType	61034 (99.7%)	49 (< 0.1%)
13	item12	ObjectDType	61074 (99.7%)	47 (< 0.1%)
14	item13	ObjectDType	61105 (99.8%)	47 (< 0.1%)
15	item14	ObjectDType	61123 (99.8%)	44 (< 0.1%)
16	item15	ObjectDType	61143 (99.8%)	36 (< 0.1%)
17	item16	ObjectDType	61160 (99.9%)	35 (< 0.1%)
18	item17	ObjectDType	61174 (99.9%)	29 (< 0.1%)
19	item18	ObjectDType	61187 (99.9%)	24 (< 0.1%)
20	item19	ObjectDType	61194 (99.9%)	20 (< 0.1%)
21	item20	ObjectDType	61203 (99.9%)	22 (< 0.1%)
22	item21	ObjectDType	61210 (99.9%)	15 (< 0.1%)
23	item22	ObjectDType	61219 (100.0%)	15 (< 0.1%)
24	item23	ObjectDType	61224 (100.0%)	11 (< 0.1%)
25	cash_price0	Int64DType	0 (0.0%)	1102 (1.8%)	1.06e+03	679.	2	949	18,349
26	cash_price1	Float64DType	30267 (49.4%)	689 (1.1%)	180.	379.	0.00	30.0	6.50e+03
27	cash_price2	Float64DType	52467 (85.7%)	511 (0.8%)	184.	369.	0.00	30.0	6.00e+03
28	cash_price3	Float64DType	58223 (95.1%)	372 (0.6%)	161.	298.	0.00	40.0	2.85e+03
29	cash_price4	Float64DType	59813 (97.7%)	283 (0.5%)	187.	355.	0.00	52.0	3.60e+03
30	cash_price5	Float64DType	60366 (98.6%)	216 (0.4%)	161.	297.	0.00	50.0	3.00e+03
31	cash_price6	Float64DType	60632 (99.0%)	173 (0.3%)	131.	268.	0.00	50.0	4.20e+03
32	cash_price7	Float64DType	60778 (99.2%)	142 (0.2%)	127.	261.	0.00	45.0	3.00e+03
33	cash_price8	Float64DType	60874 (99.4%)	138 (0.2%)	128.	274.	0.00	44.0	2.40e+03
34	cash_price9	Float64DType	60939 (99.5%)	104 (0.2%)	93.5	178.	0.00	35.0	1.40e+03
35	cash_price10	Float64DType	60995 (99.6%)	97 (0.2%)	101.	205.	0.00	30.0	1.90e+03
36	cash_price11	Float64DType	61034 (99.7%)	96 (0.2%)	107.	249.	0.00	30.0	2.16e+03
37	cash_price12	Float64DType	61074 (99.7%)	77 (0.1%)	73.3	114.	0.00	25.0	799.
38	cash_price13	Float64DType	61105 (99.8%)	76 (0.1%)	93.8	168.	0.00	39.0	1.30e+03
39	cash_price14	Float64DType	61123 (99.8%)	57 (< 0.1%)	67.3	101.	0.00	30.0	599.
40	cash_price15	Float64DType	61143 (99.8%)	54 (< 0.1%)	88.3	187.	0.00	34.0	1.50e+03
41	cash_price16	Float64DType	61160 (99.9%)	54 (< 0.1%)	110.	213.	0.00	40.0	1.55e+03
42	cash_price17	Float64DType	61174 (99.9%)	48 (< 0.1%)	99.6	154.	0.00	32.0	799.
43	cash_price18	Float64DType	61187 (99.9%)	42 (< 0.1%)	84.3	152.	0.00	35.0	999.
44	cash_price19	Float64DType	61194 (99.9%)	37 (< 0.1%)	98.2	298.	0.00	30.0	2.01e+03
45	cash_price20	Float64DType	61203 (99.9%)	26 (< 0.1%)	53.8	88.1	4.00	18.0	419.
46	cash_price21	Float64DType	61210 (99.9%)	25 (< 0.1%)	129.	378.	0.00	25.0	2.09e+03
47	cash_price22	Float64DType	61219 (100.0%)	21 (< 0.1%)	129.	213.	0.00	25.0	720.
48	cash_price23	Float64DType	61224 (100.0%)	15 (< 0.1%)	136.	260.	4.00	15.0	898.
49	make0	ObjectDType	685 (1.1%)	346 (0.6%)
50	make1	ObjectDType	30594 (50.0%)	311 (0.5%)
51	make2	ObjectDType	52577 (85.9%)	289 (0.5%)
52	make3	ObjectDType	58257 (95.1%)	243 (0.4%)
53	make4	ObjectDType	59832 (97.7%)	202 (0.3%)
54	make5	ObjectDType	60379 (98.6%)	149 (0.2%)
55	make6	ObjectDType	60642 (99.0%)	131 (0.2%)
56	make7	ObjectDType	60788 (99.3%)	122 (0.2%)
57	make8	ObjectDType	60884 (99.4%)	102 (0.2%)
58	make9	ObjectDType	60949 (99.5%)	85 (0.1%)
59	make10	ObjectDType	61003 (99.6%)	79 (0.1%)
60	make11	ObjectDType	61042 (99.7%)	67 (0.1%)
61	make12	ObjectDType	61081 (99.7%)	52 (< 0.1%)
62	make13	ObjectDType	61110 (99.8%)	53 (< 0.1%)
63	make14	ObjectDType	61128 (99.8%)	46 (< 0.1%)
64	make15	ObjectDType	61146 (99.8%)	35 (< 0.1%)
65	make16	ObjectDType	61163 (99.9%)	31 (< 0.1%)
66	make17	ObjectDType	61175 (99.9%)	22 (< 0.1%)
67	make18	ObjectDType	61188 (99.9%)	25 (< 0.1%)
68	make19	ObjectDType	61195 (99.9%)	20 (< 0.1%)
69	make20	ObjectDType	61204 (99.9%)	17 (< 0.1%)
70	make21	ObjectDType	61211 (100.0%)	12 (< 0.1%)
71	make22	ObjectDType	61220 (100.0%)	12 (< 0.1%)
72	make23	ObjectDType	61224 (100.0%)	9 (< 0.1%)
73	model0	ObjectDType	685 (1.1%)	2624 (4.3%)
74	model1	ObjectDType	30594 (50.0%)	2101 (3.4%)
75	model2	ObjectDType	52577 (85.9%)	1507 (2.5%)
76	model3	ObjectDType	58257 (95.1%)	983 (1.6%)
77	model4	ObjectDType	59832 (97.7%)	658 (1.1%)
78	model5	ObjectDType	60379 (98.6%)	437 (0.7%)
79	model6	ObjectDType	60642 (99.0%)	333 (0.5%)
80	model7	ObjectDType	60788 (99.3%)	263 (0.4%)
81	model8	ObjectDType	60884 (99.4%)	212 (0.3%)
82	model9	ObjectDType	60949 (99.5%)	184 (0.3%)
83	model10	ObjectDType	61003 (99.6%)	147 (0.2%)
84	model11	ObjectDType	61042 (99.7%)	122 (0.2%)
85	model12	ObjectDType	61081 (99.7%)	99 (0.2%)
86	model13	ObjectDType	61110 (99.8%)	81 (0.1%)
87	model14	ObjectDType	61128 (99.8%)	73 (0.1%)
88	model15	ObjectDType	61146 (99.8%)	60 (< 0.1%)
89	model16	ObjectDType	61163 (99.9%)	53 (< 0.1%)
90	model17	ObjectDType	61175 (99.9%)	41 (< 0.1%)
91	model18	ObjectDType	61188 (99.9%)	36 (< 0.1%)
92	model19	ObjectDType	61195 (99.9%)	33 (< 0.1%)
93	model20	ObjectDType	61204 (99.9%)	25 (< 0.1%)
94	model21	ObjectDType	61211 (100.0%)	23 (< 0.1%)
95	model22	ObjectDType	61220 (100.0%)	18 (< 0.1%)
96	model23	ObjectDType	61224 (100.0%)	16 (< 0.1%)
97	goods_code0	ObjectDType	0 (0.0%)	4404 (7.2%)
98	goods_code1	ObjectDType	30267 (49.4%)	3351 (5.5%)
99	goods_code2	ObjectDType	52467 (85.7%)	2246 (3.7%)
100	goods_code3	ObjectDType	58223 (95.1%)	1436 (2.3%)
101	goods_code4	ObjectDType	59813 (97.7%)	989 (1.6%)
102	goods_code5	ObjectDType	60366 (98.6%)	661 (1.1%)
103	goods_code6	ObjectDType	60632 (99.0%)	516 (0.8%)
104	goods_code7	ObjectDType	60778 (99.2%)	396 (0.6%)
105	goods_code8	ObjectDType	60874 (99.4%)	332 (0.5%)
106	goods_code9	ObjectDType	60939 (99.5%)	269 (0.4%)
107	goods_code10	ObjectDType	60995 (99.6%)	222 (0.4%)
108	goods_code11	ObjectDType	61034 (99.7%)	183 (0.3%)
109	goods_code12	ObjectDType	61074 (99.7%)	153 (0.2%)
110	goods_code13	ObjectDType	61105 (99.8%)	126 (0.2%)
111	goods_code14	ObjectDType	61123 (99.8%)	106 (0.2%)
112	goods_code15	ObjectDType	61143 (99.8%)	91 (0.1%)
113	goods_code16	ObjectDType	61160 (99.9%)	76 (0.1%)
114	goods_code17	ObjectDType	61174 (99.9%)	60 (< 0.1%)
115	goods_code18	ObjectDType	61187 (99.9%)	50 (< 0.1%)
116	goods_code19	ObjectDType	61194 (99.9%)	43 (< 0.1%)
117	goods_code20	ObjectDType	61203 (99.9%)	33 (< 0.1%)
118	goods_code21	ObjectDType	61210 (99.9%)	27 (< 0.1%)
119	goods_code22	ObjectDType	61219 (100.0%)	19 (< 0.1%)
120	goods_code23	ObjectDType	61224 (100.0%)	16 (< 0.1%)
121	Nbr_of_prod_purchas0	Int64DType	0 (0.0%)	14 (< 0.1%)	1.03	0.365	1	1	40
122	Nbr_of_prod_purchas1	Float64DType	30267 (49.4%)	10 (< 0.1%)	1.03	0.265	1.00	1.00	10.0
123	Nbr_of_prod_purchas2	Float64DType	52467 (85.7%)	10 (< 0.1%)	1.07	0.427	1.00	1.00	15.0
124	Nbr_of_prod_purchas3	Float64DType	58223 (95.1%)	13 (< 0.1%)	1.16	0.876	1.00	1.00	28.0
125	Nbr_of_prod_purchas4	Float64DType	59813 (97.7%)	9 (< 0.1%)	1.24	0.827	1.00	1.00	15.0
126	Nbr_of_prod_purchas5	Float64DType	60366 (98.6%)	8 (< 0.1%)	1.25	1.09	1.00	1.00	24.0
127	Nbr_of_prod_purchas6	Float64DType	60632 (99.0%)	9 (< 0.1%)	1.28	0.812	1.00	1.00	9.00
128	Nbr_of_prod_purchas7	Float64DType	60778 (99.2%)	9 (< 0.1%)	1.34	1.09	1.00	1.00	14.0
129	Nbr_of_prod_purchas8	Float64DType	60874 (99.4%)	10 (< 0.1%)	1.42	1.16	1.00	1.00	10.0
130	Nbr_of_prod_purchas9	Float64DType	60939 (99.5%)	7 (< 0.1%)	1.35	0.920	1.00	1.00	8.00
131	Nbr_of_prod_purchas10	Float64DType	60995 (99.6%)	9 (< 0.1%)	1.39	1.19	1.00	1.00	12.0
132	Nbr_of_prod_purchas11	Float64DType	61034 (99.7%)	6 (< 0.1%)	1.30	0.853	1.00	1.00	6.00
133	Nbr_of_prod_purchas12	Float64DType	61074 (99.7%)	4 (< 0.1%)	1.19	0.465	1.00	1.00	4.00
134	Nbr_of_prod_purchas13	Float64DType	61105 (99.8%)	6 (< 0.1%)	1.37	1.18	1.00	1.00	12.0
135	Nbr_of_prod_purchas14	Float64DType	61123 (99.8%)	6 (< 0.1%)	1.32	0.866	1.00	1.00	6.00
136	Nbr_of_prod_purchas15	Float64DType	61143 (99.8%)	6 (< 0.1%)	1.32	0.807	1.00	1.00	6.00
137	Nbr_of_prod_purchas16	Float64DType	61160 (99.9%)	5 (< 0.1%)	1.47	1.38	1.00	1.00	12.0
138	Nbr_of_prod_purchas17	Float64DType	61174 (99.9%)	4 (< 0.1%)	1.39	1.87	1.00	1.00	16.0
139	Nbr_of_prod_purchas18	Float64DType	61187 (99.9%)	4 (< 0.1%)	1.20	0.810	1.00	1.00	6.00
140	Nbr_of_prod_purchas19	Float64DType	61194 (99.9%)	5 (< 0.1%)	1.40	1.06	1.00	1.00	7.00
141	Nbr_of_prod_purchas20	Float64DType	61203 (99.9%)	3 (< 0.1%)	1.18	0.563	1.00	1.00	4.00
142	Nbr_of_prod_purchas21	Float64DType	61210 (99.9%)	4 (< 0.1%)	1.48	1.21	1.00	1.00	7.00
143	Nbr_of_prod_purchas22	Float64DType	61219 (100.0%)	4 (< 0.1%)	1.27	0.767	1.00	1.00	4.00
144	Nbr_of_prod_purchas23	Float64DType	61224 (100.0%)	2 (< 0.1%)	1.29	0.470	1.00	1.00	2.00

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 125-145 Look at the "Stats" section of the |TableReport| above. Does anything strike you? Not only did we create 144 columns, but most of these columns are filled with NaN, which is very inefficient for learning! This is because each basket contains a variable number of products, up to 24, and we created one column for each product attribute, for each position (up to 24) in the dataframe. Moreover, if we wanted to replace text columns with encodings, we would create :math:`d \times 24 \times 2` columns (encoding of dimensionality :math:`d`, for 24 products, for the ``"item"`` and ``"make"`` columns), which would explode the memory usage. .. _agg-joiner-anchor: AggJoiner --------- Let's now see how the |AggJoiner| can help us solve this. We begin with splitting our basket dataset in a training and testing set. .. GENERATED FROM PYTHON SOURCE LINES 145-151 .. code-block:: Python from sklearn.model_selection import train_test_split X, y = baskets[["ID"]], baskets["fraud_flag"] X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1) X_train.shape, y_train.shape .. rst-class:: sphx-glr-script-out .. code-block:: none ((55116, 1), (55116,)) .. GENERATED FROM PYTHON SOURCE LINES 152-164 Before aggregating our product dataframe, we need to vectorize our categorical columns. To do so, we use: - |MinHashEncoder| on "item" and "model" columns, because they both expose typos and text similarities. - |OrdinalEncoder| on "make" and "goods_code" columns, because they consist in orthogonal categories. We bring this logic into a |TableVectorizer| to vectorize these columns in a single step. See `this example `_ for more details about these encoding choices. .. GENERATED FROM PYTHON SOURCE LINES 164-177 .. code-block:: Python from sklearn.preprocessing import OrdinalEncoder from skrub import MinHashEncoder, TableVectorizer vectorizer = TableVectorizer( high_cardinality=MinHashEncoder(), # encode ["item", "model"] specific_transformers=[ (OrdinalEncoder(), ["make", "goods_code"]), ], ) products_transformed = vectorizer.fit_transform(products) TableReport(products_transformed) .. raw:: html

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Float32DType	0 (0.0%)	61241 (56.0%)	3.59e+04	2.24e+04	0.00	3.52e+04	7.65e+04
1	item_00	Float32DType	0 (0.0%)	70 (< 0.1%)	-2.03e+09	1.54e+08	-2.15e+09	-2.12e+09	-1.04e+09
2	item_01	Float32DType	0 (0.0%)	46 (< 0.1%)	-1.92e+09	2.63e+08	-2.14e+09	-2.06e+09	-1.09e+09
3	item_02	Float32DType	0 (0.0%)	57 (< 0.1%)	-1.98e+09	1.85e+08	-2.14e+09	-2.09e+09	-3.95e+08
4	item_03	Float32DType	0 (0.0%)	52 (< 0.1%)	-2.06e+09	8.25e+07	-2.15e+09	-2.07e+09	-8.27e+08
5	item_04	Float32DType	0 (0.0%)	63 (< 0.1%)	-2.02e+09	6.26e+07	-2.15e+09	-2.05e+09	-1.08e+09
6	item_05	Float32DType	0 (0.0%)	61 (< 0.1%)	-1.99e+09	2.14e+08	-2.14e+09	-2.06e+09	-8.93e+08
7	item_06	Float32DType	0 (0.0%)	67 (< 0.1%)	-1.77e+09	3.82e+08	-2.14e+09	-1.99e+09	-9.01e+07
8	item_07	Float32DType	0 (0.0%)	60 (< 0.1%)	-1.97e+09	1.38e+08	-2.14e+09	-2.00e+09	-1.20e+09
9	item_08	Float32DType	0 (0.0%)	64 (< 0.1%)	-1.98e+09	1.06e+08	-2.15e+09	-2.02e+09	-9.20e+08
10	item_09	Float32DType	0 (0.0%)	67 (< 0.1%)	-2.01e+09	2.29e+08	-2.14e+09	-2.06e+09	7.04e+08
11	item_10	Float32DType	0 (0.0%)	60 (< 0.1%)	-2.02e+09	8.09e+07	-2.14e+09	-2.04e+09	2.67e+07
12	item_11	Float32DType	0 (0.0%)	56 (< 0.1%)	-2.10e+09	1.06e+08	-2.14e+09	-2.14e+09	-5.71e+08
13	item_12	Float32DType	0 (0.0%)	58 (< 0.1%)	-2.04e+09	8.67e+07	-2.14e+09	-2.06e+09	-1.05e+09
14	item_13	Float32DType	0 (0.0%)	57 (< 0.1%)	-2.04e+09	1.47e+08	-2.15e+09	-2.10e+09	-1.14e+09
15	item_14	Float32DType	0 (0.0%)	60 (< 0.1%)	-1.98e+09	1.50e+08	-2.15e+09	-1.99e+09	-5.50e+08
16	item_15	Float32DType	0 (0.0%)	62 (< 0.1%)	-1.99e+09	1.23e+08	-2.15e+09	-2.01e+09	-8.00e+08
17	item_16	Float32DType	0 (0.0%)	66 (< 0.1%)	-2.04e+09	1.01e+08	-2.15e+09	-2.06e+09	-6.04e+08
18	item_17	Float32DType	0 (0.0%)	55 (< 0.1%)	-2.04e+09	9.10e+07	-2.15e+09	-2.04e+09	-1.13e+09
19	item_18	Float32DType	0 (0.0%)	68 (< 0.1%)	-2.00e+09	9.04e+07	-2.14e+09	-1.96e+09	-9.75e+08
20	item_19	Float32DType	0 (0.0%)	51 (< 0.1%)	-2.06e+09	1.05e+08	-2.15e+09	-2.08e+09	-5.72e+08
21	item_20	Float32DType	0 (0.0%)	57 (< 0.1%)	-2.01e+09	2.98e+08	-2.14e+09	-2.10e+09	-5.10e+06
22	item_21	Float32DType	0 (0.0%)	62 (< 0.1%)	-2.04e+09	1.01e+08	-2.14e+09	-2.05e+09	-8.74e+08
23	item_22	Float32DType	0 (0.0%)	56 (< 0.1%)	-1.99e+09	1.44e+08	-2.14e+09	-2.06e+09	-7.91e+08
24	item_23	Float32DType	0 (0.0%)	60 (< 0.1%)	-2.05e+09	1.31e+08	-2.15e+09	-2.08e+09	-7.27e+08
25	item_24	Float32DType	0 (0.0%)	61 (< 0.1%)	-2.01e+09	1.55e+08	-2.14e+09	-2.09e+09	-5.03e+08
26	item_25	Float32DType	0 (0.0%)	66 (< 0.1%)	-1.90e+09	2.06e+08	-2.14e+09	-1.83e+09	-8.07e+08
27	item_26	Float32DType	0 (0.0%)	70 (< 0.1%)	-1.95e+09	1.30e+08	-2.15e+09	-1.84e+09	-1.36e+09
28	item_27	Float32DType	0 (0.0%)	64 (< 0.1%)	-2.00e+09	1.69e+08	-2.15e+09	-2.11e+09	-9.92e+08
29	item_28	Float32DType	0 (0.0%)	74 (< 0.1%)	-2.05e+09	1.21e+08	-2.15e+09	-2.12e+09	-9.07e+08
30	item_29	Float32DType	0 (0.0%)	58 (< 0.1%)	-1.99e+09	1.21e+08	-2.15e+09	-1.97e+09	-1.07e+09
31	cash_price	Float32DType	0 (0.0%)	1280 (1.2%)	672.	714.	0.00	499.	1.83e+04
32	make	Float64DType	1273 (1.2%)	690 (0.6%)	254.	234.	0.00	168.	689.
33	model_00	Float32DType	0 (0.0%)	429 (0.4%)	-1.92e+09	2.97e+08	-2.15e+09	-2.07e+09	0.00
34	model_01	Float32DType	0 (0.0%)	309 (0.3%)	-2.07e+09	2.47e+08	-2.15e+09	-2.14e+09	0.00
35	model_02	Float32DType	0 (0.0%)	309 (0.3%)	-2.06e+09	2.32e+08	-2.15e+09	-2.12e+09	4.22e+07
36	model_03	Float32DType	0 (0.0%)	311 (0.3%)	-2.03e+09	2.36e+08	-2.15e+09	-2.10e+09	0.00
37	model_04	Float32DType	0 (0.0%)	386 (0.4%)	-2.00e+09	2.32e+08	-2.15e+09	-1.99e+09	0.00
38	model_05	Float32DType	0 (0.0%)	301 (0.3%)	-2.08e+09	2.29e+08	-2.15e+09	-2.10e+09	8.95e+08
39	model_06	Float32DType	0 (0.0%)	296 (0.3%)	-1.85e+09	4.02e+08	-2.15e+09	-2.08e+09	0.00
40	model_07	Float32DType	0 (0.0%)	329 (0.3%)	-2.00e+09	2.39e+08	-2.15e+09	-2.09e+09	0.00
41	model_08	Float32DType	0 (0.0%)	385 (0.4%)	-1.96e+09	2.66e+08	-2.15e+09	-2.07e+09	0.00
42	model_09	Float32DType	0 (0.0%)	254 (0.2%)	-1.98e+09	2.67e+08	-2.15e+09	-2.09e+09	0.00
43	model_10	Float32DType	0 (0.0%)	345 (0.3%)	-2.06e+09	2.33e+08	-2.15e+09	-2.09e+09	1.05e+09
44	model_11	Float32DType	0 (0.0%)	318 (0.3%)	-2.07e+09	2.29e+08	-2.15e+09	-2.10e+09	2.73e+08
45	model_12	Float32DType	0 (0.0%)	428 (0.4%)	-1.94e+09	2.68e+08	-2.15e+09	-2.05e+09	0.00
46	model_13	Float32DType	0 (0.0%)	337 (0.3%)	-2.07e+09	2.27e+08	-2.15e+09	-2.10e+09	0.00
47	model_14	Float32DType	0 (0.0%)	290 (0.3%)	-2.05e+09	2.35e+08	-2.15e+09	-2.10e+09	9.05e+07
48	model_15	Float32DType	0 (0.0%)	321 (0.3%)	-2.02e+09	2.34e+08	-2.15e+09	-2.03e+09	3.49e+08
49	model_16	Float32DType	0 (0.0%)	403 (0.4%)	-1.89e+09	2.99e+08	-2.15e+09	-2.05e+09	0.00
50	model_17	Float32DType	0 (0.0%)	376 (0.3%)	-1.78e+09	4.85e+08	-2.15e+09	-2.10e+09	4.20e+08
51	model_18	Float32DType	0 (0.0%)	352 (0.3%)	-2.03e+09	2.26e+08	-2.15e+09	-2.05e+09	1.57e+09
52	model_19	Float32DType	0 (0.0%)	229 (0.2%)	-2.07e+09	2.33e+08	-2.15e+09	-2.14e+09	2.29e+08
53	model_20	Float32DType	0 (0.0%)	372 (0.3%)	-1.98e+09	2.71e+08	-2.15e+09	-2.10e+09	0.00
54	model_21	Float32DType	0 (0.0%)	280 (0.3%)	-2.08e+09	2.34e+08	-2.15e+09	-2.10e+09	1.15e+09
55	model_22	Float32DType	0 (0.0%)	362 (0.3%)	-1.88e+09	3.13e+08	-2.15e+09	-2.02e+09	6.53e+08
56	model_23	Float32DType	0 (0.0%)	400 (0.4%)	-1.91e+09	3.17e+08	-2.15e+09	-2.07e+09	9.93e+08
57	model_24	Float32DType	0 (0.0%)	336 (0.3%)	-2.04e+09	2.33e+08	-2.15e+09	-2.07e+09	0.00
58	model_25	Float32DType	0 (0.0%)	332 (0.3%)	-1.97e+09	2.77e+08	-2.15e+09	-2.10e+09	0.00
59	model_26	Float32DType	0 (0.0%)	354 (0.3%)	-2.05e+09	2.40e+08	-2.15e+09	-2.09e+09	1.14e+09
60	model_27	Float32DType	0 (0.0%)	293 (0.3%)	-2.07e+09	2.31e+08	-2.15e+09	-2.09e+09	1.20e+09
61	model_28	Float32DType	0 (0.0%)	364 (0.3%)	-2.07e+09	2.37e+08	-2.15e+09	-2.10e+09	4.59e+08
62	model_29	Float32DType	0 (0.0%)	398 (0.4%)	-1.98e+09	2.35e+08	-2.15e+09	-1.97e+09	1.17e+09
63	goods_code	Float64DType	0 (0.0%)	10738 (9.8%)	7.91e+03	3.01e+03	0.00	8.72e+03	1.07e+04
64	Nbr_of_prod_purchas	Float32DType	0 (0.0%)	19 (< 0.1%)	1.05	0.426	1.00	1.00	40.0

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 178-198 Our objective is now to aggregate this vectorized product dataframe by ``"basket_ID"``, then to merge it on the baskets dataframe, still on the ``"basket_ID"``. .. image:: ../../_static/08_example_aggjoiner.png :width: 900 | |AggJoiner| can help us achieve exactly this. We need to pass the product dataframe as an auxiliary table argument to |AggJoiner| in ``__init__``. The ``aux_key`` argument represent both the columns used to groupby on, and the columns used to join on. The basket dataframe is our main table, and we indicate the columns to join on with ``main_key``. Note that we pass the main table during ``fit``, and we discuss the limitations of this design in the conclusion at the bottom of this notebook. The minimum ("min") is the most appropriate operation to aggregate encodings from |MinHashEncoder|, for reasons that are out of the scope of this notebook. .. GENERATED FROM PYTHON SOURCE LINES 198-216 .. code-block:: Python from skrub import AggJoiner from skrub import selectors as s # Skrub selectors allow us to select columns using regexes, which reduces # the boilerplate. minhash_cols_query = s.glob("item*") | s.glob("model*") minhash_cols = s.select(products_transformed, minhash_cols_query).columns agg_joiner = AggJoiner( aux_table=products_transformed, aux_key="basket_ID", main_key="ID", cols=minhash_cols, operations=["min"], ) baskets_products = agg_joiner.fit_transform(baskets) TableReport(baskets_products) .. raw:: html

Column	Column name	dtype	Unique values	Mean	Std	Min	Median	Max
0	ID	Int64DType	61241 (100.0%)	3.82e+04	2.21e+04	0	38,158	76,543
1	fraud_flag	Int64DType	2 (< 0.1%)	0.0130	0.113	0	0	1
2	item_00_min	Float32DType	44 (< 0.1%)	-2.11e+09	4.86e+07	-2.15e+09	-2.12e+09	-1.04e+09
3	item_01_min	Float32DType	36 (< 0.1%)	-1.97e+09	2.12e+08	-2.14e+09	-2.09e+09	-1.12e+09
4	item_02_min	Float32DType	45 (< 0.1%)	-2.03e+09	1.50e+08	-2.14e+09	-2.09e+09	-6.79e+08
5	item_03_min	Float32DType	35 (< 0.1%)	-2.08e+09	4.49e+07	-2.15e+09	-2.10e+09	-8.27e+08
6	item_04_min	Float32DType	44 (< 0.1%)	-2.05e+09	3.07e+07	-2.15e+09	-2.05e+09	-1.08e+09
7	item_05_min	Float32DType	43 (< 0.1%)	-2.05e+09	7.14e+07	-2.14e+09	-2.07e+09	-8.93e+08
8	item_06_min	Float32DType	51 (< 0.1%)	-1.87e+09	3.05e+08	-2.14e+09	-2.05e+09	-5.96e+08
9	item_07_min	Float32DType	44 (< 0.1%)	-2.03e+09	5.26e+07	-2.14e+09	-2.00e+09	-1.20e+09
10	item_08_min	Float32DType	46 (< 0.1%)	-2.01e+09	9.94e+07	-2.15e+09	-2.02e+09	-1.46e+09
11	item_09_min	Float32DType	42 (< 0.1%)	-2.08e+09	4.86e+07	-2.14e+09	-2.06e+09	7.04e+08
12	item_10_min	Float32DType	42 (< 0.1%)	-2.04e+09	6.45e+07	-2.14e+09	-2.08e+09	-1.07e+09
13	item_11_min	Float32DType	31 (< 0.1%)	-2.14e+09	2.40e+07	-2.14e+09	-2.14e+09	-1.23e+09
14	item_12_min	Float32DType	36 (< 0.1%)	-2.07e+09	6.45e+07	-2.14e+09	-2.12e+09	-1.05e+09
15	item_13_min	Float32DType	40 (< 0.1%)	-2.09e+09	3.61e+07	-2.15e+09	-2.10e+09	-1.16e+09
16	item_14_min	Float32DType	47 (< 0.1%)	-2.02e+09	1.26e+08	-2.15e+09	-2.09e+09	-9.70e+08
17	item_15_min	Float32DType	43 (< 0.1%)	-2.02e+09	9.49e+07	-2.15e+09	-2.03e+09	-8.00e+08
18	item_16_min	Float32DType	47 (< 0.1%)	-2.07e+09	6.01e+07	-2.15e+09	-2.07e+09	-1.34e+09
19	item_17_min	Float32DType	40 (< 0.1%)	-2.08e+09	6.43e+07	-2.15e+09	-2.04e+09	-1.33e+09
20	item_18_min	Float32DType	49 (< 0.1%)	-2.04e+09	7.46e+07	-2.14e+09	-2.02e+09	-1.56e+09
21	item_19_min	Float32DType	33 (< 0.1%)	-2.09e+09	5.64e+07	-2.15e+09	-2.13e+09	-1.01e+09
22	item_20_min	Float32DType	35 (< 0.1%)	-2.10e+09	6.14e+07	-2.14e+09	-2.11e+09	-5.10e+06
23	item_21_min	Float32DType	40 (< 0.1%)	-2.09e+09	5.45e+07	-2.14e+09	-2.11e+09	-8.74e+08
24	item_22_min	Float32DType	36 (< 0.1%)	-2.07e+09	2.45e+07	-2.14e+09	-2.06e+09	-7.91e+08
25	item_23_min	Float32DType	39 (< 0.1%)	-2.09e+09	4.50e+07	-2.15e+09	-2.08e+09	-7.27e+08
26	item_24_min	Float32DType	37 (< 0.1%)	-2.10e+09	4.29e+07	-2.14e+09	-2.09e+09	-5.35e+08
27	item_25_min	Float32DType	50 (< 0.1%)	-1.95e+09	1.83e+08	-2.14e+09	-2.02e+09	-9.00e+08
28	item_26_min	Float32DType	53 (< 0.1%)	-1.98e+09	1.25e+08	-2.15e+09	-2.02e+09	-1.45e+09
29	item_27_min	Float32DType	41 (< 0.1%)	-2.06e+09	1.28e+08	-2.15e+09	-2.12e+09	-9.92e+08
30	item_28_min	Float32DType	51 (< 0.1%)	-2.12e+09	5.13e+07	-2.15e+09	-2.12e+09	-1.14e+09
31	item_29_min	Float32DType	37 (< 0.1%)	-2.03e+09	8.10e+07	-2.15e+09	-2.07e+09	-1.19e+09
32	model_00_min	Float32DType	222 (0.4%)	-2.04e+09	2.46e+08	-2.15e+09	-2.09e+09	0.00
33	model_01_min	Float32DType	122 (0.2%)	-2.09e+09	2.34e+08	-2.15e+09	-2.14e+09	0.00
34	model_02_min	Float32DType	145 (0.2%)	-2.09e+09	2.26e+08	-2.15e+09	-2.12e+09	4.22e+07
35	model_03_min	Float32DType	170 (0.3%)	-2.08e+09	2.27e+08	-2.15e+09	-2.11e+09	0.00
36	model_04_min	Float32DType	241 (0.4%)	-2.03e+09	2.29e+08	-2.15e+09	-2.09e+09	0.00
37	model_05_min	Float32DType	135 (0.2%)	-2.08e+09	2.23e+08	-2.15e+09	-2.10e+09	0.00
38	model_06_min	Float32DType	165 (0.3%)	-2.05e+09	2.69e+08	-2.15e+09	-2.08e+09	0.00
39	model_07_min	Float32DType	182 (0.3%)	-2.05e+09	2.30e+08	-2.15e+09	-2.09e+09	0.00
40	model_08_min	Float32DType	181 (0.3%)	-2.04e+09	2.43e+08	-2.15e+09	-2.07e+09	0.00
41	model_09_min	Float32DType	154 (0.3%)	-2.07e+09	2.33e+08	-2.15e+09	-2.11e+09	0.00
42	model_10_min	Float32DType	187 (0.3%)	-2.07e+09	2.28e+08	-2.15e+09	-2.09e+09	1.05e+09
43	model_11_min	Float32DType	141 (0.2%)	-2.08e+09	2.24e+08	-2.15e+09	-2.11e+09	0.00
44	model_12_min	Float32DType	254 (0.4%)	-2.02e+09	2.44e+08	-2.15e+09	-2.08e+09	0.00
45	model_13_min	Float32DType	135 (0.2%)	-2.08e+09	2.22e+08	-2.15e+09	-2.10e+09	0.00
46	model_14_min	Float32DType	133 (0.2%)	-2.09e+09	2.26e+08	-2.15e+09	-2.13e+09	0.00
47	model_15_min	Float32DType	150 (0.2%)	-2.05e+09	2.29e+08	-2.15e+09	-2.11e+09	0.00
48	model_16_min	Float32DType	235 (0.4%)	-2.01e+09	2.49e+08	-2.15e+09	-2.06e+09	0.00
49	model_17_min	Float32DType	177 (0.3%)	-2.04e+09	2.95e+08	-2.15e+09	-2.13e+09	0.00
50	model_18_min	Float32DType	189 (0.3%)	-2.05e+09	2.22e+08	-2.15e+09	-2.06e+09	0.00
51	model_19_min	Float32DType	86 (0.1%)	-2.11e+09	2.27e+08	-2.15e+09	-2.14e+09	0.00
52	model_20_min	Float32DType	203 (0.3%)	-2.07e+09	2.43e+08	-2.15e+09	-2.11e+09	0.00
53	model_21_min	Float32DType	130 (0.2%)	-2.09e+09	2.30e+08	-2.15e+09	-2.14e+09	1.15e+09
54	model_22_min	Float32DType	209 (0.3%)	-2.02e+09	2.48e+08	-2.15e+09	-2.08e+09	6.53e+08
55	model_23_min	Float32DType	217 (0.4%)	-2.05e+09	2.54e+08	-2.15e+09	-2.11e+09	9.93e+08
56	model_24_min	Float32DType	171 (0.3%)	-2.08e+09	2.27e+08	-2.15e+09	-2.14e+09	0.00
57	model_25_min	Float32DType	166 (0.3%)	-2.07e+09	2.35e+08	-2.15e+09	-2.11e+09	0.00
58	model_26_min	Float32DType	170 (0.3%)	-2.06e+09	2.31e+08	-2.15e+09	-2.09e+09	1.34e+08
59	model_27_min	Float32DType	123 (0.2%)	-2.07e+09	2.25e+08	-2.15e+09	-2.09e+09	0.00
60	model_28_min	Float32DType	169 (0.3%)	-2.09e+09	2.30e+08	-2.15e+09	-2.12e+09	2.58e+08
61	model_29_min	Float32DType	221 (0.4%)	-2.01e+09	2.31e+08	-2.15e+09	-2.07e+09	2.92e+08

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 217-228 Now that we understand how to use the |AggJoiner|, we can now assemble our pipeline by chaining two |AggJoiner| together: - the first one to deal with the |MinHashEncoder| vectors as we just saw - the second one to deal with the all the other columns For the second |AggJoiner|, we use the mean, standard deviation, minimum and maximum operations to extract a representative summary of each distribution. |DropCols| is another skrub transformer which removes the "ID" column, which doesn't bring any information after the joining operation. .. GENERATED FROM PYTHON SOURCE LINES 228-254 .. code-block:: Python from scipy.stats import loguniform, randint from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.pipeline import make_pipeline from skrub import DropCols model = make_pipeline( AggJoiner( aux_table=products_transformed, aux_key="basket_ID", main_key="ID", cols=minhash_cols, operations=["min"], ), AggJoiner( aux_table=products_transformed, aux_key="basket_ID", main_key="ID", cols=["make", "goods_code", "cash_price", "Nbr_of_prod_purchas"], operations=["sum", "mean", "std", "min", "max"], ), DropCols(["ID"]), HistGradientBoostingClassifier(), ) model .. rst-class:: sphx-glr-script-out .. code-block:: none Pipeline(steps=[('aggjoiner-1', AggJoiner(aux_key='basket_ID', aux_table= basket_ID item_00 ... goods_code Nbr_of_prod_purchas 1 51113.0 -2.119082e+09 ... 8255.0 1.0 9 41798.0 -2.119082e+09 ... 8720.0 1.0 11 39361.0 -2.119082e+09 ... 8716.0 1.0 15 38615.0 -2.119082e+09 ... 8260.0 1.0 16 70262.0 -2.119082e+09 ... 10690.0 1.0 ... ... ... ... ... ... 163352 42613.0 -1.944861e+09 ... 2190.0 1.0 163353... 163354 43567.0 -2.119082e+09 ... 10020.0 1.0 163355 43567.0 -2.119082e+09 ... 7790.0 1.0 163356 68268.0 -2.128260e+09 ... 9377.0 1.0 [109380 rows x 65 columns], cols=['make', 'goods_code', 'cash_price', 'Nbr_of_prod_purchas'], main_key='ID', operations=['sum', 'mean', 'std', 'min', 'max'])), ('dropcols', DropCols(cols=['ID'])), ('histgradientboostingclassifier', HistGradientBoostingClassifier())]) .. GENERATED FROM PYTHON SOURCE LINES 255-260 We tune the hyper-parameters of the |HGBC| model using ``RandomizedSearchCV``. By default, the |HGBC| applies early stopping when there are at least 10_000 samples so we don't need to explicitly tune the number of trees (``max_iter``). Therefore we set this at a very high level of 1_000. We increase ``n_iter_no_change`` to make sure early stopping does not kick in too early. .. GENERATED FROM PYTHON SOURCE LINES 260-283 .. code-block:: Python from time import time from sklearn.model_selection import RandomizedSearchCV param_distributions = dict( histgradientboostingclassifier__learning_rate=loguniform(1e-2, 5e-1), histgradientboostingclassifier__min_samples_leaf=randint(2, 64), histgradientboostingclassifier__max_leaf_nodes=[None, 10, 30, 60, 90], histgradientboostingclassifier__n_iter_no_change=[50], histgradientboostingclassifier__max_iter=[1000], ) tic = time() search = RandomizedSearchCV( model, param_distributions, scoring="neg_log_loss", refit=False, n_iter=10, cv=3, verbose=1, ).fit(X_train, y_train) print(f"This operation took {time() - tic:.1f}s") .. rst-class:: sphx-glr-script-out .. code-block:: none Fitting 3 folds for each of 10 candidates, totalling 30 fits This operation took 186.4s .. GENERATED FROM PYTHON SOURCE LINES 284-285 The best hyper parameters are: .. GENERATED FROM PYTHON SOURCE LINES 285-288 .. code-block:: Python pd.Series(search.best_params_) .. rst-class:: sphx-glr-script-out .. code-block:: none histgradientboostingclassifier__learning_rate 0.013069 histgradientboostingclassifier__max_iter 1000.000000 histgradientboostingclassifier__max_leaf_nodes 60.000000 histgradientboostingclassifier__min_samples_leaf 15.000000 histgradientboostingclassifier__n_iter_no_change 50.000000 dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 289-297 To benchmark our performance, we plot the log loss of our model on the test set against the log loss of a dummy model that always output the observed probability of the two classes. As this dataset is extremely imbalanced, this dummy model should be a good baseline. The vertical bar represents one standard deviation around the mean of the cross validation log-loss. .. GENERATED FROM PYTHON SOURCE LINES 297-331 .. code-block:: Python import seaborn as sns from matplotlib import pyplot as plt from sklearn.dummy import DummyClassifier from sklearn.metrics import log_loss results = search.cv_results_ best_idx = search.best_index_ log_loss_model_mean = -results["mean_test_score"][best_idx] log_loss_model_std = results["std_test_score"][best_idx] dummy = DummyClassifier(strategy="prior").fit(X_train, y_train) y_proba_dummy = dummy.predict_proba(X_test) log_loss_dummy = log_loss(y_true=y_test, y_pred=y_proba_dummy) fig, ax = plt.subplots() ax.bar( height=[log_loss_model_mean, log_loss_dummy], x=["AggJoiner model", "Dummy"], color=["C0", "C4"], ) for container in ax.containers: ax.bar_label(container, padding=4) ax.vlines( x="AggJoiner model", ymin=log_loss_model_mean - log_loss_model_std, ymax=log_loss_model_mean + log_loss_model_std, linestyle="-", linewidth=1, color="k", ) sns.despine() ax.set_title("Log loss (lower is better)") .. image-sg:: /auto_examples/images/sphx_glr_08_join_aggregation_001.png :alt: Log loss (lower is better) :srcset: /auto_examples/images/sphx_glr_08_join_aggregation_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'Log loss (lower is better)') .. GENERATED FROM PYTHON SOURCE LINES 332-350 Conclusion ---------- With |AggJoiner|, you can bring the aggregation and joining operations within a sklearn pipeline, and train models more efficiently. One known limitation of both the |AggJoiner| and |Joiner| is that the auxiliary data to join is passed during the ``__init__`` method instead of the ``fit`` method, and is therefore fixed once the model has been trained. This limitation causes two main issues: 1. **Bigger model serialization:** Since the dataset has to be pickled along with the model, it can result in a massive file size on disk. 2. **Inflexibility with new, unseen data in a production environment:** To use new auxiliary data, you would need to replace the auxiliary table in the |AggJoiner| that was used during ``fit`` with the updated data, which is a rather hacky approach. These limitations will be addressed later in skrub. .. rst-class:: sphx-glr-timing **Total running time of the script:** (3 minutes 45.578 seconds) .. _sphx_glr_download_auto_examples_08_join_aggregation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/08_join_aggregation.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_examples/08_join_aggregation.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 08_join_aggregation.ipynb <08_join_aggregation.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 08_join_aggregation.py <08_join_aggregation.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 08_join_aggregation.zip <08_join_aggregation.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_

basket_ID

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

basket_ID

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

Please enable javascript

ID

fraud_flag

ID

fraud_flag

Please enable javascript

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

Please enable javascript

basket_ID

item0

item1

item2

item3

item4

item5

item6

item7

item8

item9

item10

item11

item12

item13

item14

item15

item16

item17

item18

item19

item20

item21

item22

item23

cash_price0

cash_price1

cash_price2

cash_price3

cash_price4

cash_price5

cash_price6

cash_price7

cash_price8

cash_price9

cash_price10

cash_price11

cash_price12

cash_price13

cash_price14

cash_price15

cash_price16

cash_price17

cash_price18

cash_price19

cash_price20

cash_price21