.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/08_join_aggregation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_08_join_aggregation.py: AggJoiner on a credit fraud dataset =================================== Many problems involve tables whose entities have a one-to-many relationship. To simplify aggregate-then-join operations for machine learning, we can include the |AggJoiner| in our pipeline. In this example, we are tackling a fraudulent loan detection use case. Because fraud is rare, this dataset is extremely imbalanced, with a prevalence of around 1.4%. The data consists of two distinct entities: e-commerce "baskets", and "products". Baskets can be tagged fraudulent (1) or not (0), and are essentially a list of products of variable size. Each basket is linked to at least one products, e.g. basket 1 can have product 1 and 2. .. image:: ../../_static/08_example_data.png :width: 450 px | Our aim is to predict which baskets are fraudulent. The products dataframe can be joined on the baskets dataframe using the ``basket_ID`` column. Each product has several attributes: - a category (marked by the column ``"item"``), - a model (``"model"``), - a brand (``"make"``), - a merchant code (``"goods_code"``), - a price per unit (``"cash_price"``), - a quantity selected in the basket (``"Nbr_of_prod_purchas"``) .. |AggJoiner| replace:: :class:`~skrub.AggJoiner` .. |Joiner| replace:: :class:`~skrub.Joiner` .. |DropCols| replace:: :class:`~skrub.DropCols` .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` .. |TableReport| replace:: :class:`~skrub.TableReport` .. |MinHashEncoder| replace:: :class:`~skrub.MinHashEncoder` .. |TargetEncoder| replace:: :class:`~sklearn.preprocessing.TargetEncoder` .. |make_pipeline| replace:: :func:`~sklearn.pipeline.make_pipeline` .. |Pipeline| replace:: :class:`~sklearn.pipeline.Pipeline` .. |HGBC| replace:: :class:`~sklearn.ensemble.HistGradientBoostingClassifier` .. |OrdinalEncoder| replace:: :class:`~sklearn.preprocessing.OrdinalEncoder` .. |TunedThresholdClassifierCV| replace:: :class:`~sklearn.model_selection.TunedThresholdClassifierCV` .. |CalibrationDisplay| replace:: :class:`~sklearn.calibration.CalibrationDisplay` .. |pandas.melt| replace:: :func:`~pandas.melt` .. GENERATED FROM PYTHON SOURCE LINES 83-90 .. code-block:: Python from skrub import TableReport from skrub.datasets import fetch_credit_fraud bunch = fetch_credit_fraud() products, baskets = bunch.products, bunch.baskets TableReport(products) .. raw:: html

	basket_ID	item	cash_price	make	model	goods_code	Nbr_of_prod_purchas
0	85517	COMPUTERS	889	APPLE	2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC	239246776	1
1	51113	COMPUTER PERIPHERALS ACCESSORIES	409	APPLE	APPLE WATCH SERIES 6 GPS 44MM SPACE GREY ALUMINIUM	239001518	1
2	83008	TELEVISIONS HOME CINEMA	1399	SAMSUNG	SAMSUNG QE75Q70A 2021 QLED HDR 4K ULTRA HD SMART T	239842093	1
3	78712	COMPUTERS	689	APPLE	2020 APPLE IPAD AIR 10 9 A14 BIONIC PROCESSOR IOS	239001422	1
4	78712	COMPUTER PERIPHERALS ACCESSORIES	119	APPLE	APPLE PENCIL 2ND GENERATION 2018 MATTE WHITE	237841896	1

163352	42613	BEDROOM FURNITURE	259	SILENTNIGHT	SILENTNIGHT SLEEP GENIUS FULL HEIGHT HEADBOARD DOU	236938439	1
163353	42613	OUTDOOR FURNITURE	949	LG OUTDOOR	LG OUTDOOR BERGEN 2-SEAT GARDEN SIDE TABLE RECLINI	239742814	1
163354	43567	COMPUTERS	1099	APPLE	2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25	240040978	1
163355	43567	COMPUTERS	2099	APPLE	2020 APPLE IMAC 27 ALL-IN-ONE INTEL CORE I7 8GB RA	238923518	1
163356	68268	TELEVISIONS HOME CINEMA	799	LG	LG OLED48A16LA 2021 OLED HDR 4K ULTRA HD SMART TV	239866717	1

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Int64DType	0 (0.0%)	92790 (56.8%)	5.59e+04	3.46e+04	0	54,665	115,985
1	item	ObjectDType	0 (0.0%)	173 (0.1%)
2	cash_price	Int64DType	0 (0.0%)	1594 (1.0%)	701.	742.	0	549	21,995
3	make	ObjectDType	1273 (0.8%)	829 (0.5%)
4	model	ObjectDType	1273 (0.8%)	9679 (5.9%)
5	goods_code	ObjectDType	0 (0.0%)	14880 (9.1%)
6	Nbr_of_prod_purchas	Int64DType	0 (0.0%)	20 (< 0.1%)	1.05	0.427	1	1	40

Column 1	Column 2	Cramér's V	Pearson's Correlation
model	goods_code	0.659
item	make	0.472
item	goods_code	0.430
item	model	0.402
make	model	0.305
cash_price	model	0.269
make	goods_code	0.252
basket_ID	item	0.176
basket_ID	model	0.162
item	cash_price	0.144
cash_price	goods_code	0.136
make	Nbr_of_prod_purchas	0.121
basket_ID	make	0.119
cash_price	make	0.118
item	Nbr_of_prod_purchas	0.116
basket_ID	goods_code	0.111
basket_ID	cash_price	0.0725	0.127
basket_ID	Nbr_of_prod_purchas	0.0523	0.00932
model	Nbr_of_prod_purchas	0.0487
cash_price	Nbr_of_prod_purchas	0.0484	-0.0168

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 91-93 .. code-block:: Python TableReport(baskets) .. raw:: html

	ID	fraud_flag
0	85517	0
1	51113	0
2	83008	0
3	78712	0
4	77846	0

92785	21243	0
92786	45891	0
92787	42613	0
92788	43567	0
92789	68268	0

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	ID	Int64DType	0 (0.0%)	92790 (100.0%)	5.80e+04	3.35e+04	0	57,961	115,985
1	fraud_flag	Int64DType	0 (0.0%)	2 (< 0.1%)	0.0142	0.118	0	0	1

Column 1	Column 2	Cramér's V	Pearson's Correlation
ID	fraud_flag	0.0529	0.00532

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 94-107 Naive aggregation ----------------- Let's explore a naive solution first. .. note:: Click :ref:`here` to skip this section and see the AggJoiner in action! The first idea that comes to mind to merge these two tables is to aggregate the products attributes into lists, using their basket IDs. .. GENERATED FROM PYTHON SOURCE LINES 107-110 .. code-block:: Python products_grouped = products.groupby("basket_ID").agg(list) TableReport(products_grouped) .. raw:: html

basket_ID	item	cash_price	make	model	goods_code	Nbr_of_prod_purchas
0	['COMPUTERS', 'WARRANTY', 'FULFILMENT CHARGE']	[1249, 35, 11]	['APPLE', 'RETAILER', 'RETAILER']	['2021 APPLE IMAC 24 ALL-IN-ONE M1 PROCESSOR 8GB RAM', 'RETAILER', 'RETAILER']	['240040969', '236604727', 'FULFILMENT']	[1, 1, 1]
1	['OUTDOOR ACCESSORIES', 'OUTDOOR FURNITURE']	[679, 369]	['KETTLER', 'RETAILER']	['RETAILER', 'RETAILER']	['237874616', '238222170']	[1, 1]
2	['OUTDOOR FURNITURE', 'OUTDOOR FURNITURE']	[1879, 110]	['KETTLER', 'KETTLER']	['RETAILER', 'RETAILER']	['239482916', '235452317']	[1, 1]
4	['TELEPHONES, FAX MACHINES & TWO-WAY RADIOS', 'FULFILMENT CHARGE']	[999, 0]	['APPLE', 'RETAILER']	['APPLE IPHONE 12 PRO', 'RETAILER']	['239091969', 'FULFILMENT']	[1, 1]
5	['LIVING & DINING FURNITURE']	[749]	['RETAILER']	['RETAILER']	['238000174']	[1]

115981	['COMPUTERS']	[1149]	['APPLE']	['2021 APPLE IMAC 24 ALL-IN-ONE M1 PROCESSOR 8GB RAM']	['240040965']	[1]
115982	['COMPUTERS', 'FULFILMENT CHARGE']	[1399, 7]	['APPLE', 'RETAILER']	['2021 APPLE IPAD PRO 11 M1 PROCESSOR IOS WI-FI 1TB', 'RETAILER']	['240041001', 'FULFILMENT']	[1, 1]
115983	['COMPUTER PERIPHERALS ACCESSORIES']	[439]	['APPLE']	['APPLE WATCH SERIES 7 GPS CELLULAR 41MM BLUE ALUMIN']	['240376595']	[1]
115984	['COMPUTERS']	[887]	['APPLE']	['2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC']	['239246776']	[1]
115985	['COMPUTERS', 'FULFILMENT CHARGE']	[569, 7]	['APPLE', 'RETAILER']	['2022 APPLE IPAD AIR 10 9 M1 PROCESSOR IPADOS WI-FI', 'RETAILER']	['241017996', 'FULFILMENT']	[1, 1]

Column	Column name	dtype	Unique values
0	item	ObjectDType	4425 (4.8%)
1	cash_price	ObjectDType	16146 (17.4%)
2	make	ObjectDType	3862 (4.2%)
3	model	ObjectDType	12529 (13.5%)
4	goods_code	ObjectDType	17916 (19.3%)
5	Nbr_of_prod_purchas	ObjectDType	812 (0.9%)

Column 1	Column 2	Cramér's V
model	goods_code	0.706
item	make	0.563
cash_price	goods_code	0.454
cash_price	model	0.426
make	Nbr_of_prod_purchas	0.417
item	model	0.357
item	Nbr_of_prod_purchas	0.353
make	model	0.306
item	goods_code	0.302
item	cash_price	0.279
make	goods_code	0.261
model	Nbr_of_prod_purchas	0.187
cash_price	make	0.185
goods_code	Nbr_of_prod_purchas	0.151
cash_price	Nbr_of_prod_purchas	0.144

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 111-114 Then, we can expand all lists into columns, as if we were "flattening" the dataframe. We end up with a products dataframe ready to be joined on the baskets dataframe, using ``"basket_ID"`` as the join key. .. GENERATED FROM PYTHON SOURCE LINES 114-124 .. code-block:: Python import pandas as pd products_flatten = [] for col in products_grouped.columns: cols = [f"{col}{idx}" for idx in range(24)] products_flatten.append(pd.DataFrame(products_grouped[col].to_list(), columns=cols)) products_flatten = pd.concat(products_flatten, axis=1) products_flatten.insert(0, "basket_ID", products_grouped.index) TableReport(products_flatten) .. raw:: html

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Int64DType	0 (0.0%)	92790 (100.0%)	5.80e+04	3.35e+04	0	57,961	115,985
1	item0	ObjectDType	0 (0.0%)	134 (0.1%)
2	item1	ObjectDType	48134 (51.9%)	137 (0.1%)
3	item2	ObjectDType	79889 (86.1%)	125 (0.1%)
4	item3	ObjectDType	88228 (95.1%)	124 (0.1%)
5	item4	ObjectDType	90620 (97.7%)	107 (0.1%)
6	item5	ObjectDType	91454 (98.6%)	97 (0.1%)
7	item6	ObjectDType	91844 (99.0%)	91 (< 0.1%)
8	item7	ObjectDType	92063 (99.2%)	89 (< 0.1%)
9	item8	ObjectDType	92222 (99.4%)	82 (< 0.1%)
10	item9	ObjectDType	92318 (99.5%)	71 (< 0.1%)
11	item10	ObjectDType	92406 (99.6%)	70 (< 0.1%)
12	item11	ObjectDType	92468 (99.7%)	62 (< 0.1%)
13	item12	ObjectDType	92533 (99.7%)	60 (< 0.1%)
14	item13	ObjectDType	92571 (99.8%)	57 (< 0.1%)
15	item14	ObjectDType	92597 (99.8%)	55 (< 0.1%)
16	item15	ObjectDType	92625 (99.8%)	44 (< 0.1%)
17	item16	ObjectDType	92648 (99.8%)	46 (< 0.1%)
18	item17	ObjectDType	92670 (99.9%)	41 (< 0.1%)
19	item18	ObjectDType	92687 (99.9%)	37 (< 0.1%)
20	item19	ObjectDType	92699 (99.9%)	33 (< 0.1%)
21	item20	ObjectDType	92713 (99.9%)	30 (< 0.1%)
22	item21	ObjectDType	92727 (99.9%)	24 (< 0.1%)
23	item22	ObjectDType	92740 (99.9%)	22 (< 0.1%)
24	item23	ObjectDType	92747 (100.0%)	22 (< 0.1%)
25	cash_price0	Int64DType	0 (0.0%)	1406 (1.5%)	1.09e+03	711.	2	949	21,995
26	cash_price1	Float64DType	48134 (51.9%)	867 (0.9%)	192.	393.	0.00	40.0	6.50e+03
27	cash_price2	Float64DType	79889 (86.1%)	645 (0.7%)	193.	376.	0.00	43.0	6.00e+03
28	cash_price3	Float64DType	88228 (95.1%)	463 (0.5%)	176.	321.	0.00	48.0	5.20e+03
29	cash_price4	Float64DType	90620 (97.7%)	357 (0.4%)	196.	374.	0.00	59.0	4.25e+03
30	cash_price5	Float64DType	91454 (98.6%)	273 (0.3%)	162.	292.	0.00	50.0	3.00e+03
31	cash_price6	Float64DType	91844 (99.0%)	230 (0.2%)	145.	291.	0.00	50.0	4.20e+03
32	cash_price7	Float64DType	92063 (99.2%)	179 (0.2%)	131.	258.	0.00	45.0	3.00e+03
33	cash_price8	Float64DType	92222 (99.4%)	177 (0.2%)	133.	267.	0.00	45.0	2.40e+03
34	cash_price9	Float64DType	92318 (99.5%)	147 (0.2%)	112.	213.	0.00	40.0	1.54e+03
35	cash_price10	Float64DType	92406 (99.6%)	130 (0.1%)	112.	251.	0.00	32.0	3.20e+03
36	cash_price11	Float64DType	92468 (99.7%)	120 (0.1%)	103.	220.	0.00	30.0	2.16e+03
37	cash_price12	Float64DType	92533 (99.7%)	102 (0.1%)	84.2	141.	0.00	29.0	899.
38	cash_price13	Float64DType	92571 (99.8%)	97 (0.1%)	111.	200.	0.00	39.0	1.30e+03
39	cash_price14	Float64DType	92597 (99.8%)	82 (< 0.1%)	72.4	106.	0.00	35.0	599.
40	cash_price15	Float64DType	92625 (99.8%)	72 (< 0.1%)	98.2	228.	0.00	35.0	1.60e+03
41	cash_price16	Float64DType	92648 (99.8%)	71 (< 0.1%)	89.0	177.	0.00	30.0	1.55e+03
42	cash_price17	Float64DType	92670 (99.9%)	67 (< 0.1%)	84.0	134.	0.00	25.0	799.
43	cash_price18	Float64DType	92687 (99.9%)	67 (< 0.1%)	88.2	142.	0.00	36.0	999.
44	cash_price19	Float64DType	92699 (99.9%)	50 (< 0.1%)	79.6	223.	0.00	26.0	2.01e+03
45	cash_price20	Float64DType	92713 (99.9%)	43 (< 0.1%)	58.2	88.8	4.00	25.0	450.
46	cash_price21	Float64DType	92727 (99.9%)	42 (< 0.1%)	126.	342.	0.00	28.0	2.09e+03
47	cash_price22	Float64DType	92740 (99.9%)	41 (< 0.1%)	109.	199.	0.00	35.0	995.
48	cash_price23	Float64DType	92747 (100.0%)	31 (< 0.1%)	122.	264.	4.00	20.0	1.04e+03
49	make0	ObjectDType	685 (0.7%)	425 (0.5%)
50	make1	ObjectDType	48461 (52.2%)	416 (0.4%)
51	make2	ObjectDType	79999 (86.2%)	357 (0.4%)
52	make3	ObjectDType	88262 (95.1%)	308 (0.3%)
53	make4	ObjectDType	90639 (97.7%)	265 (0.3%)
54	make5	ObjectDType	91467 (98.6%)	205 (0.2%)
55	make6	ObjectDType	91854 (99.0%)	186 (0.2%)
56	make7	ObjectDType	92073 (99.2%)	165 (0.2%)
57	make8	ObjectDType	92232 (99.4%)	142 (0.2%)
58	make9	ObjectDType	92328 (99.5%)	126 (0.1%)
59	make10	ObjectDType	92414 (99.6%)	107 (0.1%)
60	make11	ObjectDType	92476 (99.7%)	100 (0.1%)
61	make12	ObjectDType	92540 (99.7%)	81 (< 0.1%)
62	make13	ObjectDType	92576 (99.8%)	73 (< 0.1%)
63	make14	ObjectDType	92602 (99.8%)	69 (< 0.1%)
64	make15	ObjectDType	92628 (99.8%)	61 (< 0.1%)
65	make16	ObjectDType	92651 (99.9%)	49 (< 0.1%)
66	make17	ObjectDType	92671 (99.9%)	42 (< 0.1%)
67	make18	ObjectDType	92688 (99.9%)	44 (< 0.1%)
68	make19	ObjectDType	92700 (99.9%)	37 (< 0.1%)
69	make20	ObjectDType	92714 (99.9%)	30 (< 0.1%)
70	make21	ObjectDType	92728 (99.9%)	23 (< 0.1%)
71	make22	ObjectDType	92741 (99.9%)	25 (< 0.1%)
72	make23	ObjectDType	92747 (100.0%)	19 (< 0.1%)
73	model0	ObjectDType	685 (0.7%)	3782 (4.1%)
74	model1	ObjectDType	48461 (52.2%)	3242 (3.5%)
75	model2	ObjectDType	79999 (86.2%)	2344 (2.5%)
76	model3	ObjectDType	88262 (95.1%)	1611 (1.7%)
77	model4	ObjectDType	90639 (97.7%)	1093 (1.2%)
78	model5	ObjectDType	91467 (98.6%)	756 (0.8%)
79	model6	ObjectDType	91854 (99.0%)	601 (0.6%)
80	model7	ObjectDType	92073 (99.2%)	472 (0.5%)
81	model8	ObjectDType	92232 (99.4%)	377 (0.4%)
82	model9	ObjectDType	92328 (99.5%)	333 (0.4%)
83	model10	ObjectDType	92414 (99.6%)	265 (0.3%)
84	model11	ObjectDType	92476 (99.7%)	219 (0.2%)
85	model12	ObjectDType	92540 (99.7%)	179 (0.2%)
86	model13	ObjectDType	92576 (99.8%)	154 (0.2%)
87	model14	ObjectDType	92602 (99.8%)	139 (0.1%)
88	model15	ObjectDType	92628 (99.8%)	123 (0.1%)
89	model16	ObjectDType	92651 (99.9%)	106 (0.1%)
90	model17	ObjectDType	92671 (99.9%)	87 (< 0.1%)
91	model18	ObjectDType	92688 (99.9%)	81 (< 0.1%)
92	model19	ObjectDType	92700 (99.9%)	75 (< 0.1%)
93	model20	ObjectDType	92714 (99.9%)	63 (< 0.1%)
94	model21	ObjectDType	92728 (99.9%)	55 (< 0.1%)
95	model22	ObjectDType	92741 (99.9%)	45 (< 0.1%)
96	model23	ObjectDType	92747 (100.0%)	42 (< 0.1%)
97	goods_code0	ObjectDType	0 (0.0%)	5966 (6.4%)
98	goods_code1	ObjectDType	48134 (51.9%)	4728 (5.1%)
99	goods_code2	ObjectDType	79889 (86.1%)	3237 (3.5%)
100	goods_code3	ObjectDType	88228 (95.1%)	2118 (2.3%)
101	goods_code4	ObjectDType	90620 (97.7%)	1480 (1.6%)
102	goods_code5	ObjectDType	91454 (98.6%)	1006 (1.1%)
103	goods_code6	ObjectDType	91844 (99.0%)	805 (0.9%)
104	goods_code7	ObjectDType	92063 (99.2%)	628 (0.7%)
105	goods_code8	ObjectDType	92222 (99.4%)	514 (0.6%)
106	goods_code9	ObjectDType	92318 (99.5%)	426 (0.5%)
107	goods_code10	ObjectDType	92406 (99.6%)	350 (0.4%)
108	goods_code11	ObjectDType	92468 (99.7%)	282 (0.3%)
109	goods_code12	ObjectDType	92533 (99.7%)	238 (0.3%)
110	goods_code13	ObjectDType	92571 (99.8%)	205 (0.2%)
111	goods_code14	ObjectDType	92597 (99.8%)	179 (0.2%)
112	goods_code15	ObjectDType	92625 (99.8%)	156 (0.2%)
113	goods_code16	ObjectDType	92648 (99.8%)	131 (0.1%)
114	goods_code17	ObjectDType	92670 (99.9%)	109 (0.1%)
115	goods_code18	ObjectDType	92687 (99.9%)	96 (0.1%)
116	goods_code19	ObjectDType	92699 (99.9%)	85 (< 0.1%)
117	goods_code20	ObjectDType	92713 (99.9%)	71 (< 0.1%)
118	goods_code21	ObjectDType	92727 (99.9%)	59 (< 0.1%)
119	goods_code22	ObjectDType	92740 (99.9%)	46 (< 0.1%)
120	goods_code23	ObjectDType	92747 (100.0%)	42 (< 0.1%)
121	Nbr_of_prod_purchas0	Int64DType	0 (0.0%)	16 (< 0.1%)	1.03	0.351	1	1	40
122	Nbr_of_prod_purchas1	Float64DType	48134 (51.9%)	13 (< 0.1%)	1.04	0.300	1.00	1.00	18.0
123	Nbr_of_prod_purchas2	Float64DType	79889 (86.1%)	12 (< 0.1%)	1.08	0.464	1.00	1.00	16.0
124	Nbr_of_prod_purchas3	Float64DType	88228 (95.1%)	14 (< 0.1%)	1.15	0.795	1.00	1.00	28.0
125	Nbr_of_prod_purchas4	Float64DType	90620 (97.7%)	10 (< 0.1%)	1.23	0.824	1.00	1.00	15.0
126	Nbr_of_prod_purchas5	Float64DType	91454 (98.6%)	9 (< 0.1%)	1.26	0.978	1.00	1.00	24.0
127	Nbr_of_prod_purchas6	Float64DType	91844 (99.0%)	10 (< 0.1%)	1.29	0.905	1.00	1.00	16.0
128	Nbr_of_prod_purchas7	Float64DType	92063 (99.2%)	10 (< 0.1%)	1.33	1.05	1.00	1.00	14.0
129	Nbr_of_prod_purchas8	Float64DType	92222 (99.4%)	11 (< 0.1%)	1.41	1.27	1.00	1.00	18.0
130	Nbr_of_prod_purchas9	Float64DType	92318 (99.5%)	7 (< 0.1%)	1.36	0.948	1.00	1.00	8.00
131	Nbr_of_prod_purchas10	Float64DType	92406 (99.6%)	9 (< 0.1%)	1.37	1.11	1.00	1.00	12.0
132	Nbr_of_prod_purchas11	Float64DType	92468 (99.7%)	7 (< 0.1%)	1.32	0.897	1.00	1.00	7.00
133	Nbr_of_prod_purchas12	Float64DType	92533 (99.7%)	6 (< 0.1%)	1.26	0.823	1.00	1.00	10.0
134	Nbr_of_prod_purchas13	Float64DType	92571 (99.8%)	6 (< 0.1%)	1.36	1.04	1.00	1.00	12.0
135	Nbr_of_prod_purchas14	Float64DType	92597 (99.8%)	6 (< 0.1%)	1.35	0.951	1.00	1.00	6.00
136	Nbr_of_prod_purchas15	Float64DType	92625 (99.8%)	6 (< 0.1%)	1.29	0.749	1.00	1.00	6.00
137	Nbr_of_prod_purchas16	Float64DType	92648 (99.8%)	5 (< 0.1%)	1.44	1.43	1.00	1.00	12.0
138	Nbr_of_prod_purchas17	Float64DType	92670 (99.9%)	7 (< 0.1%)	1.47	1.81	1.00	1.00	16.0
139	Nbr_of_prod_purchas18	Float64DType	92687 (99.9%)	7 (< 0.1%)	1.39	1.17	1.00	1.00	7.00
140	Nbr_of_prod_purchas19	Float64DType	92699 (99.9%)	5 (< 0.1%)	1.33	0.870	1.00	1.00	7.00
141	Nbr_of_prod_purchas20	Float64DType	92713 (99.9%)	4 (< 0.1%)	1.22	0.529	1.00	1.00	4.00
142	Nbr_of_prod_purchas21	Float64DType	92727 (99.9%)	5 (< 0.1%)	1.38	0.923	1.00	1.00	7.00
143	Nbr_of_prod_purchas22	Float64DType	92740 (99.9%)	4 (< 0.1%)	1.16	0.548	1.00	1.00	4.00
144	Nbr_of_prod_purchas23	Float64DType	92747 (100.0%)	3 (< 0.1%)	1.37	1.11	1.00	1.00	8.00

Column 1	Column 2	Cramér's V
Nbr_of_prod_purchas20	Nbr_of_prod_purchas21	1.00
Nbr_of_prod_purchas20	Nbr_of_prod_purchas22	1.00
Nbr_of_prod_purchas20	Nbr_of_prod_purchas23	1.00
Nbr_of_prod_purchas21	Nbr_of_prod_purchas22	1.00
Nbr_of_prod_purchas21	Nbr_of_prod_purchas23	1.00
Nbr_of_prod_purchas22	Nbr_of_prod_purchas23	1.00
Nbr_of_prod_purchas16	Nbr_of_prod_purchas17	1.00
cash_price23	Nbr_of_prod_purchas20	1.00
cash_price23	model14	1.00
cash_price23	model13	1.00
cash_price23	model11	1.00
cash_price23	model9	1.00
cash_price23	model8	1.00
cash_price23	model23	1.00
cash_price23	model22	1.00
cash_price23	make23	1.00
cash_price23	make22	1.00
cash_price23	make21	1.00
cash_price23	make20	1.00
cash_price23	make19	1.00

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 125-145 Look at the "Stats" section of the |TableReport| above. Does anything strike you? Not only did we create 144 columns, but most of these columns are filled with NaN, which is very inefficient for learning! This is because each basket contains a variable number of products, up to 24, and we created one column for each product attribute, for each position (up to 24) in the dataframe. Moreover, if we wanted to replace text columns with encodings, we would create :math:`d \times 24 \times 2` columns (encoding of dimensionality :math:`d`, for 24 products, for the ``"item"`` and ``"make"`` columns), which would explode the memory usage. .. _agg-joiner-anchor: AggJoiner --------- Let's now see how the |AggJoiner| can help us solve this. We begin with splitting our basket dataset in a training and testing set. .. GENERATED FROM PYTHON SOURCE LINES 145-151 .. code-block:: Python from sklearn.model_selection import train_test_split X, y = baskets[["ID"]], baskets["fraud_flag"] X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1) X_train.shape, y_train.shape .. rst-class:: sphx-glr-script-out .. code-block:: none ((83511, 1), (83511,)) .. GENERATED FROM PYTHON SOURCE LINES 152-164 Before aggregating our product dataframe, we need to vectorize our categorical columns. To do so, we use: - |MinHashEncoder| on "item" and "model" columns, because they both expose typos and text similarities. - |OrdinalEncoder| on "make" and "goods_code" columns, because they consist in orthogonal categories. We bring this logic into a |TableVectorizer| to vectorize these columns in a single step. See `this example `_ for more details about these encoding choices. .. GENERATED FROM PYTHON SOURCE LINES 164-177 .. code-block:: Python from sklearn.preprocessing import OrdinalEncoder from skrub import MinHashEncoder, TableVectorizer vectorizer = TableVectorizer( high_cardinality=MinHashEncoder(), # encode ["item", "model"] specific_transformers=[ (OrdinalEncoder(), ["make", "goods_code"]), ], ) products_transformed = vectorizer.fit_transform(products) TableReport(products_transformed) .. raw:: html

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Float32DType	0 (0.0%)	92790 (56.8%)	5.59e+04	3.46e+04	0.00	5.47e+04	1.16e+05
1	item_00	Float32DType	0 (0.0%)	73 (< 0.1%)	-2.03e+09	1.55e+08	-2.15e+09	-2.12e+09	-1.03e+09
2	item_01	Float32DType	0 (0.0%)	47 (< 0.1%)	-1.91e+09	2.66e+08	-2.14e+09	-2.06e+09	-1.09e+09
3	item_02	Float32DType	0 (0.0%)	59 (< 0.1%)	-1.98e+09	1.84e+08	-2.14e+09	-2.09e+09	-3.95e+08
4	item_03	Float32DType	0 (0.0%)	55 (< 0.1%)	-2.06e+09	8.12e+07	-2.15e+09	-2.07e+09	-8.27e+08
5	item_04	Float32DType	0 (0.0%)	64 (< 0.1%)	-2.02e+09	6.44e+07	-2.15e+09	-2.05e+09	-1.08e+09
6	item_05	Float32DType	0 (0.0%)	62 (< 0.1%)	-1.99e+09	2.15e+08	-2.14e+09	-2.06e+09	-8.93e+08
7	item_06	Float32DType	0 (0.0%)	70 (< 0.1%)	-1.76e+09	3.83e+08	-2.14e+09	-1.99e+09	-9.01e+07
8	item_07	Float32DType	0 (0.0%)	62 (< 0.1%)	-1.97e+09	1.41e+08	-2.14e+09	-2.00e+09	-1.20e+09
9	item_08	Float32DType	0 (0.0%)	65 (< 0.1%)	-1.98e+09	1.06e+08	-2.15e+09	-2.02e+09	-9.20e+08
10	item_09	Float32DType	0 (0.0%)	67 (< 0.1%)	-2.01e+09	2.29e+08	-2.14e+09	-2.06e+09	7.04e+08
11	item_10	Float32DType	0 (0.0%)	61 (< 0.1%)	-2.02e+09	7.99e+07	-2.14e+09	-2.04e+09	2.67e+07
12	item_11	Float32DType	0 (0.0%)	59 (< 0.1%)	-2.10e+09	1.06e+08	-2.14e+09	-2.14e+09	-5.71e+08
13	item_12	Float32DType	0 (0.0%)	59 (< 0.1%)	-2.04e+09	8.53e+07	-2.14e+09	-2.06e+09	-1.05e+09
14	item_13	Float32DType	0 (0.0%)	58 (< 0.1%)	-2.04e+09	1.49e+08	-2.15e+09	-2.10e+09	-1.14e+09
15	item_14	Float32DType	0 (0.0%)	61 (< 0.1%)	-1.98e+09	1.51e+08	-2.15e+09	-1.99e+09	-5.50e+08
16	item_15	Float32DType	0 (0.0%)	63 (< 0.1%)	-1.99e+09	1.23e+08	-2.15e+09	-2.01e+09	-8.00e+08
17	item_16	Float32DType	0 (0.0%)	68 (< 0.1%)	-2.03e+09	1.06e+08	-2.15e+09	-2.06e+09	-6.04e+08
18	item_17	Float32DType	0 (0.0%)	56 (< 0.1%)	-2.04e+09	9.05e+07	-2.15e+09	-2.04e+09	-1.13e+09
19	item_18	Float32DType	0 (0.0%)	70 (< 0.1%)	-2.00e+09	8.97e+07	-2.14e+09	-1.96e+09	-9.75e+08
20	item_19	Float32DType	0 (0.0%)	51 (< 0.1%)	-2.06e+09	1.03e+08	-2.15e+09	-2.08e+09	-5.72e+08
21	item_20	Float32DType	0 (0.0%)	58 (< 0.1%)	-2.01e+09	3.00e+08	-2.14e+09	-2.10e+09	-5.10e+06
22	item_21	Float32DType	0 (0.0%)	66 (< 0.1%)	-2.05e+09	1.02e+08	-2.14e+09	-2.05e+09	-8.74e+08
23	item_22	Float32DType	0 (0.0%)	57 (< 0.1%)	-1.99e+09	1.43e+08	-2.14e+09	-2.06e+09	-7.91e+08
24	item_23	Float32DType	0 (0.0%)	63 (< 0.1%)	-2.05e+09	1.32e+08	-2.15e+09	-2.08e+09	-7.27e+08
25	item_24	Float32DType	0 (0.0%)	63 (< 0.1%)	-2.01e+09	1.54e+08	-2.14e+09	-2.09e+09	-5.03e+08
26	item_25	Float32DType	0 (0.0%)	68 (< 0.1%)	-1.90e+09	2.07e+08	-2.14e+09	-1.83e+09	-8.07e+08
27	item_26	Float32DType	0 (0.0%)	73 (< 0.1%)	-1.94e+09	1.31e+08	-2.15e+09	-1.84e+09	-1.36e+09
28	item_27	Float32DType	0 (0.0%)	67 (< 0.1%)	-2.00e+09	1.70e+08	-2.15e+09	-2.12e+09	-9.92e+08
29	item_28	Float32DType	0 (0.0%)	76 (< 0.1%)	-2.05e+09	1.21e+08	-2.15e+09	-2.12e+09	-9.07e+08
30	item_29	Float32DType	0 (0.0%)	58 (< 0.1%)	-1.99e+09	1.22e+08	-2.15e+09	-1.97e+09	-1.07e+09
31	cash_price	Float32DType	0 (0.0%)	1594 (1.0%)	701.	742.	0.00	549.	2.20e+04
32	make	Float64DType	1273 (0.8%)	829 (0.5%)	303.	281.	0.00	176.	828.
33	model_00	Float32DType	0 (0.0%)	478 (0.3%)	-1.95e+09	2.69e+08	-2.15e+09	-2.07e+09	0.00
34	model_01	Float32DType	0 (0.0%)	347 (0.2%)	-2.09e+09	2.04e+08	-2.15e+09	-2.14e+09	0.00
35	model_02	Float32DType	0 (0.0%)	337 (0.2%)	-2.07e+09	1.93e+08	-2.15e+09	-2.12e+09	4.22e+07
36	model_03	Float32DType	0 (0.0%)	341 (0.2%)	-2.05e+09	1.98e+08	-2.15e+09	-2.11e+09	0.00
37	model_04	Float32DType	0 (0.0%)	432 (0.3%)	-2.02e+09	1.96e+08	-2.15e+09	-2.04e+09	0.00
38	model_05	Float32DType	0 (0.0%)	335 (0.2%)	-2.09e+09	1.88e+08	-2.15e+09	-2.10e+09	8.95e+08
39	model_06	Float32DType	0 (0.0%)	329 (0.2%)	-1.89e+09	3.76e+08	-2.15e+09	-2.08e+09	0.00
40	model_07	Float32DType	0 (0.0%)	365 (0.2%)	-2.02e+09	2.03e+08	-2.15e+09	-2.09e+09	0.00
41	model_08	Float32DType	0 (0.0%)	430 (0.3%)	-1.98e+09	2.33e+08	-2.15e+09	-2.07e+09	0.00
42	model_09	Float32DType	0 (0.0%)	280 (0.2%)	-2.00e+09	2.35e+08	-2.15e+09	-2.09e+09	0.00
43	model_10	Float32DType	0 (0.0%)	386 (0.2%)	-2.07e+09	1.92e+08	-2.15e+09	-2.09e+09	1.05e+09
44	model_11	Float32DType	0 (0.0%)	346 (0.2%)	-2.08e+09	1.89e+08	-2.15e+09	-2.10e+09	2.73e+08
45	model_12	Float32DType	0 (0.0%)	471 (0.3%)	-1.96e+09	2.38e+08	-2.15e+09	-2.05e+09	0.00
46	model_13	Float32DType	0 (0.0%)	377 (0.2%)	-2.08e+09	1.87e+08	-2.15e+09	-2.10e+09	0.00
47	model_14	Float32DType	0 (0.0%)	311 (0.2%)	-2.07e+09	1.95e+08	-2.15e+09	-2.10e+09	9.05e+07
48	model_15	Float32DType	0 (0.0%)	355 (0.2%)	-2.04e+09	1.96e+08	-2.15e+09	-2.03e+09	3.49e+08
49	model_16	Float32DType	0 (0.0%)	448 (0.3%)	-1.93e+09	2.72e+08	-2.15e+09	-2.05e+09	0.00
50	model_17	Float32DType	0 (0.0%)	413 (0.3%)	-1.82e+09	4.59e+08	-2.15e+09	-2.11e+09	4.20e+08
51	model_18	Float32DType	0 (0.0%)	396 (0.2%)	-2.04e+09	1.88e+08	-2.15e+09	-2.05e+09	1.57e+09
52	model_19	Float32DType	0 (0.0%)	255 (0.2%)	-2.09e+09	1.93e+08	-2.15e+09	-2.14e+09	2.29e+08
53	model_20	Float32DType	0 (0.0%)	415 (0.3%)	-2.00e+09	2.37e+08	-2.15e+09	-2.10e+09	0.00
54	model_21	Float32DType	0 (0.0%)	315 (0.2%)	-2.09e+09	1.93e+08	-2.15e+09	-2.11e+09	1.15e+09
55	model_22	Float32DType	0 (0.0%)	398 (0.2%)	-1.91e+09	2.86e+08	-2.15e+09	-2.03e+09	6.53e+08
56	model_23	Float32DType	0 (0.0%)	447 (0.3%)	-1.94e+09	2.88e+08	-2.15e+09	-2.08e+09	9.93e+08
57	model_24	Float32DType	0 (0.0%)	369 (0.2%)	-2.05e+09	1.94e+08	-2.15e+09	-2.07e+09	0.00
58	model_25	Float32DType	0 (0.0%)	369 (0.2%)	-1.99e+09	2.45e+08	-2.15e+09	-2.10e+09	0.00
59	model_26	Float32DType	0 (0.0%)	395 (0.2%)	-2.06e+09	1.98e+08	-2.15e+09	-2.09e+09	1.14e+09
60	model_27	Float32DType	0 (0.0%)	332 (0.2%)	-2.08e+09	1.91e+08	-2.15e+09	-2.09e+09	1.20e+09
61	model_28	Float32DType	0 (0.0%)	406 (0.2%)	-2.08e+09	1.96e+08	-2.15e+09	-2.10e+09	4.59e+08
62	model_29	Float32DType	0 (0.0%)	433 (0.3%)	-1.99e+09	2.00e+08	-2.15e+09	-2.00e+09	1.17e+09
63	goods_code	Float64DType	0 (0.0%)	14880 (9.1%)	1.06e+04	4.22e+03	0.00	1.12e+04	1.49e+04
64	Nbr_of_prod_purchas	Float32DType	0 (0.0%)	20 (< 0.1%)	1.05	0.427	1.00	1.00	40.0

Column 1	Column 2	Cramér's V	Pearson's Correlation
model_11	model_28	0.981	0.963
model_07	model_16	0.978	0.869
model_12	model_25	0.966	0.955
model_07	model_20	0.957	0.920
model_06	model_20	0.951	0.866
model_11	model_26	0.944	0.959
model_19	model_22	0.943	0.774
model_19	model_20	0.943	0.890
model_15	model_21	0.939	0.943
model_06	model_07	0.935	0.750
model_06	model_19	0.913	0.650
model_09	model_22	0.912	0.949
model_07	model_17	0.891	0.678
model_00	model_22	0.886	0.968
model_20	model_22	0.883	0.917
model_04	model_19	0.882	0.944
model_00	model_06	0.866	0.939
model_07	model_08	0.864	0.908
model_22	model_25	0.863	0.959
model_00	model_25	0.862	0.966

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 178-198 Our objective is now to aggregate this vectorized product dataframe by ``"basket_ID"``, then to merge it on the baskets dataframe, still on the ``"basket_ID"``. .. image:: ../../_static/08_example_aggjoiner.png :width: 900 | |AggJoiner| can help us achieve exactly this. We need to pass the product dataframe as an auxiliary table argument to |AggJoiner| in ``__init__``. The ``aux_key`` argument represent both the columns used to groupby on, and the columns used to join on. The basket dataframe is our main table, and we indicate the columns to join on with ``main_key``. Note that we pass the main table during ``fit``, and we discuss the limitations of this design in the conclusion at the bottom of this notebook. The minimum ("min") is the most appropriate operation to aggregate encodings from |MinHashEncoder|, for reasons that are out of the scope of this notebook. .. GENERATED FROM PYTHON SOURCE LINES 198-216 .. code-block:: Python from skrub import AggJoiner from skrub import _selectors as s # Skrub selectors allow us to select columns using regexes, which reduces # the boilerplate. minhash_cols_query = s.glob("item*") | s.glob("model*") minhash_cols = s.select(products_transformed, minhash_cols_query).columns agg_joiner = AggJoiner( aux_table=products_transformed, aux_key="basket_ID", main_key="ID", cols=minhash_cols, operations=["min"], ) baskets_products = agg_joiner.fit_transform(baskets) TableReport(baskets_products) .. raw:: html

Column	Column name	dtype	Unique values	Mean	Std	Min	Median	Max
0	ID	Int64DType	92790 (100.0%)	5.80e+04	3.35e+04	0	57,961	115,985
1	fraud_flag	Int64DType	2 (< 0.1%)	0.0142	0.118	0	0	1
2	item_00_min	Float32DType	50 (< 0.1%)	-2.11e+09	4.90e+07	-2.15e+09	-2.12e+09	-1.04e+09
3	item_01_min	Float32DType	37 (< 0.1%)	-1.96e+09	2.15e+08	-2.14e+09	-2.09e+09	-1.12e+09
4	item_02_min	Float32DType	47 (< 0.1%)	-2.02e+09	1.52e+08	-2.14e+09	-2.09e+09	-6.79e+08
5	item_03_min	Float32DType	37 (< 0.1%)	-2.08e+09	4.41e+07	-2.15e+09	-2.08e+09	-8.27e+08
6	item_04_min	Float32DType	44 (< 0.1%)	-2.04e+09	3.26e+07	-2.15e+09	-2.05e+09	-1.08e+09
7	item_05_min	Float32DType	44 (< 0.1%)	-2.05e+09	7.20e+07	-2.14e+09	-2.07e+09	-8.93e+08
8	item_06_min	Float32DType	55 (< 0.1%)	-1.86e+09	3.09e+08	-2.14e+09	-2.03e+09	-5.96e+08
9	item_07_min	Float32DType	44 (< 0.1%)	-2.02e+09	5.24e+07	-2.14e+09	-2.00e+09	-1.20e+09
10	item_08_min	Float32DType	48 (< 0.1%)	-2.00e+09	9.91e+07	-2.15e+09	-2.02e+09	-1.46e+09
11	item_09_min	Float32DType	42 (< 0.1%)	-2.08e+09	4.46e+07	-2.14e+09	-2.06e+09	7.04e+08
12	item_10_min	Float32DType	43 (< 0.1%)	-2.04e+09	6.49e+07	-2.14e+09	-2.06e+09	-1.07e+09
13	item_11_min	Float32DType	32 (< 0.1%)	-2.14e+09	2.30e+07	-2.14e+09	-2.14e+09	-1.23e+09
14	item_12_min	Float32DType	36 (< 0.1%)	-2.06e+09	6.49e+07	-2.14e+09	-2.09e+09	-1.05e+09
15	item_13_min	Float32DType	42 (< 0.1%)	-2.09e+09	3.74e+07	-2.15e+09	-2.10e+09	-1.16e+09
16	item_14_min	Float32DType	50 (< 0.1%)	-2.02e+09	1.28e+08	-2.15e+09	-2.09e+09	-9.70e+08
17	item_15_min	Float32DType	43 (< 0.1%)	-2.02e+09	9.49e+07	-2.15e+09	-2.03e+09	-8.00e+08
18	item_16_min	Float32DType	51 (< 0.1%)	-2.07e+09	6.56e+07	-2.15e+09	-2.07e+09	-1.34e+09
19	item_17_min	Float32DType	42 (< 0.1%)	-2.07e+09	6.55e+07	-2.15e+09	-2.04e+09	-1.33e+09
20	item_18_min	Float32DType	50 (< 0.1%)	-2.03e+09	7.36e+07	-2.14e+09	-2.02e+09	-1.56e+09
21	item_19_min	Float32DType	35 (< 0.1%)	-2.09e+09	5.63e+07	-2.15e+09	-2.13e+09	-1.01e+09
22	item_20_min	Float32DType	37 (< 0.1%)	-2.10e+09	6.33e+07	-2.14e+09	-2.10e+09	-5.10e+06
23	item_21_min	Float32DType	45 (< 0.1%)	-2.09e+09	5.48e+07	-2.14e+09	-2.11e+09	-8.74e+08
24	item_22_min	Float32DType	37 (< 0.1%)	-2.07e+09	2.29e+07	-2.14e+09	-2.06e+09	-7.91e+08
25	item_23_min	Float32DType	41 (< 0.1%)	-2.09e+09	4.68e+07	-2.15e+09	-2.08e+09	-7.27e+08
26	item_24_min	Float32DType	38 (< 0.1%)	-2.10e+09	4.13e+07	-2.14e+09	-2.09e+09	-5.35e+08
27	item_25_min	Float32DType	51 (< 0.1%)	-1.94e+09	1.84e+08	-2.14e+09	-2.02e+09	-9.00e+08
28	item_26_min	Float32DType	54 (< 0.1%)	-1.97e+09	1.27e+08	-2.15e+09	-2.02e+09	-1.45e+09
29	item_27_min	Float32DType	42 (< 0.1%)	-2.06e+09	1.30e+08	-2.15e+09	-2.12e+09	-9.92e+08
30	item_28_min	Float32DType	53 (< 0.1%)	-2.12e+09	5.25e+07	-2.15e+09	-2.12e+09	-1.14e+09
31	item_29_min	Float32DType	38 (< 0.1%)	-2.02e+09	8.10e+07	-2.15e+09	-2.07e+09	-1.19e+09
32	model_00_min	Float32DType	250 (0.3%)	-2.06e+09	2.03e+08	-2.15e+09	-2.09e+09	0.00
33	model_01_min	Float32DType	140 (0.2%)	-2.10e+09	1.91e+08	-2.15e+09	-2.14e+09	0.00
34	model_02_min	Float32DType	162 (0.2%)	-2.10e+09	1.85e+08	-2.15e+09	-2.12e+09	4.22e+07
35	model_03_min	Float32DType	188 (0.2%)	-2.09e+09	1.86e+08	-2.15e+09	-2.12e+09	0.00
36	model_04_min	Float32DType	263 (0.3%)	-2.05e+09	1.91e+08	-2.15e+09	-2.10e+09	0.00
37	model_05_min	Float32DType	151 (0.2%)	-2.09e+09	1.82e+08	-2.15e+09	-2.10e+09	0.00
38	model_06_min	Float32DType	180 (0.2%)	-2.07e+09	2.22e+08	-2.15e+09	-2.12e+09	0.00
39	model_07_min	Float32DType	200 (0.2%)	-2.07e+09	1.89e+08	-2.15e+09	-2.09e+09	0.00
40	model_08_min	Float32DType	212 (0.2%)	-2.06e+09	2.02e+08	-2.15e+09	-2.09e+09	0.00
41	model_09_min	Float32DType	166 (0.2%)	-2.08e+09	1.91e+08	-2.15e+09	-2.11e+09	0.00
42	model_10_min	Float32DType	205 (0.2%)	-2.08e+09	1.86e+08	-2.15e+09	-2.09e+09	1.05e+09
43	model_11_min	Float32DType	155 (0.2%)	-2.09e+09	1.83e+08	-2.15e+09	-2.11e+09	0.00
44	model_12_min	Float32DType	274 (0.3%)	-2.05e+09	2.03e+08	-2.15e+09	-2.10e+09	0.00
45	model_13_min	Float32DType	161 (0.2%)	-2.09e+09	1.81e+08	-2.15e+09	-2.10e+09	0.00
46	model_14_min	Float32DType	139 (0.1%)	-2.10e+09	1.85e+08	-2.15e+09	-2.13e+09	0.00
47	model_15_min	Float32DType	171 (0.2%)	-2.07e+09	1.89e+08	-2.15e+09	-2.11e+09	0.00
48	model_16_min	Float32DType	253 (0.3%)	-2.03e+09	2.07e+08	-2.15e+09	-2.06e+09	0.00
49	model_17_min	Float32DType	204 (0.2%)	-2.07e+09	2.44e+08	-2.15e+09	-2.13e+09	0.00
50	model_18_min	Float32DType	208 (0.2%)	-2.06e+09	1.82e+08	-2.15e+09	-2.06e+09	0.00
51	model_19_min	Float32DType	94 (0.1%)	-2.12e+09	1.85e+08	-2.15e+09	-2.14e+09	0.00
52	model_20_min	Float32DType	228 (0.2%)	-2.08e+09	2.01e+08	-2.15e+09	-2.13e+09	0.00
53	model_21_min	Float32DType	140 (0.2%)	-2.10e+09	1.88e+08	-2.15e+09	-2.14e+09	1.15e+09
54	model_22_min	Float32DType	230 (0.2%)	-2.04e+09	2.04e+08	-2.15e+09	-2.08e+09	6.53e+08
55	model_23_min	Float32DType	242 (0.3%)	-2.07e+09	2.10e+08	-2.15e+09	-2.11e+09	9.93e+08
56	model_24_min	Float32DType	182 (0.2%)	-2.09e+09	1.86e+08	-2.15e+09	-2.14e+09	0.00
57	model_25_min	Float32DType	183 (0.2%)	-2.08e+09	1.92e+08	-2.15e+09	-2.11e+09	0.00
58	model_26_min	Float32DType	200 (0.2%)	-2.07e+09	1.89e+08	-2.15e+09	-2.09e+09	1.34e+08
59	model_27_min	Float32DType	134 (0.1%)	-2.08e+09	1.85e+08	-2.15e+09	-2.10e+09	0.00
60	model_28_min	Float32DType	194 (0.2%)	-2.10e+09	1.89e+08	-2.15e+09	-2.13e+09	2.58e+08
61	model_29_min	Float32DType	253 (0.3%)	-2.03e+09	1.93e+08	-2.15e+09	-2.07e+09	2.92e+08

Column 1	Column 2	Cramér's V	Pearson's Correlation
model_18_min	model_26_min	1.00	0.937
model_18_min	model_28_min	1.00	0.944
model_19_min	model_20_min	1.00	0.916
model_13_min	model_20_min	1.00	0.896
model_13_min	model_16_min	1.00	0.828
model_13_min	model_19_min	1.00	0.987
model_07_min	model_19_min	1.00	0.958
model_07_min	model_13_min	1.00	0.945
model_09_min	model_17_min	1.00	0.870
model_11_min	model_18_min	1.00	0.965
model_08_min	model_18_min	1.00	0.876
model_16_min	model_19_min	1.00	0.871
model_06_min	model_09_min	0.998	0.924
model_09_min	model_16_min	0.994	0.920
model_09_min	model_22_min	0.993	0.952
model_00_min	model_09_min	0.986	0.961
model_09_min	model_23_min	0.983	0.938
model_09_min	model_25_min	0.982	0.980
model_11_min	model_27_min	0.973	0.968
model_26_min	model_27_min	0.968	0.940

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 217-228 Now that we understand how to use the |AggJoiner|, we can now assemble our pipeline by chaining two |AggJoiner| together: - the first one to deal with the |MinHashEncoder| vectors as we just saw - the second one to deal with the all the other columns For the second |AggJoiner|, we use the mean, standard deviation, minimum and maximum operations to extract a representative summary of each distribution. |DropCols| is another skrub transformer which removes the "ID" column, which doesn't bring any information after the joining operation. .. GENERATED FROM PYTHON SOURCE LINES 228-254 .. code-block:: Python from scipy.stats import loguniform, randint from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.pipeline import make_pipeline from skrub import DropCols model = make_pipeline( AggJoiner( aux_table=products_transformed, aux_key="basket_ID", main_key="ID", cols=minhash_cols, operations=["min"], ), AggJoiner( aux_table=products_transformed, aux_key="basket_ID", main_key="ID", cols=["make", "goods_code", "cash_price", "Nbr_of_prod_purchas"], operations=["sum", "mean", "std", "min", "max"], ), DropCols(["ID"]), HistGradientBoostingClassifier(), ) model .. rst-class:: sphx-glr-script-out .. code-block:: none Pipeline(steps=[('aggjoiner-1', AggJoiner(aux_key='basket_ID', aux_table= basket_ID item_00 ... goods_code Nbr_of_prod_purchas 0 85517.0 -2.119082e+09 ... 11181.0 1.0 1 51113.0 -2.119082e+09 ... 10552.0 1.0 2 83008.0 -2.128260e+09 ... 12038.0 1.0 3 78712.0 -2.119082e+09 ... 10513.0 1.0 4 78712.0 -2.119082e+09 ... 4925.0 1.0 ... ... ... ... ... ... 163352 42613.0 -1.944861e+09 ... 2807.0 1.0 163353... 163354 43567.0 -2.119082e+09 ... 13080.0 1.0 163355 43567.0 -2.119082e+09 ... 9971.0 1.0 163356 68268.0 -2.128260e+09 ... 12106.0 1.0 [163357 rows x 65 columns], cols=['make', 'goods_code', 'cash_price', 'Nbr_of_prod_purchas'], main_key='ID', operations=['sum', 'mean', 'std', 'min', 'max'])), ('dropcols', DropCols(cols=['ID'])), ('histgradientboostingclassifier', HistGradientBoostingClassifier())]) .. GENERATED FROM PYTHON SOURCE LINES 255-260 We tune the hyper-parameters of the |HGBC| model using ``RandomizedSearchCV``. By default, the |HGBC| applies early stopping when there are at least 10_000 samples so we don't need to explicitly tune the number of trees (``max_iter``). Therefore we set this at a very high level of 1_000. We increase ``n_iter_no_change`` to make sure early stopping does not kick in too early. .. GENERATED FROM PYTHON SOURCE LINES 260-283 .. code-block:: Python from time import time from sklearn.model_selection import RandomizedSearchCV param_distributions = dict( histgradientboostingclassifier__learning_rate=loguniform(1e-2, 5e-1), histgradientboostingclassifier__min_samples_leaf=randint(2, 64), histgradientboostingclassifier__max_leaf_nodes=[None, 10, 30, 60, 90], histgradientboostingclassifier__n_iter_no_change=[50], histgradientboostingclassifier__max_iter=[1000], ) tic = time() search = RandomizedSearchCV( model, param_distributions, scoring="neg_log_loss", refit=False, n_iter=10, cv=3, verbose=1, ).fit(X_train, y_train) print(f"This operation took {time() - tic:.1f}s") .. rst-class:: sphx-glr-script-out .. code-block:: none Fitting 3 folds for each of 10 candidates, totalling 30 fits This operation took 128.7s .. GENERATED FROM PYTHON SOURCE LINES 284-285 The best hyper parameters are: .. GENERATED FROM PYTHON SOURCE LINES 285-288 .. code-block:: Python pd.Series(search.best_params_) .. rst-class:: sphx-glr-script-out .. code-block:: none histgradientboostingclassifier__learning_rate 0.092584 histgradientboostingclassifier__max_iter 1000.000000 histgradientboostingclassifier__max_leaf_nodes 30.000000 histgradientboostingclassifier__min_samples_leaf 20.000000 histgradientboostingclassifier__n_iter_no_change 50.000000 dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 289-297 To benchmark our performance, we plot the log loss of our model on the test set against the log loss of a dummy model that always output the observed probability of the two classes. As this dataset is extremely imbalanced, this dummy model should be a good baseline. The vertical bar represents one standard deviation around the mean of the cross validation log-loss. .. GENERATED FROM PYTHON SOURCE LINES 297-331 .. code-block:: Python import seaborn as sns from matplotlib import pyplot as plt from sklearn.dummy import DummyClassifier from sklearn.metrics import log_loss results = search.cv_results_ best_idx = search.best_index_ log_loss_model_mean = -results["mean_test_score"][best_idx] log_loss_model_std = results["std_test_score"][best_idx] dummy = DummyClassifier(strategy="prior").fit(X_train, y_train) y_proba_dummy = dummy.predict_proba(X_test) log_loss_dummy = log_loss(y_true=y_test, y_pred=y_proba_dummy) fig, ax = plt.subplots() ax.bar( height=[log_loss_model_mean, log_loss_dummy], x=["AggJoiner model", "Dummy"], color=["C0", "C4"], ) for container in ax.containers: ax.bar_label(container, padding=4) ax.vlines( x="AggJoiner model", ymin=log_loss_model_mean - log_loss_model_std, ymax=log_loss_model_mean + log_loss_model_std, linestyle="-", linewidth=1, color="k", ) sns.despine() ax.set_title("Log loss (lower is better)") .. image-sg:: /auto_examples/images/sphx_glr_08_join_aggregation_001.png :alt: Log loss (lower is better) :srcset: /auto_examples/images/sphx_glr_08_join_aggregation_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'Log loss (lower is better)') .. GENERATED FROM PYTHON SOURCE LINES 332-350 Conclusion ---------- With |AggJoiner|, you can bring the aggregation and joining operations within a sklearn pipeline, and train models more efficiently. One known limitation of both the |AggJoiner| and |Joiner| is that the auxiliary data to join is passed during the ``__init__`` method instead of the ``fit`` method, and is therefore fixed once the model has been trained. This limitation causes two main issues: 1. **Bigger model serialization:** Since the dataset has to be pickled along with the model, it can result in a massive file size on disk. 2. **Inflexibility with new, unseen data in a production environment:** To use new auxiliary data, you would need to replace the auxiliary table in the |AggJoiner| that was used during ``fit`` with the updated data, which is a rather hacky approach. These limitations will be addressed later in skrub. .. rst-class:: sphx-glr-timing **Total running time of the script:** (3 minutes 40.598 seconds) .. _sphx_glr_download_auto_examples_08_join_aggregation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.5.4?urlpath=lab/tree/notebooks/auto_examples/08_join_aggregation.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_examples/08_join_aggregation.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 08_join_aggregation.ipynb <08_join_aggregation.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 08_join_aggregation.py <08_join_aggregation.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 08_join_aggregation.zip <08_join_aggregation.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_

	basket_ID	item_00	item_01	item_02	item_03	item_04	item_05	item_06	item_07	item_08	item_09	item_10	item_11	item_12	item_13	item_14	item_15	item_16	item_17	item_18	item_19	item_20	item_21	item_22	item_23	item_24	item_25	item_26	item_27	item_28	item_29	cash_price	make	model_00	model_01	model_02	model_03	model_04	model_05	model_06	model_07	model_08	model_09	model_10	model_11	model_12	model_13	model_14	model_15	model_16	model_17	model_18	model_19	model_20	model_21	model_22	model_23	model_24	model_25	model_26	model_27	model_28	model_29	goods_code	Nbr_of_prod_purchas
	basket_ID	item_00	item_01	item_02	item_03	item_04	item_05	item_06	item_07	item_08	item_09	item_10	item_11	item_12	item_13	item_14	item_15	item_16	item_17	item_18	item_19	item_20	item_21	item_22	item_23	item_24	item_25	item_26	item_27	item_28	item_29	cash_price	make	model_00	model_01	model_02	model_03	model_04	model_05	model_06	model_07	model_08	model_09	model_10	model_11	model_12	model_13	model_14	model_15	model_16	model_17	model_18	model_19	model_20	model_21	model_22	model_23	model_24	model_25	model_26	model_27	model_28	model_29	goods_code	Nbr_of_prod_purchas
0	85517.0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	889.0	30.0	-2070114816.0	-2140802048.0	-2128763520.0	-2114467200.0	-2098292864.0	-2101188736.0	-2080134784.0	-2123221504.0	-2071984128.0	-2089785728.0	-2082095616.0	-2065732480.0	-2047006848.0	-2096223104.0	-2101725056.0	-2028907008.0	-2052252160.0	-2127037184.0	-2064700672.0	-2141094144.0	-2137696128.0	-2143421952.0	-2098587008.0	-2146054400.0	-2135834880.0	-2113379200.0	-2044705920.0	-2053709824.0	-2147119104.0	-2110975872.0	11181.0	1.0
1	51113.0	-2119082112.0	-2092437504.0	-2091895296.0	-2096070400.0	-2053898240.0	-2069353216.0	-2075299584.0	-2009485056.0	-2071984128.0	-2089785728.0	-2082897536.0	-2143475584.0	-2057488000.0	-2097144064.0	-2086916736.0	-2124873472.0	-2109534080.0	-2043068544.0	-2135823616.0	-2132238720.0	-2114070144.0	-2143421952.0	-2063595136.0	-2083497856.0	-2143897856.0	-2021854080.0	-1841633408.0	-2118906752.0	-2124841088.0	-2115769728.0	409.0	30.0	-2086969984.0	-2140802048.0	-2120533120.0	-2123470592.0	-2044521856.0	-2138671360.0	-2139137024.0	-2121052800.0	-2071984128.0	-2070443776.0	-2082769664.0	-2129256192.0	-2134113920.0	-2114362496.0	-2128111232.0	-2143113472.0	-2101461632.0	-2138302592.0	-2047586432.0	-2137376000.0	-2114070144.0	-2085245184.0	-2113314816.0	-2087341184.0	-2065861248.0	-2100475776.0	-2072272384.0	-2114559744.0	-2125133952.0	-1987577088.0	10552.0	1.0
2	83008.0	-2128259840.0	-2095576448.0	-2144445696.0	-2020459392.0	-2057084800.0	-1876818048.0	-2121499136.0	-2114278400.0	-2085140224.0	-2064848000.0	-2063490048.0	-2143475584.0	-2124267776.0	-2096223104.0	-1994766720.0	-2008991360.0	-2013149184.0	-2127037184.0	-2023255296.0	-2128494592.0	-2072412416.0	-2141099136.0	-2063595136.0	-2134392832.0	-2095952128.0	-2111947648.0	-2146418048.0	-1724371712.0	-2142947968.0	-2066046208.0	1399.0	633.0	-2140403968.0	-2092437504.0	-1949062784.0	-2128161920.0	-2145696768.0	-2119088768.0	-2138769792.0	-2087540096.0	-2124020992.0	-2133883648.0	-2129010944.0	-2119972352.0	-2075782528.0	-2096223104.0	-2140701696.0	-2142510208.0	-1969872000.0	-2127037184.0	-2053575552.0	-2140361472.0	-1990730368.0	-2049603072.0	-2092721792.0	-2109808896.0	-2135747968.0	-2130675712.0	-2106766080.0	-2112450432.0	-2071285888.0	-2118983168.0	12038.0	1.0
3	78712.0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	689.0	30.0	-2107412224.0	-2142073600.0	-2128763520.0	-2114467200.0	-2101596928.0	-2069353216.0	-2084104064.0	-2108259072.0	-2125949440.0	-2136905088.0	-2081278208.0	-2111496832.0	-2124267776.0	-2096223104.0	-2100255360.0	-2028907008.0	-2063535744.0	-2120063360.0	-2115619584.0	-2141094144.0	-2098944768.0	-2143421952.0	-2140212608.0	-1995871488.0	-2135834880.0	-2111947648.0	-2136661760.0	-2123857024.0	-2144492544.0	-2066046208.0	10513.0	1.0
4	78712.0	-2119082112.0	-2092437504.0	-2091895296.0	-2096070400.0	-2053898240.0	-2069353216.0	-2075299584.0	-2009485056.0	-2071984128.0	-2089785728.0	-2082897536.0	-2143475584.0	-2057488000.0	-2097144064.0	-2086916736.0	-2124873472.0	-2109534080.0	-2043068544.0	-2135823616.0	-2132238720.0	-2114070144.0	-2143421952.0	-2063595136.0	-2083497856.0	-2143897856.0	-2021854080.0	-1841633408.0	-2118906752.0	-2124841088.0	-2115769728.0	119.0	30.0	-2123276160.0	-2096499328.0	-2120533120.0	-2107973376.0	-1991295488.0	-2097416448.0	-2139137024.0	-2123221504.0	-2124020992.0	-2141934976.0	-2094848640.0	-2094566016.0	-2124267776.0	-2114362496.0	-2100255360.0	-2086078848.0	-2056125312.0	-2127037184.0	-2064700672.0	-2128494592.0	-2137696128.0	-2143421952.0	-1873599616.0	-2134392832.0	-2065861248.0	-2111947648.0	-2106766080.0	-2136393856.0	-2044628224.0	-2138674816.0	4925.0	1.0

163352	42613.0	-1944860928.0	-2140802048.0	-2042012544.0	-1952299136.0	-1956562944.0	-2115316480.0	-1994253056.0	-2003757696.0	-2051230592.0	-2128871936.0	-1990878976.0	-2143475584.0	-1984537216.0	-2073099648.0	-1961043200.0	-2101944064.0	-2002543232.0	-1877808256.0	-1956121088.0	-2140361472.0	-2062128512.0	-2112148992.0	-2106278912.0	-1786246784.0	-2092638208.0	-2135700096.0	-2023483264.0	-1849929088.0	-1864043904.0	-2078980864.0	259.0	658.0	-1989880960.0	-2018200832.0	-1986772608.0	-2121657216.0	-1993879936.0	-2101188736.0	-2131974784.0	-1985121152.0	-2075421312.0	-2106573056.0	-2094848640.0	-2094566016.0	-2134113920.0	-2124304256.0	-2140701696.0	-2137089792.0	-2078441984.0	-2145952512.0	-2076997632.0	-2121964032.0	-1972271360.0	-2141144576.0	-2126553728.0	-2146054400.0	-2140792960.0	-2088835584.0	-2120271616.0	-2146970240.0	-2139479040.0	-1972564480.0	2807.0	1.0
163353	42613.0	-1944860928.0	-2140802048.0	-1811393152.0	-2061610752.0	-1956562944.0	-2115316480.0	-2002457600.0	-2009485056.0	-1993391616.0	-2128871936.0	-1990878976.0	-2053293056.0	-1834136064.0	-1876119680.0	-1613908224.0	-2101944064.0	-1430106496.0	-2022489728.0	-1986229760.0	-2132238720.0	-1876001536.0	-2112148992.0	-2106278912.0	-1997466752.0	-2025872896.0	-2135700096.0	-2023483264.0	-2146970240.0	-2009022592.0	-2078980864.0	949.0	412.0	-2144654336.0	-2140802048.0	-2120533120.0	-2140173568.0	-2059119744.0	-2100804096.0	-2097867008.0	-2087686016.0	-2124020992.0	-2091307392.0	-2119114880.0	-2119972352.0	-2085467136.0	-2114362496.0	-2140701696.0	-2134924800.0	-2059552384.0	-2100255872.0	-2051555584.0	-2132238720.0	-2044759680.0	-2107845760.0	-2098688896.0	-2053801600.0	-2128403712.0	-2040432896.0	-2106766080.0	-2146970240.0	-2092956800.0	-2128906368.0	11464.0	1.0
163354	43567.0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	1099.0	30.0	-2123882240.0	-2116824448.0	-2120533120.0	-2124229120.0	-2113986816.0	-2097663616.0	-2139137024.0	-2087540096.0	-2124020992.0	-2112236160.0	-2097948544.0	-2065732480.0	-2124267776.0	-2096223104.0	-2100255360.0	-2028907008.0	-2141608832.0	-2069449344.0	-2039003264.0	-2141094144.0	-2098944768.0	-2143421952.0	-2046004608.0	-1995871488.0	-2065861248.0	-2111947648.0	-2106766080.0	-2141457664.0	-2139479040.0	-2066606080.0	13080.0	1.0
163355	43567.0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	2099.0	30.0	-2119082112.0	-2140802048.0	-2144499584.0	-2044493056.0	-2119447424.0	-2100804096.0	-2122059776.0	-2019972352.0	-2124020992.0	-2142483712.0	-2065907072.0	-2111503488.0	-2116008832.0	-2126917376.0	-2131770240.0	-2124873472.0	-2044098560.0	-2127037184.0	-2115619584.0	-2141094144.0	-2103297024.0	-2075032192.0	-2021440512.0	-2087341184.0	-2136538112.0	-2100475776.0	-2106766080.0	-2125088256.0	-2050167424.0	-2096222464.0	9971.0	1.0
163356	68268.0	-2128259840.0	-2095576448.0	-2144445696.0	-2020459392.0	-2057084800.0	-1876818048.0	-2121499136.0	-2114278400.0	-2085140224.0	-2064848000.0	-2063490048.0	-2143475584.0	-2124267776.0	-2096223104.0	-1994766720.0	-2008991360.0	-2013149184.0	-2127037184.0	-2023255296.0	-2128494592.0	-2072412416.0	-2141099136.0	-2063595136.0	-2134392832.0	-2095952128.0	-2111947648.0	-2146418048.0	-1724371712.0	-2142947968.0	-2066046208.0	799.0	411.0	-2140403968.0	-2092437504.0	-2072778240.0	-2124229120.0	-2111208576.0	-2119088768.0	-2138769792.0	-2087540096.0	-2124020992.0	-2133883648.0	-2129010944.0	-2139309824.0	-2075782528.0	-2096223104.0	-2140701696.0	-2134924800.0	-2061275392.0	-2138311808.0	-2053575552.0	-2140361472.0	-1863376000.0	-2049603072.0	-2092721792.0	-2109808896.0	-2135747968.0	-2135416576.0	-2106766080.0	-2103071232.0	-2036068224.0	-2138224128.0	12106.0	1.0

	ID	fraud_flag	item_00_min	item_01_min	item_02_min	item_03_min	item_04_min	item_05_min	item_06_min	item_07_min	item_08_min	item_09_min	item_10_min	item_11_min	item_12_min	item_13_min	item_14_min	item_15_min	item_16_min	item_17_min	item_18_min	item_19_min	item_20_min	item_21_min	item_22_min	item_23_min	item_24_min	item_25_min	item_26_min	item_27_min	item_28_min	item_29_min	model_00_min	model_01_min	model_02_min	model_03_min	model_04_min	model_05_min	model_06_min	model_07_min	model_08_min	model_09_min	model_10_min	model_11_min	model_12_min	model_13_min	model_14_min	model_15_min	model_16_min	model_17_min	model_18_min	model_19_min	model_20_min	model_21_min	model_22_min	model_23_min	model_24_min	model_25_min	model_26_min	model_27_min	model_28_min	model_29_min
	ID	fraud_flag	item_00_min	item_01_min	item_02_min	item_03_min	item_04_min	item_05_min	item_06_min	item_07_min	item_08_min	item_09_min	item_10_min	item_11_min	item_12_min	item_13_min	item_14_min	item_15_min	item_16_min	item_17_min	item_18_min	item_19_min	item_20_min	item_21_min	item_22_min	item_23_min	item_24_min	item_25_min	item_26_min	item_27_min	item_28_min	item_29_min	model_00_min	model_01_min	model_02_min	model_03_min	model_04_min	model_05_min	model_06_min	model_07_min	model_08_min	model_09_min	model_10_min	model_11_min	model_12_min	model_13_min	model_14_min	model_15_min	model_16_min	model_17_min	model_18_min	model_19_min	model_20_min	model_21_min	model_22_min	model_23_min	model_24_min	model_25_min	model_26_min	model_27_min	model_28_min	model_29_min
0	85517	0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	-2070114816.0	-2140802048.0	-2128763520.0	-2114467200.0	-2098292864.0	-2101188736.0	-2080134784.0	-2123221504.0	-2071984128.0	-2089785728.0	-2082095616.0	-2065732480.0	-2047006848.0	-2096223104.0	-2101725056.0	-2028907008.0	-2052252160.0	-2127037184.0	-2064700672.0	-2141094144.0	-2137696128.0	-2143421952.0	-2098587008.0	-2146054400.0	-2135834880.0	-2113379200.0	-2044705920.0	-2053709824.0	-2147119104.0	-2110975872.0
1	51113	0	-2119082112.0	-2092437504.0	-2091895296.0	-2096070400.0	-2053898240.0	-2069353216.0	-2075299584.0	-2009485056.0	-2071984128.0	-2089785728.0	-2082897536.0	-2143475584.0	-2057488000.0	-2097144064.0	-2086916736.0	-2124873472.0	-2109534080.0	-2043068544.0	-2135823616.0	-2132238720.0	-2114070144.0	-2143421952.0	-2063595136.0	-2083497856.0	-2143897856.0	-2021854080.0	-1841633408.0	-2118906752.0	-2124841088.0	-2115769728.0	-2086969984.0	-2140802048.0	-2120533120.0	-2123470592.0	-2044521856.0	-2138671360.0	-2139137024.0	-2121052800.0	-2071984128.0	-2070443776.0	-2082769664.0	-2129256192.0	-2134113920.0	-2114362496.0	-2128111232.0	-2143113472.0	-2101461632.0	-2138302592.0	-2047586432.0	-2137376000.0	-2114070144.0	-2085245184.0	-2113314816.0	-2087341184.0	-2065861248.0	-2100475776.0	-2072272384.0	-2114559744.0	-2125133952.0	-1987577088.0
2	83008	0	-2128259840.0	-2095576448.0	-2144445696.0	-2020459392.0	-2057084800.0	-1876818048.0	-2121499136.0	-2114278400.0	-2085140224.0	-2064848000.0	-2063490048.0	-2143475584.0	-2124267776.0	-2096223104.0	-1994766720.0	-2008991360.0	-2013149184.0	-2127037184.0	-2023255296.0	-2128494592.0	-2072412416.0	-2141099136.0	-2063595136.0	-2134392832.0	-2095952128.0	-2111947648.0	-2146418048.0	-1724371712.0	-2142947968.0	-2066046208.0	-2140403968.0	-2092437504.0	-1949062784.0	-2128161920.0	-2145696768.0	-2119088768.0	-2138769792.0	-2087540096.0	-2124020992.0	-2133883648.0	-2129010944.0	-2119972352.0	-2075782528.0	-2096223104.0	-2140701696.0	-2142510208.0	-1969872000.0	-2127037184.0	-2053575552.0	-2140361472.0	-1990730368.0	-2049603072.0	-2092721792.0	-2109808896.0	-2135747968.0	-2130675712.0	-2106766080.0	-2112450432.0	-2071285888.0	-2118983168.0
3	78712	0	-2119082112.0	-2092437504.0	-2091895296.0	-2096070400.0	-2053898240.0	-2069353216.0	-2075299584.0	-2009485056.0	-2071984128.0	-2089785728.0	-2082897536.0	-2143475584.0	-2057488000.0	-2097144064.0	-2086916736.0	-2124873472.0	-2109534080.0	-2043068544.0	-2135823616.0	-2132238720.0	-2114070144.0	-2143421952.0	-2063595136.0	-2083497856.0	-2143897856.0	-2021854080.0	-1841633408.0	-2118906752.0	-2124841088.0	-2115769728.0	-2123276160.0	-2142073600.0	-2128763520.0	-2114467200.0	-2101596928.0	-2097416448.0	-2139137024.0	-2123221504.0	-2125949440.0	-2141934976.0	-2094848640.0	-2111496832.0	-2124267776.0	-2114362496.0	-2100255360.0	-2086078848.0	-2063535744.0	-2127037184.0	-2115619584.0	-2141094144.0	-2137696128.0	-2143421952.0	-2140212608.0	-2134392832.0	-2135834880.0	-2111947648.0	-2136661760.0	-2136393856.0	-2144492544.0	-2138674816.0
4	77846	0	-2128259840.0	-2095576448.0	-2144445696.0	-2020459392.0	-2057084800.0	-1876818048.0	-2121499136.0	-2114278400.0	-2085140224.0	-2064848000.0	-2063490048.0	-2143475584.0	-2124267776.0	-2096223104.0	-1994766720.0	-2008991360.0	-2013149184.0	-2127037184.0	-2023255296.0	-2128494592.0	-2072412416.0	-2141099136.0	-2063595136.0	-2134392832.0	-2095952128.0	-2111947648.0	-2146418048.0	-1724371712.0	-2142947968.0	-2066046208.0	-2140403968.0	-2092437504.0	-2119336064.0	-2053140480.0	-2135153152.0	-2146440832.0	-2138769792.0	-2123107200.0	-2124020992.0	-2133883648.0	-2129010944.0	-2105794432.0	-2075782528.0	-2096223104.0	-2140701696.0	-2109362432.0	-2048629632.0	-2138311808.0	-2095718400.0	-2140361472.0	-1985099264.0	-2106305152.0	-2092721792.0	-2109808896.0	-2135834880.0	-2134814464.0	-2107277056.0	-2103071232.0	-2061058944.0	-1995298944.0

92785	21243	0	-2119082112.0	-2092437504.0	-2143066240.0	-2121657216.0	-2053898240.0	-2085764736.0	-2075299584.0	-2009485056.0	-2071984128.0	-2089785728.0	-2094848640.0	-2143475584.0	-2116008832.0	-2097144064.0	-2140701696.0	-2124873472.0	-2109534080.0	-2135849344.0	-2135823616.0	-2135224192.0	-2144071552.0	-2143421952.0	-2063595136.0	-2083497856.0	-2143897856.0	-2021854080.0	-2024828928.0	-2118906752.0	-2124841088.0	-2115769728.0	-2144654336.0	-2140802048.0	-2044098176.0	-2136874880.0	-2145696768.0	-2138671360.0	-2139137024.0	-2141810816.0	-2071899904.0	-2124734848.0	-2094848640.0	-2101789312.0	-2143261440.0	-2114362496.0	-2143952384.0	-2130466304.0	-2065807616.0	-2088192384.0	-2121673472.0	-2137376000.0	-2126369408.0	-2087510656.0	-2079609728.0	-2025875840.0	-2131045376.0	-2113189120.0	-2088567296.0	-2092952704.0	-2147119104.0	-2132383744.0
92786	45891	0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	-2070114816.0	-2140802048.0	-2128763520.0	-2114467200.0	-2098292864.0	-2101188736.0	-2080134784.0	-2123221504.0	-2071984128.0	-2089785728.0	-2082095616.0	-2065732480.0	-2047006848.0	-2096223104.0	-2101725056.0	-2028907008.0	-2052252160.0	-2127037184.0	-2064700672.0	-2141094144.0	-2137696128.0	-2143421952.0	-2098587008.0	-2146054400.0	-2135834880.0	-2113379200.0	-2044705920.0	-2053709824.0	-2147119104.0	-2110975872.0
92787	42613	0	-1944860928.0	-2140802048.0	-2042012544.0	-2061610752.0	-1956562944.0	-2115316480.0	-2002457600.0	-2009485056.0	-2051230592.0	-2128871936.0	-1990878976.0	-2143475584.0	-1984537216.0	-2073099648.0	-1961043200.0	-2101944064.0	-2002543232.0	-2022489728.0	-1986229760.0	-2140361472.0	-2062128512.0	-2112148992.0	-2106278912.0	-1997466752.0	-2092638208.0	-2135700096.0	-2023483264.0	-2146970240.0	-2009022592.0	-2078980864.0	-2144654336.0	-2140802048.0	-2120533120.0	-2140173568.0	-2082763776.0	-2118896512.0	-2131974784.0	-2120086912.0	-2124020992.0	-2106573056.0	-2119114880.0	-2143740544.0	-2134113920.0	-2131891200.0	-2143952384.0	-2137089792.0	-2078441984.0	-2145952512.0	-2086298112.0	-2140361472.0	-2126369408.0	-2141144576.0	-2126553728.0	-2146054400.0	-2140792960.0	-2130675712.0	-2120271616.0	-2146970240.0	-2139479040.0	-2128906368.0
92788	43567	0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	-2123882240.0	-2140802048.0	-2144499584.0	-2124229120.0	-2119447424.0	-2100804096.0	-2139137024.0	-2087540096.0	-2124020992.0	-2142483712.0	-2097948544.0	-2111503488.0	-2124267776.0	-2126917376.0	-2131770240.0	-2124873472.0	-2141608832.0	-2127037184.0	-2115619584.0	-2141094144.0	-2103297024.0	-2143421952.0	-2046004608.0	-2087341184.0	-2136538112.0	-2111947648.0	-2106766080.0	-2141457664.0	-2139479040.0	-2096222464.0
92789	68268	0	-2128259840.0	-2095576448.0	-2144445696.0	-2020459392.0	-2057084800.0	-1876818048.0	-2121499136.0	-2114278400.0	-2085140224.0	-2064848000.0	-2063490048.0	-2143475584.0	-2124267776.0	-2096223104.0	-1994766720.0	-2008991360.0	-2013149184.0	-2127037184.0	-2023255296.0	-2128494592.0	-2072412416.0	-2141099136.0	-2063595136.0	-2134392832.0	-2095952128.0	-2111947648.0	-2146418048.0	-1724371712.0	-2142947968.0	-2066046208.0	-2140403968.0	-2092437504.0	-2072778240.0	-2124229120.0	-2111208576.0	-2119088768.0	-2138769792.0	-2087540096.0	-2124020992.0	-2133883648.0	-2129010944.0	-2139309824.0	-2075782528.0	-2096223104.0	-2140701696.0	-2134924800.0	-2061275392.0	-2138311808.0	-2053575552.0	-2140361472.0	-1863376000.0	-2049603072.0	-2092721792.0	-2109808896.0	-2135747968.0	-2135416576.0	-2106766080.0	-2103071232.0	-2036068224.0	-2138224128.0

basket_ID

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

basket_ID

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

Please enable javascript

ID

fraud_flag

ID

fraud_flag

Please enable javascript

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

Please enable javascript

basket_ID

item0

item1

item2

item3

item4

item5

item6

item7

item8

item9

item10

item11

item12

item13

item14

item15

item16

item17

item18

item19

item20

item21

item22

item23

cash_price0

cash_price1

cash_price2

cash_price3

cash_price4

cash_price5

cash_price6

cash_price7

cash_price8

cash_price9

cash_price10

cash_price11

cash_price12

cash_price13

cash_price14

cash_price15

cash_price16

cash_price17

cash_price18

cash_price19

cash_price20

cash_price21