.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/08_join_aggregation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_08_join_aggregation.py: AggJoiner on a credit fraud dataset =================================== Many problems involve tables whose entities have a one-to-many relationship. To simplify aggregate-then-join operations for machine learning, we can include the |AggJoiner| in our pipeline. In this example, we are tackling a fraudulent loan detection use case. Because fraud is rare, this dataset is extremely imbalanced, with a prevalence of around 1.4%. The data consists of two distinct entities: e-commerce "baskets", and "products". Baskets can be tagged fraudulent (1) or not (0), and are essentially a list of products of variable size. Each basket is linked to at least one products, e.g. basket 1 can have product 1 and 2. .. image:: ../../_static/08_example_data.png :width: 450 px | Our aim is to predict which baskets are fraudulent. The products dataframe can be joined on the baskets dataframe using the ``basket_ID`` column. Each product has several attributes: - a category (marked by the column ``"item"``), - a model (``"model"``), - a brand (``"make"``), - a merchant code (``"goods_code"``), - a price per unit (``"cash_price"``), - a quantity selected in the basket (``"Nbr_of_prod_purchas"``) .. |AggJoiner| replace:: :class:`~skrub.AggJoiner` .. |Joiner| replace:: :class:`~skrub.Joiner` .. |DropCols| replace:: :class:`~skrub.DropCols` .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` .. |TableReport| replace:: :class:`~skrub.TableReport` .. |MinHashEncoder| replace:: :class:`~skrub.MinHashEncoder` .. |TargetEncoder| replace:: :class:`~sklearn.preprocessing.TargetEncoder` .. |make_pipeline| replace:: :func:`~sklearn.pipeline.make_pipeline` .. |Pipeline| replace:: :class:`~sklearn.pipeline.Pipeline` .. |HGBC| replace:: :class:`~sklearn.ensemble.HistGradientBoostingClassifier` .. |OrdinalEncoder| replace:: :class:`~sklearn.preprocessing.OrdinalEncoder` .. |TunedThresholdClassifierCV| replace:: :class:`~sklearn.model_selection.TunedThresholdClassifierCV` .. |CalibrationDisplay| replace:: :class:`~sklearn.calibration.CalibrationDisplay` .. |pandas.melt| replace:: :func:`~pandas.melt` .. GENERATED FROM PYTHON SOURCE LINES 82-89 .. code-block:: Python from skrub import TableReport from skrub.datasets import fetch_credit_fraud bunch = fetch_credit_fraud() products, baskets = bunch.products, bunch.baskets TableReport(products) .. raw:: html

	basket_ID	item	cash_price	make	model	goods_code	Nbr_of_prod_purchas
0	85517	COMPUTERS	889	APPLE	2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC	239246776	1
1	51113	COMPUTER PERIPHERALS ACCESSORIES	409	APPLE	APPLE WATCH SERIES 6 GPS 44MM SPACE GREY ALUMINIUM	239001518	1
2	83008	TELEVISIONS HOME CINEMA	1399	SAMSUNG	SAMSUNG QE75Q70A 2021 QLED HDR 4K ULTRA HD SMART T	239842093	1
3	78712	COMPUTERS	689	APPLE	2020 APPLE IPAD AIR 10 9 A14 BIONIC PROCESSOR IOS	239001422	1
4	78712	COMPUTER PERIPHERALS ACCESSORIES	119	APPLE	APPLE PENCIL 2ND GENERATION 2018 MATTE WHITE	237841896	1

163352	42613	BEDROOM FURNITURE	259	SILENTNIGHT	SILENTNIGHT SLEEP GENIUS FULL HEIGHT HEADBOARD DOU	236938439	1
163353	42613	OUTDOOR FURNITURE	949	LG OUTDOOR	LG OUTDOOR BERGEN 2-SEAT GARDEN SIDE TABLE RECLINI	239742814	1
163354	43567	COMPUTERS	1099	APPLE	2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25	240040978	1
163355	43567	COMPUTERS	2099	APPLE	2020 APPLE IMAC 27 ALL-IN-ONE INTEL CORE I7 8GB RA	238923518	1
163356	68268	TELEVISIONS HOME CINEMA	799	LG	LG OLED48A16LA 2021 OLED HDR 4K ULTRA HD SMART TV	239866717	1

Column	Column name	dtype	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Int64DType	92790 (56.8%)	5.59e+04	3.46e+04	0	54,665	115,985
1	item	ObjectDType	173 (0.1%)
2	cash_price	Int64DType	1594 (1.0%)	701.	742.	0	549	21,995
3	make	ObjectDType	830 (0.5%)
4	model	ObjectDType	9680 (5.9%)
5	goods_code	ObjectDType	14880 (9.1%)
6	Nbr_of_prod_purchas	Int64DType	20 (< 0.1%)	1.05	0.427	1	1	40

Column 1	Column 2	Cramér's V
model	goods_code	0.701
item	make	0.470
item	goods_code	0.451
item	model	0.417
cash_price	model	0.380
make	model	0.340
item	cash_price	0.339
cash_price	goods_code	0.332
make	goods_code	0.295
cash_price	make	0.268
basket_ID	item	0.179
basket_ID	model	0.159
basket_ID	goods_code	0.126
item	Nbr_of_prod_purchas	0.122
basket_ID	make	0.120
make	Nbr_of_prod_purchas	0.0977
basket_ID	cash_price	0.0870
goods_code	Nbr_of_prod_purchas	0.0478
basket_ID	Nbr_of_prod_purchas	0.0444
cash_price	Nbr_of_prod_purchas	0.0431

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 90-92 .. code-block:: Python TableReport(baskets) .. raw:: html

	ID	fraud_flag
0	85517	0
1	51113	0
2	83008	0
3	78712	0
4	77846	0

92785	21243	0
92786	45891	0
92787	42613	0
92788	43567	0
92789	68268	0

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	ID	Int64Dtype	0 (0.0%)	92790 (100.0%)	5.80e+04	3.35e+04	0	57,961	115,985
1	fraud_flag	Int64Dtype	0 (0.0%)	2 (< 0.1%)	0.0142	0.118	0	0	1

Column 1	Column 2	Cramér's V
ID	fraud_flag	0.0608

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 93-106 Naive aggregation ----------------- Let's explore a naive solution first. .. note:: Click :ref:`here` to skip this section and see the AggJoiner in action! The first idea that comes to mind to merge these two tables is to aggregate the products attributes into lists, using their basket IDs. .. GENERATED FROM PYTHON SOURCE LINES 106-109 .. code-block:: Python products_grouped = products.groupby("basket_ID").agg(list) TableReport(products_grouped) .. raw:: html

basket_ID	item	cash_price	make	model	goods_code	Nbr_of_prod_purchas
0	['COMPUTERS', 'WARRANTY', 'FULFILMENT CHARGE']	[1249, 35, 11]	['APPLE', 'RETAILER', 'RETAILER']	['2021 APPLE IMAC 24 ALL-IN-ONE M1 PROCESSOR 8GB RAM', 'RETAILER', 'RETAILER']	['240040969', '236604727', 'FULFILMENT']	[1, 1, 1]
1	['OUTDOOR ACCESSORIES', 'OUTDOOR FURNITURE']	[679, 369]	['KETTLER', 'RETAILER']	['RETAILER', 'RETAILER']	['237874616', '238222170']	[1, 1]
2	['OUTDOOR FURNITURE', 'OUTDOOR FURNITURE']	[1879, 110]	['KETTLER', 'KETTLER']	['RETAILER', 'RETAILER']	['239482916', '235452317']	[1, 1]
4	['TELEPHONES, FAX MACHINES & TWO-WAY RADIOS', 'FULFILMENT CHARGE']	[999, 0]	['APPLE', 'RETAILER']	['APPLE IPHONE 12 PRO', 'RETAILER']	['239091969', 'FULFILMENT']	[1, 1]
5	['LIVING & DINING FURNITURE']	[749]	['RETAILER']	['RETAILER']	['238000174']	[1]

115981	['COMPUTERS']	[1149]	['APPLE']	['2021 APPLE IMAC 24 ALL-IN-ONE M1 PROCESSOR 8GB RAM']	['240040965']	[1]
115982	['COMPUTERS', 'FULFILMENT CHARGE']	[1399, 7]	['APPLE', 'RETAILER']	['2021 APPLE IPAD PRO 11 M1 PROCESSOR IOS WI-FI 1TB', 'RETAILER']	['240041001', 'FULFILMENT']	[1, 1]
115983	['COMPUTER PERIPHERALS ACCESSORIES']	[439]	['APPLE']	['APPLE WATCH SERIES 7 GPS CELLULAR 41MM BLUE ALUMIN']	['240376595']	[1]
115984	['COMPUTERS']	[887]	['APPLE']	['2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC']	['239246776']	[1]
115985	['COMPUTERS', 'FULFILMENT CHARGE']	[569, 7]	['APPLE', 'RETAILER']	['2022 APPLE IPAD AIR 10 9 M1 PROCESSOR IPADOS WI-FI', 'RETAILER']	['241017996', 'FULFILMENT']	[1, 1]

Column	Column name	dtype	Unique values
0	item	ObjectDType	4425 (4.8%)
1	cash_price	ObjectDType	16146 (17.4%)
2	make	ObjectDType	3862 (4.2%)
3	model	ObjectDType	12529 (13.5%)
4	goods_code	ObjectDType	17916 (19.3%)
5	Nbr_of_prod_purchas	ObjectDType	812 (0.9%)

Column 1	Column 2	Cramér's V
item	make	0.565
model	goods_code	0.512
cash_price	goods_code	0.507
make	Nbr_of_prod_purchas	0.439
cash_price	model	0.367
item	Nbr_of_prod_purchas	0.362
item	model	0.336
item	goods_code	0.291
item	cash_price	0.276
make	model	0.273
make	goods_code	0.248
model	Nbr_of_prod_purchas	0.194
cash_price	make	0.180
goods_code	Nbr_of_prod_purchas	0.151
cash_price	Nbr_of_prod_purchas	0.146

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 110-113 Then, we can expand all lists into columns, as if we were "flattening" the dataframe. We end up with a products dataframe ready to be joined on the baskets dataframe, using ``"basket_ID"`` as the join key. .. GENERATED FROM PYTHON SOURCE LINES 113-123 .. code-block:: Python import pandas as pd products_flatten = [] for col in products_grouped.columns: cols = [f"{col}{idx}" for idx in range(24)] products_flatten.append(pd.DataFrame(products_grouped[col].to_list(), columns=cols)) products_flatten = pd.concat(products_flatten, axis=1) products_flatten.insert(0, "basket_ID", products_grouped.index) TableReport(products_flatten) .. raw:: html

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Int64DType	0 (0.0%)	92790 (100.0%)	5.80e+04	3.35e+04	0	57,961	115,985
1	item0	ObjectDType	0 (0.0%)	134 (0.1%)
2	item1	ObjectDType	48134 (51.9%)	137 (0.1%)
3	item2	ObjectDType	79889 (86.1%)	125 (0.1%)
4	item3	ObjectDType	88228 (95.1%)	124 (0.1%)
5	item4	ObjectDType	90620 (97.7%)	107 (0.1%)
6	item5	ObjectDType	91454 (98.6%)	97 (0.1%)
7	item6	ObjectDType	91844 (99.0%)	91 (< 0.1%)
8	item7	ObjectDType	92063 (99.2%)	89 (< 0.1%)
9	item8	ObjectDType	92222 (99.4%)	82 (< 0.1%)
10	item9	ObjectDType	92318 (99.5%)	71 (< 0.1%)
11	item10	ObjectDType	92406 (99.6%)	70 (< 0.1%)
12	item11	ObjectDType	92468 (99.7%)	62 (< 0.1%)
13	item12	ObjectDType	92533 (99.7%)	60 (< 0.1%)
14	item13	ObjectDType	92571 (99.8%)	57 (< 0.1%)
15	item14	ObjectDType	92597 (99.8%)	55 (< 0.1%)
16	item15	ObjectDType	92625 (99.8%)	44 (< 0.1%)
17	item16	ObjectDType	92648 (99.8%)	46 (< 0.1%)
18	item17	ObjectDType	92670 (99.9%)	41 (< 0.1%)
19	item18	ObjectDType	92687 (99.9%)	37 (< 0.1%)
20	item19	ObjectDType	92699 (99.9%)	33 (< 0.1%)
21	item20	ObjectDType	92713 (99.9%)	30 (< 0.1%)
22	item21	ObjectDType	92727 (99.9%)	24 (< 0.1%)
23	item22	ObjectDType	92740 (99.9%)	22 (< 0.1%)
24	item23	ObjectDType	92747 (100.0%)	22 (< 0.1%)
25	cash_price0	Int64DType	0 (0.0%)	1406 (1.5%)	1.09e+03	711.	2	949	21,995
26	cash_price1	Float64DType	48134 (51.9%)	867 (0.9%)	192.	393.	0.00	40.0	6.50e+03
27	cash_price2	Float64DType	79889 (86.1%)	645 (0.7%)	193.	376.	0.00	43.0	6.00e+03
28	cash_price3	Float64DType	88228 (95.1%)	463 (0.5%)	176.	321.	0.00	48.0	5.20e+03
29	cash_price4	Float64DType	90620 (97.7%)	357 (0.4%)	196.	374.	0.00	59.0	4.25e+03
30	cash_price5	Float64DType	91454 (98.6%)	273 (0.3%)	162.	292.	0.00	50.0	3.00e+03
31	cash_price6	Float64DType	91844 (99.0%)	230 (0.2%)	145.	291.	0.00	50.0	4.20e+03
32	cash_price7	Float64DType	92063 (99.2%)	179 (0.2%)	131.	258.	0.00	45.0	3.00e+03
33	cash_price8	Float64DType	92222 (99.4%)	177 (0.2%)	133.	267.	0.00	45.0	2.40e+03
34	cash_price9	Float64DType	92318 (99.5%)	147 (0.2%)	112.	213.	0.00	40.0	1.54e+03
35	cash_price10	Float64DType	92406 (99.6%)	130 (0.1%)	112.	251.	0.00	32.0	3.20e+03
36	cash_price11	Float64DType	92468 (99.7%)	120 (0.1%)	103.	220.	0.00	30.0	2.16e+03
37	cash_price12	Float64DType	92533 (99.7%)	102 (0.1%)	84.2	141.	0.00	29.0	899.
38	cash_price13	Float64DType	92571 (99.8%)	97 (0.1%)	111.	200.	0.00	39.0	1.30e+03
39	cash_price14	Float64DType	92597 (99.8%)	82 (< 0.1%)	72.4	106.	0.00	35.0	599.
40	cash_price15	Float64DType	92625 (99.8%)	72 (< 0.1%)	98.2	228.	0.00	35.0	1.60e+03
41	cash_price16	Float64DType	92648 (99.8%)	71 (< 0.1%)	89.0	177.	0.00	30.0	1.55e+03
42	cash_price17	Float64DType	92670 (99.9%)	67 (< 0.1%)	84.0	134.	0.00	25.0	799.
43	cash_price18	Float64DType	92687 (99.9%)	67 (< 0.1%)	88.2	142.	0.00	36.0	999.
44	cash_price19	Float64DType	92699 (99.9%)	50 (< 0.1%)	79.6	223.	0.00	26.0	2.01e+03
45	cash_price20	Float64DType	92713 (99.9%)	43 (< 0.1%)	58.2	88.8	4.00	25.0	450.
46	cash_price21	Float64DType	92727 (99.9%)	42 (< 0.1%)	126.	342.	0.00	28.0	2.09e+03
47	cash_price22	Float64DType	92740 (99.9%)	41 (< 0.1%)	109.	199.	0.00	35.0	995.
48	cash_price23	Float64DType	92747 (100.0%)	31 (< 0.1%)	122.	264.	4.00	20.0	1.04e+03
49	make0	ObjectDType	0 (0.0%)	426 (0.5%)
50	make1	ObjectDType	48134 (51.9%)	417 (0.4%)
51	make2	ObjectDType	79889 (86.1%)	358 (0.4%)
52	make3	ObjectDType	88228 (95.1%)	309 (0.3%)
53	make4	ObjectDType	90620 (97.7%)	266 (0.3%)
54	make5	ObjectDType	91454 (98.6%)	206 (0.2%)
55	make6	ObjectDType	91844 (99.0%)	187 (0.2%)
56	make7	ObjectDType	92063 (99.2%)	166 (0.2%)
57	make8	ObjectDType	92222 (99.4%)	143 (0.2%)
58	make9	ObjectDType	92318 (99.5%)	127 (0.1%)
59	make10	ObjectDType	92406 (99.6%)	108 (0.1%)
60	make11	ObjectDType	92468 (99.7%)	101 (0.1%)
61	make12	ObjectDType	92533 (99.7%)	82 (< 0.1%)
62	make13	ObjectDType	92571 (99.8%)	74 (< 0.1%)
63	make14	ObjectDType	92597 (99.8%)	70 (< 0.1%)
64	make15	ObjectDType	92625 (99.8%)	62 (< 0.1%)
65	make16	ObjectDType	92648 (99.8%)	50 (< 0.1%)
66	make17	ObjectDType	92670 (99.9%)	43 (< 0.1%)
67	make18	ObjectDType	92687 (99.9%)	45 (< 0.1%)
68	make19	ObjectDType	92699 (99.9%)	38 (< 0.1%)
69	make20	ObjectDType	92713 (99.9%)	31 (< 0.1%)
70	make21	ObjectDType	92727 (99.9%)	24 (< 0.1%)
71	make22	ObjectDType	92740 (99.9%)	26 (< 0.1%)
72	make23	ObjectDType	92747 (100.0%)	19 (< 0.1%)
73	model0	ObjectDType	0 (0.0%)	3783 (4.1%)
74	model1	ObjectDType	48134 (51.9%)	3243 (3.5%)
75	model2	ObjectDType	79889 (86.1%)	2345 (2.5%)
76	model3	ObjectDType	88228 (95.1%)	1612 (1.7%)
77	model4	ObjectDType	90620 (97.7%)	1094 (1.2%)
78	model5	ObjectDType	91454 (98.6%)	757 (0.8%)
79	model6	ObjectDType	91844 (99.0%)	602 (0.6%)
80	model7	ObjectDType	92063 (99.2%)	473 (0.5%)
81	model8	ObjectDType	92222 (99.4%)	378 (0.4%)
82	model9	ObjectDType	92318 (99.5%)	334 (0.4%)
83	model10	ObjectDType	92406 (99.6%)	266 (0.3%)
84	model11	ObjectDType	92468 (99.7%)	220 (0.2%)
85	model12	ObjectDType	92533 (99.7%)	180 (0.2%)
86	model13	ObjectDType	92571 (99.8%)	155 (0.2%)
87	model14	ObjectDType	92597 (99.8%)	140 (0.2%)
88	model15	ObjectDType	92625 (99.8%)	124 (0.1%)
89	model16	ObjectDType	92648 (99.8%)	107 (0.1%)
90	model17	ObjectDType	92670 (99.9%)	88 (< 0.1%)
91	model18	ObjectDType	92687 (99.9%)	82 (< 0.1%)
92	model19	ObjectDType	92699 (99.9%)	76 (< 0.1%)
93	model20	ObjectDType	92713 (99.9%)	64 (< 0.1%)
94	model21	ObjectDType	92727 (99.9%)	56 (< 0.1%)
95	model22	ObjectDType	92740 (99.9%)	46 (< 0.1%)
96	model23	ObjectDType	92747 (100.0%)	42 (< 0.1%)
97	goods_code0	ObjectDType	0 (0.0%)	5966 (6.4%)
98	goods_code1	ObjectDType	48134 (51.9%)	4728 (5.1%)
99	goods_code2	ObjectDType	79889 (86.1%)	3237 (3.5%)
100	goods_code3	ObjectDType	88228 (95.1%)	2118 (2.3%)
101	goods_code4	ObjectDType	90620 (97.7%)	1480 (1.6%)
102	goods_code5	ObjectDType	91454 (98.6%)	1006 (1.1%)
103	goods_code6	ObjectDType	91844 (99.0%)	805 (0.9%)
104	goods_code7	ObjectDType	92063 (99.2%)	628 (0.7%)
105	goods_code8	ObjectDType	92222 (99.4%)	514 (0.6%)
106	goods_code9	ObjectDType	92318 (99.5%)	426 (0.5%)
107	goods_code10	ObjectDType	92406 (99.6%)	350 (0.4%)
108	goods_code11	ObjectDType	92468 (99.7%)	282 (0.3%)
109	goods_code12	ObjectDType	92533 (99.7%)	238 (0.3%)
110	goods_code13	ObjectDType	92571 (99.8%)	205 (0.2%)
111	goods_code14	ObjectDType	92597 (99.8%)	179 (0.2%)
112	goods_code15	ObjectDType	92625 (99.8%)	156 (0.2%)
113	goods_code16	ObjectDType	92648 (99.8%)	131 (0.1%)
114	goods_code17	ObjectDType	92670 (99.9%)	109 (0.1%)
115	goods_code18	ObjectDType	92687 (99.9%)	96 (0.1%)
116	goods_code19	ObjectDType	92699 (99.9%)	85 (< 0.1%)
117	goods_code20	ObjectDType	92713 (99.9%)	71 (< 0.1%)
118	goods_code21	ObjectDType	92727 (99.9%)	59 (< 0.1%)
119	goods_code22	ObjectDType	92740 (99.9%)	46 (< 0.1%)
120	goods_code23	ObjectDType	92747 (100.0%)	42 (< 0.1%)
121	Nbr_of_prod_purchas0	Int64DType	0 (0.0%)	16 (< 0.1%)	1.03	0.351	1	1	40
122	Nbr_of_prod_purchas1	Float64DType	48134 (51.9%)	13 (< 0.1%)	1.04	0.300	1.00	1.00	18.0
123	Nbr_of_prod_purchas2	Float64DType	79889 (86.1%)	12 (< 0.1%)	1.08	0.464	1.00	1.00	16.0
124	Nbr_of_prod_purchas3	Float64DType	88228 (95.1%)	14 (< 0.1%)	1.15	0.795	1.00	1.00	28.0
125	Nbr_of_prod_purchas4	Float64DType	90620 (97.7%)	10 (< 0.1%)	1.23	0.824	1.00	1.00	15.0
126	Nbr_of_prod_purchas5	Float64DType	91454 (98.6%)	9 (< 0.1%)	1.26	0.978	1.00	1.00	24.0
127	Nbr_of_prod_purchas6	Float64DType	91844 (99.0%)	10 (< 0.1%)	1.29	0.905	1.00	1.00	16.0
128	Nbr_of_prod_purchas7	Float64DType	92063 (99.2%)	10 (< 0.1%)	1.33	1.05	1.00	1.00	14.0
129	Nbr_of_prod_purchas8	Float64DType	92222 (99.4%)	11 (< 0.1%)	1.41	1.27	1.00	1.00	18.0
130	Nbr_of_prod_purchas9	Float64DType	92318 (99.5%)	7 (< 0.1%)	1.36	0.948	1.00	1.00	8.00
131	Nbr_of_prod_purchas10	Float64DType	92406 (99.6%)	9 (< 0.1%)	1.37	1.11	1.00	1.00	12.0
132	Nbr_of_prod_purchas11	Float64DType	92468 (99.7%)	7 (< 0.1%)	1.32	0.897	1.00	1.00	7.00
133	Nbr_of_prod_purchas12	Float64DType	92533 (99.7%)	6 (< 0.1%)	1.26	0.823	1.00	1.00	10.0
134	Nbr_of_prod_purchas13	Float64DType	92571 (99.8%)	6 (< 0.1%)	1.36	1.04	1.00	1.00	12.0
135	Nbr_of_prod_purchas14	Float64DType	92597 (99.8%)	6 (< 0.1%)	1.35	0.951	1.00	1.00	6.00
136	Nbr_of_prod_purchas15	Float64DType	92625 (99.8%)	6 (< 0.1%)	1.29	0.749	1.00	1.00	6.00
137	Nbr_of_prod_purchas16	Float64DType	92648 (99.8%)	5 (< 0.1%)	1.44	1.43	1.00	1.00	12.0
138	Nbr_of_prod_purchas17	Float64DType	92670 (99.9%)	7 (< 0.1%)	1.47	1.81	1.00	1.00	16.0
139	Nbr_of_prod_purchas18	Float64DType	92687 (99.9%)	7 (< 0.1%)	1.39	1.17	1.00	1.00	7.00
140	Nbr_of_prod_purchas19	Float64DType	92699 (99.9%)	5 (< 0.1%)	1.33	0.870	1.00	1.00	7.00
141	Nbr_of_prod_purchas20	Float64DType	92713 (99.9%)	4 (< 0.1%)	1.22	0.529	1.00	1.00	4.00
142	Nbr_of_prod_purchas21	Float64DType	92727 (99.9%)	5 (< 0.1%)	1.38	0.923	1.00	1.00	7.00
143	Nbr_of_prod_purchas22	Float64DType	92740 (99.9%)	4 (< 0.1%)	1.16	0.548	1.00	1.00	4.00
144	Nbr_of_prod_purchas23	Float64DType	92747 (100.0%)	3 (< 0.1%)	1.37	1.11	1.00	1.00	8.00

Column 1	Column 2	Cramér's V
cash_price21	Nbr_of_prod_purchas23	1.00
model11	Nbr_of_prod_purchas18	1.00
model12	model22	1.00
cash_price22	make15	1.00
model12	model20	1.00
model12	model19	1.00
model12	model18	1.00
model12	model17	1.00
model12	model16	1.00
model12	model15	1.00
cash_price22	cash_price23	1.00
cash_price22	make23	1.00
cash_price22	make22	1.00
cash_price22	make21	1.00
cash_price22	make20	1.00
cash_price22	make19	1.00
cash_price21	Nbr_of_prod_purchas22	1.00
cash_price21	Nbr_of_prod_purchas21	1.00
cash_price21	Nbr_of_prod_purchas20	1.00
model12	goods_code22	1.00

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 124-144 Look at the "Stats" section of the |TableReport| above. Does anything strike you? Not only did we create 144 columns, but most of these columns are filled with NaN, which is very inefficient for learning! This is because each basket contains a variable number of products, up to 24, and we created one column for each product attribute, for each position (up to 24) in the dataframe. Moreover, if we wanted to replace text columns with encodings, we would create :math:`d \times 24 \times 2` columns (encoding of dimensionality :math:`d`, for 24 products, for the ``"item"`` and ``"make"`` columns), which would explode the memory usage. .. _agg-joiner-anchor: AggJoiner --------- Let's now see how the |AggJoiner| can help us solve this. We begin with splitting our basket dataset in a training and testing set. .. GENERATED FROM PYTHON SOURCE LINES 144-150 .. code-block:: Python from sklearn.model_selection import train_test_split X, y = baskets[["ID"]], baskets["fraud_flag"] X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1) X_train.shape, y_train.shape .. rst-class:: sphx-glr-script-out .. code-block:: none ((83511, 1), (83511,)) .. GENERATED FROM PYTHON SOURCE LINES 151-163 Before aggregating our product dataframe, we need to vectorize our categorical columns. To do so, we use: - |MinHashEncoder| on "item" and "model" columns, because they both expose typos and text similarities. - |OrdinalEncoder| on "make" and "goods_code" columns, because they consist in orthogonal categories. We bring this logic into a |TableVectorizer| to vectorize these columns in a single step. See `this example `_ for more details about these encoding choices. .. GENERATED FROM PYTHON SOURCE LINES 163-176 .. code-block:: Python from sklearn.preprocessing import OrdinalEncoder from skrub import MinHashEncoder, TableVectorizer vectorizer = TableVectorizer( high_cardinality=MinHashEncoder(), # encode ["item", "model"] specific_transformers=[ (OrdinalEncoder(), ["make", "goods_code"]), ], ) products_transformed = vectorizer.fit_transform(products) TableReport(products_transformed) .. raw:: html

Column	Column name	dtype	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Float32DType	92790 (56.8%)	5.59e+04	3.46e+04	0.00	5.47e+04	1.16e+05
1	item_00	Float32DType	73 (< 0.1%)	-2.03e+09	1.55e+08	-2.15e+09	-2.12e+09	-1.03e+09
2	item_01	Float32DType	47 (< 0.1%)	-1.91e+09	2.66e+08	-2.14e+09	-2.06e+09	-1.09e+09
3	item_02	Float32DType	59 (< 0.1%)	-1.98e+09	1.84e+08	-2.14e+09	-2.09e+09	-3.95e+08
4	item_03	Float32DType	55 (< 0.1%)	-2.06e+09	8.12e+07	-2.15e+09	-2.07e+09	-8.27e+08
5	item_04	Float32DType	64 (< 0.1%)	-2.02e+09	6.44e+07	-2.15e+09	-2.05e+09	-1.08e+09
6	item_05	Float32DType	62 (< 0.1%)	-1.99e+09	2.15e+08	-2.14e+09	-2.06e+09	-8.93e+08
7	item_06	Float32DType	70 (< 0.1%)	-1.76e+09	3.83e+08	-2.14e+09	-1.99e+09	-9.01e+07
8	item_07	Float32DType	62 (< 0.1%)	-1.97e+09	1.41e+08	-2.14e+09	-2.00e+09	-1.20e+09
9	item_08	Float32DType	65 (< 0.1%)	-1.98e+09	1.06e+08	-2.15e+09	-2.02e+09	-9.20e+08
10	item_09	Float32DType	67 (< 0.1%)	-2.01e+09	2.29e+08	-2.14e+09	-2.06e+09	7.04e+08
11	item_10	Float32DType	61 (< 0.1%)	-2.02e+09	7.99e+07	-2.14e+09	-2.04e+09	2.67e+07
12	item_11	Float32DType	59 (< 0.1%)	-2.10e+09	1.06e+08	-2.14e+09	-2.14e+09	-5.71e+08
13	item_12	Float32DType	59 (< 0.1%)	-2.04e+09	8.53e+07	-2.14e+09	-2.06e+09	-1.05e+09
14	item_13	Float32DType	58 (< 0.1%)	-2.04e+09	1.49e+08	-2.15e+09	-2.10e+09	-1.14e+09
15	item_14	Float32DType	61 (< 0.1%)	-1.98e+09	1.51e+08	-2.15e+09	-1.99e+09	-5.50e+08
16	item_15	Float32DType	63 (< 0.1%)	-1.99e+09	1.23e+08	-2.15e+09	-2.01e+09	-8.00e+08
17	item_16	Float32DType	68 (< 0.1%)	-2.03e+09	1.06e+08	-2.15e+09	-2.06e+09	-6.04e+08
18	item_17	Float32DType	56 (< 0.1%)	-2.04e+09	9.05e+07	-2.15e+09	-2.04e+09	-1.13e+09
19	item_18	Float32DType	70 (< 0.1%)	-2.00e+09	8.97e+07	-2.14e+09	-1.96e+09	-9.75e+08
20	item_19	Float32DType	51 (< 0.1%)	-2.06e+09	1.03e+08	-2.15e+09	-2.08e+09	-5.72e+08
21	item_20	Float32DType	58 (< 0.1%)	-2.01e+09	3.00e+08	-2.14e+09	-2.10e+09	-5.10e+06
22	item_21	Float32DType	66 (< 0.1%)	-2.05e+09	1.02e+08	-2.14e+09	-2.05e+09	-8.74e+08
23	item_22	Float32DType	57 (< 0.1%)	-1.99e+09	1.43e+08	-2.14e+09	-2.06e+09	-7.91e+08
24	item_23	Float32DType	63 (< 0.1%)	-2.05e+09	1.32e+08	-2.15e+09	-2.08e+09	-7.27e+08
25	item_24	Float32DType	63 (< 0.1%)	-2.01e+09	1.54e+08	-2.14e+09	-2.09e+09	-5.03e+08
26	item_25	Float32DType	68 (< 0.1%)	-1.90e+09	2.07e+08	-2.14e+09	-1.83e+09	-8.07e+08
27	item_26	Float32DType	73 (< 0.1%)	-1.94e+09	1.31e+08	-2.15e+09	-1.84e+09	-1.36e+09
28	item_27	Float32DType	67 (< 0.1%)	-2.00e+09	1.70e+08	-2.15e+09	-2.12e+09	-9.92e+08
29	item_28	Float32DType	76 (< 0.1%)	-2.05e+09	1.21e+08	-2.15e+09	-2.12e+09	-9.07e+08
30	item_29	Float32DType	58 (< 0.1%)	-1.99e+09	1.22e+08	-2.15e+09	-1.97e+09	-1.07e+09
31	cash_price	Float32DType	1594 (1.0%)	701.	742.	0.00	549.	2.20e+04
32	make	Float64DType	830 (0.5%)	305.	281.	0.00	202.	829.
33	model_00	Float32DType	478 (0.3%)	-1.96e+09	2.09e+08	-2.15e+09	-2.07e+09	-3.90e+08
34	model_01	Float32DType	347 (0.2%)	-2.10e+09	1.16e+08	-2.15e+09	-2.14e+09	-5.58e+08
35	model_02	Float32DType	337 (0.2%)	-2.07e+09	1.59e+08	-2.15e+09	-2.12e+09	4.22e+07
36	model_03	Float32DType	341 (0.2%)	-2.04e+09	2.72e+08	-2.15e+09	-2.11e+09	8.97e+08
37	model_04	Float32DType	432 (0.3%)	-2.03e+09	9.82e+07	-2.15e+09	-2.04e+09	-6.48e+08
38	model_05	Float32DType	335 (0.2%)	-2.10e+09	6.51e+07	-2.15e+09	-2.10e+09	8.95e+08
39	model_06	Float32DType	329 (0.2%)	-1.89e+09	3.52e+08	-2.15e+09	-2.08e+09	-6.43e+07
40	model_07	Float32DType	365 (0.2%)	-2.03e+09	1.14e+08	-2.15e+09	-2.09e+09	-2.41e+08
41	model_08	Float32DType	430 (0.3%)	-1.99e+09	1.62e+08	-2.15e+09	-2.07e+09	-3.42e+08
42	model_09	Float32DType	280 (0.2%)	-2.01e+09	1.74e+08	-2.15e+09	-2.09e+09	-6.39e+08
43	model_10	Float32DType	386 (0.2%)	-2.09e+09	5.75e+07	-2.15e+09	-2.09e+09	1.05e+09
44	model_11	Float32DType	346 (0.2%)	-2.08e+09	1.98e+08	-2.15e+09	-2.10e+09	2.73e+08
45	model_12	Float32DType	471 (0.3%)	-1.98e+09	1.62e+08	-2.15e+09	-2.05e+09	-7.02e+08
46	model_13	Float32DType	377 (0.2%)	-2.10e+09	3.80e+07	-2.15e+09	-2.10e+09	-8.41e+08
47	model_14	Float32DType	311 (0.2%)	-2.08e+09	6.92e+07	-2.15e+09	-2.10e+09	9.05e+07
48	model_15	Float32DType	355 (0.2%)	-2.05e+09	1.01e+08	-2.15e+09	-2.03e+09	3.49e+08
49	model_16	Float32DType	448 (0.3%)	-1.94e+09	2.11e+08	-2.15e+09	-2.05e+09	-6.70e+08
50	model_17	Float32DType	413 (0.3%)	-1.84e+09	4.31e+08	-2.15e+09	-2.11e+09	4.20e+08
51	model_18	Float32DType	396 (0.2%)	-2.05e+09	5.93e+07	-2.15e+09	-2.05e+09	1.57e+09
52	model_19	Float32DType	255 (0.2%)	-2.10e+09	7.12e+07	-2.15e+09	-2.14e+09	2.29e+08
53	model_20	Float32DType	415 (0.3%)	-2.01e+09	1.59e+08	-2.15e+09	-2.10e+09	-6.33e+08
54	model_21	Float32DType	315 (0.2%)	-2.10e+09	5.61e+07	-2.15e+09	-2.11e+09	1.15e+09
55	model_22	Float32DType	398 (0.2%)	-1.92e+09	2.34e+08	-2.15e+09	-2.03e+09	6.53e+08
56	model_23	Float32DType	447 (0.3%)	-1.95e+09	2.31e+08	-2.15e+09	-2.08e+09	9.93e+08
57	model_24	Float32DType	369 (0.2%)	-2.07e+09	7.35e+07	-2.15e+09	-2.07e+09	-6.13e+08
58	model_25	Float32DType	369 (0.2%)	-2.00e+09	1.70e+08	-2.15e+09	-2.10e+09	-6.18e+08
59	model_26	Float32DType	395 (0.2%)	-2.07e+09	9.72e+07	-2.15e+09	-2.09e+09	1.14e+09
60	model_27	Float32DType	332 (0.2%)	-2.09e+09	7.09e+07	-2.15e+09	-2.09e+09	1.20e+09
61	model_28	Float32DType	406 (0.2%)	-2.08e+09	1.76e+08	-2.15e+09	-2.10e+09	4.59e+08
62	model_29	Float32DType	433 (0.3%)	-2.00e+09	1.15e+08	-2.15e+09	-2.00e+09	1.17e+09
63	goods_code	Float64DType	14880 (9.1%)	1.06e+04	4.22e+03	0.00	1.12e+04	1.49e+04
64	Nbr_of_prod_purchas	Float32DType	20 (< 0.1%)	1.05	0.427	1.00	1.00	40.0

Column 1	Column 2	Cramér's V
item_09	item_28	0.891
model_03	model_15	0.866
item_03	item_09	0.844
model_15	model_18	0.844
model_11	model_19	0.837
item_03	item_28	0.837
model_02	model_19	0.836
model_03	model_18	0.834
model_05	model_18	0.824
model_26	model_28	0.822
model_02	model_05	0.818
model_18	model_19	0.818
model_08	model_18	0.818
model_02	model_15	0.817
model_02	model_03	0.816
item_09	item_22	0.815
model_06	model_11	0.811
model_15	model_28	0.811
model_02	model_08	0.808
model_02	model_18	0.808

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 177-197 Our objective is now to aggregate this vectorized product dataframe by ``"basket_ID"``, then to merge it on the baskets dataframe, still on the ``"basket_ID"``. .. image:: ../../_static/08_example_aggjoiner.png :width: 900 | |AggJoiner| can help us achieve exactly this. We need to pass the product dataframe as an auxiliary table argument to |AggJoiner| in ``__init__``. The ``aux_key`` argument represent both the columns used to groupby on, and the columns used to join on. The basket dataframe is our main table, and we indicate the columns to join on with ``main_key``. Note that we pass the main table during ``fit``, and we discuss the limitations of this design in the conclusion at the bottom of this notebook. The minimum ("min") is the most appropriate operation to aggregate encodings from |MinHashEncoder|, for reasons that are out of the scope of this notebook. .. GENERATED FROM PYTHON SOURCE LINES 197-215 .. code-block:: Python from skrub import AggJoiner from skrub import _selectors as s # Skrub selectors allow us to select columns using regexes, which reduces # the boilerplate. minhash_cols_query = s.glob("item*") | s.glob("model*") minhash_cols = s.select(products_transformed, minhash_cols_query).columns agg_joiner = AggJoiner( aux_table=products_transformed, aux_key="basket_ID", main_key="ID", cols=minhash_cols, operations=["min"], ) baskets_products = agg_joiner.fit_transform(baskets) TableReport(baskets_products) .. raw:: html

Column	Column name	dtype	Unique values	Mean	Std	Min	Median	Max
0	ID	Int64Dtype	92790 (100.0%)	5.80e+04	3.35e+04	0	57,961	115,985
1	fraud_flag	Int64Dtype	2 (< 0.1%)	0.0142	0.118	0	0	1
2	item_00_min	Float32DType	50 (< 0.1%)	-2.11e+09	4.90e+07	-2.15e+09	-2.12e+09	-1.04e+09
3	item_01_min	Float32DType	37 (< 0.1%)	-1.96e+09	2.15e+08	-2.14e+09	-2.09e+09	-1.12e+09
4	item_02_min	Float32DType	47 (< 0.1%)	-2.02e+09	1.52e+08	-2.14e+09	-2.09e+09	-6.79e+08
5	item_03_min	Float32DType	37 (< 0.1%)	-2.08e+09	4.41e+07	-2.15e+09	-2.08e+09	-8.27e+08
6	item_04_min	Float32DType	44 (< 0.1%)	-2.04e+09	3.26e+07	-2.15e+09	-2.05e+09	-1.08e+09
7	item_05_min	Float32DType	44 (< 0.1%)	-2.05e+09	7.20e+07	-2.14e+09	-2.07e+09	-8.93e+08
8	item_06_min	Float32DType	55 (< 0.1%)	-1.86e+09	3.09e+08	-2.14e+09	-2.03e+09	-5.96e+08
9	item_07_min	Float32DType	44 (< 0.1%)	-2.02e+09	5.24e+07	-2.14e+09	-2.00e+09	-1.20e+09
10	item_08_min	Float32DType	48 (< 0.1%)	-2.00e+09	9.91e+07	-2.15e+09	-2.02e+09	-1.46e+09
11	item_09_min	Float32DType	42 (< 0.1%)	-2.08e+09	4.46e+07	-2.14e+09	-2.06e+09	7.04e+08
12	item_10_min	Float32DType	43 (< 0.1%)	-2.04e+09	6.49e+07	-2.14e+09	-2.06e+09	-1.07e+09
13	item_11_min	Float32DType	32 (< 0.1%)	-2.14e+09	2.30e+07	-2.14e+09	-2.14e+09	-1.23e+09
14	item_12_min	Float32DType	36 (< 0.1%)	-2.06e+09	6.49e+07	-2.14e+09	-2.09e+09	-1.05e+09
15	item_13_min	Float32DType	42 (< 0.1%)	-2.09e+09	3.74e+07	-2.15e+09	-2.10e+09	-1.16e+09
16	item_14_min	Float32DType	50 (< 0.1%)	-2.02e+09	1.28e+08	-2.15e+09	-2.09e+09	-9.70e+08
17	item_15_min	Float32DType	43 (< 0.1%)	-2.02e+09	9.49e+07	-2.15e+09	-2.03e+09	-8.00e+08
18	item_16_min	Float32DType	51 (< 0.1%)	-2.07e+09	6.56e+07	-2.15e+09	-2.07e+09	-1.34e+09
19	item_17_min	Float32DType	42 (< 0.1%)	-2.07e+09	6.55e+07	-2.15e+09	-2.04e+09	-1.33e+09
20	item_18_min	Float32DType	50 (< 0.1%)	-2.03e+09	7.36e+07	-2.14e+09	-2.02e+09	-1.56e+09
21	item_19_min	Float32DType	35 (< 0.1%)	-2.09e+09	5.63e+07	-2.15e+09	-2.13e+09	-1.01e+09
22	item_20_min	Float32DType	37 (< 0.1%)	-2.10e+09	6.33e+07	-2.14e+09	-2.10e+09	-5.10e+06
23	item_21_min	Float32DType	45 (< 0.1%)	-2.09e+09	5.48e+07	-2.14e+09	-2.11e+09	-8.74e+08
24	item_22_min	Float32DType	37 (< 0.1%)	-2.07e+09	2.29e+07	-2.14e+09	-2.06e+09	-7.91e+08
25	item_23_min	Float32DType	41 (< 0.1%)	-2.09e+09	4.68e+07	-2.15e+09	-2.08e+09	-7.27e+08
26	item_24_min	Float32DType	38 (< 0.1%)	-2.10e+09	4.13e+07	-2.14e+09	-2.09e+09	-5.35e+08
27	item_25_min	Float32DType	51 (< 0.1%)	-1.94e+09	1.84e+08	-2.14e+09	-2.02e+09	-9.00e+08
28	item_26_min	Float32DType	54 (< 0.1%)	-1.97e+09	1.27e+08	-2.15e+09	-2.02e+09	-1.45e+09
29	item_27_min	Float32DType	42 (< 0.1%)	-2.06e+09	1.30e+08	-2.15e+09	-2.12e+09	-9.92e+08
30	item_28_min	Float32DType	53 (< 0.1%)	-2.12e+09	5.25e+07	-2.15e+09	-2.12e+09	-1.14e+09
31	item_29_min	Float32DType	38 (< 0.1%)	-2.02e+09	8.10e+07	-2.15e+09	-2.07e+09	-1.19e+09
32	model_00_min	Float32DType	250 (0.3%)	-2.07e+09	1.07e+08	-2.15e+09	-2.09e+09	-1.20e+09
33	model_01_min	Float32DType	140 (0.2%)	-2.11e+09	9.80e+07	-2.15e+09	-2.14e+09	-5.58e+08
34	model_02_min	Float32DType	162 (0.2%)	-2.10e+09	1.51e+08	-2.15e+09	-2.12e+09	4.22e+07
35	model_03_min	Float32DType	188 (0.2%)	-2.08e+09	2.61e+08	-2.15e+09	-2.12e+09	8.97e+08
36	model_04_min	Float32DType	263 (0.3%)	-2.06e+09	9.30e+07	-2.15e+09	-2.10e+09	-1.09e+09
37	model_05_min	Float32DType	151 (0.2%)	-2.10e+09	5.87e+07	-2.15e+09	-2.10e+09	-1.48e+09
38	model_06_min	Float32DType	180 (0.2%)	-2.08e+09	1.75e+08	-2.15e+09	-2.12e+09	-5.50e+08
39	model_07_min	Float32DType	200 (0.2%)	-2.08e+09	8.94e+07	-2.15e+09	-2.09e+09	-9.96e+08
40	model_08_min	Float32DType	212 (0.2%)	-2.07e+09	1.12e+08	-2.15e+09	-2.09e+09	-4.68e+08
41	model_09_min	Float32DType	166 (0.2%)	-2.09e+09	1.07e+08	-2.15e+09	-2.11e+09	-1.03e+09
42	model_10_min	Float32DType	205 (0.2%)	-2.10e+09	5.08e+07	-2.15e+09	-2.09e+09	1.05e+09
43	model_11_min	Float32DType	155 (0.2%)	-2.09e+09	1.92e+08	-2.15e+09	-2.11e+09	1.02e+08
44	model_12_min	Float32DType	274 (0.3%)	-2.06e+09	1.01e+08	-2.15e+09	-2.10e+09	-1.26e+09
45	model_13_min	Float32DType	161 (0.2%)	-2.10e+09	2.95e+07	-2.15e+09	-2.10e+09	-1.31e+09
46	model_14_min	Float32DType	139 (0.1%)	-2.11e+09	4.06e+07	-2.15e+09	-2.13e+09	-1.07e+09
47	model_15_min	Float32DType	171 (0.2%)	-2.08e+09	9.03e+07	-2.15e+09	-2.11e+09	-3.68e+08
48	model_16_min	Float32DType	253 (0.3%)	-2.05e+09	1.10e+08	-2.15e+09	-2.06e+09	-1.24e+09
49	model_17_min	Float32DType	204 (0.2%)	-2.08e+09	1.75e+08	-2.15e+09	-2.13e+09	-1.13e+09
50	model_18_min	Float32DType	208 (0.2%)	-2.07e+09	5.22e+07	-2.15e+09	-2.06e+09	-1.22e+09
51	model_19_min	Float32DType	94 (0.1%)	-2.13e+09	5.41e+07	-2.15e+09	-2.14e+09	-8.84e+08
52	model_20_min	Float32DType	228 (0.2%)	-2.09e+09	9.34e+07	-2.15e+09	-2.13e+09	-6.33e+08
53	model_21_min	Float32DType	140 (0.2%)	-2.12e+09	4.97e+07	-2.15e+09	-2.14e+09	1.15e+09
54	model_22_min	Float32DType	230 (0.2%)	-2.05e+09	1.17e+08	-2.15e+09	-2.08e+09	6.53e+08
55	model_23_min	Float32DType	242 (0.3%)	-2.08e+09	1.11e+08	-2.15e+09	-2.11e+09	9.93e+08
56	model_24_min	Float32DType	182 (0.2%)	-2.10e+09	5.73e+07	-2.15e+09	-2.14e+09	-1.61e+09
57	model_25_min	Float32DType	183 (0.2%)	-2.10e+09	6.89e+07	-2.15e+09	-2.11e+09	-7.60e+08
58	model_26_min	Float32DType	200 (0.2%)	-2.08e+09	8.45e+07	-2.15e+09	-2.09e+09	1.34e+08
59	model_27_min	Float32DType	134 (0.1%)	-2.10e+09	6.50e+07	-2.15e+09	-2.10e+09	-8.10e+08
60	model_28_min	Float32DType	194 (0.2%)	-2.10e+09	1.69e+08	-2.15e+09	-2.13e+09	2.58e+08
61	model_29_min	Float32DType	253 (0.3%)	-2.04e+09	1.07e+08	-2.15e+09	-2.07e+09	2.92e+08

Column 1	Column 2	Cramér's V
model_19_min	model_21_min	1.00
model_02_min	model_21_min	1.00
model_01_min	model_21_min	1.00
model_14_min	model_21_min	1.00
model_10_min	model_21_min	1.00
model_00_min	model_21_min	1.00
model_07_min	model_21_min	1.00
model_08_min	model_21_min	1.00
model_16_min	model_21_min	1.00
model_21_min	model_23_min	1.00
model_21_min	model_27_min	1.00
model_21_min	model_26_min	1.00
model_21_min	model_29_min	1.00
model_21_min	model_22_min	1.00
model_20_min	model_21_min	1.00
model_15_min	model_21_min	1.00
model_21_min	model_28_min	1.00
model_00_min	model_22_min	0.991
model_11_min	model_28_min	0.982
model_11_min	model_26_min	0.964

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 216-227 Now that we understand how to use the |AggJoiner|, we can now assemble our pipeline by chaining two |AggJoiner| together: - the first one to deal with the |MinHashEncoder| vectors as we just saw - the second one to deal with the all the other columns For the second |AggJoiner|, we use the mean, standard deviation, minimum and maximum operations to extract a representative summary of each distribution. |DropCols| is another skrub transformer which removes the "ID" column, which doesn't bring any information after the joining operation. .. GENERATED FROM PYTHON SOURCE LINES 227-253 .. code-block:: Python from scipy.stats import loguniform, randint from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.pipeline import make_pipeline from skrub import DropCols model = make_pipeline( AggJoiner( aux_table=products_transformed, aux_key="basket_ID", main_key="ID", cols=minhash_cols, operations=["min"], ), AggJoiner( aux_table=products_transformed, aux_key="basket_ID", main_key="ID", cols=["make", "goods_code", "cash_price", "Nbr_of_prod_purchas"], operations=["sum", "mean", "std", "min", "max"], ), DropCols(["ID"]), HistGradientBoostingClassifier(), ) model .. raw:: html

Pipeline(steps=[('aggjoiner-1',
                     AggJoiner(aux_key='basket_ID',
                               aux_table=        basket_ID       item_00  ...  goods_code  Nbr_of_prod_purchas
    0         85517.0 -2.119082e+09  ...     11181.0                  1.0
    1         51113.0 -2.119082e+09  ...     10552.0                  1.0
    2         83008.0 -2.128260e+09  ...     12038.0                  1.0
    3         78712.0 -2.119082e+09  ...     10513.0                  1.0
    4         78712.0 -2.119082e+09  ...      4925.0                  1.0
    ...           ...           ...  ...         ...                  ...
    163352    42613.0 -1.944861e+09  ...      2807.0                  1.0
    163353...
    163354    43567.0 -2.119082e+09  ...     13080.0                  1.0
    163355    43567.0 -2.119082e+09  ...      9971.0                  1.0
    163356    68268.0 -2.128260e+09  ...     12106.0                  1.0

    [163357 rows x 65 columns],
                               cols=['make', 'goods_code', 'cash_price',
                                     'Nbr_of_prod_purchas'],
                               main_key='ID',
                               operations=['sum', 'mean', 'std', 'min', 'max'])),
                    ('dropcols', DropCols(cols=['ID'])),
                    ('histgradientboostingclassifier',
                     HistGradientBoostingClassifier())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline?Documentation for PipelineiNot fitted

Pipeline(steps=[('aggjoiner-1',
                     AggJoiner(aux_key='basket_ID',
                               aux_table=        basket_ID       item_00  ...  goods_code  Nbr_of_prod_purchas
    0         85517.0 -2.119082e+09  ...     11181.0                  1.0
    1         51113.0 -2.119082e+09  ...     10552.0                  1.0
    2         83008.0 -2.128260e+09  ...     12038.0                  1.0
    3         78712.0 -2.119082e+09  ...     10513.0                  1.0
    4         78712.0 -2.119082e+09  ...      4925.0                  1.0
    ...           ...           ...  ...         ...                  ...
    163352    42613.0 -1.944861e+09  ...      2807.0                  1.0
    163353...
    163354    43567.0 -2.119082e+09  ...     13080.0                  1.0
    163355    43567.0 -2.119082e+09  ...      9971.0                  1.0
    163356    68268.0 -2.128260e+09  ...     12106.0                  1.0

    [163357 rows x 65 columns],
                               cols=['make', 'goods_code', 'cash_price',
                                     'Nbr_of_prod_purchas'],
                               main_key='ID',
                               operations=['sum', 'mean', 'std', 'min', 'max'])),
                    ('dropcols', DropCols(cols=['ID'])),
                    ('histgradientboostingclassifier',
                     HistGradientBoostingClassifier())])

AggJoiner

AggJoiner(aux_key='basket_ID',
              aux_table=        basket_ID       item_00  ...  goods_code  Nbr_of_prod_purchas
    0         85517.0 -2.119082e+09  ...     11181.0                  1.0
    1         51113.0 -2.119082e+09  ...     10552.0                  1.0
    2         83008.0 -2.128260e+09  ...     12038.0                  1.0
    3         78712.0 -2.119082e+09  ...     10513.0                  1.0
    4         78712.0 -2.119082e+09  ...      4925.0                  1.0
    ...           ...           ...  ...         ...                  ...
    163352    42613.0 -1.944861e+09  ...      2807.0                  1.0
    163353    42613.0 -1.944861e+09  ...     11464.0                  1...
           'model_00', 'model_01', 'model_02', 'model_03', 'model_04', 'model_05',
           'model_06', 'model_07', 'model_08', 'model_09', 'model_10', 'model_11',
           'model_12', 'model_13', 'model_14', 'model_15', 'model_16', 'model_17',
           'model_18', 'model_19', 'model_20', 'model_21', 'model_22', 'model_23',
           'model_24', 'model_25', 'model_26', 'model_27', 'model_28', 'model_29'],
          dtype='object'),
              main_key='ID', operations=['min'])

AggJoiner

AggJoiner(aux_key='basket_ID',
              aux_table=        basket_ID       item_00  ...  goods_code  Nbr_of_prod_purchas
    0         85517.0 -2.119082e+09  ...     11181.0                  1.0
    1         51113.0 -2.119082e+09  ...     10552.0                  1.0
    2         83008.0 -2.128260e+09  ...     12038.0                  1.0
    3         78712.0 -2.119082e+09  ...     10513.0                  1.0
    4         78712.0 -2.119082e+09  ...      4925.0                  1.0
    ...           ...           ...  ...         ...                  ...
    163352    42613.0 -1.944861e+09  ...      2807.0                  1.0
    163353    42613.0 -1.944861e+09  ...     11464.0                  1.0
    163354    43567.0 -2.119082e+09  ...     13080.0                  1.0
    163355    43567.0 -2.119082e+09  ...      9971.0                  1.0
    163356    68268.0 -2.128260e+09  ...     12106.0                  1.0

    [163357 rows x 65 columns],
              cols=['make', 'goods_code', 'cash_price', 'Nbr_of_prod_purchas'],
              main_key='ID', operations=['sum', 'mean', 'std', 'min', 'max'])

DropCols

DropCols(cols=['ID'])

HistGradientBoostingClassifier?Documentation for HistGradientBoostingClassifier

HistGradientBoostingClassifier()

.. GENERATED FROM PYTHON SOURCE LINES 254-255 We tune the hyper-parameters of the |HGBC| to get a good performance. .. GENERATED FROM PYTHON SOURCE LINES 255-277 .. code-block:: Python from time import time from sklearn.model_selection import RandomizedSearchCV param_distributions = dict( histgradientboostingclassifier__learning_rate=loguniform(1e-3, 1), histgradientboostingclassifier__max_depth=randint(3, 9), histgradientboostingclassifier__max_leaf_nodes=[None, 10, 30, 60, 90], histgradientboostingclassifier__max_iter=randint(50, 500), ) tic = time() search = RandomizedSearchCV( model, param_distributions, scoring="neg_log_loss", refit=False, n_iter=10, cv=3, verbose=1, ).fit(X_train, y_train) print(f"This operation took {time() - tic:.1f}s") .. rst-class:: sphx-glr-script-out .. code-block:: none Fitting 3 folds for each of 10 candidates, totalling 30 fits This operation took 150.3s .. GENERATED FROM PYTHON SOURCE LINES 278-279 The best hyper parameters are: .. GENERATED FROM PYTHON SOURCE LINES 279-282 .. code-block:: Python pd.Series(search.best_params_) .. rst-class:: sphx-glr-script-out .. code-block:: none histgradientboostingclassifier__learning_rate 0.022994 histgradientboostingclassifier__max_depth 7.000000 histgradientboostingclassifier__max_iter 398.000000 histgradientboostingclassifier__max_leaf_nodes 60.000000 dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 283-291 To benchmark our performance, we plot the log loss of our model on the test set against the log loss of a dummy model that always output the observed probability of the two classes. As this dataset is extremely imbalanced, this dummy model should be a good baseline. The vertical bar represents one standard deviation around the mean of the cross validation log-loss. .. GENERATED FROM PYTHON SOURCE LINES 291-325 .. code-block:: Python import seaborn as sns from matplotlib import pyplot as plt from sklearn.dummy import DummyClassifier from sklearn.metrics import log_loss results = search.cv_results_ best_idx = search.best_index_ log_loss_model_mean = -results["mean_test_score"][best_idx] log_loss_model_std = results["std_test_score"][best_idx] dummy = DummyClassifier(strategy="prior").fit(X_train, y_train) y_proba_dummy = dummy.predict_proba(X_test) log_loss_dummy = log_loss(y_true=y_test, y_pred=y_proba_dummy) fig, ax = plt.subplots() ax.bar( height=[log_loss_model_mean, log_loss_dummy], x=["AggJoiner model", "Dummy"], color=["C0", "C4"], ) for container in ax.containers: ax.bar_label(container, padding=4) ax.vlines( x="AggJoiner model", ymin=log_loss_model_mean - log_loss_model_std, ymax=log_loss_model_mean + log_loss_model_std, linestyle="-", linewidth=1, color="k", ) sns.despine() ax.set_title("Log loss (lower is better)") .. image-sg:: /auto_examples/images/sphx_glr_08_join_aggregation_001.png :alt: Log loss (lower is better) :srcset: /auto_examples/images/sphx_glr_08_join_aggregation_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'Log loss (lower is better)') .. GENERATED FROM PYTHON SOURCE LINES 326-344 Conclusion ---------- With |AggJoiner|, you can bring the aggregation and joining operations within a sklearn pipeline, and train models more efficiently. One known limitation of both the |AggJoiner| and |Joiner| is that the auxiliary data to join is passed during the ``__init__`` method instead of the ``fit`` method, and is therefore fixed once the model has been trained. This limitation causes two main issues: 1. **Bigger model serialization:** Since the dataset has to be pickled along with the model, it can result in a massive file size on disk. 2. **Inflexibility with new, unseen data in a production environment:** To use new auxiliary data, you would need to replace the auxiliary table in the |AggJoiner| that was used during ``fit`` with the updated data, which is a rather hacky approach. These limitations will be addressed later in skrub. .. rst-class:: sphx-glr-timing **Total running time of the script:** (4 minutes 31.548 seconds) .. _sphx_glr_download_auto_examples_08_join_aggregation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.4.1?urlpath=lab/tree/notebooks/auto_examples/08_join_aggregation.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_examples/08_join_aggregation.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 08_join_aggregation.ipynb <08_join_aggregation.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 08_join_aggregation.py <08_join_aggregation.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 08_join_aggregation.zip <08_join_aggregation.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_

	basket_ID	item_00	item_01	item_02	item_03	item_04	item_05	item_06	item_07	item_08	item_09	item_10	item_11	item_12	item_13	item_14	item_15	item_16	item_17	item_18	item_19	item_20	item_21	item_22	item_23	item_24	item_25	item_26	item_27	item_28	item_29	cash_price	make	model_00	model_01	model_02	model_03	model_04	model_05	model_06	model_07	model_08	model_09	model_10	model_11	model_12	model_13	model_14	model_15	model_16	model_17	model_18	model_19	model_20	model_21	model_22	model_23	model_24	model_25	model_26	model_27	model_28	model_29	goods_code	Nbr_of_prod_purchas
	basket_ID	item_00	item_01	item_02	item_03	item_04	item_05	item_06	item_07	item_08	item_09	item_10	item_11	item_12	item_13	item_14	item_15	item_16	item_17	item_18	item_19	item_20	item_21	item_22	item_23	item_24	item_25	item_26	item_27	item_28	item_29	cash_price	make	model_00	model_01	model_02	model_03	model_04	model_05	model_06	model_07	model_08	model_09	model_10	model_11	model_12	model_13	model_14	model_15	model_16	model_17	model_18	model_19	model_20	model_21	model_22	model_23	model_24	model_25	model_26	model_27	model_28	model_29	goods_code	Nbr_of_prod_purchas
0	85517.0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	889.0	30.0	-2070114816.0	-2140802048.0	-2128763520.0	-2114467200.0	-2098292864.0	-2101188736.0	-2080134784.0	-2123221504.0	-2071984128.0	-2089785728.0	-2082095616.0	-2065732480.0	-2047006848.0	-2096223104.0	-2101725056.0	-2028907008.0	-2052252160.0	-2127037184.0	-2064700672.0	-2141094144.0	-2137696128.0	-2143421952.0	-2098587008.0	-2146054400.0	-2135834880.0	-2113379200.0	-2044705920.0	-2053709824.0	-2147119104.0	-2110975872.0	11181.0	1.0
1	51113.0	-2119082112.0	-2092437504.0	-2091895296.0	-2096070400.0	-2053898240.0	-2069353216.0	-2075299584.0	-2009485056.0	-2071984128.0	-2089785728.0	-2082897536.0	-2143475584.0	-2057488000.0	-2097144064.0	-2086916736.0	-2124873472.0	-2109534080.0	-2043068544.0	-2135823616.0	-2132238720.0	-2114070144.0	-2143421952.0	-2063595136.0	-2083497856.0	-2143897856.0	-2021854080.0	-1841633408.0	-2118906752.0	-2124841088.0	-2115769728.0	409.0	30.0	-2086969984.0	-2140802048.0	-2120533120.0	-2123470592.0	-2044521856.0	-2138671360.0	-2139137024.0	-2121052800.0	-2071984128.0	-2070443776.0	-2082769664.0	-2129256192.0	-2134113920.0	-2114362496.0	-2128111232.0	-2143113472.0	-2101461632.0	-2138302592.0	-2047586432.0	-2137376000.0	-2114070144.0	-2085245184.0	-2113314816.0	-2087341184.0	-2065861248.0	-2100475776.0	-2072272384.0	-2114559744.0	-2125133952.0	-1987577088.0	10552.0	1.0
2	83008.0	-2128259840.0	-2095576448.0	-2144445696.0	-2020459392.0	-2057084800.0	-1876818048.0	-2121499136.0	-2114278400.0	-2085140224.0	-2064848000.0	-2063490048.0	-2143475584.0	-2124267776.0	-2096223104.0	-1994766720.0	-2008991360.0	-2013149184.0	-2127037184.0	-2023255296.0	-2128494592.0	-2072412416.0	-2141099136.0	-2063595136.0	-2134392832.0	-2095952128.0	-2111947648.0	-2146418048.0	-1724371712.0	-2142947968.0	-2066046208.0	1399.0	634.0	-2140403968.0	-2092437504.0	-1949062784.0	-2128161920.0	-2145696768.0	-2119088768.0	-2138769792.0	-2087540096.0	-2124020992.0	-2133883648.0	-2129010944.0	-2119972352.0	-2075782528.0	-2096223104.0	-2140701696.0	-2142510208.0	-1969872000.0	-2127037184.0	-2053575552.0	-2140361472.0	-1990730368.0	-2049603072.0	-2092721792.0	-2109808896.0	-2135747968.0	-2130675712.0	-2106766080.0	-2112450432.0	-2071285888.0	-2118983168.0	12038.0	1.0
3	78712.0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	689.0	30.0	-2107412224.0	-2142073600.0	-2128763520.0	-2114467200.0	-2101596928.0	-2069353216.0	-2084104064.0	-2108259072.0	-2125949440.0	-2136905088.0	-2081278208.0	-2111496832.0	-2124267776.0	-2096223104.0	-2100255360.0	-2028907008.0	-2063535744.0	-2120063360.0	-2115619584.0	-2141094144.0	-2098944768.0	-2143421952.0	-2140212608.0	-1995871488.0	-2135834880.0	-2111947648.0	-2136661760.0	-2123857024.0	-2144492544.0	-2066046208.0	10513.0	1.0
4	78712.0	-2119082112.0	-2092437504.0	-2091895296.0	-2096070400.0	-2053898240.0	-2069353216.0	-2075299584.0	-2009485056.0	-2071984128.0	-2089785728.0	-2082897536.0	-2143475584.0	-2057488000.0	-2097144064.0	-2086916736.0	-2124873472.0	-2109534080.0	-2043068544.0	-2135823616.0	-2132238720.0	-2114070144.0	-2143421952.0	-2063595136.0	-2083497856.0	-2143897856.0	-2021854080.0	-1841633408.0	-2118906752.0	-2124841088.0	-2115769728.0	119.0	30.0	-2123276160.0	-2096499328.0	-2120533120.0	-2107973376.0	-1991295488.0	-2097416448.0	-2139137024.0	-2123221504.0	-2124020992.0	-2141934976.0	-2094848640.0	-2094566016.0	-2124267776.0	-2114362496.0	-2100255360.0	-2086078848.0	-2056125312.0	-2127037184.0	-2064700672.0	-2128494592.0	-2137696128.0	-2143421952.0	-1873599616.0	-2134392832.0	-2065861248.0	-2111947648.0	-2106766080.0	-2136393856.0	-2044628224.0	-2138674816.0	4925.0	1.0

163352	42613.0	-1944860928.0	-2140802048.0	-2042012544.0	-1952299136.0	-1956562944.0	-2115316480.0	-1994253056.0	-2003757696.0	-2051230592.0	-2128871936.0	-1990878976.0	-2143475584.0	-1984537216.0	-2073099648.0	-1961043200.0	-2101944064.0	-2002543232.0	-1877808256.0	-1956121088.0	-2140361472.0	-2062128512.0	-2112148992.0	-2106278912.0	-1786246784.0	-2092638208.0	-2135700096.0	-2023483264.0	-1849929088.0	-1864043904.0	-2078980864.0	259.0	659.0	-1989880960.0	-2018200832.0	-1986772608.0	-2121657216.0	-1993879936.0	-2101188736.0	-2131974784.0	-1985121152.0	-2075421312.0	-2106573056.0	-2094848640.0	-2094566016.0	-2134113920.0	-2124304256.0	-2140701696.0	-2137089792.0	-2078441984.0	-2145952512.0	-2076997632.0	-2121964032.0	-1972271360.0	-2141144576.0	-2126553728.0	-2146054400.0	-2140792960.0	-2088835584.0	-2120271616.0	-2146970240.0	-2139479040.0	-1972564480.0	2807.0	1.0
163353	42613.0	-1944860928.0	-2140802048.0	-1811393152.0	-2061610752.0	-1956562944.0	-2115316480.0	-2002457600.0	-2009485056.0	-1993391616.0	-2128871936.0	-1990878976.0	-2053293056.0	-1834136064.0	-1876119680.0	-1613908224.0	-2101944064.0	-1430106496.0	-2022489728.0	-1986229760.0	-2132238720.0	-1876001536.0	-2112148992.0	-2106278912.0	-1997466752.0	-2025872896.0	-2135700096.0	-2023483264.0	-2146970240.0	-2009022592.0	-2078980864.0	949.0	412.0	-2144654336.0	-2140802048.0	-2120533120.0	-2140173568.0	-2059119744.0	-2100804096.0	-2097867008.0	-2087686016.0	-2124020992.0	-2091307392.0	-2119114880.0	-2119972352.0	-2085467136.0	-2114362496.0	-2140701696.0	-2134924800.0	-2059552384.0	-2100255872.0	-2051555584.0	-2132238720.0	-2044759680.0	-2107845760.0	-2098688896.0	-2053801600.0	-2128403712.0	-2040432896.0	-2106766080.0	-2146970240.0	-2092956800.0	-2128906368.0	11464.0	1.0
163354	43567.0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	1099.0	30.0	-2123882240.0	-2116824448.0	-2120533120.0	-2124229120.0	-2113986816.0	-2097663616.0	-2139137024.0	-2087540096.0	-2124020992.0	-2112236160.0	-2097948544.0	-2065732480.0	-2124267776.0	-2096223104.0	-2100255360.0	-2028907008.0	-2141608832.0	-2069449344.0	-2039003264.0	-2141094144.0	-2098944768.0	-2143421952.0	-2046004608.0	-1995871488.0	-2065861248.0	-2111947648.0	-2106766080.0	-2141457664.0	-2139479040.0	-2066606080.0	13080.0	1.0
163355	43567.0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	2099.0	30.0	-2119082112.0	-2140802048.0	-2144499584.0	-2044493056.0	-2119447424.0	-2100804096.0	-2122059776.0	-2019972352.0	-2124020992.0	-2142483712.0	-2065907072.0	-2111503488.0	-2116008832.0	-2126917376.0	-2131770240.0	-2124873472.0	-2044098560.0	-2127037184.0	-2115619584.0	-2141094144.0	-2103297024.0	-2075032192.0	-2021440512.0	-2087341184.0	-2136538112.0	-2100475776.0	-2106766080.0	-2125088256.0	-2050167424.0	-2096222464.0	9971.0	1.0
163356	68268.0	-2128259840.0	-2095576448.0	-2144445696.0	-2020459392.0	-2057084800.0	-1876818048.0	-2121499136.0	-2114278400.0	-2085140224.0	-2064848000.0	-2063490048.0	-2143475584.0	-2124267776.0	-2096223104.0	-1994766720.0	-2008991360.0	-2013149184.0	-2127037184.0	-2023255296.0	-2128494592.0	-2072412416.0	-2141099136.0	-2063595136.0	-2134392832.0	-2095952128.0	-2111947648.0	-2146418048.0	-1724371712.0	-2142947968.0	-2066046208.0	799.0	411.0	-2140403968.0	-2092437504.0	-2072778240.0	-2124229120.0	-2111208576.0	-2119088768.0	-2138769792.0	-2087540096.0	-2124020992.0	-2133883648.0	-2129010944.0	-2139309824.0	-2075782528.0	-2096223104.0	-2140701696.0	-2134924800.0	-2061275392.0	-2138311808.0	-2053575552.0	-2140361472.0	-1863376000.0	-2049603072.0	-2092721792.0	-2109808896.0	-2135747968.0	-2135416576.0	-2106766080.0	-2103071232.0	-2036068224.0	-2138224128.0	12106.0	1.0

	ID	fraud_flag	item_00_min	item_01_min	item_02_min	item_03_min	item_04_min	item_05_min	item_06_min	item_07_min	item_08_min	item_09_min	item_10_min	item_11_min	item_12_min	item_13_min	item_14_min	item_15_min	item_16_min	item_17_min	item_18_min	item_19_min	item_20_min	item_21_min	item_22_min	item_23_min	item_24_min	item_25_min	item_26_min	item_27_min	item_28_min	item_29_min	model_00_min	model_01_min	model_02_min	model_03_min	model_04_min	model_05_min	model_06_min	model_07_min	model_08_min	model_09_min	model_10_min	model_11_min	model_12_min	model_13_min	model_14_min	model_15_min	model_16_min	model_17_min	model_18_min	model_19_min	model_20_min	model_21_min	model_22_min	model_23_min	model_24_min	model_25_min	model_26_min	model_27_min	model_28_min	model_29_min
	ID	fraud_flag	item_00_min	item_01_min	item_02_min	item_03_min	item_04_min	item_05_min	item_06_min	item_07_min	item_08_min	item_09_min	item_10_min	item_11_min	item_12_min	item_13_min	item_14_min	item_15_min	item_16_min	item_17_min	item_18_min	item_19_min	item_20_min	item_21_min	item_22_min	item_23_min	item_24_min	item_25_min	item_26_min	item_27_min	item_28_min	item_29_min	model_00_min	model_01_min	model_02_min	model_03_min	model_04_min	model_05_min	model_06_min	model_07_min	model_08_min	model_09_min	model_10_min	model_11_min	model_12_min	model_13_min	model_14_min	model_15_min	model_16_min	model_17_min	model_18_min	model_19_min	model_20_min	model_21_min	model_22_min	model_23_min	model_24_min	model_25_min	model_26_min	model_27_min	model_28_min	model_29_min
0	85517	0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	-2070114816.0	-2140802048.0	-2128763520.0	-2114467200.0	-2098292864.0	-2101188736.0	-2080134784.0	-2123221504.0	-2071984128.0	-2089785728.0	-2082095616.0	-2065732480.0	-2047006848.0	-2096223104.0	-2101725056.0	-2028907008.0	-2052252160.0	-2127037184.0	-2064700672.0	-2141094144.0	-2137696128.0	-2143421952.0	-2098587008.0	-2146054400.0	-2135834880.0	-2113379200.0	-2044705920.0	-2053709824.0	-2147119104.0	-2110975872.0
1	51113	0	-2119082112.0	-2092437504.0	-2091895296.0	-2096070400.0	-2053898240.0	-2069353216.0	-2075299584.0	-2009485056.0	-2071984128.0	-2089785728.0	-2082897536.0	-2143475584.0	-2057488000.0	-2097144064.0	-2086916736.0	-2124873472.0	-2109534080.0	-2043068544.0	-2135823616.0	-2132238720.0	-2114070144.0	-2143421952.0	-2063595136.0	-2083497856.0	-2143897856.0	-2021854080.0	-1841633408.0	-2118906752.0	-2124841088.0	-2115769728.0	-2086969984.0	-2140802048.0	-2120533120.0	-2123470592.0	-2044521856.0	-2138671360.0	-2139137024.0	-2121052800.0	-2071984128.0	-2070443776.0	-2082769664.0	-2129256192.0	-2134113920.0	-2114362496.0	-2128111232.0	-2143113472.0	-2101461632.0	-2138302592.0	-2047586432.0	-2137376000.0	-2114070144.0	-2085245184.0	-2113314816.0	-2087341184.0	-2065861248.0	-2100475776.0	-2072272384.0	-2114559744.0	-2125133952.0	-1987577088.0
2	83008	0	-2128259840.0	-2095576448.0	-2144445696.0	-2020459392.0	-2057084800.0	-1876818048.0	-2121499136.0	-2114278400.0	-2085140224.0	-2064848000.0	-2063490048.0	-2143475584.0	-2124267776.0	-2096223104.0	-1994766720.0	-2008991360.0	-2013149184.0	-2127037184.0	-2023255296.0	-2128494592.0	-2072412416.0	-2141099136.0	-2063595136.0	-2134392832.0	-2095952128.0	-2111947648.0	-2146418048.0	-1724371712.0	-2142947968.0	-2066046208.0	-2140403968.0	-2092437504.0	-1949062784.0	-2128161920.0	-2145696768.0	-2119088768.0	-2138769792.0	-2087540096.0	-2124020992.0	-2133883648.0	-2129010944.0	-2119972352.0	-2075782528.0	-2096223104.0	-2140701696.0	-2142510208.0	-1969872000.0	-2127037184.0	-2053575552.0	-2140361472.0	-1990730368.0	-2049603072.0	-2092721792.0	-2109808896.0	-2135747968.0	-2130675712.0	-2106766080.0	-2112450432.0	-2071285888.0	-2118983168.0
3	78712	0	-2119082112.0	-2092437504.0	-2091895296.0	-2096070400.0	-2053898240.0	-2069353216.0	-2075299584.0	-2009485056.0	-2071984128.0	-2089785728.0	-2082897536.0	-2143475584.0	-2057488000.0	-2097144064.0	-2086916736.0	-2124873472.0	-2109534080.0	-2043068544.0	-2135823616.0	-2132238720.0	-2114070144.0	-2143421952.0	-2063595136.0	-2083497856.0	-2143897856.0	-2021854080.0	-1841633408.0	-2118906752.0	-2124841088.0	-2115769728.0	-2123276160.0	-2142073600.0	-2128763520.0	-2114467200.0	-2101596928.0	-2097416448.0	-2139137024.0	-2123221504.0	-2125949440.0	-2141934976.0	-2094848640.0	-2111496832.0	-2124267776.0	-2114362496.0	-2100255360.0	-2086078848.0	-2063535744.0	-2127037184.0	-2115619584.0	-2141094144.0	-2137696128.0	-2143421952.0	-2140212608.0	-2134392832.0	-2135834880.0	-2111947648.0	-2136661760.0	-2136393856.0	-2144492544.0	-2138674816.0
4	77846	0	-2128259840.0	-2095576448.0	-2144445696.0	-2020459392.0	-2057084800.0	-1876818048.0	-2121499136.0	-2114278400.0	-2085140224.0	-2064848000.0	-2063490048.0	-2143475584.0	-2124267776.0	-2096223104.0	-1994766720.0	-2008991360.0	-2013149184.0	-2127037184.0	-2023255296.0	-2128494592.0	-2072412416.0	-2141099136.0	-2063595136.0	-2134392832.0	-2095952128.0	-2111947648.0	-2146418048.0	-1724371712.0	-2142947968.0	-2066046208.0	-2140403968.0	-2092437504.0	-2119336064.0	-2053140480.0	-2135153152.0	-2146440832.0	-2138769792.0	-2123107200.0	-2124020992.0	-2133883648.0	-2129010944.0	-2105794432.0	-2075782528.0	-2096223104.0	-2140701696.0	-2109362432.0	-2048629632.0	-2138311808.0	-2095718400.0	-2140361472.0	-1985099264.0	-2106305152.0	-2092721792.0	-2109808896.0	-2135834880.0	-2134814464.0	-2107277056.0	-2103071232.0	-2061058944.0	-1995298944.0

92785	21243	0	-2119082112.0	-2092437504.0	-2143066240.0	-2121657216.0	-2053898240.0	-2085764736.0	-2075299584.0	-2009485056.0	-2071984128.0	-2089785728.0	-2094848640.0	-2143475584.0	-2116008832.0	-2097144064.0	-2140701696.0	-2124873472.0	-2109534080.0	-2135849344.0	-2135823616.0	-2135224192.0	-2144071552.0	-2143421952.0	-2063595136.0	-2083497856.0	-2143897856.0	-2021854080.0	-2024828928.0	-2118906752.0	-2124841088.0	-2115769728.0	-2144654336.0	-2140802048.0	-2044098176.0	-2136874880.0	-2145696768.0	-2138671360.0	-2139137024.0	-2141810816.0	-2071899904.0	-2124734848.0	-2094848640.0	-2101789312.0	-2143261440.0	-2114362496.0	-2143952384.0	-2130466304.0	-2065807616.0	-2088192384.0	-2121673472.0	-2137376000.0	-2126369408.0	-2087510656.0	-2079609728.0	-2025875840.0	-2131045376.0	-2113189120.0	-2088567296.0	-2092952704.0	-2147119104.0	-2132383744.0
92786	45891	0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	-2070114816.0	-2140802048.0	-2128763520.0	-2114467200.0	-2098292864.0	-2101188736.0	-2080134784.0	-2123221504.0	-2071984128.0	-2089785728.0	-2082095616.0	-2065732480.0	-2047006848.0	-2096223104.0	-2101725056.0	-2028907008.0	-2052252160.0	-2127037184.0	-2064700672.0	-2141094144.0	-2137696128.0	-2143421952.0	-2098587008.0	-2146054400.0	-2135834880.0	-2113379200.0	-2044705920.0	-2053709824.0	-2147119104.0	-2110975872.0
92787	42613	0	-1944860928.0	-2140802048.0	-2042012544.0	-2061610752.0	-1956562944.0	-2115316480.0	-2002457600.0	-2009485056.0	-2051230592.0	-2128871936.0	-1990878976.0	-2143475584.0	-1984537216.0	-2073099648.0	-1961043200.0	-2101944064.0	-2002543232.0	-2022489728.0	-1986229760.0	-2140361472.0	-2062128512.0	-2112148992.0	-2106278912.0	-1997466752.0	-2092638208.0	-2135700096.0	-2023483264.0	-2146970240.0	-2009022592.0	-2078980864.0	-2144654336.0	-2140802048.0	-2120533120.0	-2140173568.0	-2082763776.0	-2118896512.0	-2131974784.0	-2120086912.0	-2124020992.0	-2106573056.0	-2119114880.0	-2143740544.0	-2134113920.0	-2131891200.0	-2143952384.0	-2137089792.0	-2078441984.0	-2145952512.0	-2086298112.0	-2140361472.0	-2126369408.0	-2141144576.0	-2126553728.0	-2146054400.0	-2140792960.0	-2130675712.0	-2120271616.0	-2146970240.0	-2139479040.0	-2128906368.0
92788	43567	0	-2119082112.0	-1621485056.0	-1795710976.0	-2069712512.0	-2053898240.0	-2061027200.0	-1407719040.0	-2003757696.0	-1860201856.0	-2064848000.0	-1952196096.0	-2143475584.0	-1984537216.0	-2097144064.0	-1901156992.0	-1890843008.0	-2073910784.0	-2043068544.0	-1962807680.0	-2026307840.0	-2103297024.0	-2053113984.0	-2063595136.0	-2083497856.0	-2092638208.0	-1687056128.0	-1841633408.0	-2118906752.0	-2124841088.0	-1926313216.0	-2123882240.0	-2140802048.0	-2144499584.0	-2124229120.0	-2119447424.0	-2100804096.0	-2139137024.0	-2087540096.0	-2124020992.0	-2142483712.0	-2097948544.0	-2111503488.0	-2124267776.0	-2126917376.0	-2131770240.0	-2124873472.0	-2141608832.0	-2127037184.0	-2115619584.0	-2141094144.0	-2103297024.0	-2143421952.0	-2046004608.0	-2087341184.0	-2136538112.0	-2111947648.0	-2106766080.0	-2141457664.0	-2139479040.0	-2096222464.0
92789	68268	0	-2128259840.0	-2095576448.0	-2144445696.0	-2020459392.0	-2057084800.0	-1876818048.0	-2121499136.0	-2114278400.0	-2085140224.0	-2064848000.0	-2063490048.0	-2143475584.0	-2124267776.0	-2096223104.0	-1994766720.0	-2008991360.0	-2013149184.0	-2127037184.0	-2023255296.0	-2128494592.0	-2072412416.0	-2141099136.0	-2063595136.0	-2134392832.0	-2095952128.0	-2111947648.0	-2146418048.0	-1724371712.0	-2142947968.0	-2066046208.0	-2140403968.0	-2092437504.0	-2072778240.0	-2124229120.0	-2111208576.0	-2119088768.0	-2138769792.0	-2087540096.0	-2124020992.0	-2133883648.0	-2129010944.0	-2139309824.0	-2075782528.0	-2096223104.0	-2140701696.0	-2134924800.0	-2061275392.0	-2138311808.0	-2053575552.0	-2140361472.0	-1863376000.0	-2049603072.0	-2092721792.0	-2109808896.0	-2135747968.0	-2135416576.0	-2106766080.0	-2103071232.0	-2036068224.0	-2138224128.0

basket_ID

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

basket_ID

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

Please enable javascript

ID

fraud_flag

ID

fraud_flag

Please enable javascript

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

Please enable javascript

basket_ID

item0

item1

item2

item3

item4

item5

item6

item7

item8

item9

item10

item11

item12

item13

item14

item15

item16

item17

item18

item19

item20

item21

item22

item23

cash_price0

cash_price1

cash_price2

cash_price3

cash_price4

cash_price5

cash_price6

cash_price7

cash_price8

cash_price9

cash_price10

cash_price11

cash_price12

cash_price13

cash_price14

cash_price15

cash_price16

cash_price17

cash_price18

cash_price19

cash_price20

cash_price21