Note

Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder

Introduction to machine-learning pipelines with skrub DataOps#

In this example, we show how we can use Skrub’s DataOps to build a machine learning pipeline that records all the operations involved in pre-processing data and training a model. We will also show how to save the model, load it back, and then use it to make predictions on new, unseen data.

This example is meant to be an introduction to Skrub DataOps, and as such it will not cover all the features: further examples in the gallery Skrub DataOps will go into more detail on how to use Skrub DataOps for more complex tasks.

The data#

We begin by loading the employee salaries dataset, which is a regression dataset that contains information about employees and their current annual salaries. By default, the datasets.fetch_employee_salaries() function returns the training set. We will load the test set later, to evaluate our model on unseen data.

from skrub.datasets import fetch_employee_salaries

training_data = fetch_employee_salaries(split="train").employee_salaries

We can take a look at the dataset using the TableReport. This dataset contains numerical, categorical, and datetime features. The column current_annual_salary is the target variable we want to predict.

import skrub

skrub.TableReport(training_data)

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired	current_annual_salary
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	09/22/1986	1,986	6.92e+04
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1,988	9.74e+04
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1,989	1.05e+05
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2,014	5.27e+04
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2,007	9.34e+04

7,995	M	HHS	Department of Health and Human Services	Adult Drug Court	Fulltime-Regular	Supervisory Therapist	11/27/1994	1,994	1.06e+05
7,996	F	POL	Department of Police	FSB Traffic Division School Safety Section	Parttime-Regular	Crossing Guard	05/04/2015	2,015	1.66e+04
7,997	M	DPS	Department of Permitting Services	Building Construction Permit Processing	Fulltime-Regular	Permit Technician II	09/05/2006	2,006	5.88e+04
7,998	F	HHS	Department of Health and Human Services	School Health Services	Fulltime-Regular	Community Health Nurse II	07/28/2015	2,015	8.00e+04
7,999	M	CAT	County Attorney's Office	Insurance Defense Litigation	Fulltime-Regular	Assistant County Attorney III	05/19/2014	2,014	1.11e+05

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	False	10 (0.1%)	2 (< 0.1%)
1	department	ObjectDType	False	0 (0.0%)	37 (0.5%)
2	department_name	ObjectDType	False	0 (0.0%)	37 (0.5%)
3	division	ObjectDType	False	0 (0.0%)	681 (8.5%)
4	assignment_category	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	False	0 (0.0%)	431 (5.4%)
6	date_first_hired	ObjectDType	False	0 (0.0%)	2101 (26.3%)
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016
8	current_annual_salary	Float64DType	False	0 (0.0%)	3063 (38.3%)	7.34e+04	2.93e+04	9.20e+03	6.92e+04	3.03e+05

Column 1	Column 2	Cramér's V
department	department_name	1.00
assignment_category	current_annual_salary	0.706
division	assignment_category	0.611
assignment_category	employee_position_title	0.513
department	assignment_category	0.423
department_name	assignment_category	0.423
division	employee_position_title	0.420
department_name	employee_position_title	0.418
department	employee_position_title	0.418
gender	department_name	0.388
gender	department	0.388
department_name	division	0.367
department	division	0.367
gender	employee_position_title	0.295
employee_position_title	current_annual_salary	0.273
gender	assignment_category	0.262
gender	division	0.259
employee_position_title	date_first_hired	0.248
division	current_annual_salary	0.246
department	current_annual_salary	0.206

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Assembling our DataOps plan#

Our goal is to predict the current_annual_salary of employees based on their other features. We will use skrub’s DataOps to combine both skrub and scikit-learn objects into a single DataOps plan, which will allow us to preprocess the data, train a model, and tune hyperparameters.

We begin by defining a skrub var(), which is the entry point for our DataOps plan.

data_var = skrub.var("data", training_data)

Next, we define the initial features X and the target variable y. We use the DataOp.skb.mark_as_X() and DataOp.skb.mark_as_y() methods to mark these variables in the DataOps plan. This allows skrub to properly split these objects into training and validation steps when executing cross-validation or hyperparameter tuning.

X = data_var.drop("current_annual_salary", axis=1).skb.mark_as_X()
y = data_var["current_annual_salary"].skb.mark_as_y()

Our first step is to vectorize the features in X. We will use the TableVectorizer to convert the categorical and numerical features into a numerical format that can be used by machine learning algorithms. We apply the vectorizer to X using the .skb.apply() method, which allows us to apply any scikit-learn compatible transformer to the skrub variable.

from skrub import TableVectorizer

vectorizer = TableVectorizer()

X_vec = X.skb.apply(vectorizer)
X_vec

<Apply TableVectorizer>

Show graph

Result:

Column	Column name	dtype	Is sorted	Unique values	Mean	Std	Min	Median	Max
0	gender_F	Float32DType	False	2 (< 0.1%)	0.405	0.491	0.00	0.00	1.00
1	gender_M	Float32DType	False	2 (< 0.1%)	0.594	0.491	0.00	1.00	1.00
2	gender_nan	Float32DType	False	2 (< 0.1%)	0.00125	0.0353	0.00	0.00	1.00
3	department_BOA	Float32DType	False	2 (< 0.1%)	0.000375	0.0194	0.00	0.00	1.00
4	department_BOE	Float32DType	False	2 (< 0.1%)	0.00262	0.0512	0.00	0.00	1.00
5	department_CAT	Float32DType	False	2 (< 0.1%)	0.00763	0.0870	0.00	0.00	1.00
6	department_CCL	Float32DType	False	2 (< 0.1%)	0.0100	0.0995	0.00	0.00	1.00
7	department_CEC	Float32DType	False	2 (< 0.1%)	0.00800	0.0891	0.00	0.00	1.00
8	department_CEX	Float32DType	False	2 (< 0.1%)	0.00387	0.0621	0.00	0.00	1.00
9	department_COR	Float32DType	False	2 (< 0.1%)	0.0526	0.223	0.00	0.00	1.00
10	department_CUS	Float32DType	False	2 (< 0.1%)	0.00313	0.0558	0.00	0.00	1.00
11	department_DEP	Float32DType	False	2 (< 0.1%)	0.0182	0.134	0.00	0.00	1.00
12	department_DGS	Float32DType	False	2 (< 0.1%)	0.0434	0.204	0.00	0.00	1.00
13	department_DHS	Float32DType	False	2 (< 0.1%)	0.00137	0.0371	0.00	0.00	1.00
14	department_DLC	Float32DType	False	2 (< 0.1%)	0.0434	0.204	0.00	0.00	1.00
15	department_DOT	Float32DType	False	2 (< 0.1%)	0.134	0.341	0.00	0.00	1.00
16	department_DPS	Float32DType	False	2 (< 0.1%)	0.0231	0.150	0.00	0.00	1.00
17	department_DTS	Float32DType	False	2 (< 0.1%)	0.0148	0.121	0.00	0.00	1.00
18	department_ECM	Float32DType	False	2 (< 0.1%)	0.000250	0.0158	0.00	0.00	1.00
19	department_FIN	Float32DType	False	2 (< 0.1%)	0.0125	0.111	0.00	0.00	1.00
20	department_FRS	Float32DType	False	2 (< 0.1%)	0.145	0.352	0.00	0.00	1.00
21	department_HCA	Float32DType	False	2 (< 0.1%)	0.00825	0.0905	0.00	0.00	1.00
22	department_HHS	Float32DType	False	2 (< 0.1%)	0.168	0.374	0.00	0.00	1.00
23	department_HRC	Float32DType	False	2 (< 0.1%)	0.000875	0.0296	0.00	0.00	1.00
24	department_IGR	Float32DType	False	2 (< 0.1%)	0.000375	0.0194	0.00	0.00	1.00
25	department_LIB	Float32DType	False	2 (< 0.1%)	0.0399	0.196	0.00	0.00	1.00
26	department_MPB	Float32DType	False	2 (< 0.1%)	0.000250	0.0158	0.00	0.00	1.00
27	department_NDA	Float32DType	False	2 (< 0.1%)	0.00175	0.0418	0.00	0.00	1.00
28	department_OAG	Float32DType	False	2 (< 0.1%)	0.000875	0.0296	0.00	0.00	1.00
29	department_OCP	Float32DType	False	2 (< 0.1%)	0.00187	0.0433	0.00	0.00	1.00
30	department_OHR	Float32DType	False	2 (< 0.1%)	0.00738	0.0856	0.00	0.00	1.00
31	department_OIG	Float32DType	False	2 (< 0.1%)	0.000500	0.0224	0.00	0.00	1.00
32	department_OLO	Float32DType	False	2 (< 0.1%)	0.00125	0.0353	0.00	0.00	1.00
33	department_OMB	Float32DType	False	2 (< 0.1%)	0.00325	0.0569	0.00	0.00	1.00
34	department_PIO	Float32DType	False	2 (< 0.1%)	0.00625	0.0788	0.00	0.00	1.00
35	department_POL	Float32DType	False	2 (< 0.1%)	0.198	0.399	0.00	0.00	1.00
36	department_PRO	Float32DType	False	2 (< 0.1%)	0.00338	0.0580	0.00	0.00	1.00
37	department_REC	Float32DType	False	2 (< 0.1%)	0.0139	0.117	0.00	0.00	1.00
38	department_SHF	Float32DType	False	2 (< 0.1%)	0.0188	0.136	0.00	0.00	1.00
39	department_ZAH	Float32DType	False	2 (< 0.1%)	0.000500	0.0224	0.00	0.00	1.00
40	department_name_Board of Appeals Department	Float32DType	False	2 (< 0.1%)	0.000375	0.0194	0.00	0.00	1.00
41	department_name_Board of Elections	Float32DType	False	2 (< 0.1%)	0.00262	0.0512	0.00	0.00	1.00
42	department_name_Community Engagement Cluster	Float32DType	False	2 (< 0.1%)	0.00800	0.0891	0.00	0.00	1.00
43	department_name_Community Use of Public Facilities	Float32DType	False	2 (< 0.1%)	0.00313	0.0558	0.00	0.00	1.00
44	department_name_Correction and Rehabilitation	Float32DType	False	2 (< 0.1%)	0.0526	0.223	0.00	0.00	1.00
45	department_name_County Attorney's Office	Float32DType	False	2 (< 0.1%)	0.00763	0.0870	0.00	0.00	1.00
46	department_name_County Council	Float32DType	False	2 (< 0.1%)	0.0100	0.0995	0.00	0.00	1.00
47	department_name_Department of Environmental Protection	Float32DType	False	2 (< 0.1%)	0.0182	0.134	0.00	0.00	1.00
48	department_name_Department of Finance	Float32DType	False	2 (< 0.1%)	0.0125	0.111	0.00	0.00	1.00
49	department_name_Department of General Services	Float32DType	False	2 (< 0.1%)	0.0434	0.204	0.00	0.00	1.00
50	department_name_Department of Health and Human Services	Float32DType	False	2 (< 0.1%)	0.168	0.374	0.00	0.00	1.00
51	department_name_Department of Housing and Community Affairs	Float32DType	False	2 (< 0.1%)	0.00825	0.0905	0.00	0.00	1.00
52	department_name_Department of Liquor Control	Float32DType	False	2 (< 0.1%)	0.0434	0.204	0.00	0.00	1.00
53	department_name_Department of Permitting Services	Float32DType	False	2 (< 0.1%)	0.0231	0.150	0.00	0.00	1.00
54	department_name_Department of Police	Float32DType	False	2 (< 0.1%)	0.198	0.399	0.00	0.00	1.00
55	department_name_Department of Public Libraries	Float32DType	False	2 (< 0.1%)	0.0399	0.196	0.00	0.00	1.00
56	department_name_Department of Recreation	Float32DType	False	2 (< 0.1%)	0.0139	0.117	0.00	0.00	1.00
57	department_name_Department of Technology Services	Float32DType	False	2 (< 0.1%)	0.0148	0.121	0.00	0.00	1.00
58	department_name_Department of Transportation	Float32DType	False	2 (< 0.1%)	0.134	0.341	0.00	0.00	1.00
59	department_name_Ethics Commission	Float32DType	False	2 (< 0.1%)	0.000250	0.0158	0.00	0.00	1.00
60	department_name_Fire and Rescue Services	Float32DType	False	2 (< 0.1%)	0.145	0.352	0.00	0.00	1.00
61	department_name_Merit System Protection Board Department	Float32DType	False	2 (< 0.1%)	0.000250	0.0158	0.00	0.00	1.00
62	department_name_Non-Departmental Account	Float32DType	False	2 (< 0.1%)	0.00175	0.0418	0.00	0.00	1.00
63	department_name_Office of Agriculture	Float32DType	False	2 (< 0.1%)	0.000875	0.0296	0.00	0.00	1.00
64	department_name_Office of Consumer Protection	Float32DType	False	2 (< 0.1%)	0.00187	0.0433	0.00	0.00	1.00
65	department_name_Office of Emergency Management and Homeland Security	Float32DType	False	2 (< 0.1%)	0.00137	0.0371	0.00	0.00	1.00
66	department_name_Office of Human Resources	Float32DType	False	2 (< 0.1%)	0.00738	0.0856	0.00	0.00	1.00
67	department_name_Office of Human Rights	Float32DType	False	2 (< 0.1%)	0.000875	0.0296	0.00	0.00	1.00
68	department_name_Office of Intergovernmental Relations Department	Float32DType	False	2 (< 0.1%)	0.000375	0.0194	0.00	0.00	1.00
69	department_name_Office of Legislative Oversight	Float32DType	False	2 (< 0.1%)	0.00125	0.0353	0.00	0.00	1.00
70	department_name_Office of Management and Budget	Float32DType	False	2 (< 0.1%)	0.00325	0.0569	0.00	0.00	1.00
71	department_name_Office of Procurement	Float32DType	False	2 (< 0.1%)	0.00338	0.0580	0.00	0.00	1.00
72	department_name_Office of Public Information	Float32DType	False	2 (< 0.1%)	0.00625	0.0788	0.00	0.00	1.00
73	department_name_Office of Zoning and Administrative Hearings	Float32DType	False	2 (< 0.1%)	0.000500	0.0224	0.00	0.00	1.00
74	department_name_Office of the Inspector General	Float32DType	False	2 (< 0.1%)	0.000500	0.0224	0.00	0.00	1.00
75	department_name_Offices of the County Executive	Float32DType	False	2 (< 0.1%)	0.00387	0.0621	0.00	0.00	1.00
76	department_name_Sheriff's Office	Float32DType	False	2 (< 0.1%)	0.0188	0.136	0.00	0.00	1.00
77	division_00	Float32DType	False	674 (8.4%)	0.224	0.281	7.31e-05	0.134	1.12
78	division_01	Float32DType	False	674 (8.4%)	0.176	0.284	-0.557	0.172	0.821
79	division_02	Float32DType	False	674 (8.4%)	0.0400	0.314	-0.600	-0.00556	1.00
80	division_03	Float32DType	False	674 (8.4%)	0.0438	0.306	-0.404	-0.00797	1.12
81	division_04	Float32DType	False	677 (8.5%)	-0.0385	0.232	-0.804	-0.00726	0.372
82	division_05	Float32DType	False	680 (8.5%)	-0.0438	0.218	-0.657	-0.0441	0.695
83	division_06	Float32DType	False	680 (8.5%)	0.0114	0.211	-0.344	-0.00379	1.19
84	division_07	Float32DType	False	680 (8.5%)	0.00159	0.207	-0.825	-0.00348	0.936
85	division_08	Float32DType	False	679 (8.5%)	-0.0156	0.199	-0.846	-0.0131	0.665
86	division_09	Float32DType	False	681 (8.5%)	0.0151	0.185	-0.531	0.00198	1.03
87	division_10	Float32DType	False	681 (8.5%)	0.0275	0.180	-0.377	0.00560	1.23
88	division_11	Float32DType	False	681 (8.5%)	0.0166	0.175	-0.592	0.00324	0.641
89	division_12	Float32DType	False	681 (8.5%)	0.0123	0.167	-0.671	-0.00468	0.870
90	division_13	Float32DType	False	681 (8.5%)	0.0275	0.159	-0.364	0.00233	1.06
91	division_14	Float32DType	False	681 (8.5%)	-0.00634	0.155	-0.441	-0.000898	0.867
92	division_15	Float32DType	False	681 (8.5%)	-0.0135	0.151	-0.617	-0.0107	0.684
93	division_16	Float32DType	False	681 (8.5%)	0.000139	0.147	-0.559	7.10e-06	0.513
94	division_17	Float32DType	False	681 (8.5%)	0.0222	0.140	-0.432	0.00427	0.533
95	division_18	Float32DType	False	681 (8.5%)	0.0128	0.140	-0.581	0.00525	0.539
96	division_19	Float32DType	False	681 (8.5%)	0.00710	0.139	-0.369	-0.00310	0.635
97	division_20	Float32DType	False	681 (8.5%)	-0.00180	0.134	-0.418	-0.00179	0.983
98	division_21	Float32DType	False	681 (8.5%)	0.00984	0.131	-0.585	-0.00554	0.612
99	division_22	Float32DType	False	681 (8.5%)	-0.00700	0.127	-0.383	-0.00312	0.602
100	division_23	Float32DType	False	681 (8.5%)	-0.00332	0.122	-0.502	-0.00361	0.596
101	division_24	Float32DType	False	681 (8.5%)	0.00320	0.118	-0.712	-0.00298	0.815
102	division_25	Float32DType	False	681 (8.5%)	0.00349	0.115	-0.337	0.00631	0.718
103	division_26	Float32DType	False	681 (8.5%)	0.00146	0.114	-0.326	-0.00936	0.634
104	division_27	Float32DType	False	681 (8.5%)	-0.00709	0.112	-0.446	-0.00120	0.627
105	division_28	Float32DType	False	681 (8.5%)	0.00526	0.109	-0.343	-0.00524	0.526
106	division_29	Float32DType	False	681 (8.5%)	0.000560	0.108	-0.295	0.000696	0.591
107	assignment_category_Parttime-Regular	Float32DType	False	2 (< 0.1%)	0.0896	0.286	0.00	0.00	1.00
108	employee_position_title_00	Float32DType	False	431 (5.4%)	0.234	0.313	0.000532	0.0893	1.10
109	employee_position_title_01	Float32DType	False	431 (5.4%)	0.0777	0.344	-0.374	0.00421	1.08
110	employee_position_title_02	Float32DType	False	431 (5.4%)	0.109	0.301	-0.0620	0.0149	1.16
111	employee_position_title_03	Float32DType	False	431 (5.4%)	0.106	0.275	-0.166	0.0314	0.989
112	employee_position_title_04	Float32DType	False	431 (5.4%)	0.0800	0.240	-0.584	0.0430	0.684
113	employee_position_title_05	Float32DType	False	431 (5.4%)	0.0523	0.208	-0.310	-0.00336	0.963
114	employee_position_title_06	Float32DType	False	431 (5.4%)	0.0323	0.192	-0.306	0.00383	0.819
115	employee_position_title_07	Float32DType	False	431 (5.4%)	0.0106	0.190	-0.533	-0.00498	0.636
116	employee_position_title_08	Float32DType	False	431 (5.4%)	0.0360	0.177	-0.294	0.00421	0.773
117	employee_position_title_09	Float32DType	False	431 (5.4%)	0.00682	0.180	-0.514	0.00592	0.780
118	employee_position_title_10	Float32DType	False	431 (5.4%)	0.0219	0.175	-0.520	-0.00711	0.696
119	employee_position_title_11	Float32DType	False	431 (5.4%)	-0.00189	0.173	-0.437	0.00308	0.852
120	employee_position_title_12	Float32DType	False	431 (5.4%)	-0.00592	0.168	-0.499	-0.00404	0.895
121	employee_position_title_13	Float32DType	False	431 (5.4%)	0.0128	0.165	-0.449	-0.00431	0.657
122	employee_position_title_14	Float32DType	False	431 (5.4%)	-0.00423	0.160	-0.445	-0.00619	0.397
123	employee_position_title_15	Float32DType	False	431 (5.4%)	0.0178	0.155	-0.235	-0.000739	1.08
124	employee_position_title_16	Float32DType	False	431 (5.4%)	0.0113	0.145	-0.292	-0.00241	0.798
125	employee_position_title_17	Float32DType	False	431 (5.4%)	0.00434	0.138	-0.251	-0.00123	0.687
126	employee_position_title_18	Float32DType	False	431 (5.4%)	0.00572	0.136	-0.325	0.00568	1.00
127	employee_position_title_19	Float32DType	False	431 (5.4%)	0.0107	0.132	-0.318	-0.00328	0.438
128	employee_position_title_20	Float32DType	False	431 (5.4%)	-0.00352	0.129	-0.486	-0.00315	0.379
129	employee_position_title_21	Float32DType	False	431 (5.4%)	-0.00475	0.127	-0.427	0.0138	0.531
130	employee_position_title_22	Float32DType	False	431 (5.4%)	0.00190	0.125	-0.378	-0.00285	0.614
131	employee_position_title_23	Float32DType	False	431 (5.4%)	0.00778	0.123	-0.427	-0.00752	0.698
132	employee_position_title_24	Float32DType	False	431 (5.4%)	-0.00280	0.122	-0.593	0.00189	0.721
133	employee_position_title_25	Float32DType	False	431 (5.4%)	0.00826	0.120	-0.221	-0.00104	0.654
134	employee_position_title_26	Float32DType	False	431 (5.4%)	-0.00167	0.114	-0.403	0.00128	0.492
135	employee_position_title_27	Float32DType	False	431 (5.4%)	0.00499	0.108	-0.376	-0.00236	0.548
136	employee_position_title_28	Float32DType	False	431 (5.4%)	-0.000102	0.105	-0.333	-0.00226	0.554
137	employee_position_title_29	Float32DType	False	431 (5.4%)	-0.00551	0.0973	-0.702	0.0128	0.221
138	date_first_hired_year	Float32DType	False	51 (0.6%)	2.00e+03	9.33	1.96e+03	2.00e+03	2.02e+03
139	date_first_hired_month	Float32DType	False	12 (0.1%)	6.35	3.48	1.00	7.00	12.0
140	date_first_hired_day	Float32DType	False	31 (0.4%)	15.3	8.61	1.00	15.0	31.0
141	date_first_hired_total_seconds	Float32DType	False	2101 (26.3%)	1.08e+09	2.94e+08	-1.34e+08	1.12e+09	1.48e+09
142	year_first_hired	Float32DType	False	51 (0.6%)	2.00e+03	9.33	1.96e+03	2.00e+03	2.02e+03

Please enable javascript

By clicking on Show graph, we can see the DataOps plan that has been created: the plan shows the steps that have been applied to the data so far. Now that we have the vectorized features, we can proceed to train a model. We use a scikit-learn HistGradientBoostingRegressor to predict the target variable. We apply the model to the vectorized features using .skb.apply, and pass y as the target variable. Note that the resulting predictor will show the prediction results on the preview subsample, but the actual model has not been fitted yet.

from sklearn.ensemble import HistGradientBoostingRegressor

hgb = HistGradientBoostingRegressor()

predictor = X_vec.skb.apply(hgb, y=y)
predictor

<Apply HistGradientBoostingRegressor>

Show graph

Result:

	current_annual_salary
0	6.79e+04
1	9.70e+04
2	1.03e+05
3	5.39e+04
4	8.82e+04

7,995	1.08e+05
7,996	1.83e+04
7,997	6.21e+04
7,998	7.56e+04
7,999	1.03e+05

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	current_annual_salary	Float64DType	False	0 (0.0%)	6521 (81.5%)	7.34e+04	2.82e+04	1.32e+04	7.00e+04	2.30e+05

Please enable javascript

Now that we have built our entire plan, we can explore it in more detail with the .skb.eval() method:

predictions.skb.full_report()

This produces a folder on disk rather than displaying inline in a notebook so we do not run it here. But you can see the output here.

This method evaluates each step in the plan and shows detailed information about the operations that are being performed.

Turning the DataOps plan into a learner, for later reuse#

Now that we have defined the predictor, we can create a learner, a standalone object that contains all the steps in the DataOps plan. We fit the learner, so that it can be used to make predictions on new data.

trained_learner = predictor.skb.make_learner(fitted=True)

A big advantage of the learner is that it can be pickled and saved to disk, allowing us to reuse the trained model later without needing to retrain it. The learner contains all steps in the DataOps plan, including the fitted vectorizer and the trained model. We can save it using Python’s pickle module: here we use pickle.dumps to serialize the learner object into a byte string.

import pickle

saved_model = pickle.dumps(trained_learner)

We can now load the saved model back into memory using pickle.loads.

loaded_model = pickle.loads(saved_model)

Now, we can make predictions on new data using the loaded model, by passing a dictionary with the skrub variable names as keys. We don’t have to create a new variable, as this will be done internally by the learner. In fact, the learner is similar to a scikit-learn estimator, but rather than taking X and y as inputs, it takes a dictionary (the “environment”), where each key is the name of one of the skrub variables in the plan.

We can now get the test set of the employee salaries dataset:

unseen_data = fetch_employee_salaries(split="test").employee_salaries

Then, we can use the loaded model to make predictions on the unseen data by passing the environment as dictionary.

predicted_values = loaded_model.predict({"data": unseen_data})
predicted_values

array([116382.06417108,  45114.33938599,  46680.82086958, ...,
       105486.55018287, 146020.37131876,  73028.94144409], shape=(1228,))

We can also evaluate the model’s performance using the score method, which uses the scikit-learn scoring function used by the predictor:

loaded_model.score({"data": unseen_data})

0.9407037991754476

Conclusion#

In this example, we have briefly introduced the skrub DataOps, and how they can be used to build powerful machine learning pipelines. We have seen how to preprocess data, train a model. We have also shown how to save and load the trained model, and how to make predictions on new data using the trained model.

However, skrub DataOps are significantly more powerful than what we have shown here: for more advanced examples, see Skrub DataOps.

Total running time of the script: (0 minutes 5.828 seconds)

Gallery generated by Sphinx-Gallery

Introduction to machine-learning pipelines with skrub DataOps#

The data#

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

current_annual_salary

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

current_annual_salary

Please enable javascript

Assembling our DataOps plan#

gender_F

gender_M

gender_nan

department_BOA

department_BOE

department_CAT

department_CCL

department_CEC

department_CEX

department_COR

department_CUS

department_DEP

department_DGS

department_DHS

department_DLC

department_DOT

department_DPS

department_DTS

department_ECM

department_FIN

department_FRS

department_HCA

department_HHS

department_HRC

department_IGR

department_LIB

department_MPB

department_NDA

department_OAG

department_OCP

department_OHR

department_OIG

department_OLO

department_OMB

department_PIO

department_POL

department_PRO

department_REC

department_SHF

department_ZAH

department_name_Board of Appeals Department

department_name_Board of Elections

department_name_Community Engagement Cluster

department_name_Community Use of Public Facilities

department_name_Correction and Rehabilitation

department_name_County Attorney's Office

department_name_County Council

department_name_Department of Environmental Protection

department_name_Department of Finance

department_name_Department of General Services

department_name_Department of Health and Human Services

department_name_Department of Housing and Community Affairs

department_name_Department of Liquor Control

department_name_Department of Permitting Services

department_name_Department of Police

department_name_Department of Public Libraries

department_name_Department of Recreation

department_name_Department of Technology Services