gender | department | department_name | division | assignment_category | employee_position_title | date_first_hired | year_first_hired | |
---|---|---|---|---|---|---|---|---|
0 | F | POL | Department of Police | MSB Information Mgmt and Tech Division Records Management Section | Fulltime-Regular | Office Services Coordinator | 09/22/1986 | 1,986 |
1 | M | POL | Department of Police | ISB Major Crimes Division Fugitive Section | Fulltime-Regular | Master Police Officer | 09/12/1988 | 1,988 |
2 | F | HHS | Department of Health and Human Services | Adult Protective and Case Management Services | Fulltime-Regular | Social Worker IV | 11/19/1989 | 1,989 |
9,226 | M | CCL | County Council | Council Central Staff | Fulltime-Regular | Manager II | 09/05/2006 | 2,006 |
9,227 | M | DLC | Department of Liquor Control | Licensure, Regulation and Education | Fulltime-Regular | Alcohol/Tobacco Enforcement Specialist II | 01/30/2012 | 2,012 |
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
Most frequent values
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
694 (7.5%)
This column has a high cardinality (> 40).
Most frequent values
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
Most frequent values
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
443 (4.8%)
This column has a high cardinality (> 40).
Most frequent values
date_first_hired
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
2,264 (24.5%)
This column has a high cardinality (> 40).
Most frequent values
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
51 (0.6%)
This column has a high cardinality (> 40).
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column | Column name | dtype | Is sorted | Null values | Unique values | Mean | Std | Min | Median | Max |
---|---|---|---|---|---|---|---|---|---|---|
0 | gender | ObjectDType | False | 17 (0.2%) | 2 (< 0.1%) | |||||
1 | department | ObjectDType | False | 0 (0.0%) | 37 (0.4%) | |||||
2 | department_name | ObjectDType | False | 0 (0.0%) | 37 (0.4%) | |||||
3 | division | ObjectDType | False | 0 (0.0%) | 694 (7.5%) | |||||
4 | assignment_category | ObjectDType | False | 0 (0.0%) | 2 (< 0.1%) | |||||
5 | employee_position_title | ObjectDType | False | 0 (0.0%) | 443 (4.8%) | |||||
6 | date_first_hired | ObjectDType | False | 0 (0.0%) | 2264 (24.5%) | |||||
7 | year_first_hired | Int64DType | False | 0 (0.0%) | 51 (0.6%) | 2.00e+03 | 9.33 | 1,965 | 2,005 | 2,016 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
Most frequent values
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
694 (7.5%)
This column has a high cardinality (> 40).
Most frequent values
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
Most frequent values
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
443 (4.8%)
This column has a high cardinality (> 40).
Most frequent values
date_first_hired
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
2,264 (24.5%)
This column has a high cardinality (> 40).
Most frequent values
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
51 (0.6%)
This column has a high cardinality (> 40).
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V | Pearson's Correlation |
---|---|---|---|
department | department_name | 1.00 | |
assignment_category | employee_position_title | 0.674 | |
division | assignment_category | 0.624 | |
division | employee_position_title | 0.522 | |
department_name | employee_position_title | 0.416 | |
department | employee_position_title | 0.416 | |
department | assignment_category | 0.404 | |
department_name | assignment_category | 0.404 | |
gender | department | 0.377 | |
gender | department_name | 0.377 | |
department | division | 0.364 | |
department_name | division | 0.364 | |
gender | employee_position_title | 0.285 | |
gender | assignment_category | 0.276 | |
gender | division | 0.264 | |
employee_position_title | date_first_hired | 0.246 | |
department | date_first_hired | 0.165 | |
department_name | date_first_hired | 0.165 | |
date_first_hired | year_first_hired | 0.164 | |
employee_position_title | year_first_hired | 0.138 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
Powerful Feature Engineering
Encode text and high cardinality categorical data
(StringEncoder
,
TextEncoder
,
GapEncoder
,
and
MinHashEncoder
), or
extract features from dates with the
DatetimeEncoder
.
from skrub import GapEncoder
gap = GapEncoder().fit(df["employee_position_title"])
encoded_labels = gap.transform(
df["employee_position_title"].head()
)
plt.imshow(encoded_labels)
Tune arbitrary data wrangling
Inspect it, apply it to new data
Chain an arbitrary set of operations to prepare, transform, assemble multiple tables for machine learning, and then tune the full pipeline, inspect it, or apply it to new data.
Works with any computational or dataframe engine.
Given two input dataframes
products_df
and baskets_df
: (expand for full code)
# A dataset with multiple tables
import skrub
from skrub.datasets import fetch_credit_fraud
# use the test set to have smaller data
dataset = fetch_credit_fraud(split="test")
# Extract simplified tables
# Drop the columns that are not central to the analysis
products_df = dataset.products[["basket_ID", "cash_price", "Nbr_of_prod_purchas"]]
# Rename the ID column to "basket_ID"
baskets_df = dataset.baskets.rename(columns={"ID": "basket_ID"})
basket_ID | cash_price | Nbr_of_prod_purchas |
---|---|---|
85517 | 889 | 1 |
83008 | 1399 | 1 |
... | ... | ... |
95939 | 1149 | 1 |
95939 | 7 | 1 |
basket_ID | fraud_flag |
---|---|
85517 | 0 |
83008 | 0 |
... | ... |
97639 | 0 |
95939 | 0 |
# Define the inputs of our skrub pipeline
products = skrub.var("products", products_df)
baskets = skrub.var("baskets", baskets_df)
# Specify our "X" and "y" variables for machine learning
basket_IDs = baskets[["basket_ID"]].skb.mark_as_X()
fraud_flags = baskets["fraud_flag"].skb.mark_as_y()
# A pandas-based data-preparation pipeline that merges the tables
aggregated_products = products.groupby("basket_ID").agg(
skrub.choose_from(("mean", "max", "count"))).reset_index()
features = basket_IDs.merge(aggregated_products, on="basket_ID")
from sklearn.ensemble import ExtraTreesClassifier
predictions = features.skb.apply(ExtraTreesClassifier(), y=fraud_flags)
# Now use skrub to tune hyperparameters of the above pipeline
search = predictions.skb.make_grid_search(fitted=True, scoring="roc_auc")
search.plot_results()
Our Community
The Skrub project is powered by the efforts of a world-wide community of contributors. Here we display a randomly selected group of 30 contributors.
Try it yourself!
Ready to write less code and get more insights? Dive into skrub
now
and be part of an emerging community!