Skip to main content
Ctrl+K
skrub - Home skrub - Home
  • Install
  • User Guide
  • How-tos
  • API
  • Examples
    • Learning Materials
    • Release history
    • Development
  • GitHub
  • Discord
  • Bluesky
  • X (ex-Twitter)
  • Install
  • User Guide
  • How-tos
  • API
  • Examples
  • Learning Materials
  • Release history
  • Development
  • GitHub
  • Discord
  • Bluesky
  • X (ex-Twitter)

Section Navigation

  • Getting Started with skrub
  • Exploring a Dataframe
    • Exploring dataframes interactively with the TableReport
  • Wrangling data with good defaults
    • Cleaner: sanitizing a dataframe
    • Transforming a table into an array of numeric features: TableVectorizer
    • Building robust ML baselines with tabular_pipeline()
    • Transforming selected columns with ApplyToCols
  • Column-level feature extraction
    • Encoding string and text columns as numeric features
    • Handling datetimes: parsing from strings and encoding as numbers
    • Robust scaling of numeric features using SquashingScaler
    • Advanced columnwise operations
  • Multi-column operations
    • Removing unneeded columns with DropUninformative and Cleaner
    • Skrub Selectors, for selecting columns in a dataframe
    • Selecting based on dtype or data properties
    • filter() and filter_names() to select with user-defined criteria
  • Complex multi-table pipelines with Data Ops
    • Basics of DataOps: the DataOps plan, variables, and learners
    • Building a simple DataOps plan
    • Tutorial: Using Data Ops to build a machine-learning pipeline
    • Using previews for easier development and debugging
    • DataOps allow direct access to methods of the underlying data
    • Control flow in DataOps: eager and deferred evaluation
    • How do skrub Data Ops differ from the alternatives?
    • Applying machine-learning estimators
    • Applying different transformers using skrub selectors and DataOps
    • Documenting the DataOps plan with node names and descriptions
    • Evaluating and debugging the DataOps plan with .skb.full_report()
    • Using only a part of a DataOps plan
    • Subsampling data for easier development and debugging
    • Tuning and validating skrub DataOps plans
    • Using the skrub choose_* functions to tune hyperparameters
    • Validating hyperparameter search with nested cross-validation
    • Going beyond estimator hyperparameters: nesting choices and choosing pipelines
    • Exporting the DataOps plan as a learner and reusing it
    • Tuning DataOps with Optuna
  • Joining Dataframes
    • Assembling: joining multiple tables
  • User Guide
  • Wrangling data with good defaults

Wrangling data with good defaults#

This section covers how to build a predictive pipeline starting from a dataframe. The skrub objects described in this section can be used as strong defaults for building baseline pipelines, and can be customized for specific use cases.

  • Cleaner: sanitizing a dataframe
    • Parsing numeric-looking strings with the Cleaner
    • Downcasting float dtypes to float32 with the Cleaner
  • Transforming a table into an array of numeric features: TableVectorizer
    • Numeric strings and categorical encoding
  • Building robust ML baselines with tabular_pipeline()
  • The logic used by the tabular pipeline is quite simple
  • Extending the pipeline with the .steps attribute
  • Using a pipeline as the estimator
  • Transforming selected columns with ApplyToCols
    • Dealing with columns that cannot be handled by a transformer
    • Advanced usage of ApplyToCols

previous

Exploring dataframes interactively with the TableReport

next

Cleaner: sanitizing a dataframe

Show Source
Financial support from inria and :probabl.

Built with the PyData Sphinx Theme 0.17.0.