Classification models for Pump it Up project
In this submodule, we'll build a number of classification models for the ``pumpitup`` project. In particular, we will explore:
* using sklearn transformers and pipelines to streamline workflow,
* logistic regression with regularization,
* random forests and boosted trees and other ensembles.
A good, short, article on avoiding data leakage when building ML models
is `this one by Kevin Markham at Data School `_.
You'll be working in your newly created ``pumpitup`` project folder.
Start by opening the ``model_exploration.ipynb`` notebook in Jupyter Lab.
Here is are screencasts to help guide you through the notebook:
* `SCREENCAST: Intro and data preprocessing `_ (5:46)
* `SCREENCAST: Logistric regression review and overview of regularization `_ (10:49)
* `SCREENCAST: Preprocessing with column transformers `_ (8:10)
* `SCREENCAST: Logistic regression model and solvers `_ (3:11)
* `SCREENCAST: Creating a preprocessing and model estimation pipeline `_ (3:17)
* `SCREENCAST: Data partitioning and modeling fitting `_ (17:39)
* `SCREENCAST: Cross validation and Predictions `_ (7:41)
* `SCREENCAST: Automation and Model persistence `_ (14:03)
* `SCREENCAST: Random forests `_ (6:10)
If you want to learn a bit about one more popular machine learning technique, *gradient boosting machines*, you can check out the following short intro in the ``gradient_boosting.ipynb`` notebook - just take a stroll through to learn about one of the newer classification techniques available in sklearn.
And we're done with Module 2
Next we'll be using Python to do a bunch of analytics work that we'd usually do in Excel.