************************************************* Classification models for Pump it Up project ************************************************* In this submodule, we'll build a number of classification models for the ``pumpitup`` project. In particular, we will explore: * using sklearn transformers and pipelines to streamline workflow, * logistic regression with regularization, * random forests and boosted trees and other ensembles. A good, short, article on avoiding data leakage when building ML models is `this one by Kevin Markham at Data School `_. You'll be working in your newly created ``pumpitup`` project folder. Start by opening the ``model_exploration.ipynb`` notebook in Jupyter Lab. Here is are screencasts to help guide you through the notebook: * `SCREENCAST: Intro and data preprocessing `_ (5:46) * `SCREENCAST: Logistric regression review and overview of regularization `_ (10:49) * `SCREENCAST: Preprocessing with column transformers `_ (8:10) * `SCREENCAST: Logistic regression model and solvers `_ (3:11) * `SCREENCAST: Creating a preprocessing and model estimation pipeline `_ (3:17) * `SCREENCAST: Data partitioning and modeling fitting `_ (17:39) * `SCREENCAST: Cross validation and Predictions `_ (7:41) * `SCREENCAST: Automation and Model persistence `_ (14:03) * `SCREENCAST: Random forests `_ (6:10) OPTIONAL ADVANCED MATERIAL --------------------------- If you want to learn a bit about one more popular machine learning technique, *gradient boosting machines*, you can check out the following short intro in the ``gradient_boosting.ipynb`` notebook - just take a stroll through to learn about one of the newer classification techniques available in sklearn. And we're done with Module 2 ----------------------------- Next we'll be using Python to do a bunch of analytics work that we'd usually do in Excel.