Sklearn review and Ensemble models¶
Intro¶
The sckit-learn module is a full featured Python module for all kinds of data analysis and predictive modeling algorithms. In the pcda class we did one session at the end of the semester that just introduced this library and did some basic statistical/ML modeling. We’ll start by reviewing the basics of using sklearn for statistical and machine learning model building and learn about ensemble models.
Readings and review activities¶
As a review, first take a look through the following sections (and notebooks) in PDSH. We covered all of this back in the pcda class in our final session. Going through the notebooks will get you back up to speed with sklearn and ML basics.
PDSH - Ch 5: Scikit-Learn
05.00-Machine-Learning.ipynb
05.01-What-Is-Machine-Learning.ipynb
05.02-Introducing-Scikit-Learn.ipynb
05.03-Hyperparameters-and-Model-Validation.ipynb
Downloads and other resources¶
This downloads file will be used throughout all of the Module 2 activities.
Activities¶
We’ll start with a review of sklearn with a focus on the standard estimator API that makes it pretty easy to quickly try out different types of predictive models. In addition, we’ll explore a class of models known as ensemble models.
Ensemble models are just like they sound - a collection of models that, hopefully, perform better as an aggregated whole than as individual models. Modern weather forecasting relies on ensemble models and you’ll see that most Kaggle winners use ensembles of models. Individual models can be combined by doing things like averaging individual predictions (for regression) or using voting (for classification). Here’s an interesting blog post on using human regression ensembles vs various ML techniques.
We’ll use one of the Kaggle practice competitions involving trying to classify leaves based on simple images of those leaves.
you can find the notebook
sklearn_gettingstarted_leaf_classification_aap.ipynb
in thesklearn_ensemble_leaf
folder within the Downloads file.- By working through it, we will:
review sklearn, numpy and a little pandas
build, train, test models in sklearn
combine different types of models into ensemble models
Here are screencasts to help guide you through the notebook:
SCREENCAST: Decision Trees (6:22)
SCREENCAST: More classification techniques and ensemble models (9:02)
When you are done with this, move on to the next submodule, Using cookiecutter templates for project structure.