Reader small image

You're reading from  Building Data Science Solutions with Anaconda

Product typeBook
Published inMay 2022
PublisherPackt
ISBN-139781800568785
Edition1st Edition
Concepts
Right arrow
Author (1)
Dan Meador
Dan Meador
author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador

Right arrow

Chapter 11: Tuning Hyperparameters and Versioning Your Model

The journey of a data scientist is always an iterative one. Understanding how to create a process that is scalable and repeatable ensures that you can smoothly move through all the phases of data cleaning and model discovery.

In this chapter, we will cover how to create a pipeline that will combine a lot of the small steps we have learned throughout the book into an easier flow. We will then see how you can use a grid search to uncover the best hyperparameters to ensure you are creating the best possible model. We will then show you how you can create saved and versioned models to let you easily return to a previous model at any point in time. All these skills will allow for much greater accessibility and flexibility to your end goal of creating a maintainable process.

Specifically, we will cover the following in this chapter:

  • Creating a scikit-learn pipeline
  • Finding optimal hyperparameters with GridSearchCV...

Technical requirements 

To get the most out of this chapter, you will need the Anaconda distribution installed. This will include Python, conda, and Navigator. 

It is also necessary to have a conda environment setup with the following packages installed:

  • scikit-learn version 0.23.x
  • pandas
  • NumPy 
  • joblib

It is preferable to install all these at the beginning, but you can also do so at the necessary parts of the chapter.  

With these parts in place, it's now time to create a pipeline.

Creating a scikit-learn pipeline

If there is one thing that you may have noticed by now in this book, it's that there are many common steps for every problem we have looked at. We're now going to ensure that we can more easily iterate on the data and model creation steps by leveraging the scikit-learn pipeline to put together an easy, repeatable process. In this section, we are going to take a previous workflow that would ordinarily need to be repeated many times and turn it into a single unit, which will allow you much greater flexibility and save time compared to the previous process. If you are starting with this chapter or jumping to it before going through the others, you need to know that the underlying concepts covered previously are still incredibly important to understand.

To visualize what the process is going to look like, you can refer to the following diagram. On the left, you will see normal data input being passed into the pipeline object. In that pipeline...

Finding optimal hyperparameters with GridSearchCV

As we have created new models and tried various data processing techniques, we have used many different parameters and function arguments to determine how we set up the problem. One example is the impute method. Mean, median, or some other advanced approach – how do we know which we should take? One naïve approach might be to simply create a for loop and try every technique. We can calculate the score for each and use the best one. We tried a similar approach before when looking at which algorithm would give us the best score in the previous section.

This might be naïve, but never overlook the simple. It is such a good approach that scikit-learn decided to package that together and make an easy method to do so. It will even perform a k-fold cross-validation to make sure it is getting the best solution. There are a few different ways to tune hyperparameters, but we're going to focus on a grid search.

A grid...

Versioning and storing your model

As we have been working through this book, there has been one glaring issue that you might have noticed – when you closed your integration development environment, terminal, or Jupyter notebook, your model and data were gone. We won't go into the more involved topics of working and saving information on databases or other persistence layers, but there are some quite simple things you can do to create save points along the way.

Understanding the value of versioning your model

As you've worked through everything from data engineering to building models in this book, you have realized that there are a lot of iterations that happen. It's called data science, but there is also an art to guessing a path and trying to know where to go next. You've tried to make educated guesses with hyperparameters and model families, and kept the original dataset open to come back to as needed. This was all needed in case you were wrong....

Summary

In this last chapter, we covered what is the final batch of skills you will need to get up to speed in becoming a data scientist using Anaconda as a base.

We started by seeing how scikit-learn pipelines let you take discrete parts of the data science workflow and create a cohesive unit in a much more elegant way by putting estimators together, like pieces of a puzzle. We also saw how these can include things such as your scalers and imputers, finally ending in an algorithm type.

We then understood that many of the arguments we have been using throughout this book, such as the depth of a random forest, are called hyperparameters and that they are a vital component to get right. Looking at GridSearchCV from sckit-learn, we put together a grid search over possible combinations, being careful to balance the speed of discovery with the best attributes.

Finally, we looked at the value of versioning our model with pickling and joblib. We packaged up our optimized model into...

Close

Whether you are a seasoned veteran looking to brush up on your skills or brand new to the field, by now, you understand that the landscape is vast, but the journey starts small. Throughout this book, we have covered a lot of ground, including types of algorithms, how to avoid bias, and even evaluating open source tools. This is but a taste of all there is to discover, and I think you'll agree that there isn't a better skill to know or place to explore in the modern age than the wonderful world of data science.

Why subscribe?

  • Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
  • Improve your learning with Skill Plans built especially for you
  • Get a free eBook or video every month
  • Fully searchable for easy access to vital information
  • Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building Data Science Solutions with Anaconda
Published in: May 2022Publisher: PacktISBN-13: 9781800568785
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador