Reader small image

You're reading from  Mastering Predictive Analytics with scikit-learn and TensorFlow

Product typeBook
Published inSep 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789617740
Edition1st Edition
Languages
Right arrow
Author (1)
Alvaro Fuentes
Alvaro Fuentes
author image
Alvaro Fuentes

Alvaro Fuentes is a senior data scientist with a background in applied mathematics and economics. He has more than 14 years of experience in various analytical roles and is an analytics consultant at one of the ‘Big Three' global management consulting firms, leading advanced analytics projects in different industries like banking, technology, and consumer goods. Alvaro is also an author and trainer in analytics and data science and has published courses and books, such as 'Become a Python Data Analyst' and 'Hands-On Predictive Analytics with Python'. He has also taught data science and related topics to thousands of students both on-site and online through different platforms such as Springboard, Simplilearn, Udemy, and BSG Institute, among others.
Read more about Alvaro Fuentes

Right arrow

Working with Features

In this chapter, we are going to take a close look at how features play an important role in the feature engineering technique. We'll learn some techniques that will allow us to improve our predictive analytics models in two ways: in terms of the performance metrics of our models and to understand the relationship between the features and the target variables that we are trying to predict.

In this chapter, we are going to cover the following topics:

  • Feature selection methods
  • Dimensionality reduction and PCA
  • Creating new features
  • Improving models with feature engineering

Feature selection methods

Feature selection methods are used for selecting features that are likely to help with predictions. The following are the three methods for feature selection:

  • Removing dummy features with low variance
  • Identifying important features statistically
  • Recursive feature elimination

When building predictive analytics models, some features won't be related to the target and this will prove to be less helpful in prediction. Now, the problem is that including irrelevant features in the model can introduce noise and add bias to the model. So, feature selection techniques are a set of techniques used to select the most relevant and useful features that will help either with prediction or with understanding our model.

Removing dummy features with low variance

...

Dimensionality reduction and PCA

The dimensionality reduction method is the process of reducing the number of features under consideration by obtaining a set of principal variables. The Principal Component Analysis (PCA) technique is the most important technique used for dimensionality reduction. Here, we will talk about why we need dimensionality reduction, and we will also see how to perform the PCA technique in scikit-learn.

These are the reasons for having a high number of features while working on predictive analytics:

  • It enables the simplification of models, in order to make them easier to understand and to interpret. There might be some computational considerations if you are dealing with thousands of features. It might be a good idea to reduce the number of features in order to save computational resources.
  • Another reason is to avoid the "curse of dimensionality...

Feature engineering

Feature engineering plays a vital role in making machine learning algorithms work and, if carried out properly, it enhances the predictive ability of machine learning algorithms. In other words, feature engineering is the process of extracting existing features or creating new features from the raw data using domain knowledge, the context of the problem, or specialized techniques that result in more accurate predictive models. This is an activity where domain knowledge and creativity play a very important role. This is an important process, which can significantly improve the performance of our predictive models. The more context you have about a problem, the better your ability to create new and useful features. Basically, the feature engineering process converts the features into input values that algorithms can understand.
There are various ways of implementing...

Improving models with feature engineering

Now that we have seen how feature engineering techniques help in building predictive models, let's try and improve the performance of these models and evaluate whether the newly built model works better than the previous built model. Then, we will talk about two very important concepts that you must always keep in mind when doing predictive analytics, and these are the reducible and irreducible errors in your predictive models.

Let's first import the necessary modules, as shown in the following screenshot:

So, let's go to the Jupyter Notebook and take a look at the imported credit card default dataset that we saw earlier in this chapter, but as you can see, some modifications have been made to this dataset:

For this model, instead of transforming the sex and marriage features into two dummy features, the ones that we have...

Reducible and irreducible error

Before moving on, there are two really important concepts to be covered for predictive analytics. Errors can be divided into the following two types:

  • Reducible errors: These errors can be reduced by making certain improvements to the model
  • Irreducible errors: These errors cannot be reduced at all

Let's assume that, in machine learning, there is a relationship between features and target that is represented with a function, as shown in the following screenshot:

Let’s assume that the target (y) is the underlying supposition of machine learning, and the relationship between the features and the target is given by a function. Since, in most cases we consider that there is some randomness in the relationship between features and target, we add a noise term here, which will always be present in reality. This is the underlying supposition...

Summary

In this chapter, we talked about feature selection methods, how to distinguish between useful features, and features that are not likely to be helpful in prediction. We talked about dimensionality reduction and we learned how to perform PCA in scikit-learn. We also talked about feature engineering, and we tried to come up with new features in the datasets that we have been using so far. Finally, we tried to improve our credit card model by coming up with new features, and by working with all of the techniques that we learned in this chapter. I hope you have enjoyed this chapter.

In the next chapter, we will learn about artificial neural networks and how the tensorflow library is used when working with neural networks and artificial intelligence.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Predictive Analytics with scikit-learn and TensorFlow
Published in: Sep 2018Publisher: PacktISBN-13: 9781789617740
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Alvaro Fuentes

Alvaro Fuentes is a senior data scientist with a background in applied mathematics and economics. He has more than 14 years of experience in various analytical roles and is an analytics consultant at one of the ‘Big Three' global management consulting firms, leading advanced analytics projects in different industries like banking, technology, and consumer goods. Alvaro is also an author and trainer in analytics and data science and has published courses and books, such as 'Become a Python Data Analyst' and 'Hands-On Predictive Analytics with Python'. He has also taught data science and related topics to thousands of students both on-site and online through different platforms such as Springboard, Simplilearn, Udemy, and BSG Institute, among others.
Read more about Alvaro Fuentes