Reader small image

You're reading from  Building Data Science Solutions with Anaconda

Product typeBook
Published inMay 2022
PublisherPackt
ISBN-139781800568785
Edition1st Edition
Concepts
Right arrow
Author (1)
Dan Meador
Dan Meador
author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador

Right arrow

Chapter 7: Choosing the Best AI Algorithm

If the field of artificial intelligence and machine learning (commonly referred to as AI/ML) is a car, then the model is the engine. While there are other parts that are critical for its operation, no other aspect gets as much focus and attention. This is for good reason. In the end, the model is the core object that determines whether your outcome is accurate or not, and is the most important artifact from that entire data science workflow.

Which modeling approach is best? That's easy, it depends. For the same reason all cars don't have the same engine, there are many different aspects that go into the best approach to use.

Ask yourself, What problem am I trying to solve? In this chapter, we are going to start with that question, and from there lead you to the modeling approach that would best suit your situation. We'll take a look at the problem type with an example for each algorithm, and look at some of the most widely...

Technical requirements

All the required libraries can be installed easily with conda, which comes with the Anaconda distribution. The content in this chapter requires the following tools:

  • The Anaconda distribution (this includes conda and Navigator)
  • Python 3.8+ (this is included with the Anaconda distribution)
  • pandas 1.3+
  • Matplotlib 3.4+
  • Jupyter notebooks 6.4+

Now that the setup is ready, let's dive into the chapter!

Defining your problem

Many times, you'll see AI books and blogs talking about the distinct types of AI problems falling into the following categories:

  • Supervised
  • Unsupervised
  • Semi-supervised
  • Reinforcement

We did the same thing back in Chapter 1, Understanding the AI/ML Landscape, and you can find a flowchart of how to decide what category your situation falls into in Figure 7.1:

Figure 7.1 – Dataset heuristics for choosing your AI family

This is a good idea, but when you are starting with a problem, you aren't always thinking about it in terms of the problem type but more in terms of what solution you are trying to figure out.

We'll look at a few different and very common problem types in the following sections. They do not encompass every problem family that you will come across, but they will serve many of them.

Model problem types

The following are the four core problem types that we'll focus...

Understanding regression problems with examples

Figuring out the price of a stock, what your house should be worth, and the future temperature of the Earth all have one thing in common: they all can be thought of as regression problems. It's simply the goal of figuring out what a number would be, given a set of independent variables.

A few more examples that fall into this problem type are as follows:

  • Price of a car
  • Sales forecast for next year
  • Number of people who will sign up for a promotion

When you see a problem like this, you can try a few different models. There are many specific algorithms that you can use, each with its own pros and cons. Let's look at a few of these algorithms in the next section.

The following are a few of the most common regression algorithms you'll want to try. For each of these algorithms, we're going to take an example and create a regression model:

  • Linear regression
  • Random forest
  • Support...

Classification

Being able to put things into certain classes might be the most common type of ML application that you see in the world, and has been a staple of the industry for a long time.

There are two main types of classification: binary classification and multi-class classification. As the names indicate, binary classification is when the outcome only has two possible options. It's very common to have a true or false outcome in this setup.

Multi-class classification is when there are more than two possible classes. This could be for a variety of scenarios, such as movie genre. The approaches taken for them are very similar to a binary classification problem.

Let's check out some examples that might help you get a better grasp on problems that fall into the classification bucket:

Whether emails are spam or not (binary)

  • Whether you would survive the Titanic sinking (binary)
  • Identifying the type of flower (multi-class)
  • Labeling handwritten...

Anomaly detection

If you've ever gotten a text saying that your bank has noticed some suspicious activity, chances are they have put anomaly detection to use. Anomaly detection is the attempt to determine whether an event, item, or object doesn't fit in with the others. One of these things is not like the other is a good way to think about it. Another name you might see for this is outlier detection.

You will find unsupervised, supervised, and semi-supervised approaches can all work in these scenarios. A depiction of what this looks like can be found in Figure 1.4 of Chapter 1, Understanding the AI/ML Landscape.

Many of the examples in this space handle more serious issues around security and safety. You'll find some examples in the following list:

  • Credit card fraud
  • If someone is trying to hack your account via random logins
  • Unsafe operations at a power plant
  • Customer buying patterns
  • Illegal trading activity on a stock

There are...

Clustering problems

In addition to anomaly detection, there is another class of problem that takes an unsupervised approach to trying to group entities together in order to understand more about the dataset. Clustering is the process of finding elements of a dataset that contain enough similar attributes that you can determine clear distinctions from among the individual points.

There are many applications of this technique, and we'll go over the following few examples now:

  • Grouping segments of a customer base
  • Knowing which emails are promotions and which are more important

To achieve this, we can use a few different algorithms such as the following:

  • DBScan
  • K-Means clustering

While there are many more, you can be sure that these have shown promising results across various datasets and are a great place to start.

Let's look at DBscan first.

DBScan

Density-Based Spatial Clustering of Applications with Noise (or DBScan for...

Summary

In this chapter, we have discussed how starting from the problem itself is much more valuable than beginning from a technique to use. Depending on what we need to achieve, we can look at different model approaches that will help us solve the problem we need to.

We learned that classification problems are useful when we want to put elements into categories, and some approaches such as linear regression and random forest allow you create models that achieve this. We also saw how scikit-learn lets you get to a solution with very few lines of code.

We also looked at regression for predicting values, clustering to group entities into similar buckets, and anomaly detection to find elements that don't belong with others. Similar to classification, we saw how with scikit-learn, you can get going quickly. Matplotlib also comes in handy to plot out the problem in order to give you a visual representation of what the predictions look like.

All of the models built in this...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building Data Science Solutions with Anaconda
Published in: May 2022Publisher: PacktISBN-13: 9781800568785
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador