Reader small image

You're reading from  Building Data Science Solutions with Anaconda

Product typeBook
Published inMay 2022
PublisherPackt
ISBN-139781800568785
Edition1st Edition
Concepts
Right arrow
Author (1)
Dan Meador
Dan Meador
author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador

Right arrow

Chapter 6: Overcoming Bias in AI/ML

Bias in Artificial Intelligence (AI) is all around us. It can result in something as seemingly innocent as showing image results for developers, which include mostly men, to suggesting to a judge that a man of a certain race is at a much greater risk of being a repeat offender than others. You might think that you won't have that problem, but there are many shapes that bias can take that have nothing to do with you already being equipped to handle it.

The truth is removing all bias completely from datasets is impossible. Much of this is completely unintentional and is simply due to the lack of available data, but it doesn't matter. The damage can still be done. You'll see examples of bias in credit ratings, face detections, and others.

As AI is increasingly intertwined into the normal operations of society, it will continue to have its impact grow and have very real-world consequences felt by people. We can't claim ignorance...

Technical requirements

There are a few prerequisites to follow along with this chapter. They are as follows:

Defining bias versus discrimination

Let's start by making sure we have a clear understanding of the two components in the context of AI – bias and discrimination. There are different aspects to each of these components and it's important to understand the difference between them.

Bias in AI/ML

AI/ML bias is when models that have been created show favor toward certain groups or categories that doesn't reflect the actual state of the world.

Bias is inevitable in any model and in itself can be harmless. Let's say you are going to author a paper about the most popular foods and do some analysis on them. To do so, you collect data from your friends and family as to their preferences. In addition to this, think about the three foods that you would reply with. Are there any vegetables in there? Any Ethiopian foods? Anything from Turkey? Perhaps not.

This is a form of bias; unless you take a perfectly even sample size of people across the world, you are...

Overcoming proxy bias

There are times that you can introduce bias even if you don't have any features or data points that directly link to a protected class. Remember that a protected class is something such as age, sex, and religion. This is introduced by proxy. And this boils down to data being present that strongly correlates with someone being in that group due to data in some ways bleeding into that proxy dataset.

In the next diagram, you can see a representation of how proxy bias can leak into data. On the left, you have perfectly valid X and Y data, but there is also data B, which is in the form of protected class data. Even though the data from B isn't directly used in the training dataset, it is brought in via proxy through the X dataset:

Figure 6.1 – Proxy bias

Let's look at some examples of what proxy bias could look like to make this a bit more concrete.

Examples of proxy bias

The following list contains some examples...

Overcoming sample bias

Sample bias is when the choice of data doesn't reflect what is present in the real world. This is also referred to as selection bias. As with many types of bias, this can be completely harmless or very impactful, depending on the application.

In the following diagram, you can see a visual representation of what this looks like. There is hypothetical real-world data on the left that would be helpful (represented as Input z), but for one reason or another, it did not make it into the data that is included in the training dataset:

Figure 6.2 – Sample bias

When we leave this valuable data out, it is detrimental to everyone involved. The previous diagram is more abstract, so let's look at some more concrete examples of what sample bias could look like.

Examples of sample bias

The following items are examples of where sample bias could exist. Of course, this isn't close to an exhaustive list but helps to give...

Overcoming exclusion bias

Exclusions bias is when you choose to delete information that isn't considered useful. One of the strengths of AI is that it can find patterns or relationships for things that you didn't realize existed. This can happen more often if individuals or a team don't have a good set of domain knowledge around a subject and therefore are dismissive of items that they don't realize would be valuable.

An added danger arises if data scientists believe that they know an area well enough to be able to create models around it. This can go hand in hand with the Dunning–Kruger effect, which is a potential cognitive bias where people with low skill in a particular area overestimate their ability. You don't know what you don't know, meaning that when you are new to an area, there are many aspects of it you can't even realize are gaps in your knowledge. Conversely, you can have people with high knowledge in an area perceiving their...

Overcoming measurement bias

Measurement bias is when data collected differs from how it's collected in the real world. This would be an issue due to the model not understanding the nuance of how the real world might work. And how could it? All it knows is what you tell it.

The following diagram shows what this might look like. You can see at the top that the X, Y, and Z training data is used. Below that, you can see the real-world data (A, B, and C), which is fed into the model created from the training dataset. It is similar to the training data, but you can see it looks somewhat different and isn't quite the same as what was expected:

Figure 6.5 – Measurement bias

Having data that is different in training versus the real world can be a big issue. It's something that you might never even consider is an issue until much later, after your model has been in production for a long time, and then the damage of inaccurate predictions might...

Overcoming societal AI bias

According to an article from Lexalytics (https://www.lexalytics.com/lexablog/bias-in-ai-machine-learning), societal AI bias is when an AI behaves in ways that reflect social intolerance or institutional discrimination. At first glance, algorithms and data themselves may appear unbiased, but their output reinforces societal biases.

The following figure is a glimpse of what societal bias might look like in an abstract sense. You can see that there is some good data being brought in, but along with that, there is data that is in fragments. This misshapen data represents some flaw that doesn't allow us to get a correct sense of the state of the world we are trying to model:

Figure 6.6 – Societal bias

The fragmented and flawed data bits will be baked into any model trained on this data unless we do something about it. One unique thing about this bias is that it can be invisible once the data has already been gathered and...

Finding bias in an example

In the following example, there is a significant business impact to finding bias in data.

The housing data company Zillow recently backed out of the iBuying business. Zillow is a US-based company that lists housing information for average consumers to look at. iBuying is the term used for instant buying and involves Zillow buying properties directly and then selling them for a profit (in theory). Zillow found that their estimations (or zestimates) were off by a large factor, which led to the company pulling out of that area. Maybe we can find out why.

In this scenario, we will try to find where bias could have entered the system in something such as a zestimate. To give you a framework to work through, we'll walk through steps and each type of bias discussed earlier to think through it. This is important, as you might not instantly jump to a certain type of bias unless you see it. This is an issue, as everyone has a bias toward looking for something...

Summary

You might have heard the saying garbage in, garbage out when it comes to data and AI. In this chapter, you saw that we should also take just as seriously the phrase bias in, bias out.

We looked at some of the primary areas where bias can creep into our data and saw that we must start with an eye toward finding this sooner rather than later. At certain points in the process, it's too late. Bias and discrimination can have real-world impacts, from hiring and vehicle safety to continuing unjust practices around social norms.

You have a few options to make sure that you are doing all you can to avoid this bias such as having the domain knowledge or consulting those who do and getting others from different backgrounds to look at data (or better yet, on your team).

There are also many other types of bias that exist out there and, admittingly, things that we don't even realize are areas of concern. It's also important to be aware that the drift talked about...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building Data Science Solutions with Anaconda
Published in: May 2022Publisher: PacktISBN-13: 9781800568785
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador