Reader small image

You're reading from  Machine Learning Security with Azure

Product typeBook
Published inDec 2023
PublisherPackt
ISBN-139781805120483
Edition1st Edition
Tools
Right arrow
Author (1)
Georgia Kalyva
Georgia Kalyva
author image
Georgia Kalyva

Georgia Kalyva is a technical trainer at Microsoft. She was recognized as a Microsoft AI MVP, is a Microsoft Certified Trainer, and is an international speaker with more than 10 years of experience in Microsoft Cloud, AI, and developer technologies. Her career covers several areas, ranging from designing and implementing solutions to business and digital transformation. She holds a bachelor's degree in informatics from the University of Piraeus, a master's degree in business administration from the University of Derby, and multiple Microsoft certifications. Georgia's honors include several awards from international technology and business competitions, and her journey to excellence stems from a growth mindset and a passion for technology.
Read more about Georgia Kalyva

Right arrow

Data Privacy and Responsible AI Best Practices

In the previous chapter, we talked about how to build a data governance program for our organization and how to identify types of sensitive data. Our work does not stop there. Although in some cases we can safely exclude sensitive information, other times we cannot. So, our machine learning (ML) models that solve problems might need to contain personal data. Sometimes that data can be relevant and useful, or it can create unintended correlations that make the model biased. This is the issue that we will tackle in this chapter.

We will talk about how to recognize sensitive information and how to mitigate it if it is not relevant to the model training process by using techniques such as differential privacy. We will explore how to protect individual information even from aggregated data or the model results. To help us with that, we will see how we can use the SmartNoise software development kit (SDK).

We will also discuss fairness...

Technical requirements

The code for this chapter is available in this repository under the ch5 folder:

https://github.com/PacktPublishing/Machine-Learning-Model-Security-in-Azure/

Working with Python

To use the libraries, you need to be familiar with Python. In this book, we will use notebooks from the Azure Machine Learning environment to run the examples, but if you prefer to use your own development environment and tools, that is fine.

Getting started with Python

New to Python and ML? Take a look at this learning path to learn the basics of Python: https://learn.microsoft.com/en-us/training/paths/beginner-python/.

Running a notebook in Azure Machine Learning

The process of running a notebook in Azure Machine Learning is very straightforward. All you need to do is import or create a workbook in the interface, attach a compute target, and then run the cells. Let us see the steps together:

  1. Go to the Notebooks section and upload or create your file:
...

Discovering and protecting sensitive data

Although having good governance and working with multiple tools that work with data can help us with sensitive data discovery classification and profiling, more often than not, the data used in our ML experiments comes from outside sources, or maybe we are simply not developing for our own organization. In that case, we need to train ourselves on what sensitive data is and how to do a quick cleanup if we need to use Azure Machine Learning.

Identifying sensitive data

Sensitive data refers to any information that, if exposed, could cause harm, privacy breaches, or lead to identity theft, monetary loss, or other adverse consequences for individuals or organizations. This data requires special protection due to its nature and the potential risks associated with its disclosure.

There are many categories of sensitive data, many of which are outlined ahead, together with examples that we need to be aware of:

  • Personally identifiable...

Introducing differential privacy

Differential privacy is a concept that has the purpose of protecting the privacy of individual data contributors while still allowing useful statistical analysis. The basic idea behind differential privacy is to add noise or random perturbations to the data in such a way that the statistical properties of the dataset stay the same, but it is much more difficult to identify individual information within the dataset.

The level of privacy protection in differential privacy is controlled by a parameter called epsilon (ε). A smaller value of epsilon indicates a higher level of privacy, but it might also lead to a decrease in data utility (usefulness of the data for analysis). Striking a balance between privacy and utility is a key challenge in implementing differential privacy:

Figure 5.3 – Epsilon (Ɛ) value relationship with privacy and accuracy

Figure 5.3 – Epsilon (Ɛ) value relationship with privacy and accuracy

A library that we can use to add noise to the data is the...

Mitigating fairness

Mitigating fairness in ML models is an essential step to ensure that the model does not exhibit bias or discrimination against certain groups of individuals. Even though we can remove PII from our datasets, predictions might favor different groups based on characteristics such as race, gender, age, or religion. If the training data is not diverse and representative of the population you aim to serve, bias can creep into the model if the data does not adequately represent all groups.

Firstly, we need to learn to identify bias in our models. This is easy by conducting an analysis of the metrics of the model. Suppose you suspect that your load approval model favors people above a certain age to get their loan application approved. You can start by looking at the metrics for the complete dataset as follows:

...

Working with model interpretability

Model interpretability in ML refers to the ability to understand and explain how a particular model makes predictions or decisions. Interpretable models provide clear insights into the features or variables that are most influential in the model’s decision-making process. This is particularly important in domains where the decision-making process needs to be transparent and understandable, such as healthcare, finance, and legal systems.

Although you can never explain 100% why a model makes a prediction, you can use explainers to understand which features affect the results. Explainers can help us provide global explanations; for example, which features affect the overall behavior of the model or local explanations that provide us with information on what influenced an individual prediction.

Let us explore some methods we can use to achieve model interpretability:

  • Feature importance (FI) determines the influence of each feature...

Exploring FL and secure multi-party computation

FL is an ML approach that enables the training of models across multiple devices or servers without centrally aggregating the raw data. In traditional ML, data is usually collected and sent to a central compute server for training, which raises privacy and security concerns, especially when dealing with sensitive or personal information.

In FL, the training process happens locally on the devices or nodes (for example smartphones, edge devices, or compute instances) that generate or store the data. These nodes collaborate by sharing only model updates (gradients) rather than the raw data itself. The central compute server aggregates these updates to create an improved global model. This process is repeated iteratively, with each node contributing to the model’s improvement while keeping its data private.

The main advantages of FL are as follows:

  • Privacy: As the raw data remains on the local nodes, there is no need...

Summary

Protecting sensitive data is a multi-faceted problem. There are ways and techniques to mitigate fairness and protect privacy work ethically and responsibly with AI, but the balance between prediction accuracy and data protection is very sensitive. If you add the complexity of choosing the right combination of techniques based on your data and algorithms, it can seem daunting.

In this chapter, we learned to identify different types of sensitive data and common techniques to remove or mask them. However, it is not always possible to completely eliminate them as they are useful for the model training process. In this case, there are several libraries available to help. We can use the SmartNoise SDK to introduce noise to our data and protect privacy, work with the Fairlearn SDK to mitigate fairness, and use the Responsible AI dashboard together with explainers to interpret our models. We ended this chapter by introducing the concept of FL and how to apply it using Azure Machine...

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Security with Azure
Published in: Dec 2023Publisher: PacktISBN-13: 9781805120483
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Georgia Kalyva

Georgia Kalyva is a technical trainer at Microsoft. She was recognized as a Microsoft AI MVP, is a Microsoft Certified Trainer, and is an international speaker with more than 10 years of experience in Microsoft Cloud, AI, and developer technologies. Her career covers several areas, ranging from designing and implementing solutions to business and digital transformation. She holds a bachelor's degree in informatics from the University of Piraeus, a master's degree in business administration from the University of Derby, and multiple Microsoft certifications. Georgia's honors include several awards from international technology and business competitions, and her journey to excellence stems from a growth mindset and a passion for technology.
Read more about Georgia Kalyva

Selection Rate

Accuracy

Recall