Home Data Machine Learning with Qlik Sense

Machine Learning with Qlik Sense

By Hannu Ranta
ai-assist-svg-icon Book + AI Assistant
eBook + AI Assistant $39.99 $27.98
Print $49.99
Subscription $15.99 $10 p/m for three months
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime! ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime! ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
eBook + AI Assistant $39.99 $27.98
Print $49.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Introduction to Machine Learning with Qlik
About this book
The ability to forecast future trends through data prediction, coupled with the integration of ML and AI, has become indispensable to global enterprises. Qlik, with its extensive machine learning capabilities, stands out as a leading analytics platform enabling businesses to achieve exhaustive comprehension of their data. This book helps you maximize these capabilities by using hands-on illustrations to improve your ability to make data-driven decisions. You’ll begin by cultivating an understanding of machine learning concepts and algorithms, and build a foundation that paves the way for subsequent chapters. The book then helps you navigate through the process of framing machine learning challenges and validating model performance. Through the lens of Qlik Sense, you'll explore data preprocessing and analysis techniques, as well as find out how to translate these techniques into pragmatic machine learning solutions. The concluding chapters will help you get to grips with advanced data visualization methods to facilitate a clearer presentation of findings, complemented by an array of real-world instances to bolster your skillset. By the end of this book, you’ll have mastered the art of machine learning using Qlik tools and be able to take your data analytics journey to new heights.
Publication date:
October 2023
Publisher
Packt
Pages
242
ISBN
9781805126157

 

Introduction to Machine Learning with Qlik

Machine learning and artificial intelligence are two of the most powerful technology trends in the 21st century. Usage of these technologies is rapidly growing since the need for faster insights and forecasts has become crucial for companies. Qlik is a leading vendor in the analytics space and has heavily invested in machine learning and AI tools.

This first chapter will introduce the different machine learning tools in the Qlik ecosystem and provide basic information about the statistical models and principles behind these tools. It will also cover the concepts of correct sample size and how to analyze model performance and reliability.

Here is what we will cover in this first chapter:

  • An overview of the Qlik tools and platform
  • The basic statistical concepts of machine learning
  • Proper sample size and the defining factors of a sample
  • How to evaluate model performance and reliability
 

Introduction to Qlik tools

Qlik Sense is a leading data analytics and business intelligence platform and contains many tools and features for data analytics relating to machine learning. In this chapter, we will take a closer look at the key features of the Qlik platform.

Machine learning and AI capabilities on the Qlik platform can be divided into three different components:

  • Insight Advisor
  • Qlik AutoML
  • Advanced Analytics Integration

Insight Advisor

Qlik Insight Advisor is a feature of Qlik Sense that uses natural language processing (NLP) and machine learning to help users explore and analyze data more effectively. It allows users to ask questions about their data in natural language and to receive insights and recommendations in real time. It also auto-generates advanced analytics and visualizations and assists with analytics creation and data preparation.

Insight Advisor utilizes a combination of Qlik’s associative engine and augmented intelligence engine and supports a wide range of use cases, as seen in the following figure:

Figure 1.1: Qlik Insight Advisor and different scenarios

Figure 1.1: Qlik Insight Advisor and different scenarios

Did you know?

The Qlik associative engine is the core technology that powers the Qlik data analytics and business intelligence platform. It is a powerful in-memory engine that uses an associative data model, which allows users to explore data in a way that is more intuitive and natural than traditional query-based tools.

Instead of pre-defined queries or data models, the engine automatically associates data across multiple tables and data sources based on common fields or attributes and uses a patented indexing technology that stores all the data in memory, enabling real-time analysis and exploration of even the largest datasets. It is a powerful and innovative technology that underpins the entire Qlik platform.

Insight Advisor has the following key features:

  • Advanced insight generation: Insight Advisor provides a way to surface new and hidden insights. It uses AI-generated analyses that are delivered in multiple forms. Users can select from a full range of analysis types, which are auto-generated. These types include visualizations, narrative insights, and entire dashboards. Advanced analytics is also supported, and Insight Advisor can generate comparison, ranking, trending, clustering, geographical analysis, time series forecasts, and more.
  • Search-based visual discovery: Insight Advisor auto-generates the most relevant and impactful visualizations for the users, based on natural language queries. It provides a set of charts that users can edit and fine-tune before adding to the dashboard. It is context-aware and reflects the selections with generated visualizations. It also suggests the most significant data relationships to explore further.
  • Conversational analytics: Conversational analytics in Insight Advisor allows users to interact using natural language. Insight Advisor Chat offers a fully conversational analytics experience for the entire Qlik platform. It understands user intent and delivers additional insights for deeper understanding.
  • Accelerated creation and data preparation: Accelerated creation and data preparation helps users to create analytics using a traditional build process. It gives recommendations about associations and relationships in data. It also gives chart suggestions and renders the best types of visualizations for each use case, which allows non-technical users to get the most out of the analyzed data. Part of the data preparation also involves an intelligent profiling that provides descriptive statistics about the data.

Note

A hands-on example with Insight Advisor can be found in Chapter 9, where you will be given a practical example of the most important functionalities in action.

Qlik AutoML

Qlik AutoML is an automated machine learning tool that makes AI-generated machine learning models and predictive analytics available for all users. It allows users to easily generate machine learning models, make predictions, and plan decisions using an intuitive, code-free user interface.

AutoML connects and profiles data, identifies key drivers in the dataset, and generates and refines models. It allows users to create future predictions and test what-if scenarios. Results are returned with prediction-influencer data (Shapley values) at the record level, which allows users to understand why predictions were made. This is critical to take the correct actions based on the outcome.

Predictive data can be published in Qlik Sense for further analysis and models can be integrated using Advanced Analytics Integration for real-time exploratory analysis.

Using AutoML is simple and does not require comprehensive data science skills. Users must first select the target field and then AutoML will run through various steps, as seen in the following figure:

Figure 1.2: The AutoML process flow

Figure 1.2: The AutoML process flow

With the model established and trained, AutoML lets users make predictions on current datasets. Deployed models can be used both from Qlik tools and other analytics tools. AutoML also provides a REST API to consume the deployed models.

Note

More information about AutoML, including hands-on examples, can be found in Chapter 8.

Advanced Analytics Integration

Advanced Analytics Integration is the ability to integrate advanced analytics and machine learning models directly into the Qlik data analytics platform. This integration allows users to combine the power of advanced analytics with the data exploration and visualization capabilities of Qlik to gain deeper insights from their data.

Advanced Analytics Integration is based on open APIs that provide direct, engine-level integration between Qlik’s Associative Engine and third-party data science tools. Data is being exchanged and calculations are made in real time as the user interacts with the software. Only relevant data is passed from the Associative Engine to third-party tools, based on user selections and context. The workflow is explained in the following figure:

Figure 1.3: Advanced analytics integration dataflow

Figure 1.3: Advanced analytics integration dataflow

Advanced analytics integration can be used with any external calculation engine, but native connectivity is provided for Amazon SageMaker, Amazon Comprehend, Azure ML, Data Robot, and custom models made with R and Python. Qlik AutoML can also utilize advanced analytics integration.

Note

More information, including practical examples about advanced analytics integration, can be found in Chapter 7. Installing the needed components for the on-premises environment is described in Chapter 5.

 

Basic statistical concepts with Qlik solutions

Now that we have been introduced to Qlik tools, we will explore some of the statistical concepts that are used with them. Statistical principles play a crucial role in the development of machine-learning algorithms. These principles provide the mathematical framework for analyzing and modeling data, making predictions, and improving the accuracy of machine-learning models over time. In this section, we will become familiar with some of the key concepts that will be needed when building machine-learning solutions.

Types of data

Different data types are handled differently, and each requires different techniques. There are two major data types in typical machine-learning solutions: categorical and numerical.

Categorical data typically defines a group or category using a name or a label. Each piece of a categorical dataset is assigned to only one category and each category is mutually exclusive. Categorical data can be further divided into nominal data and ordinal data. Nominal data is the data category that names or labels a category. Ordinal data is constructed from elements with rankings, orders, or rating scales. Ordinal data can be ordered or counted but not measured. Some machine-learning algorithms can’t handle categorical variables unless these are converted (encoded) to numerical values.

Numerical data can be divided into discrete data that is countable numerical data. It is formed using natural numbers, for example, age, number of employees in a company, etc. Another form of numerical data is continuous data. An example of this type of data can be a person’s height or a student’s score. One type of data to pay attention to is datetime information. Dates and times are typically useful in machine-learning models but will require some work to turn them into numerical data.

Mean, median, and mode

The mean is calculated by dividing the sum of all values in a dataset by the number of values. The simplified equation can be formed like this:

mean =  Sum of all datapoints  ________________  Number of datapoints 

The following is a simple example to calculate the mean of a set of data points:

X = [5,15,30,45,50]

X̅ = (5+15+30+45+50)/5

X̅ = 29

The mean is sensitive to outliers and these can significantly affect its value. The mean is typically written as X̅.

The median is the middle value of the sorted dataset. Using the dataset in the previous example, our median is 30. The main advantage of the median over the mean is that the median is less affected by outliers. If there is a high chance for outliers, it’s better to use the median instead of the mean. If we have an even number of data points in our dataset, the median is the average of two middle points.

The mode represents the most common value in a dataset. It is mostly used when there is a need to understand clustering or, for example, encoded categorical data. Calculating the mode is quite simple. First, we need to order all values and count how many times each value appears in a set. The value that appears the most is the mode. Here is a simple example:

X = [1,4,4,5,7,9]

The mode = 4 since it appears two times and all other values appear only one time. A dataset can also have multiple modes (multimodal dataset). In this case, two or more values occur with the highest frequency.

Variance

Variance (σ2) is a statistical measure that describes the degree of variability or spread in a set of data. It is the average of the squared differences from the mean of the dataset.

In other words, variance measures how much each data point deviates from the mean of the dataset. A low variance indicates that the data points are closely clustered around the mean, while a high variance indicates that the data points are more widely spread out from the mean.

The formula for variance is as follows:

σ 2 =  Σ ( x i x ̅)² _ n 1 

where σ2 is the variance of the dataset, n is the number of data points in the set, and Σ is the sum of the squared differences between each data point (xi) and the mean (x ̅). The square root of the variance is the standard deviation.

Variance is an important concept in statistics and machine learning, as it is used in the calculation of many other statistical measures, including standard deviation and covariance. It is also commonly used to evaluate the performance of models and to compare different datasets.

Variance is used to see how individual values relate to each other within a dataset. The advantage is that variance treats all deviations from the mean as the same, regardless of direction.

Example

We have a stock that returns 15% in year 1, 25% in year 2, and -10% in year 3. The mean of the returns is 10%. The difference of each year’s return to mean is 5%, 15%, and -20%. Squaring these will give 0.25%, 2.25%, and 4%. If we add these together, we will get 6.5%. When divided by 2 (3 observations – 1), we get a variance of 3.25%.

Standard deviation

Standard deviation is a statistical measure that quantifies the amount of variation or dispersion in a set of data. It measures how much the individual data points deviate from the mean of the dataset.

A low standard deviation indicates that the data points are close to the mean, while a high standard deviation indicates that the data points are more spread out from the mean.

The formula for standard deviation is as follows:

σ =  _  Σ ( x i x ̅)² _ n 1  

where σ is the standard deviation, Σ is the sum of the squared differences between each data point (xi), and the mean (x ̅), and n is the number of data points.

Example

Continuing from our previous example, we got the variance of 3.25% for our stock. Taking the square root of the variance yields a standard deviation of 18%.

Standardization

Standardization or Z-score normalization is the concept of normalizing different variables to the same scale. This method allows comparison of scores between different types of variables. Z-score is a fractional representation of standard deviations from the mean value. We can calculate the z-score using the following formula:

z =  x x ̅ _ σ 

In the formula, x is the observed value, x̅ is the mean, and σ is the standard deviation of the data.

Basically, the z-score describes how many standard deviations away a specific data point is from the mean. If the z-score of a data point is high, it indicates that the data point is most likely an outlier. Z-score normalization is one of the most popular feature-scaling techniques in data science and is an important preprocessing step. Many machine-learning algorithms attempt to find trends in data and compare features of data points. It is problematic if features are on a different scales, which is why we need standardization.

Note

Standardized datasets will have a mean of 0 and standard deviation of 1, but there are no specific boundaries for maximum and minimum values.

Correlation

Correlation is a statistical measure that describes the relationship between two variables. It measures the degree to which changes in one variable are associated with changes in another variable.

There are two types of correlation: positive and negative. Positive correlation means that the two variables move in the same direction, while negative correlation means that the two variables move in opposite directions. A correlation of 0 indicates that there is no relationship between the variables.

The most used measure of correlation is the Pearson correlation coefficient, which ranges from -1 to 1. A value of -1 indicates a perfect negative correlation, a value of 0 indicates no correlation, and a value of 1 indicates a perfect positive correlation.

The Pearson correlation coefficient can be used when the relationship of variables is linear and both variables are quantitative and normally distributed. There should be no outliers in the dataset.

Correlation can be calculated using the cor() function in R or the scipy.stats or NumPy libraries in Python.

Probability

Probability is a fundamental concept in machine learning that is used to quantify the uncertainty associated with events or outcomes. Basic concepts of probability include the following:

  • Random variables: A variable whose value is determined by chance. Random variables can be discrete or continuous.
  • Probability distribution: A function that describes the likelihood of different values for a random variable. Common probability distributions include the normal distribution, the binomial distribution, and the Poisson distribution.
  • Bayes’ theorem: A fundamental theorem in probability theory that describes the relationship between conditional probabilities. Bayes’ theorem is used in many machine-learning algorithms, including naive Bayes classifiers and Bayesian networks.
  • Conditional probability: The probability of an event occurring given that another event has occurred. Conditional probability is used in many machine-learning algorithms, including decision trees and Markov models.
  • Expected value: The average value of a random variable, weighted by its probability distribution. Expected value is used in many machine-learning algorithms, including reinforcement learning.
  • Maximum likelihood estimation: A method of estimating the parameters of a probability distribution based on observed data. Maximum likelihood estimation is used in many machine-learning algorithms, including logistic regression and hidden Markov models.

Note

Probability is a wide concept on its own and many books have been written about this area. In this book, we are not going deeper into the details but it is important to understand the terms at a high level.

We have now investigated the basic principles of the statistics that play a crucial role in Qlik tools. Next, we will focus on the concept of defining a proper sample size. This is an important step, since we are not always able to train our model with all the data and we want our training dataset to represent the full data as much as possible.

 

Defining a proper sample size and population

Defining a proper sample size for machine learning is crucial to get accurate results. It is also a common problem that we don’t know how much training data is needed. Having a correct sample size is important for several reasons:

  • Generalization: Machine-learning models are trained on a sample of data with the expectation that they will generalize to new, unseen data. If the sample size is too small, the model may not capture the full complexity of the problem, resulting in poor generalization performance.
  • Overfitting: Overfitting occurs when a model fits the training data too closely, resulting in poor generalization performance. Overfitting is more likely to occur when the sample size is small because the model has fewer examples to learn from and may be more likely to fit the noise in the data.
  • Statistical significance: In statistical inference, sample size is an important factor in determining the statistical significance of the results. A larger sample size provides more reliable estimates of model parameters and reduces the likelihood of errors due to random variation.
  • Resource efficiency: Machine-learning models can be computationally expensive to train, especially with large datasets. Having a correct sample size can help optimize the use of computing resources by reducing the time and computational resources required to train the model.
  • Decision-making: Machine-learning models are often used to make decisions that have real-world consequences. Having a correct sample size ensures that the model is reliable and trustworthy, reducing the risk of making incorrect or biased decisions based on faulty or inadequate data.

Defining a sample size

The sample size depends on several factors, including the complexity of the problem, the quality of the data, and the algorithm being used. “How much training data do I need?” is a common question at the beginning of a machine-learning project. Unfortunately, there is no correct answer to that question, since it depends on various factors. However, there are some guidelines.

Generally, the following factors should be addressed when defining a sample:

  • Have a representative sample: It is essential to have a representative sample of the population to train a machine-learning model. The sample size should be large enough to capture the variability in the data and ensure that the model is not biased toward a particular subset of the population.
  • Avoid overfitting: Overfitting occurs when a model is too complex and fits the training data too closely. To avoid overfitting, it is important to have a sufficient sample size to ensure that the model generalizes well to new data.
  • Consider the number of features: The number of features or variables in the dataset also affects the sample size. As the number of features increases, the sample size required to train the model also increases.
  • Use power analysis: Power analysis is a statistical technique used to determine the sample size required to detect a significant effect. It can be used to determine the sample size needed for a machine-learning model to achieve a certain level of accuracy or predictive power.
  • Cross-validation: Cross-validation is a technique used to evaluate the performance of a machine-learning model. It involves splitting the data into training and testing sets and using the testing set to evaluate the model’s performance. The sample size should be large enough to ensure that the testing set is representative of the population and provides a reliable estimate of the model’s performance.

There are several statistical heuristic methods available to estimate a sample size. Let’s take a closer look at some of these.

Power analysis

Power analysis is one of the key concepts in machine learning. Power analysis is mainly used to determine whether a statistical test has sufficient probability to find an effect and to estimate the sample size required for an experiment considering the significance level, effect size, and statistical power.

The definition of a power in this concept is the probability that a statistical test will reject a false null hypothesis (H0) or the probability of detecting an effect (depending on whether the effect is there). A bigger sample size will result in a larger power. The main output of power analysis is the estimation of an appropriate sample size.

To understand the basics of power analysis, we need to get familiar with the following concepts:

  • A type I error (α) is rejecting a H0 or a null hypothesis in the data when it’s true (false positive).
  • A type II error (β) is the failure to reject a false H0 or, in other words, a probability of missing an effect that is in the data (false negative).
  • The power is the probability of detecting an effect that is in the data.
  • There is a direct relationship between the power and type II error:

    Power = 1 β

    Generally, β should never be more than 20%, which gives us the minimum approved power level of 80%.

  • The significance level (α) is the maximum risk of rejecting a true null hypothesis (H0) that you are willing to take. This is typically set to 5% (p < 0.05).
  • The effect size is the measure of the strength of a phenomenon in the dataset (independent of sample size). The effect size is typically the hardest to determine. An example of an effect size would be the height difference between men and women. The greater the effect size, the greater the height difference will be. The effect size is typically marked with the letter d in formulas.

Now that we have defined the key concepts, let’s look how to use power analysis in R and Python to calculate the sample size for an experiment with a simple example. In R we will utilize a package called pwr and with Python we will utilize the NumPy and statsmodels.stats.power libraries.

Let’s assume that we would like to create a model of customer behavior. We are interested to know whether there is a difference in the mean price of what our preferred customers and other customers pay at our online shop. How many transactions in each group should we investigate to get the power level of 80%?

R:

library(pwr)
ch <- cohen.ES(test = "t", size = "medium")
print(ch)
test <- pwr.t.test(d = 0.5, power = 0.80, sig.level = 0.05)
print(test)

The model will give us the following result:

     Two-sample t test power calculation
              n = 63.76561
              d = 0.5
      sig.level = 0.05
          power = 0.8
    alternative = two.sided
NOTE: n is number in *each* group

So, we will need a sample of 64 transactions in each group.

Python:

import numpy as np
from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size = 0.5, alpha = 0.05, power = 0.8)
print(str(sample_size))

Our Python code will produce the same result as our earlier R code, giving us 64 transactions in each group.

Note

Power analysis is a wide and complex topic, but it’s important to understand the basics, since it is widely utilized in many machine-learning tools. In this chapter, we have only scratched the surface of this topic.

Sampling

Sampling is a method that makes it possible to get information about the population (dataset) based on the statistics from a subset of population (sample), without having to investigate every individual value. Sampling is particularly useful if a dataset is large and can’t be analyzed in full. In this case, identifying and analyzing a representative sample is important. In some cases, a small sample can be enough to reveal the most important information, but generally, using a larger sample can increase the likelihood of representing the data as a whole.

When performing sampling, there are some aspects to consider:

  • Sample goal: A property that you wish to estimate or predict
  • Population: A domain from which observations are made
  • Selection criteria: A method to determine whether an individual value will be accepted as a part of the sample
  • Sample size: The number of data points that will form the final sample data

Sampling methods can be divided into two main categories:

Probability sampling is a technique where every element of the dataset has an equal chance of being selected. These methods typically give the best chance of creating a sample that truly represents the population. Examples of probability sampling algorithms are simple random sampling, cluster sampling, systematic sampling, and stratified random sampling.

Non-probability sampling is a method where all elements are not equally qualified for being selected. With these methods, there is a significant risk that the sample is non-representative. Examples of non-probability sampling algorithms are convenience sampling, selective sampling, snowball sampling, and quota sampling.

When using sampling as a methodology for training set creation, it is recommended to utilize a specialized sampling library in either R or Python. This will automate the process and produce a sample based on selected algorithms and specifications. In R, we can utilize the standard sample library and in Python there is a package called random.sample. Here is a simple example of random sampling with both languages:

R:

dataset <- data.frame(id = 1:20, fact = letters[1:20])
set.seed(123)
sample <- dataset[sample(1:nrow(dataset), size=5), ]

The content of the sample frame will look like this:

   id fact
15 15    o
19 19    s
14 14    n
3   3    c
10 10    j

Python:

import random
random.seed(123)
dataset = [[1,'v'],[5,'b'],[7,'f'],[4,'h'],[0,'l']]
sample = random.sample(dataset, 3)
print(sample)

The result of the sample vector will look like the following:

[[1, 'v'], [7, 'f'], [0, 'l']]

Note

There is a lot of material covering different sampling techniques and how to use those with R and Python on the internet. Take some time to practice these techniques with simple datasets.

Sampling errors

In all sampling methods, errors are bound to occur. There are two types of sampling errors:

  • Selection bias is introduced by the selection of values that are not random to be part of the sample. In this case, the sample is not representative of the dataset that we are looking to analyze.
  • Sampling error is a statistical error that occurs when we don’t select the sample that represents the entire population of data. In this case, the results of the prediction or model will not represent the actual results that are generalized to cover the entire dataset.

Training datasets will always contain a sampling error, since it cannot represent the entire dataset. Sample errors in the context of binary classification can be calculated using the following simplified formula:

Sample error =  False positive + False negative   ____________________________________________     True positive + False positive + True negative + False negative 

If we have, for example, a dataset containing 45 values and out of these 12 are false values, we will get a sample error of 12/45 = 26.67%.

The above formula can be only utilized in context of binary classification. When estimating the population mean (μ) from a sample mean ( _ x ), the standard error is calculated as follows:

SE =  σ _  _ n  

  • SE (Standard Error): The standard error is a measure of the variability or uncertainty in a sample statistic. It quantifies how much the sample statistic is expected to vary from the true population parameter. In other words, it gives you an idea of how reliable or precise your sample estimate is.
  • σ (population standard deviation): This is the standard deviation of the entire population you’re trying to make inferences about. It represents the amount of variability or spread in the population data. In practice, the population standard deviation is often unknown, so you may estimate it using the sample standard deviation (s) when working with sample data.
  • n (sample size): The number of observations or data points in your sample.

Example

We are conducting a survey to estimate the average age (mean) of residents in a small town. We collect a random sample of 50 residents and find the following statistics:

  • Sample mean ( _ x ): 35 years
  • Sample standard deviation (s): 10 years (an estimate of the population standard deviation)
  • Sample size (n): 50 residents

    SE =  10 _  _ 50   = 1.42 years

So, the standard error of the sample mean is approximately 1.42 years. This means that if you were to take multiple random samples of the same size from the population and calculate the mean for each sample, you would expect those sample means to vary around 35 years, with an average amount of variation of 1.42 years.

Standard error is often used to construct confidence intervals. For instance, you might use this standard error to calculate a 95% confidence interval for the average age of residents in the town, which would allow you to estimate the range within which the true population mean age is likely to fall with 95% confidence.

As we can see, sample error, often referred to as “sampling error,” is not represented by a single formula. Instead, it is a concept that reflects the variability or uncertainty in the estimates or measurements made on a sample of data when trying to infer information about a larger population. The specific formula for sampling error depends on the statistic or parameter you are estimating and the characteristics of your data. In practice, you would use statistical software or tools to calculate the standard error for the specific parameter or estimate you are interested in.

Training and test data in machine learning

The preceding methods for defining a sample size will work well if we need to define the amount of needed data without a large collection of historic data covering the phenomenon that we are investigating. In many cases, we have a large dataset and we would like to produce training and test datasets from that historical data. Training datasets are used to train our machine-learning model and test datasets are used to validate the accuracy of our model. Training and test datasets are the key concepts in machine learning.

We can utilize power analysis and sampling to create training and testing datasets, but sometimes there is no need to make a complex analysis if our sample is already created. The training dataset is the biggest subset of the original dataset and will be used to fit the machine-learning model. The test dataset is another subset of original data and is always independent of the training dataset.

Test data should be well organized and contain data for each type of scenario that the model would be facing in the production environment. Usually it is 20–25% of the total original dataset. An exact split can be adjusted based on the requirements of a problem or the dataset characteristics.

Generating a training and testing dataset from an original dataset can also be done using R or Python. Qlik functions can be used to perform this action in load script.

Now that we have investigated some of the concepts to define a good sample, we can focus on the concepts used to analyze model performance and reliability. These concepts are important, since using these techniques allow us to develop our model further and make sure that it gives proper results.

 

Concepts to analyze model performance and reliability

Analyzing the performance and reliability of our machine-learning model is an important development step and should be done before implementing the model to production. There are several metrics that you can use to analyze the performance and reliability of a machine learning model, depending on the specific task and problem you are trying to solve. In this section, we will cover some of these techniques, focusing on ones that Qlik tools are using.

Regression model scoring

The following concepts can be used to score and verify regression models. Regression models predict outcomes as a number, indicating the model’s best estimate of the target variable. We will learn more about regression models in Chapter 2.

R2 (R-squared)

R-squared is a statistical measure that represents the proportion of the variance in a dependent variable that is explained by an independent variable (or variables) in a regression model. In other words, it measures the goodness of fit of a regression model to the data.

R-squared ranges from 0 to 1, where 0 indicates that the model does not explain any of the variability in the dependent variable, and 1 indicates that the model perfectly explains all the variability in the dependent variable.

R-squared is an important measure of the quality of a regression model. A high R-squared value indicates that the model fits the data well and that the independent variable(s) have a strong relationship with the dependent variable. A low R-squared value indicates that the model does not fit the data well and that the independent variable(s) do not have a strong relationship with the dependent variable. However, it is important to note that a high R-squared value does not necessarily mean that the model is the best possible model, so other factors such as overfitting should also be taken into consideration when evaluating the performance of a model. R-squared is often used together with other metrics and it should be interpreted in the context of the problem. The formula for R-squared is the following:

R 2 =  Variance explained by the model   _______________________  Total variance 

There are some limitations for R-squared. It cannot be used to check whether the prediction is biased or not and it doesn’t tell us whether the regression model has an adequate fit or not. Bias refers to systematic errors in predictions. To check for bias, you should analyze residuals (differences between predicted and observed values) or use bias-specific metrics such as Mean Absolute Error (MAE) and Mean Bias Deviation (MBD). R-squared primarily addresses model variance, not bias.

Sometimes it is better to utilize adjusted R-squared. Adjusted R-squared is a modified version of the standard R-squared used in regression analysis. We can use adjusted R-squared when dealing with multiple predictors to assess model fit, control overfitting, compare models with different predictors, and aid in feature selection. It accounts for the number of predictors, penalizing unnecessary complexity. However, it should be used alongside other evaluation metrics and domain knowledge for a comprehensive model assessment.

Root mean squared error (RMSE), mean absolute error (MAE), and mean squared error (MSE)

Root mean squared error is the average difference that can be expected between predicted and actual value. It is the standard deviation of the residuals (prediction errors) and tells us how concentrated the data is around the “line of best fit.” It is a standard way to measure the error of a model when predicting quantitative data. RMSE is always measured in the same unit as the target value.

As an example of RMSE, if we have a model that predicts house value in a certain area and we get an RMSE of 20,000, it means that, on average, the predicted value differs 20,000 USD from the actual value.

Mean absolute error is defined as an average of all absolute prediction errors in all data points. In MAE, different errors are not weighted but the scores increase linearly with the increase in error. MAE is always a positive value since we are using an absolute value of error. MAE is useful when the errors are symmetrically distributed and there are no significant outliers.

Mean squared error is a squared average difference between the predicted and actual value. Squaring eliminates the negative values and ensures that MSE is always positive or 0. The smaller the MSE, the closer our model to the “line of best fit.” RMSE can be calculated using MSE. RMSE is a square root of MSE.

When to use the above metrics in practice

MAE is robust to outliers and provides a straightforward interpretation of the average error magnitude.

MSE penalizes large errors more heavily and is suitable when you want to minimize the impact of outliers on the error metric.

RMSE is similar to MSE but provides a more interpretable error metric in the same units as the target variable.

The choice between these metrics should align with your specific problem and objectives. Its also good practice to consider the nature of your data and the impact of outliers when selecting an error metric. Additionally, you can use these metrics in conjunction with other evaluation techniques to get a comprehensive view of your model’s performance.

Multiclass classification scoring and binary classification scoring

The following concepts can be used to score and verify multiclass and binary models. Binary classification models distribute outcomes into two categories, typically denoted as Yes or No. Multiclass classification models are similar, but there are more than two categories as an outcome. We will learn more about both models in Chapter 2.

Recall

Recall measures the percentage of correctly classified positive instances over the total number of actual positive instances. In other words, recall represents the ability of a model to correctly capture all positive instances.

Recall is calculated as follows:

Recall =  True positive  ______________________  (True positive + False negative) 

A high recall indicates that the model is able to accurately capture all positive instances and has a low rate of false negatives. On the other hand, a low recall indicates that the model is missing many positive instances, resulting in a high rate of false negatives.

Precision

Precision measures the percentage of correctly classified positive instances over the total number of predicted positive instances. In other words, precision represents the ability of the model to correctly identify positive instances.

Precision is calculated as follows:

Precision =  True positive  _____________________  (True positive + False positive) 

A high precision indicates that the model is able to accurately identify positive instances and has a low rate of false positives. On the other hand, a low precision indicates that the model is incorrectly classifying many instances as positive, resulting in a high rate of false positives.

Precision is particularly useful in situations where false positives are costly or undesirable, such as in medical diagnosis or fraud detection. Precision should be used in conjunction with other metrics, such as recall and F1 score, to get a more complete picture of the model’s performance.

F1 score

The F1 score is defined as the harmonic mean of precision and recall, and it ranges from 0 to 1, with higher values indicating better performance. The formula for F1 score is as follows:

F1 score = 2 *  (precision * recall)  _____________  (precision + recall) 

The F1 score gives equal importance to both precision and recall, making it a useful metric for evaluating models when the distribution of positive and negative instances is uneven. A high F1 score indicates that the model has a good balance between precision and recall and can accurately classify both positive and negative instances.

In general, the more imbalanced the dataset is, the lower the F1 score is likely to be. It’s crucial to recognize that, when dealing with highly imbalanced datasets where one class greatly outnumbers the other, the F1 score may be influenced. A more imbalanced dataset can result in a reduced F1 score. Being aware of this connection can assist in interpreting F1 scores within the context of particular data distributions and problem domains. If the F1 value is high, all other metrics will be high as well, and if it is low, there is a need for further analysis.

Accuracy

Accuracy measures the percentage of correctly classified instances over the total number of instances. In other words, accuracy represents the ability of the model to correctly classify both positive and negative instances.

Accuracy is calculated in the following way:

Accuracy =  (True positive + True negative)   ____________________________________________    (True positive + False positive + True negative + False negative) 

A high accuracy indicates that the model is able to accurately classify both positive and negative instances and has a low rate of false positives and false negatives. However, accuracy can be misleading in situations where the distribution of positive and negative instances is uneven. In such cases, other metrics such as precision, recall, and F1 score may provide a more accurate representation of the model’s performance.

Accuracy can mislead in imbalanced datasets where one class vastly outnumbers the others. This is because accuracy doesn’t consider the class distribution and can be high even if the model predicts the majority class exclusively. To address this, use metrics such as precision, recall, F1-score, AUC-ROC, and AUC-PR, which provide a more accurate evaluation of model performance by focusing on the correct identification of the minority class, which is often the class of interest in such datasets.

Example scenario

Suppose we are developing a machine-learning model to detect a rare disease that occurs in only 1% of the population. We collect a dataset of 10,000 patient records:

  • 100 patients have the rare disease (positive class)
  • 9,900 patients do not have the disease (negative class)

Now, let’s say our model predicts all 10,000 patients as not having the disease. Here’s what happens:

  • True Positives (correctly predicted patients with the disease): 0
  • False Positives (incorrectly predicted patients with the disease): 0
  • True Negatives (correctly predicted patients without the disease): 9,900
  • False Negatives (incorrectly predicted patients without the disease): 100

Using accuracy as our evaluation metric produces the following result:

Accuracy =  True positive + True negative  _____________________  Total  =  9900 _ 10000  = 99%

Our model appears to have an impressive 99% accuracy, which might lead to the misleading conclusion that it’s performing exceptionally well. However, it has completely failed to detect any cases of the rare disease (True Positives = 0), which is the most critical aspect of the problem.

In this example, accuracy doesn’t provide an accurate picture of the model’s performance because it doesn’t account for the severe class imbalance and the importance of correctly identifying the minority class (patients with the disease).

Confusion matrix

A confusion matrix is a table used to evaluate the performance of a classification model. It displays the number of true positive, false positive, true negative, and false negative predictions made by the model for a set of test data.

The four elements in the confusion matrix represent the following:

  • True positives (TP) are actual true values that were correctly predicted as true
  • False positives (FP) are actual false values that were incorrectly predicted as true
  • False negatives (FN) are actual true values that were incorrectly predicted as false
  • True negatives (TN) are actual false values that were correctly predicted as false

Qlik AutoML presents a confusion matrix as part of the experiment view. Below the numbers in each quadrant, you can also see percentage values for the metrics recall (TP), fallout (FP), miss rate (FN), and specificity (TN).

An example of the confusion matrix of Qlik AutoML can be seen in the following figure:

Figure 1.4: Confusion matrix as seen in Qlik AutoML

Figure 1.4: Confusion matrix as seen in Qlik AutoML

By analyzing the confusion matrix, we can calculate various performance metrics such as accuracy, precision, recall, and F1 score, which can help us understand how well the model is performing on the test data. The confusion matrix can also help us identify any patterns or biases in the model’s predictions and adjust the model accordingly.

Matthews Correlation Coefficient (MCC)

The Matthews Correlation Coefficient metric can be used to evaluate the performance of a binary classification model, particularly when dealing with imbalanced data.

MCC takes into account all four elements of the confusion matrix (true positives, false positives, true negatives, and false negatives) to provide a measure of the quality of a binary classifier’s predictions. It ranges between -1 and +1, with a value of +1 indicating perfect classification performance, 0 indicating a random classification, and -1 indicating complete disagreement between predicted and actual values.

The formula for MCC is as follows:

MCC =  (TP x TN FP x FN)   ________________________________     _____________________________________    ((TP + FP) x (TP + FN) x (TN + FP) x (TN + FN))  

MCC is particularly useful when dealing with imbalanced datasets where the number of positive and negative instances is not equal. It provides a better measure of classification performance than accuracy in such cases, since accuracy can be biased toward the majority class.

AUC and ROC curve

The ROC (Receiver Operating Characteristic) curve is a graphical representation of the performance of a binary classification model that allows us to evaluate and compare different models based on their ability to discriminate between positive and negative classes. AUC describes the area under the curve.

An ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds. The TPR is the ratio of true positive predictions to the total number of actual positive instances, while the FPR is the ratio of false positive predictions to the total number of actual negative instances.

By varying the classification threshold, we can obtain different TPR and FPR pairs and plot them on the ROC curve. The area under the ROC curve (AUC-ROC) is used as a performance metric for binary classification models, with higher AUC-ROC indicating better performance.

A perfect classifier would have an AUC-ROC of 1.0, indicating that it has a high TPR and low FPR across all possible classification thresholds. A random classifier would have an AUC-ROC of 0.5, indicating that its TPR and FPR are equal and its performance is no better than chance.

The ROC curve and AUC-ROC are useful for evaluating and comparing binary classification models, especially when the positive and negative classes are imbalanced or when the cost of false positive and false negative errors is different.

The following figure represents an ROC curve as seen in Qlik AutoML. The figure shows a pretty good ROC curve (it is good since the curve should be as close to 1 as possible). The dotted line is 50:50 random chance.

Figure 1.5: ROC curve for a good model in Qlik AutoML

Figure 1.5: ROC curve for a good model in Qlik AutoML

Threshold

In binary classification, a threshold is a value that is used to decide whether an instance should be classified as positive or negative by a model.

When a model makes a prediction, it generates a probability score between 0 and 1 that represents the likelihood of an instance belonging to the positive class. If the score is above a certain threshold value, the instance is classified as positive, and if it is below the threshold, it is classified as negative.

The choice of threshold can significantly impact the performance of a classification model. If the threshold is set too high, the model may miss many positive instances, leading to a low recall and high precision. Conversely, if the threshold is set too low, the model may classify many negative instances as positive, leading to a high recall and low precision.

Therefore, selecting an appropriate threshold for a classification model is important in achieving the desired balance between precision and recall. The optimal threshold depends on the specific application and the cost of false positive and false negative errors.

Qlik AutoML computes the precision and recall for hundreds of possible threshold values from 0 to 1. A threshold achieving the highest F1 score is chosen. By selecting a threshold, the produced predictions are more robust for imbalanced datasets.

Feature importance

Feature importance is a measure of the contribution of each input variable (feature) in a model to the output variable (prediction). It is a way to understand which features have the most impact on the model’s prediction, and which features can be ignored or removed without significantly affecting the model’s performance.

Feature importance can be computed using various methods, depending on the type of model used. Some common methods for calculating feature importance include the following:

  • Permutation importance: This method involves shuffling the values of each feature in the test data, one at a time, and measuring the impact on the model’s performance. The features that cause the largest drop in performance when shuffled are considered more important.
  • Feature importance from tree-based models: In decision tree-based models such as Random Forest or Gradient Boosting, feature importance can be calculated based on how much each feature decreases the impurity of the tree. The features that reduce impurity the most are considered more important.
  • Coefficient magnitude: In linear models such as Linear Regression or Logistic Regression, feature importance can be calculated based on the magnitude of the coefficients assigned to each feature. The features with larger coefficients are considered more important.

Feature importance can help in understanding the relationship between the input variables and the model’s prediction and can guide feature selection and engineering efforts to improve the model’s performance. It can also provide insights into the underlying problem and the data being used and can help in identifying potential biases or data quality issues.

In Qlik AutoML, the permutation importance of each feature is represented as a graph. This can be used to estimate feature importance. Another method that is visible in AutoML is SHAP importance values. The next section will cover the principles of SHAP importance values.

SHAP values

SHAP (SHapley Additive exPlanations) values are a technique for interpreting the output of machine-learning models by assigning an importance score to each input feature.

SHAP values are based on game theory and the concept of Shapley values, which provide a way to fairly distribute the value of a cooperative game among its players. In the context of machine learning, the game is the prediction task, and the players are the input features. The SHAP values represent the contribution of each feature to the difference between a specific prediction and the expected value of the output variable.

The SHAP values approach involves computing the contribution of each feature by evaluating the model’s output for all possible combinations of features, with and without the feature of interest. The contribution of the feature is the difference in the model’s output between the two cases averaged over all possible combinations.

SHAP values provide a more nuanced understanding of the relationship between the input features and the model’s output than other feature importance measures, as they account for interactions between features and the potential correlation between them.

SHAP values can be visualized using a SHAP plot, which shows the contribution of each feature to the model’s output for a specific prediction. This plot can help in understanding the relative importance of each feature and how they are influencing the model’s prediction.

Difference between SHAP and permutation importance

Permutation importance and SHAP are alternative ways of measuring feature importance. The main difference between the two is that permutation importance is based on the decrease in model performance. It is a simpler and more computationally efficient approach to compute feature importance but may not accurately reflect the true importance of features in complex models.

SHAP importance is based on the magnitude of feature attributions. SHAP values provide a more nuanced understanding of feature importance but can be computationally expensive and may not be feasible for very large datasets or complex models.

Permutation importance can be used to do the following:

  • Understand which features to keep and which to abandon
  • Understand the feature importance for model accuracy
  • Understand if there is a data leakage, meaning information from outside the training dataset is used to create or evaluate a model, resulting in over-optimistic performance estimates or incorrect predictions

SHAP importance can be used to do the following:

  • Understand which features have greatest influence to the predicted outcome
  • Understand how the different values of the feature affect the model prediction
  • Understand what the most influential rows are in the dataset

We can see an example of a permutation importance graph and SHAP graph in the following figure, as seen in Qlik AutoML:

Figure 1.6: Permutation importance and SHAP importance graphs

Figure 1.6: Permutation importance and SHAP importance graphs

Note

We will utilize both permutation importance and SHAP importance in our hands-on examples later in this book.

 

Summary

In this chapter, we first got an introduction of Qlik tools for machine learning. We discovered the key features of the platform and how different components can be utilized. Understanding the key components is important since we will be utilizing Insight Advisor, AutoML, and Advanced Analytics Integration later in this book.

We also discovered some of the key concepts of statistics. Understanding the basics of the underlying mathematics is crucial to understanding the models. We only scratched the surface of the mathematics, but this should be enough to familiarize you with the terminology. We also touched on the important topic of sample and sample size. When creating a model, we need to train it with training data. Determining a reasonable sample size will help us to get an accurate model without wasting resources.

At the end of this chapter, we got familiar with some of the techniques to validate the model’s performance and reliability. These are important concepts, since Qlik tools are using the introduced methods to communicate the metrics of the model.

In the next chapter, we will augment our background knowledge by getting familiar with some of the most common machine-learning algorithms. These algorithms will be used in later parts of this book.

Machine Learning with Qlik Sense
Unlock this book and the full library FREE for 7 days
Start now