Introduction to Machine Learning with Qlik
Machine learning and artificial intelligence are two of the most powerful technology trends in the 21st century. Usage of these technologies is rapidly growing since the need for faster insights and forecasts has become crucial for companies. Qlik is a leading vendor in the analytics space and has heavily invested in machine learning and AI tools.
This first chapter will introduce the different machine learning tools in the Qlik ecosystem and provide basic information about the statistical models and principles behind these tools. It will also cover the concepts of correct sample size and how to analyze model performance and reliability.
Here is what we will cover in this first chapter:
- An overview of the Qlik tools and platform
- The basic statistical concepts of machine learning
- Proper sample size and the defining factors of a sample
- How to evaluate model performance and reliability
Introduction to Qlik tools
Qlik Sense is a leading data analytics and business intelligence platform and contains many tools and features for data analytics relating to machine learning. In this chapter, we will take a closer look at the key features of the Qlik platform.
Machine learning and AI capabilities on the Qlik platform can be divided into three different components:
- Insight Advisor
- Qlik AutoML
- Advanced Analytics Integration
Qlik Insight Advisor is a feature of Qlik Sense that uses natural language processing (NLP) and machine learning to help users explore and analyze data more effectively. It allows users to ask questions about their data in natural language and to receive insights and recommendations in real time. It also auto-generates advanced analytics and visualizations and assists with analytics creation and data preparation.
Insight Advisor utilizes a combination of Qlik’s associative engine and augmented intelligence engine and supports a wide range of use cases, as seen in the following figure:
Figure 1.1: Qlik Insight Advisor and different scenarios
Did you know?
The Qlik associative engine is the core technology that powers the Qlik data analytics and business intelligence platform. It is a powerful in-memory engine that uses an associative data model, which allows users to explore data in a way that is more intuitive and natural than traditional query-based tools.
Instead of pre-defined queries or data models, the engine automatically associates data across multiple tables and data sources based on common fields or attributes and uses a patented indexing technology that stores all the data in memory, enabling real-time analysis and exploration of even the largest datasets. It is a powerful and innovative technology that underpins the entire Qlik platform.
Insight Advisor has the following key features:
- Advanced insight generation: Insight Advisor provides a way to surface new and hidden insights. It uses AI-generated analyses that are delivered in multiple forms. Users can select from a full range of analysis types, which are auto-generated. These types include visualizations, narrative insights, and entire dashboards. Advanced analytics is also supported, and Insight Advisor can generate comparison, ranking, trending, clustering, geographical analysis, time series forecasts, and more.
- Search-based visual discovery: Insight Advisor auto-generates the most relevant and impactful visualizations for the users, based on natural language queries. It provides a set of charts that users can edit and fine-tune before adding to the dashboard. It is context-aware and reflects the selections with generated visualizations. It also suggests the most significant data relationships to explore further.
- Conversational analytics: Conversational analytics in Insight Advisor allows users to interact using natural language. Insight Advisor Chat offers a fully conversational analytics experience for the entire Qlik platform. It understands user intent and delivers additional insights for deeper understanding.
- Accelerated creation and data preparation: Accelerated creation and data preparation helps users to create analytics using a traditional build process. It gives recommendations about associations and relationships in data. It also gives chart suggestions and renders the best types of visualizations for each use case, which allows non-technical users to get the most out of the analyzed data. Part of the data preparation also involves an intelligent profiling that provides descriptive statistics about the data.
A hands-on example with Insight Advisor can be found in Chapter 9, where you will be given a practical example of the most important functionalities in action.
Qlik AutoML is an automated machine learning tool that makes AI-generated machine learning models and predictive analytics available for all users. It allows users to easily generate machine learning models, make predictions, and plan decisions using an intuitive, code-free user interface.
AutoML connects and profiles data, identifies key drivers in the dataset, and generates and refines models. It allows users to create future predictions and test what-if scenarios. Results are returned with prediction-influencer data (Shapley values) at the record level, which allows users to understand why predictions were made. This is critical to take the correct actions based on the outcome.
Predictive data can be published in Qlik Sense for further analysis and models can be integrated using Advanced Analytics Integration for real-time exploratory analysis.
Using AutoML is simple and does not require comprehensive data science skills. Users must first select the target field and then AutoML will run through various steps, as seen in the following figure:
Figure 1.2: The AutoML process flow
With the model established and trained, AutoML lets users make predictions on current datasets. Deployed models can be used both from Qlik tools and other analytics tools. AutoML also provides a REST API to consume the deployed models.
More information about AutoML, including hands-on examples, can be found in Chapter 8.
Advanced Analytics Integration
Advanced Analytics Integration is the ability to integrate advanced analytics and machine learning models directly into the Qlik data analytics platform. This integration allows users to combine the power of advanced analytics with the data exploration and visualization capabilities of Qlik to gain deeper insights from their data.
Advanced Analytics Integration is based on open APIs that provide direct, engine-level integration between Qlik’s Associative Engine and third-party data science tools. Data is being exchanged and calculations are made in real time as the user interacts with the software. Only relevant data is passed from the Associative Engine to third-party tools, based on user selections and context. The workflow is explained in the following figure:
Figure 1.3: Advanced analytics integration dataflow
Advanced analytics integration can be used with any external calculation engine, but native connectivity is provided for Amazon SageMaker, Amazon Comprehend, Azure ML, Data Robot, and custom models made with R and Python. Qlik AutoML can also utilize advanced analytics integration.
More information, including practical examples about advanced analytics integration, can be found in Chapter 7. Installing the needed components for the on-premises environment is described in Chapter 5.
Basic statistical concepts with Qlik solutions
Now that we have been introduced to Qlik tools, we will explore some of the statistical concepts that are used with them. Statistical principles play a crucial role in the development of machine-learning algorithms. These principles provide the mathematical framework for analyzing and modeling data, making predictions, and improving the accuracy of machine-learning models over time. In this section, we will become familiar with some of the key concepts that will be needed when building machine-learning solutions.
Types of data
Categorical data typically defines a group or category using a name or a label. Each piece of a categorical dataset is assigned to only one category and each category is mutually exclusive. Categorical data can be further divided into nominal data and ordinal data. Nominal data is the data category that names or labels a category. Ordinal data is constructed from elements with rankings, orders, or rating scales. Ordinal data can be ordered or counted but not measured. Some machine-learning algorithms can’t handle categorical variables unless these are converted (encoded) to numerical values.
Numerical data can be divided into discrete data that is countable numerical data. It is formed using natural numbers, for example, age, number of employees in a company, etc. Another form of numerical data is continuous data. An example of this type of data can be a person’s height or a student’s score. One type of data to pay attention to is datetime information. Dates and times are typically useful in machine-learning models but will require some work to turn them into numerical data.
Mean, median, and mode
mean = Sum of all datapoints ________________ Number of datapoints
The following is a simple example to calculate the mean of a set of data points:
X = [5,15,30,45,50]
X̅ = 29
The median is the middle value of the sorted dataset. Using the dataset in the previous example, our median is 30. The main advantage of the median over the mean is that the median is less affected by outliers. If there is a high chance for outliers, it’s better to use the median instead of the mean. If we have an even number of data points in our dataset, the median is the average of two middle points.
The mode represents the most common value in a dataset. It is mostly used when there is a need to understand clustering or, for example, encoded categorical data. Calculating the mode is quite simple. First, we need to order all values and count how many times each value appears in a set. The value that appears the most is the mode. Here is a simple example:
X = [1,4,4,5,7,9]
The mode = 4 since it appears two times and all other values appear only one time. A dataset can also have multiple modes (multimodal dataset). In this case, two or more values occur with the highest frequency.
In other words, variance measures how much each data point deviates from the mean of the dataset. A low variance indicates that the data points are closely clustered around the mean, while a high variance indicates that the data points are more widely spread out from the mean.
The formula for variance is as follows:
σ 2 = Σ ( x i − x ̅)² _ n − 1
where σ2 is the variance of the dataset, n is the number of data points in the set, and Σ is the sum of the squared differences between each data point (xi) and the mean (x ̅). The square root of the variance is the standard deviation.
Variance is an important concept in statistics and machine learning, as it is used in the calculation of many other statistical measures, including standard deviation and covariance. It is also commonly used to evaluate the performance of models and to compare different datasets.
We have a stock that returns 15% in year 1, 25% in year 2, and -10% in year 3. The mean of the returns is 10%. The difference of each year’s return to mean is 5%, 15%, and -20%. Squaring these will give 0.25%, 2.25%, and 4%. If we add these together, we will get 6.5%. When divided by 2 (3 observations – 1), we get a variance of 3.25%.
Standard deviation is a statistical measure that quantifies the amount of variation or dispersion in a set of data. It measures how much the individual data points deviate from the mean of the dataset.
A low standard deviation indicates that the data points are close to the mean, while a high standard deviation indicates that the data points are more spread out from the mean.
The formula for standard deviation is as follows:
σ = √ _ Σ ( x i − x ̅)² _ n − 1
Continuing from our previous example, we got the variance of 3.25% for our stock. Taking the square root of the variance yields a standard deviation of 18%.
Standardization or Z-score normalization is the concept of normalizing different variables to the same scale. This method allows comparison of scores between different types of variables. Z-score is a fractional representation of standard deviations from the mean value. We can calculate the z-score using the following formula:
z = x− x ̅ _ σ
In the formula, x is the observed value, x̅ is the mean, and σ is the standard deviation of the data.
Basically, the z-score describes how many standard deviations away a specific data point is from the mean. If the z-score of a data point is high, it indicates that the data point is most likely an outlier. Z-score normalization is one of the most popular feature-scaling techniques in data science and is an important preprocessing step. Many machine-learning algorithms attempt to find trends in data and compare features of data points. It is problematic if features are on a different scales, which is why we need standardization.
Standardized datasets will have a mean of 0 and standard deviation of 1, but there are no specific boundaries for maximum and minimum values.
There are two types of correlation: positive and negative. Positive correlation means that the two variables move in the same direction, while negative correlation means that the two variables move in opposite directions. A correlation of 0 indicates that there is no relationship between the variables.
The most used measure of correlation is the Pearson correlation coefficient, which ranges from -1 to 1. A value of -1 indicates a perfect negative correlation, a value of 0 indicates no correlation, and a value of 1 indicates a perfect positive correlation.
The Pearson correlation coefficient can be used when the relationship of variables is linear and both variables are quantitative and normally distributed. There should be no outliers in the dataset.
Correlation can be calculated using the
cor() function in R or the
NumPy libraries in Python.
- Random variables: A variable whose value is determined by chance. Random variables can be discrete or continuous.
- Probability distribution: A function that describes the likelihood of different values for a random variable. Common probability distributions include the normal distribution, the binomial distribution, and the Poisson distribution.
- Bayes’ theorem: A fundamental theorem in probability theory that describes the relationship between conditional probabilities. Bayes’ theorem is used in many machine-learning algorithms, including naive Bayes classifiers and Bayesian networks.
- Conditional probability: The probability of an event occurring given that another event has occurred. Conditional probability is used in many machine-learning algorithms, including decision trees and Markov models.
- Expected value: The average value of a random variable, weighted by its probability distribution. Expected value is used in many machine-learning algorithms, including reinforcement learning.
- Maximum likelihood estimation: A method of estimating the parameters of a probability distribution based on observed data. Maximum likelihood estimation is used in many machine-learning algorithms, including logistic regression and hidden Markov models.
Probability is a wide concept on its own and many books have been written about this area. In this book, we are not going deeper into the details but it is important to understand the terms at a high level.
We have now investigated the basic principles of the statistics that play a crucial role in Qlik tools. Next, we will focus on the concept of defining a proper sample size. This is an important step, since we are not always able to train our model with all the data and we want our training dataset to represent the full data as much as possible.
Defining a proper sample size and population
Defining a proper sample size for machine learning is crucial to get accurate results. It is also a common problem that we don’t know how much training data is needed. Having a correct sample size is important for several reasons:
- Generalization: Machine-learning models are trained on a sample of data with the expectation that they will generalize to new, unseen data. If the sample size is too small, the model may not capture the full complexity of the problem, resulting in poor generalization performance.
- Overfitting: Overfitting occurs when a model fits the training data too closely, resulting in poor generalization performance. Overfitting is more likely to occur when the sample size is small because the model has fewer examples to learn from and may be more likely to fit the noise in the data.
- Statistical significance: In statistical inference, sample size is an important factor in determining the statistical significance of the results. A larger sample size provides more reliable estimates of model parameters and reduces the likelihood of errors due to random variation.
- Resource efficiency: Machine-learning models can be computationally expensive to train, especially with large datasets. Having a correct sample size can help optimize the use of computing resources by reducing the time and computational resources required to train the model.
- Decision-making: Machine-learning models are often used to make decisions that have real-world consequences. Having a correct sample size ensures that the model is reliable and trustworthy, reducing the risk of making incorrect or biased decisions based on faulty or inadequate data.
Defining a sample size
The sample size depends on several factors, including the complexity of the problem, the quality of the data, and the algorithm being used. “How much training data do I need?” is a common question at the beginning of a machine-learning project. Unfortunately, there is no correct answer to that question, since it depends on various factors. However, there are some guidelines.
Generally, the following factors should be addressed when defining a sample:
- Have a representative sample: It is essential to have a representative sample of the population to train a machine-learning model. The sample size should be large enough to capture the variability in the data and ensure that the model is not biased toward a particular subset of the population.
- Avoid overfitting: Overfitting occurs when a model is too complex and fits the training data too closely. To avoid overfitting, it is important to have a sufficient sample size to ensure that the model generalizes well to new data.
- Consider the number of features: The number of features or variables in the dataset also affects the sample size. As the number of features increases, the sample size required to train the model also increases.
- Use power analysis: Power analysis is a statistical technique used to determine the sample size required to detect a significant effect. It can be used to determine the sample size needed for a machine-learning model to achieve a certain level of accuracy or predictive power.
- Cross-validation: Cross-validation is a technique used to evaluate the performance of a machine-learning model. It involves splitting the data into training and testing sets and using the testing set to evaluate the model’s performance. The sample size should be large enough to ensure that the testing set is representative of the population and provides a reliable estimate of the model’s performance.
There are several statistical heuristic methods available to estimate a sample size. Let’s take a closer look at some of these.
Power analysis is one of the key concepts in machine learning. Power analysis is mainly used to determine whether a statistical test has sufficient probability to find an effect and to estimate the sample size required for an experiment considering the significance level, effect size, and statistical power.
The definition of a power in this concept is the probability that a statistical test will reject a false null hypothesis (H0) or the probability of detecting an effect (depending on whether the effect is there). A bigger sample size will result in a larger power. The main output of power analysis is the estimation of an appropriate sample size.
To understand the basics of power analysis, we need to get familiar with the following concepts:
- A type I error (α) is rejecting a H0 or a null hypothesis in the data when it’s true (false positive).
- A type II error (β) is the failure to reject a false H0 or, in other words, a probability of missing an effect that is in the data (false negative).
- The power is the probability of detecting an effect that is in the data.
- There is a direct relationship between the power and type II error:
Power = 1 – β
Generally, β should never be more than 20%, which gives us the minimum approved power level of 80%.
- The significance level (α) is the maximum risk of rejecting a true null hypothesis (H0) that you are willing to take. This is typically set to 5% (p < 0.05).
- The effect size is the measure of the strength of a phenomenon in the dataset (independent of sample size). The effect size is typically the hardest to determine. An example of an effect size would be the height difference between men and women. The greater the effect size, the greater the height difference will be. The effect size is typically marked with the letter d in formulas.
Now that we have defined the key concepts, let’s look how to use power analysis in R and Python to calculate the sample size for an experiment with a simple example. In R we will utilize a package called
pwr and with Python we will utilize the
Let’s assume that we would like to create a model of customer behavior. We are interested to know whether there is a difference in the mean price of what our preferred customers and other customers pay at our online shop. How many transactions in each group should we investigate to get the power level of 80%?
library(pwr) ch <- cohen.ES(test = "t", size = "medium") print(ch) test <- pwr.t.test(d = 0.5, power = 0.80, sig.level = 0.05) print(test)
The model will give us the following result:
Two-sample t test power calculation n = 63.76561 d = 0.5 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: n is number in *each* group
import numpy as np from statsmodels.stats.power import TTestIndPower analysis = TTestIndPower() sample_size = analysis.solve_power(effect_size = 0.5, alpha = 0.05, power = 0.8) print(str(sample_size))
Power analysis is a wide and complex topic, but it’s important to understand the basics, since it is widely utilized in many machine-learning tools. In this chapter, we have only scratched the surface of this topic.
Sampling is a method that makes it possible to get information about the population (dataset) based on the statistics from a subset of population (sample), without having to investigate every individual value. Sampling is particularly useful if a dataset is large and can’t be analyzed in full. In this case, identifying and analyzing a representative sample is important. In some cases, a small sample can be enough to reveal the most important information, but generally, using a larger sample can increase the likelihood of representing the data as a whole.
When performing sampling, there are some aspects to consider:
- Sample goal: A property that you wish to estimate or predict
- Population: A domain from which observations are made
- Selection criteria: A method to determine whether an individual value will be accepted as a part of the sample
- Sample size: The number of data points that will form the final sample data
Sampling methods can be divided into two main categories:
Probability sampling is a technique where every element of the dataset has an equal chance of being selected. These methods typically give the best chance of creating a sample that truly represents the population. Examples of probability sampling algorithms are simple random sampling, cluster sampling, systematic sampling, and stratified random sampling.
Non-probability sampling is a method where all elements are not equally qualified for being selected. With these methods, there is a significant risk that the sample is non-representative. Examples of non-probability sampling algorithms are convenience sampling, selective sampling, snowball sampling, and quota sampling.
When using sampling as a methodology for training set creation, it is recommended to utilize a specialized sampling library in either R or Python. This will automate the process and produce a sample based on selected algorithms and specifications. In R, we can utilize the standard
sample library and in Python there is a package called
random.sample. Here is a simple example of random sampling with both languages:
dataset <- data.frame(id = 1:20, fact = letters[1:20]) set.seed(123) sample <- dataset[sample(1:nrow(dataset), size=5), ]
The content of the sample frame will look like this:
id fact 15 15 o 19 19 s 14 14 n 3 3 c 10 10 j
import random random.seed(123) dataset = [[1,'v'],[5,'b'],[7,'f'],[4,'h'],[0,'l']] sample = random.sample(dataset, 3) print(sample)
[[1, 'v'], [7, 'f'], [0, 'l']]
There is a lot of material covering different sampling techniques and how to use those with R and Python on the internet. Take some time to practice these techniques with simple datasets.
- Selection bias is introduced by the selection of values that are not random to be part of the sample. In this case, the sample is not representative of the dataset that we are looking to analyze.
- Sampling error is a statistical error that occurs when we don’t select the sample that represents the entire population of data. In this case, the results of the prediction or model will not represent the actual results that are generalized to cover the entire dataset.
Training datasets will always contain a sampling error, since it cannot represent the entire dataset. Sample errors in the context of binary classification can be calculated using the following simplified formula:
Sample error = False positive + False negative ____________________________________________ True positive + False positive + True negative + False negative
If we have, for example, a dataset containing 45 values and out of these 12 are false values, we will get a sample error of 12/45 = 26.67%.
The above formula can be only utilized in context of binary classification. When estimating the population mean (μ) from a sample mean ( _ x ), the standard error is calculated as follows:
SE = σ _ √ _ n
- SE (Standard Error): The standard error is a measure of the variability or uncertainty in a sample statistic. It quantifies how much the sample statistic is expected to vary from the true population parameter. In other words, it gives you an idea of how reliable or precise your sample estimate is.
- σ (population standard deviation): This is the standard deviation of the entire population you’re trying to make inferences about. It represents the amount of variability or spread in the population data. In practice, the population standard deviation is often unknown, so you may estimate it using the sample standard deviation (s) when working with sample data.
- n (sample size): The number of observations or data points in your sample.
- Sample mean ( _ x ): 35 years
- Sample standard deviation (s): 10 years (an estimate of the population standard deviation)
- Sample size (n): 50 residents
SE = 10 _ √ _ 50 = 1.42 years
So, the standard error of the sample mean is approximately 1.42 years. This means that if you were to take multiple random samples of the same size from the population and calculate the mean for each sample, you would expect those sample means to vary around 35 years, with an average amount of variation of 1.42 years.
Standard error is often used to construct confidence intervals. For instance, you might use this standard error to calculate a 95% confidence interval for the average age of residents in the town, which would allow you to estimate the range within which the true population mean age is likely to fall with 95% confidence.
As we can see, sample error, often referred to as “sampling error,” is not represented by a single formula. Instead, it is a concept that reflects the variability or uncertainty in the estimates or measurements made on a sample of data when trying to infer information about a larger population. The specific formula for sampling error depends on the statistic or parameter you are estimating and the characteristics of your data. In practice, you would use statistical software or tools to calculate the standard error for the specific parameter or estimate you are interested in.
Training and test data in machine learning
The preceding methods for defining a sample size will work well if we need to define the amount of needed data without a large collection of historic data covering the phenomenon that we are investigating. In many cases, we have a large dataset and we would like to produce training and test datasets from that historical data. Training datasets are used to train our machine-learning model and test datasets are used to validate the accuracy of our model. Training and test datasets are the key concepts in machine learning.
We can utilize power analysis and sampling to create training and testing datasets, but sometimes there is no need to make a complex analysis if our sample is already created. The training dataset is the biggest subset of the original dataset and will be used to fit the machine-learning model. The test dataset is another subset of original data and is always independent of the training dataset.
Test data should be well organized and contain data for each type of scenario that the model would be facing in the production environment. Usually it is 20–25% of the total original dataset. An exact split can be adjusted based on the requirements of a problem or the dataset characteristics.
Now that we have investigated some of the concepts to define a good sample, we can focus on the concepts used to analyze model performance and reliability. These concepts are important, since using these techniques allow us to develop our model further and make sure that it gives proper results.
Concepts to analyze model performance and reliability
Analyzing the performance and reliability of our machine-learning model is an important development step and should be done before implementing the model to production. There are several metrics that you can use to analyze the performance and reliability of a machine learning model, depending on the specific task and problem you are trying to solve. In this section, we will cover some of these techniques, focusing on ones that Qlik tools are using.
Regression model scoring
The following concepts can be used to score and verify regression models. Regression models predict outcomes as a number, indicating the model’s best estimate of the target variable. We will learn more about regression models in Chapter 2.
R-squared is a statistical measure that represents the proportion of the variance in a dependent variable that is explained by an independent variable (or variables) in a regression model. In other words, it measures the goodness of fit of a regression model to the data.
R-squared ranges from 0 to 1, where 0 indicates that the model does not explain any of the variability in the dependent variable, and 1 indicates that the model perfectly explains all the variability in the dependent variable.
R-squared is an important measure of the quality of a regression model. A high R-squared value indicates that the model fits the data well and that the independent variable(s) have a strong relationship with the dependent variable. A low R-squared value indicates that the model does not fit the data well and that the independent variable(s) do not have a strong relationship with the dependent variable. However, it is important to note that a high R-squared value does not necessarily mean that the model is the best possible model, so other factors such as overfitting should also be taken into consideration when evaluating the performance of a model. R-squared is often used together with other metrics and it should be interpreted in the context of the problem. The formula for R-squared is the following:
R 2 = Variance explained by the model _______________________ Total variance
There are some limitations for R-squared. It cannot be used to check whether the prediction is biased or not and it doesn’t tell us whether the regression model has an adequate fit or not. Bias refers to systematic errors in predictions. To check for bias, you should analyze residuals (differences between predicted and observed values) or use bias-specific metrics such as Mean Absolute Error (MAE) and Mean Bias Deviation (MBD). R-squared primarily addresses model variance, not bias.
Sometimes it is better to utilize adjusted R-squared. Adjusted R-squared is a modified version of the standard R-squared used in regression analysis. We can use adjusted R-squared when dealing with multiple predictors to assess model fit, control overfitting, compare models with different predictors, and aid in feature selection. It accounts for the number of predictors, penalizing unnecessary complexity. However, it should be used alongside other evaluation metrics and domain knowledge for a comprehensive model assessment.
Root mean squared error (RMSE), mean absolute error (MAE), and mean squared error (MSE)
Root mean squared error is the average difference that can be expected between predicted and actual value. It is the standard deviation of the residuals (prediction errors) and tells us how concentrated the data is around the “line of best fit.” It is a standard way to measure the error of a model when predicting quantitative data. RMSE is always measured in the same unit as the target value.
As an example of RMSE, if we have a model that predicts house value in a certain area and we get an RMSE of 20,000, it means that, on average, the predicted value differs 20,000 USD from the actual value.
Mean absolute error is defined as an average of all absolute prediction errors in all data points. In MAE, different errors are not weighted but the scores increase linearly with the increase in error. MAE is always a positive value since we are using an absolute value of error. MAE is useful when the errors are symmetrically distributed and there are no significant outliers.
Mean squared error is a squared average difference between the predicted and actual value. Squaring eliminates the negative values and ensures that MSE is always positive or 0. The smaller the MSE, the closer our model to the “line of best fit.” RMSE can be calculated using MSE. RMSE is a square root of MSE.
When to use the above metrics in practice
The choice between these metrics should align with your specific problem and objectives. Its also good practice to consider the nature of your data and the impact of outliers when selecting an error metric. Additionally, you can use these metrics in conjunction with other evaluation techniques to get a comprehensive view of your model’s performance.
Multiclass classification scoring and binary classification scoring
The following concepts can be used to score and verify multiclass and binary models. Binary classification models distribute outcomes into two categories, typically denoted as Yes or No. Multiclass classification models are similar, but there are more than two categories as an outcome. We will learn more about both models in Chapter 2.
Recall measures the percentage of correctly classified positive instances over the total number of actual positive instances. In other words, recall represents the ability of a model to correctly capture all positive instances.
Recall is calculated as follows:
Recall = True positive ______________________ (True positive + False negative)
A high recall indicates that the model is able to accurately capture all positive instances and has a low rate of false negatives. On the other hand, a low recall indicates that the model is missing many positive instances, resulting in a high rate of false negatives.
Precision measures the percentage of correctly classified positive instances over the total number of predicted positive instances. In other words, precision represents the ability of the model to correctly identify positive instances.
Precision is calculated as follows:
Precision = True positive _____________________ (True positive + False positive)
A high precision indicates that the model is able to accurately identify positive instances and has a low rate of false positives. On the other hand, a low precision indicates that the model is incorrectly classifying many instances as positive, resulting in a high rate of false positives.
Precision is particularly useful in situations where false positives are costly or undesirable, such as in medical diagnosis or fraud detection. Precision should be used in conjunction with other metrics, such as recall and F1 score, to get a more complete picture of the model’s performance.
F1 score = 2 * (precision * recall) _____________ (precision + recall)
The F1 score gives equal importance to both precision and recall, making it a useful metric for evaluating models when the distribution of positive and negative instances is uneven. A high F1 score indicates that the model has a good balance between precision and recall and can accurately classify both positive and negative instances.
In general, the more imbalanced the dataset is, the lower the F1 score is likely to be. It’s crucial to recognize that, when dealing with highly imbalanced datasets where one class greatly outnumbers the other, the F1 score may be influenced. A more imbalanced dataset can result in a reduced F1 score. Being aware of this connection can assist in interpreting F1 scores within the context of particular data distributions and problem domains. If the F1 value is high, all other metrics will be high as well, and if it is low, there is a need for further analysis.
Accuracy measures the percentage of correctly classified instances over the total number of instances. In other words, accuracy represents the ability of the model to correctly classify both positive and negative instances.
Accuracy is calculated in the following way:
Accuracy = (True positive + True negative) ____________________________________________ (True positive + False positive + True negative + False negative)
A high accuracy indicates that the model is able to accurately classify both positive and negative instances and has a low rate of false positives and false negatives. However, accuracy can be misleading in situations where the distribution of positive and negative instances is uneven. In such cases, other metrics such as precision, recall, and F1 score may provide a more accurate representation of the model’s performance.
Accuracy can mislead in imbalanced datasets where one class vastly outnumbers the others. This is because accuracy doesn’t consider the class distribution and can be high even if the model predicts the majority class exclusively. To address this, use metrics such as precision, recall, F1-score, AUC-ROC, and AUC-PR, which provide a more accurate evaluation of model performance by focusing on the correct identification of the minority class, which is often the class of interest in such datasets.
- 100 patients have the rare disease (positive class)
- 9,900 patients do not have the disease (negative class)
Now, let’s say our model predicts all 10,000 patients as not having the disease. Here’s what happens:
- True Positives (correctly predicted patients with the disease): 0
- False Positives (incorrectly predicted patients with the disease): 0
- True Negatives (correctly predicted patients without the disease): 9,900
- False Negatives (incorrectly predicted patients without the disease): 100
Using accuracy as our evaluation metric produces the following result:
Accuracy = True positive + True negative _____________________ Total = 9900 _ 10000 = 99%
Our model appears to have an impressive 99% accuracy, which might lead to the misleading conclusion that it’s performing exceptionally well. However, it has completely failed to detect any cases of the rare disease (True Positives = 0), which is the most critical aspect of the problem.
In this example, accuracy doesn’t provide an accurate picture of the model’s performance because it doesn’t account for the severe class imbalance and the importance of correctly identifying the minority class (patients with the disease).
A confusion matrix is a table used to evaluate the performance of a classification model. It displays the number of true positive, false positive, true negative, and false negative predictions made by the model for a set of test data.
The four elements in the confusion matrix represent the following:
- True positives (TP) are actual true values that were correctly predicted as true
- False positives (FP) are actual false values that were incorrectly predicted as true
- False negatives (FN) are actual true values that were incorrectly predicted as false
- True negatives (TN) are actual false values that were correctly predicted as false
Qlik AutoML presents a confusion matrix as part of the experiment view. Below the numbers in each quadrant, you can also see percentage values for the metrics recall (TP), fallout (FP), miss rate (FN), and specificity (TN).
Figure 1.4: Confusion matrix as seen in Qlik AutoML
By analyzing the confusion matrix, we can calculate various performance metrics such as accuracy, precision, recall, and F1 score, which can help us understand how well the model is performing on the test data. The confusion matrix can also help us identify any patterns or biases in the model’s predictions and adjust the model accordingly.
Matthews Correlation Coefficient (MCC)
MCC takes into account all four elements of the confusion matrix (true positives, false positives, true negatives, and false negatives) to provide a measure of the quality of a binary classifier’s predictions. It ranges between -1 and +1, with a value of +1 indicating perfect classification performance, 0 indicating a random classification, and -1 indicating complete disagreement between predicted and actual values.
The formula for MCC is as follows:
MCC = (TP x TN − FP x FN) ________________________________ √ _____________________________________ ((TP + FP) x (TP + FN) x (TN + FP) x (TN + FN))
MCC is particularly useful when dealing with imbalanced datasets where the number of positive and negative instances is not equal. It provides a better measure of classification performance than accuracy in such cases, since accuracy can be biased toward the majority class.
AUC and ROC curve
The ROC (Receiver Operating Characteristic) curve is a graphical representation of the performance of a binary classification model that allows us to evaluate and compare different models based on their ability to discriminate between positive and negative classes. AUC describes the area under the curve.
An ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds. The TPR is the ratio of true positive predictions to the total number of actual positive instances, while the FPR is the ratio of false positive predictions to the total number of actual negative instances.
By varying the classification threshold, we can obtain different TPR and FPR pairs and plot them on the ROC curve. The area under the ROC curve (AUC-ROC) is used as a performance metric for binary classification models, with higher AUC-ROC indicating better performance.
A perfect classifier would have an AUC-ROC of 1.0, indicating that it has a high TPR and low FPR across all possible classification thresholds. A random classifier would have an AUC-ROC of 0.5, indicating that its TPR and FPR are equal and its performance is no better than chance.
The ROC curve and AUC-ROC are useful for evaluating and comparing binary classification models, especially when the positive and negative classes are imbalanced or when the cost of false positive and false negative errors is different.
The following figure represents an ROC curve as seen in Qlik AutoML. The figure shows a pretty good ROC curve (it is good since the curve should be as close to 1 as possible). The dotted line is 50:50 random chance.
Figure 1.5: ROC curve for a good model in Qlik AutoML
When a model makes a prediction, it generates a probability score between 0 and 1 that represents the likelihood of an instance belonging to the positive class. If the score is above a certain threshold value, the instance is classified as positive, and if it is below the threshold, it is classified as negative.
The choice of threshold can significantly impact the performance of a classification model. If the threshold is set too high, the model may miss many positive instances, leading to a low recall and high precision. Conversely, if the threshold is set too low, the model may classify many negative instances as positive, leading to a high recall and low precision.
Therefore, selecting an appropriate threshold for a classification model is important in achieving the desired balance between precision and recall. The optimal threshold depends on the specific application and the cost of false positive and false negative errors.
Qlik AutoML computes the precision and recall for hundreds of possible threshold values from 0 to 1. A threshold achieving the highest F1 score is chosen. By selecting a threshold, the produced predictions are more robust for imbalanced datasets.
Feature importance is a measure of the contribution of each input variable (feature) in a model to the output variable (prediction). It is a way to understand which features have the most impact on the model’s prediction, and which features can be ignored or removed without significantly affecting the model’s performance.
Feature importance can be computed using various methods, depending on the type of model used. Some common methods for calculating feature importance include the following:
- Permutation importance: This method involves shuffling the values of each feature in the test data, one at a time, and measuring the impact on the model’s performance. The features that cause the largest drop in performance when shuffled are considered more important.
- Feature importance from tree-based models: In decision tree-based models such as Random Forest or Gradient Boosting, feature importance can be calculated based on how much each feature decreases the impurity of the tree. The features that reduce impurity the most are considered more important.
- Coefficient magnitude: In linear models such as Linear Regression or Logistic Regression, feature importance can be calculated based on the magnitude of the coefficients assigned to each feature. The features with larger coefficients are considered more important.
Feature importance can help in understanding the relationship between the input variables and the model’s prediction and can guide feature selection and engineering efforts to improve the model’s performance. It can also provide insights into the underlying problem and the data being used and can help in identifying potential biases or data quality issues.
In Qlik AutoML, the permutation importance of each feature is represented as a graph. This can be used to estimate feature importance. Another method that is visible in AutoML is SHAP importance values. The next section will cover the principles of SHAP importance values.
SHAP values are based on game theory and the concept of Shapley values, which provide a way to fairly distribute the value of a cooperative game among its players. In the context of machine learning, the game is the prediction task, and the players are the input features. The SHAP values represent the contribution of each feature to the difference between a specific prediction and the expected value of the output variable.
The SHAP values approach involves computing the contribution of each feature by evaluating the model’s output for all possible combinations of features, with and without the feature of interest. The contribution of the feature is the difference in the model’s output between the two cases averaged over all possible combinations.
SHAP values provide a more nuanced understanding of the relationship between the input features and the model’s output than other feature importance measures, as they account for interactions between features and the potential correlation between them.
SHAP values can be visualized using a SHAP plot, which shows the contribution of each feature to the model’s output for a specific prediction. This plot can help in understanding the relative importance of each feature and how they are influencing the model’s prediction.
Difference between SHAP and permutation importance
Permutation importance and SHAP are alternative ways of measuring feature importance. The main difference between the two is that permutation importance is based on the decrease in model performance. It is a simpler and more computationally efficient approach to compute feature importance but may not accurately reflect the true importance of features in complex models.
SHAP importance is based on the magnitude of feature attributions. SHAP values provide a more nuanced understanding of feature importance but can be computationally expensive and may not be feasible for very large datasets or complex models.
- Understand which features to keep and which to abandon
- Understand the feature importance for model accuracy
- Understand if there is a data leakage, meaning information from outside the training dataset is used to create or evaluate a model, resulting in over-optimistic performance estimates or incorrect predictions
- Understand which features have greatest influence to the predicted outcome
- Understand how the different values of the feature affect the model prediction
- Understand what the most influential rows are in the dataset
We can see an example of a permutation importance graph and SHAP graph in the following figure, as seen in Qlik AutoML:
Figure 1.6: Permutation importance and SHAP importance graphs
We will utilize both permutation importance and SHAP importance in our hands-on examples later in this book.
In this chapter, we first got an introduction of Qlik tools for machine learning. We discovered the key features of the platform and how different components can be utilized. Understanding the key components is important since we will be utilizing Insight Advisor, AutoML, and Advanced Analytics Integration later in this book.
We also discovered some of the key concepts of statistics. Understanding the basics of the underlying mathematics is crucial to understanding the models. We only scratched the surface of the mathematics, but this should be enough to familiarize you with the terminology. We also touched on the important topic of sample and sample size. When creating a model, we need to train it with training data. Determining a reasonable sample size will help us to get an accurate model without wasting resources.
At the end of this chapter, we got familiar with some of the techniques to validate the model’s performance and reliability. These are important concepts, since Qlik tools are using the introduced methods to communicate the metrics of the model.
In the next chapter, we will augment our background knowledge by getting familiar with some of the most common machine-learning algorithms. These algorithms will be used in later parts of this book.