Before discussing machine learning, it makes sense to properly understand the term artificial intelligence. Broadly speaking, artificial intelligence is a branch of computer science and is the idea that machines can be made to think and act just like us humans, without explicit programming instructions.
There is a common misconception that artificial intelligence is a new thing. The term is widely considered to have been coined in 1956 by assistant Professor of Mathematics John McCarthy at the Dartmouth Summer Research Project on Artificial Intelligence. We are now in an AI boom – but it was not always so; artificial intelligence has a somewhat chequered history. Following on from the 1956 conference, funding flowed generously and rapid progress was made as researchers developed systems that could play chess and solve mathematical problems. Optimism was high, but progress stalled because promises made earlier about artificial intelligence were not able to be fulfilled, and hence the funding dried up; this cycle was repeated in the 1980s. The current boom we are experiencing is due to the timely advances and emergence of three key technologies:
- Big data: Giving us the amounts of data required to be able to do artificial intelligence
- High-speed high-capacity storage devices: Giving us the ability to store the data
- GPUs: Giving us the ability to process the data
Nowadays, AI is everywhere. Here are some examples of AI:
- Chatbots (for example, customer service chatbots)
- Amazon Alexa, Apple’s Siri, and other smart assistants
- Autonomous vehicles
- Spam filters
- Recommendation engines
According to experts, there are four types of AI:
- Reactive: This is the simplest type and involves machines programmed to always respond in the same predictable manner. They cannot learn.
- Limited memory: This is the most common type of AI in use today. It combines pre-programmed information with historical data to perform tasks.
- Theory of mind: This is a technology we may see in the future. The idea here is that a machine with a theory of mind AI will understand emotions, and then alter its own behavior accordingly as it interacts with humans.
- Self-aware: This is the most advanced type of AI. Machines that are self-aware of their own emotions, and the emotions of those around them, will have a level of intelligence like human beings and will be able to make assumptions, inferences, and deductions. This is certainly one for the future as the technology for this doesn’t exist just yet.
Machine learning is one way to exploit AI. Writing software programs to cater to all situations, occurrences, and eventualities is time-consuming, requires effort, and, in some cases, is not even possible. Consider the task of recognizing pictures of people. We humans can handle this task easily, but the same is not true for computers. Even more difficult is programming a computer to do this task. Machine learning tackles this problem by getting the machine to program itself by learning through experiences.
There is no universally agreed-upon definition of machine learning that everyone subscribes to. Some attempts include the following:
- A branch of computer science that focuses on the use of data and algorithms to imitate the way that humans learn
- The capability of machines to imitate intelligent human behavior
- A subset of AI that allows machines to learn from data without being programmed explicitly
Machine learning needs data – and sometimes lots and lots of it.
Lack of data is a significant weak spot in AI. Without a reasonable amount of data, machines cannot perform and generate sensible results. Indeed, in some ways, this is just like how we humans operate – we look and learn and then apply that knowledge in new, unknown situations.
And, if we think about it, everyone has data. From the smallest sole trader to the largest organization, everyone will have sales data, purchase data, customer data, and more. The format of this data may differ between different organizations, but it is all useful data that can be used in machine learning. This data can be collected and processed and can be used to build machine learning models. Typically, this data is split into the following sets:
- Training set: This is always the largest of the datasets (typically 80%) and is the data that is used to train the machine learning models.
- Development set: This dataset (10%) is used to tweak and try new parameters to find the ones that work the best for the model.
- Test set: This is used to test (validate) the model (10%). The model has already seen the training data, so it cannot be used to test the model, hence this dataset is required. This dataset also allows you to determine whether the model is working well or requires more training.
It is good practice to have both development and test datasets. The process of building models involves finding the best set of parameters that give the best results. These parameters are determined by making use of the development set. Without the development set, we would be reduced to using the same datasets for training, testing, and evaluation. This is undesirable, but it can also present further problems unless handled carefully. For example, the datasets should be constructed such that the original dataset class proportions are preserved across the test and training sets. Furthermore, as a general point, training data should be checked for the following:
- It is relevant to the problem
- It is large enough such that all use cases of the model are covered
- It is unbiased and contains no imbalance toward any particular category
Modern toolkits such as sklearn (Pedregosa et al., 2011) provide ready-made functions that will easily split your dataset for you:
res = train_test_split(data, labels, train_size=0.8,
test_size=0.2,
random_state=42,
stratify=labels)
However, there are times when the data scientist will not have enough data available to be able to warrant splitting it multiple ways – for example, there is no data relevant to the problem, or the process to collect the data is too difficult, expensive, or time-consuming. This is known as data scarcity and it can be responsible for poor model performance. In such cases, various solutions may help alleviate the problem:
- Augmentation: For example, taking an image and performing processing (for example, rotation, scaling, and modifying the colors) so that new instances are slightly different
- Synthetic data: Data that is artificially generated using computer programs
To evaluate models where data is scarce, a technique known as k-fold cross-validation is used. This is discussed more fully in Chapter 2, briefly the dataset is split into a number (k) of groups; then, in turn, each group is taken as the test dataset with the remaining groups as the training dataset, and the model is fit and evaluated. This is repeated for each group, hence each member of the original dataset is used in the test dataset exactly once and in a training dataset k-1 times. Finally, the model accuracy is calculated by using the results from the individual evaluations.
This poses an interesting question about how much data is needed. There are no hard-and-fast rules but, generally speaking, the more the better. However, regardless of the amount of data, there are typically other issues that need to be addressed:
- Missing values
- Inconsistencies
- Duplicate values
- Ambiguity
- Inaccuracies
Machine learning is important. It has many real-world applications that can allow businesses and individuals to save time, money, and effort by, for example, automating business processes. Consider a customer service center where staff are required to take calls, answer queries, and help customers. In such a scenario, machine learning can be used to handle some of the more simple repetitive tasks, hence relieving burden from staff and getting things done more quickly and efficiently.
Machine learning has dramatically altered the traditional ways of doing things over the past few years. However, in many aspects, it still lags far behind human levels of performance. Often, the best solutions are hybrid human-in-the-loop solutions where humans are needed to perform final verification of the outcome.
There are several types of machine learning:
- Supervised learning
- Unsupervised learning
- Semi-supervised learning
- Reinforcement learning
Supervised learning models must be trained with labeled data. Hence, both the inputs and the outputs of the model are specified. For example, a machine learning model could be trained with human-labeled images of apples and other fruits, labeled as apple and non-apple. This would allow the machine to learn the best way to identify pictures of apples. Supervised machine learning is the most common type of machine learning used today. In some ways, this matches how we humans function; we look and learn from experiences and then apply that knowledge in unknown, new situations to work out an answer. Technically speaking, there are two main types of supervised learning problems:
- Classification: Problems that involve predicting labels (for example, apple)
- Regression: Problems that involve predicting a numerical value (for example, a house price)
Both of these types of problems can have any number of inputs of any type. These problems are known as supervised from the idea that the output is supplied by a teacher that shows the system what to do.
Unsupervised learning is a type of machine learning that, opposite to supervised learning, involves training algorithms on data that is unlabeled. Unsupervised algorithms examine datasets looking for meaningful patterns or trends that would not otherwise be apparent – that is, the target is for the algorithm to find the structure in the data on its own. For example, unsupervised machine learning algorithms can examine sales data and pinpoint the different types of products being purchased. However, the problem with this is that although these models can perform more complex tasks than their supervised counterparts, they are also much more unpredictable. Some use cases that adopt this approach are as follows:
- Dimensionality reduction: The process of reducing the number of inputs into a model by identifying the key (principal) components that capture the majority of the data without losing key information.
- Association rules: The process of finding associations between different inputs in the input dataset by discovering the probabilities of the co-occurrence of items. For example, when people buy ice cream, they also typically buy sunglasses.
- Clustering: Finds hidden patterns in a dataset based on similarities or differences and groups the data into clusters or groups. Unsupervised learning can be used to perform clustering when the exact details of the clusters are unknown.
Semi-supervised learning is, unsurprisingly, a combination of supervised and unsupervised learning. A small amount of labeled data and a large amount of unlabeled data is used. This has the benefits of both unsupervised and supervised learning but at the same time avoids the challenges of requiring large amounts of labeled data. Consequently, models can be trained to label data without requiring huge amounts of labeled training data.
Reinforcement learning is about learning the best behavior so that the maximum reward is achieved. This behavior is learned by interacting with the environment and observing how it responds. In other words, the sequence of actions that maximize the reward must be independently discovered via a trial-and-error process. In this way, the model can learn the actions that result in success in an unseen environment.
Briefly, here are the typical steps that are followed in a machine learning project:
- Data collection: Data can come from a database, Excel, or text file – essentially it can come from anywhere.
- Data preparation: The quality of the data used is crucial. Hence, time must be spent fixing issues such as missing data and duplicates. Initial exploratory data analysis (EDA) is performed on the data to discover patterns, spot anomalies, and test theories about the data by using visual techniques.
- Model training: An appropriate algorithm and model is chosen to represent the data. The data is split into training data for developing the model and test data for testing the model.
- Evaluation: To test the accuracy, the test data is used.
- Improve performance: Here, a different model may be chosen, or other inputs may be used.
Let’s start with the technical requirements.
Technical requirements
This book describes a series of experiments with machine learning algorithms – some standard algorithms, some developed especially for this book. These algorithms, along with various worked examples, are available as Python programs at https://github.com/PacktPublishing/Machine-Learning-for-Emotion-Analysis/tree/main, split into directories corresponding to the chapters in which the specific algorithms will be discussed.
One of the reasons why we implemented these programs in Python is that there is a huge amount of useful material to build upon. In particular, there are good -quality, efficient implementations of several standard machine learning algorithms, and using these helps us be confident that where an algorithm doesn’t work as well as expected on some dataset, it is because the algorithm isn’t very well suited to that dataset, rather than that we just haven’t implemented it very well. Some of the programs in the repository use very particular libraries, but there are several packages that we will use throughout this book. These are listed here. If you are going to use the code in the repository – which we hope you will because looking at what actual programs do is one of the best ways of learning – you will need to install these libraries. Most of them can be installed very easily, either by using the built-in package installer pip or by following the directions on the relevant website:
- pandas: This is one of the most commonly used libraries and is used primarily for cleaning and preparing data, as well as analyzing tabular data. It provides tools to explore, clean, manipulate, and analyze all types of structured data. Typically, machine learning libraries and projects use
pandas structures as inputs. You can install it by typing the following command in the command prompt:pip install pandas
- Or you can go to https://pandas.pydata.org/docs/getting_started/install.html for other options.
- NumPy: This is used primarily for its support of N-dimensional arrays. It has functions for linear algebra and matrices and is also used by other libraries. Python provides several collection classes that can be used to represent arrays, notably as lists, but they are computationally slow to work with – NumPy provides objects that are up to 50 times faster than Python lists. To install it, run the following command in the command prompt:
pip install numpy
Alternatively, you can refer to the documentation for more options: https://numpy.org/install/.
- SciPy: This provides a range of scientific functions built on top of NumPy, including ways of representing sparse arrays (arrays where most elements are 0) that can be manipulated thousands of times faster than standard NumPy arrays if the vast majority of elements are 0. You can install it using the following command:
pip install scipy
You can also refer to the SciPy documentation for more details: https://scipy.org/install/.
- scikit-learn (Pedregosa et al., 2011): This is used to build machine learning models as it has functions for building supervised and unsupervised machine learning models, analysis, and dimensionality reduction. A large part of this book is about investigating how well various standard machine learning algorithms work on particular datasets, and it is useful to have reliable good-quality implementations of the most widely used algorithms so that we are not distracted by issues due to the way we have implemented them.
scikit-learn is also known as sklearn – when you want to import it into a program, you should refer to it as sklearn. You can install it as follows:
pip install scikit-learn
Refer to the documentation for more information: https://scikit-learn.org/stable/install.html.
The sklearn implementations of the various algorithms generally make the internal representations of the data available to other programs. This can be particularly valuable when you are trying to understand the behavior of some algorithm on a given dataset and is something we will use extensively as we carry out our experiments.
- TensorFlow: This is a popular library for building neural networks as well as performing other tasks. It uses tensors (multi-dimensional arrays) to perform operations. It is built to take advantage of parallelism, so it is used to train neural networks in a highly efficient manner. Again, it makes sense to reuse a reliable good-quality implementation when testing neural network models on our data so that we know that any poor performances arise because of problems with the algorithm rather than with our implementation of it. As ever, you can just install it using
pip:pip install tensorflow
For more information, refer to the TensorFlow documentation: https://www.tensorflow.org/install.
You will not benefit from its use of parallelism unless you have a GPU or other hardware accelerator built into your machine, and training complex models is likely to be intolerably slow. We will consider how to use remote facilities such as Google Colab to obtain better performance in Chapter 9, Exploring Transformers. For now, just be aware that running tensorflow on a standard computer without any kind of hardware accelerator probably won’t do anything within a reasonable period.
- Keras: This is also used for building neural networks. It is built on top of TensorFlow. It creates computational graphs to represent machine learning algorithms, so it is slow compared to other libraries. Keras comes as part of TensorFlow, so there is no need to install anything beyond TensorFlow itself.
- Matplotlib: This is an interactive library for plotting graphs, charts, plots, and visualizing data. It comes with a wide range of plots that help data scientists understand trends and patterns. Matplotlib is extremely powerful and allows users to create almost any visualization imaginable. Use the following command to install
matplotlib:pip install matplotlib
Refer to the documentation for more information: https://matplotlib.org/stable/users/installing/index.html.
Matplotlib may install NumPy if you do not have it already installed, but it is more sensible to install them separately (NumPy first).
- Seaborn: This is built on the top of Matplotlib, and is another library for creating visualizations. It is useful for making attractive plots and helps users explore and understand data. Seaborn makes it easy for users to switch between different visualizations. You can easily install Seaborn by running the following command:
pip install seaborn
For more installation options, please refer to https://seaborn.pydata.org/installing.html.
We will use these libraries throughout this book, so we advise you to install them now, before trying out any of the programs and examples that we’ll discuss as we go along. You only have to install them once so that they will be available whenever you need them. We will specify any other libraries that the examples depend on as we go along, but from now on, we will assume that you have at least these ones.
A sample project
The best way to learn is by doing! In this section, we will discover how to complete a small machine learning project in Python. Completing, and understanding, this project will allow you to become familiar with machine learning concepts and techniques.
Typically, the first step in developing any Python program is to import the modules that are going to be needed using the import statement:
import sklearnimport pandas as pd
Note
Other imports are needed for this exercise; these can be found in the GitHub repository.
The next step is to load the data that is needed to build the model. Like most tutorials, we will use the famous Iris dataset. The Iris dataset contains data on the length and width of sepals and petals. We will use pandas to load the dataset. The dataset can be downloaded from the internet and read from your local filesystem, as follows:
df = pd.read_csv("c:\iris.csv") Alternatively, pandas can read it directly from a URL:
df = pd.read_csv("https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv") The read_csv command returns a DataFrame. It is probably the most commonly used pandas object and is simply a two-dimensional data structure with rows and columns, just like a spreadsheet.
Since we will be using sklearn, it is interesting to see that sklearn also makes it easy to access the dataset:
from sklearn import datasetsiris = datasets.load_iris()
df = iris.data
We can now check that the dataset has been successfully loaded by using the describe function:
df.describe()
The describe function returns a descriptive summary of a DataFrame reporting values such as the mean, count, and standard deviation:
sepal.length sepal.width petal.length petal.widthcount 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
This function is useful to check that the data has been loaded correctly but also to provide a first glance at some interesting attributes of the data.
Some other useful commands tell us more about the DataFrame:
- This shows the first five elements in the DataFrame:
df.head(5)
- This shows the last five elements in the DataFrame:
df.tail(5)
- This describes the columns of the DataFrame:
df.columns
- This describes the number of rows and columns in the DataFrame:
df.shape
It is usually a good idea to use these functions to check that the dataset has been successfully and correctly loaded into the DataFrame and that everything looks as it should.
It is also important to ensure that the dataset is balanced – that is, there are relatively equal numbers of each class.
The majority of machine learning algorithms have been developed with the assumption that there are equal numbers of instances of each class. Consequently, imbalanced datasets present a big problem for machine learning models as this results in models with poor predictive performance.
In the Iris example, this means that we have to check that we have equal numbers of each type of flower. This can be verified by running the following command:
df['variety'].value_counts()
This prints the following output:
Setosa 50Versicolor 50
Virginica 50
Name: variety, dtype: int64
We can see that there are 50 examples of each variety. The next step is to create some visualizations. Although we used the describe function to get an idea of the statistical properties of the dataset, it is much easier to observe these in a visual form as opposed to in a table.
Box plots (see Figure 1.9) plot the distribution of data based on the sample minimum, the lower quartile, the median, the upper quartile, and the sample maximum. This helps us analyze the data to establish any outliers and the data variation to better understand each attribute:
import matplotlib.pyplot as pltattributes = df[['sepal.length', 'sepal.width',
'petal.length', 'petal.width']]
attributes.boxplot()
plt.show()
This outputs the following plot:
Figure 1.9 – Box plot
Heatmaps are useful for understanding the relationships between attributes. Heatmaps are an important tool for data scientists to explore and visualize data. They represent the data in a two-dimensional format and allow the data to be summarized visually as a colored graph. Although we can use matplotlib to create heatmaps, it is much easier in seaborn and requires significantly fewer lines of code – something we like!
import seaborn as snssns.heatmap(iris.corr(), annot=True)
plt.show()
This outputs the following heatmap:
Figure 1.10 – Heatmap
The squares in the heatmap represent the correlation (a measure that shows how much two variables are related) between the variables. The correlation values range from -1 to +1:
- The closer the value is to 1, the more positively correlated they are – that is, as one increases, so does the other
- Conversely, the closer the value is to -1, the more negatively correlated they are – that is, as one variable decreases, the other will increase
- Values close to 0 indicate that there is no linear trend between the variables
In Figure 1.10, the diagonals are all 1. This is because, in those squares, the two variables are the same and hence the correlation is to itself. For the remainder, the scale shows that the lighter the color (toward the top of the scale), the higher the correlation. For example, the petal length and petal width are highly correlated, whereas petal length and sepal width are not. Finally, it can also be seen that the plot is symmetrical on both sides of the diagonal. This is because the same set of variables are paired in the squares that are the same.
We can now build a model using the data and estimate the accuracy of the model on data that it has not seen previously. Let’s start by separating the data and the labels from each other by using Python:
data = df.iloc[:, 0:4]labels = df.iloc[:, 4]
Before we can train a machine learning model, it is necessary to split the data and labels into testing and training data. As discussed previously, we can use the train_test_split function from sklearn:
from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test = train_test_split(data,
labels, test_size=0.2)
The capital X and lowercase y are a nod to math notation, where it is common practice to write matrix variable names in uppercase and vector variable names using lowercase letters. This has no special Python function and these conventions can be ignored if desired. For now, note that X refers to data, and y refers to the associated labels. Hence, X_train can be understood to refer to an object that contains the training data.
Before we can begin to work on the machine learning model, we must normalize the data. Normalization is a scaling technique that updates the numeric columns to use a common scale. This helps improve the performance, reliability, and accuracy of the model. The two most common normalization techniques are min-max scaling and standardization scaling:
- Min-max scaling: This method uses the minimum and maximum values for scaling and rescales the values so that they end up in the range 0 to 1 or -1 to 1. It is most useful when the features are of different scales. It is typically used when the feature distribution is unknown, such as in k-NN or neural network models.
- Standardization scaling: This method uses the mean and the standard deviation to rescale values so that they have a mean of 0 and a variance of 1. The resultant scaled values are not confined to a specific range. It is typically used when the feature distribution is normal.
It is uncommon to come across datasets that perfectly follow a certain specific distribution. Typically, every dataset will need to be standardized. For the Iris dataset, we will use sklearn’s StandardScaler to scale the data by making the mean of the data 0 and the standard deviation 1:
from sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import cross_val_score
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Now that the data is ready, sklearn makes it easy for us to test and compare various machine learning models. A brief explanation of each model has been provided but don’t worry – we explain these models in more detail later in later chapters.
Logistic regression
Logistic regression is one of the most popular machine learning techniques. It is used to predict a categorical dependent variable using a set of independent variables and makes use of a sigmoid function. The sigmoid is a mathematical function that has values from 0 to 1 and asymptotes both values. This makes it useful for binary classification with 0 and 1 as potential output values:
from sklearn.linear_model import LogisticRegressionlr = LogisticRegression()
lr.fit(X_train, y_train)
score = lr.score(X_train, y_train)
print(f"Training data accuracy {score}")
score = lr.score(X_test, y_test)
print(f"Testing data accuracy {score}")
Training data accuracy 0.9666666666666667
Testing data accuracy 0.9666666666666667
Note
There is also a technique called linear regression but, as its name suggests, this is used for regression problems, whereas the current Iris problem is a classification problem.
Support vector machines (SVMs)
Support vector machine (SVM) is one of the best “out-of-the-box” classification techniques. SVM constructs a hyperplane that can then be used for classification. It works by calculating the distance between two observations and then determining a hyperplane that maximizes the distance between the closest members of separate classes. The linear support vector classifier (SVC) method (as used in the following example) applies a linear kernel function to perform classification:
from sklearn.svm import SVCsvm = SVC(random_state=0, gamma='auto', C=1.0)
svm.fit(X_train, y_train)
score = svm.score(X_train, y_train)
print(f"Training data accuracy {score}")
score = svm.score(X_test, y_test)
print(f"Testing data accuracy {score}")
data accuracy 0.9666666666666667
Testing data accuracy 0.9666666666666667
The following parameters are used:
random_state: This controls the random number generation that is used to shuffle the data. In this example, a value hasn’t been set, hence a randomly initialized state is used. This means that results will vary between runs.
gamma: This controls how much influence a single data point has on the decision boundary. Low values mean “far” and high values mean “close.” In this example, gamma is set to “auto,” hence allowing it to automatically define its own value based on the characteristics of the dataset.
C: This controls the trade-off between maximizing the distance between classes and correctly classifying the data.
K-nearest neighbors (k-NN)
k-NN is another widely used classification technique. k-NN classifies objects based on the closest training examples in the feature space. It is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is the most common among its k-NNs measured by a distance function:
from sklearn.neighbors import KNeighborsClassifierknn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train,y_train)
score = knn.score(X_train, y_train)
print(f"Training data accuracy {score}")
score = knn.score(X_test, y_test)
print(f"Testing data accuracy {score}")
Training data accuracy 0.9583333333333334
Testing data accuracy 0.9333333333333333
Decision trees
Decision trees attempt to create a tree-like model that predicts the value of a variable by learning simple decision rules that are inferred from the data features. Decision trees classify examples by sorting down the tree from the root to a leaf node, with the leaf node providing the classification for our example:
from sklearn import treedt = tree.DecisionTreeClassifier()
dt.fit(X_train, y_train)
score = dt.score(X_train, y_train)
print(f"Training data accuracy {score}")
score = dt.score(X_test, y_test)
print(f"Testing data accuracy {score}")
Training data accuracy 1.0
Testing data accuracy 0.9333333333333333
Random forest
Random forest builds decision trees using different samples and then takes the majority vote as the answer. In other words, random forest builds multiple decision trees and then merges them to get a more accurate prediction. Due to its simplicity, it is also one of the most commonly used algorithms:
from sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier()
rf.fit(X_train, y_train)
score = rf.score(X_train, y_train)
print(f"Training data accuracy {score}")
score = rf.score(X_test, y_test)
print(f"Testing data accuracy {score}")
Training data accuracy 1.0
Testing data accuracy 0.9666666666666667
Neural networks
Neural networks (also referred to as deep learning) are algorithms that are inspired by how the human brain works, and are designed to recognize numerical patterns. Neural networks consist of input and output layers and (optionally) hidden layers. These layers contain units (neurons) that transform the inputs into something useful for the output layer. These neurons are connected and work together. We will look at these in more detail later in this book.
Making predictions
Once we have chosen and fit a machine learning model, it can easily be used to make predictions on new, unseen data – that is, take the final model and one or more data instances and then predict the classes for each of the data instances. The model is needed because the result classes are not known for the new data. The class for the unseen data can be predicted using scikit-learn’s predict() function.
First, the unseen data must be transformed into a pandas DataFrame, along with the column names:
df_predict = pd.DataFrame([[5.9, 3.0, 5.1, 1.8]], columns = ['sepal.length', 'sepal.width',
'petal.length', 'petal.width'])
This DataFrame can then be passed to scikit-learn’s predict() function to predict the class value:
print (dt.predict(df_predict))['Virginica']
A sample text classification problem
Given that this is a book on emotion classification and emotions are generally expressed in written form, it makes sense to take a look at how a text classification problem is tackled.
We have all received spam emails. These are typically emails that are sent to huge numbers of email addresses, usually for marketing or phishing purposes. Often, these emails are sent by bots. They are of no interest to the recipients and have not been requested by them. Consequently, email servers will often automatically detect and remove these messages by looking for recognizable phrases and patterns, and sometimes placing them into special folders labeled Junk or Spam.
In this example, we will build a spam detector and use the machine learning abilities of scikit-learn to train the spam detector to detect and classify text as spam and non-spam. There are many labeled datasets available online (for example, from Kaggle); we chose to use the dataset from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset?resource=download.
The dataset contains SMS messages that have been collected for spam research. It contains 5,574 SMS messages in English that are labeled as spam or non-spam (ham). The file contains one message per line, and each line has two columns; the label and the message text.
We have seen some of the basic pandas commands already, so let’s load the file and split it into training and test sets, as we did previously:
import pandas as pdfrom sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
spam = pd.read_csv("spam.csv", encoding_errors="ignore")
labels = spam["v1"]
data = spam["v2"]
X_train,X_test,y_train,y_test = train_test_split(data,
labels, test_size = 0.2)
Note
The file may have an encoding error; for now, we will ignore this as it is not relevant to the task at hand.
A handy function called CountVectorizer is available in sklearn. This can be used to transform text into a vector of term-token counts. It is also able to preprocess the text data before generating the vector representations, hence it is an extremely useful function. CountVectorizer converts the raw text into a numerical vector representation, which makes it easy to use the text as inputs in machine learning tasks:
count_vectorizer = CountVectorizer()X_train_features = count_vectorizer.fit_transform(X_train)
Essentially, it assigns a number, randomly, to each word and then counts the number of occurrences of each. For example, consider the following sentence:
The quick brown fox jumps over the lazy dog.
This would be converted as follows:
|
word
|
the
|
quick
|
brown
|
fox
|
jumps
|
over
|
lazy
|
dog
|
|
index
|
0
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
|
count
|
2
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
Notice that there are eight unique words, hence eight columns. Each column represents a unique word in the vocabulary. Each count row represents the item or row in the dataset. The values in the cells are the word counts. Armed with this knowledge about the types and counts of common words that appear in spam, the model will be able to classify text as spam or non-spam.
We will use the simple k-NN model introduced earlier:
knn = KNeighborsClassifier(n_neighbors = 5)
The fit() function, as we have seen earlier, trains the model with the vectorized counts from the training data and the training labels. It compares its predictions against the real answers in y_train and then tunes its hyperparameters until it achieves the best possible accuracy. Note how here, since this is a classification task, the labels must also be passed to the fit() function. The Iris example earlier was a regression task; there were no labels, so we did not pass them into the fit() function:
knn.fit(X_train_features, y_train)
We now have a model that we can use on the test data to test for accuracy:
X_test_features = count_vectorizer.transform(X_test)score = knn.score(X_test_features, y_test)
print(f"Training data accuracy {score}")
Training data accuracy 0.9255605381165919
Note how this time, we use transform() instead of fit_transform(). The difference is subtle but important. The fit_transform() function does fit(), followed by transform() – that is, it calculates the initial parameters, uses these calculated values to modify the training data, and generates a term-count matrix. This is needed when a model is being trained. The transform() method, on the other hand, only generates and returns the term-count matrix. The score() function then scores the prediction of the test data term-count matrix against the actual labels in test data labels, y_test, and even using a simplistic model we can classify spam with high accuracy and obtain reasonable results.