Home Data Machine Learning for Emotion Analysis in Python

Machine Learning for Emotion Analysis in Python

By Allan Ramsay , Tariq Ahmad
books-svg-icon Book
eBook $39.99 $27.98
Print $49.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $39.99 $27.98
Print $49.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Foundations
About this book
Artificial intelligence and machine learning are the technologies of the future, and this is the perfect time to tap into their potential and add value to your business. Machine Learning for Emotion Analysis in Python helps you employ these cutting-edge technologies in your customer feedback system and in turn grow your business exponentially. With this book, you’ll take your foundational data science skills and grow them in the exciting realm of emotion analysis. By following a practical approach, you’ll turn customer feedback into meaningful insights assisting you in making smart and data-driven business decisions. The book will help you understand how to preprocess data, build a serviceable dataset, and ensure top-notch data quality. Once you’re set up for success, you’ll explore complex ML techniques, uncovering the concepts of deep neural networks, support vector machines, conditional probabilities, and more. Finally, you’ll acquire practical knowledge using in-depth use cases showing how the experimental results can be transformed into real-life examples and how emotion mining can help track short- and long-term changes in public opinion. By the end of this book, you’ll be well-equipped to use emotion mining and analysis to drive business decisions.
Publication date:
September 2023
Publisher
Packt
Pages
334
ISBN
9781803240688

 

Foundations

Emotions play a key role in our daily lives. Some people define them as the reactions that we as human beings experience as a response to events or situations, some describe them simply as a class of feelings, and others say they describe physiological states and are generated subconsciously. Psychologists describe emotions as “a complex state of feeling that results in physical and psychological changes that influence thought and behavior.” So, it appears that although we feel emotions, they are much harder to describe.

Our brains play a crucial role when creating and processing emotions. Historically, it was believed that each emotion was located in a specific part of the brain. However, research has shown that there is no single region of the brain that’s responsible for processing emotions – several brain regions are activated when emotions are being processed. Furthermore, different parts of the brain can generate the same emotion and different parts can also contribute to generating an emotion.

The reality may even be that emotion and sentiment are experiences that result from combined influences of biological, cognitive, and social aspects. Whatever the case, emotions matter because they help us decide what actions to do, how to negotiate tricky situations, and, at a basic level, how to survive. Different emotions rule our everyday lives; for example, we make decisions based on whether we are happy, angry, or sad, and we choose our daily pastimes and routines based on the emotions they facilitate. So, emotions are important, and understanding them may make our lives easier.

In this chapter, you will learn about the main concepts and differences between sentiment analysis and emotion analysis, and also understand why emotion analysis is important in the modern world. By combining this with a basic introduction to natural language processing (NLP) and machine learning, we will lay the foundations for successfully using these techniques for emotion analysis.

In this chapter, we’ll cover the following topics:

  • Emotions
  • Sentiment
  • Why is emotion analysis important?
  • Introduction to natural language processing
  • Introduction to machine learning
 

Emotions

This book is about writing programs that can detect emotions expressed in texts, particularly informal texts. Emotions play a crucial role in our daily lives. They impact how we feel, how we think, and how we behave. Consequently, it stands to reason that they impact the decisions we make. If this is the case, then being able to detect emotions from written text (for example, social media posts) is a useful thing to do because the impact it would have on many practical everyday applications in sectors such as marketing, industry, health, and security would be huge.

However, while it is clear that we all experience emotions and that they play a significant role in our plans and actions, it is much less clear what they are. Given that we are about to embark on a detailed study of how to write programs to detect them, it is perhaps worth beginning by investigating the notion of what an emotion is and looking at the various theories that attempt to pin them down. This is a topic that has fascinated philosophers and psychologists from antiquity to the present day, and it is still far from settled. We will briefly look at a number of the most prominent theories and approaches. This overview will not lead us to a definitive view, but before we start trying to identify them in written texts, we should at least become aware of the problems that people still have in pinning them down.

Darwin believed that emotions allowed humans and animals to survive and reproduce. He argued that they evolved, were adaptive, and that all humans, and even other animals, expressed emotion through similar behaviors. He believed that emotions had an evolutionary history that could be traced across cultures and species. Today, psychologists agree that emotions such as fear, surprise, disgust, happiness, and sadness can be regarded as universal regardless of culture.

The James-Lange theory proposes that it is our physical responses that are responsible for emotions. For example, if someone jumps out at you from behind a bush, your heart rate will increase, and it is this increase that causes the individual to feel fear. The facial-feedback theory builds on this idea and suggests that physical activity is responsible for influencing emotion, for example, if you smile, likely, you will automatically feel happier than if you did not smile. However, Cannon-Bard’s theory refutes James-Lange, instead suggesting that people experience emotional and physical responses simultaneously. The Schachter-Singer theory is a cognitive theory of emotion that suggests that it is our thoughts that are responsible for emotions, and similarly, cognitive appraisal theory suggests that thinking must come before experiencing an emotion. For instance, the brain might understand a situation as threatening, and hence fear is experienced.

To try to obtain a deeper understanding of emotions, let’s look at the three main theories of emotion:

  • Physiological: Psychologists have the view that emotions are formed when a bodily response is triggered by a stimulus, so as the individual experiences physiological changes, this is also experienced as an emotion
  • Neurological: Biologists claim that hormones (for example, estrogen, progesterone, and testosterone) that are produced by the body’s glands impact the chemistry and circuitry of the brain and these lead to emotional responses
  • Cognitive: Cognitive scientists believe that thoughts and other mental activities play a crucial role in forming emotions

In all likelihood, all three theories are probably valid to some extent. It has also been postulated that instead of thinking of these as mutually exclusive, it is more likely that they are complementary and that each explains and accounts for a different aspect of what we think of as an emotion.

Although emotions have been studied for many decades, it is probably still true that we still do not fully understand emotions.

Humans can experience a huge number of emotions, but only a handful are considered basic. However, the number of emotions considered in emotion analysis research is not always limited to just these basic emotions. Furthermore, it is not straightforward to demarcate emotions, and hence boundaries are very rarely clearly defined.

We will now consider what are known as the primary emotions. These have been described as a reaction to an event or situation, or the immediate strong first reaction experienced when something happens. There has been much research on identifying these primary emotions, but there is still no general agreement, and different models have been suggested by eminent researchers such as Ekman, Plutchik, and Parrot. Some emotions such as anger, fear, joy, and surprise are universally agreed upon. However, the same is not true for other emotions, with disagreements on the emotions that constitute the basic emotions and the number of these emotions. Although there is, again, no consensus on which model is best at covering basic emotions, the models proposed by Ekman and Plutchik are most commonly used. There are two popular approaches: categorical and dimensional.

Categorical

Ekman is an advocate of the categorical theory, which suggests that emotions arise from separate neural systems. This approach also suggests that there are a limited number of primary, distinct emotions, such as anger, anxiety, joy, and sadness. Ekman suggested that primary emotions must have a distinct facial expression that is recognizable across all cultures. For example, the corners of the lips being turned down demonstrates sadness – and this facial expression is recognized universally as portraying sadness. Similarly, smiling with teeth exposed and the corners of the mouth pointing upwards is universally recognized as joy.

Amazingly, people blind from birth use the same facial expressions when expressing sadness and joy. They have never seen these facial expressions, so it is impossible that these expressions were learned. It is much more likely that these are an integral part of human nature. Using this understanding of distinct, universal facial expressions, Ekman proposed six primary emotions (Ekman, 1993):

  • Anger
  • Disgust
  • Fear
  • Joy
  • Sadness
  • Surprise

Ekman suggested that these basic emotions were biologically primitive and have evolved to increase the reproductive fitness of animals and that all other emotions were combinations of these eight primary emotions. Later, Eckman expanded this list to include other emotions that he considered basic, such as embarrassment, excitement, contempt, shame, pride, satisfaction, and amusement.

Another of the most influential works in the area of emotions is Plutchik’s psychoevolutionary theory of emotion. Plutchik proposed eight primary emotions (Plutchik, 2001):

  • Anger
  • Anticipation
  • Disgust
  • Fear
  • Joy
  • Sadness
  • Surprise
  • Trust

From this theory, Plutchik developed a Wheel of Emotions (see Figure 1.1). This wheel was developed to help understand the nuances of emotion and how emotions contrast. It has eight sectors representing the eight emotions. Emotions intensify as they move from outside toward the center of the wheel. For example, annoyance increases to anger and then further increases to outright rage. Each sector of the circle has an opposite emotion that is placed directly opposite in the wheel. For example, the opposite of sadness is joy, and the opposite of anger is fear. It also shows how different emotions can be combined.

Figure 1.1 – Plutchik’s Wheel of Emotions

Figure 1.1 – Plutchik’s Wheel of Emotions

Although Ekman and Plutchik’s theories are the most common, there are other works, but there is little agreement on what the basic emotions are. However, in the area of emotion analysis research, Ekman and Plutchik’s models are the most often used classification schemes.

Dimensional

The dimensional approach posits that to understand emotional experiences, the fundamental dimensions of valence (the goodness and badness of the emotion) and arousal (the intensity of the emotion) are vital. This approach suggests that a common and interconnected neurophysiological system is responsible for all affective states. Every emotion can then be defined in terms of these two measures, so the plane can be viewed as a continuous two-dimensional space, with dimensions of valence and arousal, and each point in the place corresponds to a separate emotion state.

Figure 1.2 – Russell’s circumplex model

Figure 1.2 – Russell’s circumplex model

The most common dimensional model is Russell’s circumplex model ((Russell, 1980): see Figure 1.2). The model posits that emotions are made up of two core dimensions: valence and arousal. Figure 1.2 shows that valence ranges from −1 (unpleasant) to 1 (pleasant), and arousal also ranges from −1 (calm) to 1 (excited). Each emotion is then a linear combination of these two dimensions. For example, anger is an unpleasant emotional state (a negative valence) with a high intensity (a positive arousal). Other basic emotions can be seen in Figure 1.2 with their approximate positions in the two-dimensional space.

Some emotions have similar arousal and valence (for example, grief and rage). Hence, a third dimension (control) has also been suggested that can be used to distinguish between these. Control ranges from no control to full control. So, the entire range of human emotions can be represented as a set of points in the three-dimensional space using these three dimensions.

The dimensional model has a poorer resolution of emotions; that is, it is harder to distinguish between ambiguous emotions. The categorical model is simpler to understand, but some emotions are not part of the set of basic emotions.

Most emotion analysis research uses a categorical perspective; there seems to be a lack of research using the dimensional approach.

 

Sentiment

There is a second closely-related term known as sentiment. The terms sentiment and emotion seem to be used in an ad hoc manner, with different writers using them almost interchangeably. Given the difficulty we have found in working out what emotions are, and in deciding exactly how many emotions there are, having yet another ill-defined term is not exactly helpful. To try to clarify the situation, note that when people work on sentiment mining, they generally make use of a simple, limited system of classification using positive, negative, and neutral cases. This is a much simpler scheme to process and ascertain, and yields results that are also easier to understand. In some ways, emotion analysis may be regarded as an upgrade to sentiment analysis; a more complex solution that analyzes much more than the simple positive and negative markers and instead tries to determine specific emotions (anger, joy, sadness). This may be more useful but also involves much more effort, time, and cost. Emotion and sentiment are, thus, not the same. An emotion is a complex psychological state, whereas a sentiment is a mental attitude that is created through the very existence of the emotion.

For us, sentiment refers exclusively to an expressed opinion that is positive, negative, or neutral. There is some degree of overlap here because, for example, emotions such as joy and love could both be considered positive sentiments. It may be that the terms simply have different granularity – in the same way that ecstasy, joy, and contentment provide a fine-grained classification of a single generic emotion class that we might call happiness, happiness and love are a fine-grained classification of the general notion of feeling positive. Alternatively, it may be that sentiment is the name for one of the axes in the dimensional model – for example, the valence axis in Russell’s analysis. Given the range of theories of emotion, it seems best to just avoid having another term for much the same thing. In this book, we will stick to the term emotion; we will take an entirely pragmatic approach by accepting some set of labels from an existing theory such as Plutchik’s or Russell’s as denoting emotions, without worrying too much about what it is that they denote. We can all agree that I hate the people who did that and I wish they were all dead expresses hate and anger, and that it is overall negative, even if we’re not sure what hate and anger are or what the scale from negative to positive actually measures.

Now that we know a bit more about what emotion is and how it is categorized and understood, it is essential to understand why emotion analysis is an important topic.

 

Why emotion analysis is important

The amount of data generated daily from online sources such as social media and blogs is staggering. In 2019, Forbes estimated this to be around 2.5 quintillion bytes of data, though this figure is more

than likely even higher now. Due to this, much research has focused on using this data for analysis and for gaining hitherto unknown insights (for example, predicting flu trends and disease outbreaks using Twitter (now known as “X”) data).

Similarly, people are also increasingly expressing their opinions online – and many of these opinions are, explicitly or implicitly, highly emotional (for example, I love summer). Nowadays, social network platforms such as Facebook, LinkedIn, and Twitter are at the hub of everything we do. Twitter is one of the most popular social network platforms, with more than 300 million users using Twitter actively every month. Twitter is used by people from all walks of life; celebrities, movie stars, politicians, sports stars, and everyday people. Users post short messages, known as tweets, and, every day, millions share their opinions about themselves, news, sports, movies, and other topics. Consequently, this makes platforms such as Twitter rich sources of data for public opinion mining and sentiment analysis.

As we have seen, emotions play an important role in human intelligence, decision-making, social interaction, perception, memory, learning, creativity, and much much more.

Emotion analysis is the process of recognizing the emotions that are expressed through texts (for example, social media posts). It is a complex task because user-generated content, such as tweets, is typically understood as follows:

  • Written in natural language
  • Often unstructured, informal, and misspelled
  • Can contain slang and made-up words
  • Can contain emojis and emoticons where their usage does not always correspond to the reason for their original creation (for example, using the pizza emoji to express love)

Furthermore, it is also entirely possible to express emotion without using any obvious emotion markers.

One of the big unsolved problems in emotion analysis is detecting emotions such as anticipation, pessimism, and sarcasm. Consider the following tweet:

We lost again. Great.

We humans are fairly knowledgeable when it comes to drilling down to the true meaning implied, and would understand that the user was being sarcastic. We know that a team losing again is not a good thing. Hence, by making use of this understanding, we can easily identify the implied meaning.

The problem is that simply considering each word that has sentiment in isolation will not do a good job. Instead, further rules must be applied to understand the context of the word. These rules will help the analyzer differentiate between sentences that might contain similar words but have completely different meanings. However, even with these rules, analyzers will still make mistakes.

Social media is now viewed as a valuable resource, so organizations are showing an increased interest in social media monitoring to analyze massive, free-form, short, user-generated text from social

media sites. Exploiting these allows organizations to gain insights into understanding their customer’s opinions, concerns, and needs about their products and services.

Due to its real-time nature, governments are also interested in using social media to identify threats and monitor and analyze public responses to current events.

Emotion analysis has many interesting applications:

  • Marketing: Lots of Twitter users follow brands (for example, Nike), so there are many marketing opportunities. Twitter can help spread awareness of a brand, generate leads, drive traffic to sites, build a customer base, and more. Some of the biggest marketing campaigns of previous years include #ShareACoke by Coca-Cola, #WantAnR8 by Audi, and #BeTheFastest by Virgin Media.
  • Stock markets: Academics have attempted to use Twitter to anticipate trends in financial markets. In 2013, the Associated Press Twitter account posted a (false) tweet stating that there had been explosions in the White House and that Obama was injured. The post was debunked very quickly but the stock markets still took a nosedive, resulting in hundreds of billions of dollars changing hands.
  • Social studies: Millions of people regularly interact with the world by tweeting, providing invaluable insights into their feelings, actions, routines, emotions, and behavior. This vast amount of public communication can be used to generate forecasts of various types of events. For example, large-scale data analysis of social media has demonstrated that not only did Brexit supporters have a more powerful, emotional message, but they were also more effective in the use of social media. They routinely outmuscled their rivals and had more vocal and active supporters across nearly all social media platforms. This led to the activation of a greater number of Leave supporters and enabled them to dominate social media platforms – thus influencing many undecided voters.

Gaining an understanding of emotions is also important for organizations to gain insights into public opinion about their products and services. However, it is also important to automate this process so that decisions can be made and actions can be taken in real-time. For example, analysis techniques can automatically analyze and process thousands of reviews about a particular product and extract insights that show whether consumers are satisfied with the product or service. This can be sentiment or emotion, although emotion may be more useful due to it being more granular.

Research has shown that tweets posted by dissatisfied users are shared more often and spread faster and wider than other types of tweets. Therefore, organizations have to provide customer services beyond the old-fashioned agent at the end of the phone line. Due to this, many organizations today also provide social media-based customer support in an attempt to head-off bad reviews and give a good impression. Nowadays, there is so much consumer choice, and it is so much easier for customers to switch to competitors, that it is vitally important for organizations to retain and increase their customer base. Hence, the quicker an organization reacts to a bad post, the better chance they have

of retaining the customer. Furthermore, there is no better advertising than word of mouth – such as that generated by happy customers. Emotion analysis is one way to quickly analyze hundreds of tweets, find the ones where customers are unhappy, and use this to drive other processes that attempt to resolve the problem before the customer becomes too unhappy and decides to take their business elsewhere. Emotion analysis not only requires data – it also generates a lot of data. This data can be further analyzed to determine, for example, what the top items on user wishlists are, or what the top user gripes are. These can then be used to drive the next iteration or version of the product or service.

Although sentiment analysis and emotion analysis are not mutually exclusive and can be used in conjunction, the consensus is that sentiment analysis is not adequate for classifying something as complex, multi-layered, and nuanced as emotion. Simply taking the whole range of emotions and considering them as only positive, negative, or neutral runs the considerable risk of missing out on deeper insights and understandings.

Emotion analysis also provides more in-depth insights. Understanding why someone ignored or liked a post needs more than just a sentiment score. Furthermore, gaining actionable insights also requires more than just a sentiment score.

Emotion analysis is a sub-field of NLP, so it makes sense to gain a better understanding of that next.

 

Introduction to NLP

Sentiment mining is about finding the sentiments that are expressed by natural language texts – often quite short texts such as tweets and online reviews, but also larger items such as newspaper articles. There are many other ways of getting computers to do useful things with natural language texts and spoken language: you can write programs that can have conversations (with people or with each other), you can write programs to extract facts and events from articles and stories, you can write programs to translate from one language to another, and so on. These applications all share some basic notions and techniques, but they each lay more emphasis on some topics and less on others. In Chapter 4, Preprocessing – Stemming, Tagging, and Parsing, we will look at the things that matter most for sentiment mining, but we will give a brief overview of the main principles of NLP here. As noted, not all of the stages outlined here are needed for every application, but it is nonetheless useful to have a picture of how everything fits together when considering specific subtasks later.

We will start with a couple of basic observations:

  • Natural language is linear. The fundamental form of language is speech, which is necessarily linear. You make one sound, and then you make another, and then you make another. There may be some variation in the way you make each sound – louder or softer, with a higher pitch or a lower one, quicker or slower – and this may be used to overlay extra information on the basic message, but fundamentally, spoken language is made up of a sequence of identifiable units, namely sounds; and since written language is just a way of representing spoken language, it too must be made up of a sequence of identifiable units.
  • Natural language is hierarchical. Smaller units are grouped into larger units, which are grouped into larger units, which are grouped into larger units, and so on. Consider the sentence smaller units are grouped into larger units. In the written form of English, for instance, the smallest units are characters; these are grouped into morphemes (meaning-bearing word-parts), as small er unit s are group ed into large er unit s, which are grouped into words (small-er unit-s are group-ed into large-er unit-s), which are grouped into base-level phrases ([small-er unit-s] [are group-ed] [into] [large-er unit-s]), which are grouped into higher-level phrases ([[small-er unit-s] [[are group-ed] [[into] [large-er unit-s]]]]]).

These two properties hold for all natural languages. All natural languages were spoken before they were written (some widely spoken languages have no universally accepted written form!), and hence are fundamentally linear. But they all express complex hierarchical relations, and hence to understand them, you have to be able to find the ways that smaller units are grouped into larger ones.

What the bottom-level units are like, and how they are grouped, differs from language to language. The sounds of a language are made by moving your articulators (tongue, teeth, lips, vocal cords, and various other things) around while trying to expel air from your lungs. The sound that you get by closing and then opening your lips with your vocal cords tensed (/b/, as in the English word bat) is different from the sound you get by doing the same things with your lips while your vocal cords are relaxed (/p/, as in pat). Different languages use different combinations – Arabic doesn’t use /p/ and English doesn’t use the sound you get by closing the exit from the chamber containing the vocal cords (a glottal stop): the combinations that are used in a particular language are called its phonemes. Speakers of a language that don’t use a particular combination find it hard to distinguish words that use it from ones that use a very similar combination, and very hard to produce that combination when they learn a language that does.

To make matters worse, the relationship between the bottom-level units in spoken language and written language can vary from language to language. The phonemes of a language can be represented in the written form of that language in a wide variety of ways. The written form may make use of graphemes, which are combinations of ways of making a shape out of strokes and marks (so, AAAAAA are all written by producing two near-vertical more-or-less-straight lines joined at the top with a cross-piece about half-way up), just as phonemes are combinations of ways of making a sound; a single phoneme may be represented by one grapheme (the short vowel /a/ from pat is represented in English by the character a) or by a combination of graphemes (the sound /sh/ from should is represented by the pair of graphemes s and h); a sound may have no representation in the written form (Arabic text omits short vowels and some other distinctions between phonemes); or there may simply be no connection between the written form and the way it is pronounced (written Chinese, Japanese kanji symbols). Given that we are going to be largely looking at text, we can at least partly ignore the wide variety of ways that written and spoken language are related, but we will still have to be aware that different languages combine the basic elements of the written forms in completely different ways to make up words.

The bottom-level units of a language, then, are either identifiable sounds or identifiable marks. These are combined into groups that carry meaning – morphemes. A morpheme can carry quite a lot of meaning; for example, cat (made out of the graphemes c, a, and t) denotes a small mammal with pointy ears and an inscrutable outlook on life, whereas s just says that you’ve got more than one item of the kind you are thinking about, so cats denotes a group of several small mammals with pointy ears and an opaque view of the world. Morphemes of the first kind are sometimes called lexemes, with a single lexeme combining with one or more other morphemes to express a concept (so, the French lexeme noir (black) might combine with e (feminine) and s (plural) to make noires – several black female things). Morphemes that add information to a lexeme, such as about how many things were involved or when an event happened, are called inflectional morphemes, whereas ones that radically change their meaning (for example an incomplete solution to a problem is not complete) are called derivational morphemes, since they derive a new concept from the original. Again, most languages make use of inflectional and derivational morphemes to enrich the basic set of lexemes, but exactly how this works varies from language to language. We will revisit this at some length in Chapter 5 , Sentiment Lexicons and Vector Space Models since finding the core lexemes can be significant when we are trying to assign emotions to texts.

A lexeme plus a suitable set of morphemes is often referred to as a word. Words are typically grouped into larger tree-like structures, with the way that they are grouped carrying a substantial part of the message conveyed by the text. In the sentence John believes that Mary expects Peter to marry Susan, for instance, Peter to marry Susan is a group that describes a particular kind of event, Mary expects [Peter to marry Susan] is a group that describes Mary’s attitude to this event, and John believes [that Mary expected [Peter to marry Susan]] is a group that describes John’s view of Mary’s expectation.

Yet again, different languages carry out this kind of grouping in different ways, and there are numerous ways of approaching the task of analyzing the grouping in particular cases. This is not the place for a review of all the grammatical theories that have ever been proposed to analyze the ways that words get grouped together or of all the algorithms that have ever been proposed for applying those theories to specific cases (parsers), but there are a few general observations that are worth making.

Phrase structure grammar versus dependency grammar

In some languages, groups are mainly formed by merging adjacent groups. The previous sentence, for instance, can be analyzed if we group it as follows:

In some languages groups are mainly formed by merging adjacent groups

In [some languages]np groups are mainly formed by merging [adjacent groups]np

[In [some languages]]pp groups are mainly formed by [merging [adjacent groups]]vp

[In [some languages]]pp groups are mainly formed [by [merging [adjacent groups]]]pp

[In [some languages]]pp groups are mainly [formed [by [merging [adjacent groups]]]]vp

[In [some languages]]pp groups are [mainly [formed [by [merging [adjacent groups]]]]]vp

[In [some languages]]pp groups [are [mainly [formed [by [merging [adjacent groups]]]]]]vp

[In [some languages]]pp [groups [are [mainly [formed [by [merging [adjacent groups]]]]]]]s

[[In [some languages]][groups [are [mainly [formed [by [merging [adjacent groups]]]]]]]]s

This tends to work well for languages where word order is largely fixed – no languages have completely fixed word order (for example, the preceding sentence could be rewritten as Groups are mainly formed by merging adjacent groups in some languages with very little change in meaning), but some languages allow more freedom than others. For languages such as English, analyzing the relationships between words in terms of adjacent phrases, such as using a phrase structure grammar, works quite well.

For languages where words and phrases are allowed to move around fairly freely, it can be more convenient to record pairwise relationships between words. The following tree describes the same sentence using a dependency grammar – that is, by assigning a parent word to every word (apart from the full stop, which we are taking to be the root of the tree):

Figure 1.3 – Analysis of “In some languages, groups are mainly formed by merging adjacent groups” using a rule-based dependency parser

Figure 1.3 – Analysis of “In some languages, groups are mainly formed by merging adjacent groups” using a rule-based dependency parser

There are many variations of phrase structure grammar and many variations of dependency grammar. Roughly speaking, dependency grammar provides an easier handle on languages where words can move around very freely, while phrase structure grammar makes it easier to deal with invisible items such as the subject of merging in the preceding example. The difference between the two is, in any case, less clear than it might seem from the preceding figure: a dependency tree can easily be transformed into a phrase structure tree by treating each subtree as a phrase, and a phrase structure tree can be transformed into a dependency tree if you can specify which item in a phrase is its head – for example, in the preceding phrase structure tree, the head of a group labeled as nn is its noun and the head of a group labeled as np is the head of nn.

Rule-based parsers versus data-driven parsers

As well as having a theory of how to describe the structure of a piece of text, you need a program that applies that theory to specific texts – a parser. There are two ways to approach the development of a parser:

  • Rule-based: You can try to devise a set of rules that describe the way that a particular language works (a grammar), and then implement a program that tries to apply these rules to the texts you want analyzed. Devising such rules is difficult and time-consuming, and programs that try to apply them tend to be slow and fail if the target text does not obey the rules.
  • Data-driven: You can somehow produce a set of analyses of a large number of texts (a treebank), and then implement a program that extracts patterns from these analyses. Producing a treebank is difficult and time-consuming – you need hundreds of thousands of examples, and the trees all have to be consistently annotated, which means that if this is to be done by people, then they have to be given consistent guidelines that cover every example they will see (which is, in effect, a grammar) (and if it is not done by people then you must already have an automated way of doing it, that is, a parser!).

Both approaches have advantages and disadvantages: when considering whether to use a dependency grammar or a phrase structure grammar and then when considering whether to follow a rule-based approach or a data-driven one, there are several criteria to be considered. Since no existing system optimizes all of these, you should think about which ones matter most for your application and then decide which way to go:

  • Speed: The first criterion to consider is the speed at which the parser runs. Some parsers can become very slow when faced with long sentences. The worst-case complexity of the standard chart-parsing algorithm for rule-based approaches is O(N3), where N is the length of the sentence, which means that for long sentences, the algorithm can take a very long time. Some other algorithms have much better complexity than this (the MALT (Nivre et al., 2006) and MST (McDonald et al., 2005) parsers, for instance, are linear in the length of the sentence), while others have much worse. If two parsers are equally good according to all the other criteria, then the faster one will be preferable, but there will be situations where one (or more) of the other criteria is more important.
  • Robustness: Some parsers, particularly rule-based ones, can fail to produce any analysis at all for some sentences. This will happen if the input is ungrammatical, but it will also happen if the rules are not a complete description of the language. A parser that fails to produce a perfectly grammatical input sentence is less useful than one that can analyze every grammatically correct sentence of the target language. It is less clear that parsers that will do something with every input sentence are necessarily more useful than ones that will reject some sentences as being ungrammatical. In some applications, detecting ungrammaticality is a crucial part of the task (for example, in language learning programs), but in any case, assigning an analysis to an ungrammatical sentence cannot be either right or wrong, and hence any program that makes use of such an analysis cannot be sure that it is doing the right thing.
  • Accuracy: A parser that assigns the right analysis to every input text will generally be more useful than one that does not. This does, of course, beg the question of how to decide what the right analysis is. For data-driven parsers, it is impossible to say what the right analysis of a sentence that does not appear in the treebank is. For rule-based parsers, any analysis that is returned will be right in the sense that it obeys the rules. So, if an analysis looks odd, you have to work out how the rules led to it and revise them accordingly.

There is a trade-off between accuracy and robustness. A parser that fails to return any analysis at all in complex cases will produce fewer wrong analyses than one that tries to find some way of interpreting every input text: the one that simply rejects some sentences will have lower recall but may have higher precision, and that can be a good thing. It may be better to have a system that says Sorry, I didn’t quite understand what you just said than one that goes ahead with whatever it is supposed to be doing based on an incorrect interpretation.

  • Sensitivity and consistency: Sometimes, sentences that look superficially similar have different underlying structures. Consider the following examples:
    1. a) I want to see the queen b) I went to see the queen

1(a) is the answer to What do you want? and 2(b) is the answer to Why did you go? If the structures that are assigned to these two sentences do not reflect the different roles for to see the queen, then it will be impossible to make this distinction:

Figure 1.4 – Trees for 1(a) and 1(b) from the Stanford dependency parser (Dozat et al., 2017)

Figure 1.4 – Trees for 1(a) and 1(b) from the Stanford dependency parser (Dozat et al., 2017)

  1. a) One of my best friends is watching old movies b) One of my favorite pastimes is watching old movies
Figure 1.5 – Trees for 2(a) and 2(b) from the Stanford dependency parser

Figure 1.5 – Trees for 2(a) and 2(b) from the Stanford dependency parser

The Stanford dependency parser (SDP) trees both say that the subject (One of my best friends, One of my favorite pastimes) is carrying out the action of watching old movies – it is sitting in its most comfortable armchair with the curtains drawn and the TV on. The first of these makes sense, but the second doesn’t: pastimes don’t watch old movies. What we need is an equational analysis that says that One of my favorite pastimes and watching old movies are the same thing, as in Figure 1.6:

Figure 1.6 – Equational analysis of “One of my favorite pastimes is watching old movies”

Figure 1.6 – Equational analysis of “One of my favorite pastimes is watching old movies”

Spotting that 2(b) requires an analysis like this, where my favorite pastime is the predication in an equational use of be rather than the agent of a watching-old-movies event, requires more detail about the words in question than is usually embodied in a treebank.

It can also happen that sentences that look superficially different have very similar underlying structures:

  1. a) Few great tenors are poor b) Most great tenors are rich

This time, the SDP assigns quite different structures to the two sentences:

Figure 1.7 – Trees for 3(a) and 3(b) from the SDP

Figure 1.7 – Trees for 3(a) and 3(b) from the SDP

The analysis of 3(a) assigns most as a modifier of great, whereas the analysis of 3(b) assigns few as a modifier of tenors. Most can indeed be used for modifying adjectives, as in He is the most annoying person I know, but in 3(a), it is acting as something more like a determiner, just as few is in 3(b).

  1. a) There are great tenors who are rich b) Are there great tenors who are rich?

It is clear that 4(a) and 4(b) should have almost identical analyses – 4(b) is just 4(a) turned into a question. Again, this can cause problems for treebank-based parsers:

Figure 1.8 – Trees for 4(a) and 4(b) from MALTParser

Figure 1.8 – Trees for 4(a) and 4(b) from MALTParser

The analysis in Figure 1.8 for 4(a) makes are the head of the tree, with there, great tenors who are rich, and as daughters, whereas 4(b) is given tenors as its head and are, there, great, who are rich, and ? as daughters. It would be difficult, given these analyses, to see that 4(a) is the answer to 4(b)!

Treebank-based parsers frequently fail to cope with issues of the kind raised by the examples given here. The problem is that the treebanks on which they are trained tend not to include detailed information about the words that appear in them – that went is an intransitive verb and want requires a sentential complement, that friends are human and can therefore watch old movies while pastimes are events, and can therefore be equated with the activity of watching something, or that most can be used in a wide variety of ways.

It is not possible to say that all treebank-based parsers suffer from these problems, but several very widely used ones (the SDP, the version of MALT distributed with the NLTK, the EasyCCG parser (Lewis & Steedman, 2014), spaCy (Kitaev & Klein, 2018)) do. Some of these issues are fairly widespread (the failure to distinguish 1(a) and 1(b)), and some arise because of specific properties of either the treebank or the parsing algorithm. Most of the pre-trained models for parsers such as MALT and SPACY are trained on the well-known Wall Street Journal corpus, and since this treebank does not distinguish between sentences such as 1(a) and 1(b), it is impossible for parsers trained on it to do so. All the parsers listed previously assign different structures to 3(a) and 3(b), which may be a characteristic of the treebank or it may be some property of the training algorithms. It is worth evaluating the output of any such parser to check that it does give distinct analyses for obvious cases such as 1(a) and 1(b) and does give parallel analyses for obvious cases such as 4(a) and 4(b).

So, when choosing a parser, you have to weigh up a range of factors. Do you care if it sometimes makes mistakes? Do you want it to assign different trees to texts whose underlying representations are different (this isn’t quite the same as accuracy because it could happen that what the parser produces isn’t wrong, it just doesn’t contain all the information you need, as in 1(a) and 1(b))? Do you want it to always produce a tree, even for texts that don’t conform to any of the rules of normal language (should it produce a parse for #anxious don’t know why ................. #worry 😶 slowly going #mad hahahahahahahahaha)? Does it matter if it takes 10 or 20 seconds to parse some sentences? Whatever you do, do not trust what anyone says about a parser: try it for yourself, on the data that you are intending to use it on, and check that its output matches your needs.

Semantics (the study of meaning)

As we’ve seen, finding words, assigning them to categories, and finding the relationships between them is quite hard work. There would be no point in doing this work unless you had some application in mind that could make use of it. The key here is that the choice of words and the relationships between them are what allow language to carry messages, to have meaning. That’s why language is important; because it carries messages. Almost all application programs that do anything with natural language are concerned with the message carried by the input text, so almost all such programs have to identify the words that are present and the way they are arranged.

The study of how language encodes messages is known as semantics. As just noted, the message is encoded by the words that are present (lexical semantics) and the way they are arranged (compositional semantics). They are both crucial: you can’t understand the difference between John loves Mary and John hates Mary if you don’t know what loves and hates mean, and you can’t understand the difference between John loves Mary and Mary loves John if you don’t know how being the subject or object of a verb encodes the relationship between the things denoted by John and Mary and the event denoted by loves.

The key test for a theory of semantics is the ability to carry out inference between sets of natural language texts. If you can’t do the inferences in 1–7 (where P1, …, Pn |- Q means that Q can be inferred from the premises P1, …, Pn), then you cannot be said to understand English:

  1. John hates Mary |- John dislikes Mary
  2. (a) John and Mary are divorced |- John and Mary are not married
  3. (b) John and Mary are divorced |- John and Mary used to be married
  4. I saw a man with a big nose |- I saw a man
  5. Every woman distrusts John, Mary is a woman |- Mary distrusts John
  6. I saw more than three pigeons |- I saw at least four birds
  7. I doubt that she saw anyone |- I do not believe she saw a fat man

These are very simple inferences. If someone said that the conclusions didn’t follow from the premises, you would have to say that they just don’t understand English properly. They involve a range of different kinds of knowledge – simple entailment relationships between words (hates entails dislikes (1)); more complex relationships between words (getting divorced means canceling an existing marriage (2), so if John and Mary are divorced, then they are not now married but at one time they were); the fact that a man with a big nose is something that is a man and has a big nose plus the fact that A and B entails A (3); an understanding of how quantifiers work ((4) and (5)); combinations of all of these (6) – but they are all inferences that anyone who understands English would agree with.

Some of this information can be fairly straightforwardly extracted from corpora. There is a great deal of work, for instance, on calculating the similarity between pairs of words, though extending that to cover entailments between words has proved more difficult. Some of it is much more difficult to find using data-driven methods – the relationships between more than and at least, for instance, cannot easily be found in corpora, and the complex concepts that lie behind the word divorce would also be difficult to extract unsupervised from a corpus.

Furthermore, some of it can be applied by using tree-matching algorithms of various kinds, from simple algorithms that just compute whether one tree is a subtree of another to more complex approaches that pay attention to polarity (that doubt flicks a switch that turns the direction of the matching algorithm round – I know she loves him |- I know she likes him, I doubt she likes him |- I doubt she loves him) and to the relationships between quantifiers (the |- some, more than N |- at least N-1) (Alabbas & Ramsay, 2013) (MacCartney & Manning, 2014). Some of it requires more complex strategies, in particular examples with multiple premises (4), but all but the very simplest (for example, just treating a sentence as a bag of words) require accurate, or at least consistent, trees.

Exactly how much of this machinery you need depends on your ultimate application. Fortunately for us, sentiment mining can be done reasonably effectively with fairly shallow approaches, but it should not be forgotten that there is a great deal more to understanding a text than simply knowing lexical relationships such as similarity or subsumption between words.

Before wrapping up this chapter, we will spend some time learning about machine learning, looking at various machine learning models, and then working our way through a sample project using Python.

 

Introduction to machine learning

Before discussing machine learning, it makes sense to properly understand the term artificial intelligence. Broadly speaking, artificial intelligence is a branch of computer science and is the idea that machines can be made to think and act just like us humans, without explicit programming instructions.

There is a common misconception that artificial intelligence is a new thing. The term is widely considered to have been coined in 1956 by assistant Professor of Mathematics John McCarthy at the Dartmouth Summer Research Project on Artificial Intelligence. We are now in an AI boom – but it was not always so; artificial intelligence has a somewhat chequered history. Following on from the 1956 conference, funding flowed generously and rapid progress was made as researchers developed systems that could play chess and solve mathematical problems. Optimism was high, but progress stalled because promises made earlier about artificial intelligence were not able to be fulfilled, and hence the funding dried up; this cycle was repeated in the 1980s. The current boom we are experiencing is due to the timely advances and emergence of three key technologies:

  • Big data: Giving us the amounts of data required to be able to do artificial intelligence
  • High-speed high-capacity storage devices: Giving us the ability to store the data
  • GPUs: Giving us the ability to process the data

Nowadays, AI is everywhere. Here are some examples of AI:

  • Chatbots (for example, customer service chatbots)
  • Amazon Alexa, Apple’s Siri, and other smart assistants
  • Autonomous vehicles
  • Spam filters
  • Recommendation engines

According to experts, there are four types of AI:

  • Reactive: This is the simplest type and involves machines programmed to always respond in the same predictable manner. They cannot learn.
  • Limited memory: This is the most common type of AI in use today. It combines pre-programmed information with historical data to perform tasks.
  • Theory of mind: This is a technology we may see in the future. The idea here is that a machine with a theory of mind AI will understand emotions, and then alter its own behavior accordingly as it interacts with humans.
  • Self-aware: This is the most advanced type of AI. Machines that are self-aware of their own emotions, and the emotions of those around them, will have a level of intelligence like human beings and will be able to make assumptions, inferences, and deductions. This is certainly one for the future as the technology for this doesn’t exist just yet.

Machine learning is one way to exploit AI. Writing software programs to cater to all situations, occurrences, and eventualities is time-consuming, requires effort, and, in some cases, is not even possible. Consider the task of recognizing pictures of people. We humans can handle this task easily, but the same is not true for computers. Even more difficult is programming a computer to do this task. Machine learning tackles this problem by getting the machine to program itself by learning through experiences.

There is no universally agreed-upon definition of machine learning that everyone subscribes to. Some attempts include the following:

  • A branch of computer science that focuses on the use of data and algorithms to imitate the way that humans learn
  • The capability of machines to imitate intelligent human behavior
  • A subset of AI that allows machines to learn from data without being programmed explicitly

Machine learning needs data – and sometimes lots and lots of it.

Lack of data is a significant weak spot in AI. Without a reasonable amount of data, machines cannot perform and generate sensible results. Indeed, in some ways, this is just like how we humans operate – we look and learn and then apply that knowledge in new, unknown situations.

And, if we think about it, everyone has data. From the smallest sole trader to the largest organization, everyone will have sales data, purchase data, customer data, and more. The format of this data may differ between different organizations, but it is all useful data that can be used in machine learning. This data can be collected and processed and can be used to build machine learning models. Typically, this data is split into the following sets:

  • Training set: This is always the largest of the datasets (typically 80%) and is the data that is used to train the machine learning models.
  • Development set: This dataset (10%) is used to tweak and try new parameters to find the ones that work the best for the model.
  • Test set: This is used to test (validate) the model (10%). The model has already seen the training data, so it cannot be used to test the model, hence this dataset is required. This dataset also allows you to determine whether the model is working well or requires more training.

It is good practice to have both development and test datasets. The process of building models involves finding the best set of parameters that give the best results. These parameters are determined by making use of the development set. Without the development set, we would be reduced to using the same datasets for training, testing, and evaluation. This is undesirable, but it can also present further problems unless handled carefully. For example, the datasets should be constructed such that the original dataset class proportions are preserved across the test and training sets. Furthermore, as a general point, training data should be checked for the following:

  • It is relevant to the problem
  • It is large enough such that all use cases of the model are covered
  • It is unbiased and contains no imbalance toward any particular category

Modern toolkits such as sklearn (Pedregosa et al., 2011) provide ready-made functions that will easily split your dataset for you:

res = train_test_split(data, labels,    train_size=0.8,
    test_size=0.2,
    random_state=42,
    stratify=labels)

However, there are times when the data scientist will not have enough data available to be able to warrant splitting it multiple ways – for example, there is no data relevant to the problem, or the process to collect the data is too difficult, expensive, or time-consuming. This is known as data scarcity and it can be responsible for poor model performance. In such cases, various solutions may help alleviate the problem:

  • Augmentation: For example, taking an image and performing processing (for example, rotation, scaling, and modifying the colors) so that new instances are slightly different
  • Synthetic data: Data that is artificially generated using computer programs

To evaluate models where data is scarce, a technique known as k-fold cross-validation is used. This is discussed more fully in Chapter 2, briefly the dataset is split into a number (k) of groups; then, in turn, each group is taken as the test dataset with the remaining groups as the training dataset, and the model is fit and evaluated. This is repeated for each group, hence each member of the original dataset is used in the test dataset exactly once and in a training dataset k-1 times. Finally, the model accuracy is calculated by using the results from the individual evaluations.

This poses an interesting question about how much data is needed. There are no hard-and-fast rules but, generally speaking, the more the better. However, regardless of the amount of data, there are typically other issues that need to be addressed:

  • Missing values
  • Inconsistencies
  • Duplicate values
  • Ambiguity
  • Inaccuracies

Machine learning is important. It has many real-world applications that can allow businesses and individuals to save time, money, and effort by, for example, automating business processes. Consider a customer service center where staff are required to take calls, answer queries, and help customers. In such a scenario, machine learning can be used to handle some of the more simple repetitive tasks, hence relieving burden from staff and getting things done more quickly and efficiently.

Machine learning has dramatically altered the traditional ways of doing things over the past few years. However, in many aspects, it still lags far behind human levels of performance. Often, the best solutions are hybrid human-in-the-loop solutions where humans are needed to perform final verification of the outcome.

There are several types of machine learning:

  • Supervised learning
  • Unsupervised learning
  • Semi-supervised learning
  • Reinforcement learning

Supervised learning models must be trained with labeled data. Hence, both the inputs and the outputs of the model are specified. For example, a machine learning model could be trained with human-labeled images of apples and other fruits, labeled as apple and non-apple. This would allow the machine to learn the best way to identify pictures of apples. Supervised machine learning is the most common type of machine learning used today. In some ways, this matches how we humans function; we look and learn from experiences and then apply that knowledge in unknown, new situations to work out an answer. Technically speaking, there are two main types of supervised learning problems:

  • Classification: Problems that involve predicting labels (for example, apple)
  • Regression: Problems that involve predicting a numerical value (for example, a house price)

Both of these types of problems can have any number of inputs of any type. These problems are known as supervised from the idea that the output is supplied by a teacher that shows the system what to do.

Unsupervised learning is a type of machine learning that, opposite to supervised learning, involves training algorithms on data that is unlabeled. Unsupervised algorithms examine datasets looking for meaningful patterns or trends that would not otherwise be apparent – that is, the target is for the algorithm to find the structure in the data on its own. For example, unsupervised machine learning algorithms can examine sales data and pinpoint the different types of products being purchased. However, the problem with this is that although these models can perform more complex tasks than their supervised counterparts, they are also much more unpredictable. Some use cases that adopt this approach are as follows:

  • Dimensionality reduction: The process of reducing the number of inputs into a model by identifying the key (principal) components that capture the majority of the data without losing key information.
  • Association rules: The process of finding associations between different inputs in the input dataset by discovering the probabilities of the co-occurrence of items. For example, when people buy ice cream, they also typically buy sunglasses.
  • Clustering: Finds hidden patterns in a dataset based on similarities or differences and groups the data into clusters or groups. Unsupervised learning can be used to perform clustering when the exact details of the clusters are unknown.

Semi-supervised learning is, unsurprisingly, a combination of supervised and unsupervised learning. A small amount of labeled data and a large amount of unlabeled data is used. This has the benefits of both unsupervised and supervised learning but at the same time avoids the challenges of requiring large amounts of labeled data. Consequently, models can be trained to label data without requiring huge amounts of labeled training data.

Reinforcement learning is about learning the best behavior so that the maximum reward is achieved. This behavior is learned by interacting with the environment and observing how it responds. In other words, the sequence of actions that maximize the reward must be independently discovered via a trial-and-error process. In this way, the model can learn the actions that result in success in an unseen environment.

Briefly, here are the typical steps that are followed in a machine learning project:

  1. Data collection: Data can come from a database, Excel, or text file – essentially it can come from anywhere.
  2. Data preparation: The quality of the data used is crucial. Hence, time must be spent fixing issues such as missing data and duplicates. Initial exploratory data analysis (EDA) is performed on the data to discover patterns, spot anomalies, and test theories about the data by using visual techniques.
  3. Model training: An appropriate algorithm and model is chosen to represent the data. The data is split into training data for developing the model and test data for testing the model.
  4. Evaluation: To test the accuracy, the test data is used.
  5. Improve performance: Here, a different model may be chosen, or other inputs may be used.

Let’s start with the technical requirements.

Technical requirements

This book describes a series of experiments with machine learning algorithms – some standard algorithms, some developed especially for this book. These algorithms, along with various worked examples, are available as Python programs at https://github.com/PacktPublishing/Machine-Learning-for-Emotion-Analysis/tree/main, split into directories corresponding to the chapters in which the specific algorithms will be discussed.

One of the reasons why we implemented these programs in Python is that there is a huge amount of useful material to build upon. In particular, there are good -quality, efficient implementations of several standard machine learning algorithms, and using these helps us be confident that where an algorithm doesn’t work as well as expected on some dataset, it is because the algorithm isn’t very well suited to that dataset, rather than that we just haven’t implemented it very well. Some of the programs in the repository use very particular libraries, but there are several packages that we will use throughout this book. These are listed here. If you are going to use the code in the repository – which we hope you will because looking at what actual programs do is one of the best ways of learning – you will need to install these libraries. Most of them can be installed very easily, either by using the built-in package installer pip or by following the directions on the relevant website:

  • pandas: This is one of the most commonly used libraries and is used primarily for cleaning and preparing data, as well as analyzing tabular data. It provides tools to explore, clean, manipulate, and analyze all types of structured data. Typically, machine learning libraries and projects use pandas structures as inputs. You can install it by typing the following command in the command prompt:
    pip install pandas
  • Or you can go to https://pandas.pydata.org/docs/getting_started/install.html for other options.
  • NumPy: This is used primarily for its support of N-dimensional arrays. It has functions for linear algebra and matrices and is also used by other libraries. Python provides several collection classes that can be used to represent arrays, notably as lists, but they are computationally slow to work with – NumPy provides objects that are up to 50 times faster than Python lists. To install it, run the following command in the command prompt:
    pip install numpy

Alternatively, you can refer to the documentation for more options: https://numpy.org/install/.

  • SciPy: This provides a range of scientific functions built on top of NumPy, including ways of representing sparse arrays (arrays where most elements are 0) that can be manipulated thousands of times faster than standard NumPy arrays if the vast majority of elements are 0. You can install it using the following command:
    pip install scipy

You can also refer to the SciPy documentation for more details: https://scipy.org/install/.

  • scikit-learn (Pedregosa et al., 2011): This is used to build machine learning models as it has functions for building supervised and unsupervised machine learning models, analysis, and dimensionality reduction. A large part of this book is about investigating how well various standard machine learning algorithms work on particular datasets, and it is useful to have reliable good-quality implementations of the most widely used algorithms so that we are not distracted by issues due to the way we have implemented them.

scikit-learn is also known as sklearn – when you want to import it into a program, you should refer to it as sklearn. You can install it as follows:

pip install scikit-learn

Refer to the documentation for more information: https://scikit-learn.org/stable/install.html.

The sklearn implementations of the various algorithms generally make the internal representations of the data available to other programs. This can be particularly valuable when you are trying to understand the behavior of some algorithm on a given dataset and is something we will use extensively as we carry out our experiments.

  • TensorFlow: This is a popular library for building neural networks as well as performing other tasks. It uses tensors (multi-dimensional arrays) to perform operations. It is built to take advantage of parallelism, so it is used to train neural networks in a highly efficient manner. Again, it makes sense to reuse a reliable good-quality implementation when testing neural network models on our data so that we know that any poor performances arise because of problems with the algorithm rather than with our implementation of it. As ever, you can just install it using pip:
    pip install tensorflow

For more information, refer to the TensorFlow documentation: https://www.tensorflow.org/install.

You will not benefit from its use of parallelism unless you have a GPU or other hardware accelerator built into your machine, and training complex models is likely to be intolerably slow. We will consider how to use remote facilities such as Google Colab to obtain better performance in Chapter 9, Exploring Transformers. For now, just be aware that running tensorflow on a standard computer without any kind of hardware accelerator probably won’t do anything within a reasonable period.

  • Keras: This is also used for building neural networks. It is built on top of TensorFlow. It creates computational graphs to represent machine learning algorithms, so it is slow compared to other libraries. Keras comes as part of TensorFlow, so there is no need to install anything beyond TensorFlow itself.
  • Matplotlib: This is an interactive library for plotting graphs, charts, plots, and visualizing data. It comes with a wide range of plots that help data scientists understand trends and patterns. Matplotlib is extremely powerful and allows users to create almost any visualization imaginable. Use the following command to install matplotlib:
    pip install matplotlib

Refer to the documentation for more information: https://matplotlib.org/stable/users/installing/index.html.

Matplotlib may install NumPy if you do not have it already installed, but it is more sensible to install them separately (NumPy first).

  • Seaborn: This is built on the top of Matplotlib, and is another library for creating visualizations. It is useful for making attractive plots and helps users explore and understand data. Seaborn makes it easy for users to switch between different visualizations. You can easily install Seaborn by running the following command:
    pip install seaborn

For more installation options, please refer to https://seaborn.pydata.org/installing.html.

We will use these libraries throughout this book, so we advise you to install them now, before trying out any of the programs and examples that we’ll discuss as we go along. You only have to install them once so that they will be available whenever you need them. We will specify any other libraries that the examples depend on as we go along, but from now on, we will assume that you have at least these ones.

A sample project

The best way to learn is by doing! In this section, we will discover how to complete a small machine learning project in Python. Completing, and understanding, this project will allow you to become familiar with machine learning concepts and techniques.

Typically, the first step in developing any Python program is to import the modules that are going to be needed using the import statement:

import sklearnimport pandas as pd

Note

Other imports are needed for this exercise; these can be found in the GitHub repository.

The next step is to load the data that is needed to build the model. Like most tutorials, we will use the famous Iris dataset. The Iris dataset contains data on the length and width of sepals and petals. We will use pandas to load the dataset. The dataset can be downloaded from the internet and read from your local filesystem, as follows:

df = pd.read_csv("c:\iris.csv")

Alternatively, pandas can read it directly from a URL:

df = pd.read_csv("https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv")

The read_csv command returns a DataFrame. It is probably the most commonly used pandas object and is simply a two-dimensional data structure with rows and columns, just like a spreadsheet.

Since we will be using sklearn, it is interesting to see that sklearn also makes it easy to access the dataset:

from sklearn import datasetsiris = datasets.load_iris()
df = iris.data

We can now check that the dataset has been successfully loaded by using the describe function:

df.describe()

The describe function returns a descriptive summary of a DataFrame reporting values such as the mean, count, and standard deviation:

sepal.length sepal.width petal.length petal.widthcount 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

This function is useful to check that the data has been loaded correctly but also to provide a first glance at some interesting attributes of the data.

Some other useful commands tell us more about the DataFrame:

  • This shows the first five elements in the DataFrame:
    df.head(5)
  • This shows the last five elements in the DataFrame:
    df.tail(5)
  • This describes the columns of the DataFrame:
    df.columns
  • This describes the number of rows and columns in the DataFrame:
    df.shape

It is usually a good idea to use these functions to check that the dataset has been successfully and correctly loaded into the DataFrame and that everything looks as it should.

It is also important to ensure that the dataset is balanced – that is, there are relatively equal numbers of each class.

The majority of machine learning algorithms have been developed with the assumption that there are equal numbers of instances of each class. Consequently, imbalanced datasets present a big problem for machine learning models as this results in models with poor predictive performance.

In the Iris example, this means that we have to check that we have equal numbers of each type of flower. This can be verified by running the following command:

df['variety'].value_counts()

This prints the following output:

Setosa 50Versicolor 50
Virginica 50
Name: variety, dtype: int64

We can see that there are 50 examples of each variety. The next step is to create some visualizations. Although we used the describe function to get an idea of the statistical properties of the dataset, it is much easier to observe these in a visual form as opposed to in a table.

Box plots (see Figure 1.9) plot the distribution of data based on the sample minimum, the lower quartile, the median, the upper quartile, and the sample maximum. This helps us analyze the data to establish any outliers and the data variation to better understand each attribute:

import matplotlib.pyplot as pltattributes = df[['sepal.length', 'sepal.width',
    'petal.length', 'petal.width']]
attributes.boxplot()
plt.show()

This outputs the following plot:

Figure 1.9 – Box plot

Figure 1.9 – Box plot

Heatmaps are useful for understanding the relationships between attributes. Heatmaps are an important tool for data scientists to explore and visualize data. They represent the data in a two-dimensional format and allow the data to be summarized visually as a colored graph. Although we can use matplotlib to create heatmaps, it is much easier in seaborn and requires significantly fewer lines of code – something we like!

import seaborn as snssns.heatmap(iris.corr(), annot=True)
plt.show()

This outputs the following heatmap:

Figure 1.10 – Heatmap

Figure 1.10 – Heatmap

The squares in the heatmap represent the correlation (a measure that shows how much two variables are related) between the variables. The correlation values range from -1 to +1:

  • The closer the value is to 1, the more positively correlated they are – that is, as one increases, so does the other
  • Conversely, the closer the value is to -1, the more negatively correlated they are – that is, as one variable decreases, the other will increase
  • Values close to 0 indicate that there is no linear trend between the variables

In Figure 1.10, the diagonals are all 1. This is because, in those squares, the two variables are the same and hence the correlation is to itself. For the remainder, the scale shows that the lighter the color (toward the top of the scale), the higher the correlation. For example, the petal length and petal width are highly correlated, whereas petal length and sepal width are not. Finally, it can also be seen that the plot is symmetrical on both sides of the diagonal. This is because the same set of variables are paired in the squares that are the same.

We can now build a model using the data and estimate the accuracy of the model on data that it has not seen previously. Let’s start by separating the data and the labels from each other by using Python:

data = df.iloc[:, 0:4]labels = df.iloc[:, 4]

Before we can train a machine learning model, it is necessary to split the data and labels into testing and training data. As discussed previously, we can use the train_test_split function from sklearn:

from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test = train_test_split(data,
    labels, test_size=0.2)

The capital X and lowercase y are a nod to math notation, where it is common practice to write matrix variable names in uppercase and vector variable names using lowercase letters. This has no special Python function and these conventions can be ignored if desired. For now, note that X refers to data, and y refers to the associated labels. Hence, X_train can be understood to refer to an object that contains the training data.

Before we can begin to work on the machine learning model, we must normalize the data. Normalization is a scaling technique that updates the numeric columns to use a common scale. This helps improve the performance, reliability, and accuracy of the model. The two most common normalization techniques are min-max scaling and standardization scaling:

  • Min-max scaling: This method uses the minimum and maximum values for scaling and rescales the values so that they end up in the range 0 to 1 or -1 to 1. It is most useful when the features are of different scales. It is typically used when the feature distribution is unknown, such as in k-NN or neural network models.
  • Standardization scaling: This method uses the mean and the standard deviation to rescale values so that they have a mean of 0 and a variance of 1. The resultant scaled values are not confined to a specific range. It is typically used when the feature distribution is normal.

It is uncommon to come across datasets that perfectly follow a certain specific distribution. Typically, every dataset will need to be standardized. For the Iris dataset, we will use sklearn’s StandardScaler to scale the data by making the mean of the data 0 and the standard deviation 1:

from sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import cross_val_score
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Now that the data is ready, sklearn makes it easy for us to test and compare various machine learning models. A brief explanation of each model has been provided but don’t worry – we explain these models in more detail later in later chapters.

Logistic regression

Logistic regression is one of the most popular machine learning techniques. It is used to predict a categorical dependent variable using a set of independent variables and makes use of a sigmoid function. The sigmoid is a mathematical function that has values from 0 to 1 and asymptotes both values. This makes it useful for binary classification with 0 and 1 as potential output values:

from sklearn.linear_model import LogisticRegressionlr = LogisticRegression()
lr.fit(X_train, y_train)
score = lr.score(X_train, y_train)
print(f"Training data accuracy {score}")
score = lr.score(X_test, y_test)
print(f"Testing data accuracy {score}")
Training data accuracy 0.9666666666666667
Testing data accuracy 0.9666666666666667

Note

There is also a technique called linear regression but, as its name suggests, this is used for regression problems, whereas the current Iris problem is a classification problem.

Support vector machines (SVMs)

Support vector machine (SVM) is one of the best “out-of-the-box” classification techniques. SVM constructs a hyperplane that can then be used for classification. It works by calculating the distance between two observations and then determining a hyperplane that maximizes the distance between the closest members of separate classes. The linear support vector classifier (SVC) method (as used in the following example) applies a linear kernel function to perform classification:

from sklearn.svm import SVCsvm = SVC(random_state=0, gamma='auto', C=1.0)
svm.fit(X_train, y_train)
score = svm.score(X_train, y_train)
print(f"Training data accuracy {score}")
score = svm.score(X_test, y_test)
print(f"Testing data accuracy {score}")
data accuracy 0.9666666666666667
Testing data accuracy 0.9666666666666667

The following parameters are used:

  • random_state: This controls the random number generation that is used to shuffle the data. In this example, a value hasn’t been set, hence a randomly initialized state is used. This means that results will vary between runs.
  • gamma: This controls how much influence a single data point has on the decision boundary. Low values mean “far” and high values mean “close.” In this example, gamma is set to “auto,” hence allowing it to automatically define its own value based on the characteristics of the dataset.
  • C: This controls the trade-off between maximizing the distance between classes and correctly classifying the data.

K-nearest neighbors (k-NN)

k-NN is another widely used classification technique. k-NN classifies objects based on the closest training examples in the feature space. It is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is the most common among its k-NNs measured by a distance function:

from sklearn.neighbors import KNeighborsClassifierknn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train,y_train)
score = knn.score(X_train, y_train)
print(f"Training data accuracy {score}")
score = knn.score(X_test, y_test)
print(f"Testing data accuracy {score}")
Training data accuracy 0.9583333333333334
Testing data accuracy 0.9333333333333333

Decision trees

Decision trees attempt to create a tree-like model that predicts the value of a variable by learning simple decision rules that are inferred from the data features. Decision trees classify examples by sorting down the tree from the root to a leaf node, with the leaf node providing the classification for our example:

from sklearn import treedt = tree.DecisionTreeClassifier()
dt.fit(X_train, y_train)
score = dt.score(X_train, y_train)
print(f"Training data accuracy {score}")
score = dt.score(X_test, y_test)
print(f"Testing data accuracy {score}")
Training data accuracy 1.0
Testing data accuracy 0.9333333333333333

Random forest

Random forest builds decision trees using different samples and then takes the majority vote as the answer. In other words, random forest builds multiple decision trees and then merges them to get a more accurate prediction. Due to its simplicity, it is also one of the most commonly used algorithms:

from sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier()
rf.fit(X_train, y_train)
score = rf.score(X_train, y_train)
print(f"Training data accuracy {score}")
score = rf.score(X_test, y_test)
print(f"Testing data accuracy {score}")
Training data accuracy 1.0
Testing data accuracy 0.9666666666666667

Neural networks

Neural networks (also referred to as deep learning) are algorithms that are inspired by how the human brain works, and are designed to recognize numerical patterns. Neural networks consist of input and output layers and (optionally) hidden layers. These layers contain units (neurons) that transform the inputs into something useful for the output layer. These neurons are connected and work together. We will look at these in more detail later in this book.

Making predictions

Once we have chosen and fit a machine learning model, it can easily be used to make predictions on new, unseen data – that is, take the final model and one or more data instances and then predict the classes for each of the data instances. The model is needed because the result classes are not known for the new data. The class for the unseen data can be predicted using scikit-learn’s predict() function.

First, the unseen data must be transformed into a pandas DataFrame, along with the column names:

df_predict = pd.DataFrame([[5.9, 3.0, 5.1, 1.8]],    columns = ['sepal.length', 'sepal.width',
    'petal.length', 'petal.width'])

This DataFrame can then be passed to scikit-learn’s predict() function to predict the class value:

print (dt.predict(df_predict))['Virginica']

A sample text classification problem

Given that this is a book on emotion classification and emotions are generally expressed in written form, it makes sense to take a look at how a text classification problem is tackled.

We have all received spam emails. These are typically emails that are sent to huge numbers of email addresses, usually for marketing or phishing purposes. Often, these emails are sent by bots. They are of no interest to the recipients and have not been requested by them. Consequently, email servers will often automatically detect and remove these messages by looking for recognizable phrases and patterns, and sometimes placing them into special folders labeled Junk or Spam.

In this example, we will build a spam detector and use the machine learning abilities of scikit-learn to train the spam detector to detect and classify text as spam and non-spam. There are many labeled datasets available online (for example, from Kaggle); we chose to use the dataset from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset?resource=download.

The dataset contains SMS messages that have been collected for spam research. It contains 5,574 SMS messages in English that are labeled as spam or non-spam (ham). The file contains one message per line, and each line has two columns; the label and the message text.

We have seen some of the basic pandas commands already, so let’s load the file and split it into training and test sets, as we did previously:

import pandas as pdfrom sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
spam = pd.read_csv("spam.csv", encoding_errors="ignore")
labels = spam["v1"]
data = spam["v2"]
X_train,X_test,y_train,y_test = train_test_split(data,
    labels, test_size = 0.2)

Note

The file may have an encoding error; for now, we will ignore this as it is not relevant to the task at hand.

A handy function called CountVectorizer is available in sklearn. This can be used to transform text into a vector of term-token counts. It is also able to preprocess the text data before generating the vector representations, hence it is an extremely useful function. CountVectorizer converts the raw text into a numerical vector representation, which makes it easy to use the text as inputs in machine learning tasks:

count_vectorizer = CountVectorizer()X_train_features = count_vectorizer.fit_transform(X_train)

Essentially, it assigns a number, randomly, to each word and then counts the number of occurrences of each. For example, consider the following sentence:

The quick brown fox jumps over the lazy dog.

This would be converted as follows:

word

the

quick

brown

fox

jumps

over

lazy

dog

index

0

1

2

3

4

5

6

7

count

2

1

1

1

1

1

1

1

Notice that there are eight unique words, hence eight columns. Each column represents a unique word in the vocabulary. Each count row represents the item or row in the dataset. The values in the cells are the word counts. Armed with this knowledge about the types and counts of common words that appear in spam, the model will be able to classify text as spam or non-spam.

We will use the simple k-NN model introduced earlier:

knn = KNeighborsClassifier(n_neighbors = 5)

The fit() function, as we have seen earlier, trains the model with the vectorized counts from the training data and the training labels. It compares its predictions against the real answers in y_train and then tunes its hyperparameters until it achieves the best possible accuracy. Note how here, since this is a classification task, the labels must also be passed to the fit() function. The Iris example earlier was a regression task; there were no labels, so we did not pass them into the fit() function:

knn.fit(X_train_features, y_train)

We now have a model that we can use on the test data to test for accuracy:

X_test_features = count_vectorizer.transform(X_test)score = knn.score(X_test_features, y_test)
print(f"Training data accuracy {score}")
Training data accuracy 0.9255605381165919

Note how this time, we use transform() instead of fit_transform(). The difference is subtle but important. The fit_transform() function does fit(), followed by transform() – that is, it calculates the initial parameters, uses these calculated values to modify the training data, and generates a term-count matrix. This is needed when a model is being trained. The transform() method, on the other hand, only generates and returns the term-count matrix. The score() function then scores the prediction of the test data term-count matrix against the actual labels in test data labels, y_test, and even using a simplistic model we can classify spam with high accuracy and obtain reasonable results.

 

Summary

In this chapter, we started by examining emotion and sentiment, and their origins. Emotion is not the same as sentiment; emotion is more fine-grained and is much harder to quantify and work with. Hence, we learned about the three main theories of emotion, with psychologists, neurologists, and cognitive scientists each having slightly different views as to how emotions are formed. We explored the approaches of Ekman and Plutchik, and how the categorical and dimensional models are laid out.

We also examined the reasons why emotion analysis is important but difficult due to the nuances and difficulty of working with content written in natural language, particularly the kind of informal language we are concerned with in this book. We looked at the basic issues in NLP and will return to the most relevant aspects of NLP in Chapter 4, Preprocessing – Stemming, Tagging, and Parsing. Finally, we introduced machine learning and worked through some sample projects.

In the next chapter, we will explore where to find suitable data, the steps needed to make it fit for purpose, and how to construct a dataset suitable for emotion analysis.

 

References

To learn more about the topics that were covered in this chapter, take a look at the following resources:

  • Alabbas, M., & Ramsay, A. M. (2013). Natural language inference for Arabic using extendedtree edit distance with subtrees. Journal of Artificial Intelligence Research, 48, 1–22.
  • Dozat, T., Qi, P., & Manning, C. D. (2017). Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task. Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 20–30. https://doi.org/10.18653/v1/K17-3002.
  • Ekman, P. (1993). Facial expression and emotion. American Psychologist, 48(4), 384.
  • Kitaev, N., & Klein, D. (2018). Constituency Parsing with a Self-Attentive Encoder. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2,676–2,686. https://doi.org/10.18653/v1/P18-1249.
  • Lewis, M., & Steedman, M. (2014). A* CCG Parsing with a Supertag-factored Model. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 990–1,000. https://doi.org/10.3115/v1/D14-1107.
  • MacCartney, B., & Manning, C. D. (2014). Natural logic and natural language inference. In Computing Meaning (pp. 129–147). Springer.
  • McDonald, R., Pereira, F., Ribarov, K., & Hajič, J. (2005). Non-Projective Dependency Parsing using Spanning Tree Algorithms. Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 523–530. https://aclanthology.org/H05-1066.
  • Nivre, J., Hall, J., & Nilsson, J. (2006). MaltParser: A data-driven parser-generator for dependency parsing. Proceedings of the International Conference on Language Resources and Evaluation (LREC), 6, 2,216–2,219.
  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2,825–2,830.
  • Plutchik, R. (2001). The Nature of Emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American Scientist, 89(4), 344–350.
  • Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1,161–1,178. https://doi.org/10.1037/h0077714.
About the Authors
Machine Learning for Emotion Analysis in Python
Unlock this book and the full library FREE for 7 days
Start now