Reader small image

You're reading from  MATLAB for Machine Learning - Second Edition

Product typeBook
Published inJan 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781835087695
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Giuseppe Ciaburro
Giuseppe Ciaburro
author image
Giuseppe Ciaburro

Giuseppe Ciaburro holds a PhD and two master's degrees. He works at the Built Environment Control Laboratory - Università degli Studi della Campania "Luigi Vanvitelli". He has over 25 years of work experience in programming, first in the field of combustion and then in acoustics and noise control. His core programming knowledge is in MATLAB, Python and R. As an expert in AI applications to acoustics and noise control problems, Giuseppe has wide experience in researching and teaching. He has several publications to his credit: monographs, scientific journals, and thematic conferences. He was recently included in the world's top 2% scientists list by Stanford University (2022).
Read more about Giuseppe Ciaburro

Right arrow

Natural Language Processing Using MATLAB

Natural language processing (NLP) automatically processes information conveyed through spoken or written language. This task is fraught with difficulty and complexity, largely due to the innate ambiguity of human language. To enable machine learning (ML) and interaction with the world in ways typical of humans, it is essential not only to store data but also to teach machines how to translate this data simultaneously into meaningful concepts. As natural language interacts with the environment, it generates predictive knowledge. In this chapter, we will learn the basic concepts of NLP and how to build a model to label sentences.

In this chapter, we’re going to cover the following main topics:

  • Explaining NLP
  • Exploring corpora and word and sentence tokenize
  • Implementing a MATLAB model to label sentences
  • Understanding gradient boosting techniques

Technical requirements

In this chapter, we will introduce basic ML concepts. To understand these topics, a basic knowledge of algebra and mathematical modeling is needed. You will also require working knowledge of the MATLAB environment.

To work with the MATLAB code in this chapter, you’ll need the following files (available on GitHub at https://github.com/PacktPublishing/MATLAB-for-Machine-Learning-second-edition):

  • IMDBSentimentClassification.m
  • ImdbDataset.xlsx

Explaining NLP

NLP is a field that’s dedicated to the development of technology that enables computers to interact with, understand, and generate human language in a way that mimics natural human communication. This involves various techniques and approaches aimed at processing and analyzing the complexities of natural languages, such as English, Chinese, Arabic, and more. The goal is to bridge the gap between human language and computer language, allowing computers to comprehend and generate text as if they were engaging in a conversation with a human interlocutor (Figure 7.1):

Figure 7.1 – NLP tasks

Figure 7.1 – NLP tasks

NLP strives to develop information technology tools for analyzing, comprehending, and creating texts in a manner that resonates with human understanding, mimicking interactions with another human rather than a machine. Natural language, both spoken and written, represents the most instinctive and widespread mode of communication. In contrast...

Exploring corpora and word and sentence tokenizers

The analysis of corpora, words, and sentence tokenization forms the basis for comprehensive language understanding. Corpora provides real-world language data for analysis, words constitute the elements of expression, and sentence tokenization structures the text into meaningful units for further investigation. This trio of concepts plays a central role in advancing linguistic research and enhancing NLP capabilities.

Corpora

In linguistics and NLP, corpora refer to extensive collections of written or spoken texts that serve as valuable sources of data for linguistic analysis and language-related studies. Corpora provides a diverse range of language samples, enabling researchers to examine patterns, trends, and variations in language usage, syntax, and semantics across different contexts and genres.

Linguistic corpora represent sizable collections of spoken or written texts, often originating from authentic communication contexts...

Implementing a MATLAB model to label sentences

In this section, we will discuss a very interesting topic that is very popular in today’s society. I am referring to the importance of reviews in influencing a customer’s interest in making the right decision.

Introducing sentiment analysis

Sentiment analysis, a technique that utilizes NLP, extracts and analyzes subjective information from text. Analyzing vast datasets reveals collective opinions that impact various domains. While manual sentiment analysis is challenging, automated methods have emerged. However, automating language modeling is complex and costly due to the nuances of human language. Additionally, the methodology varies across languages, increasing complexity.

A major challenge lies in determining the polarity of opinions. Polarity classification is subjective, with one sentence perceived differently by individuals based on their value systems. The rise of social media has heightened interest in sentiment...

Understanding gradient boosting techniques

To improve the performance of an algorithm, we can perform a series of steps and use different techniques, depending on the type of algorithm and the specific problems being addressed. The first approach involves a thorough analysis of the data to identify possible inaccuracies or shortcomings. In addition, many algorithms have parameters that can be adjusted to achieve better performance – not to mention techniques such as feature scaling or feature selection. A popular technique is to combine the capabilities offered by different algorithms to achieve better overall performance.

Approaching ensemble learning

The concept of ensemble learning involves the use of multiple models combined in a way that maximizes performance by exploiting their strengths and mitigating their relative weaknesses. These ensemble learning methods are based on weak learning models that do not achieve high levels of accuracy on their own, but when combined...

Summary

In this chapter, we studied NLP, which automatically processes information that’s transmitted through spoken or written language. To begin, we analyzed the basic concepts of NLP by identifying the tasks that can be tackled and then moved on to the main approaches concerning text analysis and text generation. We then moved on to analyze corpora, words, and sentence tokenization. Corpora offers authentic language data for examination, with words serving as the fundamental components of expression, and sentence tokenization organizing the text into coherent units for in-depth analysis.

In the second part of this chapter, we analyzed a practical case of using NLP for labeling movie reviews. This is a sentiment analysis problem that aims to automatically identify the polarity of a textual comment. In this example, we were able to practically learn which tools to use in MATLAB to perform this type of analysis. In the final part of this chapter, we analyzed ensemble learning...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
MATLAB for Machine Learning - Second Edition
Published in: Jan 2024Publisher: PacktISBN-13: 9781835087695
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Giuseppe Ciaburro

Giuseppe Ciaburro holds a PhD and two master's degrees. He works at the Built Environment Control Laboratory - Università degli Studi della Campania "Luigi Vanvitelli". He has over 25 years of work experience in programming, first in the field of combustion and then in acoustics and noise control. His core programming knowledge is in MATLAB, Python and R. As an expert in AI applications to acoustics and noise control problems, Giuseppe has wide experience in researching and teaching. He has several publications to his credit: monographs, scientific journals, and thematic conferences. He was recently included in the world's top 2% scientists list by Stanford University (2022).
Read more about Giuseppe Ciaburro