Packt+ | Advance your knowledge in tech

You're reading from Machine Learning with Swift

Product typeBook

Published inFeb 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781787121515

Edition1st Edition

Languages

Swift

Tools

iOS

Concepts

Machine Learning

Authors (3):

Jojo Moolayil

Alexander Sosnovshchenko

Oleksandr Baiev

View More author details

Chapter 10. Natural Language Processing

Language is an integral part of our daily life and a natural way of conveying ideas from person to person. But as easy it is for us to understand our native language, it is just as difficult for computers to process it. The internet changed the science of language forever because it allowed collecting huge volumes of text and audio records. The field of knowledge that arose at the intersection of linguistics, computer science, and machine learning was called natural language processing (NLP).

In this chapter, we will get acquainted with the basic concepts and applications of NLP, relevant in the context of mobile development. We will talk about the powerful tools provided by iOS and the macOS SDK for language processing. We also will learn about the theory of distributional semantics and vector representations of words as its embodiment. They will allow us to express the meaning of sentences in the computer's favorite format—in the form of numbers....

NLP in the mobile development world

Usually, NLP specialists deal with big amounts of raw text organized in linguistics corpuses. The algorithms in this domain are resource-consuming and often contain many hand-crafted heuristics. All this doesn't look like a good match for mobile applications, where each megabyte or frame per second is important. Despite these obstacles, NLP is widely used on mobile platforms, usually in tight integration with the server-side backend for heavy computations. Here is a list of some common NLP features that can be found in many mobile applications:

Chatbots
Spam filtering
Automated translation
Sentiment analysis
Speech-to-text and text-to-speech
Automatic spelling and grammar correction
Automatic completion
Keyboard suggestions

Until recently, all but the last two tasks were done on the server side, but as mobile computational power grows, more apps tend to do processing (at least partially) locally on the client. When we talk about NLP on a mobile device, in most...

Word Association game

Many of us may have played this game as kids. The rules are very simple:

You say the word:

do while(true) {

I say the first association to your word that came to my mind
You give an association to my association:

For example, Dog → Cat → Pet → Toy → Baby → Girl → Wedding → Funeral. In the game, people reveal their life experience and way of thinking to each other; maybe that's why we could play it for hours as kids. Different people have different associations with the same word, and associations often head towards a completely unexpected direction. Psychologists have been studying associative series for more than a century, hoping to find in them the key to the mysteries of consciousness and the subconscious. Can you code a game AI to play like that? Perhaps you think you will need a manually composed database of associations. But what if you want your AI to have several personalities? Thanks to machine learning, this is definitely possible and you even don't need to...

Python NLP libraries

The two Python libraries that we're going to use in this chapter are natural language toolkit (NLTK) and Gensim. We will use the first one for text preprocessing and the second one for training or machine learning models. To install them, activate your Python virtual environment:

> cd ~> virtualenv swift-ml-book

And run pip install:

> pip install -U nltk gensim

Note

Official sites:

NLTK, http://www.nltk.org/
Gensim, https://radimrehurek.com/gensim/

Other popular libraries for NLP in Python:

TextBlob, https://textblob.readthedocs.io/en/dev/
Stanford's CoreNLP, https://stanfordnlp.github.io/CoreNLP/
SpaCy, https://spacy.io/

Textual corpuses

For our NLP experiments, we need some reasonably big texts. I used the complete works of classical writers and statesmen from the Gutenberg project because they are in the public domain, but you can find your own texts and train models on them. If you want to use the same texts as I did, I included them in the supplementary material for this chapter under the Corpuses folder. There should be five of them: Benjamin Franklin, John Galsworthy, Mark Twain, William Shakespeare, and Winston Churchill. Create a new Jupyter notebook and load Mark Twain's corpus as one long string:

import zipfile 
zip_ref = zipfile.ZipFile('Corpuses.zip', 'r') 
zip_ref.extractall('') 
zip_ref.close() 
In [1]: 
import codecs 
In [2]: 
one_long_string = "" 
with codecs.open('Corpuses/MarkTwain.txt', 'r', 'utf-8-sig') as text_file: 
    one_long_string = text_file.read() 
In [3]: 
one_long_string[99000:99900] 
Out[3]: 
u"size, very elegantly wrought and dressed in the fancifulrncostumes of two centuries...

Common NLP approaches and subtasks

Most programmers are familiar with the simplest way of processing natural language: regular expressions. There are many regular expression implementations for different programming languages that differ in small details. Because of these details, the same regular expression on various platforms can produce different results or not work at all. The two most popular standards are POSIX and Perl. The Foundation framework, however, contains its own version of regular expressions, based on the ICU C++ library. It is an extension of the POSIX standard for Unicode strings.

Why are we even talking about regular expressions here? Regular expressions are a great example of what NLP specialists call heuristics—manually written rules, ad hoc solutions, and describing a complex structure in such a way that all exceptions and variations are taken into account. Sophisticated heuristics require deep domain expertise to build. Only when we are not able to capture all the...

Distributional semantics hypothesis

It’s difficult to say what it means “to understand meaning of a text”, but everyone will say that people can do this, and computers do not. Natural language understanding is one of the tough problems in Artificial Intelligence. How to capture the semantics of the sentence

Traditionally there were two opposite approaches to the problem. The first one goes like this: start from the definitions of separate words, hard-code the relations between them, and write down the sentence structures. If you are persistent enough, hopefully you will end up with a complex model that will incorporate enough expert knowledge to parse some natural questions and produce meaningful answers. And then, you'll find out that for a new language, you need to start everything over.

That's why many researchers turned to the opposite approach: statistical methods. Here, we start from a big amount of textual data and allow the computer to figure out the meaning of the text. The hypothesis...

Word vector representations

Distributional semantics represents words as vectors in the space of senses. The vectors corresponding to the words with similar meanings should be close to each other in this space. How to build such vectors is not a simple question, however. The simplest approach to take is to start from one-hot vectors for the words, but then the vectors will be both sparse and giant, each one of the same length as the number of words in the vocabulary. That's why we use dimensionality reduction with autoencoder-like architecture.

Autoencoder neural networks

Autoencoder is a neural network whose goal is to produce an output identical to an input. For example, if you pass a picture into it, it should return the same picture on the other end. This seems... not complicated! But the trick is the special architecture—its inner layers have fewer neurons than input and output layers, usually with some extreme bottleneck in the middle. The layer before the bottleneck is called encoder and the layer after it is called decoder network. The encoder converts the input into some inner representation and the decoder then restores the data to its original form. During training, the network must figure out how to compress the input data most effectively and then un-compress it with the least possible information loss. This architecture can also be employed to train neural networks, which change input data in a way we want them to. For example, autoencoders have been successfully used to remove noise from images.

Note

Autoencoder neural...

Word2Vec

Word2Vec is an efficient algorithm for word embeddings generation based on neural networks. It was originally described by Mikolov et al. in Distributed Representations of Words and Phrases and their Compositionality (2013). The original C implementation in the form of a command-line application is available at https://code.google.com/archive/p/word2vec/.

Figure 10.4: Architecture of Word2Vec

Word2Vec is often referred to as an instance of deep learning, but the architecture is actually quite shallow: only three layers in depth. This misconception is likely related to its wide adoption for enhancing productivity of deep networks in NLP. The Word2Vec architecture is similar to an autoencoder. The input of the neural network is a sufficiently big text corpus, and the output is a list of vectors (arrays of numbers), one vector for each word in the corpus. The algorithm uses the context of each word to encode in those vectors information about co-occurrences of words. As a result, the...

Word2Vec in Gensim

There is no point in running Word2Vec on an iOS device: in the app, we need only the vectors it generates. For running Word2Vec, we will use the Python NLP package gensim. This library is popular for topic modeling and contains a fast Word2Vec implementation with a nice API. We don't want to load large corpuses of text on a mobile phone and don't want to train Word2vec on the iOS device, so we will learn a vector representation using the Gensim Python library. Then, we will do some preprocessing (remove everything except nouns) and plug this database into our iOS application:

In [39]: 
import gensim 
In [40]: 
def trim_rule(word, count, min_count): 
    if word not in words_to_keep or word in stop_words: 
        return gensim.utils.RULE_DISCARD 
    else: 
        return gensim.utils.RULE_DEFAULT 
In [41]: 
model = gensim.models.Word2Vec(sentences_to_train_on, min_count=15, trim_rule=trim_rule)

Vector space properties

"The Hatter opened his eyes very wide on hearing this; but all he SAID was, 'Why is a raven like a writing-desk?' 'Come, we shall have some fun now!' thought Alice. 'I'm glad they've begun asking riddles. - I believe I can guess that,' she added aloud. 'Do you mean that you think you can find out the answer to it?' said the March Hare."

– Lewis Carroll, Alice in a Wonderland

Why is a raven like a writing desk? With the help of distributive semantic and vector word representations, finally we can help Alice to solve Hatter's riddle (in a mathematically precise way):

In [42]: 
model.most_similar('house', topn=5) 
Out[42]: 
[(u'camp', 0.8188982009887695), 
 (u'cabin', 0.8176383972167969), 
 (u'town', 0.7998955845832825), 
 (u'room', 0.7963996529579163), 
 (u'street', 0.7951667308807373)] 
In [43]: 
model.most_similar('America', topn=5) 
Out[43]: 
[(u'India', 0.8678370714187622), 
 (u'Europe', 0.8501001596450806), 
 (u'number', 0.8464810848236084), 
 (u'member', 0.8352445363998413...

iOS application

To use vectors in an iOS application, we must export them in a binary format:

In [47]: 
model.wv.save_word2vec_format(fname='MarkTwain.bin', binary=True)

This binary contains words and their embedding vectors, all of the same length. The original implementation of Word2Vec was written in C, so I took it and adapted the code for our purpose—to parse the binary file and find closest words to the one that we specify.

Chatbot anatomy

Most chatbots look like reincarnations of console applications: you have a predefined set of commands and the bot produces an output for every command of yours. Someone even joked that Linux includes an awesome chatbot called console. But they don't always have to be that way. Let's see how we can make them more interesting. A typical chatbot consists of one or several input streams, a brain, and output streams. Inputs can be a keyboard, voice recognition, or set of predefined phrases. The brain is a sort of algorithm for transforming input into output...

Word2Vec friends and relatives

GloVE, Lexvec FastText.

One popular alternative to word2vec is GloVe (Global Vectors).

Doc2Vec - Efficient Vector Representation for Documents Through Corruption.

https://openreview.net/pdf?id=B1Igu2ogg

https://github.com/mchen24/iclr2017

Both models learn geometrical encodings (vectors) of words from their co-occurrence information (how frequently they appear together in large text corpora). They differ in that word2vec is a "predictive" model, whereas GloVe is a "count-based" model. See this paper for more on the distinctions between these two approaches: http://clic.cimec.unitn.it/marco

Predictive models learn their vectors in order to improve their predictive ability of Loss(target word | context words; Vectors), that is, the loss of predicting the target words from the context words given the vector representations. In Word2Vec, this is cast as a feed-forward neural network and optimized as such using SGD, and so on.

Count-based models learn their vectors by...

Where to go from here?

Word embeddings are such an elegant idea that they immediately became an indispensable part of many applications in NLP and other domains. Here are several possible directions for your further exploration:

You can easily transform the Word Association game into a question-answer system by replacing vectors of words with vectors of sentences. The simplest way to get the sentence vectors is by adding all the word vectors together. Interestingly, such sentence vectors still keep the semantics, so you can use them to find similar sentences.
Using clustering on embedding vectors, you can separate words, sentences, and documents into groups by similarity.
As we have mentioned, Word2Vec vectors are popular as parts of the more complex NLP pipelines. For example, you can feed them into a neural network or some other machine learning algorithm. In this way, you can train a classifier for pieces of text, for example, to recognize text sentiments or topics.
Word2Vec itself is just...

Summary

For developing applications that can understand voice or text input, we use techniques from the natural language processing domain. We have just seen several widely used ways to preprocess texts: tokenization, stop words removal, stemming, lemmatization, POS tagging, and named entity recognition.

Word embedding algorithms, and mainly Word2Vec, draw inspiration from the distributive semantics hypothesis, which states that the meaning of the word is defined by its context. Using an autoencoder-like neural network, we learn fixed-size vectors for each word in a text corpus. Effectively, this neural network captures the context of the word and encodes it in the corresponding vector. Then, using linear algebra operations with those vectors, we can discover different interesting relationships between words. For example, it allows us to find semantically close words (cosine similarity between vectors).

In the next section of the book, we are going to dig deeper into some practical questions...

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning with Swift

Published in: Feb 2018Publisher: PacktISBN-13: 9781787121515

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Jojo Moolayil

Jojo Moolayil is a data scientist, living in Bengaluru—the silicon valley of India. With over 4 years of industrial experience in Decision Science and IoT, he has worked with industry leaders on high impact and critical projects across multiple verticals. He is currently associated with GE, the pioneer and leader in data science for Industrial IoT. Jojo was born and raised in Pune, India and graduated from University of Pune with a major in information technology engineering. With a vision to solve problems at scale, Jojo found solace in decision science and learnt to solve a variety of problems across multiple industry verticals early in his career. He started his career with Mu Sigma Inc., the world's largest pure play analytics provider where he worked with the leaders of many fortune 50 clients. With the passion to solve increasingly complex problems, Jojo touch based with Internet of Things and found deep interest in the very promising area of consumer and industrial IoT. One of the early enthusiasts to venture into IoT analytics, Jojo converged his learnings from decision science to bring the problem solving frameworks and his learnings from data and decision science to IoT. To cement his foundations in industrial IoT and scale the impact of the problem solving experiments, he joined a fast growing IoT Analytics startup called Flutura based in Bangalore and headquartered in the valley. Flutura focuses exclusively on Industrial IoT and specializes in analytics for M2M data. It is with Flutura, where Jojo reinforced his problem solving skills for M2M and Industrial IoT while working for the world's leading manufacturing giant and lighting solutions providers. His quest for solving problems at scale brought the 'product' dimension in him naturally and soon he also ventured into developing data science products and platforms. After a short stint with Flutura, Jojo moved on to work with the leaders of Industrial IoT, that is, G.E. in Bangalore, where he focused on solving decision science problems for Industrial IoT use cases. As a part of his role in GE, Jojo also focuses on developing data science and decision science products and platforms for Industrial IoT.
Read more about Jojo Moolayil

Alexander Sosnovshchenko

Alexander Sosnovshchenko has been working as an iOS software engineer since 2012. Later he made his foray into data science, from the first experiments with mobile machine learning in 2014, to complex deep learning solutions for detecting anomalies in video surveillance data. He lives in Lviv, Ukraine, and has a wife and a daughter.
Read more about Alexander Sosnovshchenko

Oleksandr Baiev

Other recommended products

Related to this chapter

Machine Learning with scikit-learn Quick Start Guide

Scikit-learn is a robust machine learning library for the Python programming language. It provides a set of supervised and unsupervised learning algorithms. This book is the easiest way to learn how to deploy, optimize and evaluate all the important machine learning algorithms that scikit-learn provides.

BookOct 2018172 pages

Machine Learning with Core ML

Discover the world of ML through the lens and application of Core ML. We will take you through examples; each example provides a new use case uncovering how ML can be applied specifically to computer vision tasks. By the end of the book, you will have the intuition and skills required to boost your iOS applications with the help of machine learning.

BookJun 2018378 pages

Machine Learning Projects for Mobile Applications

Machine learning on mobile devices is the next big thing. This book presents the implementation of 7 practical, real-world projects that will teach you how to leverage TensorFlow Lite and Core ML to perform efficient machine learning on a cross-platform mobile OS. You will get to work on image, text, and video datasets through these projects.

BookOct 2018246 pages

Machine Learning for Mobile

This book will help you build intelligent mobile applications for Android and iOS using machine learning. In the process, you will use popular machine learning toolkits such as TensorFlow Lite, Core ML, ML Kit and Fritz to build and deploy state-of-the-art machine learning models for mobile devices.

BookDec 2018274 pages

Mastering Machine Learning with scikit-learn

This book examines machine learning models including k-nearest neighbors, logistic regression, naive Bayes, random forests, and support vector machines. You will work through document classification, image recognition, and other example problems.

BookJul 2017254 pages

Deep Learning with PyTorch Quick Start Guide

PyTorch is extremely powerful and yet easy to learn. It provides advanced features such as supporting multiprocessor, distributed and parallel computation. This book is an excellent entry point for those wanting to explore deep learning with PyTorch to harness its power.

BookDec 2018158 pages

Mastering Machine Learning for Penetration Testing

We live in an era where cyber security plays an important role. As systems are getting smarter, we now see machine learning interrupting computer security. With the adoption of machine learning in upcoming security products, it’s important for pentesters and security researchers to understand how these systems work, and to breach them for testing purposes.

BookJun 2018276 pages

Python Natural Language Processing

Natural Language Processing is a field of computational linguistics and artificial intelligence that deals with human-computer interaction. The numbers of human-computer interaction instances are increasing so it’s becoming imperative that computers comprehend all major natural languages. Python's powerful tools and libraries are evolved so much that natural language processing becomes much simpler and accurate with it. This book will get you up and running with Python's library for Natural Language Processing-- NLTK-- in no time.

BookJul 2017486 pages

Mastering Machine Learning on AWS

This book will help you master your skills in various artificial intelligence and machine learning services available on AWS. Through practical hands-on examples, you’ll learn how to use these services to generate impressive results. You will have a tremendous understanding of how to use a wide range of AWS services in your own organization.

BookMay 2019306 pages

Hands-On Machine Learning with C++

This book will help you explore how to implement different well-known machine learning algorithms with various C++ frameworks and libraries. You will cover basic to advanced machine learning concepts with practical and easy to follow examples. By the end of the book, you will be able to build various machine learning models with ease.

BookMay 2020530 pages

Machine Learning in Java

Machine Learning in Java will provide you with the techniques and tools you need to quickly gain insight from complex data. You will start by learning how to apply machine learning methods to a variety of common tasks including classification, prediction, forecasting, market basket analysis, and clustering.

BookNov 2018300 pages

Hands-On Computer Vision with TensorFlow 2

Computer vision is achieving a new frontier of capabilities in fields like health, automobile or robotics. This book explores TensorFlow 2, Google's open-source AI framework, and teaches how to leverage deep neural networks for visual tasks. It will help you acquire the insight and skills to be a part of the exciting advances in computer vision.

BookMay 2019372 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages