Machine Learning with Core ML

Introduction to Machine Learning

Let's begin our journey by peering into the future and envision how we'll see ourselves interacting with computers. Unlike today's computers, where we are required to continuously type in our emails and passwords to access information, the computers of the future will easily be able to recognize us by our face, voice, or activity. Unlike today's computers, which require step-by-step instructions to perform an action, the computer of the future will anticipate our intent and provide a natural way for us to converse with it, similar to how we engage with other people, and then proceed to help us achieve our goal. Our computer will not only assist us but also be our friend, our doctor, and so on. It could deliver our groceries at the door and be our interface with an increasingly complex and information-rich physical world.

What is exciting about this vision is that it is no longer in the realm of science fiction but an emergent reality. One of the major drivers of this is the progress and adoption of machine learning (ML) techniques, a discipline that gives computers the perceptual power of humans, thus giving them the ability to see, hear, and make sense of the world—physical and digital.

But despite all the great progress over the last 3-4 years, most of the ideas and potential are locked away in research projects and papers rather than being in the hands of the user. So it's the aim of this book to help developers understand these concepts better. It will enable you to put them into practice so that we can arrive at this future—a future where computers augment us, rather than enslave us due to their inability to understand our world.

Because of the constraint of Core ML—it being only able to perform inference—this book differs vastly from other ML books, in the sense that the core focus is on the application of ML. Specifically we'll focus on computer vision applications rather than the details of ML. But in order to better enable you to take full advantage of ML, we will spend some time introducing the associated concepts with each example.

And before jumping into the hands-on examples, let's start from the beginning and build an appreciation for what ML is and how it can be applied. In this chapter we will:

Start by introducing ML. We'll learn how it differs from classical programming and why you might choose it.
Look at some examples of how ML is being used today, along with the type of data and ML algorithm being used.
Finally, present the typical workflow for ML projects.

Let's kick off by first discussing what ML is and why everyone is talking about it.

What is machine learning?

ML is a subfield of Artificial Intelligence (AI), a topic of computer science born in the 1950s with the goal of trying to get computers to think or provide a level of automated intelligence similar to that of us humans.

Early success in AI was achieved by using an extensive set of defined rules, known as symbolic AI, allowing expert decision making to be mimicked by computers. This approach worked well for many domains but had a big shortfall in that in order to create an expert, you needed one. Not only this, but also their expertise needed to be digitized somehow, which normally required explicit programming.

ML provides an alternative; instead of having to handcraft rules, it learns from examples and experience. It also differs from classical programming in that it is probabilistic as opposed to being discrete. That is, it is able to handle fuzziness or uncertainty much better than its counterpart, which will likely fail when given an ambiguous input that wasn't explicitly identified and handled.

I am going to borrow an example used by Google engineer Josh Godron in an introductory video to ML to better highlight the differences and value of ML.

Suppose you were given the task of classifying apples and oranges. Let's first approach this using what we will call classical programming:

Our input is an array of pixels for each image, and for each input, we will need to explicitly define some rules that will be able to distinguish an apple from an orange. Using the preceding examples, you can solve this by simply counting the number of orange and green pixels. Those with a higher ratio of green pixels would be classified as an apple, while those with a higher ratio of orange pixels would be classified as an orange. This works well with these examples but breaks if our input becomes more complex:

The introduction of new images means our simple color-counting function can no longer sufficiently differentiates our apples from our oranges, or even classify apples. We are required to reimplement the function to handle the new nuances introduced. As a result, our function grows in complexity and becomes more tightly coupled to the inputs and less likely able to generalize to other inputs. Our function might resemble something like the following:

func countColors(_ image:UIImage) -> [(color:UIColor, count:Int)]{
// lots of code 
}

func detectEdges(_ image:UIImage) -> [(x1:Int, y1:Int, x2:Int, y2:Int)]
{
// lots of code
}

func analyseTexture(_ image:UIImage) -> [String]
{
// lots of code 
} 

func fitBoundingBox(_ image:UIImage) -> [(x:Int, y:Int, w:Int, h:Int)]
{
// lots of code 
}

This function can be considered our model, which models the relationship of the inputs with respect to their labels (apple or orange), as illustrated in the following diagram:

The alternative, and the approach we're interested in, is getting this model created to automatically use examples; this, in essence, is what ML is all about. It provides us with an effective tool to model complex tasks that would otherwise be nearly impossible to define by rules.

The creation phase of the ML model is called training and is determined by the type of ML algorithm selected and data being fed. Once the model is trained, that is, once it has learned, we can use it to make inferences from the data, as illustrated in the following diagram:

The example we have presented here, classifying oranges and apples, is a specific type of ML algorithm called a classifier, or, more specifically, a multi-class classifier. The model was trained through supervision; that is, we fed in examples of input with their associated labels (or classes). It is useful to understand the types of ML algorithms that exist along with the types of training, which is the topic of the next section.

A brief tour of ML algorithms

In this section, we will look at some examples of how ML is used, and with each example, we'll speculate about the type of data, learning style, and ML algorithm used. I hope that by the end of this section, you will be inspired by what is possible with ML and gain some appreciation for the types of data, algorithms, and learning styles that exist.

In this section, we will be presenting some real-life examples in the context of introducing types of data, algorithms, and learning styles. It is not our intention to show accurate data representations or implementations for the example, but rather use the examples as a way of making the ideas more tangible.

Netflix – making recommendations

No ML book is complete without mentioning recommendation engines—probably one of the most well known applications of ML. In part, this is thanks to the publicity gained when Netflix announced a $1 million competition for movie rating predictions, also known as recommendations. Add to this Amazon's commercial success in making use of it.

The goal of recommendation engines is to predict the likelihood of someone wanting a particular product or service. In the context of Netflix, this would mean recommending movies or TV shows to its users.

One intuitive way of making recommendations is to try and mimic the real world, where a person is likely to seek recommendations from like-minded people. What constitutes likeness is dependent on the domain. For example, you are most likely to have one group of friends that you would ask for restaurant recommendations and another group of friends for movie recommendations. What determines these groups is how similar their tastes are to your own taste for that particular domain. We can replicate this using the (user-based) Collaborative Filtering (CF) algorithm. This algorithm achieves this by finding the distance between each user and then using these distances as a similarity metric to infer predictions on movies for a particular user; that is, those that are more similar will contribute more to the prediction than those that have different preferences. Let's have a look at what form the data might take from Netflix:

User	Movie	Rating
0: Jo	A: Monsters Inc	5
	B: The Bourne Identity	2
	C: The Martian	2
	D: Blade Runner	1
1: Sam	C: The Martian	4
	D: Blade Runner	4
	E: The Matrix	4
	F: Inception	5
2: Chris	B: The Bourne Identity	4
	C: The Martian	5
	D: Blade Runner	5
	F: Inception	4

For each example, we have a user, a movie, and an assigned rating. To find the similarity between each user, we can first calculate the Euclidean distance of the shared movies between each pair of users. The Euclidean distance gives us larger values for users who are most dissimilar; we invert this by dividing 1 by this distance to give us a result, where 1 represents perfect matches and 0 means the users are most dissimilar. The following is the formula for Euclidean distance and the function used to calculate similarities between two users:

Equation for Euclidian distance and similarity

func calcSimilarity(userRatingsA: [String:Float], userRatingsB:[String:Float]) -> Float{
  var distance = userRatingsA.map( { (movieRating) -> Float in 
    if userRatingsB[movieRating.key] == nil{
      return 0 
    }
    let diff = movieRating.value - (userRatingsB[movieRating.key] ?? 0)
    return diff * diff
  }).reduce(0) { (prev, curr) -> Float in 
    return prev + curr
  }.squareRoot()
  return 1 / (1 + distance)
}

To make this more concrete, let's walk through how we can find the most similar user for Sam, who has rated the following movies: ["The Martian" : 4, "Blade Runner" : 4, "The Matrix" : 4, "Inception" : 5]. Let's now calculate the similarity between Sam and Jo and then between Sam and Chris.

Sam and Jo

Jo has rated the movies ["Monsters Inc." : 5, "The Bourne Identity" : 2, "The Martian" : 2, "Blade Runner" : 1]; by calculating the similarity of intersection of the two sets of ratings for each user, we get a value of 0.22.

Sam and Chris

Similar to the previous ones, but now, by calculating the similarity using the movie ratings from Chris (["The Bourne Identity" : 4, "The Martian" : 5, "Blade Runner" : 5, "Inception" : 4]), we get a value of 0.37.

Through manual inspection, we can see that Chris is more similar to Sam than Jo is, and our similarity rating shows this by giving Chris a higher value than Jo.

To help illustrate why this works, let's project the ratings of each user onto a chart as shown in the following graph:

The preceding graph shows the users plotted in a preference space; the closer two users are in this preference space, the more similar their preferences are. Here, we are just showing two axes, but, as seen in the preceding table, this extends to multiple dimensions.

We can now use these similarities as weights that contribute to predicting the rating a particular user would give to a particular movie. Then, using these predictions, we can recommend some movies that a user is likely to want to watch.

The preceding approach is a type of clustering algorithm that falls under unsupervised learning, a learning style where examples have no associated label and the job of the ML algorithm is to find patterns within the data. Other common unsupervised learning algorithms include the Apriori algorithm (basket analysis) and K-means.

Recommendations are applicable anytime when there is an abundance of information that can benefit from being filtered and ranked before being presented to the user. Having recommendations performed on the device offers many benefits, such as being able to incorporate the context of the user when filtering and ranking the results.

Shadow draw – real-time user guidance for freehand drawing

To highlight the synergies between man and machine, AI is sometimes referred to as Augmented Intelligence (AI), putting the emphasis on the system to augment our abilities rather than replacing us altogether.

One area that is becoming increasingly popular—and of particular interest to myself—is assisted creation systems, an area that sits at the intersection of the fields of human-computer interaction (HCI) and ML. These are systems created to assist in some creative tasks such as drawing, writing, video, and music.

The example we will discuss in this section is shadow draw, a research project undertaken at Microsoft in 2011 by Y.J. Lee, L. Zitnick, and M. Cohen. Shadow draw is a system that assists the user in drawing by matching and aligning a reference image from an existing dataset of objects and then lightly rendering shadows in the background to be used as guidelines for the user. For example, if the user is predicted to be drawing a bicycle, then the system would render guidelines under the user's pen to assist them in drawing the object, as illustrated in this diagram:

As we did before, let's walk through how we might approach this, focusing specifically on classifying the sketch; that is, we'll predict what object the user is drawing. This will give us the opportunity to see new types of data, algorithms, and applications of ML.

The dataset used in this project consisted of 30,000 natural images collected from the internet via 40 category queries such as face, car, and bicycle, with each category stored in its own directory; the following diagram shows some examples of these images:

After obtaining the raw data, the next step, and typical of any ML project, is to perform data preprocessing and feature engineering. The following diagram shows the preprocessing steps, which consist of:

Rescaling each image
Desaturating (turning black and white)
Edge detection

Our next step is to abstract our data into something more meaningful and useful for our ML algorithm to work with; this is known as feature engineering, and is a critical step in a typical ML workflow.

One approach, and the approach we will describe, is creating something known as a visual bag of words. This is essentially a histogram of features (visual words) used to describe each image, and collectively to describe each category. What constitutes a feature is dependent on the data and ML algorithm; for example, we can extract and count the colors of each image, where the colors become our features and collectively describe our image, as shown in the following diagram:

But because we are dealing with sketches, we want something fairly coarse—something that can capture the general strokes directions that will encapsulate the general structure of the image. For example, if we were to describe a square and a circle, the square would consist of horizontal and vertical strokes, while the circle would consist mostly of diagonal strokes. To extract these features, we can use a computer vision algorithm called histogram of oriented gradients (HOG); after processing an image you are returned a histogram of gradient orientations in localized portions of the image. Exactly what we want! To help illustrate the concept, this process is summarized for a single image here:

After processing all the images in our dataset, our next step is to find a histogram (or histograms) that can be used to identify each category; we can use an unsupervised learning clustering technique called K-means, where each category histogram is the centroid for that cluster. The following diagram describes this process; we first extract features for each image and then cluster these using K-means, where the distance is calculated using the histogram of gradients. Once our images have been clustered into their groups, we extract the center (mean) histogram of each of these groups to act as our category descriptor:

Once we have obtained a histogram for each category (codebook), we can train a classifier using each image's extracted features (visual words) and the associated category (label). One popular and effective classifier is support vector machines (SVM). What SVM tries to find is a hyperplane that best separates the categories; here, best refers to a plane that has the largest distance between each of the category members. The term hyper is used because it transforms the vectors into high-dimensional space such that the categories can be separated with a linear plane (plane because we are working within a space). The following diagram shows how this may look for two categories in a two-dimensional space:

With our model now trained, we can perform real-time classification on the image as the user is drawing, thus allowing us to assist the user by providing them with guidelines for the object they are wanting to draw (or at least, mention the object we predicted them to be drawing). Perfectly suited for touch interfaces such as your iPhone or iPad! This assists not just in drawing applications, but anytime where an input is required by the user, such as image-based searching or note taking.

In this example, we showed how feature engineering and unsupervised learning are used to augment data, making it easier for our model to sufficiently perform classification using the supervised learning algorithm SVM. Prior to deep neural networks, feature engineering was a critical step in ML and sometimes a limiting factor for these reasons:

It required special skills and sometimes domain expertise
It was at the mercy of a human being able to find and extract meaningful features
It required that the features extracted would generalize across the population, that is, be expressive enough to be applied to all examples

In the next example, we introduce a type of neural network called a convolutional neural network (CNN or ConvNet), which takes care of a lot of the feature engineering itself.

The paper describing the actual project and approach can be found here: http://vision.cs.utexas.edu/projects/shadowdraw/shadowdraw.html.

Shutterstock – image search based on composition

Over the past 10 years, we have seen an explosive growth in visual content created and consumed on the Web, but before the success of CNNs, images were found by performing simple keyword searches on the tags assigned manually. All this changed around 2012, when A. Krizhevsky, I. Sutskever, and G. E. Hinton published their paper ImageNet Classification with Deep Convolutional Networks. The paper described their architecture used to win the 2012 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). It's a competition like the Olympics of computer vision, where teams compete across a range of CV tasks such as classification, detection, and object localization. And that was the first year a CNN gained the top position with a test error rate of 15.4% (the next best entry achieved an test error rate of 26.2%). Ever since then, CNNs have become the de facto approach for computer vision tasks, including becoming the new approach for performing visual search. Most likely, it has been adopted by the likes of Google, Facebook, and Pinterest, making it easier than ever to find that right image.

Recently, (October 2017), Shutterstock announced one of the more novel uses of CNNs, where they introduced the ability for their users to search for not only multiple items in an image, but also the composition of those items. The following screenshot shows an example search for a kitten and a computer, with the kitten on the left of the computer:

So what are CNNs? As previously mentioned, CNNs are a type of neural network that are well suited for visual content due to their ability to retain spatial information. They are somewhat similar to the previous example, where we explicitly define a filter to extract localized features from the image. A CNN performs a similar operation, but unlike our previous example, filters are not explicitly defined. They are learned through training, and they are not confined to a single layer but rather build with many layers. Each layer builds upon the previous one and each becomes increasingly more abstract (abstract here means a higher-order representation, that is, from pixels to shapes) in what it represents.

To help illustrate this, the following diagram visualizes how a network might build up its understanding of a cat. The first layer's filters extract simple features, such as edges and corners. The next layer builds on top of these with its own filters, resulting in higher-level concepts being extracted, such as shapes or parts of the cat. These high-level concepts are then combined for classification purposes:

This ability to get a deeper understanding of the data and reduce the dependency on manual feature engineering has made deep neural networks one of the most popular ML algorithms over the past few years.

To train the model, we feed the network examples using images as inputs and labels as the expected outputs. Given enough examples, the model will build an internal representation for each label, which can be sufficiently used for classification; this, of course, is a type of supervised learning.

Our last task is to find the location of the item or items; to achieve this, we can inspect the weights of the network to find out which pixels activated a particular class, and then create a bounding box around the inputs with the largest weights.

We have now identified the items and their locations within the image. With this information, we can preprocess our repository of images and cache it as metadata to make it accessible via search queries. We will revisit this idea later in the book when you will get a chance to implement a version of this to assist the user in finding images in their photo album.

In this section, we saw how ML can be used to improve user experience and briefly introduced the intuition behind CNNs, a neural network well suited for visual contexts, where retaining proximity of features and building higher levels of abstraction is important. In the next section, we will continue our exploration of ML applications by introducing another example that improves the user experience and a new type of neural network that is well suited for sequential data such as text.

iOS keyboard prediction – next letter prediction

Quoting usability expert Jared Spool, Good design, when done well, should be invisible. This holds true for ML as well. The application of ML need not be apparent to the user and sometimes (more often than not) more subtle uses of ML can prove just as impactful.

A good example of this is an iOS feature called dynamic target resizing; it is working every time you type on an iOS keyboard, where it actively tries to predict what word you're trying to type:

Using this prediction, the iOS keyboard dynamically changes the touch area of a key (here illustrated by the red circles) that is the most likely character based on what has already been typed before it.

For example, in the preceding diagram, the user has entered "Hell"; now it would be reasonable to assume that the most likely next character the user wants to tap is "o". This is intuitive given our knowledge of the English language, but how do we teach a machine to know this?

This is where recurrent neural networks (RNNs) come in; it's a type of neural network that persists state over time. You can think of this persisted state as a form of memory, making RNNs suitable for sequential data such as text (any data where the inputs and outputs are dependent on each other). This state is created by using a feedback loop from the output of the cell, as shown in the following diagram:

The preceding diagram shows a single RNN cell. If we unroll this over time, we would get something that looks like the following:

Using hello as our example, the preceding diagram shows an unrolled RNN over five time steps; at each time step, the RNN predicts the next likely character. This prediction is determined by its internal representation of the language (from training) and subsequent inputs. This internal representation is built by training it on samples of text where the output is using the inputs but at the next time step (as illustrated earlier). Once trained, the inference follows a similar path, except that we feed to the network the predicted character from the output, to get the next output (to generate the sequence, that is, words).

Neural networks and most ML algorithms require their inputs to be numbers, so we need to convert our characters to numbers, and back again. When dealing with text (characters and words), there are generally two approaches: one-hot encoding and embeddings. Let's quickly cover each of these to get some intuition of how to handle text.

Text (characters and words) is considered categorical, meaning that we cannot use a single number to represent text because there is no inherit relationship between the text and the value; that is, assigning the 10 and cat 20 implies that cat has a greater value than the. Instead, we need to encode them into something where no bias is introduced. One solution to this is encoding them using one-hot encoding, which uses an array of the size of your vocabulary (number of characters in our case), with the index of the specific character set to 1 and the rest set to 0. The following diagram illustrates the encoding process for the corpus "hello":

In the preceding diagram, we show some of the steps required when encoding characters; we start off by splitting the corpus into individual characters (called tokens, and the process is called tokenization). Then we create a set that acts as our vocabulary, and finally we encode this with each character being assigned a vector.

Here, we'll only present some of the steps required for preparing text before passing it to our ML algorithm.

Once our inputs are encoded, we can feed them into our network. Outputs will also be represented in this format, with the most likely character being the index with the greatest value. For example, if 'e' is predicted, then the most likely the output may resemble something like [0.95, 0.2, 0.2, 0.1].

But there are two problems with one-hot encoding. The first is that for a large vocabulary, we end up with a very sparse data structure. This is not only an inefficient use of memory, but also requires additional calculations for training and inference. The second problem, which is more obvious when operating on words, is that we lose any contextual meaning after they have been encoded. For example, if we were to encode the words dog and dogs, we would lose any relationship between these words after encoding.

An alternative, and something that addresses these two problems, is using an embedding. These are generally weights from a trained network that use a dense vector representation for each token, one that preserves some contextual meaning. This book focuses on computer vision tasks, so we won't be going into the details here. Just remember that we need to encode our text (characters) into something our ML algorithm will accept.

We train the model using weak supervision, similar to supervised learning, but inferring the label without it having been explicitly labelled. Once trained, we can predict the next character using multi-class classification, as described earlier.

Over the past couple of years, we have seen the evolution of assistive writing; one example is Google's Smart Reply, which provides an end-to-end method for automatically generating short email responses. Exciting times!

This concludes our brief tour of introducing types of ML problems along with the associated data types, algorithms, and learning style. We have only scratched the surface of each, but as you make your way through this book, you will be introduced to more data types, algorithms, and learning styles.

In the next section, we will take a step back and review the overall workflow for training and inference before wrapping up this chapter.

Yiannis Aug 21, 2018

With Core ML, iOS (as of iOS 11) now allows iPhone developers to easily integrate trained machine learning models into their apps with a few line of code. As Machine Learning is the talk of the town nowadays, it's a well sought-after skill for any iOS developer to know how they can introduce some AI/ML magic into their apps. And this is exactly what this book achieves and it does so in a way that:1. Doesn't require much knowledge of machine learning. As a matter of fact, it delivers a short and sweet introduction of key ML concepts that is all you need to know to understand how to utilise trained models.2. Explains with clear examples what the possibilities are. Examples are easy to follow and run on your device. As seeing is believing, these examples can offer food for thought for you or your clients (if you are in the business for developing apps for other).By far the best book about Core ML in the market!

Amazon Verified review

Machine Learning with Core ML: An iOS developer's guide to implementing machine learning in mobile apps

What do you get with Print?

Machine Learning with Core ML

Introduction to Machine Learning

What is machine learning?

A brief tour of ML algorithms

Netflix – making recommendations

Shadow draw – real-time user guidance for freehand drawing

Shutterstock – image search based on composition

iOS keyboard prediction – next letter prediction

A typical ML workflow

Summary

Page 1 of 5

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs

Machine Learning with Core ML: An iOS developer's guide to implementing machine learning in mobile apps

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access