Analyzing Data with the Help of Mathematica

Sergiy Suchok

December 2015

In this article by Sergiy Suchok, author of the book Mathematica Data Analysis, explains a bunch of already implemented and working algorithms are waiting for you to be used to build fast and compact solutions for set tasks.

In this article, you will learn the following:

  • Clustering information and classifying data
  • Recognizing an object in a picture and reckoning it in a particular category
  • Identifying people's faces in a photo
  • Recognizing text information and determining the language of the text
  • Reading the different types of barcodes

(For more resources related to this topic, see here.)

Data clustering

Clusters are data groups of elements that are very close or similar. For example, a group of people can be divided into clusters according to age, height, sex, social status, and so on. Clustering helps to better understand input information because if we know the properties of one element of the cluster, it is likely that the other elements may also have these properties. The process of finding a cluster can go on without a teacher (unsupervised learning technique) and can be based on two functions: the distance function that indicates the distance between the elements of a cluster—the closer the elements are to each other, the greater is the probability that they are in the same cluster, and the dissimilarity function, the result of which is the degree of dissimilarity between the elements.

To cluster data, we'll use the FindClusters function. First, let's consider its application in simple examples:

By default, the FindClusters function finds clusters on the basis of the shortest distance. Since we have not specified how many clusters we would like to find, the result of running FindClusters[list] is two clusters: {1, -2, 6, 3} and {11, 17, 15, 21}. In the second case, we have clearly indicated that the data list contains three clusters, and based on this, the function has found them.

Let's generate three sets of random points distributed by a normal distribution and see how the FindClusters function will divide them:

As you can see, we have got four quite clear data sets.

To fine-tune the recognition of clusters, you can use advanced settings. One of them is the function of distance, which should satisfy the following conditions:

  • f(ei, ei)=0
  • f(ei, ej)≥0
  • f(ei, ej)= f(ej, ei)

Mathematica supports the following distance functions:

  • EuclideanDistance[u,v]: The Euclidean norm,
  • SquaredEuclideanDistance[u,v]: The squared Euclidean norm,
  • ManhattanDistance[u,v]: The Manhattan distance,
  • ChessboardDistance[u,v]: The chessboard or Chebyshev distance,
  • CanberraDistance[u,v]: The Canberra distance,
  • CosineDistance[u,v]: The cosine distance,
  • CorrelationDistance[u,v]: The correlation distance, 1-(u-Mean[u]).(v-Mean[v])/(Abs[u-Mean[u]]Abs[v-Mean[v]])
  • BrayCurtisDistance[u,v]: The Bray–Curtis distance,

In order to use a specific function, simply specify it in the parameters as FindClusters[data, DistanceFunction -> ManhattanDistance].

You can also specify the method of finding clusters (Agglomerate or Optimize), conduct a test to determine the best number of clusters (the SignificanceTest parameter), as well as specify the clustering linkage to be used (the Linkage parameter):

In this case, we have used the agglomerative method with the largest intercluster dissimilarity and got three clusters.

Similarly, you can partition into clusters arrays with more dimensions and various types of data: text, Boolean, and numeric.

Data classification

If clustering relates to learning without a teacher, then classification, on the contrary, is knowing to what groups a part of the known data  belongs, and we want to determine the probability with which the unknown new element might belong to one group or another.

For example, using the Classify function, let's try to explain which are even numbers and which are odd numbers:

We have set several even and several odd numbers, then we have classified them with default parameters using the Classify function, and finally we can see how the missing elements, 5, 0 and 10, will be classified. As you can see from the result, they were successfully and correctly defined:

With the help of the Probabilities parameter, you can determine how likely it is for an element to belong to a particular class.

The ClassifierInformation function provides information about the sample data based on which the classification took place:

In Mathematica, there are also built-in classes from different areas of knowledge:

CountryFlag

Which country a flag image is for

FacebookTopic

Which topic a Facebook post is about

Language

Which natural language the text is in

NameGender

Which gender a first name is

Profanity

Whether the text contains profanity

Sentiment

The sentiment of a social media post

Spam

Whether an email is spam

In order to use one of these classes, you should specify its name as the first parameter of the Classify function:

Let's consider how we can use the data obtained with the help of the Classify function. For example, let's take a set of data that contains a list of Titanic passengers with their age, sex, ticket class, and survival and classify this using the logistic regression method:

Now we'll construct the survival probability graph depending on the passenger's sex and the class in which they travelled:

Based on the graphical representation, we see that women from the first class had a very high probability of survival and men from the third class had very low probability. Besides, women from the third class were offered places by men from the first class.

The Classify function is also useful to determine an author's handwriting. For example, let's take three classic works by Shakespeare, Oscar Wilde, and Victor Hugo:

We'll use these works to educate the classification function and then present the third work and see whether it manages to guess the author:

As you can see, we have used Markov's method for education based on which the authors of the works were accurately identified.

Image recognition

With the help of Mathematica's knowledge base on pattern recognition, we can engage in the development of artificial intelligence. Mathematica needs only one photo to report what is depicted in it. This is possible, thanks to the ImageIdentify function:

However, this definition is not limited to only one category. Since it is not always possible to identify exactly what is depicted in an image, you should specify additional parameters to be able to select among options. For example, in this case, we are asking for 10 possible options to be shown:

If we exactly know that there is an edible fruit in the image, then Mathematica will help to classify it. For example, in the next case, we ask for 10 types of edible fruits, which could match this image together with their probabilities:

Note one very useful function—WordCloud. This allows you to build a cloud of words depending on their frequency, which helps to define what is in the image more clearly.

The ImageInstanceQ function is also quite interesting. It allows you to identify whether the object depicted in the image is related to a specific category. For example, we can determine whether there is a tree depicted in the image:

To enter the name of q category—tree, you need to press the Ctrl + = key combination and choose the definition you have in mind in the knowledge base.

Recognizing faces

When a photo is added to a social networking website, there is an option to tag friends who are depicted in it. This is done with the help of a face detection function. Using the FindFaces Mathematica function, you can implement a similar feature in your applications. The input parameter of the function is the image and the output parameters are the coordinates of all the rectangles that show the people's faces:

Pay attention to the /@ record, which refers to a short form of the Map[f,expr]function that applies f to each element in the first level of expr. For example, the f /@ {a, b, c, d, e} result is a list of f function applications for each of these parameters: {f[a], f[b], f[c], f[d], f[e]}.

In this example, the FindFaces function has detected multiple faces in a photo and the HighlightImage function made it possible to highlighting these faces directly in the photo.

NOTE: Take into account the Rectangle @@@ FindFaces[fam] record—it is a shortened version of the Apply[f,expr,{1}]function that replaces the  heads at level 1 of expr by f. For example, the result of f @@@ {{a}, {b}, {c}, {d}} will be {f[a], f[b], f[c], f[d]}.

The FindFaces function also has a parameter of the minimum face area size, which helps to prevent incorrect face recognition.

A demonstration of this parameter is shown as follows:

In this picture, we see five areas that were wrongly recognized as faces. We can remove them by limiting the size of the face area from 140 pixels to infinity.

In order to insert the infinity symbol, ∞, you have to press Esc inf Esc.

Recognizing text information

You can solve the problem of textual information recognition using the TextRecognize function, which recognizes the text in an image and returns it as a string. This function works with both grey and multichannel images. By default, it recognizes the text written in English; however, using the Language parameter, you can specify that the text is written in one of the following languages: French, German, Italian, Portuguese, Russian, and Spanish.

This function handles different ways to write text, such as text that is written at an angle:

In addition to text recognition, you can also speak this text out loud if you choose the Speak representation:

Besides, one of the important functions is determining the language in which the text is written; it helps to perform automatic translation:

Recognizing barcodes

The information in the form of a barcode is compact and can be easily recognized by software. In Mathematica, there is a BarcodeRecognize function, which can recognize barcodes from an image and provide their meaning. It recognizes 11 one-dimensional barcodes, including the following:

"UPC"

UPC-A

12 numerical digits

"UPCE"

UPC-E

6 numerical digits

"EAN8"

EAN-8

8 numerical digits

"EAN13"

EAN-13

13 numerical digits

"Code39"

Code 39

Up to 43 characters of uppercase letters, numeric digits, and special characters such as -, ., $, /,+, %, and space

"Code93"

Code 93

Uppercase letters, numeric digits, and special characters such as -, ., $, /, +, %, and space

"Code128"

Code 128

Up to 80 ASCII characters

"ITF"

ITF

Up to 80 numerical digits of an even length

"Codabar"

Codabar

Numerical digits and special characters such as :,/, +, .

"GS1"

GS1 DataBar (or RSS)

14 numerical digits

"ExpandedGS1"

GS1 Expanded and Expanded Stacked

74 digits or 41 alphanumeric characters in a single row, or up to 11 stacked rows(GS1 DataBar Expanded Stacked)

The function also recognizes five two-dimensional barcodes, as follows:

{"QR",lev}

QR and error correction level

Variable-length ASCII characters

{"PDF417",lev}

PDF417 and error correction level

Variable-length ASCII characters

"Aztec"

Aztec code

Variable-length ASCII characters

"DataMatrix"

Data Matrix code

Variable-length ASCII characters

"MaxiCode"

MaxiCode

Up to 93 ASCII characters

To get the meaning of the barcode, you need to simply include its image into the BarcodeRecognize function parameters:

Using additional parameters, you can also find out the type of a barcode:

In the following example, let's write the procedure that decodes the barcode meaning and adds it to the original image:

Let's consider this function. The Show function shows graphics with the specified options added. Using the SetAlphaChannel function, we have set the transparency level of the original image. Then, using the Inset function, we have identified the red-colored area in which the barcode and its text meaning will be highlighted. The boundaries of this area were obtained with the BoundingBox parameter, which we specified when scanning code by the BarcodeRecognize function. Let's recall the #[[1]] record—this means one element of the list, so in our case, #[[1]] is the bc image, #[[2]] indicates the coordinates of the barcode's borders, #[[3]] is the text meaning of the barcode, and #[[4]] is the format of the barcode.

Besides, using the BarcodeImage function, you can generate different barcodes by means of software.

Summary

In this article, we reviewed the main points of data analysis. We learned Mathematica's functions that will help to perform data classification (as a supervised learning technique) and data clustering (as an unsupervised learning technique). We got to know how to recognize faces, classify objects in an image, and work with textual information by identifying the language of the text and recognizing the text in the image. Apart from this, we analyzed barcodes as a system of information recognition simplification. We learned to read and create barcodes.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

Mathematica Data Analysis

Explore Title
comments powered by Disqus