Packt+ | Advance your knowledge in tech

You're reading from Learning Data Mining with Python, - Second Edition

Product type Book

Published in Apr 2017

Publisher Packt

ISBN-13 9781787126787

Pages 358 pages

Edition 2nd Edition

Languages

Python

Concepts

Data Mining

Table of Contents (20) Chapters

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Getting Started with Data Mining

Classifying with scikit-learn Estimators

Predicting Sports Winners with Decision Trees

Recommending Movies Using Affinity Analysis

Features and scikit-learn Transformers

Social Media Insight using Naive Bayes

Follow Recommendations Using Graph Mining

Beating CAPTCHAs with Neural Networks

Authorship Attribution

Clustering News Articles

Object Detection in Images using Deep Neural Networks

Working with Big Data

Next Steps...

Chapter 8. Beating CAPTCHAs with Neural Networks

Images pose interesting and difficult challenges for data miners. Until recently, only small amounts of progress were made with analyzing images for extracting information. However recently, such as with the progress made on self-driving cars, significant advances have been made in a very short time-frame. The latest research is providing algorithms that can understand images for commercial surveillance, self-driving vehicles, and person identification.

There is lots of raw data in an image, and the standard method for encoding images - pixels - isn't that informative by itself. Images and photos can be blurry, too close to the targets, too dark, too light, scaled, cropped, skewed, or any other of a variety of problems that cause havoc for a computer system trying to extract useful information. Neural networks can combine these lower level features into higher level patterns that are more able to generalize and deal with these issues.

In this...

Artificial neural networks

Neural networks are a class of algorithm that was originally designed based on the way that human brains work. However, modern advances are generally based on mathematics rather than biological insights. A neural network is a collection of neurons that are connected together. Each neuron is a simple function of its inputs, which are combined using some function to generate an output:

The functions that define a neuron's processing can be any standard function, such as a linear combination of the inputs, and is called the activation function. For the commonly used learning algorithms to work, we need the activation function to be derivable and smooth. A frequently used activation function is the logistic function, which is defined by the following equation (k is often simply 1, x is the inputs into the neuron, and L is normally 1, that is, the maximum value of the function):

The value of this graph, from -6 to +6, is shown below. The red lines indicate that the value...

Creating the dataset

In this chapter, to spice up things a little, let us take on the role of the bad guy. We want to create a program that can beat CAPTCHAs, allowing our comment spam program to advertise on someone's website. It should be noted that our CAPTCHAs will be a little easier than those used on the web today and that spamming isn't a very nice thing to do.

Note

We play the bad guy today, but please don't use this against real world sites. One reason to "play the bad guy" is to help improve the security of our website, by looking for issues with it.

Our experiment will simplify a CAPTCHA to be individual English words of four letters only, as shown in the following image:

Our goal will be to create a program that can recover the word from images like this. To do this, we will use four steps:

Break the image into individual letters.
Classify each individual letter.
Recombine the letters to form a word.
Rank words with a dictionary to try to fix errors.

Note

Our CAPTCHA-busting algorithm...

Training and classifying

We are now going to build a neural network that will take an image as input and try to predict which (single) letter is in the image.

We will use the training set of single letters we created earlier. The dataset itself is quite simple. We have a 20-by-20-pixel image, each pixel 1 (black) or 0 (white). These represent the 400 features that we will use as inputs into the neural network. The outputs will be 26 values between 0 and 1, where higher values indicate a higher likelihood that the associated letter (the first neuron is A, the second is B, and so on) is the letter represented by the input image.

We are going to use the scikit-learn's MLPClassifier for our neural network in this chapter.

Note

You will need a recent version of scikit-learn to use MLPClassifier. If the below import statement fails, try again after updating scikit-learn. You can do this using the following Anaconda command: conda update scikit-learn

As for other scikit-learn classifiers, we import...

Predicting words

Now that we have a classifier for predicting individual letters, we now move onto the next step in our plan - predicting words. To do this, we want to predict each letter from each of these segments, and put those predictions together to form the predicted word from a given CAPTCHA.

Our function will accept a CAPTCHA and the trained neural network, and it will return the predicted word:

def predict_captcha(captcha_image, neural_network):
    subimages = segment_image(captcha_image)
    # Perform the same transformations we did for our training data
    dataset = np.array([np.resize(subimage, (20, 20)) for subimage in subimages])
    X_test = dataset.reshape((dataset.shape[0], dataset.shape[1] * dataset.shape[2]))
    # Use predict_proba and argmax to get the most likely prediction
    y_pred = neural_network.predict_proba(X_test)
    predictions = np.argmax(y_pred, axis=1)

    # Convert predictions to letters
    predicted_word = str.join("", [letters[prediction] for prediction...

Summary

In this chapter, we worked with images in order to use simple pixel values to predict the letter being portrayed in a CAPTCHA. Our CAPTCHAs were a bit simplified; we only used complete four-letter English words. In practice, the problem is much harder--as it should be! With some improvements, it would be possible to solve much harder CAPTCHAs with neural networks and a methodology similar to what we discussed. The scikit-image library contains lots of useful functions for extracting shapes from images, functions for improving contrast, and other image tools that will help.

We took our larger problem of predicting words, and created a smaller and simple problem of predicting letters. From here, we were able to create a feed-forward neural network to accurately predict which letter was in the image. At this stage, our results were very good with 97 percent accuracy.

Neural networks are simply connected sets of neurons, which are basic computation devices consisting of a single function...