Chapter 8. Beating CAPTCHAs with Neural Networks
Images pose interesting and difficult challenges for data miners. Until recently, only small amounts of progress were made with analyzing images for extracting information. However recently, such as with the progress made on self-driving cars, significant advances have been made in a very short time-frame. The latest research is providing algorithms that can understand images for commercial surveillance, self-driving vehicles, and person identification.
There is lots of raw data in an image, and the standard method for encoding images - pixels - isn't that informative by itself. Images and photos can be blurry, too close to the targets, too dark, too light, scaled, cropped, skewed, or any other of a variety of problems that cause havoc for a computer system trying to extract useful information. Neural networks can combine these lower level features into higher level patterns that are more able to generalize and deal with these issues.
In this...
Artificial neural networks
Neural networks are a class of algorithm that was originally designed based on the way that human brains work. However, modern advances are generally based on mathematics rather than biological insights. A neural network is a collection of neurons that are connected together. Each neuron is a simple function of its inputs, which are combined using some function to generate an output:
The functions that define a neuron's processing can be any standard function, such as a linear combination of the inputs, and is called the activation function. For the commonly used learning algorithms to work, we need the activation function to be derivable and smooth. A frequently used activation function is the logistic function, which is defined by the following equation (k is often simply 1, x is the inputs into the neuron, and L is normally 1, that is, the maximum value of the function):
The value of this graph, from -6 to +6, is shown below. The red lines indicate that the value...
In this chapter, to spice up things a little, let us take on the role of the bad guy. We want to create a program that can beat CAPTCHAs, allowing our comment spam program to advertise on someone's website. It should be noted that our CAPTCHAs will be a little easier than those used on the web today and that spamming isn't a very nice thing to do.
Note
We play the bad guy today, but please don't use this against real world sites. One reason to "play the bad guy" is to help improve the security of our website, by looking for issues with it.
Our experiment will simplify a CAPTCHA to be individual English words of four letters only, as shown in the following image:
Our goal will be to create a program that can recover the word from images like this. To do this, we will use four steps:
- Break the image into individual letters.
- Classify each individual letter.
- Recombine the letters to form a word.
- Rank words with a dictionary to try to fix errors.
Note
Our CAPTCHA-busting algorithm...
We are now going to build a neural network that will take an image as input and try to predict which (single) letter is in the image.
We will use the training set of single letters we created earlier. The dataset itself is quite simple. We have a 20-by-20-pixel image, each pixel 1 (black) or 0 (white). These represent the 400 features that we will use as inputs into the neural network. The outputs will be 26 values between 0 and 1, where higher values indicate a higher likelihood that the associated letter (the first neuron is A, the second is B, and so on) is the letter represented by the input image.
We are going to use the scikit-learn's MLPClassifier
for our neural network in this chapter.
Note
You will need a recent version of scikit-learn
to use MLPClassifier. If the below import statement fails, try again after updating scikit-learn. You can do this using the following Anaconda command: conda update scikit-learn
As for other scikit-learn
classifiers, we import...
Now that we have a classifier for predicting individual letters, we now move onto the next step in our plan - predicting words. To do this, we want to predict each letter from each of these segments, and put those predictions together to form the predicted word from a given CAPTCHA.
Our function will accept a CAPTCHA and the trained neural network, and it will return the predicted word:
def predict_captcha(captcha_image, neural_network):
subimages = segment_image(captcha_image)
# Perform the same transformations we did for our training data
dataset = np.array([np.resize(subimage, (20, 20)) for subimage in subimages])
X_test = dataset.reshape((dataset.shape[0], dataset.shape[1] * dataset.shape[2]))
# Use predict_proba and argmax to get the most likely prediction
y_pred = neural_network.predict_proba(X_test)
predictions = np.argmax(y_pred, axis=1)
# Convert predictions to letters
predicted_word = str.join("", [letters[prediction] for prediction...
In this chapter, we worked with images in order to use simple pixel values to predict the letter being portrayed in a CAPTCHA. Our CAPTCHAs were a bit simplified; we only used complete four-letter English words. In practice, the problem is much harder--as it should be! With some improvements, it would be possible to solve much harder CAPTCHAs with neural networks and a methodology similar to what we discussed. The scikit-image
library contains lots of useful functions for extracting shapes from images, functions for improving contrast, and other image tools that will help.
We took our larger problem of predicting words, and created a smaller and simple problem of predicting letters. From here, we were able to create a feed-forward neural network to accurately predict which letter was in the image. At this stage, our results were very good with 97 percent accuracy.
Neural networks are simply connected sets of neurons, which are basic computation devices consisting of a single function...