Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Data Mining with Python, - Second Edition

You're reading from  Learning Data Mining with Python, - Second Edition

Product type Book
Published in Apr 2017
Publisher Packt
ISBN-13 9781787126787
Pages 358 pages
Edition 2nd Edition
Languages
Concepts

Table of Contents (20) Chapters

Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
Getting Started with Data Mining Classifying with scikit-learn Estimators Predicting Sports Winners with Decision Trees Recommending Movies Using Affinity Analysis Features and scikit-learn Transformers Social Media Insight using Naive Bayes Follow Recommendations Using Graph Mining Beating CAPTCHAs with Neural Networks Authorship Attribution Clustering News Articles Object Detection in Images using Deep Neural Networks Working with Big Data Next Steps...

Chapter 8. Beating CAPTCHAs with Neural Networks

Images pose interesting and difficult challenges for data miners. Until recently, only small amounts of progress were made with analyzing images for extracting information. However recently, such as with the progress made on self-driving cars, significant advances have been made in a very short time-frame. The latest research is providing algorithms that can understand images for commercial surveillance, self-driving vehicles, and person identification.

There is lots of raw data in an image, and the standard method for encoding images - pixels - isn't that informative by itself. Images and photos can be blurry, too close to the targets, too dark, too light, scaled, cropped, skewed, or any other of a variety of problems that cause havoc for a computer system trying to extract useful information. Neural networks can combine these lower level features into higher level patterns that are more able to generalize and deal with these issues.

In this...

Artificial neural networks


Neural networks are a class of algorithm that was originally designed based on the way that human brains work. However, modern advances are generally based on mathematics rather than biological insights. A neural network is a collection of neurons that are connected together. Each neuron is a simple function of its inputs, which are combined using some function to generate an output:

The functions that define a neuron's processing can be any standard function, such as a linear combination of the inputs, and is called the activation function. For the commonly used learning algorithms to work, we need the activation function to be derivable and  smooth. A frequently used activation function is the logistic function, which is defined by the following equation (k is often simply 1, x is the inputs into the neuron, and L is normally 1, that is, the maximum value of the function):

The value of this graph, from -6 to +6, is shown below. The red lines indicate that the value...

Creating the dataset


In this chapter, to spice up things a little, let us take on the role of the bad guy. We want to create a program that can beat CAPTCHAs, allowing our comment spam program to advertise on someone's website. It should be noted that our CAPTCHAs will be a little easier than those used on the web today and that spamming isn't a very nice thing to do.

Note

We play the bad guy today, but please don't use this against real world sites. One reason to "play the bad guy" is to help improve the security of our website, by looking for issues with it.

Our experiment will simplify a CAPTCHA to be individual English words of four letters only, as shown in the following image:

Our goal will be to create a program that can recover the word from images like this. To do this, we will use four steps:

  1. Break the image into individual letters.
  2. Classify each individual letter.
  3. Recombine the letters to form a word.
  4. Rank words with a dictionary to try to fix errors.

Note

Our CAPTCHA-busting algorithm...

Training and classifying


We are now going to build a neural network that will take an image as input and try to predict which (single) letter is in the image.

We will use the training set of single letters we created earlier. The dataset itself is quite simple. We have a 20-by-20-pixel image, each pixel 1 (black) or 0 (white). These represent the 400 features that we will use as inputs into the neural network. The outputs will be 26 values between 0 and 1, where higher values indicate a higher likelihood that the associated letter (the first neuron is A, the second is B, and so on) is the letter represented by the input image.

We are going to use the scikit-learn's MLPClassifier for our neural network in this chapter.

Note

You will need a recent version of scikit-learn to use MLPClassifier. If the below import statement fails, try again after updating scikit-learn. You can do this using the following Anaconda command:  conda update scikit-learn

As for other scikit-learn classifiers, we import...

Predicting words


Now that we have a classifier for predicting individual letters, we now move onto the next step in our plan - predicting words. To do this, we want to predict each letter from each of these segments, and put those predictions together to form the predicted word from a given CAPTCHA.

Our function will accept a CAPTCHA and the trained neural network, and it will return the predicted word:

def predict_captcha(captcha_image, neural_network):
    subimages = segment_image(captcha_image)
    # Perform the same transformations we did for our training data
    dataset = np.array([np.resize(subimage, (20, 20)) for subimage in subimages])
    X_test = dataset.reshape((dataset.shape[0], dataset.shape[1] * dataset.shape[2]))
    # Use predict_proba and argmax to get the most likely prediction
    y_pred = neural_network.predict_proba(X_test)
    predictions = np.argmax(y_pred, axis=1)

    # Convert predictions to letters
    predicted_word = str.join("", [letters[prediction] for prediction...

Summary


In this chapter, we worked with images in order to use simple pixel values to predict the letter being portrayed in a CAPTCHA. Our CAPTCHAs were a bit simplified; we only used complete four-letter English words. In practice, the problem is much harder--as it should be! With some improvements, it would be possible to solve much harder CAPTCHAs with neural networks and a methodology similar to what we discussed. The scikit-image library contains lots of useful functions for extracting shapes from images, functions for improving contrast, and other image tools that will help.

We took our larger problem of predicting words, and created a smaller and simple problem of predicting letters. From here, we were able to create a feed-forward neural network to accurately predict which letter was in the image. At this stage, our results were very good with 97 percent accuracy.

Neural networks are simply connected sets of neurons, which are basic computation devices consisting of a single function...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Learning Data Mining with Python, - Second Edition
Published in: Apr 2017 Publisher: Packt ISBN-13: 9781787126787
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}