Packt+ | Advance your knowledge in tech

You're reading from Learning Data Mining with Python, - Second Edition

Product type Book

Published in Apr 2017

Publisher Packt

ISBN-13 9781787126787

Pages 358 pages

Edition 2nd Edition

Languages

Python

Concepts

Data Mining

Table of Contents (20) Chapters

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Getting Started with Data Mining

Classifying with scikit-learn Estimators

Predicting Sports Winners with Decision Trees

Recommending Movies Using Affinity Analysis

Features and scikit-learn Transformers

Social Media Insight using Naive Bayes

Follow Recommendations Using Graph Mining

Beating CAPTCHAs with Neural Networks

Authorship Attribution

Clustering News Articles

Object Detection in Images using Deep Neural Networks

Working with Big Data

Next Steps...

Chapter 13. Next Steps...

During the course, there were lots of avenues not taken, options not presented, and subjects not fully explored. In this appendix, I've created a collection of next steps for those wishing to undertake extra learning and progress their data mining with Python.

This appendix is for learning more about data mining. Also included are some challenges to extend the work performed. Some of these will be small improvements; some will be quite a bit more work—I've made a note of those more tasks that are noticeably more difficult and involved than the others.

Getting Started with Data Mining

In this chapter following are a few avenues that reader can explore:

Scikit-learn tutorials

URL: http://scikit-learn.org/stable/tutorial/index.html

Included in the scikit-learn documentation is a series of tutorials on data mining. The tutorials range from basic introductions to toy datasets, all the way through to comprehensive tutorials on techniques used in recent research. The tutorials here will take quite a while to get through—they are very comprehensive—but are well worth the effort to learn.

There are also a large number of algorithms that have been implemented for compatability with scikit-learn. These algorithms are not always included in scikit-learn itself for a number of reasons, but a list of many of these is maintained at https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets.

Extending the Jupyter Notebook

URL: http://ipython.org/ipython-doc/1/interactive/public_server.html

The Jupyter Notebook is a powerful tool...

Classifying with scikit-learn Estimators

A naïve implementation of the nearest neighbor algorithm is quite slow—it checks all pairs of points to find those that are close together. Better implementations exist, with some implemented in scikit-learn.

Scalability with the nearest neighbor

URL: https://github.com/jnothman/scikit-learn/tree/pr2532

For instance, a kd-tree can be created that speeds up the algorithm (and this is already included in scikit-learn).

Another way to speed up this search is to use locality-sensitive hashing, Locality-Sensitive Hashing (LSH). This is a proposed improvement for scikit-learn, and hasn't made it into the package at the time of writing. The preceding link gives a development branch of scikit-learn that will allow you to test out LSH on a dataset. Read through the documentation attached to this branch for details on doing this.

To install it, clone the repository and follow the instructions to install the Bleeding Edge code available at http://scikit-learn.org...

Predicting Sports Winners with Decision Trees

URL: http://pandas.pydata.org/pandas-docs/stable/tutorials.html

The pandas library is a great package—anything you normally write to do data loading is probably already implemented in pandas. You can learn more about it from their tutorial.

There is also a great blog post written by Chris Moffitt that overviews common tasks people do in Excel and how to do them in pandas: http://pbpython.com/excel-pandas-comp.html

You can also handle large datasets with pandas; see the answer, from user Jeff, to this StackOverflow question for an extensive overview of the process: http://stackoverflow.com/a/14268804/307363.

Another great tutorial on pandas is written by Brian Connelly: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/.

More complex features

URL: http://www.basketball-reference.com/teams/ORL/2014_roster_status.html

Larger exercise!

Sports teams change regularly from game to game. An easy win for a team can turn into a difficult game...

Recommending Movies Using Affinity Analysis

There are many recommendation-based datasets that are worth investigating, each with its own issues.

New datasets

URL: http://www2.informatik.uni-freiburg.de/~cziegler/BX/

Larger exercise!

There are many recommendation-based datasets that are worth investigating, each with its own issues. For example, the Book-Crossing dataset contains more than 278,000 users and over a million ratings. Some of these ratings are explicit (the user did give a rating), while others are more implicit. The weighting to these implicit ratings probably shouldn't be as high as for explicit ratings. The music website www.last.fm has released a great dataset for music recommendation: http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/.

There is also a joke recommendation dataset! See here: http://eigentaste.berkeley.edu/dataset/.

The Eclat algorithm

URL: http://www.borgelt.net/eclat.html

The APriori algorithm implemented here is easily the most famous of the association...

Extracting Features with Transformers

Following topics, according to me, are also relevant when it comes to deeper understanding of Extracting Features with Transformers

Adding noise

We covered removing noise to improve features; however, improved performance can be obtained for some datasets by adding noise. The reason for this is simple—it helps stop overfitting by forcing the classifier to generalize its rules a little (although too much noise will make the model too general). Try implementing a Transformer that can add a given amount of noise to a dataset. Test that out on some of the datasets from UCI ML and see if it improves test-set performance.

Vowpal Wabbit

URL: http://hunch.net/~vw/

Vowpal Wabbit is a great project, providing very fast feature extraction for text-based problems. It comes with a Python wrapper, allowing you to call it from with Python code. Test it out on large datasets.

word2vec

URL: https://radimrehurek.com/gensim/models/word2vec.html

Word embeddings are receiving a...

Social Media Insight Using Naive Bayes

Do consider the following points after finishing with Social Media Insight Using Native Bayes.

Spam detection

URL: http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

Using the concepts here, you can create a spam detection method that is able to view a social media post and determine whether it is spam or not. Try this out by first creating a dataset of spam/not-spam posts, implementing the text mining algorithms, and then evaluating them.

One important consideration with spam detection is the false-positive/false-negative ratio. Many people would prefer to have a couple of spam messages slip through, rather than miss out on a legitimate message because the filter was too aggressive in stopping the spam. In order to turn your method for this, you can use a Grid Search with the f1-score as the evaluation criteria. See the preceding link for information on how to do this.

Natural language processing and part-of-speech tagging

URL...

Discovering Accounts to Follow Using Graph Mining

Do give the following a read when done with the chapter.

More complex algorithms

URL: https://www.cs.cornell.edu/home/kleinber/link-pred.pdfLarger exercise!

There has been extensive research on predicting links in graphs, including for social networks. For instance, David Liben-Nowell and Jon Kleinberg published a paper on this topic that would serve as a great place for more complex algorithms, linked previously.

NetworkX

URL: https://networkx.github.io/

If you are going to be using graphs and networks more, going in-depth into the NetworkX package is well worth your time—the visualization options are great and the algorithms are well implemented. Another library called SNAP is also available with Python bindings, at http://snap.stanford.edu/snappy/index.html.

Beating CAPTCHAs with Neural Networks

You may find the following topics interesting as well:

Better (worse?) CAPTCHAs

URL: http://scikit-image.org/docs/dev/auto_examples/applications/plot_geometric.html

Larger exercise!

The CAPTCHAs we beat in this example were not as complex as those normally used today. You can create more complex variants using a number of techniques as follows:

Applying different transformations such as the ones in scikit-image (see the preceding link)
Using different colors and colors that don't translate well to grayscale
Adding lines or other shapes to the image: http://scikit-image.org/docs/dev/api/skimage.draw.html

Deeper networks

These techniques will probably fool our current implementation, so improvements will need to be made to make the method better. Try some of the deeper networks we used. Larger networks need more data, though, so you will probably need to generate more than the few thousand samples we did here in order to get good performance. Generating these datasets...

Authorship Attribution

When it comes to Authorship Attribution do give the following topics a read.

Increasing the sample size

The Enron application we used ended up using just a portion of the overall dataset. There is lots more data available in this dataset. Increasing the number of authors will likely lead to a drop in accuracy, but it is possible to boost the accuracy further than was achieved here, using similar methods. Using a Grid Search, try different values for n-grams and different parameters for support vector machines, in order to get better performance on a larger number of authors.

Blogs dataset

The dataset used, provides authorship-based classes (each blogger ID is a separate author). This dataset can be tested using this kind of method as well. In addition, there are the other classes of gender, age, industry, and star sign that can be tested—are authorship-based methods good for these classification tasks?

Local n-grams

URL: https://github.com/robertlayton/authorship_tutorials...

Clustering News Articles

It won't hurt to read a little on the following topics

Clustering Evaluation

The evaluation of clustering algorithms is a difficult problem—on the one hand, we can sort of tell what good clusters look like; on the other hand, if we really know that, we should label some instances and use a supervised classifier! Much has been written on this topic. One slideshow on the topic that is a good introduction to the challenges follows: http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf.

In addition, a very comprehensive (although now a little dated) paper on this topic is here: http://web.itu.edu.tr/sgunduz/courses/verimaden/paper/validity_survey.pdf.

The scikit-learn package does implement a number of the metrics described in those links, with an overview here: http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation.

Using some of these, you can start evaluating which parameters need to be used for better clusterings. Using a Grid Search...

Classifying Objects in Images Using Deep Learning

The following topics are also important when deeper study into Classifying objects is considered.

Mahotas

URL: http://luispedro.org/software/mahotas/

Another package for image processing is Mahotas, including better and more complex image processing techniques that can help achieve better accuracy, although they may come at a high computational cost. However, many image processing tasks are good candidates for parallelization. More techniques on image classification can be found in the research literature, with this survey paper as a good start: http://ijarcce.com/upload/january/22-A%20Survey%20on%20Image%20Classification.pdf.

Other image datasets are available at http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html.

There are many datasets of images available from a number of academic and industry-based sources. The linked website lists a bunch of datasets and some of the best algorithms to use on them. Implementing...

Working with Big Data

The following resources on Big Data would be helpful

Courses on Hadoop

Both Yahoo and Google have great tutorials on Hadoop, which go from beginner to quite advanced levels. They don't specifically address using Python, but learning the Hadoop concepts and then applying them in Pydoop or a similar library can yield great results.

Yahoo's tutorial: https://developer.yahoo.com/hadoop/tutorial/

Google's tutorial: https://cloud.google.com/hadoop/what-is-hadoop

Pydoop

URL: http://crs4.github.io/pydoop/tutorial/index.html

Pydoop is a python library to run Hadoop jobs. Pydoop also works with HDFS, the Hadoop File System, although you can get that functionality in mrjob as well. Pydoop will give you a bit more control over running some jobs.

Recommendation engine

Building a large recommendation engine is a good test of your Big data skills. A great blog post by Mark Litwintschik covers an engine using Apache Spark, a big data technology: http://tech.marksblogg.com/recommendation-engine...

More resources

The following would serve as a really good resource for additional information:

Kaggle competitions

URL: www.kaggle.com/

Kaggle runs data mining competitions regularly, often with monetary prizes. Testing your skills on Kaggle competitions is a fast and great way to learn to work with real-world data mining problems. The forums are nice and share environments—often, you will see code released for a top-10 entry during the competition!

Coursera

URL: www.coursera.org

Coursera contains many courses on data mining and data science. Many of the courses are specialized, such as big data and image processing. A great general one to start with is Andrew Ng's famous course: https://www.coursera.org/learn/machine-learning/. It is a bit more advanced than this and would be a great next step for interested readers. For neural networks, check out this course: https://www.coursera.org/course/neuralnets. If you complete all of these, try out the course on probabilistic graphical models at https...