Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-first-principle-and-useful-way-think
Packt
08 Oct 2015
8 min read
Save for later

First Principle and a Useful Way to Think

Packt
08 Oct 2015
8 min read
In this article, by Timothy Washington, author of the book Clojure for Finance, we will cover the following topics: Modeling the stock price activity Function evaluation First-Class functions Lazy evaluation Basic Clojure functions and immutability Namespace modifications and creating our first function (For more resources related to this topic, see here.) Modeling the stock price activity There are many types of banks. Commercial entities (large stores, parking areas, hotels, and so on) that collect and retain credit card information, are either quasi banks, or farm out credit operations to bank-like processing companies. There are more well-known consumer banks, which accept demand deposits from the public. There are also a range of other banks such as commercial banks, insurance companies and trusts, credit unions, and in our case, investment banks. As promised, this article will slowly build up a set of lagging price indicators that follow a moving stock price time series. In order to do that, I think it's useful to touch on stock markets, and to crudely model stock price activity. A stock (or equity) market, is a collection of buyers and sellers trading economic assets (usually companies). The stock (or shares) of those companies can be equities listed on an exchange (New York Stock Exchange, London Stock Exchange, and others), or may be those traded privately. In this exercise, we will do the following: Crudely model the stock price movement, which will give us a test bed for writing our lagging price indicators Introduce some basic features of the Clojure language Function evaluation The Clojure website has a cheatsheet (http://clojure.org/cheatsheet) with all of the language's core functions. The first function we'll look at is rand, a function that randomly gives you a number within a given range. So in your edgar/ project, launch a repl with the lein repl shell command. After a few seconds, you will enter repl (Read-Eval-Print-Loop). Again, Clojure functions are executed by being placed in the first position of a list. The function's arguments are placed directly afterwards. In your repl, evaluate (rand 35) or (rand 99) or (rand 43.21) or any number you fancy Run it many times to see that you can get any different floating point number, within 0 and the upper bound of the number you provided First-Class functions The next functions we'll look at are repeatedly and fn. repeatedly is a function that takes another function and returns an infinite (or length n if supplied) lazy sequence of calls to the latter function. This is our first encounter of a function that can take another function. We'll also encounter functions that return other functions. Described as First-Class functions, this falls out of lambda calculus and is one of the central features of functional programming. As such, we need to wrap our previous (rand 35) call in another function. fn is one of Clojure's core functions, and produces an anonymous, unnamed function. We can now supply this function to repeatedly. In your repl, if you evaluate (take 25 (repeatedly (fn [] (rand 35)))), you should see a long list of floating point numbers with the list's tail elided. Lazy evaluation We only took the first 25 of the (repeatedly (fn [] (rand 35))) result list, because the list (actually a lazy sequence) is infinite. Lazy evaluation (or laziness) is a common feature in functional programming languages. Being infinite, Clojure chooses to delay evaluating most of the list until it's needed by some other function that pulls out some values. Laziness benefits us by increasing performance and letting us more easily construct control flow. We can avoid needless calculation, repeated evaluations, and potential error conditions in compound expressions. Let's try to pull out some values with the take function. take itself, returns another lazy sequence, of the first n items of a collection. Evaluating (take 25 (repeatedly (fn [] (rand 35)))) will pull out the first 25 repeatedly calls to rand which generates a float between 0 and 35. Basic Clojure functions and immutability There's many operations we can perform over our result list (or lazy sequence). One of the main approaches of functional programming is to take a data structure, and perform operations over top of it to produce a new data structure, or some atomic result (a string, number, and so on). This may sound inefficient at first. But most FP languages employ something called immutability to make these operations efficient. Immutable data structures are the ones that cannot change once they've been created. This is feasible as most immutable, FP languages use some kind of structural data sharing between an original and a modified version. The idea is that if we run evaluate (conj [1 2 3] 4), the resulting [1 2 3 4] vector shares the original vector of [1 2 3]. The only additional resource that's assigned is for any novelty that's been introduced to the data structure (the 4). There's a more detailed explanation of (for example) Clojure's persistent vectors here: conj: This conjoins an element to a collection—the collection decides where. So conjoining an element to a vector (conj [1 2 3] 4) versus conjoining an element to a list (conj '(1 2 3) 4) yield different results. Try it in your repl. map: This passes a function over one or many lists, yielding another list. (map inc [1 2 3]) increments each element by 1. reduce (or left fold): This passes a function over each element, accumulating one result. (reduce + (take 100 (repeatedly (fn [] (rand 35))))) sums the list. filter: This constrains the input by some condition. >=: This is a conditional function, which tests whether the first argument is greater than or equal to the second function. Try (>= 4 9) and (>= 9 1). fn: This is a function that creates a function. This unnamed or anonymous function can have any instructions you choose to put in there. So if we only want numbers above 12, we can put that assertion in a predicate function. Try entering the below expression into your repl: (take 25 (filter (fn [x] (>= x 12)) (repeatedly (fn [] (rand 35))))) Modifying the namespaces and creating our first function We now have the basis for creating a function. It will return a lazy infinite sequence of floating point numbers, within an upper and lower bound. defn is a Clojure function, which takes an anonymous function, and binds a name to it in a given namespace. A Clojure namespace is an organizational tool for mapping human-readable names to things like functions, named data structures and such. Here, we're going to bind our function to the name generate-prices in our current namespace. You'll notice that our function is starting to span multiple lines. This will be a good time to author the code in your text editor of choice. I'll be using Emacs: Open your text editor, and add this code to the file called src/edgar/core.clj. Make sure that (ns edgar.core) is at the top of that file. After adding the following code, you can then restart repl. (load "edgaru/core") uses the load function to load the Clojure code in your in src/edgaru/core.clj: (defn generate-prices [lower-bound upper-bound] (filter (fn [x] (>= x lower-bound)) (repeatedly (fn [] (rand upper-bound))))) The Read-Eval-Print-Loop In our repl, we can pull in code in various namespaces, with the require function. This applies to the src/edgar/core.clj file we've just edited. That code will be in the edgar.core namespace: In your repl, evaluate (require '[edgar.core :as c]). c is just a handy alias we can use instead of the long name. You can then generate random prices within an upper and lower bound. Take the first 10 of them like this (take 10 (c/generate-prices 12 35)). You should see results akin to the following output. All elements should be within the range of 12 to 35: (29.60706184716407 12.507593971664075 19.79939384292759 31.322074615579716 19.737852534147326 25.134649707849572 19.952195022152488 12.94569843904663 23.618693004455086 14.695872710062428) There's a subtle abstraction in the preceding code that deserves attention. (require '[edgar.core :as c]) introduces the quote symbol. ' is the reader shorthand for the quote function. So the equivalent invocation would be (require (quote [edgar.core :as c])). Quoting a form tells the Clojure reader not to evaluate the subsequent expression (or form). So evaluating '(a b c) returns a list of three symbols, without trying to evaluate a. Even though those symbols haven't yet been assigned, that's okay, because that expression (or form) has not yet been evaluated. But that begs a larger question. What is reader? Clojure (and all Lisps) are what's known as homoiconic. This means that Clojure code is also data. And data can be directly output and evaluated as code. The reader is the thing that parses our src/edgar/core.clj file (or (+ 1 1) input from the repl prompt), and produces the data structures that are evaluated. read and eval are the 2 essential processes by which Clojure code runs. The evaluation result is printed (or output), usually to the standard output device. Then we loop the process back to the read function. So, when the repl reads, your src/edgar/two.clj file, it's directly transforming that text representation into data and evaluating it. A few things fall out of that. For example, it becomes simpler for Clojure programs to directly read, transform and write out other Clojure programs. The implications of that will become clearer when we look at macros. But for now, know that there are ways to modify or delay the evaluation process, in this case by quoting a form. Summary In this article, we learned about basic features of the Clojure language and how to model the stock price activity. Besides these, we also learned function evaluation, First-Class functions, the lazy evaluation method, namespace modifications and creating our first function. Resources for Article: Further resources on this subject: Performance by Design[article] Big Data[article] The Observer Pattern [article]
Read more
  • 0
  • 0
  • 1628

article-image-introduction-data-analysis-and-libraries
Packt
07 Oct 2015
13 min read
Save for later

Introduction to Data Analysis and Libraries

Packt
07 Oct 2015
13 min read
In this article by Martin Czygan and Phuong Vothihong, the authors of the book Getting Started with Python Data Analysis, Data is raw information that can exist in any form, which is either usable or not. We can easily get data everywhere in our life; for example, the price of gold today is $ 1.158 per ounce. It does not have any meaning, except describing the gold price. This also shows that data is useful based on context. (For more resources related to this topic, see here.) With relational data connection, information appears and allows us to expand our knowledge beyond the range of our senses. When we possess gold price data gathered overtime, one information we might have is that the price has continuously risen from $1.152 to $1.158 for three days. It is used by someone who tracks gold prices. Knowledge helps people to create value in their lives and work. It is based on information that is organized, synthesized, or summarized to enhance comprehension, awareness, or understanding. It represents a state or potential for action and decisions. When the gold price continuously increases for three days, it will lightly decrease on the next day; this is useful knowledge. The following figure illustrates the steps from data to knowledge; we call this process the data analysis process and we will introduce it in the next section: In this article, we will cover the following topics: Data analysis and process Overview of libraries in data analysis using different programming languages Common Python data analysis libraries Data analysis and process Data is getting bigger and more diversified every day. Therefore, analyzing and processing data to advance human knowledge or to create value are big challenges. To tackle these challenges, you will need domain knowledge and a variety of skills, drawing from areas such as computer science, artificial intelligence (AI) and machine learning (ML), statistics and mathematics, and knowledge domain, as shown in the following figure: Let's us go through the Data analysis and it's domain knowledge: Computer science: We need this knowledge to provide abstractions for efficient data processing. A basic Python programming experience is required. We will introduce Python libraries used in data analysis. Artificial intelligence and machine learning: If computer science knowledge helps us to program data analysis tools, artificial intelligence and machine learning help us to model the data and learn from it in order to build smart products. Statistics and mathematics: We cannot extract useful information from raw data if we do not use statistical techniques or mathematical functions. Knowledge domain: Besides technology and general techniques, it is important to have an insight into the specific domain. What do the data fields mean? What data do we need to collect? Based on the expertise, we explore and analyze raw data by applying the above techniques, step by step. Data analysis is a process composed of the following steps: Data requirements: We have to define what kind of data will be collected based on the requirements or problem analysis. For example, if we want to detect a user's behavior while reading news on the internet, we should be aware of visited article links, date and time, article categories, and the user's time spent on different pages. Data collection: Data can be collected from a variety of sources: mobile, personal computer, camera, or recording device. It may also be obtained through different ways: communication, event, and interaction between person and person, person and device, or device and device. Data appears whenever and wherever in the world. The problem is, how can we find and gather it to solve our problem? This is the mission of this step. Data processing: Data that is initially obtained must be processed or organized for analysis. This process is performance-sensitive: How fast can we create, insert, update, or query data? For building a real product that has to process big data, we should consider this step carefully. What kind of database should we use to store data? What kind of data structure, such as analysis, statistics, or visualization, is suitable for our purposes? Data cleaning: After being processed and organized, the data may still contain duplicates or errors. Therefore, we need a cleaning step to reduce those situations and increase the quality of the results in the following steps. Common tasks include record matching, deduplication, or column segmentation. Depending on the type of data, we can apply several types of data cleaning. For example, a user's history of a visited news website might contain a lot of duplicate rows, because the user might have refreshed certain pages many times. For our specific issue, these rows might not carry any meaning when we explore the user's behavior. So, we should remove them before saving it to our database. Another situation we may encounter is click fraud on news—someone just wants to improve their website ranking or sabotage the website. In this case, the data will not help us to explore a user's behavior. We can use thresholds to check whether a visit page event comes from a real person or from malicious software. Exploratory data analysis: Now, we can start to analyze data through a variety of techniques referred to as exploratory data analysis. We may detect additional problems in data cleaning or discover requests for further data. Therefore, these steps may be iterative and repeated throughout the whole data analysis process. Data visualization techniques are also used to examine the data in graphs or charts. Visualization often facilitates the understanding of data sets, especially, if they are large or high-dimensional. Modelling and algorithms: A lot of mathematical formulas and algorithms may be applied to detect or predict useful knowledge from the raw data. For example, we can use similarity measures to cluster users who have exhibited similar news reading behavior and recommend articles of interest to them next time. Alternatively, we can detect users' gender based on their news reading behavior by applying classification models such as Support Vector Machine (SVM) or linear regression. Depending on the problem, we may use different algorithms to get an acceptable result. It can take a lot of time to evaluate the accuracy of the algorithms and to choose the best one to implement for a certain product. Data product: The goal of this step is to build data products that receive data input and generate output according to the problem requirements. We will apply computer science knowledge to implement our selected algorithms as well as manage the data storage. Overview of libraries in data analysis There are numerous data analysis libraries that help us to process and analyze data. They use different programming languages and have different advantages as well as disadvantages of solving various data analysis problems. Now, we introduce some common libraries that may be useful for you. They should give you an overview of libraries in the field. However, the rest of this focuses on Python-based libraries. Some of the libraries that use the Java language for data analysis are as follows: Weka: This is the library that I got familiar with, the first time I learned about data analysis. It has a graphical user interface that allows you to run experiments on a small dataset. This is great if you want to get a feel for what is possible in the data processing space. However, if you build a complex product, I think it is not the best choice because of its performance: sketchy API design, non-optimal algorithms, and little documentation (http://www.cs.waikato.ac.nz/ml/weka/). Mallet: This is another Java library that is used for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications on text. There is an add-on package to Mallet, called GRMM, that contains support for inference in general, graphical models, and training of Conditional random fields (CRF) with arbitrary graphical structure. In my experience, the library performance as well as the algorithms are better than Weka. However, its only focus is on text processing problems. The reference page is at http://mallet.cs.umass.edu/. Mahout: This is Apache's machine learning framework built on top of Hadoop; its goal is to build a scalable machine learning library. It looks promising, but comes with all the baggage and overhead of Hadoop. The Homepage is at http://mahout.apache.org/. Spark: This is a relatively new Apache project; supposedly up to a hundred times faster than Hadoop. It is also a scalable library that consists of common machine learning algorithms and utilities. Development can be done in Python as well as in any JVM language. The reference page is at https://spark.apache.org/docs/1.5.0/mllib-guide.html. Here are a few libraries that are implemented in C++: Vowpal Wabbit: This library is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research. It has been used to learn a tera-feature (1012) dataset on 1000 nodes in one hour. More information can be found in the publication at http://arxiv.org/abs/1110.4198. MultiBoost: This package is a multiclass, multilabel, and multitask classification boosting software implemented in C++. If you use this software, you should refer to the paper published in 2012, in the Journal of Machine Learning Research, MultiBoost: A Multi-purpose Boosting Package, D.Benbouzid, R. Busa-Fekete, N. Casagrande, F.-D. Collin, and B. Kégl. MLpack: This is also a C++ machine learning library, developed by the Fundamental Algorithmic and Statistical Tools Laboratory (FASTLab) at Georgia Tech. It focusses on scalability, speed, and ease-of-use and was presented at the BigLearning workshop of NIPS 2011. Its homepage is at http://www.mlpack.org/about.html. Caffe: The last C++ library we want to mention is Caffe. This is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors. You can find more information about it at http://caffe.berkeleyvision.org/. Other libraries for data processing and analysis are as follows: Statsmodels: This is a great Python library for statistical modelling and is mainly used for predictive and exploratory analysis. Modular toolkit for data processing (MDP):This is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures (http://mdp-toolkit.sourceforge.net/index.html). Orange: This is an open source data visualization and analysis for novices and experts. It is packed with features for data analysis and has add-ons for bioinformatics and text mining. It contains an implementation of self-organizing maps, which sets it apart from the other projects as well (http://orange.biolab.si/). Mirador: This is a tool for the visual exploration of complex datasets supporting Mac and Windows. It enables users to discover correlation patterns and derive new hypotheses from data (http://orange.biolab.si/). RapidMiner: This is another GUI-based tool for data mining, machine learning, and predictive analysis (https://rapidminer.com/). Theano: This bridges the gap between Python and lower-level languages. Theano gives very significant performance gains, particularly for large matrix operations and is, therefore, a good choice for deep learning models. However, it is not easy to debug because of the additional compilation layer. Natural language processing toolkit (NLTK): This is written in Python with very unique and salient features. Here, I could not list all libraries for data analysis. However, I think the above libraries are enough to take a lot of your time to learn and build data analysis applications. Python libraries in data analysis Python is a multi-platform, general purpose programming language that can run on Windows, Linux/Unix, and Mac OS X, and has been ported to the Java and the .NET virtual machines as well. It has a powerful standard library. In addition, it has many libraries for data analysis: Pylearn2, Hebel, Pybrain, Pattern, MontePython, and MILK. We will cover some common Python data analysis libraries such as Numpy, Pandas, Matplotlib, PyMongo, and scikit-learn. Now, to help you getting started, I will briefly present an overview of each library for those who are less familiar with the scientific Python stack. NumPy One of the fundamental packages used for scientific computing with Python is Numpy. Among other things, it contains the following: A powerful N-dimensional array object Sophisticated (broadcasting) functions for performing array computations Tools for integrating C/C++ and Fortran code Useful linear algebra operations, Fourier transforms, and random number capabilities. Besides this, it can also be used as an efficient multidimensional container of generic data. Arbitrary data types can be defined and integrated with a wide variety of databases. Pandas Pandas is a Python package that supports rich data structures and functions for analyzing data and is developed by the PyData Development Team. It is focused on the improvement of Python's data libraries. Pandas consists of the following things: A set of labelled array data structures; the primary of which are Series, DataFrame, and Panel Index objects enabling both simple axis indexing and multilevel/hierarchical axis indexing An integrated group by engine for aggregating and transforming datasets Date range generation and custom date offsets Input/output tools that loads and saves data from flat files or PyTables/HDF5 format Optimal memory versions of the standard data structures Moving window statistics and static and moving window linear/panel regression Because of these features, Pandas is an ideal tool for systems that need complex data structures or high-performance time series functions such as financial data analysis applications. Matplotlib Matplotlib is the single most used Python package for 2D-graphic. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats: line plots, contour plots, scatter plots, or Basemap plot. It comes with a set of default settings, but allows customizing all kinds of properties. However, we can easily create our chart with the defaults of almost every property in Matplotlib. PyMongo MongoDB is a type of NoSQL database. It is highly scalable, robust, and perfect to work with JavaScript-based web applications because we can store data as JSON documents and use flexible schemas. PyMongo is a Python distribution containing tools for working with MongoDB. Many tools have also been written for working with PyMongo to add more features such as MongoKit, Humongolus, MongoAlchemy, and Ming. scikit-learn scikit-learn is an open source machine learning library using the Python programming language. It supports various machine learning models, such as classification, regression, and clustering algorithms, interoperated with the Python numerical and scientific libraries NumPy and SciPy. The latest scikit-learn version is 0.16.1, published in April 2015. Summary In this article, there were three main points that we presented. Firstly, we figured out the relationship between raw data, information and knowledge. Because of its contribution in our life, we continued to discuss an overview of data analysis and processing steps in the second part. Finally, we introduced a few common supported libraries that are useful for practical data analysis applications. Among those we will focus on Python libraries in data analysis. Resources for Article: Further resources on this subject: Exploiting Services with Python [Article] Basics of Jupyter Notebook and Python [Article] How to do Machine Learning with Python [Article]
Read more
  • 0
  • 0
  • 28651

article-image-fingerprint-detection-using-opencv
Packt
07 Oct 2015
11 min read
Save for later

Fingerprint detection using OpenCV 3

Packt
07 Oct 2015
11 min read
In this article by Joseph Howse, Quan Hua, Steven Puttemans, and Utkarsh Sinha, the authors of OpenCV Blueprints, we delve into the aspect of fingerprint detection using OpenCV. (For more resources related to this topic, see here.) Fingerprint identification, how is it done? We have already discussed the use of the first biometric, which is the face of the person trying to login to the system. However since we mentioned that using a single biometric can be quite risky, we suggest adding secondary biometric checks to the system, like the fingerprint of a person. There are many of the shelf fingerprint scanners which are quite cheap and return you the scanned image. However you will still have to write your own registration software for these scanners and which can be done by using OpenCV software. Examples of such fingerprint images can be found below. Examples of single individual thumb fingerprint in different scanning positions This dataset can be downloaded from the FVC2002 competition website released by the University of Bologna. The website (http://bias.csr.unibo.it/fvc2002/databases.asp) contains 4 databases of fingerprints available for public download of the following structure: Four fingerprint capturing devices DB1 - DB4. For each device, the prints of 10 individuals are available. For each person, 8 different positions of prints were recorded. We will use this publicly available dataset to build our system upon. We will focus on the first capturing device, using up to 4 fingerprints of each individual for training the system and making an average descriptor of the fingerprint. Then we will use the other 4 fingerprints to evaluate our system and make sure that the person is still recognized by our system. You could apply exactly the same approach on the data grabbed from the other devices if you would like to investigate the difference between a system that captures almost binary images and one that captures grayscale images. However we will provide techniques for doing the binarization yourself. Implement the approach in OpenCV 3 The complete fingerprint software for processing fingerprints derived from a fingerprint scanner can be found at https://github.com/OpenCVBlueprints/OpenCVBlueprints/tree/master/chapter_6/source_code/fingerprint/fingerprint_process/. In this subsection we will describe how you can implement this approach in the OpenCV interface. We will start by grabbing the image from the fingerprint system and apply binarization. This will enable us to remove any desired noise from the image as well as help us to make the contrast better between the kin and the wrinkled surface of the finger. // Start by reading in an image Mat input = imread("/data/fingerprints/image1.png", CV_LOAD_GRAYSCALE); // Binarize the image, through local thresholding Mat input_binary; threshold(input, input_binary, 0, 255, CV_THRESH_BINARY | CV_THRESH_OTSU); The Otsu thresholding will automatically choose the best generic threshold for the image to obtain a good contrast between foreground and background information. This is because the image contains a bimodal distribution (which means that we have an image with a 2 peak histogram) of pixel values. For that image, we can approximately take a value in the middle of those peaks as threshold value. (for images which are not bimodal, binarization won't be accurate.) Otsu allows us to avoid using a fixed threshold value, and thus making the system more general for any capturing device. However we do acknowledge that if you have only a single capturing device, then playing around with a fixed threshold value could result in a better image for that specific setup. The result of the thresholding can be seen below. In order to make thinning from the next skeletization step as effective as possible we need the inverse binary image. Comparison between grayscale and binarized fingerprint image Once we have a binary image we are actually already set to go to calculate our feature points and feature point descriptors. However, in order to improve the process a bit more, we suggest to skeletize the image. This will create more unique and stronger interest points. The following piece of code can apply the skeletization on top of the binary image. The skeletization is based on the Zhang-Suen line thinning approach. Special thanks to @bsdNoobz of the OpenCV Q&A forum who supplied this iteration approach. #include <opencv2/imgproc.hpp> #include <opencv2/highgui.hpp> using namespace std; using namespace cv; // Perform a single thinning iteration, which is repeated until the skeletization is finalized void thinningIteration(Mat& im, int iter) { Mat marker = Mat::zeros(im.size(), CV_8UC1); for (int i = 1; i < im.rows-1; i++) { for (int j = 1; j < im.cols-1; j++) { uchar p2 = im.at<uchar>(i-1, j); uchar p3 = im.at<uchar>(i-1, j+1); uchar p4 = im.at<uchar>(i, j+1); uchar p5 = im.at<uchar>(i+1, j+1); uchar p6 = im.at<uchar>(i+1, j); uchar p7 = im.at<uchar>(i+1, j-1); uchar p8 = im.at<uchar>(i, j-1); uchar p9 = im.at<uchar>(i-1, j-1); int A = (p2 == 0 && p3 == 1) + (p3 == 0 && p4 == 1) + (p4 == 0 && p5 == 1) + (p5 == 0 && p6 == 1) + (p6 == 0 && p7 == 1) + (p7 == 0 && p8 == 1) + (p8 == 0 && p9 == 1) + (p9 == 0 && p2 == 1); int B = p2 + p3 + p4 + p5 + p6 + p7 + p8 + p9; int m1 = iter == 0 ? (p2 * p4 * p6) : (p2 * p4 * p8); int m2 = iter == 0 ? (p4 * p6 * p8) : (p2 * p6 * p8); if (A == 1 && (B >= 2 && B <= 6) && m1 == 0 && m2 == 0) marker.at<uchar>(i,j) = 1; } } im &= ~marker; } // Function for thinning any given binary image within the range of 0-255. If not you should first make sure that your image has this range preset and configured! void thinning(Mat& im) { // Enforce the range tob e in between 0 - 255 im /= 255; Mat prev = Mat::zeros(im.size(), CV_8UC1); Mat diff; do { thinningIteration(im, 0); thinningIteration(im, 1); absdiff(im, prev, diff); im.copyTo(prev); } while (countNonZero(diff) > 0); im *= 255; } The code above can then simply be applied to our previous steps by calling the thinning function on top of our previous binary generated image. The code for this is: Apply thinning algorithm Mat input_thinned = input_binary.clone(); thinning(input_thinned); This will result in the following output: Comparison between binarized and thinned fingerprint image using skeletization techniques. When we got this skeleton image, the following step would be to look for crossing points on the ridges of the fingerprint, which are being then called minutiae points. We can do this by a keypoint detector that looks at a large change in local contrast, like the Harris corner detector. Since the Harris corner detector is both able to detect strong corners and edges, this is ideally for the fingerprint problem, where the most important minutiae are short edges and bifurcation, the positions where edges come together. More information about minutae points and Harris Corner detection can be found in the following publications: Ross, Arun A., Jidnya Shah, and Anil K. Jain. "Toward reconstructing fingerprints from minutiae points." Defense and Security. International Society for Optics and Photonics, 2005. Harris, Chris, and Mike Stephens. "A combined corner and edge detector." Alvey vision conference. Vol. 15. 1988. Calling the Harris Corner operation on a skeletonized and binarized image in OpenCV is quite straightforward. The Harris corners are stored as positions corresponding in the image with their cornerness response value. If we want to detect points with a certain cornerness, than we should simply threshold the image. Mat harris_corners, harris_normalised; harris_corners = Mat::zeros(input_thinned.size(), CV_32FC1); cornerHarris(input_thinned, harris_corners, 2, 3, 0.04, BORDER_DEFAULT); normalize(harris_corners, harris_normalised, 0, 255, NORM_MINMAX, CV_32FC1, Mat()); We now have a map with all the available corner responses rescaled to the range of [0 255] and stored as float values. We can now manually define a threshold, that will generate a good amount of keypoints for our application. Playing around with this parameter could improve performance in other cases. This can be done by using the following code snippet: float threshold = 125.0; vector<KeyPoint> keypoints; Mat rescaled; convertScaleAbs(harris_normalised, rescaled); Mat harris_c(rescaled.rows, rescaled.cols, CV_8UC3); Mat in[] = { rescaled, rescaled, rescaled }; int from_to[] = { 0,0, 1,1, 2,2 }; mixChannels( in, 3, &harris_c, 1, from_to, 3 ); for(int x=0; x<harris_normalised.cols; x++){ for(int y=0; y<harris_normalised.rows; y++){ if ( (int)harris_normalised.at<float>(y, x) > threshold ){ // Draw or store the keypoint location here, just like you decide. In our case we will store the location of the keypoint circle(harris_c, Point(x, y), 5, Scalar(0,255,0), 1); circle(harris_c, Point(x, y), 1, Scalar(0,0,255), 1); keypoints.push_back( KeyPoint (x, y, 1) ); } } } Comparison between thinned fingerprint and Harris corner response, as well as the selected Harris corners. Now that we have a list of keypoints we will need to create some of formal descriptor of the local region around that keypoint to be able to uniquely identify it among other keypoints. Chapter 3, Recognizing facial expressions with machine learning, discusses in more detail the wide range of keypoints out there. In this article, we will mainly focus on the process. Feel free to adapt the interface with other keypoint detectors and descriptors out there, for better or for worse performance. Since we have an application where the orientation of the thumb can differ (since it is not a fixed position), we want a keypoint descriptor that is robust at handling these slight differences. One of the mostly used descriptors for that is the SIFT descriptor, which stands for scale invariant feature transform. However SIFT is not under a BSD license and can thus pose problems to use in commercial software. A good alternative in OpenCV is the ORB descriptor. In OpenCV you can implement it in the following way. Ptr<Feature2D> orb_descriptor = ORB::create(); Mat descriptors; orb_descriptor->compute(input_thinned, keypoints, descriptors); This will enable us to calculate only the descriptors using the ORB approach, since we already retrieved the location of the keypoints using the Harris corner approach. At this point we can retrieve a descriptor for each detected keypoint of any given fingerprint. The descriptors matrix will contain a row for each keypoint containing the representation. Let us now start from the case where we have only a single reference image for each fingerprint. In that case we will have a database containing a set of feature descriptors for the training persons in the database. We then have a single new entry, consisting of multiple descriptors for the keypoints found at registration time. We now have to match these descriptors to the descriptors stored in the database, to see which one has the best match. The most simple way is by performing a brute force matching using the hamming distance criteria between descriptors of different keypoints. // Imagine we have a vector of single entry descriptors as a database // We will still need to fill those once we compare everything, by using the code snippets above vector<Mat> database_descriptors; Mat current_descriptors; // Create the matcher interface Ptr<DescriptorMatcher> matcher = DescriptorMatcher::create("BruteForce-Hamming"); // Now loop over the database and start the matching vector< vector< DMatch > > all_matches; for(int entry=0; i<database_descriptors.size();entry++){ vector< DMatch > matches; matcheràmatch(database_descriptors[entry], current_descriptors, matches); all_matches.push_back(matches); } We now have all the matches stored as DMatch objects. This means that for each matching couple we will have the original keypoint, the matched keypoint and a floating point score between both matches, representing the distance between the matched points. The idea about finding the best match seems pretty straightforward. We take a look at the amount of matches that have been returned by the matching process and weigh them by their Euclidean distance in order to add some certainty. We then look for the matching process that yielded the biggest score. This will be our best match and the match we want to return as the selected one from the database. If you want to avoid an imposter getting assigned to the best matching score, you can again add a manual threshold on top of the scoring to avoid matches that are not good enough, to be ignored. However it is possible, and should be taken into consideration, that if you increase the score to high, that people with a little change will be rejected from the system, like for example in the case where someone cuts his finger and thus changing his pattern to drastically. Fingerprint matching process visualized. To summarize, we saw how to detect fingerprints and implement it using OpenCV 3. Resources for Article: Further resources on this subject: Making subtle color shifts with curves [article] Tracking Objects in Videos [article] Hand Gesture Recognition Using a Kinect Depth Sensor [article]
Read more
  • 0
  • 1
  • 62522

article-image-integrating-elasticsearch-hadoop-ecosystem
Packt
07 Oct 2015
14 min read
Save for later

Integrating Elasticsearch with the Hadoop ecosystem

Packt
07 Oct 2015
14 min read
In this article by Vishal Shukla, author of the book Elasticsearch for Hadoop, we will take a look at how ES-Hadoop can integrate with Pig and Spark with ease. Elasticsearch is great in getting insights into the indexed data. The Hadoop ecosystem does a great job in making Hadoop easily usable for different users by providing a comfortable interface. Some of the examples are Hive and Pig. Apart from these, Hadoop integrates well with other computing engines and platforms, such as Spark and Cascading. (For more resources related to this topic, see here.) Pigging out Elasticsearch For many use cases, Pig is one of the easiest ways to fiddle around with the data in the Hadoop ecosystem. Pig wins when it comes to ease of use and simple syntax for designing data flow pipelines without getting into complex programming. Assuming that you know Pig, we will cover how to move the data to and from Elasticsearch. If you don't know Pig yet, never mind. You can still carry on with the steps, and by the end of the article, you will at least know how to use Pig to perform data ingestion and reading with Elasticsearch. Setting up Apache Pig for Elasticsearch Let's start by setting up Apache Pig. At the time of writing this article, the latest Pig version available is 0.15.0. You can use the following steps to set up the same version: First, download the Pig distribution using the following command: $ sudo wget –O /usr/local/pig.tar.gz http://mirrors.sonic.net/apache/pig/pig-0.15.0/pig-0.15.0.tar.gz Then, extract Pig to the desired location and rename it to a convenient name: $ cd /userusr/local $ sudo tar –xvf pig.tar.gz $ sudo mv pig-0.15.0 pig Now, export the required environment variables by appending the following two lines in the /home/eshadoop/.bashrc file: export PIG_HOME=/usr/local/pig export PATH=$PATH:$PIG_HOME/bin You can either log out and relogin to see the newly set environment variables or source the environment configuration with the following command: $ source ~/.bashrc Now, start the job history server daemon with the following command: $ mr-jobhistory-daemon.sh start historyserver You should see the Pig console with the following command: $ pig grunt> It's easy to forget to start the job history daemon once you restart your machine or VM. You may make this daemon run on start up, or you need to ensure this manually. Now, we have Pig up and running. In order to use Pig with Elasticsearch, we must ensure that the ES-Hadoop JAR file is available in the Pig classpath. Let's take the ES-Hadoop JAR file and and import it to HDFS using the following steps: First, download the ES-Hadoop JAR used to develop the examples in this article, as shown in the following command: $ wget http://central.maven.org/maven2/org/elasticsearch/elasticsearch-hadoop/2.1.1/elasticsearch-hadoop-2.1.1.jar Then, move the downloaded JAR to a convenient name as follows: $ sudo mkdir /opt/lib Now, import the JAR to HDFS: $ hadoop fs –mkdir /lib $ hadoop fs –put elasticsearch-hadoop-2.1.1.jar /lib/elasticsearch-hadoop-2.1.1.jar Throughout this article, we will use a crime dataset that is tailored from the open dataset provided at https://data.cityofchicago.org/. This tailored dataset can be downloaded from http://www.packtpub.com/support, where all the code files required for this article are available. Once you have downloaded the dataset, import it to HDFS at /ch07/crime_data.csv. Importing data to Elasticsearch Let's import the crime dataset to Elasticsearch using Pig with ES-Hadoop. This provides the EsStorage class as Pig Storage. In order to use the EsStorage class, you need to have a registered ES-Hadoop JAR with Pig. You can register the JAR located in the local filesystem, HDFS, or other shared filesystems. The REGISTER command registers a JAR file that contains UDFs (User-defined functions) with Pig, as shown in the following code: grunt> REGISTER hdfs://localhost:9000/lib/elasticsearch-hadoop-2.1.1.jar; Then, load the CSV data file as a relation with the following code: grunt> SOURCE = load '/ch07/crimes_dataset.csv' using PigStorage(',') as (id:chararray, caseNumber:chararray, date:datetime, block:chararray, iucr:chararray, primaryType:chararray, description:chararray, location:chararray, arrest:boolean, domestic:boolean, lat:double,lon:double); This command reads the CSV fields and maps each token in the data to the respective field in the preceding command. The resulting relation, SOURCE, represents a relation with the Bag data structure that contains multiple Tuples. Now, generate the target Pig relation that has the structure that matches closely to the target Elasticsearch index mapping, as shown in the following code: grunt> TARGET = foreach SOURCE generate id, caseNumber, date, block, iucr, primaryType, description, location, arrest, domestic, TOTUPLE(lon, lat) AS geoLocation; Here, we need the nested object with the geoLocation name in the target Elasticsearch document. We can achieve this with a Tuple to represent the lat and lon fields. TOTUPLE() helps us to create this tuple. We then assigned the geoLocation alias for this tuple. Let's store the TARGET relationto the Elasticsearch index with the following code: grunt> STORE TARGET INTO 'esh_pig/crimes' USING org.elasticsearch.hadoop.pig.EsStorage('es.http.timeout = 5m', 'es.index.auto.create = true', 'es.mapping.names=arrest:isArrest, domestic:isDomestic', 'es.mapping.id=id'); We can specify the target index and type to store indexed documents. The EsStorage class can accept multiple Elasticsearch configurations.es.mapping.names maps the Pig field name to Elasticsearch document's field name. You can use Pig's field id to assign a custom _id value for the Elasticsearch document using the es.mapping.id option. Similarly, you can set the _ttl and _timestamp metadata fields as well. Pig uses just one reducer in the default configuration. It is recommended to change this behavior to have a parallelism that matches the number of shards available, as shown in the following command: grunt> SET default_parallel 5; Pig also combines the input splits, irrespective of its size. This makes it efficient for small files by reducing the number of mappers. However, this will give performance issues for large files. You can disable this behavior in the Pig script, as shown in the following command: grunt> SET pig.splitCombination FALSE; Executing the preceding commands will create the Elasticsearch index and import crime data documents. If you observe the created documents in Elasticsearch, you can see the geoLocation value isan array in the [-87.74274476, 41.87404405]format. This is because by default, ES-Hadoop ignores the tuple field names and simply converts them as an ordered array. If you wish to make your geoLocation field look similar to the key/value-based object with the lat/lon keys, you can do so by including the following configuration in EsStorage: es.mapping.pig.tuple.use.field.names=true Writing from the JSON source If you have inputs as a well-formed JSON file, you can avoid conversion and transformations and directly pass the JSON document to Elasticsearch for indexing purposes. You may have the JSON data in Pig as chararray, bytearray, or in any other form that translates to well-formed JSON by calling the toString() method, as shown in the following code: grunt> JSON_DATA = LOAD '/ch07/crimes.json' USING PigStorage() AS (json:chararray); grunt> STORE JSON_DATA INTO 'esh_pig/crimes_json' USING org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true'); Type conversions Take a look at the the type mapping of the esh_pig index in Elasticsearch. It maps the geoLocation type to double. This is done because Elasticsearch inferred the double type based on the field type we specified in Pig. To map geoLocation to geo_point, you must create the Elasticsearch mapping for it manually before executing the script. Although Elasticsearch provides a data type detection based on the type of field in the incoming document, it is always good to create the type mapping beforehand in Elasticsearch. This is a one-time activity that you should do. Then, you can run the MapReduce, Pig, Hive, Cascading, or Spark jobs multiple times. This will avoid any surprises in the type detection. For your reference, here is a list of some of the field types of Pig and Elasticsearch that map to each other. The table doesn't list no-brainer and absolutely intuitive type mappings: Pig type Elasticsearch type chararray This specifies string bytearray This indicates binary tuple This denotes an array(default) or object bag This specifies an array map This denotes an object bigdecimal This indicates Not supported biginteger This denotes Not supported Reading data from Elasticsearch Reading data from Elasticsearch using Pig is as simple as writing a single command with the Elasticsearch query. Here is a snippet of how to print tuples that has crimes related to theft: grunt> REGISTER hdfs://localhost:9000/lib/elasticsearch-hadoop-2.1.1.jar grunt> ES = LOAD 'esh_pig/crimes' using org.elasticsearch.hadoop.pig.EsStorage('{"query" : { "term" : { "primaryType" : "theft" } } }'); grunt> dump ES; Executing the preceding commands will print the tuples Pig console. Giving Spark to Elasticsearch Spark is a distributed computing system that provides huge performance boost compared to Hadoop MapReduce. It works on an abstraction of RDD (Resilient-distributed Datasets). This can be created for any data residing in Hadoop. Without any surprises, ES-Hadoop provides easy integration with Spark by enabling the creation of RDD from the data in Elasticsearch. Spark's increasing support of integrating with various data sources, such as HDFS, Parquet, Avro, S3, Cassandra, relational databases, and streaming data makes it special when it comes to data integration. This means that when you use ES-Hadoop with Spark, you can make all these sources integrate with Elasticsearch easily. Setting up Spark In order to set up Apache Spark in order to execute a job, you can perform the following steps: First, download the Apache Spark distribution with the following command: $ sudo wget –O /usr/local/spark.tgzhttp://www.apache.org/dyn/closer.cgi/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz Then, extract Spark to the desired location and rename it to a convenient name, as shown in the following command: $ cd /user/local $ sudo tar –xvf spark.tgz $ sudo mv spark-1.4.1-bin-hadoop2.4 spark Importing data to Elasticsearch To import the crime dataset to Elasticsearch with Spark, let's see how we can write a Spark job. We will continue using Java to write Spark jobs for consistency. Here are the driver program's snippets: SparkConf conf = new SparkConf().setAppName("esh-spark").setMaster("local[4]"); conf.set("es.index.auto.create", "true"); JavaSparkContext context = new JavaSparkContext(conf); Set up the SparkConf object to configure the spark job. As always, you can also set most options (such as es.index.auto.create) and other configurations that we have seen throughout the article. Using this configuration, we created the JavaSparkContext object as follows: JavaRDD<String> textFile = context.textFile("hdfs://localhost:9000/ch07/crimes_dataset.csv"); Read the crime data CSV file as JavaRDD. Here, RDD is still of the type String that represents each line: JavaRDD<Crime> dataSplits = textFile.map(new Function<String, Crime>() { @Override public Crime call(String line) throws Exception { CSVParser parser = CSVParser.parse(line, CSVFormat.RFC4180); Crime c = new Crime(); CSVRecord record = parser.getRecords().get(0); c.setId(record.get(0)); .. .. String lat = record.get(10); String lon = record.get(11); Map<String, Double> geoLocation = new HashMap<>(); geoLocation.put("lat", StringUtils.isEmpty(lat)? null:Double.parseDouble(lat)); geoLocation.put("lon",StringUtils.isEmpty(lon)?null:Double. parseDouble(lon)); c.setGeoLocation(geoLocation); return c; } }); In the preceding snippet, we called the map() method on JavaRDD to map each of the input line to the Crime object. Note that we created a simple JavaBean class called Crime that implements the Serializable interface and maps to the Elasticsearch document structure. Using CSVParser, we parsed each field into the Crime object. We mapped nested the geoLocation object by embedding Map in the Crime object. This map is populated with the lat and lon fields. This map() method returns another JavaRDD that contains the Crime objects, as shown in the following code: JavaEsSpark.saveToEs(dataSplits, "esh_spark/crimes"); Save JavaRDD<Crime> to Elasticsearch with the JavaEsSpark class provided by Elasticsearch. For all the ES-Hadoop integrations, such as Pig, Hive, Cascading, Apache Storm, and Spark, you can use all the standard ES-Hadoop configurations and techniques. This includes dynamic/multiresource writes with a pattern similar to esh_spark/{primaryType} and use JSON strings to directly import the data to Elasticsearch as well. To control the Elasticsearch document metadata from being indexed, you can use the saveToEsWithMeta() method of JavaEsSpark. You can pass an instance of JavaPairRDD that contains Tuple2<Metadata, Object>, where Metadata represents a map that has the key/value pairs of the document metadata fields, such as id, ttl, timestamp, and version. Using SparkSQL ES-Hadoop also bridges Elasticsearch with the SparkSQL module. SparkSQL 1.3+ versions provide the DataFrame abstraction that represent a collection of Row. We will not discuss the details of DataFrame here. ES-Hadoop lets you persist your DataFrame instance to Elasticsearch transparently. Let's see how we can do this with the following code: SQLContext sqlContext = new SQLContext(context); DataFrame df = sqlContext.createDataFrame(dataSplits, Crime.class); Create an SQLContext instance using the JavaSparkContext instance. Using the SqlContextSqlContext instance, you can create DataFrame by calling the createDataFrame() method and passing the existing JavaRDD<T> and Class<T>, where T is a JavaBean class that implements the Serializable interface. Note that the passing class instance is required to infer a schema for DataFrame. If you wish to use nonJavaBean-based RDD, you can create the schema manually. The article source code contains the implementations of both the approaches for your reference. Take a look at the following code: JavaEsSparkSQL.saveToEs(df, "esh_sparksql/crimes_reflection"); Once you have the DataFrame instance, you can save it to Elasticsearch with the JavaEsSparkSQL class, as shown in the preceding code. Reading data from Elasticsearch Here is the snippet of SparkEsReader that finds crimes related to theft: JavaRDD<Map<String, Object>> esRDD = JavaEsSpark.esRDD(context, "esh_spark/crimes", "{"query" : { "term" : { "primaryType" : "theft" } } }").values(); for(Map<String,Object> item: esRDD.collect()){ System.out.println(item); } We used the same JavaEsSpark class to create RDD with documents that match the Elasticsearch query. Using SparkSQL ES-Hadoop provides a org.elasticsearch.spark.sql data source provider to read the data from Elasticsearch using SparkSQL, as shown in the following code: Map<String, String> options = new HashMap<>(); options.put("pushdown","true"); options.put("es.nodes","localhost"); DataFrame df = sqlContext.read() .options(options) .format("org.elasticsearch.spark.sql") .load("esh_sparksql/crimes_reflection"); The preceding code snippet uses the org.elasticsearch.spark.sql data source to load data from Elasticsearch. You can set the pushdown option to true to push the query execution down to Elasticsearch. This greatly increases its efficiency as the query execution is collocated where the data resides, as shown in the following code: df.registerTempTable("crimes"); DataFrame theftCrimes = sqlContext.sql("SELECT * FROM crimes WHERE primaryType='THEFT'"); for(Row row: theftCrimes.javaRDD().collect()){ System.out.println(row); } We registered table with the data frame and executed the SQL query on SqlContext. Note that we need to collect the final results locally to print in a driver class. Summary In this article, we looked at the various Hadoop ecosystem technologies. We set up Pig with ES-Hadoop and developed the script to interact with Elasticsearch. You also learned how to use ES-Hadoop to integrate Elasticsearch with Spark and empower it with powerful SQL engine SparkSQL. Resources for Article: Further resources on this subject: Extending ElasticSearch with Scripting [Article] Elasticsearch Administration [Article] Downloading and Setting Up ElasticSearch [Article]
Read more
  • 0
  • 0
  • 4453

article-image-hand-gesture-recognition-using-kinect-depth-sensor
Packt
06 Oct 2015
26 min read
Save for later

Hand Gesture Recognition Using a Kinect Depth Sensor

Packt
06 Oct 2015
26 min read
In this article by Michael Beyeler author of the book OpenCV with Python Blueprints is to develop an app that detects and tracks simple hand gestures in real time using the output of a depth sensor, such as that of a Microsoft Kinect 3D sensor or an Asus Xtion. The app will analyze each captured frame to perform the following tasks: Hand region segmentation: The user's hand region will be extracted in each frame by analyzing the depth map output of the Kinect sensor, which is done by thresholding, applying some morphological operations, and finding connected components Hand shape analysis: The shape of the segmented hand region will be analyzed by determining contours, convex hull, and convexity defects Hand gesture recognition: The number of extended fingers will be determined based on the hand contour's convexity defects, and the gesture will be classified accordingly (with no extended finger corresponding to a fist, and five extended fingers corresponding to an open hand) Gesture recognition is an ever popular topic in computer science. This is because it not only enables humans to communicate with machines (human-machine interaction or HMI), but also constitutes the first step for machines to begin understanding the human body language. With affordable sensors, such as Microsoft Kinect or Asus Xtion, and open source software such as OpenKinect and OpenNI, it has never been easy to get started in the field yourself. So what shall we do with all this technology? The beauty of the algorithm that we are going to implement in this article is that it works well for a number of hand gestures, yet is simple enough to run in real time on a generic laptop. And if we want, we can easily extend it to incorporate more complicated hand pose estimations. The end product looks like this: No matter how many fingers of my left hand I extend, the algorithm correctly segments the hand region (white), draws the corresponding convex hull (the green line surrounding the hand), finds all convexity defects that belong to the spaces between fingers (large green points) while ignoring others (small red points), and infers the correct number of extended fingers (the number in the bottom-right corner), even for a fist. This article assumes that you have a Microsoft Kinect 3D sensor installed. Alternatively, you may install Asus Xtion or any other depth sensor for which OpenCV has built-in support. First, install OpenKinect and libfreenect from http://www.openkinect.org/wiki/Getting_Started. Then, you need to build (or rebuild) OpenCV with OpenNI support. The GUI used in this article will again be designed with wxPython, which can be obtained from http://www.wxpython.org/download.php. Planning the app The final app will consist of the following modules and scripts: gestures: A module that consists of an algorithm for recognizing hand gestures. We separate this algorithm from the rest of the application so that it can be used as a standalone module without the need for a GUI. gestures.HandGestureRecognition: A class that implements the entire process flow of hand gesture recognition. It accepts a single-channel depth image (acquired from the Kinect depth sensor) and returns an annotated RGB color image with an estimated number of extended fingers. gui: A module that provides a wxPython GUI application to access the capture device and display the video feed. In order to have it access the Kinect depth sensor instead of a generic camera, we will have to extend some of the base class functionality. gui.BaseLayout: A generic layout from which more complicated layouts can be built. Setting up the app Before we can get to the nitty-grittyof our gesture recognition algorithm, we need to make sure that we can access the Kinect sensor and display a stream of depth frames in a simple GUI. Accessing the Kinect 3D sensor Accessing Microsoft Kinect from within OpenCV is not much different from accessing a computer's webcam or camera device. The easiest way to integrate a Kinect sensor with OpenCV is by using an OpenKinect module called freenect. For installation instructions, see the preceding information box. The following code snippet grants access to the sensor using cv2.VideoCapture: import cv2 import freenect device = cv2.cv.CV_CAP_OPENNI capture = cv2.VideoCapture(device) On some platforms, the first call to cv2.VideoCapture fails to open a capture channel. In this case, we provide a workaround by opening the channel ourselves: if not(capture.isOpened(device)): capture.open(device) If you want to connect to your Asus Xtion, the device variable should be assigned the cv2.cv.CV_CAP_OPENNI_ASUS value instead. In order to give our app a fair chance to run in real time, we will limit the frame size to 640 x 480 pixels: capture.set(cv2.cv.CV_CAP_PROP_FRAME_WIDTH, 640) capture.set(cv2.cv.CV_CAP_PROP_FRAME_HEIGHT, 480) If you are using OpenCV 3, the constants you are looking for might be called cv3.CAP_PROP_FRAME_WIDTH and cv3.CAP_PROP_FRAME_HEIGHT. The read() method of cv2.VideoCapture is inappropriate when we need to synchronize a set of cameras or a multihead camera, such as a Kinect. In this case, we should use the grab() and retrieve() methods instead. An even easier way when working with OpenKinect is to use the sync_get_depth() and sync_get_video()methods. For the purpose of this article, we will need only the Kinect's depth map, which is a single-channel (grayscale) image in which each pixel value is the estimated distance from the camera to a particular surface in the visual scene. The latest frame can be grabbed via this code: depth, timestamp = freenect.sync_get_depth() The preceding code returns both the depth map and a timestamp. We will ignore the latter for now. By default, the map is in 11-bit format, which is inadequate to be visualized with cv2.imshow right away. Thus, it is a good idea to convert the image to 8-bit precision first. In order to reduce the range of depth values in the frame, we will clip the maximal distance to a value of 1,023 (or 2**10-1). This will get rid of values that correspond either to noise or distances that are far too large to be of interest to us: np.clip(depth, 0, 2**10-1, depth) depth >>= 2 Then, we convert the image into 8-bit format and display it: depth = depth.astype(np.uint8) cv2.imshow("depth", depth) Running the app In order to run our app, we will need to execute a main function routine that accesses the Kinect, generates the GUI, and executes the main loop of the app: import numpy as np import wx import cv2 import freenect from gui import BaseLayout from gestures import HandGestureRecognition def main(): device = cv2.cv.CV_CAP_OPENNI capture = cv2.VideoCapture() if not(capture.isOpened()): capture.open(device) capture.set(cv2.cv.CV_CAP_PROP_FRAME_WIDTH, 640) capture.set(cv2.cv.CV_CAP_PROP_FRAME_HEIGHT, 480) We will design a suitable layout (KinectLayout) for the current project: # start graphical user interface app = wx.App() layout = KinectLayout(None, -1, 'Kinect Hand Gesture Recognition', capture) layout.Show(True) app.MainLoop() The Kinect GUI The layout chosen for the current project (KinectLayout) is as plain as it gets. It should simply display the live stream of the Kinect depth sensor at a comfortable frame rate of 10 frames per second. Therefore, there is no need to further customize BaseLayout: class KinectLayout(BaseLayout): def _create_custom_layout(self): pass The only parameter that needs to be initialized this time is the recognition class. This will be useful in just a moment: def _init_custom_layout(self): self.hand_gestures = HandGestureRecognition() Instead of reading a regular camera frame, we need to acquire a depth frame via the freenect method sync_get_depth(). This can be achieved by overriding the following method: def _acquire_frame(self): As mentioned earlier, by default, this function returns a single-channel depth image with 11-bit precision and a timestamp. However, we are not interested in the timestamp, and we simply pass on the frame if the acquisition was successful: frame, _ = freenect.sync_get_depth() # return success if frame size is valid if frame is not None: return (True, frame) else: return (False, frame) The rest of the visualization pipeline is handled by the BaseLayout class. We only need to make sure that we provide a _process_frame method. This method accepts a depth image with 11-bit precision, processes it, and returns an annotated 8-bit RGB color image. Conversion to a regular grayscale image is the same as mentioned in the previous subsection: def _process_frame(self, frame): # clip max depth to 1023, convert to 8-bit grayscale np.clip(frame, 0, 2**10 – 1, frame) frame >>= 2 frame = frame.astype(np.uint8) The resulting grayscale image can then be passed to the hand gesture recognizer, which will return the estimated number of extended fingers (num_fingers) and the annotated RGB color image mentioned earlier (img_draw): num_fingers, img_draw = self.hand_gestures.recognize(frame) In order to simplify the segmentation task of the HandGestureRecognition class, we will instruct the user to place their hand in the center of the screen. To provide a visual aid for this, let's draw a rectangle around the image center and highlight the center pixel of the image in orange: height, width = frame.shape[:2] cv2.circle(img_draw, (width/2, height/2), 3, [255, 102, 0], 2) cv2.rectangle(img_draw, (width/3, height/3), (width*2/3, height*2/3), [255, 102, 0], 2) In addition, we print num_fingers on the screen: cv2.putText(img_draw, str(num_fingers), (30, 30),cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255)) return img_draw Tracking hand gestures in real time The bulk of the work is done by the HandGestureRecognition class, especially by its recognize method. This class starts off with a few parameter initializations, which will be explained and used later: class HandGestureRecognition: def __init__(self): # maximum depth deviation for a pixel to be considered # within range self.abs_depth_dev = 14 # cut-off angle (deg): everything below this is a convexity # point that belongs to two extended fingers self.thresh_deg = 80.0 The recognize method is where the real magic takes place. This method handles the entire process flow, from the raw grayscale image all the way to a recognized hand gesture. It implements the following procedure: It extracts the user's hand region by analyzing the depth map (img_gray) and returning a hand region mask (segment): def recognize(self, img_gray): segment = self._segment_arm(img_gray) It performs contour analysis on the hand region mask (segment). Then, it returns the largest contour area found in the image (contours) and any convexity defects (defects): [contours, defects] = self._find_hull_defects(segment) Based on the contours found and the convexity defects, it detects the number of extended fingers (num_fingers) in the image. Then, it annotates the output image (img_draw) with contours, defect points, and the number of extended fingers: img_draw = cv2.cvtColor(img_gray, cv2.COLOR_GRAY2RGB) [num_fingers, img_draw] = self._detect_num_fingers(contours, defects, img_draw) It returns the estimated number of extended fingers (num_fingers) as well as the annotated output image (img_draw): return (num_fingers, img_draw) Hand region segmentation The automatic detection of an arm, and later the hand region, could be designed to be arbitrarily complicated, maybe by combining information about the shape and color of an arm or hand. However, using a skin color as a determining feature to find hands in visual scenes might fail terribly in poor lighting conditions or when the user is wearing gloves. Instead, we choose to recognize the user's hand by its shape in the depth map. Allowing hands of all sorts to be present in any region of the image unnecessarily complicates the mission of this article, so we make two simplifying assumptions: We will instruct the user of our app to place their hand in front of the center of the screen, orienting their palm roughly parallel to the orientation of the Kinect sensor so that it is easier to identify the corresponding depth layer of the hand. We will also instruct the user to sit roughly 1 to 2 meters away from the Kinect, and to slightly extend their arm in front of their body so that the hand will end up in a slightly different depth layer than the arm. However, the algorithm will still work even if the full arm is visible. In this way, it will be relatively straightforward to segment the image based on the depth layer alone. Otherwise, we would have to come up with a hand detection algorithm first, which would unnecessarily complicate our mission. If you feel adventurous, feel free to do this on your own. Finding the most prominent depth of the image center region Once the hand is placed roughly in the center of the screen, we can start finding all image pixels that lie on the same depth plane as the hand. For this, we simply need to determine the most prominent depth value of the center region of the image. The simplest approach would be as follows: look only at the depth value of the center pixel: width, height = depth.shape center_pixel_depth = depth[width/2, height/2] Then, create a mask in which all pixels at a depth of center_pixel_depth are white and all others are black: import numpy as np depth_mask = np.where(depth == center_pixel_depth, 255, 0).astype(np.uint8) However, this approach will not be very robust, because chances are that: Your hand is not placed perfectly parallel to the Kinect sensor Your hand is not perfectly flat The Kinect sensor values are noisy Therefore, different regions of your hand will have slightly different depth values. The _segment_arm method takes a slightly better approach, that is, looking at a small neighborhood in the center of the image and determining the median (meaning the most prominent) depth value. First, we find the center (for example, 21 x 21 pixels) region of the image frame: def _segment_arm(self, frame): """ segments the arm region based on depth """ center_half = 10 # half-width of 21 is 21/2-1 lowerHeight = self.height/2 – center_half upperHeight = self.height/2 + center_half lowerWidth = self.width/2 – center_half upperWidth = self.width/2 + center_half center = frame[lowerHeight:upperHeight,lowerWidth:upperWidth] We can then reshape the depth values of this center region into a one-dimensional vector and determine the median depth value, med_val: med_val = np.median(center) We can now compare med_val with the depth value of all pixels in the image and create a mask in which all pixels whose depth values are within a particular range [med_val-self.abs_depth_dev, med_val+self.abs_depth_dev] are white and all other pixels are black. However, for reasons that will be clear in a moment, let's paint the pixels gray instead of white: frame = np.where(abs(frame – med_val) <= self.abs_depth_dev, 128, 0).astype(np.uint8) The result will look like this: Applying morphological closing to smoothen the segmentation mask A common problem with segmentation is that a hard threshold typically results in small imperfections (that is, holes, as in the preceding image) in the segmented region. These holes can be alleviated using morphological opening and closing. Opening removes small objects from the foreground (assuming that the objects are bright on a dark foreground), whereas closing removes small holes (dark regions). This means that we can get rid of the small black regions in our mask by applying morphological closing (dilation followed by erosion) with a small 3 x 3 pixel kernel: kernel = np.ones((3, 3), np.uint8) frame = cv2.morphologyEx(frame, cv2.MORPH_CLOSE, kernel) The result looks a lot smoother, as follows: Notice, however, that the mask still contains regions that do not belong to the hand or arm, such as what appears to be one of my knees on the left and some furniture on the right. These objects just happen to be on the same depth layer as my arm and hand. If possible, we could now combine the depth information with another descriptor, maybe a texture-based or skeleton-based hand classifier, that would weed out all non-skin regions. Finding connected components in a segmentation mask An easier approach is to realize that most of the times, hands are not connected to knees or furniture. We already know that the center region belongs to the hand, so we can simply apply cv2.floodfill to find all the connected image regions. Before we do this, we want to be absolutely certain that the seed point for the flood fill belongs to the right mask region. This can be achieved by assigning a grayscale value of 128 to the seed point. But we also want to make sure that the center pixel does not, by any coincidence, lie within a cavity that the morphological operation failed to close. So, let's set a small 7 x 7 pixel region with a grayscale value of 128 instead: small_kernel = 3 frame[self.height/2-small_kernel : self.height/2+small_kernel, self.width/2-small_kernel : self.width/2+small_kernel] = 128 Because flood filling (as well as morphological operations) is potentially dangerous, the Python version of later OpenCV versions requires specifying a mask that avoids flooding the entire image. This mask has to be 2 pixels wider and taller than the original image and has to be used in combination with the cv2.FLOODFILL_MASK_ONLY flag. It can be very helpful in constraining the flood filling to a small region of the image or a specific contour so that we need not connect two neighboring regions that should have never been connected in the first place. It's better to be safe than sorry, right? Ah, screw it! Today, we feel courageous! Let's make the mask entirely black: mask = np.zeros((self.height+2, self.width+2), np.uint8) Then we can apply the flood fill to the center pixel (seed point) and paint all the connected regions white: flood = frame.copy() cv2.floodFill(flood, mask, (self.width/2, self.height/2), 255, flags=4 | (255 << 8)) At this point, it should be clear why we decided to start with a gray mask earlier. We now have a mask that contains white regions (arm and hand), gray regions (neither arm nor hand but other things in the same depth plane), and black regions (all others). With this setup, it is easy to apply a simple binary threshold to highlight only the relevant regions of the pre-segmented depth plane: ret, flooded = cv2.threshold(flood, 129, 255, cv2.THRESH_BINARY) This is what the resulting mask looks like: The resulting segmentation mask can now be returned to the recognize method, where it will be used as an input to _find_hull_defects as well as a canvas for drawing the final output image (img_draw). Hand shape analysis Now that we (roughly) know where the hand is located, we aim to learn something about its shape. Determining the contour of the segmented hand region The first step involves determining the contour of the segmented hand region. Luckily, OpenCV comes with a pre-canned version of such an algorithm—cv2.findContours. This function acts on a binary image and returns a set of points that are believed to be part of the contour. Because there might be multiple contours present in the image, it is possible to retrieve an entire hierarchy of contours: def _find_hull_defects(self, segment): contours, hierarchy = cv2.findContours(segment, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) Furthermore, because we do not know which contour we are looking for, we have to make an assumption to clean up the contour result. Since it is possible that some small cavities are left over even after the morphological closing—but we are fairly certain that our mask contains only the segmented area of interest—we will assume that the largest contour found is the one that we are looking for. Thus, we simply traverse the list of contours, calculate the contour area (cv2.contourArea), and store only the largest one (max_contour): max_contour = max(contours, key=cv2.contourArea) Finding the convex hull of a contour area Once we have identified the largest contour in our mask, it is straightforward to compute the convex hull of the contour area. The convex hull is basically the envelope of the contour area. If you think of all the pixels that belong to the contour area as a set of nails sticking out of a board, then the convex hull is the shape formed by a tight rubber band that surrounds all the nails. We can get the convex hull directly from our largest contour (max_contour): hull = cv2.convexHull(max_contour, returnPoints=False) Because we now want to look at convexity deficits in this hull, we are instructed by the OpenCV documentation to set the returnPoints optional flag to False. The convex hull drawn in green around a segmented hand region looks like this: Finding convexity defects of a convex hull As is evident from the preceding screenshot, not all points on the convex hull belong to the segmented hand region. In fact, all the fingers and the wrist cause severe convexity defects, that is, points of the contour that are far away from the hull. We can find these defects by looking at both the largest contour (max_contour) and the corresponding convex hull (hull): defects = cv2.convexityDefects(max_contour, hull) The output of this function (defects) is a 4-tuple that contains start_index (the point of the contour where the defect begins), end_index (the point of the contour where the defect ends), farthest_pt_index (the farthest from the convex hull point within the defect), and fixpt_depth (distance between the farthest point and the convex hull). We will make use of this information in just a moment when we reason about fingers. But for now, our job is done. The extracted contour (max_contour) and convexity defects (defects) can be passed to recognize, where they will be used as inputs to _detect_num_fingers: return (cnt,defects) Hand gesture recognition What remains to be done is classifying the hand gesture based on the number of extended fingers. For example, if we find five extended fingers, we assume the hand to be open, whereas no extended fingers imply a fist. All that we are trying to do is count from zero to five and make the app recognize the corresponding number of fingers. This is actually trickier than it might seem at first. For example, people in Europe might count to three by extending their thumb, index finger, and middle finger. If you do that in the US, people there might get horrendously confused, because people do not tend to use their thumbs when signaling the number two. This might lead to frustration, especially in restaurants (trust me). If we could find a way to generalize these two scenarios—maybe by appropriately counting the number of extended fingers—we would have an algorithm that could teach simple hand gesture recognition to not only a machine but also (maybe) to an average waitress. As you might have guessed, the answer has to do with convexity defects. As mentioned earlier, extended fingers cause defects in the convex hull. However, the inverse is not true; that is, not all convexity defects are caused by fingers! There might be additional defects caused by the wrist as well as the overall orientation of the hand or the arm. How can we distinguish between these different causes for defects? Distinguishing between different causes for convexity defects The trick is to look at the angle between the farthest point from the convex hull point within the defect (farthest_pt_index) and the start and end points of the defect (start_index and end_index, respectively), as illustrated in the following screenshot: In this screenshot, the orange markers serve as a visual aid to center the hand in the middle of the screen, and the convex hull is outlined in green. Each red dot corresponds to a farthest from the convex hull point (farthest_pt_index) for every convexity defect detected. If we compare a typical angle that belongs to two extended fingers (such as θj) to an angle that is caused by general hand geometry (such as θi), we notice that the former is much smaller than the latter. This is obviously because humans can spread their finger only a little, thus creating a narrow angle made by the farthest defect point and the neighboring fingertips. Therefore, we can iterate over all convexity defects and compute the angle between the said points. For this, we will need a utility function that calculates the angle (in radians) between two arbitrary, list-like vectors, v1 and v2: def angle_rad(v1, v2): return np.arctan2(np.linalg.norm(np.cross(v1, v2)), np.dot(v1, v2)) This method uses the cross product to compute the angle, rather than the standard way. The standard way of calculating the angle between two vectors v1 and v2 is by calculating their dot product and dividing it by the norm of v1 and the norm of v2. However, this method has two imperfections: You have to manually avoid division by zero if either the norm of v1 or the norm of v2 is zero The method returns relatively inaccurate results for small angles Similarly, we provide a simple function to convert an angle from degrees to radians: def deg2rad(angle_deg): return angle_deg/180.0*np.pi Classifying hand gestures based on the number of extended fingers What remains to be done is actually classifying the hand gesture based on the number of extended fingers. The _detect_num_fingers method will take as input the detected contour (contours), the convexity defects (defects), and a canvas to draw on (img_draw): def _detect_num_fingers(self, contours, defects, img_draw): Based on these parameters, it will then determine the number of extended fingers. However, we first need to define a cut-off angle that can be used as a threshold to classify convexity defects as being caused by extended fingers or not. Except for the angle between the thumb and the index finger, it is rather hard to get anything close to 90 degrees, so anything close to that number should work. We do not want the cut-off angle to be too high, because that might lead to misclassifications: self.thresh_deg = 80.0 For simplicity, let's focus on the special cases first. If we do not find any convexity defects, it means that we possibly made a mistake during the convex hull calculation, or there are simply no extended fingers in the frame, so we return 0 as the number of detected fingers: if defects is None: return [0, img_draw] But we can take this idea even further. Due to the fact that arms are usually slimmer than hands or fists, we can assume that the hand geometry will always generate at least two convexity defects (which usually belong to the wrists). So if there are no additional defects, it implies that there are no extended fingers: if len(defects) <= 2: return [0, img_draw] Now that we have ruled out all special cases, we can begin counting real fingers. If there are a sufficient number of defects, we will find a defect between every pair of fingers. Thus, in order to get the number right (num_fingers), we should start counting at 1: num_fingers = 1 Then we can start iterating over all convexity defects. For each defect, we will extract the four elements and draw its hull for visualization purposes: for i in range(defects.shape[0]): # each defect point is a 4-tuplestart_idx, end_idx, farthest_idx, _ == defects[i, 0] start = tuple(contours[start_idx][0]) end = tuple(contours[end_idx][0]) far = tuple(contours[farthest_idx][0]) # draw the hull cv2.line(img_draw, start, end [0, 255, 0], 2) Then we will compute the angle between the two edges from far to start and from far to end. If the angle is smaller than self.thresh_deg degrees, it means that we are dealing with a defect that is most likely caused by two extended fingers. In this case, we want to increment the number of detected fingers (num_fingers), and we draw the point with green. Otherwise, we draw the point with red: # if angle is below a threshold, defect point belongs # to two extended fingers if angle_rad(np.subtract(start, far), np.subtract(end, far)) < deg2rad(self.thresh_deg): # increment number of fingers num_fingers = num_fingers + 1 # draw point as green cv2.circle(img_draw, far, 5, [0, 255, 0], -1) else: # draw point as red cv2.circle(img_draw, far, 5, [255, 0, 0], -1) After iterating over all convexity defects, we pass the number of detected fingers and the assembled output image to the recognize method: return (min(5, num_fingers), img_draw) This will make sure that we do not exceed the common number of fingers per hand. The result can be seen in the following screenshots: Interestingly, our app is able to detect the correct number of extended fingers in a variety of hand configurations. Defect points between extended fingers are easily classified as such by the algorithm, and others are successfully ignored. Summary This article showed a relatively simple and yet surprisingly robust way of recognizing a variety of hand gestures by counting the number of extended fingers. The algorithm first shows how a task-relevant region of the image can be segmented using depth information acquired from a Microsoft Kinect 3D Sensor, and how morphological operations can be used to clean up the segmentation result. By analyzing the shape of the segmented hand region, the algorithm comes up with a way to classify hand gestures based on the types of convexity effects found in the image. Once again, mastering our use of OpenCV to perform a desired task did not require us to produce a large amount of code. Instead, we were challenged to gain an important insight that made us use the built-in functionality of OpenCV in the most effective way possible. Gesture recognition is a popular but challenging field in computer science, with applications in a large number of areas, such as human-computer interaction, video surveillance, and even the video game industry. You can now use your advanced understanding of segmentation and structure analysis to build your own state-of-the-art gesture recognition system. Resources for Article: Tracking Faces with Haar Cascades Our First Machine Learning Method - Linear Classification Solving problems with Python: Closest good restaurant
Read more
  • 0
  • 0
  • 35824

article-image-creating-interactive-spreadsheets-using-tables-and-slicers
Packt
06 Oct 2015
10 min read
Save for later

Creating Interactive Spreadsheets using Tables and Slicers

Packt
06 Oct 2015
10 min read
In this article by Hernan D Rojas author of the book Data Analysis and Business Modeling with Excel 2013 introduces additional materials to the advanced Excel developer in the presentation stage of the data analysis life cycle. What you will learn in this article is how to leverage Excel's new features to add interactivity to your spreadsheets. (For more resources related to this topic, see here.) What are slicers? Slicers are essentially buttons that automatically filter your data. Excel has always been able to filter data, but slicers are more practical and visually appealing. Let's compare the two in the following steps: First, fire up Excel 2013, and create a new spreadsheet. Manually enter the data, as shown in the following figure: Highlight cells A1 through E11, and press Ctrl + T to convert our data to an Excel table. Converting your data to a table is the first step that you need to take in order to introduce slicers in your spreadsheet. Let's filter our data using the default filtering capabilities that we are already familiar with. Filter the Type column and only select the rows that have the value equal to SUV, as shown in the following figure. Click on the OK button to apply the filter to the table. You will now be left with four rows that have the Type column equal to SUV. Using a typical Excel filter, we were able to filter our data and only show all of the SUV cars. We can then continue to filter by other columns, such as MPG (miles per gallon) and Price. How can we accomplish the same results using slicers? Continue reading this article to find this out. How to create slicers? In this article, we will be going through simple but powerful steps that are required to build slicers. After we create our first slicer, make sure that you compare and contrast the old way of filtering versus the new way of filtering data. Remove the filter that we just applied to our table by clicking on the option named Clear Filter From "Type", as shown in the following figure: With your Excel table selected, click on the TABLE TOOLS tab. Click on the Insert Slicer button. In the Insert Slicers dialog box, select the Type checkbox, and click on the OK button, as shown in the following screenshot: You should now have a slicer that looks similar to the one in the following figure. Notice that you can resize and move the slicer anywhere you want in the spreadsheet. Click on the Sedan filter in the slicer that we build in the previous step. Wow! The data is filtered and only the rows where the Type column is equal to Sedan is shown in the results. Click on the Sport filter and see what happens. The data is now filtered where the Type column is equal to Sport. Notice that the previous filter of Sedan was removed as soon as we clicked on the Sport filter. What if we want to filter the data by both Sport and Sedan? We can just highlight both the filters with our mouse, or click on Sedan, press Ctrl, and then, click on the Sport filter. The end result will look like this: To clear the filter, just click on the Clear Filter button. Do you see the advantage of slicers over filters? Yes, of course, they are simply better. Filtering between Sedan, Sport, or SUV is very easy and convenient. It will certainly take less key strokes and the feedback is instant. Think about the end users interacting with your spreadsheet. At a touch of a button, they can answer questions that arise in their heads. This is what you call an interactive spreadsheet or an interactive dashboard. Styling slicers There are not many options to style slicers but Excel does give you a decent amount of color schemes that you can experiment with: With the Type slicer selected, navigate to the SLICER TOOLS tab, as shown in the following figure: Click on the various slicer styles available to get a feel of what Excel offers. Adding multiple slicers You are able to add multiple slicers and multiple charts in one spreadsheet. Why would we do this? Well, this is the beginning of a dashboard creation. Let's expand on the example we have just been working on, and see how we can turn raw data into an interactive dashboard: Let's start with creating slicers for # of Passengers and MPG, as shown in the following figure: Rename Sheet1 as Data, and create a new sheet called Dashboard, as shown here: Move the three slicers by cutting and pasting them from the Data sheet to the Dashboard sheet. Create a line chart using the columns Company and MPG, as shown in the following figure: Create a bar chart using the columns Type and MPG. Create another bar chart with the columns company and # of Passengers, as shown in the following figure. These types of charts are technically called column charts, but you can get away by calling them bar charts. Now, move the three charts from the Data tab to the Dashboard tab. Right-click on the bar chart, and select the Move Chart… option. In the Move Chart dialog box, change the Object in: parameter from Data to Dashboard, and then click on the OK button. Move the other two charts to the Dashboard tab so that there are no more charts in the Data tab. Rearrange the charts and slicers so that they look as closely as possible to the ones in the following figure. As you can see that this tab is starting to look like a dashboard. The Type slicer will look better if Sedan, Sport, and SUV are laid out horizontally. Select the Type slicer, and click on the SLICER TOOLS menu option. Change the Columns parameter from 1 to 3, as shown in the following figure. This is how we are able to change the layout or shape of the slicer. Resize the Type slicer so that it looks like the one in the following figure: Clearing filters You can click on one or more filters in the dashboard that we just created. Very cool! Every time we select a filter, all of the three charts that we created get updated. This again is called adding interactivity to your spreadsheets. This allows the end users of your dashboard to interact with your data and perform their own analysis. If you notice, there is really a no good way of removing multiple filters at once. For example, if you select Sedans that have a MPG of greater or equal to 30, how would you remove all of the filters? You would have to clear the filters from the Type slicer and then from the MPG slicer. This can be a little tedious to your end user, and you will want to avoid this at any cost. The next steps will show you how to create a button using VBA that will filter all of our data in a flash: Press Alt + F11, and create a sub procedure called Clear_Slicer, as shown in the following figure. This code will basically find all of the filters that you have selected and then manually clears them for you one at a time. The next step is to bind this code to a button: Sub Clear_Slicer() ' Declare Variables Dim cache As SlicerCache ' Loop through each filter For Each cache In ActiveWorkbook.SlicerCaches ' clear filter cache.ClearManualFilter Next cache End Sub Select the DEVELOPER tab, and click on the Insert button. In the pop-up menu called Form Controls, select the Button option. Now, click anywhere on the sheet, and you will get a dialog box that looks like the following figure. This is where we are going to assign a macro to the button. This means that whenever you click on the button we are creating, Excel will run the macro of our choice. Since we have already created a macro called Clear_Slicer, it will make sense to select this macro, and then click on the OK button. Change the text of the button to Clear All Filters and resize it so that it looks like this: Adjust the properties of the button by right-clicking on the button and selecting the Format Control… option. Here, you can change the font size and the color of your button label. Now, select a bunch of filters, and click on our new shiny button. Yes, that was pretty cool. The most important part is that it is now even easier to "reset" your dashboard and start a brand new analysis. What do I mean by start a brand new analysis? In general, when a user initially starts using your dashboard, he/she will click on the filters aimlessly. The users do this just to figure out how to mechanically use the dashboard. Then, after they get the hang of it, they want to start with a clean slate and perform some data analysis. If we did not have the Clear All Filters button, the users would have to figure out how they would clear every slicer one at a time to start over. The worst case scenario is when the user does not realize when the filters are turned on and when they are turned off. Now, do not laugh at this situation, or assume that your end user is not as smart as you are. This just means that you need to lower the learning curve of your dashboard and make it easy to use. With the addition of the clear button, the end user can think of a question, use the slicers to answer it, click on the clear button, and start the process all over again. These little details are what that is going to separate you from the average Excel developer. Summary The aim of this article was to give you ideas and tools to present your data artistically. Whether you like it or not, sometimes, a better looking analysis will trump the better but less attractive one. Excel gives you the tools to not be on the short end of the stick but to always be able to present visually stunning analysis. You now know have your Excel slicers, and you learned how to bind them to your data. Users of your spreadsheet can now slice and dice your data to answer multiple questions. Executives like flashy visualizations, and when you combine them with a strong analysis, you have a very powerful combination. In this article, we also went through a variety of strategies to customize slicers and chart elements. These little changes made to your dashboard will make them standout and help you get your message across. Excel as always has been an invaluable tool that gives you all of the tools necessary to overcome any data challenges you might come across. As I tell all my students, the key to become better is simply to practice, practice, and practice. Resources for Article: Further resources on this subject: Introduction to Stata and Data Analytics [article] Getting Started with Apache Spark DataFrames [article] Big Data Analysis (R and Hadoop) [article]
Read more
  • 0
  • 0
  • 4331
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-mathematical-imaging
Packt
05 Oct 2015
17 min read
Save for later

Mathematical Imaging

Packt
05 Oct 2015
17 min read
 In this article by Francisco J. Blanco-Silva, author of the book Mastering SciPy, you will learn about image editing and the purpose of editing is the alteration of digital images, usually to perform improvement of its properties or to turn them into an intermediate step for further processing. Let's examine different examples of editing: Transformations of the domain Intensity adjustment Image restoration (For more resources related to this topic, see here.) Transformations of the domain In this setting, we address transformations to images by first changing the location of pixels, rotations, compressions, stretching, swirls, cropping, perspective control, and so on. Once the transformation to the pixels in the domain of the original is performed, we observe the size of the output. If there are more pixels in this image than in the original, the extra locations are filled with numerical values obtained by interpolating the given data. We do have some control over the kind of interpolation performed, of course. To better illustrate these techniques, we will pair an actual image (say, Lena) with a representation of its domain as a checkerboard: In [1]: import numpy as np, matplotlib.pyplot as plt In [2]: from scipy.misc import lena; ...: from skimage.data import checkerboard In [3]: image = lena().astype('float') ...: domain = checkerboard() In [4]: print image.shape, domain.shape Out[4]: (512, 512) (200, 200) Rescale and resize Before we proceed with the pairing of image and domain, we have to make sure that they both have the same size. One quick fix is to rescale both objects, or simply resize one of the images to match the size of the other. Let's go with the first choice, so that we can illustrate the usage of the two functions available for this task in the module skimage.transform to resize and rescale: In [5]: from skimage.transform import rescale, resize In [6]: domain = rescale(domain, scale=1024./200); ...: image = resize(image, output_shape=(1024, 1024), order=3) Observe how, in the resizing operation, we requested a bicubic interpolation. Swirl To perform a swirl, we call the function swirl from the module skimage.transform: In all the examples of this section, we will present the results visually after performing the requested computations. In all cases, the syntax of the call to offer the images is the same. For a given operation mapping, we issue the command display (mapping, image, and domain) where the routine display is defined as follows: def display(mapping, image, domain): plt.figure() plt.subplot(121) plt.imshow(mapping(image)) plt.gray() plt.subplot(122) plt.imshow(mapping(domain)) plt.show() For the sake of brevity, we will not include this command in the following code, but assume it is called every time: In [7]: from skimage.transform import swirl In [8]: mapping = lambda img: swirl(img, strength=6, ...: radius=512) Geometric transformations A simple rotation around any location (no matter whether inside or outside the domain of the image) can be achieved with the function rotate from either module scipy.ndimage or skimage.transform. They are basically the same under the hood, but the syntax of the function in the scikit-image toolkit is more user-friendly: In [10]: from skimage.transform import rotate In [11]: mapping = lambda img: rotate(img, angle=30, resize=True, ....: center=None) This gives a counter-clockwise rotation of 30 degrees (angle=30) around the center of the image (center=None). The size of the output image is expanded to guarantee that all the original pixels are present in the output (resize=True): Rotations are a special case of what we call an affine transformation—a combination of rotation with scales (one for each dimension), shear, and translation. Affine transformations are in turn a special case of a homography (a projective transformation). Rather than learning a different set of functions, one for each kind of geometric transformation, the library skimage.transform allows a very comfortable setting. There is a common function (warp) that gets called with the requested geometric transformation and performs the computations. Each suitable geometric transformation is previously initialized with an appropriate python class. For example, to perform an affine transformation with a counter-clockwise rotation angle of 30 degrees about the point with coordinates (512, -2048), and scale factors of 2 and 3 units, respectively for the x and y coordinates, we issue the following command: In [13]: from skimage.transform import warp, AffineTransform In [14]: operation = AffineTransform(scale=(2,3), rotation=np.pi/6, ....: translation = (512, -2048)) In [15]: mapping = lambda img: warp(img, operation) Observe how all the lines in the transformed checkerboard are either parallel or perpendicular—affine transformations preserve angles. The following illustrates the effect of a homography: In [17]: from skimage.transform import ProjectiveTransform In [18]: generator = np.matrix('1,0,10; 0,1,20; -0.0007,0.0005,1'); ....: homography = ProjectiveTransform(matrix=generator); ....: mapping = lambda img: warp(img, homography) Observe how, unlike in the case of an affine transformation, the lines cease to be all parallel or perpendicular. All vertical lines are now incident at a point at infinity. All horizontal lines are also incident at a different point at infinity. The real usefulness of homographies arises, for example, when we need to perform perspective control. For instance, the image skimage.data.text is clearly slanted. By choosing the four corners of what we believe is a perfect rectangle (we estimate such a rectangle by visual inspection), we can compute a homography that transforms the given image into one that is devoid of any perspective. The Python classes representing geometric transformations allow us to perform this estimation very easily, as the following example shows: In [20]: from skimage.data import text In [21]: text().shape Out[21]: (172, 448) In [22]: source = np.array(((155, 15), (65, 40), ....: (260, 130), (360, 95), ....: (155, 15))) In [23]: mapping = ProjectiveTransform() Let's estimate the homography that transforms the given set of points into a perfect rectangle of the size 48 x 256 centered in an output image of size 512 x 512. The choice of size of the output image is determined by the length of the diagonal of the original image (about 479 pixels). This way, after the homography is computed, the output is likely to contain all the pixels from the original. Observe that we have included one of vertices in the source, twice. This is not strictly necessary for the following computations, but will make the visualization of rectangles much easier to code. We use the same trick for the target rectangle. In [24]: target = np.array(((256-128, 256-24), (256-128, 256+24), ....: (256+128, 256+24), (256+128, 256-24), ....: (256-128, 256-24))) In [25]: mapping.estimate(target, source) Out[25]: True In [26]: plt.figure(); ....: plt.subplot(121); ....: plt.imshow(text()); ....: plt.gray(); ....: plt.plot(source[:,0], source[:,1],'-', lw=1, color='red'); ....: plt.xlim(0, 448); ....: plt.ylim(172, 0); ....: plt.subplot(122); ....: plt.imshow(warp(text(), mapping,output_shape=(512, 512))); ....: plt.plot(target[:,0], target[:,1],'-', lw=1, color='red'); ....: plt.xlim(0, 512); ....: plt.ylim(512, 0); ....: plt.show() Other more involved geometric operations are needed, for example, to fix vignetting and some of the other kinds of distortions produced by photographic lenses. Traditionally, once we acquire an image, we assume that all these distortions are present. By knowing the technical specifications of the equipment used to take the photographs, we can automatically rectify these defects. With this purpose in mind, in the SciPy stack, we have access to the lensfun library (http://lensfun.sourceforge.net/) through the package lensfunpy (https://pypi.python.org/pypi/lensfunpy). For examples of usage and documentation, an excellent resource is the API reference of lensfunpy at http://pythonhosted.org/lensfunpy/api/. Intensity adjustment In this category, we have operations that only modify the intensity of an image obeying a global formula. All these operations can therefore be easily coded by using purely NumPy operations, by creating vectorized functions adapting the requested formulas. The applications of these operations can be explained in terms of exposure in black and white photography, for example. For this reason, all the examples in this section are applied on gray-scale images. We have mainly three approaches to enhancing images by working on its intensity: Histogram equalizations Intensity clipping/resizing Contrast adjustment Histogram equalization The starting point in this setting is always the concept of intensity histogram (or more precisely, the histogram of pixel intensity values)—a function that indicates the number of pixels in an image at each different intensity value found in that image. For instance, for the original version of Lena, we could issue the following command: In [27]: plt.figure(); ....: plt.hist(lena().flatten(), 256); ....: plt.show() The operations of histogram equalization improve the contrast of images by modifying the histogram in a way that most of the relevant intensities have the same impact. We can accomplish this enhancement by calling, from the module skimage.exposure, any of the functions equalize_hist (pure histogram equalization) or equalize_adaphist (contrast limited adaptive histogram equalization (CLAHE)). Note the obvious improvement after the application of the histogram equalization of the image skimage.data.moon. In the following examples, we include the corresponding histogram below all relevant images for comparison. A suitable code to perform this visualization could be as follows: def display(image, transform, bins): target = transform(image) plt.figure() plt.subplot(221) plt.imshow(image) plt.gray() plt.subplot(222) plt.imshow(target) plt.subplot(223) plt.hist(image.flatten(), bins) plt.subplot(224) plt.hist(target.flatten(), bins) plt.show() In [28]: from skimage.exposure import equalize_hist; ....: from skimage.data import moon In [29]: display(moon(), equalize_hist, 256) Intensity clipping/resizing A peak at the histogram indicates the presence of a particular intensity that is remarkably more predominant than its neighboring ones. If we desire to isolate intensities around a peak, we can do so using purely NumPy masking/clipping operations on the original image. If storing the result is not needed, we can request a quick visualization of the result by employing the command clim in the library matplotlib.pyplot. For instance, to isolate intensities around the highest peak of Lena (roughly, these are between 150 and 160), we could issue the following command: In [30]: plt.figure(); ....: plt.imshow(lena()); ....: plt.clim(vmin=150, vmax=160); ....: plt.show() Note how this operation, in spite of having reduced the representative range of intensities from 256 to 10, offers us a new image that keeps sufficient information to recreate the original one. Naturally, we can regard this operation also as a lossy compression method. Contrast enhancement An obvious drawback of clipping intensities is the loss of perceived lightness contrast. To overcome this loss, it is preferable to employ formulas that do not reduce the size of the range. Among the many available formulas conforming to this mathematical property, the most successful ones are those that replicate an optical property of the acquisition method. We explore the following three cases: Gamma correction: Human vision follows a power function, with greater sensitivity to relative differences between darker tones than between lighter ones. Each original image, as captured by the acquisition device, might allocate too many bits or too much bandwidth to highlight that humans cannot actually differentiate. Similarly, too few bits/bandwidth could be allocated to the darker areas of the image. By manipulation of this power function, we are capable of addressing the correct amount of bits and bandwidth. Sigmoid correction: Independently of the amount of bits and bandwidth, it is often desirable to maintain the perceived lightness contrast. Sigmoidal remapping functions were then designed based on empirical contrast enhancement model developed from the results of psychophysical adjustment experiments. Logarithmic correction: This is a purely mathematical process designed to spread the range of naturally low-contrast images by transforming to a logarithmic range. To perform gamma correction on images, we could employ the function adjust_gamma in the module skimage.exposure. The equivalent mathematical operation is the power-law relationship output = gain * input^gamma. Observe the great improvement in definition of the brighter areas of a stained micrograph of colonic glands, when we choose the exponent gamma=2.5 and no gain (gain=1.0): In [31]: from skimage.exposure import adjust_gamma; ....: from skimage.color import rgb2gray; ....: from skimage.data import immunohistochemistry In [32]: image = rgb2gray(immunohistochemistry()) In [33]: correction = lambda img: adjust_gamma(img, gamma=2.5, ....: gain=1.) Note the huge improvement in contrast in the lower right section of the micrograph, allowing a better description and differentiation of the observed objects: Immunohistochemical staining with hematoxylin counterstaining. This image was acquired at the Center for Microscopy And Molecular Imaging (CMMI). To perform sigmoid correction with given gain and cutoff coefficients, according to the formula output = 1/(1 + exp*(gain*(cutoff - input))), we employ the function adjust_sigmoid in skimage.exposure. For example, with gain=10.0 and cutoff=0.5 (the default values), we obtain the following enhancement: In [35]: from skimage.exposure import adjust_sigmoid In [36]: display(image[:256, :256], adjust_sigmoid, 256) Note the improvement in the definition of the walls of cells in the enhanced image: We have already explored logarithmic correction in the previous section, when enhancing the visualization of the frequency of an image. This is equivalent to applying a vectorized version of np.log1p to the intensities. The corresponding function in the scikit-image toolkit is adjust_log in the sub module exposure. Image restoration In this category of image editing, the purpose is to repair the damage produced by post or preprocessing of the image, or even the removal of distortion produced by the acquisition device. We explore two classic situations: Noise reduction Sharpening and blurring Noise reduction In mathematical imaging, noise is by definition a random variation of the intensity (or the color) produced by the acquisition device. Among all the possible types of noise, we acknowledge four key cases: Gaussian noise: We add to each pixel a value obtained from a random variable with normal distribution and a fixed mean. We usually allow the same variance on each pixel of the image, but it is feasible to change the variance depending on the location. Poisson noise: To each pixel, we add a value obtained from a random variable with Poisson distribution. Salt and pepper: This is a replacement noise, where some pixels are substituted by zeros (black or pepper), and some pixels are substituted by ones (white or salt). Speckle: This is a multiplicative kind of noise, where each pixel gets modified by the formula output = input + n * input. The value of the modifier n is a value obtained from a random variable with uniform distribution of fixed mean and variance. To emulate all these kinds of noise, we employ the utility random_noise from the module skimage.util. Let's illustrate the possibilities in a common image: In [37]: from skimage.data import camera; ....: from skimage.util import random_noise In [38]: gaussian_noise = random_noise(camera(), 'gaussian', ....: var=0.025); ....: poisson_noise = random_noise(camera(), 'poisson'); ....: saltpepper_noise = random_noise(camera(), 's&p', ....: salt_vs_pepper=0.45); ....: speckle_noise = random_noise(camera(), 'speckle', var=0.02) In [39]: variance_generator = lambda i,j: 0.25*(i+j)/1022. + 0.001; ....: variances = np.fromfunction(variance_generator,(512,512)); ....: lclvr_noise = random_noise(camera(), 'localvar', ....: local_vars=variances) In the last example, we have created a function that assigns a variance between 0.001 and 0.026 depending on the distance to the upper corner of an image. When we visualize the corresponding noisy version of skimage.data.camera, we see that the level of degradation gets stronger as we get closer to the lower right corner of the picture. The following is an example of visualization of the corresponding noisy images: The purpose of noise reduction is to remove as much of this unwanted signal, so we obtain an image as close to the original as possible. The trick, of course, is to do so without any previous knowledge of the properties of the noise. The most basic methods of denoising are the application of either a Gaussian or a median filter. We explored them both in the previous section. The former was presented as a smoothing filter (gaussian_filter), and the latter was discussed when we explored statistical filters (median_filter). They both offer decent noise removal, but they introduce unwanted artifacts as well. For example, the Gaussian filter does not preserve edges in images. The application of any of these methods is also not recommended if preserving texture information is needed. We have a few more advanced methods in the module skimage.restoration, able to tailor denoising to images with specific properties: denoise_bilateral: This is the bilateral filter. It is useful when preserving edges is important. denoise_tv_bregman, denoise_tv_chambolle: We will use this if we require a denoised image with small total variation. nl_means_denoising: The so-called non-local means denoising. This method ensures the best results to denoise areas of the image presenting texture. wiener, unsupervised_wiener: This is the Wiener-Hunt deconvolution. It is useful when we have knowledge of the point-spread function at acquisition time. Let's show you, by example, the performance of one of these methods on some of the noisy images we computed earlier: In [40]: from skimage.restoration import nl_means_denoising as dnoise In [41]: images = [gaussian_noise, poisson_noise, ....: saltpepper_noise, speckle_noise]; ....: names = ['Gaussian', 'Poisson', 'Salt & Pepper', 'Speckle'] In [42]: plt.figure() Out[42]: <matplotlib.figure.Figure at 0x118301490> In [43]: for index, image in enumerate(images): ....: output = dnoise(image, patch_size=5, patch_distance=7) ....: plt.subplot(2, 4, index+1) ....: plt.imshow(image) ....: plt.gray() ....: plt.title(names[index]) ....: plt.subplot(2, 4, index+5) ....: plt.imshow(output) ....: In [44]: plt.show() Under each noisy image, we have presented the corresponding result after employing, by nonlocal means, denoising: It is also possible to perform denoising by thresholding coefficient, provided we represent images with a transform. For example, to do a soft thresholding, employing Biorthonormal 2.8 wavelets; we will use the package PyWavelets: In [45]: import pywt In [46]: def dnoise(image, wavelet, noise_var): ....: levels = int(np.floor(np.log2(image.shape[0]))) ....: coeffs = pywt.wavedec2(image, wavelet, level=levels) ....: value = noise_var * np.sqrt(2 * np.log2(image.size)) ....: threshold = lambda x: pywt.thresholding.soft(x, value) ....: coeffs = map(threshold, coeffs) ....: return pywt.waverec2(coeffs, wavelet) ....: In [47]: plt.figure() Out[47]: <matplotlib.figure.Figure at 0x10e5ed790> In [48]: for index, image in enumerate(images): ....: output = dnoise(image, pywt.Wavelet('bior2.8'), ....: noise_var=0.02) ....: plt.subplot(2, 4, index+1) ....: plt.imshow(image) ....: plt.gray() ....: plt.title(names[index]) ....: plt.subplot(2, 4, index+5) ....: plt.imshow(output) ....: In [49]: plt.show() Observe that the results are of comparable quality to those obtained with the previous method: Sharpening and blurring There are many situations that produce blurred images: Incorrect focus at acquisition Movement of the imaging system The point-spread function of the imaging device (like in electron microscopy) Graphic-art effects For blurring images, we could replicate the effect of a point-spread function by performing convolution of the image with the corresponding kernel. The Gaussian filter that we used for denoising performs blurring in this fashion. In the general case, convolution with a given kernel can be done with the routine convolve from the module scipy.ndimage. For instance, for a constant kernel supported on a 10 x 10 square, we could do as follows: In [50]: from scipy.ndimage import convolve; ....: from skimage.data import page In [51]: kernel = np.ones((10, 10))/100.; ....: blurred = convolve(page(), kernel) To emulate the blurring produced by movement too, we could convolve with a kernel as created here: In [52]: from skimage.draw import polygon In [53]: x_coords = np.array([14, 14, 24, 26, 24, 18, 18]); ....: y_coords = np.array([ 2, 18, 26, 24, 22, 18, 2]); ....: kernel_2 = np.zeros((32, 32)); ....: kernel_2[polygon(x_coords, y_coords)]= 1.; ....: kernel_2 /= kernel_2.sum() In [54]: blurred_motion = convolve(page(), kernel_2) In order to reverse the effects of convolution when we have knowledge of the degradation process, we perform deconvolution. For example, if we have knowledge of the point-spread function, in the module skimage.restoration we have implementations of the Wiener filter (wiener), the unsupervised Wiener filter (unsupervised_wiener), and Lucy-Richardson deconvolution (richardson_lucy). We perform deconvolution on blurred images by applying a Wiener filter. Note the huge improvement in readability of the text: In [55]: from skimage.restoration import wiener In [56]: deconv = wiener(blurred, kernel, balance=0.025, clip=False) Summary In this article, we have seen how the SciPy stack helps us solve many problems in mathematical imaging, from the effective representation of digital images, to modifying, restoring, or analyzing them. Although certainly exhaustive, this article scratches but the surface of this challenging field of engineering. Resources for Article: Further resources on this subject: Symbolizers [article] Machine Learning [article] Analyzing a Complex Dataset [article]
Read more
  • 0
  • 0
  • 5374

article-image-object-detection-using-image-features-javascript
Packt
05 Oct 2015
16 min read
Save for later

Object Detection Using Image Features in JavaScript

Packt
05 Oct 2015
16 min read
In this article by Foat Akhmadeev, author of the book Computer Vision for the Web, we will discuss how we can detect an object on an image using several JavaScript libraries. In particular, we will see techniques such as FAST features detection, and BRIEF and ORB descriptors matching. Eventually, the object detection example will be presented. There are many ways to detect an object on an image. Color object detection, which is the detection of changes in intensity of an image is just a simple computer vision methods. There are some sort of fundamental things which every computer vision enthusiast should know. The libraries we use here are: JSFeat (http://inspirit.github.io/jsfeat/) tracking.js (http://trackingjs.com) (For more resources related to this topic, see here.) Detecting key points What information do we get when we see an object on an image? An object usually consists of some regular parts or unique points, which represent this particular object. Of course, we can compare each pixel of an image, but it is not a good thing in terms of computational speed. Probably, we can take unique points randomly, thus reducing the computation cost significantly, but we will still not get much information from random points. Using the whole information, we can get too much noise and lose important parts of an object representation. Eventually, we need to consider that both ideas, getting all pixels and selecting random pixels, are really bad. So, what can we do in that case? We are working with a grayscale image and we need to get unique points of an image. Then, we need to focus on the intensity information. For example, getting object edges in the Canny edge detector or the Sobel filter. We are closer to the solution! But still not close enough. What if we have a long edge? Don't you think that it is a bit bad that we have too many unique pixels that lay on this edge? An edge of an object has end points or corners; if we reduce our edge to those corners, we will get enough unique pixels and remove unnecessary information. There are various methods of getting keypoints from an image, many of which extract corners as those keypoints. To get them, we will use the FAST (Features from Accelerated Segment Test) algorithm. It is really simple and you can easily implement it by yourself if you want. But you do not need to. The algorithm implementation is provided by both tracking.js and JSFeat libraries. The idea of the FAST algorithm can be captured from the following image: Suppose we want to check whether the pixel P is a corner. We will check 16 pixels around it. If at least 9 pixels in an arc around P are much darker or brighter than the value of P, then we say that P is a corner. How much darker or brighter should the P pixels be? The decision is made by applying a threshold for the difference between the value of P and the value of pixels around P. A practical example First, we will start with an example of FAST corner detection for the tracking.js library. Before we do something, we can set the detector threshold. Threshold defines the minimum difference between a tested corner and the points around it: tracking.Fast.THRESHOLD = 30; It is usually a good practice to apply a Gaussian blur on an image before we start the method. It significantly reduces the noise of an image: var imageData = context.getImageData(0, 0, cols, rows); var gray = tracking.Image.grayscale(imageData.data, cols, rows, true); var blurred4 = tracking.Image.blur(gray, cols, rows, 3); Remember that the blur function returns a 4 channel array—RGBA. In that case, we need to convert it to 1-channel. Since we can easily skip other channels, it should not be a problem: var blurred1 = new Array(blurred4.length / 4); for (var i = 0, j = 0; i < blurred4.length; i += 4, ++j) { blurred1[j] = blurred4[i]; } Next, we run a corner detection function on our image array: var corners = tracking.Fast.findCorners(blurred1, cols, rows); The result returns an array with its length twice the length of the corner's number. The array is returned in the format [x0,y0,x1,y1,...]. Where [xn, yn] are coordinates of a detected corner. To print the result on a canvas, we will use the fillRect function: for (i = 0; i < corners.length; i += 2) { context.fillStyle = '#0f0'; context.fillRect(corners[i], corners[i + 1], 3, 3); } Let's see an example with the JSFeat library,. for which the steps are very similar to that of tracking.js. First, we set the global threshold with a function: jsfeat.fast_corners.set_threshold(30); Then, we apply a Gaussian blur to an image matrix and run the corner detection: jsfeat.imgproc.gaussian_blur(matGray, matBlurred, 3); We need to preallocate keypoints for a corners result. The keypoint_t function is just a new type which is useful for keypoints of an image. The first two parameters represent coordinates of a point and the other parameters set: point score (which checks whether the point is good enough to be a key point), point level (which you can use it in an image pyramid, for example), and point angle (which is usually used for the gradient orientation): var corners = []; var i = cols * rows; while (--i >= 0) { corners[i] = new jsfeat.keypoint_t(0, 0, 0, 0, -1); } After all this, we execute the FAST corner detection method. As a last parameter of detection function, we define a border size. The border is used to constrain circles around each possible corner. For example, you cannot precisely say whether the point is a corner for the [0,0] pixel. There is no [0, -3] pixel in our matrix: var count = jsfeat.fast_corners.detect(matBlurred, corners, 3); Since we preallocated the corners, the function returns the number of calculated corners for us. The result returns an array of structures with the x and y fields, so we can print it using those fields: for (var i = 0; i < count; i++) { context.fillStyle = '#0f0'; context.fillRect(corners[i].x, corners[i].y, 3, 3); } The result is nearly the same for both algorithms. The difference is in some parts of realization. Let's look at the following example: From left to right: tracking.js without blur, JSFeat without blur, tracking.js and JSFeat with blur. If you look closely, you can see the difference between tracking.js and JSFeat results, but it is not easy to spot it. Look at how much noise was reduced by applying just a small 3 x 3 Gaussian filter! A lot of noisy points were removed from the background. And now the algorithm can focus on points that represent flowers and the pattern of the vase. We have extracted key points from our image, and we successfully reached the goal of reducing the number of keypoints and focusing on the unique points of an image. Now, we need to compare or match those points somehow. How we can do that? Descriptors and object matching Image features by themselves are a bit useless. Yes, we have found unique points on an image. But what did we get? Only values of pixels and that's it. If we try to compare these values, it will not give us much information. Moreover, if we change the overall image brightness, we will not find the same keypoints on the same image! Taking into account all of this, we need the information that surrounds our key points. Moreover, we need a method to efficiently compare this information. First, we need to describe the image features, which comes from image descriptors. In this part, we will see how these descriptors can be extracted and matched. The tracking.js and JSFeat libraries provide different methods for image descriptors. We will discuss both. BRIEF and ORB descriptors The descriptors theory is focused on changes in image pixels' intensities. The tracking.js library provides the BRIEF (Binary Robust Independent Elementary Features) descriptors and its JSFeat extension—ORB (Oriented FAST and Rotated BRIEF). As we can see from the ORB naming, it is rotation invariant. This means that even if you rotate an object, the algorithm can still detect it. Moreover, the authors of the JSFeat library provide an example using the image pyramid, which is scale invariant too. Let's start by explaining BRIEF, since it is the source for ORB descriptors. As a first step, the algorithm takes computed image features, and it takes the unique pairs of elements around each feature. Based on these pairs' intensities it forms a binary string. For example, if we have a pair of positions i and j, and if I(i) < I(j) (where I(pos) means image value at the position pos), then the result is 1, else 0. We add this result to the binary string. We do that for N pairs, where N is taken as a power of 2 (128, 256, 512). Since descriptors are just binary strings, we can compare them in an efficient manner. To match these strings, the Hamming distance is usually used. It shows the minimum number of substitutions required to change one string to another. For example, we have two binary strings: 10011 and 11001. The Hamming distance between them is 2, since we need to change 2 bits of information to change the first string to the second. The JSFeat library provides the functionality to apply ORB descriptors. The core idea is very similar to BRIEF. However, there are two major differences: The implementation is scale invariant, since the descriptors are computed for an image pyramid. The descriptors are rotation invariant; the direction is computed using intensity of the patch around a feature. Using this orientation, ORB manages to compute the BRIEF descriptor in a rotation-invariant manner. Implementation of descriptors implementation and their matching Our goal is to find an object from a template on a scene image. We can do that by finding features and descriptors on both images and matching descriptors from a template to an image. We start from the tracking.js library and BRIEF descriptors. The first thing that we can do is set the number of location pairs: tracking.Brief.N = 512 By default, it is 256, but you can choose a higher value. The larger the value, the more information you will get and the more the memory and computational cost it requires. Before starting the computation, do not forget to apply the Gaussian blur to reduce the image noise. Next, we find the FAST corners and compute descriptors on both images. Here and in the next example, we use the suffix Object for a template image and Scene for a scene image: var cornersObject = tracking.Fast.findCorners(grayObject, colsObject, rowsObject); var cornersScene = tracking.Fast.findCorners(grayScene, colsScene, rowsScene); var descriptorsObject = tracking.Brief.getDescriptors(grayObject, colsObject, cornersObject); var descriptorsScene = tracking.Brief.getDescriptors(grayScene, colsScene, cornersScene); Then we do the matching: var matches = tracking.Brief.reciprocalMatch(cornersObject, descriptorsObject, cornersScene, descriptorsScene); We need to pass information of both corners and descriptors to the function, since it returns coordinate information as a result. Next, we print both images on one canvas. To draw the matches using this trick, we need to shift our scene keypoints for the width of a template image as a keypoint1 matching returns a point on a template and keypoint2 returns a point on a scene image. The keypoint1 and keypoint2 are arrays with x and y coordinates at 0 and 1 indexes, respectively: for (var i = 0; i < matches.length; i++) { var color = '#' + Math.floor(Math.random() * 16777215).toString(16); context.fillStyle = color; context.strokeStyle = color; context.fillRect(matches[i].keypoint1[0], matches[i].keypoint1[1], 5, 5); context.fillRect(matches[i].keypoint2[0] + colsObject, matches[i].keypoint2[1], 5, 5); context.beginPath(); context.moveTo(matches[i].keypoint1[0], matches[i].keypoint1[1]); context.lineTo(matches[i].keypoint2[0] + colsObject, matches[i].keypoint2[1]); context.stroke(); } The JSFeat library provides most of the code for pyramids and scale invariant features not in the library, but in the examples, which are available on https://github.com/inspirit/jsfeat/blob/gh-pages/sample_orb.html. We will not provide the full code here, because it requires too much space. But do not worry, we will highlight main topics here. Let's start from functions that are included in the library. First, we need to preallocate the descriptors matrix, where 32 is the length of a descriptor and 500 is the maximum number of descriptors. Again, 32 is a power of two: var descriptors = new jsfeat.matrix_t(32, 500, jsfeat.U8C1_t); Then, we compute the ORB descriptors for each corner, we need to do that for both template and scene images: jsfeat.orb.describe(matBlurred, corners, num_corners, descriptors); The function uses global variables, which mainly define input descriptors and output matching: function match_pattern() The result match_t contains the following fields: screen_idx: This is the index of a scene descriptor pattern_lev: This is the index of a pyramid level pattern_idx: This is the index of a template descriptor Since ORB works with the image pyramid, it returns corners and matches for each level: var s_kp = screen_corners[m.screen_idx]; var p_kp = pattern_corners[m.pattern_lev][m.pattern_idx]; We can print each matching as shown here. Again, we use Shift, since we computed descriptors on separate images, but print the result on one canvas: context.fillRect(p_kp.x, p_kp.y, 4, 4); context.fillRect(s_kp.x + shift, s_kp.y, 4, 4); Working with a perspective Let's take a step away. Sometimes, an object you want to detect is affected by a perspective distortion. In that case, you may want to rectify an object plane. For example, a building wall: Looks good, doesn't it? How do we do that? Let's look at the code: var imgRectified = new jsfeat.matrix_t(mat.cols, mat.rows, jsfeat.U8_t | jsfeat.C1_t); var transform = new jsfeat.matrix_t(3, 3, jsfeat.F32_t | jsfeat.C1_t); jsfeat.math.perspective_4point_transform(transform, 0, 0, 0, 0, // first pair x1_src, y1_src, x1_dst, y1_dst 640, 0, 640, 0, // x2_src, y2_src, x2_dst, y2_dst and so on. 640, 480, 640, 480, 0, 480, 180, 480); jsfeat.matmath.invert_3x3(transform, transform); jsfeat.imgproc.warp_perspective(mat, imgRectified, transform, 255); Primarily, as we did earlier, we define a result matrix object. Next, we assign a matrix for image perspective transformation. We calculate it based on four pairs of corresponding points. For example, the last, that is, the fourth point of the original image, which is [0, 480], should be projected to the point [180, 480] on the rectified image. Here, the first coordinate refers to x and the second to y. Then, we invert the transform matrix to be able to apply it to the original image—the mat variable. We pick the background color as white (255 for an unsigned byte). As a result, we get a nice image without any perspective distortion. Finding an object location Returning to our primary goal, we found a match. That is great. But what we did not do is finding an object location. There is no function for that in the tracking.js library, but JSFeat provides such functionality in the examples section. First, we need to compute a perspective transform matrix. This is why we discussed the example of such transformation previously. We have points from two images but we do not have a transformation for the whole image. First, we define a transform matrix: var homo3x3 = new jsfeat.matrix_t(3, 3, jsfeat.F32C1_t); To compute the homography, we need only four points. But after the matching, we get too many. In addition, there are can be noisy points, which we will need to skip somehow. For that, we use a RANSAC (Random sample consensus) algorithm. It is an iterative method for estimating a mathematical model from a dataset that contains outliers (noise). It estimates outliers and generates a model that is computed without the noisy data. Before we start, we need to define the algorithm parameters. The first parameter is a match mask, where all mathces will be marked as good (1) or bad (0): var match_mask = new jsfeat.matrix_t(500, 1, jsfeat.U8C1_t); Our mathematical model to find: var mm_kernel = new jsfeat.motion_model.homography2d(); Minimum number of points to estimate a model (4 points to get a homography): var num_model_points = 4; Maximum threshold to classify a data point as an inlier or a good match: var reproj_threshold = 3; Finally, the variable that holds main parameters and the last two arguments define the maximum ratio of outliers and probability of success when the algorithm stops at the point where the number of inliers is 99 percent: var ransac_param = new jsfeat.ransac_params_t(num_model_points, reproj_threshold, 0.5, 0.99); Then, we run the RANSAC algorithm. The last parameter represents the number of maximum iterations for the algorithm: jsfeat.motion_estimator.ransac(ransac_param, mm_kernel, object_xy, screen_xy, count, homo3x3, match_mask, 1000); The shape finding can be applied for both tracking.js and JSFeat libraries, you just need to set matches as object_xy and screen_xy, where those arguments must hold an array of objects with the x and y fields. After we find the transformation matrix, we compute the projected shape of an object to a new image: var shape_pts = tCorners(homo3x3.data, colsObject, rowsObject); After the computation is done, we draw computed shapes on our images: As we see, our program successfully found an object in both cases. Actually, both methods can show different performance, it is mainly based on the thresholds you set. Summary Image features and descriptors matching are powerful tools for object detection. Both JSFeat and tracking.js libraries provide different functionalities to match objects using these features. In addition to this, the JSFeat project contains algorithms for object finding. These methods can be useful for tasks such as uniques object detection, face tracking, and creating a human interface by tracking various objects very efficiently. Resources for Article: Further resources on this subject: Welcome to JavaScript in the full stack[article] Introducing JAX-RS API[article] Using Google Maps APIs with Knockout.js [article]
Read more
  • 0
  • 1
  • 30631

article-image-data-around-us
Packt
29 Sep 2015
25 min read
Save for later

Data Around Us

Packt
29 Sep 2015
25 min read
In this article by Gergely Daróczi, author of the book Mastering Data Analysis with R we will discuss Spatial data, also known as geospatial data, which identifies geographic locations, such as natural or constructed features around us. Although all observations have some spatial content, such as the location of the observation, but this is out of most data analysis tools' range due to the complex nature of spatial information; alternatively, the spatiality might not be that interesting (at first sight) in the given research topic. On the other hand, analyzing spatial data can reveal some very important underlying structures of the data, and it is well worth spending time visualizing the differences and similarities between close or far data points. In this article, we are going to help with this and will use a variety of R packages to: Retrieve geospatial information from the Internet Visualize points and polygons on a map (For more resources related to this topic, see here.) Geocoding We will use the hflights dataset to demonstrate how one can deal with data bearing spatial information. To this end, let's aggregate our dataset but instead of generating daily data let's view the aggregated characteristics of the airports. For the sake of performance, we will use the data.table package: > library(hflights) > library(data.table) > dt <- data.table(hflights)[, list( + N = .N, + Cancelled = sum(Cancelled), + Distance = Distance[1], + TimeVar = sd(ActualElapsedTime, na.rm = TRUE), + ArrDelay = mean(ArrDelay, na.rm = TRUE)) , by = Dest] So we have loaded and then immediately transformed the hlfights dataset to a data.table object. At the same time, we aggregated by the destination of the flights to compute: The number of rows The number of cancelled flights The distance The standard deviation of the elapsed time of the flights The arithmetic mean of the delays The resulting R object looks like this: > str(dt) Classes 'data.table' and 'data.frame': 116 obs. of 6 variables: $ Dest : chr "DFW" "MIA" "SEA" "JFK" ... $ N : int 6653 2463 2615 695 402 6823 4893 5022 6064 ... $ Cancelled: int 153 24 4 18 1 40 40 27 33 28 ... $ Distance : int 224 964 1874 1428 3904 305 191 140 1379 862 ... $ TimeVar : num 10 12.4 16.5 19.2 15.3 ... $ ArrDelay : num 5.961 0.649 9.652 9.859 10.927 ... - attr(*, ".internal.selfref")=<externalptr> So we have 116 observations all around the world and five variables describing those. Although this seems to be a spatial dataset, we have no geospatial identifiers that a computer can understand per se, so let's fetch the geocodes of these airports from the Google Maps API via the ggmap package. First, let's see how it works when we are looking for the geo-coordinates of Houston: > library(ggmap) > (h <- geocode('Houston, TX')) Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Houston,+TX&sensor=false lon lat 1 -95.3698 29.76043 So the geocode function can return the matched latitude and longitude of the string we sent to Google. Now let's do the very same thing for all flight destinations: > dt[, c('lon', 'lat') := geocode(Dest)] Well, this took some time as we had to make 116 separate queries to the Google Maps API. Please note that Google limits you to 2,500 queries a day without authentication, so do not run this on a large dataset. There is a helper function in the package, called geocodeQueryCheck, which can be used to check the remaining number of free queries for the day. Some of the methods and functions we plan to use in some later sections of this article do not support data.table, so let's fall back to the traditional data.frame format and also print the structure of the current object: > str(setDF(dt)) 'data.frame': 116 obs. of 8 variables: $ Dest : chr "DFW" "MIA" "SEA" "JFK" ... $ N : int 6653 2463 2615 695 402 6823 4893 5022 6064 ... $ Cancelled: int 153 24 4 18 1 40 40 27 33 28 ... $ Distance : int 224 964 1874 1428 3904 305 191 140 1379 862 ... $ TimeVar : num 10 12.4 16.5 19.2 15.3 ... $ ArrDelay : num 5.961 0.649 9.652 9.859 10.927 ... $ lon : num -97 136.5 -122.3 -73.8 -157.9 ... $ lat : num 32.9 34.7 47.5 40.6 21.3 ... This was pretty quick and easy, wasn't it? Now that we have the longitude and latitude values of all the airports, we can try to show these points on a map. Visualizing point data in space For the first time, let's keep it simple and load some package-bundled polygons as the base map. To this end, we will use the maps package. After loading it, we use the map function to render the polygons of the United States of America, add a title, and then some points for the airports and also for Houston with a slightly modified symbol: > library(maps) > map('state') > title('Flight destinations from Houston,TX') > points(h$lon, h$lat, col = 'blue', pch = 13) > points(dt$lon, dt$lat, col = 'red', pch = 19) And showing the airport names on the plot is pretty easy as well: we can use the well-known functions from the base graphics package. Let's pass the three character names as labels to the text function with a slightly increased y value to shift the preceding text the previously rendered data points: > text(dt$lon, dt$lat + 1, labels = dt$Dest, cex = 0.7) Now we can also specify the color of the points to be rendered. This feature can be used to plot our first meaningful map to highlight the number of flights in 2011 to different parts of the USA: > map('state') > title('Frequent flight destinations from Houston,TX') > points(h$lon, h$lat, col = 'blue', pch = 13) > points(dt$lon, dt$lat, pch = 19, + col = rgb(1, 0, 0, dt$N / max(dt$N))) > legend('bottomright', legend = round(quantile(dt$N)), pch = 19, + col = rgb(1, 0, 0, quantile(dt$N) / max(dt$N)), box.col = NA) So the intensity of red shows the number of flights to the given points (airports); the values range from 1 to almost 10,000. Probably it would be more meaningful to compute these values on a state level, as there are many airports, very close to each other, that might be better aggregated at a higher administrative area level. To this end, we load the polygon of the states, match the points of interest (airports) with the overlaying polygons (states), and render the polygons as a thematic map instead of points, like we did on the previous pages. Finding polygon overlays of point data We already have all the data we need to identify the parent state of each airport. The dt dataset includes the geo-coordinates of the locations, and we managed to render the states as polygons with the map function. Actually, this latter function can return the underlying dataset without rendering a plot: > str(map_data <- map('state', plot = FALSE, fill = TRUE)) List of 4 $ x : num [1:15599] -87.5 -87.5 -87.5 -87.5 -87.6 ... $ y : num [1:15599] 30.4 30.4 30.4 30.3 30.3 ... $ range: num [1:4] -124.7 -67 25.1 49.4 $ names: chr [1:63] "alabama" "arizona" "arkansas" "california" ... - attr(*, "class")= chr "map" So we have around 16,000 points describing the boundaries of the US states, but this map data is more detailed than we actually need (see for example the name of the polygons starting with Washington): > grep('^washington', map_data$names, value = TRUE) [1] "washington:san juan island" "washington:lopez island" [3] "washington:orcas island" "washington:whidbey island" [5] "washington:main" In short, the non-connecting parts of a state are defined as separate polygons. To this end, let's save a list of the state names without the string after the colon: > states <- sapply(strsplit(map_data$names, ':'), '[[', 1) We will use this list as the basis of aggregation from now on. Let's transform this map dataset into another class of object, so that we can use the powerful features of the sp package. We will use the maptools package to do this transformation: > library(maptools) > us <- map2SpatialPolygons(map_data, IDs = states, + proj4string = CRS("+proj=longlat +datum=WGS84")) An alternative way of getting the state polygons might be to directly load those instead of transforming from other data formats as described earlier. To this end, you may find the raster package especially useful to download free map shapefiles from gadm.org via the getData function. Although these maps are way too detailed for such a simple task, you can always simplify those—for example, with the gSimplify function of the rgeos package. So we have just created an object called us, which includes the polygons of map_data for each state with the given projection. This object can be shown on a map just like we did previously, although you should use the general plot method instead of the map function: > plot(us) Besides this, however, the sp package supports so many powerful features! For example, it's very easy to identify the overlay polygons of the provided points via the over function. As this function name conflicts with the one found in the grDevices package, it's better to refer to the function along with the namespace using a double colon: > library(sp) > dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')], dt, + proj4string = CRS("+proj=longlat +datum=WGS84")) > str(sp::over(us, dtp)) 'data.frame': 49 obs. of 8 variables: $ Dest : chr "BHM" "PHX" "XNA" "LAX" ... $ N : int 2736 5096 1172 6064 164 NA NA 2699 3085 7886 ... $ Cancelled: int 39 29 34 33 1 NA NA 35 11 141 ... $ Distance : int 562 1009 438 1379 926 NA NA 1208 787 689 ... $ TimeVar : num 10.1 13.61 9.47 15.16 13.82 ... $ ArrDelay : num 8.696 2.166 6.896 8.321 -0.451 ... $ lon : num -86.8 -112.1 -94.3 -118.4 -107.9 ... $ lat : num 33.6 33.4 36.3 33.9 38.5 ... What happened here? First, we passed the coordinates and the whole dataset to the SpatialPointsDataFrame function, which stored our data as spatial points with the given longitude and latitude values. Next we called the over function to left-join the values of dtp to the US states. An alternative way of identifying the state of a given airport is to ask for more detailed information from the Google Maps API. By changing the default output argument of the geocode function, we can get all address components for the matched spatial object, which of course includes the state as well. Look for example at the following code snippet: geocode('LAX','all')$results[[1]]$address_components Based on this, you might want to get a similar output for all airports and filter the list for the short name of the state. The rlist package would be extremely useful in this task, as it offers some very convenient ways of manipulating lists in R. The only problem here is that we matched only one airport to the states, which is definitely not okay. See for example the fourth column in the earlier output: it shows LAX as the matched airport for California (returned by states[4]), although there are many others there as well. To overcome this issue, we can do at least two things. First, we can use the returnList argument of the over function to return all matched rows of dtp, and we will then post-process that data: > str(sapply(sp::over(us, dtp, returnList = TRUE), + function(x) sum(x$Cancelled))) Named int [1:49] 51 44 34 97 23 0 0 35 66 149 ... - attr(*, "names")= chr [1:49] "alabama" "arizona" "arkansas" ... So we created and called an anonymous function that will sum up the Cancelled values of the data.frame in each element of the list returned by over. Another, probably cleaner, approach is to redefine dtp to only include the related values and pass a function to over to do the summary: > dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')], + dt[, 'Cancelled', drop = FALSE], + proj4string = CRS("+proj=longlat +datum=WGS84")) > str(cancels <- sp::over(us, dtp, fn = sum)) 'data.frame': 49 obs. of 1 variable: $ Cancelled: int 51 44 34 97 23 NA NA 35 66 149 ... Either way, we have a vector to merge back to the US state names: > val <- cancels$Cancelled[match(states, row.names(cancels))] And to update all missing values to zero (as the number of cancelled flights in a state without any airport is not missing data, but exactly zero for sure): > val[is.na(val)] <- 0 Plotting thematic maps Now we have everything to create our first thematic map. Let's pass the val vector to the previously used map function (or plot it using the us object), specify a plot title, add a blue point for Houston, and then create a legend, which shows the quantiles of the overall number of cancelled flights as a reference: > map("state", col = rgb(1, 0, 0, sqrt(val/max(val))), fill = TRUE) > title('Number of cancelled flights from Houston to US states') > points(h$lon, h$lat, col = 'blue', pch = 13) > legend('bottomright', legend = round(quantile(val)), + fill = rgb(1, 0, 0, sqrt(quantile(val)/max(val))), box.col = NA) Please note that, instead of a linear scale, we decided to compute the square root of the relative values to define the intensity of the fill color, so that we can visually highlight the differences between the states. This was necessary as most flight cancellations happened in Texas (748), and there were no more than 150 cancelled flights in any other state (with the average being around 45). You can also easily load ESRI shape files or other geospatial vector data formats into R as points or polygons with a bunch of packages already discussed and a few others as well, such as the maptools, rgdal, dismo, raster, or shapefile packages. Another, probably easier, way to generate country-level thematic maps, especially choropleth maps, is to load the rworldmap package made by Andy South, and rely on the convenient mapCountryData function. Rendering polygons around points Besides thematic maps, another really useful way of presenting spatial data is to draw artificial polygons around the data points based on the data values. This is especially useful if there is no available polygon shape file to be used to generate a thematic map. A level plot, contour plot, or isopleths, might be an already familiar design from tourist maps, where the altitude of the mountains is represented by a line drawn around the center of the hill at the very same levels. This is a very smart approach having maps present the height of hills—projecting this third dimension onto a 2-dimensional image. Now let's try to replicate this design by considering our data points as mountains on the otherwise flat map. We already know the heights and exact geo-coordinates of the geometric centers of these hills (airports); the only challenge here is to draw the actual shape of these objects. In other words: Are these mountains connected? How steep are the hillsides? Should we consider any underlying spatial effects in the data? In other words, can we actually render these as mountains with a 3D shape instead of plotting independent points in space? If the answer for the last question is positive, then we can start trying to answer the other questions by fine-tuning the plot parameters. For now, let's simply suppose that there is a spatial effect in the underlying data, and it makes sense to visualize the data in such a way. Later, we will have the chance to disprove or support this statement either by analyzing the generated plots, or by building some geo-spatial models—some of these will be discussed later, in the Spatial Statistics section. Contour lines First, let's expand our data points into a matrix with the fields package. The size of the resulting R object is defined arbitrarily but, for the given number of rows and columns, which should be a lot higher to generate higher resolution images, 256 is a good start: > library(fields) > out <- as.image(dt$ArrDelay, x = dt[, c('lon', 'lat')], + nrow = 256, ncol = 256) The as.image function generates a special R object, which in short includes a 3‑dimensional matrix-like data structure, where the x and y axes represent the longitude and latitude ranges of the original data respectively. To simplify this even more, we have a matrix with 256 rows and 256 columns, where each of those represents a discrete value evenly distributed between the lowest and highest values of the latitude and longitude. And on the z axis, we have the ArrDelay values—which are in most cases of course missing: > table(is.na(out$z)) FALSE TRUE 112 65424 What does this matrix look like? It's better to see what we have at the moment: > image(out) Well, this does not seem to be useful at all. What is shown there? We rendered the x and y dimensions of the matrix with z colors here, and most tiles of this map are empty due to the high amount of missing values in z. Also, it's pretty straightforward now that the dataset included many airports outside the USA as well. How does it look if we focus only on the USA? > image(out, xlim = base::range(map_data$x, na.rm = TRUE), + ylim = base::range(map_data$y, na.rm = TRUE)) An alternative and more elegant approach to rendering only the US part of the matrix would be to drop the non-US airports from the database before actually creating the out R object. Although we will continue with this example for didactic purposes, with real data make sure you concentrate on the target subset of your data instead of trying to smooth and model unrelated data points as well. A lot better! So we have our data points as a tile, now let's try to identify the slope of these mountain peaks, to be able to render them on a future map. This can be done by smoothing the matrix: > look <- image.smooth(out, theta = .5) > table(is.na(look$z)) FALSE TRUE 14470 51066 As can be seen in the preceding table, this algorithm successfully eliminated many missing values from the matrix. The image.smooth function basically reused our initial data point values in the neighboring tiles, and computed some kind of average for the conflicting overrides. This smoothing algorithm results in the following arbitrary map, which does not respect any political or geographical boundaries: > image(look) It would be really nice to plot these artificial polygons along with the administrative boundaries, so let's clear out all cells that do not belong to the territory of the USA. We will use the point.in.polygon function from the sp package to do so: > usa_data <- map('usa', plot = FALSE, region = 'main') > p <- expand.grid(look$x, look$y) > library(sp) > n <- which(point.in.polygon(p$Var1, p$Var2, + usa_data$x, usa_data$y) == 0) > look$z[n] <- NA In a nutshell, we have loaded the main polygon of the USA without any sub-administrative areas, and verified our cells in the look object, if those are overlapping the polygon. Then we simply reset the value of the cell, if not. The next step is to render the boundaries of the USA, plot our smoothed contour plot, then add some eye-candy in the means of the US states and, the main point of interest, the airport: > map("usa") > image(look, add = TRUE) > map("state", lwd = 3, add = TRUE) > title('Arrival delays of flights from Houston') > points(dt$lon, dt$lat, pch = 19, cex = .5) > points(h$lon, h$lat, pch = 13) Now this is pretty neat, isn't it? Voronoi diagrams An alternative way of visualizing point data with polygons is to generate Voronoi cells between them. In short, the Voronoi map partitions the space into regions around the data points by aligning all parts of the map to one of the regions to minimize the distance from the central data points. This is extremely easy to interpret, and also to implement in R. The deldir package provides a function with the very same name for Delaunay triangulation: > library(deldir) > map("usa") > plot(deldir(dt$lon, dt$lat), wlines = "tess", lwd = 2, + pch = 19, col = c('red', 'darkgray'), add = TRUE) Here, we represented the airports with red dots, as we did before, but also added the Dirichlet tessellation (Voronoi cells) rendered as dark-gray dashed lines. For more options on how to fine-tune the results, see the plot.deldir method. In the next section, let's see how to improve this plot by adding a more detailed background map to it. Satellite maps There are many R packages on CRAN that can fetch data from Google Maps, Stamen, Bing, or OpenStreetMap—even some of the packages we previously used in this article, like the ggmap package, can do this. Similarly, the dismo package also comes with both geo-coding and Google Maps API integration capabilities, and there are some other packages focused on that latter, such as the RgoogleMaps package. Now we will use the OpenStreetMap package, mainly because it supports not only the awesome OpenStreetMap database back-end, but also a bunch of other formats as well. For example, we can render really nice terrain maps via Stamen: > library(OpenStreetMap) > map <- openmap(c(max(map_data$y, na.rm = TRUE), + min(map_data$x, na.rm = TRUE)), + c(min(map_data$y, na.rm = TRUE), + max(map_data$x, na.rm = TRUE)), + type = 'stamen-terrain') So we defined the left upper and right lower corners of the map we need, and also specified the map style to be a satellite map. As the data by default arrives from the remote servers with the Mercator projections, we first have to transform that to WGS84 (we used this previously), so that we can render the points and polygons on the top of the fetched map: > map <- openproj(map, + projection = '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs') And Showtime at last: > plot(map) > plot(deldir(dt$lon, dt$lat), wlines = "tess", lwd = 2, + col = c('red', 'black'), pch = 19, cex = 0.5, add = TRUE) This seems to be a lot better compared to the outline map we created previously. Now you can try some other map styles as well, such as mapquest-aerial, or some of the really nice-looking cloudMade designs. Interactive maps Besides being able to use Web-services to download map tiles for the background of the maps created in R, we can also rely on some of those to generate truly interactive maps. One of the best known related services is the Google Visualization API, which provides a platform for hosting visualizations made by the community; you can also use it to share maps you've created with others. Querying Google Maps In R, you can access this API via the googleVis package written and maintained by Markus Gesmann and Diego de Castillo. Most functions of the package generate HTML and JavaScript code that we can directly view in a Web browser as an SVG object with the base plot function; alternatively, we can integrate them in a Web page, for example via the IFRAME HTML tag. The gvisIntensityMap function takes a data.frame with country ISO or USA state codes and the actual data to create a simple intensity map. We will use the cancels dataset we created in the Finding Polygon Overlays of Point Data section but, before that, we have to do some data transformations. Let's add the state name as a new column to the data.frame, and replace the missing values with zero: > cancels$state <- rownames(cancels) > cancels$Cancelled[is.na(cancels$Cancelled)] <- 0 Now it's time to load the package and pass the data along with a few extra parameters, signifying that we want to generate a state-level US map: > library(googleVis) > plot(gvisGeoChart(cancels, 'state', 'Cancelled', + options = list( + region = 'US', + displayMode = 'regions', + resolution = 'provinces'))) The package also offers opportunities to query the Google Map API via the gvisMap function. We will use this feature to render the airports from the dt dataset as points on a Google Map with an auto-generated tooltip of the variables. But first, as usual, we have to do some data transformations again. The location argument of the gvisMap function takes the latitude and longitude values separated by a colon: > dt$LatLong <- paste(dt$lat, dt$lon, sep = ':') We also have to generate the tooltips as a new variable, which can be done easily with an apply call. We will concatenate the variable names and actual values separated by a HTML line break: > dt$tip <- apply(dt, 1, function(x) + paste(names(dt), x, collapse = '<br/ >')) And now we just pass these arguments to the function for an instant interactive map: > plot(gvisMap(dt, 'LatLong', tipvar = 'tip')) Another nifty feature of the googleVis package is that you can easily merge the different visualizations into one by using the gvisMerge function. The use of this function is quite simple: specify any two gvis objects you want to merge, and also whether they are to be placed horizontally or vertically. JavaScript mapping libraries The great success of the trending JavaScript data visualization libraries is only partly due to their great design. I suspect other factors also contribute to the general spread of such tools: it's very easy to create and deploy full-blown data models, especially since the release and on-going development of Mike Bostock's D3.js. Although there are also many really useful and smart R packages to interact directly with D3 and topojson (see for example my R user activity compilation at http://bit.ly/countRies). Now we will only focus on how to use Leaflet— probably the most used JavaScript library for interactive maps. What I truly love in R is that there are many packages wrapping other tools, so that R users can rely on only one programming language, and we can easily use C++ programs and Hadoop MapReduce jobs or build JavaScript-powered dashboards without actually knowing anything about the underlying technology. This is especially true when it comes to Leaflet! There are at least two very nice packages that can generate a Leaflet plot from the R console, without a single line of JavaScript. The Leaflet reference class of the rCharts package was developed by Ramnath Vaidyanathan, and includes some methods to create a new object, set the viewport and zoom level, add some points or polygons to the map, and then render or print the generated HTML and JavaScript code to the console or to a file. Unfortunately, this package is not on CRAN yet, so you have to install it from GitHub: > devtools::install_github('ramnathv/rCharts') As a quick example, let's generate a Leaflet map of the airports with some tooltips, like we did with the Google Maps API in the previous section. As the setView method expects numeric geo-coordinates as the center of the map, we will use Kansas City's airport as a reference: > library(rCharts) > map <- Leaflet$new() > map$setView(as.numeric(dt[which(dt$Dest == 'MCI'), + c('lat', 'lon')]), zoom = 4) > for (i in 1:nrow(dt)) + map$marker(c(dt$lat[i], dt$lon[i]), bindPopup = dt$tip[i]) > map$show() Similarly, RStudio's leaflet package and the more general htmlwidgets package also provide some easy ways to generate JavaScript-powered data visualizations. Let's load the library and define the steps one by one using the pipe operator from the magrittr package, which is pretty standard for all packages created or inspired by RStudio or Hadley Wickham: > library(leaflet) > leaflet(us) %>% + addProviderTiles("Acetate.terrain") %>% + addPolygons() %>% + addMarkers(lng = dt$lon, lat = dt$lat, popup = dt$tip) I especially like this latter map, as we can load a third-party satellite map in the background, then render the states as polygons; we also added the original data points along with some useful tooltips on the very same map with literally a one-line R command. We could even color the state polygons based on the aggregated results we computed in the previous sections! Ever tried to do the same in Java? Alternative map designs Besides being able to use some third-party tools, another main reason why I tend to use R for all my data analysis tasks is that R is extremely powerful in creating custom data exploration, visualization, and modeling designs. As an example, let's create a flow-map based on our data, where we will highlight the flights from Houston based on the number of actual and cancelled flights. We will use lines and circles to render these two variables on a 2-dimensional map, and we will also add a contour plot in the background based on the average time delay. But, as usual, let's do some data transformations first! To keep the number of flows at a minimal level, let's get rid of the airports outside the USA at last: > dt <- dt[point.in.polygon(dt$lon, dt$lat, + usa_data$x, usa_data$y) == 1, ] We will need the diagram package (to render curved arrows from Houston to the destination airports) and the scales package to create transparent colors: > library(diagram) > library(scales) Then let's render the contour map described in the Contour Lines section: > map("usa") > title('Number of flights, cancellations and delays from Houston') > image(look, add = TRUE) > map("state", lwd = 3, add = TRUE) And then add a curved line from Houston to each of the destination airports, where the width of the line represents the number of cancelled flights and the diameter of the target circles shows the number of actual flights: > for (i in 1:nrow(dt)) { + curvedarrow( + from = rev(as.numeric(h)), + to = as.numeric(dt[i, c('lon', 'lat')]), + arr.pos = 1, + arr.type = 'circle', + curve = 0.1, + arr.col = alpha('black', dt$N[i] / max(dt$N)), + arr.length = dt$N[i] / max(dt$N), + lwd = dt$Cancelled[i] / max(dt$Cancelled) * 25, + lcol = alpha('black', + dt$Cancelled[i] / max(dt$Cancelled))) + } Well, this article ended up being about visualizing spatial data, and not really about analyzing spatial data by fitting models and filtering raw data. Summary In case you are interested in knowing other R-related books that Packt has in store for you, here is the link: R for Data Science Practical Data Science Cookbook Resources for Article: Further resources on this subject: R ─ Classification and Regression Trees[article] An overview of common machine learning tasks[article] Reduction with Principal Component Analysis [article]
Read more
  • 0
  • 0
  • 2453

Packt
25 Sep 2015
11 min read
Save for later

The Dashboard Design – Best Practices

Packt
25 Sep 2015
11 min read
 In this article by Julian Villafuerte, author of the book Creating Stunning Dashboards with QlikView you know more about the best practices for the dashboard design. (For more resources related to this topic, see here.) Data visualization is a field that is constantly evolving. However, some concepts have proven their value time and again through the years and have become what we call best practices. These notions should not be seen as strict rules that must be applied without any further consideration but as a series of tips that will help you create better applications. If you are a beginner, try to stick to them as much as you can. These best practices will save you a lot of trouble and will greatly enhance your first endeavors. On the other hand, if you are an advanced developer, combine them with your personal experiences in order to build the ultimate dashboard. Some guidelines in this article come from the widely known characters in the field of data visualization, such as Stephen Few, Edward Tufte, John Tukey, Alberto Cairo, and Nathan Yau. So, if a concept strikes your attention, I strongly recommend you to read more about it in their books. Throughout this article, we will review some useful recommendations that will help you create not only engaging, but also effective and user-friendly dashboards. Remember that they may apply differently depending on the information displayed and the audience you are working with. Nevertheless, they are great guidelines to the field of data visualization, so do not hesitate to consider them in all of your developments. Gestalt principles In the early 1900s, the Gestalt school of psychology conducted a series of studies on human perception in order to understand how our brain interprets forms and recognizes patterns. Understanding these principles may help you create a better structure for your dashboard and make your charts easier to interpret: Proximity: When we see multiple elements located near one another, we tend to see them as groups. For example, we can visually distinguish clusters in a scatter plot by grouping the dots according to their position. Similarity: Our brain associates the elements that are similar to each other (in terms of shape, size, color, or orientation). For example, in color-coded bar charts, we can associate the bars that share the same color even if they are not grouped. Enclosure: If a border surrounds a series of objects, we perceive them as part of a group. For example, if a scatter plot has reference lines that wrap the elements between 20 and 30 percent, we will automatically see them as a cluster. Closure: When we detect a figure that looks incomplete, we tend to perceive it as a closed structure. For example, even if we discard the borders of a bar chart, the axes will form a region that our brain will isolate without needing the extra lines. Continuity: If a number of objects are aligned, we will perceive them as a continuous body. For example, the different blocks of code when you indent QlikView script are percieved as one continuous code. Connection: If objects are connected by a line, we will see them as a group. For example, we tend to associate the dots connected by lines on a scatter plot with lines and symbols. Giving context to the data When it comes to analyzing data, context is everything. If you present isolated figures, the users will have a hard time trying to find the story hidden behind them. For example, if I told you that the gross margin of our company was 16.5 percent during the first quarter of 2015, would you evaluate it as a positive or negative sign? This is pretty difficult, right? However, what if we added some extra information to complement this KPI? Then, the following image would make a lot more sense: As you can see, adding context to the data can make the landscape look quite different. Now, it is easy to see that even though the gross margin has substantially improved during the last year, our company has some work to do in order to be competitive and surpass the industry standard. The appropriate references may change depending on the KPI you are dealing with and the goals of the organization, but some common examples are as follows: Last year's performance The quota, budget, or objective Comparison with the closest competitor, product, or employee The market share The industry standards Another good tip in this regard is to anticipate the comparisons. If you display figures regarding the monthly quota and the actual sales, you can save the users the mental calculations by including complementary indicators, such as the gap between them and the percentage of completion. Data-Ink Ratio One of the most interesting principles in the field of data visualization is Data-Ink Ratio, introduced by Edward R. Tufte in his book, The Visual Display of Quantitative Information, which must be read by every designer. In this publication, he states that there are two different types of ink (or in our case, pixels) in any chart, as follows: Data-ink: This includes all the nonerasable portions of graphic that are used to represent the actual data. These pixels are at the core of the visualization and cannot be removed without losing some of its content. Non-data-ink: This includes any element that is not directly related to the data or doesn't convey anything meaningful to the reader. Based on these concepts, he defined the Data Ink Ratio as the proportion of the graphic's ink that is devoted to the nonredundant display of data information: Data Ink Ratio = Data Ink / Total Ink As you can imagine, our goal is to maximize this number by decreasing the non-data-ink used in our dashboards. For example, the chart to the left has a low data-ink ratio due to the usage of 3D effects, shadows, backgrounds, and multiple grid lines. On the contrary, the chart to the right presents a higher ratio as most of the pixels are data-related. Avoiding chart junk Chart junk is another term coined by Tufte that refers to all the elements that distract the viewer from the actual information in a graphic. Evidently, chart junk is considered as non-data-ink and comprises of features such as heavy gridlines, frames, redundant labels, ornamental axes, backgrounds, overly complex fonts, shadows, images, or other effects included only as decoration. Take for instance the following charts: As you can see, by removing all the unnecessary elements in a chart, it becomes easier to interpret and looks much more elegant. Balance Colors, icons, reference lines, and other visual cues can be very useful to help the users focus on the most important elements in a dashboard. However, misusing or overusing these features can be a real hazard, so try to find the adequate balance for each of them. Excessive precision QlikView applications should use the appropriate language for each audience. When designing, think about whether precise figures will be useful or if they are going to become a distraction. Most of the time, dashboards show high-level KPIs, so it may be more comfortable for certain users to see rounded numbers, as in the following image: 3D charts One of Microsoft Excel's greatest wrongdoings is making everyone believe that 3D charts are good for data analysis. For some reason, people seem to love them; but, believe me, they are a real threat to business analysts. Despite their visual charm, these representations can easily hide some parts of the information and convey wrong perceptions depending on their usage of colors, shadows, and axis inclination. I strongly recommend you to avoid them in any context. Sorting Whether you are working with a list box, a bar chart, or a straight table, sorting an object is always advisable, as it adds context to the data. It can help you find the most commonly selected items in a list box, distinguish which slice is bigger on a pie chart when the sizes are similar, or easily spot the outliners in other graphic representations. Alignment and distribution Most of my colleagues argue that I am on the verge of an obsessive-compulsive disorder, but I cannot stand an application with unaligned objects. (Actually, I am still struggling with the fact that the paragraphs in this book are not justified, but anyway...). The design toolbar offers useful options in this regard, so there is no excuse for not having a tidy dashboard. If you take care of the quadrature of all the charts and filters, your interface will display a clean and professional look that every user will appreciate: Animations I have a rule of thumb regarding chart animation in QlikView—If you are Hans Rosling, go ahead. If not, better think it over twice. Even though they can be very illustrative, chart animations end up being a distraction rather than a tool to help us visualize data most of the time, so be conservative about their use. For those of you who do not know him, Hans Rosling is a Swedish professor of international health who works in Stockholm. However, he is best known for his amazing way of presenting data with GapMinder, a simple piece of software that allows him to animate a scatter plot. If you are a data enthusiast, you ought to watch his appearances in TED Talks. Avoiding scroll bars Throughout his work, Stephen Few emphasizes that all the information in a dashboard must fit on a single screen. Whilst I believe that there is no harm in splitting the data in multiple sheets, it is undeniable that scroll bars reduce the overall usability of an application. If the user has to continuously scroll right and left to read all the figures in a table, or if she must go up and down to see the filter panel, she will end up getting tired and eventually discard your dashboard. Consistency If you want to create an easy way to navigate your dashboard, you cannot forget about consistency. Locating standard objects (such as Current Selections Box, Search Object, and Filter Panels) in the same area in every tab will help the users easily find all the items they need. In addition, applying the same style, fonts, and color palettes in all your charts will make your dashboard look more elegant and professional. White space The space between charts, tables, and filters is often referred to as white space, and even though you may not notice it, it is a vital part of any dashboard. Displaying dozens of objects without letting them breathe makes your interface look cluttered and, therefore, harder to understand. Some of the benefits of using white space adequately are: The improvement in readability It focuses and emphasizes the important objects It guides the users' eyes, creating a sense of hierarchy in the dashboard It fosters a balanced layout, making your interface look clear and sophisticated Applying makeup Every now and then, you stumble upon delicate situations where some business users try their best to hide certain parts of the data. Whether it is about low sales or the insane amount of defective products, they often ask you to remove a few charts or avoid visual cues so that those numbers go unnoticed. Needless to say, dashboards are tools intended to inform and guide the decisions of the viewers, so avoid presenting misleading visualizations. Meaningless variety As a designer, you will often hesitate to use the same chart type multiple times in your application fearing that the users will get bored of it. Though this may be a haunting perception, if you present valuable data in an adequate format, there is no need to add new types of charts just for variety's sake. We want to keep the users engaged with great analyses, not just with pretty graphics. Summary In this article, you learned all about the best practices to be followed in Qlikview. Resources for Article: Further resources on this subject: Analyzing Financial Data in QlikView[article] Securing QlikView Documents[article] Common QlikView script errors [article]
Read more
  • 0
  • 0
  • 9683
article-image-integration-spark-sql
Packt
24 Sep 2015
11 min read
Save for later

Integration with Spark SQL

Packt
24 Sep 2015
11 min read
 In this article by Sumit Gupta, the author of the book Learning Real-time Processing with Spark Streaming, we will discuss the integration of Spark Streaming with various other advance Spark libraries such as Spark SQL. (For more resources related to this topic, see here.) No single software in today's world can fulfill the varied, versatile, and complex demands/needs of the enterprises, and to be honest, neither should it! Software are made to fulfill specific needs arising out of the enterprises at a particular point in time, which may change in future due to many other factors. These factors may or may not be controlled like government policies, business/market dynamics, and many more. Considering all these factors integration and interoperability of any software system with internal/external systems/software's is pivotal in fulfilling the enterprise needs. Integration and interoperability are categorized as nonfunctional requirements, which are always implicit and may or may not be explicitly stated by the end users. Over the period of time, architects have realized the importance of these implicit requirements in modern enterprises, and now, all enterprise architectures provide support due diligence and provisions in fulfillment of these requirements. Even the enterprise architecture frameworks such as The Open Group Architecture Framework (TOGAF) defines the specific set of procedures and guidelines for defining and establishing interoperability and integration requirements of modern enterprises. Spark community realized the importance of both these factors and provided a versatile and scalable framework with certain hooks for integration and interoperability with the different systems/libraries; for example; data consumed and processed via Spark streams can also be loaded into the structured (table: rows/columns) format and can be further queried using SQL. Even the data can be stored in the form of Hive tables in HDFS as persistent tables, which will exist even after our Spark program has restarted. In this article, we will discuss querying streaming data in real time using Spark SQL. Querying streaming data in real time Spark Streaming is developed on the principle of integration and interoperability where it not only provides a framework for consuming data in near real time from varied data sources, but at the same time, it also provides the integration with Spark SQL where existing DStreams can be converted into structured data format for querying using standard SQL constructs. There are many such use cases where SQL on streaming data is a much needed feature; for example, in our distributed log analysis use case, we may need to combine the precomputed datasets with the streaming data for performing exploratory analysis using interactive SQL queries, which is difficult to implement only with streaming operators as they are not designed for introducing new datasets and perform ad hoc queries. Moreover SQL's success at expressing complex data transformations derives from the fact that it is based on a set of very powerful data processing primitives that do filtering, merging, correlation, and aggregation, which is not available in the low-level programming languages such as Java/ C++ and may result in long development cycles and high maintenance costs. Let's move forward and first understand few things about Spark SQL, and then, we will also see the process of converting existing DStreams into the Structured formats. Understanding Spark SQL Spark SQL is one of the modules developed over the Spark framework for processing structured data, which is stored in the form of rows and columns. At a very high level, it is similar to the data residing in RDBMS in the form rows and columns, and then SQL queries are executed for performing analysis, but Spark SQL is much more versatile and flexible as compared to RDBMS. Spark SQL provides distributed processing of SQL queries and can be compared to frameworks Hive/Impala or Drill. Here are the few notable features of Spark SQL: Spark SQL is capable of loading data from variety of data sources such as text files, JSON, Hive, HDFS, Parquet format, and of course RDBMS too so that we can consume/join and process datasets from different and varied data sources. It supports static and dynamic schema definition for the data loaded from various sources, which helps in defining schema for known data structures/types, and also for those datasets where the columns and their types are not known until runtime. It can work as a distributed query engine using the thrift JDBC/ODBC server or command-line interface where end users or applications can interact with Spark SQL directly to run SQL queries. Spark SQL provides integration with Spark Streaming where DStreams can be transformed into the structured format and further SQL Queries can be executed. It is capable of caching tables using an in-memory columnar format for faster reads and in-memory data processing. It supports Schema evolution so that new columns can be added/deleted to the existing schema, and Spark SQL still maintains the compatibility between all versions of the schema. Spark SQL defines the higher level of programming abstraction called DataFrames, which is also an extension to the existing RDD API. Data frames are the distributed collection of the objects in the form of rows and named columns, which is similar to tables in the RDBMS, but with much richer functionality containing all the previously defined features. The DataFrame API is inspired by the concepts of data frames in R (http://www.r-tutor.com/r-introduction/data-frame) and Python (http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). Let's move ahead and understand how Spark SQL works with the help of an example: As a first step, let's create sample JSON data about the basic information about the company's departments such as Name, Employees, and so on, and save this data into the file company.json. The JSON file would look like this: [ { "Name":"DEPT_A", "No_Of_Emp":10, "No_Of_Supervisors":2 }, { "Name":"DEPT_B", "No_Of_Emp":12, "No_Of_Supervisors":2 }, { "Name":"DEPT_C", "No_Of_Emp":14, "No_Of_Supervisors":3 }, { "Name":"DEPT_D", "No_Of_Emp":10, "No_Of_Supervisors":1 }, { "Name":"DEPT_E", "No_Of_Emp":20, "No_Of_Supervisors":5 } ] You can use any online JSON editor such as http://codebeautify.org/online-json-editor to see and edit data defined in the preceding JSON code. Next, let's extend our Spark-Examples project and create a new package by the name chapter.six, and within this new package, create a new Scala object and name it as ScalaFirstSparkSQL.scala. Next, add the following import statements just below the package declaration: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql._ import org.apache.spark.sql.functions._ Further, in your main method, add following set of statements to create SQLContext from SparkContext: //Creating Spark Configuration val conf = new SparkConf() //Setting Application/ Job Name conf.setAppName("My First Spark SQL") // Define Spark Context which we will use to initialize our SQL Context val sparkCtx = new SparkContext(conf) //Creating SQL Context val sqlCtx = new SQLContext(sparkCtx) SQLContext or any of its descendants such as HiveContext—for working with Hive tables or CassandraSQLContext—for working with Cassandra tables is the main entry point for accessing all functionalities of Spark SQL. It allows the creation of data frames, and also provides functionality to fire SQL queries over data frames. Next, we will define the following code to load the JSON file (company.json) using the SQLContext, and further, we will also create a data frame: //Define path of your JSON File (company.json) which needs to be processed val path = "/home/softwares/spark/data/company.json"; //Use SQLCOntext and Load the JSON file. //This will return the DataFrame which can be further Queried using SQL queries. val dataFrame = sqlCtx.jsonFile(path) In the preceding piece of code, we used the jsonFile(…) method for loading the JSON data. There are other utility method defined by SQLContext for reading raw data from filesystem or creating data frames from the existing RDD and many more. Spark SQL supports two different methods for converting the existing RDDs into data frames. The first method uses reflection to infer the schema of an RDD from the given data. This approach leads to more concise code and helps in instances where we already know the schema while writing Spark application. We have used the same approach in our example. The second method is through a programmatic interface that allows to construct a schema. Then, apply it to an existing RDD and finally generate a data frame. This method is more verbose, but provides flexibility and helps in those instances where columns and data types are not known until the data is received at runtime. Refer to https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.SQLContext for a complete list of methods exposed by SQLContext. Once the DataFrame is created, we need to register DataFrame as a temporary table within the SQL context so that we can execute SQL queries over the registered table. Let's add the following piece of code for registering our DataFrame with our SQL context and name it company: //Register the data as a temporary table within SQL Context //Temporary table is destroyed as soon as SQL Context is destroyed. dataFrame.registerTempTable("company"); And we are done… Our JSON data is automatically organized into the table (rows/column) and is ready to accept the SQL queries. Even the data types is inferred from the type of data entered within the JSON file itself. Now, we will start executing the SQL queries on our table, but before that let's see the schema being created/defined by SQLContext: //Printing the Schema of the Data loaded in the Data Frame dataFrame.printSchema(); The execution of the preceding statement will provide results similar to mentioned illustration: The preceding illustration shows the schema of the JSON data loaded by Spark SQL. Pretty simple and straight, isn't it? Spark SQL has automatically created our schema based on the data defined in our company.json file. It has also defined the data type of each of the columns. We can also define the schema using reflection (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection) or can also programmatically define the schema (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection). Next, let's execute some SQL queries to see the data stored in DataFrame, so the first SQL would be to print all records: //Executing SQL Queries to Print all records in the DataFrame println("Printing All records") sqlCtx.sql("Select * from company").collect().foreach(print) The execution of the preceding statement will produce the following results on the console where the driver is executed: Next, let's also select only few columns instead of all records and print the same on console: //Executing SQL Queries to Print Name and Employees //in each Department println("n Printing Number of Employees in All Departments") sqlCtx.sql("Select Name, No_Of_Emp from company").collect().foreach(println) The execution of the preceding statement will produce the following results on the Console where the driver is executed: Now, finally let's do some aggregation and count the total number of all employees across the departments: //Using the aggregate function (agg) to print the //total number of employees in the Company println("n Printing Total Number of Employees in Company_X") val allRec = sqlCtx.sql("Select * from company").agg(Map("No_Of_Emp"->"sum")) allRec.collect.foreach ( println ) In the preceding piece of code, we used the agg(…) function and performed the sum of all employees across the departments, where sum can be replaced by avg, max, min, or count. The execution of the preceding statement will produce the following results on the console where the driver is executed: The preceding images shows the results of executing the aggregation on our company.json data. Refer to the Data Frame API at https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.DataFrame for further information on the available functions for performing aggregation. As a last step, we will stop our Spark SQL context by invoking the stop() function on SparkContext—sparkCtx.stop(). This is required so that your application can notify master or resource manager to release all resources allocated to the Spark job. It also ensures the graceful shutdown of the job and avoids any resource leakage, which may happen otherwise. Also, as of now, there can be only one Spark context active per JVM, and we need to stop() the active SparkContext class before creating a new one. Summary In this article, we have seen the step-by-step process of using Spark SQL as a standalone program. Though we have considered JSON files as an example, but we can also leverage Spark SQL with Cassandra (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md) or MongoDB (https://github.com/Stratio/spark-mongodb) or Elasticsearch (http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html). Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames[article] Sabermetrics with Apache Spark[article] Getting Started with Apache Spark [article]
Read more
  • 0
  • 0
  • 4455

article-image-find-friends-facebook
Packt
22 Sep 2015
13 min read
Save for later

Find Friends on Facebook

Packt
22 Sep 2015
13 min read
 In this article by the authors, Vikram Garg and Sharan Kumar Ravindran, of the book, Mastering Social Media Mining with R, we learn about data mining using Facebook as our resource. (For more resources related to this topic, see here.) We will see how to use the R package Rfacebook, which provides access to the Facebook Graph API from R. It includes a series of functions that allow us to extract various data about our network such as friends, likes, comments, followers, newsfeeds, and much more. We will discuss how to visualize our Facebook network and we will see some methodologies to make use of the available data to implement business cases. Rfacebook package installation and authentication The Rfacebook package is authored and maintained by Pablo Barbera and Michael Piccirilli. It provides an interface to the Facebook API. It needs Version 2.12.0 or later of R and it is dependent on a few other packages, such as httr, rjson, and httpuv. Before starting, make sure those packages are installed. It is preferred to have Version 0.6 of the httr package installed. Installation We will now install the Rfacebook packages. We can download and install the latest package from GitHub using the following code and load the package using the library function. On the other hand, we will also install the Rfacebook package from the CRAN network. One prerequisite for installing the package using the function install_github is to have the package devtools loaded into the R environment. The code is as follows: library(devtools) install_github("Rfacebook", "pablobarbera", subdir="Rfacebook") library(Rfacebook) After installing the Rfacebook package for connecting to the API, make an authentication request. This can be done via two different methods. The first method is by using the access token generated for the app, which is short-lived (valid for two hours); on the other hand, we can create a long-lasting token using the OAuth function. Let's first create a temporary token. Go to https://developers.facebook.com/tools/explorer, click on Get Token, and select the required user data permissions. The Facebook Graph API explorer will open with an access token. This access token will be valid for two hours. The status of the access token as well as the scope can be checked by clicking on the Debug button. Once the tokens expire, we can regenerate a new token. Now, we can access the data from R using the following code. The access token generated using the link should be copied and passed to the token variable. The use of username in the function getUsers is deprecated in the latest Graph API; hence, we are passing the ID of a user. You can get your ID from the same link that was used for token generation. This function can be used to pull the details of any user, provided the generated token has the access. Usually, access is limited to a few users with a public setting or those who use your app. It is also based on the items selected in the user data permission check page during token generation. In the following code, paste your token inside the double quotes, so that it can be reused across the functions without explicitly mentioning the actual token. token<- "XXXXXXXXX" A closer look at how the package works The getUsers function using the token will hit the Facebook Graph API. Facebook will be able to uniquely identify the users as well as the permissions to access information. If all the check conditions are satisfied, we will be able to get the required data. Copy the token from the mentioned URL and paste it within the double quotes. Remember that the token generated will be active only for two hours. Use the getUsers function to get the details of the user. Earlier, the getUsers function used to work based on the Facebook friend's name as well as ID; in API Version 2.0, we cannot access the data using the name. Consider the following code for example: token<- "XXXXXXXXX" me<- getUsers("778278022196130", token, private_info = TRUE) Then, the details of the user, such as name and hometown, can be retrieved using the following code: me$name The output is also mentioned for your reference: [1] "Sharan Kumar R" For the following code: me$hometown The output is as follows: [1] "Chennai, Tamil Nadu" Now, let's see how to create a long-lasting token. Open your Facebook app page by going to https://developers.facebook.com/apps/ and choosing your app. On theDashboard tab, you will be able to see the App ID and Secret Code values. Use those in the following code. require("Rfacebook") fb_oauth<- fbOAuth(app_id="11",app_secret="XX",extended_permissions = TRUE) On executing the preceding statements, you will find the following message in your console: Copy and paste into Site URL on Facebook App Settings: http://localhost:1410/ When done, press any key to continue... Copy the URL displayed and open your Facebook app; on the Settings tab, click on the Add Platform button and paste the copied URL in the Site URL text box. Make sure to save the changes. Then, return to the R console and press any key to continue, you will be prompted to enter your Facebook username and password. On completing that, you will return to the R console. If you find the following message, it means your long-lived token is ready to use. When you get the completion status, you might not be able to access any of the information. It is advisable to use the OAuth function a few minutes after creation of the Facebook application. Authentication complete. Authentication successful. After successfully authenticating, we can save it and load on demand using the following code: save(fb_oauth, file="fb_oauth") load("fb_oauth") When it is required to automate a few things or to use Rfacebook extensively, it will be very difficult as the tokens should be generated quite often. Hence, it is advisable to create a long-lasting token to authenticate the user, and then save it. Whenever required, we can just load it from a local file. Note that Facebook authentication might take several minutes. Hence, if your authentication fails on the retry, please wait for some time before pressing any key and check whether you have installed the httr package Version 0.6. If you continue to experience any issues in generating the token, then it's not a problem. We are good to go with the temporary token. Exercise Create an app in Facebook and authenticate by any one of the methods discussed. A basic analysis of your network In this section, we will discuss how to extract Facebook network of friends and some more information about the people in our network. After completing the app creation and authentication steps, let's move forward and learn to pull some basic network data from Facebook. First, let's find out which friends we have access to, using the following command in R. Let's use the temporary token for accessing the data: token<- "XXXXXXXXX" friends<- getFriends(token, simplify = TRUE) head(friends) # To see few of your friends The preceding function will return all our Facebook friends whose data is accessible. Version 1 of the API would allow us to download all the friends' data by default. But in the new version, we have limited access. Since we have set simplify as TRUE, we will pull only the username and their Facebook ID. By setting the same parameter to FALSE, we will be able to access additional data such as gender, location, hometown, profile picture, relationship status, and full name. We can use the function getUsers to get additional information about a particular user. The following information is available by default: gender, location, and language. We can, however, get some additional information such as relationship status, birthday, and the current location by setting the parameter private_info to TRUE: friends_data<- getUsers(friends$id, token, private_info = TRUE) table(friends_data$gender) The output is as follows: female male 5 21 We can also find out the language, location, and relationship status. The commands to generate the details as well as the respective outputs are given here for your reference: #Language table(substr(friends_data$locale, 1, 2)) The output is as follows: en 26 The code to find the location is as follows: # Location (Country) table(substr(friends_data$locale, 4, 5)) The output is as follows: GB US 1 25 Here's the code to find the relationship status: # Relationship Status table(friends_data$relationship_status) Here's the output: Engaged Married Single 1 1 3 Now, let's see what things were liked by us in Facebook. We can use the function getLikes to get the like data. In order to know about your likes data, specify user as me. The same function can be used to extract information about our friends, in which case we should pass the user's Facebook ID. This function will provide us with a list of Facebook pages liked by the user, their ID, name, and the website associated with the page. We can even restrict the number of results retrieved by setting a value to the parameter n. The same function will be used to get the likes of people in our network; instead of the keyword me, we should give the Facebook ID of those users. Remember we can only access data of people with accessibility from our app. The code is as follows: likes<- getLikes(user="me", token=token) head(likes) After exploring the use of functions to pull data, let's see how to use the Facebook Query Language using the function getFQL, which can be used to pass the queries. The following query will get you the list of friends in your network: friends<- getFQL("SELECT uid2 FROM friend WHERE uid1=me()", token=token) In order to get the complete details of your friends, the following query can be used. The query will return the username, Facebook ID, and the link to their profile picture. Note that we might not be able to access the complete network of friends' data, since access to data of all your friends are deprecated with Version 2.0. The code is as follows: # Details about friends Friends_details<- getFQL("SELECT uid, name, pic_square FROM user WHERE uid = me() OR uid IN (SELECT uid2 FROM friend WHERE uid1 = me())", token=token) In order to know more about the Facebook Query Language, check out the following link. This method of extracting the information might be preferred by people familiar with query language. It can also help extract data satisfying only specific conditions (https://developers.facebook.com/docs/technical-guides/fql). Exercise Download your Facebook network and do an exploration analysis on the languages your friends speak, places where they live, the total number of pages they have liked, and their marital status. Try all these with the Facebook Query Language as well. Network analysis and visualization So far, we used a few functions to get the details about our Facebook profile as well as friends' data. Let's see how to get to know more about our network. Before learning to get the network data, let's understand what a network is as well as a few important concepts about the network. Anything connected to a few other things could be a network. Everything in real life is connected to each other, for example, people, machines, events, and so on. It would make a lot of sense if we analyzed them as a network. Let's consider a network of people; here, people will be the nodes in the network and the relationship between them would be the edges (lines connecting them). Social network analysis The technique to study/analyze the network is called social network analysis. We will see how to create a simple plot of friends in our network in this section. To understand the nodes (people/places/etc) in a network in social network analysis, we need to evaluate the position of the nodes. We can evaluate the nodes using centrality. Centrality can be measured using different methods like degree, betweenness, and closeness. Let's first get our Facebook network and then get to know the centrality measures in detail. We use the function getNetwork to download our Facebook network. We need to mention how we would like to format the data. When the parameter format is set to adj.matrix, it will produce the data in matrix format where the people in the network would become the row names and column names of the matrix and if they are connected to each other, then the corresponding cell in the matrix will hold a value. The command is as follows: network<- getNetwork(token, format="adj.matrix") We now have our Facebook network downloaded. Let's visualize our network before getting to understand the centrality concept one by one with our own network. To visualize the network, we need to use the package called igraph in R. Since we downloaded our network in the adjacency matrix format, we will use the same function in igraph. We use the layout function to determine the placement of vertices in the network for drawing the graph and then we use the plot function to draw the network. In order to explore various other functionalities in these parameters, you can execute the ?<function_name> function in RStudio and the help window will have the description of the function. Let’s use the following code to load the package igraph into R. require(igraph) We will now build the graph using the function graph.adjacency; this function helps in creating a network graph using the adjacency matrix. In order to build a force-directed graph, we will use the function layout.drl. The force-directed graph will help in making the graph more readable. The commands are as follows: social_graph<- graph.adjacency(network) layout<- layout.drl(social_graph, options=list(simmer.attraction=0)) At last, we will use the plot function with various built in parameters to make the graph more readable. For example, we can name the nodes in our network, we can set the size of the nodes as well as the edges in the network, and we can color the graph and the components of the graph. Use the following code to see what the network looks like. The output that was plotted can be saved locally using the function dev.copy, and the size of the image as well as the type can be passed as a parameter to the function: plot(social_graph, vertex.size=10, vertex.color="green", vertex.label=NA, vertex.label.cex=0.5, edge.arrow.size=0, edge.curved=TRUE, layout=layout.fruchterman.reingold) dev.copy(png,filename= "C:/Users/Sharan/Desktop/3973-03-community.png", width=600, height=600); dev.off (); With the preceding plot function, my network will look like the following one. In the following network, the node labels (name of the people) have been disabled. They can be enabled by removing the vertex.label parameter. Summary In this article, we discussed how to use the various functions implemented in the Rfacebook package, analyze the network. This article covers the important techniques that helps in performing vital network analysis and also enlightens us about the wide range of business problems that could be addressed with the Facebook data. It gives us a glimpse of the great potential for implementation of various analyses. Resources for Article: Further resources on this subject: Using App Directory in HootSuite[article] Supervised learning[article] Warming Up [article]
Read more
  • 0
  • 0
  • 2572

article-image-getting-started-apache-spark-dataframes
Packt
22 Sep 2015
5 min read
Save for later

Getting Started with Apache Spark DataFrames

Packt
22 Sep 2015
5 min read
 In this article article about Arun Manivannan’s book Scala Data Analysis Cookbook, we will cover the following recipes: Getting Apache Spark ML – a framework for large-scale machine learning Creating a data frame from CSV (For more resources related to this topic, see here.) Getting started with Apache Spark Breeze is the building block of Spark MLLib, the machine learning library for Apache Spark. In this recipe, we'll see how to bring Spark into our project (using SBT) and look at how it works internally. The code for this recipe could be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1-spark-csv/build.sbt. How to do it... Pulling Spark ML into our project is just a matter of adding a few dependencies on our build.sbt file: spark-core, spark-sql, and spark-mllib: Under a brand new folder (which will be our project root), we create a new file called build.sbt. Next, let's add to the project dependencies the Spark libraries: organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.3.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion ) resolvers ++= Seq( "Apache HBase" at "https://repository.apache.org/content/repositories/releases", "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/" ) How it works... Spark has four major higher level tools built on top of the Spark Core: Spark Streaming, Spark ML Lib (Machine Learning), Spark SQL (An SQL interface for accessing data), and GraphX (for graph processing). The Spark Core is the heart of Spark, providing higher level abstractions in various languages for data representation, serialization, scheduling, metrics, and so on. For this recipe, we skipped streaming and GraphX and added the remaining three libraries. There’s more… Apache Spark is a cluster computing platform that claims to run about 100 times faster than Hadoop (that's a mouthful). In our terms, we could consider that as a means to run our complex logic over a massive amount of data at a blazingly high speed. The other good thing about Spark is that the programs we write are much smaller than the typical Map Reduce classes that we write for Hadoop. So, not only do our programs run faster, but it also takes lesser time to write them in the first place. Creating a data frame from CSV In this recipe, we'll look at how to create a new data frame from a Delimiter Separated Values (DSV) file. The code for this recipe could be found athttps://github.com/arunma/ScalaDataAnalysisCookbook/tree/master/chapter1-spark-csv in the DataFrameCSV class. How to do it... CSV support isn't first-class in Spark but is available through an external library from databricks. So, let's go ahead and add that up in build.sbt: After adding the spark-csv dependency, our complete build.sbt looks as follows: organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.3.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion, "com.databricks" %% "spark-csv" % "1.0.3" ) resolvers ++= Seq( "Apache HBase" at"https://repository.apache.org/content/repositories/releases", "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/" ) fork := true Before we create the actual data frame, there are three steps that we ought to do: create the Spark configuration, create the Spark context, and create the SQL context. SparkConf holds all of the information for running this Spark cluster. For this recipe, we are running locally, and we intend to use only two cores in the machine—local[2]: val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]") For this recipe, we'll be running Spark on standalone mode. Now let's load our pipe-separated file: org.apache.spark.sql.DataFrame val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|') How it works... The csvFile function of sqlContext accepts the full filePath of the file to be loaded. If the CSV has a header, then the useHeader flag will read the first row as column names. The delimiter flag, as expected, defaults to a comma, but you can override the character as needed. Instead of using the csvFile function, you can also use the load function available in the SQL context. The load function accepts the format of the file (in our case, it is CSV) and options as a map. We can specify the same parameters that we specified earlier using Map, like this: val options=Map("header"->"true", "path"->"ModifiedStudent.csv") val newStudents=sqlContext.load("com.databricks.spark.csv",options) Summary In this article, you learned in detail Apache Spark ML, a framework for large-scale machine learning. Then we saw the creation of a data frame from CSV with the help of example code. Resources for Article: Further resources on this subject: Integrating Scala, Groovy, and Flex Development with Apache Maven[article] Ridge Regression[article] Reactive Data Streams [article]
Read more
  • 0
  • 0
  • 9371
article-image-implementing-decision-trees
Packt
22 Sep 2015
4 min read
Save for later

Implementing Decision Trees

Packt
22 Sep 2015
4 min read
 In this article by the author, Sunila Gollapudi, of this book, Practical Machine Learning, we will outline a business problem that can be addressed by building a decision tree-based model, and see how it can be implemented in Apache Mahout, R, Julia, Apache Spark, and Python. This can happen many, many times. So, building a website or an app will take a bit longer than it used to. (For more resources related to this topic, see here.) Implementing decision trees Here, we will explore implementing decision trees using various frameworks and tools. The R example We will use the rpart and ctree packages in R to build decision tree-based models: Import the packages for data import and decision tree libraries as shown here: Start data manipulation: Create a categorical variable on Sales and append to the existing dataset as shown here: Using random functions, split data into training and testing datasets; Fit the tree model with training data and check how the model is working with testing data, measure the error: Prune the tree; Plotting the pruned tree will look like the following: The Spark example Java-based example using MLib is shown here: import java.util.HashMap; import scala.Tuple2; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.tree.DecisionTree; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.util.MLUtils; import org.apache.spark.SparkConf; SparkConf sparkConf = new SparkConf().setAppName("JavaDecisionTree"); JavaSparkContext sc = new JavaSparkContext(sparkConf); // Load and parse the data file. String datapath = "data/mllib/sales.txt"; JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD(); // Split the data into training and test sets (30% held out for testing) JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3}); JavaRDD<LabeledPoint> trainingData = splits[0]; JavaRDD<LabeledPoint> testData = splits[1]; // Set parameters. // Empty categoricalFeaturesInfo indicates all features are continuous. Integer numClasses = 2; Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>(); String impurity = "gini"; Integer maxDepth = 5; Integer maxBins = 32; // Train a DecisionTree model for classification. final DecisionTreeModel model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins); // Evaluate model on test instances and compute test error JavaPairRDD<Double, Double> predictionAndLabel = testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { @Override public Tuple2<Double, Double> call(LabeledPoint p) { return new Tuple2<Double, Double>(model.predict(p.features()), p.label()); } }); Double testErr = 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { @Override public Boolean call(Tuple2<Double, Double> pl) { return !pl._1().equals(pl._2()); } }).count() / testData.count(); System.out.println("Test Error: " + testErr); System.out.println("Learned classification tree model:n" + model.toDebugString()); The Julia example We will use the DecisionTree package in Julia as shown here; julia> Pkg.add("DecisionTree")julia> using DecisionTree We will use the RDatasets package to load the dataset for the example in context; julia> Pkg.add("RDatasets"); using RDatasets julia> sales = data("datasets", "sales"); julia> features = array(sales[:, 1:4]); # use matrix() for Julia v0.2 julia> labels = array(sales[:, 5]); # use vector() for Julia v0.2 julia> stump = build_stump(labels, features); julia> print_tree(stump) Feature 3, Threshold 3.0 L-> price : 50/50 R-> shelvelock : 50/100 Pruning the tree julia> length(tree) 11 julia> pruned = prune_tree(tree, 0.9); julia> length(pruned) 9 Summary In this article, we implemented decision trees using R, Spark, and Julia. Resources for Article: Further resources on this subject: An overview of common machine learning tasks[article] How to do Machine Learning with Python[article] Modeling complex functions with artificial neural networks [article]
Read more
  • 0
  • 0
  • 37536

Packt
22 Sep 2015
16 min read
Save for later

R ─ Classification and Regression Trees

Packt
22 Sep 2015
16 min read
"The classifiers most likely to be the best are the random forest (RF) versions, the best of which (implemented in R and accessed via caret), achieves 94.1 percent of the maximum accuracy overcoming 90 percent in the 84.3 percent of the data sets."                                                                          – Fernández-Delgado et al (2014) "You can't see the forest for the trees!"                                                                                                     – An old saying (For more resources related to this topic, see here.) In this article by Cory Lesmeister, the author of Mastering Machine Learning with R, the first item of discussion is the basic decision tree, which is both simple to build and understand. However, the single decision tree method does not perform as well as the other methods such as support vector machines or neural networks. Therefore, we will discuss the creation of multiple, sometimes hundreds of, different trees with their individual results combined, leading to a single overall prediction. The first quote written above is from Fernández-Delgado et al in the Journal of Machine Learning Research and is meant to set the stage that the techniques in this article are quite powerful, particularly when used for the classification problems. Certainly, they are not always the best solution, but they do provide a good starting point. Regression trees For an understanding of the tree-based methods, it is probably easier to start with a quantitative outcome and then move on to how it works on a classification problem. The essence of a tree is that the features are partitioned, starting with the first split that improves the residual sum of squares the most. These binary splits continue until the termination of the tree. Each subsequent split/partition is not done on the entire dataset but only on the portion of the prior split that it falls under. This top-down process is referred as recursive partitioning. It is also a process that is greedy, a term you may stumble on in reading about the machine learning methods. Greedy means that in each split in the process, the algorithm looks for the greatest reduction in the residual sum of squares without a regard to how well it will perform in the later partitions. The result is that you may end up with a full tree of unnecessary branches, leading to a low bias but high variance. To control this effect, you need to appropriately prune the tree to an optimal size after building a full tree. The following figure provides a visual of the technique in action. The data is hypothetical with 30 observations, a response ranging from 1 to 10, and two predictor features, both ranging in value from 0 to 10 named X1 and X2. The tree has three splits that lead to four terminal nodes. Each split is basically an if or then statement or uses an R syntax, ifelse(). In the first split, if X1 < 3.5, then the response is split into 4 observations with an average value of 2.4 and the remaining 26 observations. This left branch of 4 observations is a terminal node as any further splits would not substantially improve the residual sum of squares. The predicted value for the 4 observations in that partition of the tree becomes the average. The next split is at X2 < 4 and finally X1 < 7.5. An advantage of this method is that it can handle the highly nonlinear relationships; but can you see a couple of potential problems? The first issue is that an observation is given the average of the terminal node that it falls under. This can hurt the overall predictive performance (high bias). Conversely, if you keep partitioning the data further and further to achieve a low bias, high variance can become an issue. As with the other methods, you can use cross-validation to select the appropriate tree size. Regression Tree with 3 splits and 4 terminal nodes and the corresponding node average and number of observations. Classification trees Classification trees operate under the same principal as regression trees except that the splits are not determined by the residual sum of squares but an error rate. The error rate used is not what you would expect, where the calculation is simply misclassified observations divided by the total observations. As it turns out, when it comes to tree splitting, a misclassification rate by itself may lead to a situation where you can gain information with a further split but not improve the misclassification rate. Let's look at an example. Suppose we have a node—let's call it N0 where you have 7 observations labeled No and 3 observations labeled Yes, and we say that the misclassified rate is 30 percent. With this in mind, let's calculate a common alternative error measure called Gini index. The formula for a single node Gini index is as follows: Gini = 1 – (probability of Class 1)2 – (probability of Class 2)2. For N0, the Gini is 1 - (.7)2 - (.3)2, which is equal to 0.42, versus the misclassification rate of 30 percent. Taking this example further, we will now create an N1 node with 3 of Class 1 and none of Class 2 along with N2, which has 4 observations from Class 1 and 3 from Class 2. Now, the overall misclassification rate for this branch of the tree is still 30 percent, but look at the following to see how the overall Gini index has improved: Gini(N1) = 1 – (3/3)2 – (0/3)2 = 0. Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.49. The new Gini index = (proportion of N1 x Gini(N1)) + (proportion of N2 x Gini(N2)) which is equal to (.3 x 0) + (.7 x 0.49) or 0.343. By doing a split on a surrogate error rate, we actually improved our model impurity by reducing it from 0.42 to 0.343, whereas the misclassification rate did not change. This is the methodology used by the rpart() package. Random forest To greatly improve our model's predictive ability, we can produce numerous trees and combine the results. The random forest technique does this by applying two different tricks in the model development. The first is the use of bootstrap aggregation or bagging as it is called. In bagging, an individual tree is built on a sample of dataset, roughly two-thirds of the total observations. It is important to note that the remaining one-third is referred to as Out of Bag(OOB). This is repeated for dozens or hundreds of times and the results are averaged. Each of these trees is grown and not pruned based on any error measure and this means that the variance of each of these individual trees is high. However, by averaging the results, you can reduce the variance without increasing the bias. The next thing that the random forest brings to the table is that concurrently with the random sample of the data, it also takes a random sample of the input features at each split. In the randomForest package, we will use the default random number of the sampled predictors, which is the square root of the total predictors for classification problems and total predictors divided by 3 for regression. The number of predictors that the algorithm randomly chooses at each split can be changed via the model tuning process. By doing this random sampling of the features at each split and incorporating it into the methodology, you mitigate the effect of a highly correlated predictor in becoming the main driver in all of your bootstrapped trees and preventing you from reducing the variance that you hoped to achieve with bagging. The subsequent averaging of the trees that are less correlated to each other than if you only performed bagging, is more generalizable and more robust to outliers. Gradient boosting The boosting methods can become extremely complicated for you to learn and understand, but you should keep in mind about what is fundamentally happening behind the curtain. The main idea is to build an initial model of some kind (linear, spline, tree, and so on.) called the base-learner, examine the residuals, and fit a model based on these residuals around the so-called loss function. A loss function is merely the function that measures the discrepancy between the model and desired prediction, for example, a squared error for the regression or the logistic function for the classification. The process continues until it reaches some specified stopping criterion. This is like the student who takes a practice exam and gets 30 out of 100 questions wrong and as a result, studies only those 30 questions that were missed. The next practice exam they get 10 out of these 30 wrong and so only focus on these 10 questions and so on. If you would like to explore the theory behind this further, a great resource for you is available in Frontiers in Neurorobotics, Gradient boosting machines, a tutorial, Natekin A., Knoll A., (2013), at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/. As previously mentioned, boosting can be applied to many different base learners, but here we will only focus on the specifics of tree-based learning. Each tree iteration is small and we will determine how small it is with one of the tuning parameters referred to as interaction depth. In fact, it may be as small as one split, which is referred to as a stump. Trees are sequentially fit to the residuals according to the loss function up to the number of trees that we specified (our stopping criterion). There is another tuning parameter that we will need to identify and that is shrinkage. You can think of shrinkage as the rate at which your model is learning generally and specifically, as the contribution of each tree or stump to the model. This learning rate acts as a regularization parameter. The other thing about our boosting algorithm is that it is stochastic, meaning that it adds randomness by taking a random sample of our data at each tree. Introducing some randomness to a boosted model usually improves the accuracy and speed and reduces overfitting (Friedman 2002). As you may have guessed, tuning these parameters can be quite a challenge. These parameters can interact with each other and if you just tinker with one without considering the other, your model may actually perform worse. The caret package will help us in this endeavor. Business case The overall business objective in this situation is to see if we can improve the predictive ability for some of the cases. For regression, we will visit the prostate cancer data. For classification purposes, we will utilize both the breast cancer biopsy data and Pima Indian Diabetes data. Both random forests and boosting will be applied to all the three datasets. The simple tree method will be used only on the breast and prostate cancer sets. Regression tree We will jump right into the prostate data set, but first let's load the necessary R package, as follows: > library(rpart) #classification and regression trees > library(partykit) #treeplots > library(MASS) #breast and pima indian data > library(ElemStatLearn) #prostate data > library(randomForest) #random forests > library(gbm) #gradient boosting > library(caret) #tune hyper-parameter First, we will do regression on the prostate data. This involves calling the dataset, coding the gleason score as an indicator variable using the ifelse() function, and creating a test and training set. The training set will be pros.train and the test set will be pros.test, as follows: > data(prostate) > prostate$gleason = ifelse(prostate$gleason == 6, 0, 1) > pros.train = subset(prostate, train==TRUE)[,1:9] > pros.test = subset(prostate, train==FALSE)[,1:9] To build a regression tree on the training data, we will use the following rpart() function from R's party package. The syntax is quite similar to what we used in the other modeling techniques: > tree.pros <- rpart(lpsa~., data=pros.train) We can call this object using the print() function and cptable and then examine the error per split to determine the optimal number of splits in the tree, as follows: > print(tree.pros$cptable) CP nsplit rel error xerror xstd 1 0.35852251 0 1.0000000 1.0364016 0.1822698 2 0.12295687 1 0.6414775 0.8395071 0.1214181 3 0.11639953 2 0.5185206 0.7255295 0.1015424 4 0.05350873 3 0.4021211 0.7608289 0.1109777 5 0.01032838 4 0.3486124 0.6911426 0.1061507 6 0.01000000 5 0.3382840 0.7102030 0.1093327 This is a very important table to analyze. The first column labeled CP is the cost complexity parameter, which states that the second column, nsplit, is the number of splits in the tree. The rel error column stands for relative errors and is the residual sum of squares for that number of splits divided by the residual sum of squares for no splits (RSS(k)/RSS(0). Both xerror and xstd are based on a ten-fold cross-validation with xerror being the average error and xstd being the standard deviation of the cross-validation process. We can see that four splits produced slightly less errors using cross-validation while five splits produced the lowest error on the full dataset. You can examine this using the plotcp() function, as follows: > plotcp(tree.pros) The following is the output of the preceding command: The plot shows us the relative error by the tree size with the corresponding error bars. The horizontal line on the plot is the upper limit of the lowest standard error. Selecting the tree size 5, which is four splits, we can build a new tree object where xerror is minimized by pruning our tree accordingly—first creating an object for cp associated with the pruned tree from the table. Then the prune() function handles the rest as follows: > cp = min(tree.pros$cptable[5,]) > prune.tree.pros <- prune(tree.pros, cp = cp) With this done, you can plot and compare the full and pruned trees. The tree plots produced by the partykit package are much better than those produced by the party package. You can simply use the as.party() function as a wrapper in the plot() function: > plot(as.party(tree.pros)) The output of the preceding command is as follows: > plot(as.party(prune.tree.pros)) The following is the output of the preceding command: Note that the splits are exactly the same in the two trees with the exception of the last split, which includes the age variable for the full tree. Interestingly, both the first and second splits in the tree are related to the log of cancer volume lcavol. These plots are quite informative as they show the splits, nodes, observations per node, and box plots of the outcome that we are trying to predict. Let's see how well the pruned tree performs on the test data. What we will do is create an object of the predicted values using the predict() function by incorporating the test data. Then, we will calculate the errors as the predicted values minus the actual values and finally the mean of the squared errors, as follows: > party.pros.test <- predict(prune.tree.pros, newdata=pros.test) > rpart.resid = party.pros.test - pros.test$lpsa #calculate residuals > mean(rpart.resid^2) #caluclate MSE [1] 0.5267748 One can look at the tree plots that we produced and easily explain what are the primary drivers behind the response. As mentioned in the introduction, the trees are easy to interpret and explain, which, in many cases, may be more important than the accuracy. Classification tree For the classification problem, we will prepare the breast cancer data. After loading the data, you will delete the patient ID, rename the features, eliminate the few missing values, and then create the train/test datasets, as follows: > data(biopsy) > biopsy <- biopsy[,-1] #delete ID > names(biopsy) = c("thick", "u.size", "u.shape", "adhsn", "s.size", "nucl", "chrom", "n.nuc", "mit", "class") #change the feature names > biopsy.v2 = na.omit(biopsy) #delete the observations with missing values > set.seed(123) #random number generator > ind = sample(2, nrow(biopsy.v2), replace=TRUE, prob=c(0.7, 0.3)) > biop.train = biopsy.v2[ind==1,] #the training data set > biop.test = biopsy.v2[ind==2,] #the test data set With the data set up appropriately, we will use the same syntax style for a classification problem as we did previously for a regression problem, but before creating a classification tree, we will need to ensure that the outcome is a factor, which can be done using the str() function. as follows: > str(biop.test[,10]) Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 2 1 1 … First, we will create the tree: > set.seed(123) > tree.biop <- rpart(class~., data=biop.train) Then, examine the table for the optimal number of splits: > print(tree.biop$cptable) CP nsplit rel error xerror xstd 1 0.79651163 0 1.0000000 1.0000000 0.06086254 2 0.07558140 1 0.2034884 0.2674419 0.03746996 3 0.01162791 2 0.1279070 0.1453488 0.02829278 4 0.01000000 3 0.1162791 0.1744186 0.03082013 The cross-validation error is at a minimum with only two splits (row 3). We can now prune the tree, plot the full and pruned tree, and see how it performs on the test set, as follows: > cp = min(tree.biop$cptable[3,]) > prune.tree.biop <- prune(tree.biop, cp = cp) > plot(as.party(tree.biop)) > plot(as.party(prune.tree.biop)) An examination of the tree plots shows that the uniformity of the cell size is the first split, then bare nuclei. The full tree had an additional split at the cell thickness. We can predict the test observations using type="class" in the predict() function: > rparty.test <- predict(prune.tree.biop, newdata=biop.test, type="class") > table(rparty.test, biop.test$class) rparty.test benign malignant benign 136 3 malignant 6 64 > (136+64)/209 [1] 0.9569378 The basic tree with just two splits gets us almost 96 percent accuracy. This still falls short but should encourage us to believe that we can improve on it with the upcoming methods, starting with random forests. Summary In this article we learned both the power and limitations of tree-based learning methods for both classification and regression problems. To improve on predictive ability, we have the tools of the random forest and gradient boosted trees at our disposal. Resources for Article: Further resources on this subject: Big Data Analysis (R and Hadoop) [article] Using R for Statistics, Research, and Graphics [article] First steps with R [article]
Read more
  • 0
  • 0
  • 7948
Modal Close icon
Modal Close icon