Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-first-principle-and-useful-way-think

08 Oct 2015

8 min read

First Principle and a Useful Way to Think

08 Oct 2015

In this article, by Timothy Washington, author of the book Clojure for Finance, we will cover the following topics: Modeling the stock price activity Function evaluation First-Class functions Lazy evaluation Basic Clojure functions and immutability Namespace modifications and creating our first function (For more resources related to this topic, see here.) Modeling the stock price activity There are many types of banks. Commercial entities (large stores, parking areas, hotels, and so on) that collect and retain credit card information, are either quasi banks, or farm out credit operations to bank-like processing companies. There are more well-known consumer banks, which accept demand deposits from the public. There are also a range of other banks such as commercial banks, insurance companies and trusts, credit unions, and in our case, investment banks. As promised, this article will slowly build up a set of lagging price indicators that follow a moving stock price time series. In order to do that, I think it's useful to touch on stock markets, and to crudely model stock price activity. A stock (or equity) market, is a collection of buyers and sellers trading economic assets (usually companies). The stock (or shares) of those companies can be equities listed on an exchange (New York Stock Exchange, London Stock Exchange, and others), or may be those traded privately. In this exercise, we will do the following: Crudely model the stock price movement, which will give us a test bed for writing our lagging price indicators Introduce some basic features of the Clojure language Function evaluation The Clojure website has a cheatsheet (http://clojure.org/cheatsheet) with all of the language's core functions. The first function we'll look at is rand, a function that randomly gives you a number within a given range. So in your edgar/ project, launch a repl with the lein repl shell command. After a few seconds, you will enter repl (Read-Eval-Print-Loop). Again, Clojure functions are executed by being placed in the first position of a list. The function's arguments are placed directly afterwards. In your repl, evaluate (rand 35) or (rand 99) or (rand 43.21) or any number you fancy Run it many times to see that you can get any different floating point number, within 0 and the upper bound of the number you provided First-Class functions The next functions we'll look at are repeatedly and fn. repeatedly is a function that takes another function and returns an infinite (or length n if supplied) lazy sequence of calls to the latter function. This is our first encounter of a function that can take another function. We'll also encounter functions that return other functions. Described as First-Class functions, this falls out of lambda calculus and is one of the central features of functional programming. As such, we need to wrap our previous (rand 35) call in another function. fn is one of Clojure's core functions, and produces an anonymous, unnamed function. We can now supply this function to repeatedly. In your repl, if you evaluate (take 25 (repeatedly (fn [] (rand 35)))), you should see a long list of floating point numbers with the list's tail elided. Lazy evaluation We only took the first 25 of the (repeatedly (fn [] (rand 35))) result list, because the list (actually a lazy sequence) is infinite. Lazy evaluation (or laziness) is a common feature in functional programming languages. Being infinite, Clojure chooses to delay evaluating most of the list until it's needed by some other function that pulls out some values. Laziness benefits us by increasing performance and letting us more easily construct control flow. We can avoid needless calculation, repeated evaluations, and potential error conditions in compound expressions. Let's try to pull out some values with the take function. take itself, returns another lazy sequence, of the first n items of a collection. Evaluating (take 25 (repeatedly (fn [] (rand 35)))) will pull out the first 25 repeatedly calls to rand which generates a float between 0 and 35. Basic Clojure functions and immutability There's many operations we can perform over our result list (or lazy sequence). One of the main approaches of functional programming is to take a data structure, and perform operations over top of it to produce a new data structure, or some atomic result (a string, number, and so on). This may sound inefficient at first. But most FP languages employ something called immutability to make these operations efficient. Immutable data structures are the ones that cannot change once they've been created. This is feasible as most immutable, FP languages use some kind of structural data sharing between an original and a modified version. The idea is that if we run evaluate (conj [1 2 3] 4), the resulting [1 2 3 4] vector shares the original vector of [1 2 3]. The only additional resource that's assigned is for any novelty that's been introduced to the data structure (the 4). There's a more detailed explanation of (for example) Clojure's persistent vectors here: conj: This conjoins an element to a collection—the collection decides where. So conjoining an element to a vector (conj [1 2 3] 4) versus conjoining an element to a list (conj '(1 2 3) 4) yield different results. Try it in your repl. map: This passes a function over one or many lists, yielding another list. (map inc [1 2 3]) increments each element by 1. reduce (or left fold): This passes a function over each element, accumulating one result. (reduce + (take 100 (repeatedly (fn [] (rand 35))))) sums the list. filter: This constrains the input by some condition. >=: This is a conditional function, which tests whether the first argument is greater than or equal to the second function. Try (>= 4 9) and (>= 9 1). fn: This is a function that creates a function. This unnamed or anonymous function can have any instructions you choose to put in there. So if we only want numbers above 12, we can put that assertion in a predicate function. Try entering the below expression into your repl: (take 25 (filter (fn [x] (>= x 12)) (repeatedly (fn [] (rand 35))))) Modifying the namespaces and creating our first function We now have the basis for creating a function. It will return a lazy infinite sequence of floating point numbers, within an upper and lower bound. defn is a Clojure function, which takes an anonymous function, and binds a name to it in a given namespace. A Clojure namespace is an organizational tool for mapping human-readable names to things like functions, named data structures and such. Here, we're going to bind our function to the name generate-prices in our current namespace. You'll notice that our function is starting to span multiple lines. This will be a good time to author the code in your text editor of choice. I'll be using Emacs: Open your text editor, and add this code to the file called src/edgar/core.clj. Make sure that (ns edgar.core) is at the top of that file. After adding the following code, you can then restart repl. (load "edgaru/core") uses the load function to load the Clojure code in your in src/edgaru/core.clj: (defn generate-prices [lower-bound upper-bound] (filter (fn [x] (>= x lower-bound)) (repeatedly (fn [] (rand upper-bound))))) The Read-Eval-Print-Loop In our repl, we can pull in code in various namespaces, with the require function. This applies to the src/edgar/core.clj file we've just edited. That code will be in the edgar.core namespace: In your repl, evaluate (require '[edgar.core :as c]). c is just a handy alias we can use instead of the long name. You can then generate random prices within an upper and lower bound. Take the first 10 of them like this (take 10 (c/generate-prices 12 35)). You should see results akin to the following output. All elements should be within the range of 12 to 35: (29.60706184716407 12.507593971664075 19.79939384292759 31.322074615579716 19.737852534147326 25.134649707849572 19.952195022152488 12.94569843904663 23.618693004455086 14.695872710062428) There's a subtle abstraction in the preceding code that deserves attention. (require '[edgar.core :as c]) introduces the quote symbol. ' is the reader shorthand for the quote function. So the equivalent invocation would be (require (quote [edgar.core :as c])). Quoting a form tells the Clojure reader not to evaluate the subsequent expression (or form). So evaluating '(a b c) returns a list of three symbols, without trying to evaluate a. Even though those symbols haven't yet been assigned, that's okay, because that expression (or form) has not yet been evaluated. But that begs a larger question. What is reader? Clojure (and all Lisps) are what's known as homoiconic. This means that Clojure code is also data. And data can be directly output and evaluated as code. The reader is the thing that parses our src/edgar/core.clj file (or (+ 1 1) input from the repl prompt), and produces the data structures that are evaluated. read and eval are the 2 essential processes by which Clojure code runs. The evaluation result is printed (or output), usually to the standard output device. Then we loop the process back to the read function. So, when the repl reads, your src/edgar/two.clj file, it's directly transforming that text representation into data and evaluating it. A few things fall out of that. For example, it becomes simpler for Clojure programs to directly read, transform and write out other Clojure programs. The implications of that will become clearer when we look at macros. But for now, know that there are ways to modify or delay the evaluation process, in this case by quoting a form. Summary In this article, we learned about basic features of the Clojure language and how to model the stock price activity. Besides these, we also learned function evaluation, First-Class functions, the lazy evaluation method, namespace modifications and creating our first function. Resources for Article: Further resources on this subject: Performance by Design[article] Big Data[article] The Observer Pattern [article]

0
0
1628

article-image-introduction-data-analysis-and-libraries

Packt

07 Oct 2015

13 min read

Introduction to Data Analysis and Libraries

Packt

07 Oct 2015

13 min read

In this article by Martin Czygan and Phuong Vothihong, the authors of the book Getting Started with Python Data Analysis, Data is raw information that can exist in any form, which is either usable or not. We can easily get data everywhere in our life; for example, the price of gold today is $ 1.158 per ounce. It does not have any meaning, except describing the gold price. This also shows that data is useful based on context. (For more resources related to this topic, see here.) With relational data connection, information appears and allows us to expand our knowledge beyond the range of our senses. When we possess gold price data gathered overtime, one information we might have is that the price has continuously risen from $1.152 to $1.158 for three days. It is used by someone who tracks gold prices. Knowledge helps people to create value in their lives and work. It is based on information that is organized, synthesized, or summarized to enhance comprehension, awareness, or understanding. It represents a state or potential for action and decisions. When the gold price continuously increases for three days, it will lightly decrease on the next day; this is useful knowledge. The following figure illustrates the steps from data to knowledge; we call this process the data analysis process and we will introduce it in the next section: In this article, we will cover the following topics: Data analysis and process Overview of libraries in data analysis using different programming languages Common Python data analysis libraries Data analysis and process Data is getting bigger and more diversified every day. Therefore, analyzing and processing data to advance human knowledge or to create value are big challenges. To tackle these challenges, you will need domain knowledge and a variety of skills, drawing from areas such as computer science, artificial intelligence (AI) and machine learning (ML), statistics and mathematics, and knowledge domain, as shown in the following figure: Let's us go through the Data analysis and it's domain knowledge: Computer science: We need this knowledge to provide abstractions for efficient data processing. A basic Python programming experience is required. We will introduce Python libraries used in data analysis. Artificial intelligence and machine learning: If computer science knowledge helps us to program data analysis tools, artificial intelligence and machine learning help us to model the data and learn from it in order to build smart products. Statistics and mathematics: We cannot extract useful information from raw data if we do not use statistical techniques or mathematical functions. Knowledge domain: Besides technology and general techniques, it is important to have an insight into the specific domain. What do the data fields mean? What data do we need to collect? Based on the expertise, we explore and analyze raw data by applying the above techniques, step by step. Data analysis is a process composed of the following steps: Data requirements: We have to define what kind of data will be collected based on the requirements or problem analysis. For example, if we want to detect a user's behavior while reading news on the internet, we should be aware of visited article links, date and time, article categories, and the user's time spent on different pages. Data collection: Data can be collected from a variety of sources: mobile, personal computer, camera, or recording device. It may also be obtained through different ways: communication, event, and interaction between person and person, person and device, or device and device. Data appears whenever and wherever in the world. The problem is, how can we find and gather it to solve our problem? This is the mission of this step. Data processing: Data that is initially obtained must be processed or organized for analysis. This process is performance-sensitive: How fast can we create, insert, update, or query data? For building a real product that has to process big data, we should consider this step carefully. What kind of database should we use to store data? What kind of data structure, such as analysis, statistics, or visualization, is suitable for our purposes? Data cleaning: After being processed and organized, the data may still contain duplicates or errors. Therefore, we need a cleaning step to reduce those situations and increase the quality of the results in the following steps. Common tasks include record matching, deduplication, or column segmentation. Depending on the type of data, we can apply several types of data cleaning. For example, a user's history of a visited news website might contain a lot of duplicate rows, because the user might have refreshed certain pages many times. For our specific issue, these rows might not carry any meaning when we explore the user's behavior. So, we should remove them before saving it to our database. Another situation we may encounter is click fraud on news—someone just wants to improve their website ranking or sabotage the website. In this case, the data will not help us to explore a user's behavior. We can use thresholds to check whether a visit page event comes from a real person or from malicious software. Exploratory data analysis: Now, we can start to analyze data through a variety of techniques referred to as exploratory data analysis. We may detect additional problems in data cleaning or discover requests for further data. Therefore, these steps may be iterative and repeated throughout the whole data analysis process. Data visualization techniques are also used to examine the data in graphs or charts. Visualization often facilitates the understanding of data sets, especially, if they are large or high-dimensional. Modelling and algorithms: A lot of mathematical formulas and algorithms may be applied to detect or predict useful knowledge from the raw data. For example, we can use similarity measures to cluster users who have exhibited similar news reading behavior and recommend articles of interest to them next time. Alternatively, we can detect users' gender based on their news reading behavior by applying classification models such as Support Vector Machine (SVM) or linear regression. Depending on the problem, we may use different algorithms to get an acceptable result. It can take a lot of time to evaluate the accuracy of the algorithms and to choose the best one to implement for a certain product. Data product: The goal of this step is to build data products that receive data input and generate output according to the problem requirements. We will apply computer science knowledge to implement our selected algorithms as well as manage the data storage. Overview of libraries in data analysis There are numerous data analysis libraries that help us to process and analyze data. They use different programming languages and have different advantages as well as disadvantages of solving various data analysis problems. Now, we introduce some common libraries that may be useful for you. They should give you an overview of libraries in the field. However, the rest of this focuses on Python-based libraries. Some of the libraries that use the Java language for data analysis are as follows: Weka: This is the library that I got familiar with, the first time I learned about data analysis. It has a graphical user interface that allows you to run experiments on a small dataset. This is great if you want to get a feel for what is possible in the data processing space. However, if you build a complex product, I think it is not the best choice because of its performance: sketchy API design, non-optimal algorithms, and little documentation (http://www.cs.waikato.ac.nz/ml/weka/). Mallet: This is another Java library that is used for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications on text. There is an add-on package to Mallet, called GRMM, that contains support for inference in general, graphical models, and training of Conditional random fields (CRF) with arbitrary graphical structure. In my experience, the library performance as well as the algorithms are better than Weka. However, its only focus is on text processing problems. The reference page is at http://mallet.cs.umass.edu/. Mahout: This is Apache's machine learning framework built on top of Hadoop; its goal is to build a scalable machine learning library. It looks promising, but comes with all the baggage and overhead of Hadoop. The Homepage is at http://mahout.apache.org/. Spark: This is a relatively new Apache project; supposedly up to a hundred times faster than Hadoop. It is also a scalable library that consists of common machine learning algorithms and utilities. Development can be done in Python as well as in any JVM language. The reference page is at https://spark.apache.org/docs/1.5.0/mllib-guide.html. Here are a few libraries that are implemented in C++: Vowpal Wabbit: This library is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research. It has been used to learn a tera-feature (1012) dataset on 1000 nodes in one hour. More information can be found in the publication at http://arxiv.org/abs/1110.4198. MultiBoost: This package is a multiclass, multilabel, and multitask classification boosting software implemented in C++. If you use this software, you should refer to the paper published in 2012, in the Journal of Machine Learning Research, MultiBoost: A Multi-purpose Boosting Package, D.Benbouzid, R. Busa-Fekete, N. Casagrande, F.-D. Collin, and B. Kégl. MLpack: This is also a C++ machine learning library, developed by the Fundamental Algorithmic and Statistical Tools Laboratory (FASTLab) at Georgia Tech. It focusses on scalability, speed, and ease-of-use and was presented at the BigLearning workshop of NIPS 2011. Its homepage is at http://www.mlpack.org/about.html. Caffe: The last C++ library we want to mention is Caffe. This is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors. You can find more information about it at http://caffe.berkeleyvision.org/. Other libraries for data processing and analysis are as follows: Statsmodels: This is a great Python library for statistical modelling and is mainly used for predictive and exploratory analysis. Modular toolkit for data processing (MDP):This is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures (http://mdp-toolkit.sourceforge.net/index.html). Orange: This is an open source data visualization and analysis for novices and experts. It is packed with features for data analysis and has add-ons for bioinformatics and text mining. It contains an implementation of self-organizing maps, which sets it apart from the other projects as well (http://orange.biolab.si/). Mirador: This is a tool for the visual exploration of complex datasets supporting Mac and Windows. It enables users to discover correlation patterns and derive new hypotheses from data (http://orange.biolab.si/). RapidMiner: This is another GUI-based tool for data mining, machine learning, and predictive analysis (https://rapidminer.com/). Theano: This bridges the gap between Python and lower-level languages. Theano gives very significant performance gains, particularly for large matrix operations and is, therefore, a good choice for deep learning models. However, it is not easy to debug because of the additional compilation layer. Natural language processing toolkit (NLTK): This is written in Python with very unique and salient features. Here, I could not list all libraries for data analysis. However, I think the above libraries are enough to take a lot of your time to learn and build data analysis applications. Python libraries in data analysis Python is a multi-platform, general purpose programming language that can run on Windows, Linux/Unix, and Mac OS X, and has been ported to the Java and the .NET virtual machines as well. It has a powerful standard library. In addition, it has many libraries for data analysis: Pylearn2, Hebel, Pybrain, Pattern, MontePython, and MILK. We will cover some common Python data analysis libraries such as Numpy, Pandas, Matplotlib, PyMongo, and scikit-learn. Now, to help you getting started, I will briefly present an overview of each library for those who are less familiar with the scientific Python stack. NumPy One of the fundamental packages used for scientific computing with Python is Numpy. Among other things, it contains the following: A powerful N-dimensional array object Sophisticated (broadcasting) functions for performing array computations Tools for integrating C/C++ and Fortran code Useful linear algebra operations, Fourier transforms, and random number capabilities. Besides this, it can also be used as an efficient multidimensional container of generic data. Arbitrary data types can be defined and integrated with a wide variety of databases. Pandas Pandas is a Python package that supports rich data structures and functions for analyzing data and is developed by the PyData Development Team. It is focused on the improvement of Python's data libraries. Pandas consists of the following things: A set of labelled array data structures; the primary of which are Series, DataFrame, and Panel Index objects enabling both simple axis indexing and multilevel/hierarchical axis indexing An integrated group by engine for aggregating and transforming datasets Date range generation and custom date offsets Input/output tools that loads and saves data from flat files or PyTables/HDF5 format Optimal memory versions of the standard data structures Moving window statistics and static and moving window linear/panel regression Because of these features, Pandas is an ideal tool for systems that need complex data structures or high-performance time series functions such as financial data analysis applications. Matplotlib Matplotlib is the single most used Python package for 2D-graphic. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats: line plots, contour plots, scatter plots, or Basemap plot. It comes with a set of default settings, but allows customizing all kinds of properties. However, we can easily create our chart with the defaults of almost every property in Matplotlib. PyMongo MongoDB is a type of NoSQL database. It is highly scalable, robust, and perfect to work with JavaScript-based web applications because we can store data as JSON documents and use flexible schemas. PyMongo is a Python distribution containing tools for working with MongoDB. Many tools have also been written for working with PyMongo to add more features such as MongoKit, Humongolus, MongoAlchemy, and Ming. scikit-learn scikit-learn is an open source machine learning library using the Python programming language. It supports various machine learning models, such as classification, regression, and clustering algorithms, interoperated with the Python numerical and scientific libraries NumPy and SciPy. The latest scikit-learn version is 0.16.1, published in April 2015. Summary In this article, there were three main points that we presented. Firstly, we figured out the relationship between raw data, information and knowledge. Because of its contribution in our life, we continued to discuss an overview of data analysis and processing steps in the second part. Finally, we introduced a few common supported libraries that are useful for practical data analysis applications. Among those we will focus on Python libraries in data analysis. Resources for Article: Further resources on this subject: Exploiting Services with Python [Article] Basics of Jupyter Notebook and Python [Article] How to do Machine Learning with Python [Article]

0
0
28651

article-image-fingerprint-detection-using-opencv

Packt

07 Oct 2015

11 min read

Fingerprint detection using OpenCV 3

Packt

07 Oct 2015

11 min read

In this article by Joseph Howse, Quan Hua, Steven Puttemans, and Utkarsh Sinha, the authors of OpenCV Blueprints, we delve into the aspect of fingerprint detection using OpenCV. (For more resources related to this topic, see here.) Fingerprint identification, how is it done? We have already discussed the use of the first biometric, which is the face of the person trying to login to the system. However since we mentioned that using a single biometric can be quite risky, we suggest adding secondary biometric checks to the system, like the fingerprint of a person. There are many of the shelf fingerprint scanners which are quite cheap and return you the scanned image. However you will still have to write your own registration software for these scanners and which can be done by using OpenCV software. Examples of such fingerprint images can be found below. Examples of single individual thumb fingerprint in different scanning positions This dataset can be downloaded from the FVC2002 competition website released by the University of Bologna. The website (http://bias.csr.unibo.it/fvc2002/databases.asp) contains 4 databases of fingerprints available for public download of the following structure: Four fingerprint capturing devices DB1 - DB4. For each device, the prints of 10 individuals are available. For each person, 8 different positions of prints were recorded. We will use this publicly available dataset to build our system upon. We will focus on the first capturing device, using up to 4 fingerprints of each individual for training the system and making an average descriptor of the fingerprint. Then we will use the other 4 fingerprints to evaluate our system and make sure that the person is still recognized by our system. You could apply exactly the same approach on the data grabbed from the other devices if you would like to investigate the difference between a system that captures almost binary images and one that captures grayscale images. However we will provide techniques for doing the binarization yourself. Implement the approach in OpenCV 3 The complete fingerprint software for processing fingerprints derived from a fingerprint scanner can be found at https://github.com/OpenCVBlueprints/OpenCVBlueprints/tree/master/chapter_6/source_code/fingerprint/fingerprint_process/. In this subsection we will describe how you can implement this approach in the OpenCV interface. We will start by grabbing the image from the fingerprint system and apply binarization. This will enable us to remove any desired noise from the image as well as help us to make the contrast better between the kin and the wrinkled surface of the finger. // Start by reading in an image Mat input = imread("/data/fingerprints/image1.png", CV_LOAD_GRAYSCALE); // Binarize the image, through local thresholding Mat input_binary; threshold(input, input_binary, 0, 255, CV_THRESH_BINARY | CV_THRESH_OTSU); The Otsu thresholding will automatically choose the best generic threshold for the image to obtain a good contrast between foreground and background information. This is because the image contains a bimodal distribution (which means that we have an image with a 2 peak histogram) of pixel values. For that image, we can approximately take a value in the middle of those peaks as threshold value. (for images which are not bimodal, binarization won't be accurate.) Otsu allows us to avoid using a fixed threshold value, and thus making the system more general for any capturing device. However we do acknowledge that if you have only a single capturing device, then playing around with a fixed threshold value could result in a better image for that specific setup. The result of the thresholding can be seen below. In order to make thinning from the next skeletization step as effective as possible we need the inverse binary image. Comparison between grayscale and binarized fingerprint image Once we have a binary image we are actually already set to go to calculate our feature points and feature point descriptors. However, in order to improve the process a bit more, we suggest to skeletize the image. This will create more unique and stronger interest points. The following piece of code can apply the skeletization on top of the binary image. The skeletization is based on the Zhang-Suen line thinning approach. Special thanks to @bsdNoobz of the OpenCV Q&A forum who supplied this iteration approach. #include <opencv2/imgproc.hpp> #include <opencv2/highgui.hpp> using namespace std; using namespace cv; // Perform a single thinning iteration, which is repeated until the skeletization is finalized void thinningIteration(Mat& im, int iter) { Mat marker = Mat::zeros(im.size(), CV_8UC1); for (int i = 1; i < im.rows-1; i++) { for (int j = 1; j < im.cols-1; j++) { uchar p2 = im.at<uchar>(i-1, j); uchar p3 = im.at<uchar>(i-1, j+1); uchar p4 = im.at<uchar>(i, j+1); uchar p5 = im.at<uchar>(i+1, j+1); uchar p6 = im.at<uchar>(i+1, j); uchar p7 = im.at<uchar>(i+1, j-1); uchar p8 = im.at<uchar>(i, j-1); uchar p9 = im.at<uchar>(i-1, j-1); int A = (p2 == 0 && p3 == 1) + (p3 == 0 && p4 == 1) + (p4 == 0 && p5 == 1) + (p5 == 0 && p6 == 1) + (p6 == 0 && p7 == 1) + (p7 == 0 && p8 == 1) + (p8 == 0 && p9 == 1) + (p9 == 0 && p2 == 1); int B = p2 + p3 + p4 + p5 + p6 + p7 + p8 + p9; int m1 = iter == 0 ? (p2 * p4 * p6) : (p2 * p4 * p8); int m2 = iter == 0 ? (p4 * p6 * p8) : (p2 * p6 * p8); if (A == 1 && (B >= 2 && B <= 6) && m1 == 0 && m2 == 0) marker.at<uchar>(i,j) = 1; } } im &= ~marker; } // Function for thinning any given binary image within the range of 0-255. If not you should first make sure that your image has this range preset and configured! void thinning(Mat& im) { // Enforce the range tob e in between 0 - 255 im /= 255; Mat prev = Mat::zeros(im.size(), CV_8UC1); Mat diff; do { thinningIteration(im, 0); thinningIteration(im, 1); absdiff(im, prev, diff); im.copyTo(prev); } while (countNonZero(diff) > 0); im *= 255; } The code above can then simply be applied to our previous steps by calling the thinning function on top of our previous binary generated image. The code for this is: Apply thinning algorithm Mat input_thinned = input_binary.clone(); thinning(input_thinned); This will result in the following output: Comparison between binarized and thinned fingerprint image using skeletization techniques. When we got this skeleton image, the following step would be to look for crossing points on the ridges of the fingerprint, which are being then called minutiae points. We can do this by a keypoint detector that looks at a large change in local contrast, like the Harris corner detector. Since the Harris corner detector is both able to detect strong corners and edges, this is ideally for the fingerprint problem, where the most important minutiae are short edges and bifurcation, the positions where edges come together. More information about minutae points and Harris Corner detection can be found in the following publications: Ross, Arun A., Jidnya Shah, and Anil K. Jain. "Toward reconstructing fingerprints from minutiae points." Defense and Security. International Society for Optics and Photonics, 2005. Harris, Chris, and Mike Stephens. "A combined corner and edge detector." Alvey vision conference. Vol. 15. 1988. Calling the Harris Corner operation on a skeletonized and binarized image in OpenCV is quite straightforward. The Harris corners are stored as positions corresponding in the image with their cornerness response value. If we want to detect points with a certain cornerness, than we should simply threshold the image. Mat harris_corners, harris_normalised; harris_corners = Mat::zeros(input_thinned.size(), CV_32FC1); cornerHarris(input_thinned, harris_corners, 2, 3, 0.04, BORDER_DEFAULT); normalize(harris_corners, harris_normalised, 0, 255, NORM_MINMAX, CV_32FC1, Mat()); We now have a map with all the available corner responses rescaled to the range of [0 255] and stored as float values. We can now manually define a threshold, that will generate a good amount of keypoints for our application. Playing around with this parameter could improve performance in other cases. This can be done by using the following code snippet: float threshold = 125.0; vector<KeyPoint> keypoints; Mat rescaled; convertScaleAbs(harris_normalised, rescaled); Mat harris_c(rescaled.rows, rescaled.cols, CV_8UC3); Mat in[] = { rescaled, rescaled, rescaled }; int from_to[] = { 0,0, 1,1, 2,2 }; mixChannels( in, 3, &harris_c, 1, from_to, 3 ); for(int x=0; x<harris_normalised.cols; x++){ for(int y=0; y<harris_normalised.rows; y++){ if ( (int)harris_normalised.at<float>(y, x) > threshold ){ // Draw or store the keypoint location here, just like you decide. In our case we will store the location of the keypoint circle(harris_c, Point(x, y), 5, Scalar(0,255,0), 1); circle(harris_c, Point(x, y), 1, Scalar(0,0,255), 1); keypoints.push_back( KeyPoint (x, y, 1) ); } } } Comparison between thinned fingerprint and Harris corner response, as well as the selected Harris corners. Now that we have a list of keypoints we will need to create some of formal descriptor of the local region around that keypoint to be able to uniquely identify it among other keypoints. Chapter 3, Recognizing facial expressions with machine learning, discusses in more detail the wide range of keypoints out there. In this article, we will mainly focus on the process. Feel free to adapt the interface with other keypoint detectors and descriptors out there, for better or for worse performance. Since we have an application where the orientation of the thumb can differ (since it is not a fixed position), we want a keypoint descriptor that is robust at handling these slight differences. One of the mostly used descriptors for that is the SIFT descriptor, which stands for scale invariant feature transform. However SIFT is not under a BSD license and can thus pose problems to use in commercial software. A good alternative in OpenCV is the ORB descriptor. In OpenCV you can implement it in the following way. Ptr<Feature2D> orb_descriptor = ORB::create(); Mat descriptors; orb_descriptor->compute(input_thinned, keypoints, descriptors); This will enable us to calculate only the descriptors using the ORB approach, since we already retrieved the location of the keypoints using the Harris corner approach. At this point we can retrieve a descriptor for each detected keypoint of any given fingerprint. The descriptors matrix will contain a row for each keypoint containing the representation. Let us now start from the case where we have only a single reference image for each fingerprint. In that case we will have a database containing a set of feature descriptors for the training persons in the database. We then have a single new entry, consisting of multiple descriptors for the keypoints found at registration time. We now have to match these descriptors to the descriptors stored in the database, to see which one has the best match. The most simple way is by performing a brute force matching using the hamming distance criteria between descriptors of different keypoints. // Imagine we have a vector of single entry descriptors as a database // We will still need to fill those once we compare everything, by using the code snippets above vector<Mat> database_descriptors; Mat current_descriptors; // Create the matcher interface Ptr<DescriptorMatcher> matcher = DescriptorMatcher::create("BruteForce-Hamming"); // Now loop over the database and start the matching vector< vector< DMatch > > all_matches; for(int entry=0; i<database_descriptors.size();entry++){ vector< DMatch > matches; matcheràmatch(database_descriptors[entry], current_descriptors, matches); all_matches.push_back(matches); } We now have all the matches stored as DMatch objects. This means that for each matching couple we will have the original keypoint, the matched keypoint and a floating point score between both matches, representing the distance between the matched points. The idea about finding the best match seems pretty straightforward. We take a look at the amount of matches that have been returned by the matching process and weigh them by their Euclidean distance in order to add some certainty. We then look for the matching process that yielded the biggest score. This will be our best match and the match we want to return as the selected one from the database. If you want to avoid an imposter getting assigned to the best matching score, you can again add a manual threshold on top of the scoring to avoid matches that are not good enough, to be ignored. However it is possible, and should be taken into consideration, that if you increase the score to high, that people with a little change will be rejected from the system, like for example in the case where someone cuts his finger and thus changing his pattern to drastically. Fingerprint matching process visualized. To summarize, we saw how to detect fingerprints and implement it using OpenCV 3. Resources for Article: Further resources on this subject: Making subtle color shifts with curves [article] Tracking Objects in Videos [article] Hand Gesture Recognition Using a Kinect Depth Sensor [article]

0
1
62522

article-image-integrating-elasticsearch-hadoop-ecosystem

Packt

07 Oct 2015

14 min read

Integrating Elasticsearch with the Hadoop ecosystem

Packt

07 Oct 2015

14 min read

0
0
4453

article-image-hand-gesture-recognition-using-kinect-depth-sensor

Packt

06 Oct 2015

26 min read

Hand Gesture Recognition Using a Kinect Depth Sensor

Packt

06 Oct 2015

26 min read

0
0
35824

article-image-creating-interactive-spreadsheets-using-tables-and-slicers

Packt

06 Oct 2015

10 min read

Creating Interactive Spreadsheets using Tables and Slicers

Packt

06 Oct 2015

10 min read

0
0
4331

Packt

05 Oct 2015

17 min read

Mathematical Imaging

Packt

05 Oct 2015

17 min read

In this article by Francisco J. Blanco-Silva, author of the book Mastering SciPy, you will learn about image editing and the purpose of editing is the alteration of digital images, usually to perform improvement of its properties or to turn them into an intermediate step for further processing. Let's examine different examples of editing: Transformations of the domain Intensity adjustment Image restoration (For more resources related to this topic, see here.) Transformations of the domain In this setting, we address transformations to images by first changing the location of pixels, rotations, compressions, stretching, swirls, cropping, perspective control, and so on. Once the transformation to the pixels in the domain of the original is performed, we observe the size of the output. If there are more pixels in this image than in the original, the extra locations are filled with numerical values obtained by interpolating the given data. We do have some control over the kind of interpolation performed, of course. To better illustrate these techniques, we will pair an actual image (say, Lena) with a representation of its domain as a checkerboard: In [1]: import numpy as np, matplotlib.pyplot as plt In [2]: from scipy.misc import lena; ...: from skimage.data import checkerboard In [3]: image = lena().astype('float') ...: domain = checkerboard() In [4]: print image.shape, domain.shape Out[4]: (512, 512) (200, 200) Rescale and resize Before we proceed with the pairing of image and domain, we have to make sure that they both have the same size. One quick fix is to rescale both objects, or simply resize one of the images to match the size of the other. Let's go with the first choice, so that we can illustrate the usage of the two functions available for this task in the module skimage.transform to resize and rescale: In [5]: from skimage.transform import rescale, resize In [6]: domain = rescale(domain, scale=1024./200); ...: image = resize(image, output_shape=(1024, 1024), order=3) Observe how, in the resizing operation, we requested a bicubic interpolation. Swirl To perform a swirl, we call the function swirl from the module skimage.transform: In all the examples of this section, we will present the results visually after performing the requested computations. In all cases, the syntax of the call to offer the images is the same. For a given operation mapping, we issue the command display (mapping, image, and domain) where the routine display is defined as follows: def display(mapping, image, domain): plt.figure() plt.subplot(121) plt.imshow(mapping(image)) plt.gray() plt.subplot(122) plt.imshow(mapping(domain)) plt.show() For the sake of brevity, we will not include this command in the following code, but assume it is called every time: In [7]: from skimage.transform import swirl In [8]: mapping = lambda img: swirl(img, strength=6, ...: radius=512) Geometric transformations A simple rotation around any location (no matter whether inside or outside the domain of the image) can be achieved with the function rotate from either module scipy.ndimage or skimage.transform. They are basically the same under the hood, but the syntax of the function in the scikit-image toolkit is more user-friendly: In [10]: from skimage.transform import rotate In [11]: mapping = lambda img: rotate(img, angle=30, resize=True, ....: center=None) This gives a counter-clockwise rotation of 30 degrees (angle=30) around the center of the image (center=None). The size of the output image is expanded to guarantee that all the original pixels are present in the output (resize=True): Rotations are a special case of what we call an affine transformation—a combination of rotation with scales (one for each dimension), shear, and translation. Affine transformations are in turn a special case of a homography (a projective transformation). Rather than learning a different set of functions, one for each kind of geometric transformation, the library skimage.transform allows a very comfortable setting. There is a common function (warp) that gets called with the requested geometric transformation and performs the computations. Each suitable geometric transformation is previously initialized with an appropriate python class. For example, to perform an affine transformation with a counter-clockwise rotation angle of 30 degrees about the point with coordinates (512, -2048), and scale factors of 2 and 3 units, respectively for the x and y coordinates, we issue the following command: In [13]: from skimage.transform import warp, AffineTransform In [14]: operation = AffineTransform(scale=(2,3), rotation=np.pi/6, ....: translation = (512, -2048)) In [15]: mapping = lambda img: warp(img, operation) Observe how all the lines in the transformed checkerboard are either parallel or perpendicular—affine transformations preserve angles. The following illustrates the effect of a homography: In [17]: from skimage.transform import ProjectiveTransform In [18]: generator = np.matrix('1,0,10; 0,1,20; -0.0007,0.0005,1'); ....: homography = ProjectiveTransform(matrix=generator); ....: mapping = lambda img: warp(img, homography) Observe how, unlike in the case of an affine transformation, the lines cease to be all parallel or perpendicular. All vertical lines are now incident at a point at infinity. All horizontal lines are also incident at a different point at infinity. The real usefulness of homographies arises, for example, when we need to perform perspective control. For instance, the image skimage.data.text is clearly slanted. By choosing the four corners of what we believe is a perfect rectangle (we estimate such a rectangle by visual inspection), we can compute a homography that transforms the given image into one that is devoid of any perspective. The Python classes representing geometric transformations allow us to perform this estimation very easily, as the following example shows: In [20]: from skimage.data import text In [21]: text().shape Out[21]: (172, 448) In [22]: source = np.array(((155, 15), (65, 40), ....: (260, 130), (360, 95), ....: (155, 15))) In [23]: mapping = ProjectiveTransform() Let's estimate the homography that transforms the given set of points into a perfect rectangle of the size 48 x 256 centered in an output image of size 512 x 512. The choice of size of the output image is determined by the length of the diagonal of the original image (about 479 pixels). This way, after the homography is computed, the output is likely to contain all the pixels from the original. Observe that we have included one of vertices in the source, twice. This is not strictly necessary for the following computations, but will make the visualization of rectangles much easier to code. We use the same trick for the target rectangle. In [24]: target = np.array(((256-128, 256-24), (256-128, 256+24), ....: (256+128, 256+24), (256+128, 256-24), ....: (256-128, 256-24))) In [25]: mapping.estimate(target, source) Out[25]: True In [26]: plt.figure(); ....: plt.subplot(121); ....: plt.imshow(text()); ....: plt.gray(); ....: plt.plot(source[:,0], source[:,1],'-', lw=1, color='red'); ....: plt.xlim(0, 448); ....: plt.ylim(172, 0); ....: plt.subplot(122); ....: plt.imshow(warp(text(), mapping,output_shape=(512, 512))); ....: plt.plot(target[:,0], target[:,1],'-', lw=1, color='red'); ....: plt.xlim(0, 512); ....: plt.ylim(512, 0); ....: plt.show() Other more involved geometric operations are needed, for example, to fix vignetting and some of the other kinds of distortions produced by photographic lenses. Traditionally, once we acquire an image, we assume that all these distortions are present. By knowing the technical specifications of the equipment used to take the photographs, we can automatically rectify these defects. With this purpose in mind, in the SciPy stack, we have access to the lensfun library (http://lensfun.sourceforge.net/) through the package lensfunpy (https://pypi.python.org/pypi/lensfunpy). For examples of usage and documentation, an excellent resource is the API reference of lensfunpy at http://pythonhosted.org/lensfunpy/api/. Intensity adjustment In this category, we have operations that only modify the intensity of an image obeying a global formula. All these operations can therefore be easily coded by using purely NumPy operations, by creating vectorized functions adapting the requested formulas. The applications of these operations can be explained in terms of exposure in black and white photography, for example. For this reason, all the examples in this section are applied on gray-scale images. We have mainly three approaches to enhancing images by working on its intensity: Histogram equalizations Intensity clipping/resizing Contrast adjustment Histogram equalization The starting point in this setting is always the concept of intensity histogram (or more precisely, the histogram of pixel intensity values)—a function that indicates the number of pixels in an image at each different intensity value found in that image. For instance, for the original version of Lena, we could issue the following command: In [27]: plt.figure(); ....: plt.hist(lena().flatten(), 256); ....: plt.show() The operations of histogram equalization improve the contrast of images by modifying the histogram in a way that most of the relevant intensities have the same impact. We can accomplish this enhancement by calling, from the module skimage.exposure, any of the functions equalize_hist (pure histogram equalization) or equalize_adaphist (contrast limited adaptive histogram equalization (CLAHE)). Note the obvious improvement after the application of the histogram equalization of the image skimage.data.moon. In the following examples, we include the corresponding histogram below all relevant images for comparison. A suitable code to perform this visualization could be as follows: def display(image, transform, bins): target = transform(image) plt.figure() plt.subplot(221) plt.imshow(image) plt.gray() plt.subplot(222) plt.imshow(target) plt.subplot(223) plt.hist(image.flatten(), bins) plt.subplot(224) plt.hist(target.flatten(), bins) plt.show() In [28]: from skimage.exposure import equalize_hist; ....: from skimage.data import moon In [29]: display(moon(), equalize_hist, 256) Intensity clipping/resizing A peak at the histogram indicates the presence of a particular intensity that is remarkably more predominant than its neighboring ones. If we desire to isolate intensities around a peak, we can do so using purely NumPy masking/clipping operations on the original image. If storing the result is not needed, we can request a quick visualization of the result by employing the command clim in the library matplotlib.pyplot. For instance, to isolate intensities around the highest peak of Lena (roughly, these are between 150 and 160), we could issue the following command: In [30]: plt.figure(); ....: plt.imshow(lena()); ....: plt.clim(vmin=150, vmax=160); ....: plt.show() Note how this operation, in spite of having reduced the representative range of intensities from 256 to 10, offers us a new image that keeps sufficient information to recreate the original one. Naturally, we can regard this operation also as a lossy compression method. Contrast enhancement An obvious drawback of clipping intensities is the loss of perceived lightness contrast. To overcome this loss, it is preferable to employ formulas that do not reduce the size of the range. Among the many available formulas conforming to this mathematical property, the most successful ones are those that replicate an optical property of the acquisition method. We explore the following three cases: Gamma correction: Human vision follows a power function, with greater sensitivity to relative differences between darker tones than between lighter ones. Each original image, as captured by the acquisition device, might allocate too many bits or too much bandwidth to highlight that humans cannot actually differentiate. Similarly, too few bits/bandwidth could be allocated to the darker areas of the image. By manipulation of this power function, we are capable of addressing the correct amount of bits and bandwidth. Sigmoid correction: Independently of the amount of bits and bandwidth, it is often desirable to maintain the perceived lightness contrast. Sigmoidal remapping functions were then designed based on empirical contrast enhancement model developed from the results of psychophysical adjustment experiments. Logarithmic correction: This is a purely mathematical process designed to spread the range of naturally low-contrast images by transforming to a logarithmic range. To perform gamma correction on images, we could employ the function adjust_gamma in the module skimage.exposure. The equivalent mathematical operation is the power-law relationship output = gain * input^gamma. Observe the great improvement in definition of the brighter areas of a stained micrograph of colonic glands, when we choose the exponent gamma=2.5 and no gain (gain=1.0): In [31]: from skimage.exposure import adjust_gamma; ....: from skimage.color import rgb2gray; ....: from skimage.data import immunohistochemistry In [32]: image = rgb2gray(immunohistochemistry()) In [33]: correction = lambda img: adjust_gamma(img, gamma=2.5, ....: gain=1.) Note the huge improvement in contrast in the lower right section of the micrograph, allowing a better description and differentiation of the observed objects: Immunohistochemical staining with hematoxylin counterstaining. This image was acquired at the Center for Microscopy And Molecular Imaging (CMMI). To perform sigmoid correction with given gain and cutoff coefficients, according to the formula output = 1/(1 + exp*(gain*(cutoff - input))), we employ the function adjust_sigmoid in skimage.exposure. For example, with gain=10.0 and cutoff=0.5 (the default values), we obtain the following enhancement: In [35]: from skimage.exposure import adjust_sigmoid In [36]: display(image[:256, :256], adjust_sigmoid, 256) Note the improvement in the definition of the walls of cells in the enhanced image: We have already explored logarithmic correction in the previous section, when enhancing the visualization of the frequency of an image. This is equivalent to applying a vectorized version of np.log1p to the intensities. The corresponding function in the scikit-image toolkit is adjust_log in the sub module exposure. Image restoration In this category of image editing, the purpose is to repair the damage produced by post or preprocessing of the image, or even the removal of distortion produced by the acquisition device. We explore two classic situations: Noise reduction Sharpening and blurring Noise reduction In mathematical imaging, noise is by definition a random variation of the intensity (or the color) produced by the acquisition device. Among all the possible types of noise, we acknowledge four key cases: Gaussian noise: We add to each pixel a value obtained from a random variable with normal distribution and a fixed mean. We usually allow the same variance on each pixel of the image, but it is feasible to change the variance depending on the location. Poisson noise: To each pixel, we add a value obtained from a random variable with Poisson distribution. Salt and pepper: This is a replacement noise, where some pixels are substituted by zeros (black or pepper), and some pixels are substituted by ones (white or salt). Speckle: This is a multiplicative kind of noise, where each pixel gets modified by the formula output = input + n * input. The value of the modifier n is a value obtained from a random variable with uniform distribution of fixed mean and variance. To emulate all these kinds of noise, we employ the utility random_noise from the module skimage.util. Let's illustrate the possibilities in a common image: In [37]: from skimage.data import camera; ....: from skimage.util import random_noise In [38]: gaussian_noise = random_noise(camera(), 'gaussian', ....: var=0.025); ....: poisson_noise = random_noise(camera(), 'poisson'); ....: saltpepper_noise = random_noise(camera(), 's&p', ....: salt_vs_pepper=0.45); ....: speckle_noise = random_noise(camera(), 'speckle', var=0.02) In [39]: variance_generator = lambda i,j: 0.25*(i+j)/1022. + 0.001; ....: variances = np.fromfunction(variance_generator,(512,512)); ....: lclvr_noise = random_noise(camera(), 'localvar', ....: local_vars=variances) In the last example, we have created a function that assigns a variance between 0.001 and 0.026 depending on the distance to the upper corner of an image. When we visualize the corresponding noisy version of skimage.data.camera, we see that the level of degradation gets stronger as we get closer to the lower right corner of the picture. The following is an example of visualization of the corresponding noisy images: The purpose of noise reduction is to remove as much of this unwanted signal, so we obtain an image as close to the original as possible. The trick, of course, is to do so without any previous knowledge of the properties of the noise. The most basic methods of denoising are the application of either a Gaussian or a median filter. We explored them both in the previous section. The former was presented as a smoothing filter (gaussian_filter), and the latter was discussed when we explored statistical filters (median_filter). They both offer decent noise removal, but they introduce unwanted artifacts as well. For example, the Gaussian filter does not preserve edges in images. The application of any of these methods is also not recommended if preserving texture information is needed. We have a few more advanced methods in the module skimage.restoration, able to tailor denoising to images with specific properties: denoise_bilateral: This is the bilateral filter. It is useful when preserving edges is important. denoise_tv_bregman, denoise_tv_chambolle: We will use this if we require a denoised image with small total variation. nl_means_denoising: The so-called non-local means denoising. This method ensures the best results to denoise areas of the image presenting texture. wiener, unsupervised_wiener: This is the Wiener-Hunt deconvolution. It is useful when we have knowledge of the point-spread function at acquisition time. Let's show you, by example, the performance of one of these methods on some of the noisy images we computed earlier: In [40]: from skimage.restoration import nl_means_denoising as dnoise In [41]: images = [gaussian_noise, poisson_noise, ....: saltpepper_noise, speckle_noise]; ....: names = ['Gaussian', 'Poisson', 'Salt & Pepper', 'Speckle'] In [42]: plt.figure() Out[42]: <matplotlib.figure.Figure at 0x118301490> In [43]: for index, image in enumerate(images): ....: output = dnoise(image, patch_size=5, patch_distance=7) ....: plt.subplot(2, 4, index+1) ....: plt.imshow(image) ....: plt.gray() ....: plt.title(names[index]) ....: plt.subplot(2, 4, index+5) ....: plt.imshow(output) ....: In [44]: plt.show() Under each noisy image, we have presented the corresponding result after employing, by nonlocal means, denoising: It is also possible to perform denoising by thresholding coefficient, provided we represent images with a transform. For example, to do a soft thresholding, employing Biorthonormal 2.8 wavelets; we will use the package PyWavelets: In [45]: import pywt In [46]: def dnoise(image, wavelet, noise_var): ....: levels = int(np.floor(np.log2(image.shape[0]))) ....: coeffs = pywt.wavedec2(image, wavelet, level=levels) ....: value = noise_var * np.sqrt(2 * np.log2(image.size)) ....: threshold = lambda x: pywt.thresholding.soft(x, value) ....: coeffs = map(threshold, coeffs) ....: return pywt.waverec2(coeffs, wavelet) ....: In [47]: plt.figure() Out[47]: <matplotlib.figure.Figure at 0x10e5ed790> In [48]: for index, image in enumerate(images): ....: output = dnoise(image, pywt.Wavelet('bior2.8'), ....: noise_var=0.02) ....: plt.subplot(2, 4, index+1) ....: plt.imshow(image) ....: plt.gray() ....: plt.title(names[index]) ....: plt.subplot(2, 4, index+5) ....: plt.imshow(output) ....: In [49]: plt.show() Observe that the results are of comparable quality to those obtained with the previous method: Sharpening and blurring There are many situations that produce blurred images: Incorrect focus at acquisition Movement of the imaging system The point-spread function of the imaging device (like in electron microscopy) Graphic-art effects For blurring images, we could replicate the effect of a point-spread function by performing convolution of the image with the corresponding kernel. The Gaussian filter that we used for denoising performs blurring in this fashion. In the general case, convolution with a given kernel can be done with the routine convolve from the module scipy.ndimage. For instance, for a constant kernel supported on a 10 x 10 square, we could do as follows: In [50]: from scipy.ndimage import convolve; ....: from skimage.data import page In [51]: kernel = np.ones((10, 10))/100.; ....: blurred = convolve(page(), kernel) To emulate the blurring produced by movement too, we could convolve with a kernel as created here: In [52]: from skimage.draw import polygon In [53]: x_coords = np.array([14, 14, 24, 26, 24, 18, 18]); ....: y_coords = np.array([ 2, 18, 26, 24, 22, 18, 2]); ....: kernel_2 = np.zeros((32, 32)); ....: kernel_2[polygon(x_coords, y_coords)]= 1.; ....: kernel_2 /= kernel_2.sum() In [54]: blurred_motion = convolve(page(), kernel_2) In order to reverse the effects of convolution when we have knowledge of the degradation process, we perform deconvolution. For example, if we have knowledge of the point-spread function, in the module skimage.restoration we have implementations of the Wiener filter (wiener), the unsupervised Wiener filter (unsupervised_wiener), and Lucy-Richardson deconvolution (richardson_lucy). We perform deconvolution on blurred images by applying a Wiener filter. Note the huge improvement in readability of the text: In [55]: from skimage.restoration import wiener In [56]: deconv = wiener(blurred, kernel, balance=0.025, clip=False) Summary In this article, we have seen how the SciPy stack helps us solve many problems in mathematical imaging, from the effective representation of digital images, to modifying, restoring, or analyzing them. Although certainly exhaustive, this article scratches but the surface of this challenging field of engineering. Resources for Article: Further resources on this subject: Symbolizers [article] Machine Learning [article] Analyzing a Complex Dataset [article]

0
0
5374

article-image-object-detection-using-image-features-javascript

Packt

05 Oct 2015

16 min read

Object Detection Using Image Features in JavaScript

Packt

05 Oct 2015

16 min read

In this article by Foat Akhmadeev, author of the book Computer Vision for the Web, we will discuss how we can detect an object on an image using several JavaScript libraries. In particular, we will see techniques such as FAST features detection, and BRIEF and ORB descriptors matching. Eventually, the object detection example will be presented. There are many ways to detect an object on an image. Color object detection, which is the detection of changes in intensity of an image is just a simple computer vision methods. There are some sort of fundamental things which every computer vision enthusiast should know. The libraries we use here are: JSFeat (http://inspirit.github.io/jsfeat/) tracking.js (http://trackingjs.com) (For more resources related to this topic, see here.) Detecting key points What information do we get when we see an object on an image? An object usually consists of some regular parts or unique points, which represent this particular object. Of course, we can compare each pixel of an image, but it is not a good thing in terms of computational speed. Probably, we can take unique points randomly, thus reducing the computation cost significantly, but we will still not get much information from random points. Using the whole information, we can get too much noise and lose important parts of an object representation. Eventually, we need to consider that both ideas, getting all pixels and selecting random pixels, are really bad. So, what can we do in that case? We are working with a grayscale image and we need to get unique points of an image. Then, we need to focus on the intensity information. For example, getting object edges in the Canny edge detector or the Sobel filter. We are closer to the solution! But still not close enough. What if we have a long edge? Don't you think that it is a bit bad that we have too many unique pixels that lay on this edge? An edge of an object has end points or corners; if we reduce our edge to those corners, we will get enough unique pixels and remove unnecessary information. There are various methods of getting keypoints from an image, many of which extract corners as those keypoints. To get them, we will use the FAST (Features from Accelerated Segment Test) algorithm. It is really simple and you can easily implement it by yourself if you want. But you do not need to. The algorithm implementation is provided by both tracking.js and JSFeat libraries. The idea of the FAST algorithm can be captured from the following image: Suppose we want to check whether the pixel P is a corner. We will check 16 pixels around it. If at least 9 pixels in an arc around P are much darker or brighter than the value of P, then we say that P is a corner. How much darker or brighter should the P pixels be? The decision is made by applying a threshold for the difference between the value of P and the value of pixels around P. A practical example First, we will start with an example of FAST corner detection for the tracking.js library. Before we do something, we can set the detector threshold. Threshold defines the minimum difference between a tested corner and the points around it: tracking.Fast.THRESHOLD = 30; It is usually a good practice to apply a Gaussian blur on an image before we start the method. It significantly reduces the noise of an image: var imageData = context.getImageData(0, 0, cols, rows); var gray = tracking.Image.grayscale(imageData.data, cols, rows, true); var blurred4 = tracking.Image.blur(gray, cols, rows, 3); Remember that the blur function returns a 4 channel array—RGBA. In that case, we need to convert it to 1-channel. Since we can easily skip other channels, it should not be a problem: var blurred1 = new Array(blurred4.length / 4); for (var i = 0, j = 0; i < blurred4.length; i += 4, ++j) { blurred1[j] = blurred4[i]; } Next, we run a corner detection function on our image array: var corners = tracking.Fast.findCorners(blurred1, cols, rows); The result returns an array with its length twice the length of the corner's number. The array is returned in the format [x0,y0,x1,y1,...]. Where [xn, yn] are coordinates of a detected corner. To print the result on a canvas, we will use the fillRect function: for (i = 0; i < corners.length; i += 2) { context.fillStyle = '#0f0'; context.fillRect(corners[i], corners[i + 1], 3, 3); } Let's see an example with the JSFeat library,. for which the steps are very similar to that of tracking.js. First, we set the global threshold with a function: jsfeat.fast_corners.set_threshold(30); Then, we apply a Gaussian blur to an image matrix and run the corner detection: jsfeat.imgproc.gaussian_blur(matGray, matBlurred, 3); We need to preallocate keypoints for a corners result. The keypoint_t function is just a new type which is useful for keypoints of an image. The first two parameters represent coordinates of a point and the other parameters set: point score (which checks whether the point is good enough to be a key point), point level (which you can use it in an image pyramid, for example), and point angle (which is usually used for the gradient orientation): var corners = []; var i = cols * rows; while (--i >= 0) { corners[i] = new jsfeat.keypoint_t(0, 0, 0, 0, -1); } After all this, we execute the FAST corner detection method. As a last parameter of detection function, we define a border size. The border is used to constrain circles around each possible corner. For example, you cannot precisely say whether the point is a corner for the [0,0] pixel. There is no [0, -3] pixel in our matrix: var count = jsfeat.fast_corners.detect(matBlurred, corners, 3); Since we preallocated the corners, the function returns the number of calculated corners for us. The result returns an array of structures with the x and y fields, so we can print it using those fields: for (var i = 0; i < count; i++) { context.fillStyle = '#0f0'; context.fillRect(corners[i].x, corners[i].y, 3, 3); } The result is nearly the same for both algorithms. The difference is in some parts of realization. Let's look at the following example: From left to right: tracking.js without blur, JSFeat without blur, tracking.js and JSFeat with blur. If you look closely, you can see the difference between tracking.js and JSFeat results, but it is not easy to spot it. Look at how much noise was reduced by applying just a small 3 x 3 Gaussian filter! A lot of noisy points were removed from the background. And now the algorithm can focus on points that represent flowers and the pattern of the vase. We have extracted key points from our image, and we successfully reached the goal of reducing the number of keypoints and focusing on the unique points of an image. Now, we need to compare or match those points somehow. How we can do that? Descriptors and object matching Image features by themselves are a bit useless. Yes, we have found unique points on an image. But what did we get? Only values of pixels and that's it. If we try to compare these values, it will not give us much information. Moreover, if we change the overall image brightness, we will not find the same keypoints on the same image! Taking into account all of this, we need the information that surrounds our key points. Moreover, we need a method to efficiently compare this information. First, we need to describe the image features, which comes from image descriptors. In this part, we will see how these descriptors can be extracted and matched. The tracking.js and JSFeat libraries provide different methods for image descriptors. We will discuss both. BRIEF and ORB descriptors The descriptors theory is focused on changes in image pixels' intensities. The tracking.js library provides the BRIEF (Binary Robust Independent Elementary Features) descriptors and its JSFeat extension—ORB (Oriented FAST and Rotated BRIEF). As we can see from the ORB naming, it is rotation invariant. This means that even if you rotate an object, the algorithm can still detect it. Moreover, the authors of the JSFeat library provide an example using the image pyramid, which is scale invariant too. Let's start by explaining BRIEF, since it is the source for ORB descriptors. As a first step, the algorithm takes computed image features, and it takes the unique pairs of elements around each feature. Based on these pairs' intensities it forms a binary string. For example, if we have a pair of positions i and j, and if I(i) < I(j) (where I(pos) means image value at the position pos), then the result is 1, else 0. We add this result to the binary string. We do that for N pairs, where N is taken as a power of 2 (128, 256, 512). Since descriptors are just binary strings, we can compare them in an efficient manner. To match these strings, the Hamming distance is usually used. It shows the minimum number of substitutions required to change one string to another. For example, we have two binary strings: 10011 and 11001. The Hamming distance between them is 2, since we need to change 2 bits of information to change the first string to the second. The JSFeat library provides the functionality to apply ORB descriptors. The core idea is very similar to BRIEF. However, there are two major differences: The implementation is scale invariant, since the descriptors are computed for an image pyramid. The descriptors are rotation invariant; the direction is computed using intensity of the patch around a feature. Using this orientation, ORB manages to compute the BRIEF descriptor in a rotation-invariant manner. Implementation of descriptors implementation and their matching Our goal is to find an object from a template on a scene image. We can do that by finding features and descriptors on both images and matching descriptors from a template to an image. We start from the tracking.js library and BRIEF descriptors. The first thing that we can do is set the number of location pairs: tracking.Brief.N = 512 By default, it is 256, but you can choose a higher value. The larger the value, the more information you will get and the more the memory and computational cost it requires. Before starting the computation, do not forget to apply the Gaussian blur to reduce the image noise. Next, we find the FAST corners and compute descriptors on both images. Here and in the next example, we use the suffix Object for a template image and Scene for a scene image: var cornersObject = tracking.Fast.findCorners(grayObject, colsObject, rowsObject); var cornersScene = tracking.Fast.findCorners(grayScene, colsScene, rowsScene); var descriptorsObject = tracking.Brief.getDescriptors(grayObject, colsObject, cornersObject); var descriptorsScene = tracking.Brief.getDescriptors(grayScene, colsScene, cornersScene); Then we do the matching: var matches = tracking.Brief.reciprocalMatch(cornersObject, descriptorsObject, cornersScene, descriptorsScene); We need to pass information of both corners and descriptors to the function, since it returns coordinate information as a result. Next, we print both images on one canvas. To draw the matches using this trick, we need to shift our scene keypoints for the width of a template image as a keypoint1 matching returns a point on a template and keypoint2 returns a point on a scene image. The keypoint1 and keypoint2 are arrays with x and y coordinates at 0 and 1 indexes, respectively: for (var i = 0; i < matches.length; i++) { var color = '#' + Math.floor(Math.random() * 16777215).toString(16); context.fillStyle = color; context.strokeStyle = color; context.fillRect(matches[i].keypoint1[0], matches[i].keypoint1[1], 5, 5); context.fillRect(matches[i].keypoint2[0] + colsObject, matches[i].keypoint2[1], 5, 5); context.beginPath(); context.moveTo(matches[i].keypoint1[0], matches[i].keypoint1[1]); context.lineTo(matches[i].keypoint2[0] + colsObject, matches[i].keypoint2[1]); context.stroke(); } The JSFeat library provides most of the code for pyramids and scale invariant features not in the library, but in the examples, which are available on https://github.com/inspirit/jsfeat/blob/gh-pages/sample_orb.html. We will not provide the full code here, because it requires too much space. But do not worry, we will highlight main topics here. Let's start from functions that are included in the library. First, we need to preallocate the descriptors matrix, where 32 is the length of a descriptor and 500 is the maximum number of descriptors. Again, 32 is a power of two: var descriptors = new jsfeat.matrix_t(32, 500, jsfeat.U8C1_t); Then, we compute the ORB descriptors for each corner, we need to do that for both template and scene images: jsfeat.orb.describe(matBlurred, corners, num_corners, descriptors); The function uses global variables, which mainly define input descriptors and output matching: function match_pattern() The result match_t contains the following fields: screen_idx: This is the index of a scene descriptor pattern_lev: This is the index of a pyramid level pattern_idx: This is the index of a template descriptor Since ORB works with the image pyramid, it returns corners and matches for each level: var s_kp = screen_corners[m.screen_idx]; var p_kp = pattern_corners[m.pattern_lev][m.pattern_idx]; We can print each matching as shown here. Again, we use Shift, since we computed descriptors on separate images, but print the result on one canvas: context.fillRect(p_kp.x, p_kp.y, 4, 4); context.fillRect(s_kp.x + shift, s_kp.y, 4, 4); Working with a perspective Let's take a step away. Sometimes, an object you want to detect is affected by a perspective distortion. In that case, you may want to rectify an object plane. For example, a building wall: Looks good, doesn't it? How do we do that? Let's look at the code: var imgRectified = new jsfeat.matrix_t(mat.cols, mat.rows, jsfeat.U8_t | jsfeat.C1_t); var transform = new jsfeat.matrix_t(3, 3, jsfeat.F32_t | jsfeat.C1_t); jsfeat.math.perspective_4point_transform(transform, 0, 0, 0, 0, // first pair x1_src, y1_src, x1_dst, y1_dst 640, 0, 640, 0, // x2_src, y2_src, x2_dst, y2_dst and so on. 640, 480, 640, 480, 0, 480, 180, 480); jsfeat.matmath.invert_3x3(transform, transform); jsfeat.imgproc.warp_perspective(mat, imgRectified, transform, 255); Primarily, as we did earlier, we define a result matrix object. Next, we assign a matrix for image perspective transformation. We calculate it based on four pairs of corresponding points. For example, the last, that is, the fourth point of the original image, which is [0, 480], should be projected to the point [180, 480] on the rectified image. Here, the first coordinate refers to x and the second to y. Then, we invert the transform matrix to be able to apply it to the original image—the mat variable. We pick the background color as white (255 for an unsigned byte). As a result, we get a nice image without any perspective distortion. Finding an object location Returning to our primary goal, we found a match. That is great. But what we did not do is finding an object location. There is no function for that in the tracking.js library, but JSFeat provides such functionality in the examples section. First, we need to compute a perspective transform matrix. This is why we discussed the example of such transformation previously. We have points from two images but we do not have a transformation for the whole image. First, we define a transform matrix: var homo3x3 = new jsfeat.matrix_t(3, 3, jsfeat.F32C1_t); To compute the homography, we need only four points. But after the matching, we get too many. In addition, there are can be noisy points, which we will need to skip somehow. For that, we use a RANSAC (Random sample consensus) algorithm. It is an iterative method for estimating a mathematical model from a dataset that contains outliers (noise). It estimates outliers and generates a model that is computed without the noisy data. Before we start, we need to define the algorithm parameters. The first parameter is a match mask, where all mathces will be marked as good (1) or bad (0): var match_mask = new jsfeat.matrix_t(500, 1, jsfeat.U8C1_t); Our mathematical model to find: var mm_kernel = new jsfeat.motion_model.homography2d(); Minimum number of points to estimate a model (4 points to get a homography): var num_model_points = 4; Maximum threshold to classify a data point as an inlier or a good match: var reproj_threshold = 3; Finally, the variable that holds main parameters and the last two arguments define the maximum ratio of outliers and probability of success when the algorithm stops at the point where the number of inliers is 99 percent: var ransac_param = new jsfeat.ransac_params_t(num_model_points, reproj_threshold, 0.5, 0.99); Then, we run the RANSAC algorithm. The last parameter represents the number of maximum iterations for the algorithm: jsfeat.motion_estimator.ransac(ransac_param, mm_kernel, object_xy, screen_xy, count, homo3x3, match_mask, 1000); The shape finding can be applied for both tracking.js and JSFeat libraries, you just need to set matches as object_xy and screen_xy, where those arguments must hold an array of objects with the x and y fields. After we find the transformation matrix, we compute the projected shape of an object to a new image: var shape_pts = tCorners(homo3x3.data, colsObject, rowsObject); After the computation is done, we draw computed shapes on our images: As we see, our program successfully found an object in both cases. Actually, both methods can show different performance, it is mainly based on the thresholds you set. Summary Image features and descriptors matching are powerful tools for object detection. Both JSFeat and tracking.js libraries provide different functionalities to match objects using these features. In addition to this, the JSFeat project contains algorithms for object finding. These methods can be useful for tasks such as uniques object detection, face tracking, and creating a human interface by tracking various objects very efficiently. Resources for Article: Further resources on this subject: Welcome to JavaScript in the full stack[article] Introducing JAX-RS API[article] Using Google Maps APIs with Knockout.js [article]

0
1
30631

Packt

29 Sep 2015

25 min read

Data Around Us

Packt

29 Sep 2015

25 min read

In this article by Gergely Daróczi, author of the book Mastering Data Analysis with R we will discuss Spatial data, also known as geospatial data, which identifies geographic locations, such as natural or constructed features around us. Although all observations have some spatial content, such as the location of the observation, but this is out of most data analysis tools' range due to the complex nature of spatial information; alternatively, the spatiality might not be that interesting (at first sight) in the given research topic. On the other hand, analyzing spatial data can reveal some very important underlying structures of the data, and it is well worth spending time visualizing the differences and similarities between close or far data points. In this article, we are going to help with this and will use a variety of R packages to: Retrieve geospatial information from the Internet Visualize points and polygons on a map (For more resources related to this topic, see here.) Geocoding We will use the hflights dataset to demonstrate how one can deal with data bearing spatial information. To this end, let's aggregate our dataset but instead of generating daily data let's view the aggregated characteristics of the airports. For the sake of performance, we will use the data.table package: > library(hflights) > library(data.table) > dt <- data.table(hflights)[, list( + N = .N, + Cancelled = sum(Cancelled), + Distance = Distance[1], + TimeVar = sd(ActualElapsedTime, na.rm = TRUE), + ArrDelay = mean(ArrDelay, na.rm = TRUE)) , by = Dest] So we have loaded and then immediately transformed the hlfights dataset to a data.table object. At the same time, we aggregated by the destination of the flights to compute: The number of rows The number of cancelled flights The distance The standard deviation of the elapsed time of the flights The arithmetic mean of the delays The resulting R object looks like this: > str(dt) Classes 'data.table' and 'data.frame': 116 obs. of 6 variables: $ Dest : chr "DFW" "MIA" "SEA" "JFK" ... $ N : int 6653 2463 2615 695 402 6823 4893 5022 6064 ... $ Cancelled: int 153 24 4 18 1 40 40 27 33 28 ... $ Distance : int 224 964 1874 1428 3904 305 191 140 1379 862 ... $ TimeVar : num 10 12.4 16.5 19.2 15.3 ... $ ArrDelay : num 5.961 0.649 9.652 9.859 10.927 ... - attr(*, ".internal.selfref")=<externalptr> So we have 116 observations all around the world and five variables describing those. Although this seems to be a spatial dataset, we have no geospatial identifiers that a computer can understand per se, so let's fetch the geocodes of these airports from the Google Maps API via the ggmap package. First, let's see how it works when we are looking for the geo-coordinates of Houston: > library(ggmap) > (h <- geocode('Houston, TX')) Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Houston,+TX&sensor=false lon lat 1 -95.3698 29.76043 So the geocode function can return the matched latitude and longitude of the string we sent to Google. Now let's do the very same thing for all flight destinations: > dt[, c('lon', 'lat') := geocode(Dest)] Well, this took some time as we had to make 116 separate queries to the Google Maps API. Please note that Google limits you to 2,500 queries a day without authentication, so do not run this on a large dataset. There is a helper function in the package, called geocodeQueryCheck, which can be used to check the remaining number of free queries for the day. Some of the methods and functions we plan to use in some later sections of this article do not support data.table, so let's fall back to the traditional data.frame format and also print the structure of the current object: > str(setDF(dt)) 'data.frame': 116 obs. of 8 variables: $ Dest : chr "DFW" "MIA" "SEA" "JFK" ... $ N : int 6653 2463 2615 695 402 6823 4893 5022 6064 ... $ Cancelled: int 153 24 4 18 1 40 40 27 33 28 ... $ Distance : int 224 964 1874 1428 3904 305 191 140 1379 862 ... $ TimeVar : num 10 12.4 16.5 19.2 15.3 ... $ ArrDelay : num 5.961 0.649 9.652 9.859 10.927 ... $ lon : num -97 136.5 -122.3 -73.8 -157.9 ... $ lat : num 32.9 34.7 47.5 40.6 21.3 ... This was pretty quick and easy, wasn't it? Now that we have the longitude and latitude values of all the airports, we can try to show these points on a map. Visualizing point data in space For the first time, let's keep it simple and load some package-bundled polygons as the base map. To this end, we will use the maps package. After loading it, we use the map function to render the polygons of the United States of America, add a title, and then some points for the airports and also for Houston with a slightly modified symbol: > library(maps) > map('state') > title('Flight destinations from Houston,TX') > points(h$lon, h$lat, col = 'blue', pch = 13) > points(dt$lon, dt$lat, col = 'red', pch = 19) And showing the airport names on the plot is pretty easy as well: we can use the well-known functions from the base graphics package. Let's pass the three character names as labels to the text function with a slightly increased y value to shift the preceding text the previously rendered data points: > text(dt$lon, dt$lat + 1, labels = dt$Dest, cex = 0.7) Now we can also specify the color of the points to be rendered. This feature can be used to plot our first meaningful map to highlight the number of flights in 2011 to different parts of the USA: > map('state') > title('Frequent flight destinations from Houston,TX') > points(h$lon, h$lat, col = 'blue', pch = 13) > points(dt$lon, dt$lat, pch = 19, + col = rgb(1, 0, 0, dt$N / max(dt$N))) > legend('bottomright', legend = round(quantile(dt$N)), pch = 19, + col = rgb(1, 0, 0, quantile(dt$N) / max(dt$N)), box.col = NA) So the intensity of red shows the number of flights to the given points (airports); the values range from 1 to almost 10,000. Probably it would be more meaningful to compute these values on a state level, as there are many airports, very close to each other, that might be better aggregated at a higher administrative area level. To this end, we load the polygon of the states, match the points of interest (airports) with the overlaying polygons (states), and render the polygons as a thematic map instead of points, like we did on the previous pages. Finding polygon overlays of point data We already have all the data we need to identify the parent state of each airport. The dt dataset includes the geo-coordinates of the locations, and we managed to render the states as polygons with the map function. Actually, this latter function can return the underlying dataset without rendering a plot: > str(map_data <- map('state', plot = FALSE, fill = TRUE)) List of 4 $ x : num [1:15599] -87.5 -87.5 -87.5 -87.5 -87.6 ... $ y : num [1:15599] 30.4 30.4 30.4 30.3 30.3 ... $ range: num [1:4] -124.7 -67 25.1 49.4 $ names: chr [1:63] "alabama" "arizona" "arkansas" "california" ... - attr(*, "class")= chr "map" So we have around 16,000 points describing the boundaries of the US states, but this map data is more detailed than we actually need (see for example the name of the polygons starting with Washington): > grep('^washington', map_data$names, value = TRUE) [1] "washington:san juan island" "washington:lopez island" [3] "washington:orcas island" "washington:whidbey island" [5] "washington:main" In short, the non-connecting parts of a state are defined as separate polygons. To this end, let's save a list of the state names without the string after the colon: > states <- sapply(strsplit(map_data$names, ':'), '[[', 1) We will use this list as the basis of aggregation from now on. Let's transform this map dataset into another class of object, so that we can use the powerful features of the sp package. We will use the maptools package to do this transformation: > library(maptools) > us <- map2SpatialPolygons(map_data, IDs = states, + proj4string = CRS("+proj=longlat +datum=WGS84")) An alternative way of getting the state polygons might be to directly load those instead of transforming from other data formats as described earlier. To this end, you may find the raster package especially useful to download free map shapefiles from gadm.org via the getData function. Although these maps are way too detailed for such a simple task, you can always simplify those—for example, with the gSimplify function of the rgeos package. So we have just created an object called us, which includes the polygons of map_data for each state with the given projection. This object can be shown on a map just like we did previously, although you should use the general plot method instead of the map function: > plot(us) Besides this, however, the sp package supports so many powerful features! For example, it's very easy to identify the overlay polygons of the provided points via the over function. As this function name conflicts with the one found in the grDevices package, it's better to refer to the function along with the namespace using a double colon: > library(sp) > dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')], dt, + proj4string = CRS("+proj=longlat +datum=WGS84")) > str(sp::over(us, dtp)) 'data.frame': 49 obs. of 8 variables: $ Dest : chr "BHM" "PHX" "XNA" "LAX" ... $ N : int 2736 5096 1172 6064 164 NA NA 2699 3085 7886 ... $ Cancelled: int 39 29 34 33 1 NA NA 35 11 141 ... $ Distance : int 562 1009 438 1379 926 NA NA 1208 787 689 ... $ TimeVar : num 10.1 13.61 9.47 15.16 13.82 ... $ ArrDelay : num 8.696 2.166 6.896 8.321 -0.451 ... $ lon : num -86.8 -112.1 -94.3 -118.4 -107.9 ... $ lat : num 33.6 33.4 36.3 33.9 38.5 ... What happened here? First, we passed the coordinates and the whole dataset to the SpatialPointsDataFrame function, which stored our data as spatial points with the given longitude and latitude values. Next we called the over function to left-join the values of dtp to the US states. An alternative way of identifying the state of a given airport is to ask for more detailed information from the Google Maps API. By changing the default output argument of the geocode function, we can get all address components for the matched spatial object, which of course includes the state as well. Look for example at the following code snippet: geocode('LAX','all')$results[[1]]$address_components Based on this, you might want to get a similar output for all airports and filter the list for the short name of the state. The rlist package would be extremely useful in this task, as it offers some very convenient ways of manipulating lists in R. The only problem here is that we matched only one airport to the states, which is definitely not okay. See for example the fourth column in the earlier output: it shows LAX as the matched airport for California (returned by states[4]), although there are many others there as well. To overcome this issue, we can do at least two things. First, we can use the returnList argument of the over function to return all matched rows of dtp, and we will then post-process that data: > str(sapply(sp::over(us, dtp, returnList = TRUE), + function(x) sum(x$Cancelled))) Named int [1:49] 51 44 34 97 23 0 0 35 66 149 ... - attr(*, "names")= chr [1:49] "alabama" "arizona" "arkansas" ... So we created and called an anonymous function that will sum up the Cancelled values of the data.frame in each element of the list returned by over. Another, probably cleaner, approach is to redefine dtp to only include the related values and pass a function to over to do the summary: > dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')], + dt[, 'Cancelled', drop = FALSE], + proj4string = CRS("+proj=longlat +datum=WGS84")) > str(cancels <- sp::over(us, dtp, fn = sum)) 'data.frame': 49 obs. of 1 variable: $ Cancelled: int 51 44 34 97 23 NA NA 35 66 149 ... Either way, we have a vector to merge back to the US state names: > val <- cancels$Cancelled[match(states, row.names(cancels))] And to update all missing values to zero (as the number of cancelled flights in a state without any airport is not missing data, but exactly zero for sure): > val[is.na(val)] <- 0 Plotting thematic maps Now we have everything to create our first thematic map. Let's pass the val vector to the previously used map function (or plot it using the us object), specify a plot title, add a blue point for Houston, and then create a legend, which shows the quantiles of the overall number of cancelled flights as a reference: > map("state", col = rgb(1, 0, 0, sqrt(val/max(val))), fill = TRUE) > title('Number of cancelled flights from Houston to US states') > points(h$lon, h$lat, col = 'blue', pch = 13) > legend('bottomright', legend = round(quantile(val)), + fill = rgb(1, 0, 0, sqrt(quantile(val)/max(val))), box.col = NA) Please note that, instead of a linear scale, we decided to compute the square root of the relative values to define the intensity of the fill color, so that we can visually highlight the differences between the states. This was necessary as most flight cancellations happened in Texas (748), and there were no more than 150 cancelled flights in any other state (with the average being around 45). You can also easily load ESRI shape files or other geospatial vector data formats into R as points or polygons with a bunch of packages already discussed and a few others as well, such as the maptools, rgdal, dismo, raster, or shapefile packages. Another, probably easier, way to generate country-level thematic maps, especially choropleth maps, is to load the rworldmap package made by Andy South, and rely on the convenient mapCountryData function. Rendering polygons around points Besides thematic maps, another really useful way of presenting spatial data is to draw artificial polygons around the data points based on the data values. This is especially useful if there is no available polygon shape file to be used to generate a thematic map. A level plot, contour plot, or isopleths, might be an already familiar design from tourist maps, where the altitude of the mountains is represented by a line drawn around the center of the hill at the very same levels. This is a very smart approach having maps present the height of hills—projecting this third dimension onto a 2-dimensional image. Now let's try to replicate this design by considering our data points as mountains on the otherwise flat map. We already know the heights and exact geo-coordinates of the geometric centers of these hills (airports); the only challenge here is to draw the actual shape of these objects. In other words: Are these mountains connected? How steep are the hillsides? Should we consider any underlying spatial effects in the data? In other words, can we actually render these as mountains with a 3D shape instead of plotting independent points in space? If the answer for the last question is positive, then we can start trying to answer the other questions by fine-tuning the plot parameters. For now, let's simply suppose that there is a spatial effect in the underlying data, and it makes sense to visualize the data in such a way. Later, we will have the chance to disprove or support this statement either by analyzing the generated plots, or by building some geo-spatial models—some of these will be discussed later, in the Spatial Statistics section. Contour lines First, let's expand our data points into a matrix with the fields package. The size of the resulting R object is defined arbitrarily but, for the given number of rows and columns, which should be a lot higher to generate higher resolution images, 256 is a good start: > library(fields) > out <- as.image(dt$ArrDelay, x = dt[, c('lon', 'lat')], + nrow = 256, ncol = 256) The as.image function generates a special R object, which in short includes a 3‑dimensional matrix-like data structure, where the x and y axes represent the longitude and latitude ranges of the original data respectively. To simplify this even more, we have a matrix with 256 rows and 256 columns, where each of those represents a discrete value evenly distributed between the lowest and highest values of the latitude and longitude. And on the z axis, we have the ArrDelay values—which are in most cases of course missing: > table(is.na(out$z)) FALSE TRUE 112 65424 What does this matrix look like? It's better to see what we have at the moment: > image(out) Well, this does not seem to be useful at all. What is shown there? We rendered the x and y dimensions of the matrix with z colors here, and most tiles of this map are empty due to the high amount of missing values in z. Also, it's pretty straightforward now that the dataset included many airports outside the USA as well. How does it look if we focus only on the USA? > image(out, xlim = base::range(map_data$x, na.rm = TRUE), + ylim = base::range(map_data$y, na.rm = TRUE)) An alternative and more elegant approach to rendering only the US part of the matrix would be to drop the non-US airports from the database before actually creating the out R object. Although we will continue with this example for didactic purposes, with real data make sure you concentrate on the target subset of your data instead of trying to smooth and model unrelated data points as well. A lot better! So we have our data points as a tile, now let's try to identify the slope of these mountain peaks, to be able to render them on a future map. This can be done by smoothing the matrix: > look <- image.smooth(out, theta = .5) > table(is.na(look$z)) FALSE TRUE 14470 51066 As can be seen in the preceding table, this algorithm successfully eliminated many missing values from the matrix. The image.smooth function basically reused our initial data point values in the neighboring tiles, and computed some kind of average for the conflicting overrides. This smoothing algorithm results in the following arbitrary map, which does not respect any political or geographical boundaries: > image(look) It would be really nice to plot these artificial polygons along with the administrative boundaries, so let's clear out all cells that do not belong to the territory of the USA. We will use the point.in.polygon function from the sp package to do so: > usa_data <- map('usa', plot = FALSE, region = 'main') > p <- expand.grid(look$x, look$y) > library(sp) > n <- which(point.in.polygon(p$Var1, p$Var2, + usa_data$x, usa_data$y) == 0) > look$z[n] <- NA In a nutshell, we have loaded the main polygon of the USA without any sub-administrative areas, and verified our cells in the look object, if those are overlapping the polygon. Then we simply reset the value of the cell, if not. The next step is to render the boundaries of the USA, plot our smoothed contour plot, then add some eye-candy in the means of the US states and, the main point of interest, the airport: > map("usa") > image(look, add = TRUE) > map("state", lwd = 3, add = TRUE) > title('Arrival delays of flights from Houston') > points(dt$lon, dt$lat, pch = 19, cex = .5) > points(h$lon, h$lat, pch = 13) Now this is pretty neat, isn't it? Voronoi diagrams An alternative way of visualizing point data with polygons is to generate Voronoi cells between them. In short, the Voronoi map partitions the space into regions around the data points by aligning all parts of the map to one of the regions to minimize the distance from the central data points. This is extremely easy to interpret, and also to implement in R. The deldir package provides a function with the very same name for Delaunay triangulation: > library(deldir) > map("usa") > plot(deldir(dt$lon, dt$lat), wlines = "tess", lwd = 2, + pch = 19, col = c('red', 'darkgray'), add = TRUE) Here, we represented the airports with red dots, as we did before, but also added the Dirichlet tessellation (Voronoi cells) rendered as dark-gray dashed lines. For more options on how to fine-tune the results, see the plot.deldir method. In the next section, let's see how to improve this plot by adding a more detailed background map to it. Satellite maps There are many R packages on CRAN that can fetch data from Google Maps, Stamen, Bing, or OpenStreetMap—even some of the packages we previously used in this article, like the ggmap package, can do this. Similarly, the dismo package also comes with both geo-coding and Google Maps API integration capabilities, and there are some other packages focused on that latter, such as the RgoogleMaps package. Now we will use the OpenStreetMap package, mainly because it supports not only the awesome OpenStreetMap database back-end, but also a bunch of other formats as well. For example, we can render really nice terrain maps via Stamen: > library(OpenStreetMap) > map <- openmap(c(max(map_data$y, na.rm = TRUE), + min(map_data$x, na.rm = TRUE)), + c(min(map_data$y, na.rm = TRUE), + max(map_data$x, na.rm = TRUE)), + type = 'stamen-terrain') So we defined the left upper and right lower corners of the map we need, and also specified the map style to be a satellite map. As the data by default arrives from the remote servers with the Mercator projections, we first have to transform that to WGS84 (we used this previously), so that we can render the points and polygons on the top of the fetched map: > map <- openproj(map, + projection = '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs') And Showtime at last: > plot(map) > plot(deldir(dt$lon, dt$lat), wlines = "tess", lwd = 2, + col = c('red', 'black'), pch = 19, cex = 0.5, add = TRUE) This seems to be a lot better compared to the outline map we created previously. Now you can try some other map styles as well, such as mapquest-aerial, or some of the really nice-looking cloudMade designs. Interactive maps Besides being able to use Web-services to download map tiles for the background of the maps created in R, we can also rely on some of those to generate truly interactive maps. One of the best known related services is the Google Visualization API, which provides a platform for hosting visualizations made by the community; you can also use it to share maps you've created with others. Querying Google Maps In R, you can access this API via the googleVis package written and maintained by Markus Gesmann and Diego de Castillo. Most functions of the package generate HTML and JavaScript code that we can directly view in a Web browser as an SVG object with the base plot function; alternatively, we can integrate them in a Web page, for example via the IFRAME HTML tag. The gvisIntensityMap function takes a data.frame with country ISO or USA state codes and the actual data to create a simple intensity map. We will use the cancels dataset we created in the Finding Polygon Overlays of Point Data section but, before that, we have to do some data transformations. Let's add the state name as a new column to the data.frame, and replace the missing values with zero: > cancels$state <- rownames(cancels) > cancels$Cancelled[is.na(cancels$Cancelled)] <- 0 Now it's time to load the package and pass the data along with a few extra parameters, signifying that we want to generate a state-level US map: > library(googleVis) > plot(gvisGeoChart(cancels, 'state', 'Cancelled', + options = list( + region = 'US', + displayMode = 'regions', + resolution = 'provinces'))) The package also offers opportunities to query the Google Map API via the gvisMap function. We will use this feature to render the airports from the dt dataset as points on a Google Map with an auto-generated tooltip of the variables. But first, as usual, we have to do some data transformations again. The location argument of the gvisMap function takes the latitude and longitude values separated by a colon: > dt$LatLong <- paste(dt$lat, dt$lon, sep = ':') We also have to generate the tooltips as a new variable, which can be done easily with an apply call. We will concatenate the variable names and actual values separated by a HTML line break: > dt$tip <- apply(dt, 1, function(x) + paste(names(dt), x, collapse = '<br/ >')) And now we just pass these arguments to the function for an instant interactive map: > plot(gvisMap(dt, 'LatLong', tipvar = 'tip')) Another nifty feature of the googleVis package is that you can easily merge the different visualizations into one by using the gvisMerge function. The use of this function is quite simple: specify any two gvis objects you want to merge, and also whether they are to be placed horizontally or vertically. JavaScript mapping libraries The great success of the trending JavaScript data visualization libraries is only partly due to their great design. I suspect other factors also contribute to the general spread of such tools: it's very easy to create and deploy full-blown data models, especially since the release and on-going development of Mike Bostock's D3.js. Although there are also many really useful and smart R packages to interact directly with D3 and topojson (see for example my R user activity compilation at http://bit.ly/countRies). Now we will only focus on how to use Leaflet— probably the most used JavaScript library for interactive maps. What I truly love in R is that there are many packages wrapping other tools, so that R users can rely on only one programming language, and we can easily use C++ programs and Hadoop MapReduce jobs or build JavaScript-powered dashboards without actually knowing anything about the underlying technology. This is especially true when it comes to Leaflet! There are at least two very nice packages that can generate a Leaflet plot from the R console, without a single line of JavaScript. The Leaflet reference class of the rCharts package was developed by Ramnath Vaidyanathan, and includes some methods to create a new object, set the viewport and zoom level, add some points or polygons to the map, and then render or print the generated HTML and JavaScript code to the console or to a file. Unfortunately, this package is not on CRAN yet, so you have to install it from GitHub: > devtools::install_github('ramnathv/rCharts') As a quick example, let's generate a Leaflet map of the airports with some tooltips, like we did with the Google Maps API in the previous section. As the setView method expects numeric geo-coordinates as the center of the map, we will use Kansas City's airport as a reference: > library(rCharts) > map <- Leaflet$new() > map$setView(as.numeric(dt[which(dt$Dest == 'MCI'), + c('lat', 'lon')]), zoom = 4) > for (i in 1:nrow(dt)) + map$marker(c(dt$lat[i], dt$lon[i]), bindPopup = dt$tip[i]) > map$show() Similarly, RStudio's leaflet package and the more general htmlwidgets package also provide some easy ways to generate JavaScript-powered data visualizations. Let's load the library and define the steps one by one using the pipe operator from the magrittr package, which is pretty standard for all packages created or inspired by RStudio or Hadley Wickham: > library(leaflet) > leaflet(us) %>% + addProviderTiles("Acetate.terrain") %>% + addPolygons() %>% + addMarkers(lng = dt$lon, lat = dt$lat, popup = dt$tip) I especially like this latter map, as we can load a third-party satellite map in the background, then render the states as polygons; we also added the original data points along with some useful tooltips on the very same map with literally a one-line R command. We could even color the state polygons based on the aggregated results we computed in the previous sections! Ever tried to do the same in Java? Alternative map designs Besides being able to use some third-party tools, another main reason why I tend to use R for all my data analysis tasks is that R is extremely powerful in creating custom data exploration, visualization, and modeling designs. As an example, let's create a flow-map based on our data, where we will highlight the flights from Houston based on the number of actual and cancelled flights. We will use lines and circles to render these two variables on a 2-dimensional map, and we will also add a contour plot in the background based on the average time delay. But, as usual, let's do some data transformations first! To keep the number of flows at a minimal level, let's get rid of the airports outside the USA at last: > dt <- dt[point.in.polygon(dt$lon, dt$lat, + usa_data$x, usa_data$y) == 1, ] We will need the diagram package (to render curved arrows from Houston to the destination airports) and the scales package to create transparent colors: > library(diagram) > library(scales) Then let's render the contour map described in the Contour Lines section: > map("usa") > title('Number of flights, cancellations and delays from Houston') > image(look, add = TRUE) > map("state", lwd = 3, add = TRUE) And then add a curved line from Houston to each of the destination airports, where the width of the line represents the number of cancelled flights and the diameter of the target circles shows the number of actual flights: > for (i in 1:nrow(dt)) { + curvedarrow( + from = rev(as.numeric(h)), + to = as.numeric(dt[i, c('lon', 'lat')]), + arr.pos = 1, + arr.type = 'circle', + curve = 0.1, + arr.col = alpha('black', dt$N[i] / max(dt$N)), + arr.length = dt$N[i] / max(dt$N), + lwd = dt$Cancelled[i] / max(dt$Cancelled) * 25, + lcol = alpha('black', + dt$Cancelled[i] / max(dt$Cancelled))) + } Well, this article ended up being about visualizing spatial data, and not really about analyzing spatial data by fitting models and filtering raw data. Summary In case you are interested in knowing other R-related books that Packt has in store for you, here is the link: R for Data Science Practical Data Science Cookbook Resources for Article: Further resources on this subject: R ─ Classification and Regression Trees[article] An overview of common machine learning tasks[article] Reduction with Principal Component Analysis [article]

0
0
2453

Packt

25 Sep 2015

11 min read

The Dashboard Design – Best Practices

Packt

25 Sep 2015

11 min read

In this article by Julian Villafuerte, author of the book Creating Stunning Dashboards with QlikView you know more about the best practices for the dashboard design. (For more resources related to this topic, see here.) Data visualization is a field that is constantly evolving. However, some concepts have proven their value time and again through the years and have become what we call best practices. These notions should not be seen as strict rules that must be applied without any further consideration but as a series of tips that will help you create better applications. If you are a beginner, try to stick to them as much as you can. These best practices will save you a lot of trouble and will greatly enhance your first endeavors. On the other hand, if you are an advanced developer, combine them with your personal experiences in order to build the ultimate dashboard. Some guidelines in this article come from the widely known characters in the field of data visualization, such as Stephen Few, Edward Tufte, John Tukey, Alberto Cairo, and Nathan Yau. So, if a concept strikes your attention, I strongly recommend you to read more about it in their books. Throughout this article, we will review some useful recommendations that will help you create not only engaging, but also effective and user-friendly dashboards. Remember that they may apply differently depending on the information displayed and the audience you are working with. Nevertheless, they are great guidelines to the field of data visualization, so do not hesitate to consider them in all of your developments. Gestalt principles In the early 1900s, the Gestalt school of psychology conducted a series of studies on human perception in order to understand how our brain interprets forms and recognizes patterns. Understanding these principles may help you create a better structure for your dashboard and make your charts easier to interpret: Proximity: When we see multiple elements located near one another, we tend to see them as groups. For example, we can visually distinguish clusters in a scatter plot by grouping the dots according to their position. Similarity: Our brain associates the elements that are similar to each other (in terms of shape, size, color, or orientation). For example, in color-coded bar charts, we can associate the bars that share the same color even if they are not grouped. Enclosure: If a border surrounds a series of objects, we perceive them as part of a group. For example, if a scatter plot has reference lines that wrap the elements between 20 and 30 percent, we will automatically see them as a cluster. Closure: When we detect a figure that looks incomplete, we tend to perceive it as a closed structure. For example, even if we discard the borders of a bar chart, the axes will form a region that our brain will isolate without needing the extra lines. Continuity: If a number of objects are aligned, we will perceive them as a continuous body. For example, the different blocks of code when you indent QlikView script are percieved as one continuous code. Connection: If objects are connected by a line, we will see them as a group. For example, we tend to associate the dots connected by lines on a scatter plot with lines and symbols. Giving context to the data When it comes to analyzing data, context is everything. If you present isolated figures, the users will have a hard time trying to find the story hidden behind them. For example, if I told you that the gross margin of our company was 16.5 percent during the first quarter of 2015, would you evaluate it as a positive or negative sign? This is pretty difficult, right? However, what if we added some extra information to complement this KPI? Then, the following image would make a lot more sense: As you can see, adding context to the data can make the landscape look quite different. Now, it is easy to see that even though the gross margin has substantially improved during the last year, our company has some work to do in order to be competitive and surpass the industry standard. The appropriate references may change depending on the KPI you are dealing with and the goals of the organization, but some common examples are as follows: Last year's performance The quota, budget, or objective Comparison with the closest competitor, product, or employee The market share The industry standards Another good tip in this regard is to anticipate the comparisons. If you display figures regarding the monthly quota and the actual sales, you can save the users the mental calculations by including complementary indicators, such as the gap between them and the percentage of completion. Data-Ink Ratio One of the most interesting principles in the field of data visualization is Data-Ink Ratio, introduced by Edward R. Tufte in his book, The Visual Display of Quantitative Information, which must be read by every designer. In this publication, he states that there are two different types of ink (or in our case, pixels) in any chart, as follows: Data-ink: This includes all the nonerasable portions of graphic that are used to represent the actual data. These pixels are at the core of the visualization and cannot be removed without losing some of its content. Non-data-ink: This includes any element that is not directly related to the data or doesn't convey anything meaningful to the reader. Based on these concepts, he defined the Data Ink Ratio as the proportion of the graphic's ink that is devoted to the nonredundant display of data information: Data Ink Ratio = Data Ink / Total Ink As you can imagine, our goal is to maximize this number by decreasing the non-data-ink used in our dashboards. For example, the chart to the left has a low data-ink ratio due to the usage of 3D effects, shadows, backgrounds, and multiple grid lines. On the contrary, the chart to the right presents a higher ratio as most of the pixels are data-related. Avoiding chart junk Chart junk is another term coined by Tufte that refers to all the elements that distract the viewer from the actual information in a graphic. Evidently, chart junk is considered as non-data-ink and comprises of features such as heavy gridlines, frames, redundant labels, ornamental axes, backgrounds, overly complex fonts, shadows, images, or other effects included only as decoration. Take for instance the following charts: As you can see, by removing all the unnecessary elements in a chart, it becomes easier to interpret and looks much more elegant. Balance Colors, icons, reference lines, and other visual cues can be very useful to help the users focus on the most important elements in a dashboard. However, misusing or overusing these features can be a real hazard, so try to find the adequate balance for each of them. Excessive precision QlikView applications should use the appropriate language for each audience. When designing, think about whether precise figures will be useful or if they are going to become a distraction. Most of the time, dashboards show high-level KPIs, so it may be more comfortable for certain users to see rounded numbers, as in the following image: 3D charts One of Microsoft Excel's greatest wrongdoings is making everyone believe that 3D charts are good for data analysis. For some reason, people seem to love them; but, believe me, they are a real threat to business analysts. Despite their visual charm, these representations can easily hide some parts of the information and convey wrong perceptions depending on their usage of colors, shadows, and axis inclination. I strongly recommend you to avoid them in any context. Sorting Whether you are working with a list box, a bar chart, or a straight table, sorting an object is always advisable, as it adds context to the data. It can help you find the most commonly selected items in a list box, distinguish which slice is bigger on a pie chart when the sizes are similar, or easily spot the outliners in other graphic representations. Alignment and distribution Most of my colleagues argue that I am on the verge of an obsessive-compulsive disorder, but I cannot stand an application with unaligned objects. (Actually, I am still struggling with the fact that the paragraphs in this book are not justified, but anyway...). The design toolbar offers useful options in this regard, so there is no excuse for not having a tidy dashboard. If you take care of the quadrature of all the charts and filters, your interface will display a clean and professional look that every user will appreciate: Animations I have a rule of thumb regarding chart animation in QlikView—If you are Hans Rosling, go ahead. If not, better think it over twice. Even though they can be very illustrative, chart animations end up being a distraction rather than a tool to help us visualize data most of the time, so be conservative about their use. For those of you who do not know him, Hans Rosling is a Swedish professor of international health who works in Stockholm. However, he is best known for his amazing way of presenting data with GapMinder, a simple piece of software that allows him to animate a scatter plot. If you are a data enthusiast, you ought to watch his appearances in TED Talks. Avoiding scroll bars Throughout his work, Stephen Few emphasizes that all the information in a dashboard must fit on a single screen. Whilst I believe that there is no harm in splitting the data in multiple sheets, it is undeniable that scroll bars reduce the overall usability of an application. If the user has to continuously scroll right and left to read all the figures in a table, or if she must go up and down to see the filter panel, she will end up getting tired and eventually discard your dashboard. Consistency If you want to create an easy way to navigate your dashboard, you cannot forget about consistency. Locating standard objects (such as Current Selections Box, Search Object, and Filter Panels) in the same area in every tab will help the users easily find all the items they need. In addition, applying the same style, fonts, and color palettes in all your charts will make your dashboard look more elegant and professional. White space The space between charts, tables, and filters is often referred to as white space, and even though you may not notice it, it is a vital part of any dashboard. Displaying dozens of objects without letting them breathe makes your interface look cluttered and, therefore, harder to understand. Some of the benefits of using white space adequately are: The improvement in readability It focuses and emphasizes the important objects It guides the users' eyes, creating a sense of hierarchy in the dashboard It fosters a balanced layout, making your interface look clear and sophisticated Applying makeup Every now and then, you stumble upon delicate situations where some business users try their best to hide certain parts of the data. Whether it is about low sales or the insane amount of defective products, they often ask you to remove a few charts or avoid visual cues so that those numbers go unnoticed. Needless to say, dashboards are tools intended to inform and guide the decisions of the viewers, so avoid presenting misleading visualizations. Meaningless variety As a designer, you will often hesitate to use the same chart type multiple times in your application fearing that the users will get bored of it. Though this may be a haunting perception, if you present valuable data in an adequate format, there is no need to add new types of charts just for variety's sake. We want to keep the users engaged with great analyses, not just with pretty graphics. Summary In this article, you learned all about the best practices to be followed in Qlikview. Resources for Article: Further resources on this subject: Analyzing Financial Data in QlikView[article] Securing QlikView Documents[article] Common QlikView script errors [article]

0
0
9683

Packt

24 Sep 2015

11 min read

Integration with Spark SQL

Packt

24 Sep 2015

11 min read

In this article by Sumit Gupta, the author of the book Learning Real-time Processing with Spark Streaming, we will discuss the integration of Spark Streaming with various other advance Spark libraries such as Spark SQL. (For more resources related to this topic, see here.) No single software in today's world can fulfill the varied, versatile, and complex demands/needs of the enterprises, and to be honest, neither should it! Software are made to fulfill specific needs arising out of the enterprises at a particular point in time, which may change in future due to many other factors. These factors may or may not be controlled like government policies, business/market dynamics, and many more. Considering all these factors integration and interoperability of any software system with internal/external systems/software's is pivotal in fulfilling the enterprise needs. Integration and interoperability are categorized as nonfunctional requirements, which are always implicit and may or may not be explicitly stated by the end users. Over the period of time, architects have realized the importance of these implicit requirements in modern enterprises, and now, all enterprise architectures provide support due diligence and provisions in fulfillment of these requirements. Even the enterprise architecture frameworks such as The Open Group Architecture Framework (TOGAF) defines the specific set of procedures and guidelines for defining and establishing interoperability and integration requirements of modern enterprises. Spark community realized the importance of both these factors and provided a versatile and scalable framework with certain hooks for integration and interoperability with the different systems/libraries; for example; data consumed and processed via Spark streams can also be loaded into the structured (table: rows/columns) format and can be further queried using SQL. Even the data can be stored in the form of Hive tables in HDFS as persistent tables, which will exist even after our Spark program has restarted. In this article, we will discuss querying streaming data in real time using Spark SQL. Querying streaming data in real time Spark Streaming is developed on the principle of integration and interoperability where it not only provides a framework for consuming data in near real time from varied data sources, but at the same time, it also provides the integration with Spark SQL where existing DStreams can be converted into structured data format for querying using standard SQL constructs. There are many such use cases where SQL on streaming data is a much needed feature; for example, in our distributed log analysis use case, we may need to combine the precomputed datasets with the streaming data for performing exploratory analysis using interactive SQL queries, which is difficult to implement only with streaming operators as they are not designed for introducing new datasets and perform ad hoc queries. Moreover SQL's success at expressing complex data transformations derives from the fact that it is based on a set of very powerful data processing primitives that do filtering, merging, correlation, and aggregation, which is not available in the low-level programming languages such as Java/ C++ and may result in long development cycles and high maintenance costs. Let's move forward and first understand few things about Spark SQL, and then, we will also see the process of converting existing DStreams into the Structured formats. Understanding Spark SQL Spark SQL is one of the modules developed over the Spark framework for processing structured data, which is stored in the form of rows and columns. At a very high level, it is similar to the data residing in RDBMS in the form rows and columns, and then SQL queries are executed for performing analysis, but Spark SQL is much more versatile and flexible as compared to RDBMS. Spark SQL provides distributed processing of SQL queries and can be compared to frameworks Hive/Impala or Drill. Here are the few notable features of Spark SQL: Spark SQL is capable of loading data from variety of data sources such as text files, JSON, Hive, HDFS, Parquet format, and of course RDBMS too so that we can consume/join and process datasets from different and varied data sources. It supports static and dynamic schema definition for the data loaded from various sources, which helps in defining schema for known data structures/types, and also for those datasets where the columns and their types are not known until runtime. It can work as a distributed query engine using the thrift JDBC/ODBC server or command-line interface where end users or applications can interact with Spark SQL directly to run SQL queries. Spark SQL provides integration with Spark Streaming where DStreams can be transformed into the structured format and further SQL Queries can be executed. It is capable of caching tables using an in-memory columnar format for faster reads and in-memory data processing. It supports Schema evolution so that new columns can be added/deleted to the existing schema, and Spark SQL still maintains the compatibility between all versions of the schema. Spark SQL defines the higher level of programming abstraction called DataFrames, which is also an extension to the existing RDD API. Data frames are the distributed collection of the objects in the form of rows and named columns, which is similar to tables in the RDBMS, but with much richer functionality containing all the previously defined features. The DataFrame API is inspired by the concepts of data frames in R (http://www.r-tutor.com/r-introduction/data-frame) and Python (http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). Let's move ahead and understand how Spark SQL works with the help of an example: As a first step, let's create sample JSON data about the basic information about the company's departments such as Name, Employees, and so on, and save this data into the file company.json. The JSON file would look like this: [ { "Name":"DEPT_A", "No_Of_Emp":10, "No_Of_Supervisors":2 }, { "Name":"DEPT_B", "No_Of_Emp":12, "No_Of_Supervisors":2 }, { "Name":"DEPT_C", "No_Of_Emp":14, "No_Of_Supervisors":3 }, { "Name":"DEPT_D", "No_Of_Emp":10, "No_Of_Supervisors":1 }, { "Name":"DEPT_E", "No_Of_Emp":20, "No_Of_Supervisors":5 } ] You can use any online JSON editor such as http://codebeautify.org/online-json-editor to see and edit data defined in the preceding JSON code. Next, let's extend our Spark-Examples project and create a new package by the name chapter.six, and within this new package, create a new Scala object and name it as ScalaFirstSparkSQL.scala. Next, add the following import statements just below the package declaration: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql._ import org.apache.spark.sql.functions._ Further, in your main method, add following set of statements to create SQLContext from SparkContext: //Creating Spark Configuration val conf = new SparkConf() //Setting Application/ Job Name conf.setAppName("My First Spark SQL") // Define Spark Context which we will use to initialize our SQL Context val sparkCtx = new SparkContext(conf) //Creating SQL Context val sqlCtx = new SQLContext(sparkCtx) SQLContext or any of its descendants such as HiveContext—for working with Hive tables or CassandraSQLContext—for working with Cassandra tables is the main entry point for accessing all functionalities of Spark SQL. It allows the creation of data frames, and also provides functionality to fire SQL queries over data frames. Next, we will define the following code to load the JSON file (company.json) using the SQLContext, and further, we will also create a data frame: //Define path of your JSON File (company.json) which needs to be processed val path = "/home/softwares/spark/data/company.json"; //Use SQLCOntext and Load the JSON file. //This will return the DataFrame which can be further Queried using SQL queries. val dataFrame = sqlCtx.jsonFile(path) In the preceding piece of code, we used the jsonFile(…) method for loading the JSON data. There are other utility method defined by SQLContext for reading raw data from filesystem or creating data frames from the existing RDD and many more. Spark SQL supports two different methods for converting the existing RDDs into data frames. The first method uses reflection to infer the schema of an RDD from the given data. This approach leads to more concise code and helps in instances where we already know the schema while writing Spark application. We have used the same approach in our example. The second method is through a programmatic interface that allows to construct a schema. Then, apply it to an existing RDD and finally generate a data frame. This method is more verbose, but provides flexibility and helps in those instances where columns and data types are not known until the data is received at runtime. Refer to https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.SQLContext for a complete list of methods exposed by SQLContext. Once the DataFrame is created, we need to register DataFrame as a temporary table within the SQL context so that we can execute SQL queries over the registered table. Let's add the following piece of code for registering our DataFrame with our SQL context and name it company: //Register the data as a temporary table within SQL Context //Temporary table is destroyed as soon as SQL Context is destroyed. dataFrame.registerTempTable("company"); And we are done… Our JSON data is automatically organized into the table (rows/column) and is ready to accept the SQL queries. Even the data types is inferred from the type of data entered within the JSON file itself. Now, we will start executing the SQL queries on our table, but before that let's see the schema being created/defined by SQLContext: //Printing the Schema of the Data loaded in the Data Frame dataFrame.printSchema(); The execution of the preceding statement will provide results similar to mentioned illustration: The preceding illustration shows the schema of the JSON data loaded by Spark SQL. Pretty simple and straight, isn't it? Spark SQL has automatically created our schema based on the data defined in our company.json file. It has also defined the data type of each of the columns. We can also define the schema using reflection (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection) or can also programmatically define the schema (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection). Next, let's execute some SQL queries to see the data stored in DataFrame, so the first SQL would be to print all records: //Executing SQL Queries to Print all records in the DataFrame println("Printing All records") sqlCtx.sql("Select * from company").collect().foreach(print) The execution of the preceding statement will produce the following results on the console where the driver is executed: Next, let's also select only few columns instead of all records and print the same on console: //Executing SQL Queries to Print Name and Employees //in each Department println("n Printing Number of Employees in All Departments") sqlCtx.sql("Select Name, No_Of_Emp from company").collect().foreach(println) The execution of the preceding statement will produce the following results on the Console where the driver is executed: Now, finally let's do some aggregation and count the total number of all employees across the departments: //Using the aggregate function (agg) to print the //total number of employees in the Company println("n Printing Total Number of Employees in Company_X") val allRec = sqlCtx.sql("Select * from company").agg(Map("No_Of_Emp"->"sum")) allRec.collect.foreach ( println ) In the preceding piece of code, we used the agg(…) function and performed the sum of all employees across the departments, where sum can be replaced by avg, max, min, or count. The execution of the preceding statement will produce the following results on the console where the driver is executed: The preceding images shows the results of executing the aggregation on our company.json data. Refer to the Data Frame API at https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.DataFrame for further information on the available functions for performing aggregation. As a last step, we will stop our Spark SQL context by invoking the stop() function on SparkContext—sparkCtx.stop(). This is required so that your application can notify master or resource manager to release all resources allocated to the Spark job. It also ensures the graceful shutdown of the job and avoids any resource leakage, which may happen otherwise. Also, as of now, there can be only one Spark context active per JVM, and we need to stop() the active SparkContext class before creating a new one. Summary In this article, we have seen the step-by-step process of using Spark SQL as a standalone program. Though we have considered JSON files as an example, but we can also leverage Spark SQL with Cassandra (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md) or MongoDB (https://github.com/Stratio/spark-mongodb) or Elasticsearch (http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html). Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames[article] Sabermetrics with Apache Spark[article] Getting Started with Apache Spark [article]

0
0
4455

Packt

22 Sep 2015

13 min read

Find Friends on Facebook

Packt

22 Sep 2015

13 min read

In this article by the authors, Vikram Garg and Sharan Kumar Ravindran, of the book, Mastering Social Media Mining with R, we learn about data mining using Facebook as our resource. (For more resources related to this topic, see here.) We will see how to use the R package Rfacebook, which provides access to the Facebook Graph API from R. It includes a series of functions that allow us to extract various data about our network such as friends, likes, comments, followers, newsfeeds, and much more. We will discuss how to visualize our Facebook network and we will see some methodologies to make use of the available data to implement business cases. Rfacebook package installation and authentication The Rfacebook package is authored and maintained by Pablo Barbera and Michael Piccirilli. It provides an interface to the Facebook API. It needs Version 2.12.0 or later of R and it is dependent on a few other packages, such as httr, rjson, and httpuv. Before starting, make sure those packages are installed. It is preferred to have Version 0.6 of the httr package installed. Installation We will now install the Rfacebook packages. We can download and install the latest package from GitHub using the following code and load the package using the library function. On the other hand, we will also install the Rfacebook package from the CRAN network. One prerequisite for installing the package using the function install_github is to have the package devtools loaded into the R environment. The code is as follows: library(devtools) install_github("Rfacebook", "pablobarbera", subdir="Rfacebook") library(Rfacebook) After installing the Rfacebook package for connecting to the API, make an authentication request. This can be done via two different methods. The first method is by using the access token generated for the app, which is short-lived (valid for two hours); on the other hand, we can create a long-lasting token using the OAuth function. Let's first create a temporary token. Go to https://developers.facebook.com/tools/explorer, click on Get Token, and select the required user data permissions. The Facebook Graph API explorer will open with an access token. This access token will be valid for two hours. The status of the access token as well as the scope can be checked by clicking on the Debug button. Once the tokens expire, we can regenerate a new token. Now, we can access the data from R using the following code. The access token generated using the link should be copied and passed to the token variable. The use of username in the function getUsers is deprecated in the latest Graph API; hence, we are passing the ID of a user. You can get your ID from the same link that was used for token generation. This function can be used to pull the details of any user, provided the generated token has the access. Usually, access is limited to a few users with a public setting or those who use your app. It is also based on the items selected in the user data permission check page during token generation. In the following code, paste your token inside the double quotes, so that it can be reused across the functions without explicitly mentioning the actual token. token<- "XXXXXXXXX" A closer look at how the package works The getUsers function using the token will hit the Facebook Graph API. Facebook will be able to uniquely identify the users as well as the permissions to access information. If all the check conditions are satisfied, we will be able to get the required data. Copy the token from the mentioned URL and paste it within the double quotes. Remember that the token generated will be active only for two hours. Use the getUsers function to get the details of the user. Earlier, the getUsers function used to work based on the Facebook friend's name as well as ID; in API Version 2.0, we cannot access the data using the name. Consider the following code for example: token<- "XXXXXXXXX" me<- getUsers("778278022196130", token, private_info = TRUE) Then, the details of the user, such as name and hometown, can be retrieved using the following code: me$name The output is also mentioned for your reference: [1] "Sharan Kumar R" For the following code: me$hometown The output is as follows: [1] "Chennai, Tamil Nadu" Now, let's see how to create a long-lasting token. Open your Facebook app page by going to https://developers.facebook.com/apps/ and choosing your app. On theDashboard tab, you will be able to see the App ID and Secret Code values. Use those in the following code. require("Rfacebook") fb_oauth<- fbOAuth(app_id="11",app_secret="XX",extended_permissions = TRUE) On executing the preceding statements, you will find the following message in your console: Copy and paste into Site URL on Facebook App Settings: http://localhost:1410/ When done, press any key to continue... Copy the URL displayed and open your Facebook app; on the Settings tab, click on the Add Platform button and paste the copied URL in the Site URL text box. Make sure to save the changes. Then, return to the R console and press any key to continue, you will be prompted to enter your Facebook username and password. On completing that, you will return to the R console. If you find the following message, it means your long-lived token is ready to use. When you get the completion status, you might not be able to access any of the information. It is advisable to use the OAuth function a few minutes after creation of the Facebook application. Authentication complete. Authentication successful. After successfully authenticating, we can save it and load on demand using the following code: save(fb_oauth, file="fb_oauth") load("fb_oauth") When it is required to automate a few things or to use Rfacebook extensively, it will be very difficult as the tokens should be generated quite often. Hence, it is advisable to create a long-lasting token to authenticate the user, and then save it. Whenever required, we can just load it from a local file. Note that Facebook authentication might take several minutes. Hence, if your authentication fails on the retry, please wait for some time before pressing any key and check whether you have installed the httr package Version 0.6. If you continue to experience any issues in generating the token, then it's not a problem. We are good to go with the temporary token. Exercise Create an app in Facebook and authenticate by any one of the methods discussed. A basic analysis of your network In this section, we will discuss how to extract Facebook network of friends and some more information about the people in our network. After completing the app creation and authentication steps, let's move forward and learn to pull some basic network data from Facebook. First, let's find out which friends we have access to, using the following command in R. Let's use the temporary token for accessing the data: token<- "XXXXXXXXX" friends<- getFriends(token, simplify = TRUE) head(friends) # To see few of your friends The preceding function will return all our Facebook friends whose data is accessible. Version 1 of the API would allow us to download all the friends' data by default. But in the new version, we have limited access. Since we have set simplify as TRUE, we will pull only the username and their Facebook ID. By setting the same parameter to FALSE, we will be able to access additional data such as gender, location, hometown, profile picture, relationship status, and full name. We can use the function getUsers to get additional information about a particular user. The following information is available by default: gender, location, and language. We can, however, get some additional information such as relationship status, birthday, and the current location by setting the parameter private_info to TRUE: friends_data<- getUsers(friends$id, token, private_info = TRUE) table(friends_data$gender) The output is as follows: female male 5 21 We can also find out the language, location, and relationship status. The commands to generate the details as well as the respective outputs are given here for your reference: #Language table(substr(friends_data$locale, 1, 2)) The output is as follows: en 26 The code to find the location is as follows: # Location (Country) table(substr(friends_data$locale, 4, 5)) The output is as follows: GB US 1 25 Here's the code to find the relationship status: # Relationship Status table(friends_data$relationship_status) Here's the output: Engaged Married Single 1 1 3 Now, let's see what things were liked by us in Facebook. We can use the function getLikes to get the like data. In order to know about your likes data, specify user as me. The same function can be used to extract information about our friends, in which case we should pass the user's Facebook ID. This function will provide us with a list of Facebook pages liked by the user, their ID, name, and the website associated with the page. We can even restrict the number of results retrieved by setting a value to the parameter n. The same function will be used to get the likes of people in our network; instead of the keyword me, we should give the Facebook ID of those users. Remember we can only access data of people with accessibility from our app. The code is as follows: likes<- getLikes(user="me", token=token) head(likes) After exploring the use of functions to pull data, let's see how to use the Facebook Query Language using the function getFQL, which can be used to pass the queries. The following query will get you the list of friends in your network: friends<- getFQL("SELECT uid2 FROM friend WHERE uid1=me()", token=token) In order to get the complete details of your friends, the following query can be used. The query will return the username, Facebook ID, and the link to their profile picture. Note that we might not be able to access the complete network of friends' data, since access to data of all your friends are deprecated with Version 2.0. The code is as follows: # Details about friends Friends_details<- getFQL("SELECT uid, name, pic_square FROM user WHERE uid = me() OR uid IN (SELECT uid2 FROM friend WHERE uid1 = me())", token=token) In order to know more about the Facebook Query Language, check out the following link. This method of extracting the information might be preferred by people familiar with query language. It can also help extract data satisfying only specific conditions (https://developers.facebook.com/docs/technical-guides/fql). Exercise Download your Facebook network and do an exploration analysis on the languages your friends speak, places where they live, the total number of pages they have liked, and their marital status. Try all these with the Facebook Query Language as well. Network analysis and visualization So far, we used a few functions to get the details about our Facebook profile as well as friends' data. Let's see how to get to know more about our network. Before learning to get the network data, let's understand what a network is as well as a few important concepts about the network. Anything connected to a few other things could be a network. Everything in real life is connected to each other, for example, people, machines, events, and so on. It would make a lot of sense if we analyzed them as a network. Let's consider a network of people; here, people will be the nodes in the network and the relationship between them would be the edges (lines connecting them). Social network analysis The technique to study/analyze the network is called social network analysis. We will see how to create a simple plot of friends in our network in this section. To understand the nodes (people/places/etc) in a network in social network analysis, we need to evaluate the position of the nodes. We can evaluate the nodes using centrality. Centrality can be measured using different methods like degree, betweenness, and closeness. Let's first get our Facebook network and then get to know the centrality measures in detail. We use the function getNetwork to download our Facebook network. We need to mention how we would like to format the data. When the parameter format is set to adj.matrix, it will produce the data in matrix format where the people in the network would become the row names and column names of the matrix and if they are connected to each other, then the corresponding cell in the matrix will hold a value. The command is as follows: network<- getNetwork(token, format="adj.matrix") We now have our Facebook network downloaded. Let's visualize our network before getting to understand the centrality concept one by one with our own network. To visualize the network, we need to use the package called igraph in R. Since we downloaded our network in the adjacency matrix format, we will use the same function in igraph. We use the layout function to determine the placement of vertices in the network for drawing the graph and then we use the plot function to draw the network. In order to explore various other functionalities in these parameters, you can execute the ?<function_name> function in RStudio and the help window will have the description of the function. Let’s use the following code to load the package igraph into R. require(igraph) We will now build the graph using the function graph.adjacency; this function helps in creating a network graph using the adjacency matrix. In order to build a force-directed graph, we will use the function layout.drl. The force-directed graph will help in making the graph more readable. The commands are as follows: social_graph<- graph.adjacency(network) layout<- layout.drl(social_graph, options=list(simmer.attraction=0)) At last, we will use the plot function with various built in parameters to make the graph more readable. For example, we can name the nodes in our network, we can set the size of the nodes as well as the edges in the network, and we can color the graph and the components of the graph. Use the following code to see what the network looks like. The output that was plotted can be saved locally using the function dev.copy, and the size of the image as well as the type can be passed as a parameter to the function: plot(social_graph, vertex.size=10, vertex.color="green", vertex.label=NA, vertex.label.cex=0.5, edge.arrow.size=0, edge.curved=TRUE, layout=layout.fruchterman.reingold) dev.copy(png,filename= "C:/Users/Sharan/Desktop/3973-03-community.png", width=600, height=600); dev.off (); With the preceding plot function, my network will look like the following one. In the following network, the node labels (name of the people) have been disabled. They can be enabled by removing the vertex.label parameter. Summary In this article, we discussed how to use the various functions implemented in the Rfacebook package, analyze the network. This article covers the important techniques that helps in performing vital network analysis and also enlightens us about the wide range of business problems that could be addressed with the Facebook data. It gives us a glimpse of the great potential for implementation of various analyses. Resources for Article: Further resources on this subject: Using App Directory in HootSuite[article] Supervised learning[article] Warming Up [article]

0
0
2572

article-image-getting-started-apache-spark-dataframes

Packt

22 Sep 2015

5 min read

Getting Started with Apache Spark DataFrames

Packt

22 Sep 2015

5 min read

In this article article about Arun Manivannan’s book Scala Data Analysis Cookbook, we will cover the following recipes: Getting Apache Spark ML – a framework for large-scale machine learning Creating a data frame from CSV (For more resources related to this topic, see here.) Getting started with Apache Spark Breeze is the building block of Spark MLLib, the machine learning library for Apache Spark. In this recipe, we'll see how to bring Spark into our project (using SBT) and look at how it works internally. The code for this recipe could be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1-spark-csv/build.sbt. How to do it... Pulling Spark ML into our project is just a matter of adding a few dependencies on our build.sbt file: spark-core, spark-sql, and spark-mllib: Under a brand new folder (which will be our project root), we create a new file called build.sbt. Next, let's add to the project dependencies the Spark libraries: organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.3.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion ) resolvers ++= Seq( "Apache HBase" at "https://repository.apache.org/content/repositories/releases", "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/" ) How it works... Spark has four major higher level tools built on top of the Spark Core: Spark Streaming, Spark ML Lib (Machine Learning), Spark SQL (An SQL interface for accessing data), and GraphX (for graph processing). The Spark Core is the heart of Spark, providing higher level abstractions in various languages for data representation, serialization, scheduling, metrics, and so on. For this recipe, we skipped streaming and GraphX and added the remaining three libraries. There’s more… Apache Spark is a cluster computing platform that claims to run about 100 times faster than Hadoop (that's a mouthful). In our terms, we could consider that as a means to run our complex logic over a massive amount of data at a blazingly high speed. The other good thing about Spark is that the programs we write are much smaller than the typical Map Reduce classes that we write for Hadoop. So, not only do our programs run faster, but it also takes lesser time to write them in the first place. Creating a data frame from CSV In this recipe, we'll look at how to create a new data frame from a Delimiter Separated Values (DSV) file. The code for this recipe could be found athttps://github.com/arunma/ScalaDataAnalysisCookbook/tree/master/chapter1-spark-csv in the DataFrameCSV class. How to do it... CSV support isn't first-class in Spark but is available through an external library from databricks. So, let's go ahead and add that up in build.sbt: After adding the spark-csv dependency, our complete build.sbt looks as follows: organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.3.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion, "com.databricks" %% "spark-csv" % "1.0.3" ) resolvers ++= Seq( "Apache HBase" at"https://repository.apache.org/content/repositories/releases", "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/" ) fork := true Before we create the actual data frame, there are three steps that we ought to do: create the Spark configuration, create the Spark context, and create the SQL context. SparkConf holds all of the information for running this Spark cluster. For this recipe, we are running locally, and we intend to use only two cores in the machine—local[2]: val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]") For this recipe, we'll be running Spark on standalone mode. Now let's load our pipe-separated file: org.apache.spark.sql.DataFrame val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|') How it works... The csvFile function of sqlContext accepts the full filePath of the file to be loaded. If the CSV has a header, then the useHeader flag will read the first row as column names. The delimiter flag, as expected, defaults to a comma, but you can override the character as needed. Instead of using the csvFile function, you can also use the load function available in the SQL context. The load function accepts the format of the file (in our case, it is CSV) and options as a map. We can specify the same parameters that we specified earlier using Map, like this: val options=Map("header"->"true", "path"->"ModifiedStudent.csv") val newStudents=sqlContext.load("com.databricks.spark.csv",options) Summary In this article, you learned in detail Apache Spark ML, a framework for large-scale machine learning. Then we saw the creation of a data frame from CSV with the help of example code. Resources for Article: Further resources on this subject: Integrating Scala, Groovy, and Flex Development with Apache Maven[article] Ridge Regression[article] Reactive Data Streams [article]

0
0
9371

article-image-implementing-decision-trees

Packt

22 Sep 2015

4 min read

Implementing Decision Trees

Packt

22 Sep 2015

4 min read

0
0
37536

Packt

22 Sep 2015

16 min read

R ─ Classification and Regression Trees

Packt

22 Sep 2015

16 min read

"The classifiers most likely to be the best are the random forest (RF) versions, the best of which (implemented in R and accessed via caret), achieves 94.1 percent of the maximum accuracy overcoming 90 percent in the 84.3 percent of the data sets." – Fernández-Delgado et al (2014) "You can't see the forest for the trees!" – An old saying (For more resources related to this topic, see here.) In this article by Cory Lesmeister, the author of Mastering Machine Learning with R, the first item of discussion is the basic decision tree, which is both simple to build and understand. However, the single decision tree method does not perform as well as the other methods such as support vector machines or neural networks. Therefore, we will discuss the creation of multiple, sometimes hundreds of, different trees with their individual results combined, leading to a single overall prediction. The first quote written above is from Fernández-Delgado et al in the Journal of Machine Learning Research and is meant to set the stage that the techniques in this article are quite powerful, particularly when used for the classification problems. Certainly, they are not always the best solution, but they do provide a good starting point. Regression trees For an understanding of the tree-based methods, it is probably easier to start with a quantitative outcome and then move on to how it works on a classification problem. The essence of a tree is that the features are partitioned, starting with the first split that improves the residual sum of squares the most. These binary splits continue until the termination of the tree. Each subsequent split/partition is not done on the entire dataset but only on the portion of the prior split that it falls under. This top-down process is referred as recursive partitioning. It is also a process that is greedy, a term you may stumble on in reading about the machine learning methods. Greedy means that in each split in the process, the algorithm looks for the greatest reduction in the residual sum of squares without a regard to how well it will perform in the later partitions. The result is that you may end up with a full tree of unnecessary branches, leading to a low bias but high variance. To control this effect, you need to appropriately prune the tree to an optimal size after building a full tree. The following figure provides a visual of the technique in action. The data is hypothetical with 30 observations, a response ranging from 1 to 10, and two predictor features, both ranging in value from 0 to 10 named X1 and X2. The tree has three splits that lead to four terminal nodes. Each split is basically an if or then statement or uses an R syntax, ifelse(). In the first split, if X1 < 3.5, then the response is split into 4 observations with an average value of 2.4 and the remaining 26 observations. This left branch of 4 observations is a terminal node as any further splits would not substantially improve the residual sum of squares. The predicted value for the 4 observations in that partition of the tree becomes the average. The next split is at X2 < 4 and finally X1 < 7.5. An advantage of this method is that it can handle the highly nonlinear relationships; but can you see a couple of potential problems? The first issue is that an observation is given the average of the terminal node that it falls under. This can hurt the overall predictive performance (high bias). Conversely, if you keep partitioning the data further and further to achieve a low bias, high variance can become an issue. As with the other methods, you can use cross-validation to select the appropriate tree size. Regression Tree with 3 splits and 4 terminal nodes and the corresponding node average and number of observations. Classification trees Classification trees operate under the same principal as regression trees except that the splits are not determined by the residual sum of squares but an error rate. The error rate used is not what you would expect, where the calculation is simply misclassified observations divided by the total observations. As it turns out, when it comes to tree splitting, a misclassification rate by itself may lead to a situation where you can gain information with a further split but not improve the misclassification rate. Let's look at an example. Suppose we have a node—let's call it N0 where you have 7 observations labeled No and 3 observations labeled Yes, and we say that the misclassified rate is 30 percent. With this in mind, let's calculate a common alternative error measure called Gini index. The formula for a single node Gini index is as follows: Gini = 1 – (probability of Class 1)2 – (probability of Class 2)2. For N0, the Gini is 1 - (.7)2 - (.3)2, which is equal to 0.42, versus the misclassification rate of 30 percent. Taking this example further, we will now create an N1 node with 3 of Class 1 and none of Class 2 along with N2, which has 4 observations from Class 1 and 3 from Class 2. Now, the overall misclassification rate for this branch of the tree is still 30 percent, but look at the following to see how the overall Gini index has improved: Gini(N1) = 1 – (3/3)2 – (0/3)2 = 0. Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.49. The new Gini index = (proportion of N1 x Gini(N1)) + (proportion of N2 x Gini(N2)) which is equal to (.3 x 0) + (.7 x 0.49) or 0.343. By doing a split on a surrogate error rate, we actually improved our model impurity by reducing it from 0.42 to 0.343, whereas the misclassification rate did not change. This is the methodology used by the rpart() package. Random forest To greatly improve our model's predictive ability, we can produce numerous trees and combine the results. The random forest technique does this by applying two different tricks in the model development. The first is the use of bootstrap aggregation or bagging as it is called. In bagging, an individual tree is built on a sample of dataset, roughly two-thirds of the total observations. It is important to note that the remaining one-third is referred to as Out of Bag(OOB). This is repeated for dozens or hundreds of times and the results are averaged. Each of these trees is grown and not pruned based on any error measure and this means that the variance of each of these individual trees is high. However, by averaging the results, you can reduce the variance without increasing the bias. The next thing that the random forest brings to the table is that concurrently with the random sample of the data, it also takes a random sample of the input features at each split. In the randomForest package, we will use the default random number of the sampled predictors, which is the square root of the total predictors for classification problems and total predictors divided by 3 for regression. The number of predictors that the algorithm randomly chooses at each split can be changed via the model tuning process. By doing this random sampling of the features at each split and incorporating it into the methodology, you mitigate the effect of a highly correlated predictor in becoming the main driver in all of your bootstrapped trees and preventing you from reducing the variance that you hoped to achieve with bagging. The subsequent averaging of the trees that are less correlated to each other than if you only performed bagging, is more generalizable and more robust to outliers. Gradient boosting The boosting methods can become extremely complicated for you to learn and understand, but you should keep in mind about what is fundamentally happening behind the curtain. The main idea is to build an initial model of some kind (linear, spline, tree, and so on.) called the base-learner, examine the residuals, and fit a model based on these residuals around the so-called loss function. A loss function is merely the function that measures the discrepancy between the model and desired prediction, for example, a squared error for the regression or the logistic function for the classification. The process continues until it reaches some specified stopping criterion. This is like the student who takes a practice exam and gets 30 out of 100 questions wrong and as a result, studies only those 30 questions that were missed. The next practice exam they get 10 out of these 30 wrong and so only focus on these 10 questions and so on. If you would like to explore the theory behind this further, a great resource for you is available in Frontiers in Neurorobotics, Gradient boosting machines, a tutorial, Natekin A., Knoll A., (2013), at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/. As previously mentioned, boosting can be applied to many different base learners, but here we will only focus on the specifics of tree-based learning. Each tree iteration is small and we will determine how small it is with one of the tuning parameters referred to as interaction depth. In fact, it may be as small as one split, which is referred to as a stump. Trees are sequentially fit to the residuals according to the loss function up to the number of trees that we specified (our stopping criterion). There is another tuning parameter that we will need to identify and that is shrinkage. You can think of shrinkage as the rate at which your model is learning generally and specifically, as the contribution of each tree or stump to the model. This learning rate acts as a regularization parameter. The other thing about our boosting algorithm is that it is stochastic, meaning that it adds randomness by taking a random sample of our data at each tree. Introducing some randomness to a boosted model usually improves the accuracy and speed and reduces overfitting (Friedman 2002). As you may have guessed, tuning these parameters can be quite a challenge. These parameters can interact with each other and if you just tinker with one without considering the other, your model may actually perform worse. The caret package will help us in this endeavor. Business case The overall business objective in this situation is to see if we can improve the predictive ability for some of the cases. For regression, we will visit the prostate cancer data. For classification purposes, we will utilize both the breast cancer biopsy data and Pima Indian Diabetes data. Both random forests and boosting will be applied to all the three datasets. The simple tree method will be used only on the breast and prostate cancer sets. Regression tree We will jump right into the prostate data set, but first let's load the necessary R package, as follows: > library(rpart) #classification and regression trees > library(partykit) #treeplots > library(MASS) #breast and pima indian data > library(ElemStatLearn) #prostate data > library(randomForest) #random forests > library(gbm) #gradient boosting > library(caret) #tune hyper-parameter First, we will do regression on the prostate data. This involves calling the dataset, coding the gleason score as an indicator variable using the ifelse() function, and creating a test and training set. The training set will be pros.train and the test set will be pros.test, as follows: > data(prostate) > prostate$gleason = ifelse(prostate$gleason == 6, 0, 1) > pros.train = subset(prostate, train==TRUE)[,1:9] > pros.test = subset(prostate, train==FALSE)[,1:9] To build a regression tree on the training data, we will use the following rpart() function from R's party package. The syntax is quite similar to what we used in the other modeling techniques: > tree.pros <- rpart(lpsa~., data=pros.train) We can call this object using the print() function and cptable and then examine the error per split to determine the optimal number of splits in the tree, as follows: > print(tree.pros$cptable) CP nsplit rel error xerror xstd 1 0.35852251 0 1.0000000 1.0364016 0.1822698 2 0.12295687 1 0.6414775 0.8395071 0.1214181 3 0.11639953 2 0.5185206 0.7255295 0.1015424 4 0.05350873 3 0.4021211 0.7608289 0.1109777 5 0.01032838 4 0.3486124 0.6911426 0.1061507 6 0.01000000 5 0.3382840 0.7102030 0.1093327 This is a very important table to analyze. The first column labeled CP is the cost complexity parameter, which states that the second column, nsplit, is the number of splits in the tree. The rel error column stands for relative errors and is the residual sum of squares for that number of splits divided by the residual sum of squares for no splits (RSS(k)/RSS(0). Both xerror and xstd are based on a ten-fold cross-validation with xerror being the average error and xstd being the standard deviation of the cross-validation process. We can see that four splits produced slightly less errors using cross-validation while five splits produced the lowest error on the full dataset. You can examine this using the plotcp() function, as follows: > plotcp(tree.pros) The following is the output of the preceding command: The plot shows us the relative error by the tree size with the corresponding error bars. The horizontal line on the plot is the upper limit of the lowest standard error. Selecting the tree size 5, which is four splits, we can build a new tree object where xerror is minimized by pruning our tree accordingly—first creating an object for cp associated with the pruned tree from the table. Then the prune() function handles the rest as follows: > cp = min(tree.pros$cptable[5,]) > prune.tree.pros <- prune(tree.pros, cp = cp) With this done, you can plot and compare the full and pruned trees. The tree plots produced by the partykit package are much better than those produced by the party package. You can simply use the as.party() function as a wrapper in the plot() function: > plot(as.party(tree.pros)) The output of the preceding command is as follows: > plot(as.party(prune.tree.pros)) The following is the output of the preceding command: Note that the splits are exactly the same in the two trees with the exception of the last split, which includes the age variable for the full tree. Interestingly, both the first and second splits in the tree are related to the log of cancer volume lcavol. These plots are quite informative as they show the splits, nodes, observations per node, and box plots of the outcome that we are trying to predict. Let's see how well the pruned tree performs on the test data. What we will do is create an object of the predicted values using the predict() function by incorporating the test data. Then, we will calculate the errors as the predicted values minus the actual values and finally the mean of the squared errors, as follows: > party.pros.test <- predict(prune.tree.pros, newdata=pros.test) > rpart.resid = party.pros.test - pros.test$lpsa #calculate residuals > mean(rpart.resid^2) #caluclate MSE [1] 0.5267748 One can look at the tree plots that we produced and easily explain what are the primary drivers behind the response. As mentioned in the introduction, the trees are easy to interpret and explain, which, in many cases, may be more important than the accuracy. Classification tree For the classification problem, we will prepare the breast cancer data. After loading the data, you will delete the patient ID, rename the features, eliminate the few missing values, and then create the train/test datasets, as follows: > data(biopsy) > biopsy <- biopsy[,-1] #delete ID > names(biopsy) = c("thick", "u.size", "u.shape", "adhsn", "s.size", "nucl", "chrom", "n.nuc", "mit", "class") #change the feature names > biopsy.v2 = na.omit(biopsy) #delete the observations with missing values > set.seed(123) #random number generator > ind = sample(2, nrow(biopsy.v2), replace=TRUE, prob=c(0.7, 0.3)) > biop.train = biopsy.v2[ind==1,] #the training data set > biop.test = biopsy.v2[ind==2,] #the test data set With the data set up appropriately, we will use the same syntax style for a classification problem as we did previously for a regression problem, but before creating a classification tree, we will need to ensure that the outcome is a factor, which can be done using the str() function. as follows: > str(biop.test[,10]) Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 2 1 1 … First, we will create the tree: > set.seed(123) > tree.biop <- rpart(class~., data=biop.train) Then, examine the table for the optimal number of splits: > print(tree.biop$cptable) CP nsplit rel error xerror xstd 1 0.79651163 0 1.0000000 1.0000000 0.06086254 2 0.07558140 1 0.2034884 0.2674419 0.03746996 3 0.01162791 2 0.1279070 0.1453488 0.02829278 4 0.01000000 3 0.1162791 0.1744186 0.03082013 The cross-validation error is at a minimum with only two splits (row 3). We can now prune the tree, plot the full and pruned tree, and see how it performs on the test set, as follows: > cp = min(tree.biop$cptable[3,]) > prune.tree.biop <- prune(tree.biop, cp = cp) > plot(as.party(tree.biop)) > plot(as.party(prune.tree.biop)) An examination of the tree plots shows that the uniformity of the cell size is the first split, then bare nuclei. The full tree had an additional split at the cell thickness. We can predict the test observations using type="class" in the predict() function: > rparty.test <- predict(prune.tree.biop, newdata=biop.test, type="class") > table(rparty.test, biop.test$class) rparty.test benign malignant benign 136 3 malignant 6 64 > (136+64)/209 [1] 0.9569378 The basic tree with just two splits gets us almost 96 percent accuracy. This still falls short but should encourage us to believe that we can improve on it with the upcoming methods, starting with random forests. Summary In this article we learned both the power and limitations of tree-based learning methods for both classification and regression problems. To improve on predictive ability, we have the tools of the random forest and gradient boosted trees at our disposal. Resources for Article: Further resources on this subject: Big Data Analysis (R and Hadoop) [article] Using R for Statistics, Research, and Graphics [article] First steps with R [article]

0
0
7948

How-To Tutorials - Data

First Principle and a Useful Way to Think

Introduction to Data Analysis and Libraries

Fingerprint detection using OpenCV 3

Integrating Elasticsearch with the Hadoop ecosystem

Hand Gesture Recognition Using a Kinect Depth Sensor

Creating Interactive Spreadsheets using Tables and Slicers

Mathematical Imaging

Object Detection Using Image Features in JavaScript

Data Around Us

The Dashboard Design – Best Practices

Trending Topics

Integration with Spark SQL

Find Friends on Facebook

Getting Started with Apache Spark DataFrames

Implementing Decision Trees

R ─ Classification and Regression Trees

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access