Practical Data Analysis

Chapter 1. Getting Started

Data analysis is the process in which raw data is ordered and organized, to be used in methods that help to explain the past and predict the future. Data analysis is not about the numbers, it is about making/asking questions, developing explanations, and testing hypotheses. Data Analysis is a multidisciplinary field, which combines Computer Science, Artificial Intelligence & Machine Learning, Statistics & Mathematics, and Knowledge Domain as shown in the following figure:

The nature of data

Data is the plural of datum, so it is always treated as plural. We can find data in all the situations of the world around us, in all the structured or unstructured, in continuous or discrete conditions, in weather records, stock market logs, in photo albums, music playlists, or in our Twitter accounts. In fact, data can be seen as the essential raw material of any kind of human activity. According to the Oxford English Dictionary:

Data are known facts or things used as basis for inference or reckoning.

As shown in the following figure, we can see Data in two distinct ways: Categorical and Numerical:

Categorical data are values or observations that can be sorted into groups or categories. There are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, housing is a categorical variable having two categories (own and rent). An ordinal variable has an established ordering. For example, age as a variable with three orderly categories (young, adult, and elder).

Numerical data are values or observations that can be measured. There are two kinds of numerical values, discrete and continuous. Discrete data are values or observations that can be counted and are distinct and separate. For example, number of lines in a code. Continuous data are values or observations that may take on any value within a finite or infinite interval. For example, an economic time series such as historic gold prices.

The kinds of datasets used in this book are as follows:

E-mails (unstructured, discrete)
Digital images (unstructured, discrete)
Stock market logs (structured, continuous)
Historic gold prices (structured, continuous)
Credit approval records (structured, discrete)
Social media friends and relationships (unstructured, discrete)
Tweets and trending topics (unstructured, continuous)
Sales records (structured, continuous)

For each of the projects in this book, we try to use a different kind of data. This book is trying to give the reader the ability to address different kinds of data problems.

The data analysis process

When you have a good understanding of a phenomenon, it is possible to make predictions about it. Data analysis helps us to make this possible through exploring the past and creating predictive models.

The data analysis process is composed of the following steps:

The statement of problem
Obtain your data
Clean the data
Normalize the data
Transform the data
Exploratory statistics
Exploratory visualization
Predictive modeling
Validate your model
Visualize and interpret your results
Deploy your solution

All these activities can be grouped as shown in the following figure:

The problem

The problem definition starts with high-level questions such as how to track differences in behavior between groups of customers, or what's going to be the gold price in the next month. Understanding the objectives and requirements from a domain perspective is the key to a successful data analysis project.

Types of data analysis questions are listed as follows:

Inferential
Predictive
Descriptive
Exploratory
Causal
Correlational

Data preparation

Data preparation is about how to obtain, clean, normalize, and transform the data into an optimal dataset, trying to avoid any possible data quality issues such as invalid, ambiguous, out-of-range, or missing values. This process can take a lot of your time. In Chapter 2, Working with Data, we go into more detail about working with data, using OpenRefine to address the complicated tasks. Analyzing data that has not been carefully prepared can lead you to highly misleading results.

The characteristics of good data are listed as follows:

Complete
Coherent
Unambiguous
Countable
Correct
Standardized
Non-redundant

Data exploration

Data exploration is essentially looking at the data in a graphical or statistical form trying to find patterns, connections, and relations in the data. Visualization is used to provide overviews in which meaningful patterns may be found.

In Chapter 3, Data Visualization, we present a visualization framework (D3.js) and we implement some examples on how to use visualization as a data exploration tool.

Predictive modeling

Predictive modeling is a process used in data analysis to create or choose a statistical model trying to best predict the probability of an outcome. In this book, we use a variety of those models and we can group them in three categories based on its outcome:

	Chapter	Algorithm
Categorical outcome (Classification)	4	Naïve Bayes Classifier
Categorical outcome (Classification)	11	Natural Language Toolkit + Naïve Bayes Classifier
Numerical outcome (Regression)	6	Random Walk
	8	Support Vector Machines
	9	Cellular Automata
	8	Distance Based Approach + k-nearest neighbor
Descriptive modeling (Clustering)	5	Fast Dynamic Time Warping (FDTW) + Distance Metrics
Descriptive modeling (Clustering)	10	Force Layout and Fruchterman-Reingold layout

Another important task we need to accomplish in this step is evaluating the model we chose to be optimal for the particular problem.

The No Free Lunch Theorem proposed by Wolpert in 1996 stated:

"No Free Lunch theorems have shown that learning algorithms cannot be universally good."

The model evaluation helps us to ensure that our analysis is not over-optimistic or over-fitted. In this book, we are going to present two different ways to validate the model:

Cross-validation: We divide the data into subsets of equal size and test the predictive model in order to estimate how it is going to perform in practice. We will implement cross-validation in order to validate the robustness of our model as well as evaluate multiple models to identify the best model based on their performance.
Hold-Out: Mostly, large dataset is randomly divided in to three subsets: training set, validation set, and test set.

Visualization of results

This is the final step in our analysis process and we need to answer the following questions:

How is it going to present the results?

For example, in tabular reports, 2D plots, dashboards, or infographics.

Where is it going to be deployed?

For example, in hard copy printed, poster, mobile devices, desktop interface, or web.

Each choice will depend on the kind of analysis and a particular data. In the following chapters, we will learn how to use standalone plotting in Python with matplotlib and web visualization with D3.js.

What about big data?

Big data is a term used when the data exceeds the processing capacity of typical database. We need a big data analytics when the data grows quickly and we need to uncover hidden patterns, unknown correlations, and other useful information.

There are three main features in big data:

Volume: Large amounts of data
Variety: Different types of structured, unstructured, and multi-structured data
Velocity: Needs to be analyzed quickly

As shown in the following figure, we can see the interaction between the three Vs:

Big data is the opportunity for any company to gain advantages from data aggregation, data exhaust, and metadata. This makes big data a useful business analytic tool, but there is a common misunderstanding about what big data is.

The most common architecture for big data processing is through MapReduce , which is a programming model for processing large datasets in parallel using a distributed cluster.

Apache Hadoop is the most popular implementation of MapReduce to solve large-scale distributed data storage, analysis, and retrieval tasks. However, MapReduce is just one of the three classes of technologies for storing and managing big data. The other two classes are NoSQL and massively parallel processing (MPP) data stores. In this book, we implement MapReduce functions and NoSQL storage through MongoDB , see Chapter 12, Data Processing and Aggregation with MongoDB and Chapter 13, Working with MapReduce.

MongoDB provides us with document-oriented storage, high availability, and map/reduce flexible aggregation for data processing.

A paper published by the IEEE in 2009, The Unreasonable Effectiveness of Data states:

But invariably, simple models and a lot of data trump over more elaborate models based on less data.

This is a fundamental idea in big data (you can find the full paper at http://bit.ly/1dvHCom). The trouble with real world data is that the probability of finding false correlations is high and gets higher as the datasets grow. That's why, in this book, we focus on meaningful data instead of big data.

One of the main challenges for big data is how to store, protect, backup, organize, and catalog the data in a petabyte scale. Another main challenge of big data is the concept of data ubiquity. With the proliferation of smart devices with several sensors and cameras the amount of data available for each person increases every minute. Big data must process all this data in real time.

Sensors and cameras

Interaction with the outside world is highly important in data analysis. Using sensors such as RFID (Radio-frequency identification) or a smartphone to scan a QR code (Quick Response Code) is an easy way to interact directly with the customer, make recommendations, and analyze consumer trends.

On the other hand, people are using their smartphones all the time, using their cameras as a tool. In Chapter 5, Similarity-based Image Retrieval, we will use these digital images to perform search by image. This can be used, for example, in face recognition or to find reviews of a restaurant just by taking a picture of the front door.

The interaction with the real world can give you a competitive advantage and a real-time data source directly from the customer.

Social networks analysis

Formally, the SNA (social network analysis) performs the analysis of social relationships in terms of network theory, with nodes representing individuals and ties representing relationships between the individuals, as we can see in the following figure. The social network creates groups of related individuals (friendship) based on different aspects of their interaction. We can find important information such as hobbies (for product recommendation) or who has the most influential opinion in the group (centrality). We will present in Chapter 10, Working with Social Graphs, a project; who is your closest friend and we'll show a solution for Twitter clustering.

Social networks are strongly connected and these connections are often not symmetric. This makes the SNA computationally expensive, and needs to be addressed with high-performance solutions that are less statistical and more algorithmic.

The visualization of a social network can help us to get a good insight into how people are connected. The exploration of the graph is done through displaying nodes and ties in various colors, sizes, and distributions. The D3.js library has animation capabilities that enable us to visualize the social graph with an interactive animation. These help us to simulate behaviors such as information diffusion or distance between nodes.

Facebook processes more than 500 TB data daily (images, text, video, likes, and relationships), this amount of data needs non-conventional treatment such as NoSQL databases and MapReduce frameworks, in this book, we work with MongoDB—a document-based NoSQL database, which also has great functions for aggregations and MapReduce processing.

Tools and toys for this book

The main goal of this book is to provide the reader with self-contained projects ready to deploy, in order to do this, as you go through the book you will use and implement tools such as Python, D3, and MongoDB. These tools will help you to program and deploy the projects. You also can download all the code from the author's GitHub repository https://github.com/hmcuesta.

You can see a detailed installation and setup process of all the tools in Appendix, Setting Up the Infrastructure.

Why Python?

Python is a scripting language—an interpreted language with its own built-in memory management and good facilities for calling and cooperating with other programs. There are two popular Versions, 2.7 or 3.x, in this book, we will focused on the 3.x Version because it is under active development and has already seen over two years of stable releases.

Python is multi-platform, which runs on Windows, Linux/Unix, and Mac OS X, and has been ported to the Java and .NET virtual machines. Python has powerful standard libraries and a wealth of third-party packages for numerical computation and machine learning such as NumPy , SciPy , pandas , SciKit , mlpy, and so on.

Python is excellent for beginners, yet great for experts and is highly scalable—suitable for large projects as well as small ones. Also it is easily extensible and object-oriented.

Python is widely used by organizations such as Google, Yahoo Maps, NASA, RedHat, Raspberry Pi, IBM, and so on.

A list of organizations using Python is available at http://wiki.python.org/moin/OrganizationsUsingPython.

Python has excellent documentation and examples at http://docs.python.org/3/.

Python is free to use, even for commercial products, download is available for free from http://python.org/.

Why mlpy?

mlpy (Machine Learning Python) is a Python module built on top of NumPy, SciPy, and the GNU Scientific Libraries. It is open source and supports Python 3.x. The mlpy module has a large amount of machine learning algorithms for supervised and unsupervised problems.

Some of the features of mlpy that will be used in this book are as follows:

We will perform a numeric regression with kernel ridge regression (KRR)
We will explore the dimensionality reduction through principal component analysis (PCA)
We will work with support vector machines (SVM) for classification
We will perform text classification with Naive Bayes
We will see how different two time series are with dynamic time warping (DTW) distance metric

We can download the latest Version of mlpy from http://mlpy.sourceforge.net/.

For reference you can refer to the paper mply: Machine Learning Python (http://arxiv.org/abs/1202.6548) submitted in 2012 by D. Albanese, R. Visintainer, S. Merler, S. Riccadonna, G. Jurman, and C. Furlanello.

Why D3.js?

D3.js (Data-Driven Documents) was developed by Mike Bostock. D3 is a JavaScript library for visualizing data and manipulating the document object model that runs in a browser without a plugin. In D3.js you can manipulate all the elements of the DOM (Document Object Model); it is as flexible as the client-side web technology stack (HTML, CSS, and SVG).

D3.js supports large datasets and includes animation capabilities that make it a really good choice for web visualization.

D3 has an excellent documentation, examples, and community at https://github.com/mbostock/d3/wiki/Gallery and https://github.com/mbostock/d3/wiki.

You can download the latest Version of D3.js from http://d3js.org/d3.v3.zip.

Why MongoDB?

NoSQL (Not only SQL) is a term that covers different types of data storage technologies, used when you can't fit your business model into a classical relational data model. NoSQL is mainly used in Web 2.0 and in social media applications.

MongoDB is a document-based database. This means that MongoDB stores and organizes the data as a collection of documents that gives you the possibility to store the view models almost exactly like you model them in the application. Also, you can perform complex searches for data and elementary data mining with MapReduce.

MongoDB is highly scalable, robust, and perfect to work with JavaScript-based web applications because you can store your data in a JSON (JavaScript Object Notation ) document and implement a flexible schema which makes it perfect for no structured data.

MongoDB is used by highly recognized corporations such as Foursquare, Craigslist, Firebase, SAP, and Forbes. We can see a detailed list at http://www.mongodb.org/about/production-deployments/.

MongoDB has a big and active community and well-written documentation at http://docs.mongodb.org/manual/.

MongoDB is easy to learn and it's free, we can download MongoDB from http://www.mongodb.org/downloads.

Filter reviews by

All

Amazon verified reviews

Carlos Rodriguez Contreras Feb 19, 2014

This a very useful text for all people trying to get into Big Data Analysis. Concepts are clearly explained and readers do not need to be experts in any topic covered, this is why I chose the Cuesta's book over a lot of books on Big Data that apparently try to show mainly the expertise of authors. If you, like me, are interested in Big Data, this is a must on your shelf.

Amazon Verified review

José Carlos Dec 07, 2013

This book is not about theories of data analysis, is about how move your hacking skills into the data analysis world.If you are a programmer/hacker who want to understanding a problem from a data-oriented perspective, this book isfor you.This book is a fast introduction to data analysis methods including some of the most used techniques forclassification, regression and clustering. The book provides a wide range of tools like Python, mlpy, Pandas, D3jsand MongoDB. The recipes are clear and easy to follow you can get into data analysis in fast way if you alreadyhave some programming skills.I can highly recommend chapters 10 and 11 which focus on Social Networks Analytics and Social NetworksGraph’s Visualization.

Mark Kerzner Nov 27, 2013

This is a very practical book, which teaches you how to "make data talk to you," that is, how to extract information, quantitative and qualitative, out of your data, and make it useful beyond just numbers.Following the by now ubiquitous quote by Hal Varian of Google that "the sexy job in the next ten years will be statisticians" [...] the book teaches not the theory and not the programming languages, but methods and operations on the data.Programming languages do come in (Python with its mathematical and word analysis packages), but only as tools for the practical applications. So, if you are not looking for the theoretical mathematical proofs or for computers science implementation details but are rather interested in the answers that the data can provide, you have come to the right place. Here are some of the the areas that the books covers:Data formats and visualizationText classificationFinding similar imagesSimulation of stock price and predicting the prices of goldMachine learningModeling infectious diseasesWorking with social graphsSentiment analysis of Twitter dataThe reader will do well to go deeper and to read the description of the algorithms mentioned in the books. As mentioned, the books is practical in that it explains the benefits of the analysis but not the analysis itself. However, it gives you a good list of areas you need to go deeper into, and sets you on the right track with that. Later, you will be able to use it as handbook and a cheat sheet.

View2 Nov 24, 2013

This books gives a very practical introduction to data analysis. It covers a wide range of topics, including data visualization, text analysis (spam recognition, sentiment analysis), image analysis, social graph analysis, Bayes classification, SVM, etc. The examples are very practical, and teaches the user how to use popular languages and libraries like d3.js, python3, nltk, mlpy etc. to do basic data analysis.The book is a great read for beginners. To read and fully appreciate it, no data analysis is required. The books provides an introductory to the very basic techniques. Some basic understanding of python and javascript would be necessary, though.What I like of this book is its hand-on style: while reading, you can easily get started with your first data analyses. The examples are very simple, the code easy to read, and a very detailed appendix helps to install the tools used. This book is a great help to learn data analysis by doing.What may be improved is precision. I found some grammar mistakes. Not so big a problem, but not perfect, either. For instance reading sentences like "we will use Pillow due to its compatibility with Python 3.2 and can be downloaded ..." [p. 97] does hurt a little. More problematic is the section "Classifier accuracy" [p. 90]. It simply uses the ratio of correctly predicted emails to be a measure of accuracy, although actually every discussion of classification accuracy must contain the rations of false positives and false negatives as well.Overall, this book is a very practical introduction to data analysis for beginners.

R. Friesel Jr. Dec 09, 2013

I just finished up reading "Practical Data Analysis" by Hector Cuesta (Packt Publishing, 2013) and overall, it was a pretty good overview and recommends some good tools. I would say that the book is a good place for someone to get started if they have no real experience performing these kinds of analyses, and though Cuesta doesn't go deep into the math behind it all, he isn't afraid to use the technical names for different formulae, which should make it easy for you to do your own follow-up research.Jeff Leek's Data Analysis on Coursera provides the lens through which I read this book. That being said, I found myself doing a lot of comparing and contrasting between the two. For example, they both use practical, reasonably small "real world" sample problems to highlight specific analytical techniques and/or features of their chosen toolkits. However, whereas Leek's course focused exclusively on using R, Cuesta assembles his own all-star team of tools using Python and D3.js. Perhaps it goes without saying, but there are pros and cons to each approach (e.g., Leek's "pure R" vs. Cuesta's "Python plus D3.js"), and I felt that it was best to consider them together.Cuesta's approach with this book is to present a sample scenario in each chapter that introduces a class of problem, a solution to that problem, and his recommended toolkit. For example, chapter six creates a stock price simulation, introducing simple simulation problems (especially for apparently stochastic data), time series data and Monte Carlo methods, and then how to simulate the data using Python and visualizing it in D3.js. Although the book is not strictly a "cookbook", the chapters very much feel like macro-level "recipes". There's quite a bit of code and some decent discussion around the concepts that govern the analytical model, and (true to the "practical" in the title) the emphasis is on the "how" and not the "why".While I did not read the entire book cover-to-cover, I would definitely recommend it to anyone that wants an introduction to some basic data analysis techniques and tools. You'll get more out of this book if you have some base to compare it to -- e.g., some experience in R (academic or otherwise); and you'll get the most out of this book if you also have a solid foundation in the mathematics and/or statistics that underlie these analytical approaches.DISCLOSURE: I was given an electronic copy of this book from the publisher in exchange for writing a review.

Practical Data Analysis: For small businesses, analyzing the information contained in their data using open source technology could be game-changing. All you need is some basic programming and mathematical skills to do just that.

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access