Practical Data Analysis

4 (4 reviews total)
By Hector Cuesta
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Getting Started

About this book

Plenty of small businesses face big amounts of data but lack the internal skills to support quantitative analysis. Understanding how to harness the power of data analysis using the latest open source technology can lead them to providing better customer service, the visualization of customer needs, or even the ability to obtain fresh insights about the performance of previous products. Practical Data Analysis is a book ideal for home and small business users who want to slice and dice the data they have on hand with minimum hassle.

Practical Data Analysis is a hands-on guide to understanding the nature of your data and turn it into insight. It will introduce you to the use of machine learning techniques, social networks analytics, and econometrics to help your clients get insights about the pool of data they have at hand. Performing data preparation and processing over several kinds of data such as text, images, graphs, documents, and time series will also be covered.

Practical Data Analysis presents a detailed exploration of the current work in data analysis through self-contained projects. First you will explore the basics of data preparation and transformation through OpenRefine. Then you will get started with exploratory data analysis using the D3js visualization framework. You will also be introduced to some of the machine learning techniques such as, classification, regression, and clusterization through practical projects such as spam classification, predicting gold prices, and finding clusters in your Facebook friends’ network. You will learn how to solve problems in text classification, simulation, time series forecast, social media, and MapReduce through detailed projects. Finally you will work with large amounts of Twitter data using MapReduce to perform a sentiment analysis implemented in Python and MongoDB.

Practical Data Analysis contains a combination of carefully selected algorithms and data scrubbing that enables you to turn your data into insight.

Publication date:
October 2013
Publisher
Packt
Pages
360
ISBN
9781783280995

 

Chapter 1. Getting Started

Data analysis is the process in which raw data is ordered and organized, to be used in methods that help to explain the past and predict the future. Data analysis is not about the numbers, it is about making/asking questions, developing explanations, and testing hypotheses. Data Analysis is a multidisciplinary field, which combines Computer Science, Artificial Intelligence & Machine Learning, Statistics & Mathematics, and Knowledge Domain as shown in the following figure:

 

Computer science


Computer science creates the tools for data analysis. The vast amount of data generated has made computational analysis critical and has increased the demand for skills such as programming, database administration, network administration, and high-performance computing. Some programming experience in Python (or any high-level programming language) is needed to understand the chapters.

 

Artificial intelligence (AI)


According to Stuart Russell and Peter Norvig:

"[AI] has to do with smart programs, so let's get on and write some."

In other words, AI studies the algorithms that can simulate an intelligent behavior. In data analysis, we use AI to perform those activities that require intelligence such as inference, similarity search, or unsupervised classification.

 

Machine Learning (ML)


Machine learning is the study of computer algorithms to learn how to react in a certain situation or recognize patterns. According to Arthur Samuel (1959),

"Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed."

ML has a large amount of algorithms generally split in to three groups; given how the algorithm is training:

  • Supervised learning

  • Unsupervised learning

  • Reinforcement learning

Relevant numbers of algorithms are used throughout the book and are combined with practical examples, leading the reader through the process from the data problem to its programming solution.

 

Statistics


In January 2009, Google's Chief Economist, Hal Varian said,

"I keep saying the sexy job in the next ten years will be statisticians. People think I'm joking, but who would've guessed that computer engineers would've been the sexy job of the 1990s?"

Statistics is the development and application of methods to collect, analyze, and interpret data.

Data analysis encompasses a variety of statistical techniques such as simulation, Bayesian methods, forecasting, regression, time-series analysis, and clustering.

 

Mathematics


Data analysis makes use of a lot of mathematical techniques such as linear algebra (vector and matrix, factorization, and eigenvalue), numerical methods, and conditional probability in the algorithms. In this book, all the chapters are self-contained and include the necessary math involved.

 

Knowledge domain


One of the most important activities in data analysis is asking questions, and a good understanding of the knowledge domain can give you the expertise and intuition needed to ask good questions. Data analysis is used in almost all the domains such as finance, administration, business, social media, government, and science.

 

Data, information, and knowledge


Data are facts of the world. For example, financial transactions, age, temperature, number of steps from my house to my office, are simply numbers. The information appears when we work with those numbers and we can find value and meaning. The information can help us to make informed decisions.

We can talk about knowledge when the data and the information turn into a set of rules to assist the decisions. In fact, we can't store knowledge because it implies theoretical or practical understanding of a subject. However, using predictive analytics, we can simulate an intelligent behavior and provide a good approximation. An example of how to turn data into knowledge is shown in the following figure:

 

The nature of data


Data is the plural of datum, so it is always treated as plural. We can find data in all the situations of the world around us, in all the structured or unstructured, in continuous or discrete conditions, in weather records, stock market logs, in photo albums, music playlists, or in our Twitter accounts. In fact, data can be seen as the essential raw material of any kind of human activity. According to the Oxford English Dictionary:

Data are known facts or things used as basis for inference or reckoning.

As shown in the following figure, we can see Data in two distinct ways: Categorical and Numerical:

Categorical data are values or observations that can be sorted into groups or categories. There are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, housing is a categorical variable having two categories (own and rent). An ordinal variable has an established ordering. For example, age as a variable with three orderly categories (young, adult, and elder).

Numerical data are values or observations that can be measured. There are two kinds of numerical values, discrete and continuous. Discrete data are values or observations that can be counted and are distinct and separate. For example, number of lines in a code. Continuous data are values or observations that may take on any value within a finite or infinite interval. For example, an economic time series such as historic gold prices.

The kinds of datasets used in this book are as follows:

  • E-mails (unstructured, discrete)

  • Digital images (unstructured, discrete)

  • Stock market logs (structured, continuous)

  • Historic gold prices (structured, continuous)

  • Credit approval records (structured, discrete)

  • Social media friends and relationships (unstructured, discrete)

  • Tweets and trending topics (unstructured, continuous)

  • Sales records (structured, continuous)

For each of the projects in this book, we try to use a different kind of data. This book is trying to give the reader the ability to address different kinds of data problems.

 

The data analysis process


When you have a good understanding of a phenomenon, it is possible to make predictions about it. Data analysis helps us to make this possible through exploring the past and creating predictive models.

The data analysis process is composed of the following steps:

  • The statement of problem

  • Obtain your data

  • Clean the data

  • Normalize the data

  • Transform the data

  • Exploratory statistics

  • Exploratory visualization

  • Predictive modeling

  • Validate your model

  • Visualize and interpret your results

  • Deploy your solution

All these activities can be grouped as shown in the following figure:

The problem

The problem definition starts with high-level questions such as how to track differences in behavior between groups of customers, or what's going to be the gold price in the next month. Understanding the objectives and requirements from a domain perspective is the key to a successful data analysis project.

Types of data analysis questions are listed as follows:

  • Inferential

  • Predictive

  • Descriptive

  • Exploratory

  • Causal

  • Correlational

Data preparation

Data preparation is about how to obtain, clean, normalize, and transform the data into an optimal dataset, trying to avoid any possible data quality issues such as invalid, ambiguous, out-of-range, or missing values. This process can take a lot of your time. In Chapter 2, Working with Data, we go into more detail about working with data, using OpenRefine to address the complicated tasks. Analyzing data that has not been carefully prepared can lead you to highly misleading results.

The characteristics of good data are listed as follows:

  • Complete

  • Coherent

  • Unambiguous

  • Countable

  • Correct

  • Standardized

  • Non-redundant

Data exploration

Data exploration is essentially looking at the data in a graphical or statistical form trying to find patterns, connections, and relations in the data. Visualization is used to provide overviews in which meaningful patterns may be found.

In Chapter 3, Data Visualization, we present a visualization framework (D3.js) and we implement some examples on how to use visualization as a data exploration tool.

Predictive modeling

Predictive modeling is a process used in data analysis to create or choose a statistical model trying to best predict the probability of an outcome. In this book, we use a variety of those models and we can group them in three categories based on its outcome:

 

Chapter

Algorithm

Categorical outcome (Classification)

4

Naïve Bayes Classifier

11

Natural Language Toolkit + Naïve Bayes Classifier

Numerical outcome (Regression)

6

Random Walk

8

Support Vector Machines

9

Cellular Automata

8

Distance Based Approach + k-nearest neighbor

Descriptive modeling (Clustering)

5

Fast Dynamic Time Warping (FDTW) + Distance Metrics

10

Force Layout and Fruchterman-Reingold layout

Another important task we need to accomplish in this step is evaluating the model we chose to be optimal for the particular problem.

The No Free Lunch Theorem proposed by Wolpert in 1996 stated:

"No Free Lunch theorems have shown that learning algorithms cannot be universally good."

The model evaluation helps us to ensure that our analysis is not over-optimistic or over-fitted. In this book, we are going to present two different ways to validate the model:

  • Cross-validation: We divide the data into subsets of equal size and test the predictive model in order to estimate how it is going to perform in practice. We will implement cross-validation in order to validate the robustness of our model as well as evaluate multiple models to identify the best model based on their performance.

  • Hold-Out: Mostly, large dataset is randomly divided in to three subsets: training set, validation set, and test set.

Visualization of results

This is the final step in our analysis process and we need to answer the following questions:

How is it going to present the results?

For example, in tabular reports, 2D plots, dashboards, or infographics.

Where is it going to be deployed?

For example, in hard copy printed, poster, mobile devices, desktop interface, or web.

Each choice will depend on the kind of analysis and a particular data. In the following chapters, we will learn how to use standalone plotting in Python with matplotlib and web visualization with D3.js.

 

Quantitative versus qualitative data analysis


Quantitative and qualitative analysis can be defined as follows:

  • Quantitative data: It is numerical measurements expressed in terms of numbers

  • Qualitative data: It is categorical measurements expressed in terms of natural language descriptions

As shown in the following figure, we can observe the differences between quantitative and qualitative analysis:

Quantitative analytics involves analysis of numerical data. The type of the analysis will depend on the level of measurement. There are four kinds of measurements:

  • Nominal: Data has no logical order and is used as classification data

  • Ordinal: Data has a logical order and differences between values are not constant

  • Interval: Data is continuous and depends on logical order. The data has standardized differences between values, but does not include zero

  • Ratio: Data is continuous with logical order as well as regular interval differences between values and may include zero

Qualitative analysis can explore the complexity and meaning of social phenomena. Data for qualitative study may include written texts (for example, documents or email) and/or audible and visual data (for example, digital images or sounds). In Chapter 11, Sentiment Analysis of Twitter Data, we present a sentiment analysis from Twitter data as an example of qualitative analysis.

 

Importance of data visualization


The goal of the data visualization is to expose something new about the underlying patterns and relationships contained within the data. The visualization not only needs to look good but also meaningful in order to help organizations make better decisions. Visualization is an easy way to jump into a complex dataset (small or big) to describe and explore the data efficiently.

Many kinds of data visualizations are available such as bar chart, histogram, line chart, pie chart, heat maps, frequency Wordle (as shown in the following figure) and so on, for one variable, two variables, and many variables in one, two, or three dimensions.

Data visualization is an important part of our data analysis process because it is a fast and easy way to do an exploratory data analysis through summarizing their main characteristics with a visual graph.

The goals of exploratory data analysis are listed as follows:

  • Detection of data errors

  • Checking of assumptions

  • Finding hidden patterns (such as tendency)

  • Preliminary selection of appropriate models

  • Determining relationships between the variables

We will get into more detail about data visualization and exploratory data analysis in Chapter 3, Data Visualization.

 

What about big data?


Big data is a term used when the data exceeds the processing capacity of typical database. We need a big data analytics when the data grows quickly and we need to uncover hidden patterns, unknown correlations, and other useful information.

There are three main features in big data:

  • Volume: Large amounts of data

  • Variety: Different types of structured, unstructured, and multi-structured data

  • Velocity: Needs to be analyzed quickly

As shown in the following figure, we can see the interaction between the three Vs:

Big data is the opportunity for any company to gain advantages from data aggregation, data exhaust, and metadata. This makes big data a useful business analytic tool, but there is a common misunderstanding about what big data is.

The most common architecture for big data processing is through MapReduce , which is a programming model for processing large datasets in parallel using a distributed cluster.

Apache Hadoop is the most popular implementation of MapReduce to solve large-scale distributed data storage, analysis, and retrieval tasks. However, MapReduce is just one of the three classes of technologies for storing and managing big data. The other two classes are NoSQL and massively parallel processing (MPP) data stores. In this book, we implement MapReduce functions and NoSQL storage through MongoDB , see Chapter 12, Data Processing and Aggregation with MongoDB and Chapter 13, Working with MapReduce.

MongoDB provides us with document-oriented storage, high availability, and map/reduce flexible aggregation for data processing.

A paper published by the IEEE in 2009, The Unreasonable Effectiveness of Data states:

But invariably, simple models and a lot of data trump over more elaborate models based on less data.

This is a fundamental idea in big data (you can find the full paper at http://bit.ly/1dvHCom). The trouble with real world data is that the probability of finding false correlations is high and gets higher as the datasets grow. That's why, in this book, we focus on meaningful data instead of big data.

One of the main challenges for big data is how to store, protect, backup, organize, and catalog the data in a petabyte scale. Another main challenge of big data is the concept of data ubiquity. With the proliferation of smart devices with several sensors and cameras the amount of data available for each person increases every minute. Big data must process all this data in real time.

Sensors and cameras

Interaction with the outside world is highly important in data analysis. Using sensors such as RFID (Radio-frequency identification) or a smartphone to scan a QR code (Quick Response Code) is an easy way to interact directly with the customer, make recommendations, and analyze consumer trends.

On the other hand, people are using their smartphones all the time, using their cameras as a tool. In Chapter 5, Similarity-based Image Retrieval, we will use these digital images to perform search by image. This can be used, for example, in face recognition or to find reviews of a restaurant just by taking a picture of the front door.

The interaction with the real world can give you a competitive advantage and a real-time data source directly from the customer.

Social networks analysis

Formally, the SNA (social network analysis) performs the analysis of social relationships in terms of network theory, with nodes representing individuals and ties representing relationships between the individuals, as we can see in the following figure. The social network creates groups of related individuals (friendship) based on different aspects of their interaction. We can find important information such as hobbies (for product recommendation) or who has the most influential opinion in the group (centrality). We will present in Chapter 10, Working with Social Graphs, a project; who is your closest friend and we'll show a solution for Twitter clustering.

Social networks are strongly connected and these connections are often not symmetric. This makes the SNA computationally expensive, and needs to be addressed with high-performance solutions that are less statistical and more algorithmic.

The visualization of a social network can help us to get a good insight into how people are connected. The exploration of the graph is done through displaying nodes and ties in various colors, sizes, and distributions. The D3.js library has animation capabilities that enable us to visualize the social graph with an interactive animation. These help us to simulate behaviors such as information diffusion or distance between nodes.

Facebook processes more than 500 TB data daily (images, text, video, likes, and relationships), this amount of data needs non-conventional treatment such as NoSQL databases and MapReduce frameworks, in this book, we work with MongoDB—a document-based NoSQL database, which also has great functions for aggregations and MapReduce processing.

Tools and toys for this book

The main goal of this book is to provide the reader with self-contained projects ready to deploy, in order to do this, as you go through the book you will use and implement tools such as Python, D3, and MongoDB. These tools will help you to program and deploy the projects. You also can download all the code from the author's GitHub repository https://github.com/hmcuesta.

You can see a detailed installation and setup process of all the tools in Appendix, Setting Up the Infrastructure.

Why Python?

Python is a scripting language—an interpreted language with its own built-in memory management and good facilities for calling and cooperating with other programs. There are two popular Versions, 2.7 or 3.x, in this book, we will focused on the 3.x Version because it is under active development and has already seen over two years of stable releases.

Python is multi-platform, which runs on Windows, Linux/Unix, and Mac OS X, and has been ported to the Java and .NET virtual machines. Python has powerful standard libraries and a wealth of third-party packages for numerical computation and machine learning such as NumPy , SciPy , pandas , SciKit , mlpy, and so on.

Python is excellent for beginners, yet great for experts and is highly scalable—suitable for large projects as well as small ones. Also it is easily extensible and object-oriented.

Python is widely used by organizations such as Google, Yahoo Maps, NASA, RedHat, Raspberry Pi, IBM, and so on.

A list of organizations using Python is available at http://wiki.python.org/moin/OrganizationsUsingPython.

Python has excellent documentation and examples at http://docs.python.org/3/.

Python is free to use, even for commercial products, download is available for free from http://python.org/.

Why mlpy?

mlpy (Machine Learning Python) is a Python module built on top of NumPy, SciPy, and the GNU Scientific Libraries. It is open source and supports Python 3.x. The mlpy module has a large amount of machine learning algorithms for supervised and unsupervised problems.

Some of the features of mlpy that will be used in this book are as follows:

  • We will perform a numeric regression with kernel ridge regression (KRR)

  • We will explore the dimensionality reduction through principal component analysis (PCA)

  • We will work with support vector machines (SVM) for classification

  • We will perform text classification with Naive Bayes

  • We will see how different two time series are with dynamic time warping (DTW) distance metric

We can download the latest Version of mlpy from http://mlpy.sourceforge.net/.

For reference you can refer to the paper mply: Machine Learning Python (http://arxiv.org/abs/1202.6548) submitted in 2012 by D. Albanese, R. Visintainer, S. Merler, S. Riccadonna, G. Jurman, and C. Furlanello.

Why D3.js?

D3.js (Data-Driven Documents) was developed by Mike Bostock. D3 is a JavaScript library for visualizing data and manipulating the document object model that runs in a browser without a plugin. In D3.js you can manipulate all the elements of the DOM (Document Object Model); it is as flexible as the client-side web technology stack (HTML, CSS, and SVG).

D3.js supports large datasets and includes animation capabilities that make it a really good choice for web visualization.

D3 has an excellent documentation, examples, and community at https://github.com/mbostock/d3/wiki/Gallery and https://github.com/mbostock/d3/wiki.

You can download the latest Version of D3.js from http://d3js.org/d3.v3.zip.

Why MongoDB?

NoSQL (Not only SQL) is a term that covers different types of data storage technologies, used when you can't fit your business model into a classical relational data model. NoSQL is mainly used in Web 2.0 and in social media applications.

MongoDB is a document-based database. This means that MongoDB stores and organizes the data as a collection of documents that gives you the possibility to store the view models almost exactly like you model them in the application. Also, you can perform complex searches for data and elementary data mining with MapReduce.

MongoDB is highly scalable, robust, and perfect to work with JavaScript-based web applications because you can store your data in a JSON (JavaScript Object Notation ) document and implement a flexible schema which makes it perfect for no structured data.

MongoDB is used by highly recognized corporations such as Foursquare, Craigslist, Firebase, SAP, and Forbes. We can see a detailed list at http://www.mongodb.org/about/production-deployments/.

MongoDB has a big and active community and well-written documentation at http://docs.mongodb.org/manual/.

MongoDB is easy to learn and it's free, we can download MongoDB from http://www.mongodb.org/downloads.

 

Summary


In this chapter, we presented an overview of the data analysis ecosystem, explaining basic concepts of the data analysis process, tools, and some insight into the practical applications of the data analysis. We have also provided an overview of the different kinds of data; numerical and categorical. We got into the nature of data, structured (databases, logs, and reports) and unstructured (image collections, social networks, and text mining). Then, we introduced the importance of data visualization and how a fine visualization can help us in the exploratory data analysis. Finally we explored some of the concepts of big data and social networks analysis.

In the next chapter, we will work with data, cleaning, processing, and transforming, using Python and OpenRefine.

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

About the Author

  • Hector Cuesta

    Hector Cuesta is founder and Chief Data Scientist at Dataxios, a machine intelligence research company. Holds a BA in Informatics and a M.Sc. in Computer Science. He provides consulting services for data-driven product design with experience in a variety of industries including financial services, retail, fintech, e-learning and Human Resources. He is an enthusiast of Robotics in his spare time.

    You can follow him on Twitter at https://twitter.com/hmCuesta.

    Browse publications by this author

Latest Reviews

(4 reviews total)
Excellent price for some fantastic books!
Book was more about using Python's libraries for Data Analysis rather than Data Analysis per se.It was okay if one is using Python but for users of other languages there wasn't much to take away. It would have been good to see language agnostic descriptions of the algorithms used
A great book for what I need.
Practical Data Analysis
Unlock this book and the full library FREE for 7 days
Start now