Data analysis is the process in which raw data is ordered and organized, to be used in methods that help to explain the past and predict the future. Data analysis is not about the numbers, it is about making/asking questions, developing explanations, and testing hypotheses. Data Analysis is a multidisciplinary field, which combines Computer Science, Artificial Intelligence & Machine Learning, Statistics & Mathematics, and Knowledge Domain as shown in the following figure:
Computer science creates the tools for data analysis. The vast amount of data generated has made computational analysis critical and has increased the demand for skills such as programming, database administration, network administration, and high-performance computing. Some programming experience in Python (or any high-level programming language) is needed to understand the chapters.
According to Stuart Russell and Peter Norvig:
"[AI] has to do with smart programs, so let's get on and write some."
In other words, AI studies the algorithms that can simulate an intelligent behavior. In data analysis, we use AI to perform those activities that require intelligence such as inference, similarity search, or unsupervised classification.
"Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed."
ML has a large amount of algorithms generally split in to three groups; given how the algorithm is training:
Relevant numbers of algorithms are used throughout the book and are combined with practical examples, leading the reader through the process from the data problem to its programming solution.
"I keep saying the sexy job in the next ten years will be statisticians. People think I'm joking, but who would've guessed that computer engineers would've been the sexy job of the 1990s?"
Data analysis makes use of a lot of mathematical techniques such as linear algebra (vector and matrix, factorization, and eigenvalue), numerical methods, and conditional probability in the algorithms. In this book, all the chapters are self-contained and include the necessary math involved.
One of the most important activities in data analysis is asking questions, and a good understanding of the knowledge domain can give you the expertise and intuition needed to ask good questions. Data analysis is used in almost all the domains such as finance, administration, business, social media, government, and science.
Data are facts of the world. For example, financial transactions, age, temperature, number of steps from my house to my office, are simply numbers. The information appears when we work with those numbers and we can find value and meaning. The information can help us to make informed decisions.
We can talk about knowledge when the data and the information turn into a set of rules to assist the decisions. In fact, we can't store knowledge because it implies theoretical or practical understanding of a subject. However, using predictive analytics, we can simulate an intelligent behavior and provide a good approximation. An example of how to turn data into knowledge is shown in the following figure:
Data is the plural of datum, so it is always treated as plural. We can find data in all the situations of the world around us, in all the structured or unstructured, in continuous or discrete conditions, in weather records, stock market logs, in photo albums, music playlists, or in our Twitter accounts. In fact, data can be seen as the essential raw material of any kind of human activity. According to the Oxford English Dictionary:
Data are known facts or things used as basis for inference or reckoning.
Categorical data are values or observations that can be sorted into groups or categories. There are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, housing is a categorical variable having two categories (own and rent). An ordinal variable has an established ordering. For example, age as a variable with three orderly categories (young, adult, and elder).
Numerical data are values or observations that can be measured. There are two kinds of numerical values, discrete and continuous. Discrete data are values or observations that can be counted and are distinct and separate. For example, number of lines in a code. Continuous data are values or observations that may take on any value within a finite or infinite interval. For example, an economic time series such as historic gold prices.
E-mails (unstructured, discrete)
Digital images (unstructured, discrete)
Stock market logs (structured, continuous)
Historic gold prices (structured, continuous)
Credit approval records (structured, discrete)
Social media friends and relationships (unstructured, discrete)
Tweets and trending topics (unstructured, continuous)
Sales records (structured, continuous)
For each of the projects in this book, we try to use a different kind of data. This book is trying to give the reader the ability to address different kinds of data problems.
When you have a good understanding of a phenomenon, it is possible to make predictions about it. Data analysis helps us to make this possible through exploring the past and creating predictive models.
The data analysis process is composed of the following steps:
The statement of problem
Obtain your data
Clean the data
Normalize the data
Transform the data
Validate your model
Visualize and interpret your results
Deploy your solution
All these activities can be grouped as shown in the following figure:
The problem definition starts with high-level questions such as how to track differences in behavior between groups of customers, or what's going to be the gold price in the next month. Understanding the objectives and requirements from a domain perspective is the key to a successful data analysis project.
Types of data analysis questions are listed as follows:
Data preparation is about how to obtain, clean, normalize, and transform the data into an optimal dataset, trying to avoid any possible data quality issues such as invalid, ambiguous, out-of-range, or missing values. This process can take a lot of your time. In Chapter 2, Working with Data, we go into more detail about working with data, using OpenRefine to address the complicated tasks. Analyzing data that has not been carefully prepared can lead you to highly misleading results.
The characteristics of good data are listed as follows:
Data exploration is essentially looking at the data in a graphical or statistical form trying to find patterns, connections, and relations in the data. Visualization is used to provide overviews in which meaningful patterns may be found.
In Chapter 3, Data Visualization, we present a visualization framework (
D3.js) and we implement some examples on how to use visualization as a data exploration tool.
Predictive modeling is a process used in data analysis to create or choose a statistical model trying to best predict the probability of an outcome. In this book, we use a variety of those models and we can group them in three categories based on its outcome:
Categorical outcome (Classification)
Naïve Bayes Classifier
Natural Language Toolkit + Naïve Bayes Classifier
Numerical outcome (Regression)
Support Vector Machines
Distance Based Approach + k-nearest neighbor
Descriptive modeling (Clustering)
Fast Dynamic Time Warping (FDTW) + Distance Metrics
Force Layout and Fruchterman-Reingold layout
The No Free Lunch Theorem proposed by Wolpert in 1996 stated:
"No Free Lunch theorems have shown that learning algorithms cannot be universally good."
The model evaluation helps us to ensure that our analysis is not over-optimistic or over-fitted. In this book, we are going to present two different ways to validate the model:
Cross-validation: We divide the data into subsets of equal size and test the predictive model in order to estimate how it is going to perform in practice. We will implement cross-validation in order to validate the robustness of our model as well as evaluate multiple models to identify the best model based on their performance.
Hold-Out: Mostly, large dataset is randomly divided in to three subsets: training set, validation set, and test set.
How is it going to present the results?
For example, in tabular reports, 2D plots, dashboards, or infographics.
Where is it going to be deployed?
For example, in hard copy printed, poster, mobile devices, desktop interface, or web.
Each choice will depend on the kind of analysis and a particular data. In the following chapters, we will learn how to use standalone plotting in Python with
matplotlib and web visualization with
Quantitative and qualitative analysis can be defined as follows:
Quantitative data: It is numerical measurements expressed in terms of numbers
Qualitative data: It is categorical measurements expressed in terms of natural language descriptions
As shown in the following figure, we can observe the differences between quantitative and qualitative analysis:
Nominal: Data has no logical order and is used as classification data
Ordinal: Data has a logical order and differences between values are not constant
Interval: Data is continuous and depends on logical order. The data has standardized differences between values, but does not include zero
Ratio: Data is continuous with logical order as well as regular interval differences between values and may include zero
Qualitative analysis can explore the complexity and meaning of social phenomena. Data for qualitative study may include written texts (for example, documents or email) and/or audible and visual data (for example, digital images or sounds). In Chapter 11, Sentiment Analysis of Twitter Data, we present a sentiment analysis from Twitter data as an example of qualitative analysis.
The goal of the data visualization is to expose something new about the underlying patterns and relationships contained within the data. The visualization not only needs to look good but also meaningful in order to help organizations make better decisions. Visualization is an easy way to jump into a complex dataset (small or big) to describe and explore the data efficiently.
Many kinds of data visualizations are available such as bar chart, histogram, line chart, pie chart, heat maps, frequency Wordle (as shown in the following figure) and so on, for one variable, two variables, and many variables in one, two, or three dimensions.
Data visualization is an important part of our data analysis process because it is a fast and easy way to do an exploratory data analysis through summarizing their main characteristics with a visual graph.
The goals of exploratory data analysis are listed as follows:
Detection of data errors
Checking of assumptions
Finding hidden patterns (such as tendency)
Preliminary selection of appropriate models
Determining relationships between the variables
We will get into more detail about data visualization and exploratory data analysis in Chapter 3, Data Visualization.
Big data is a term used when the data exceeds the processing capacity of typical database. We need a big data analytics when the data grows quickly and we need to uncover hidden patterns, unknown correlations, and other useful information.
Volume: Large amounts of data
Variety: Different types of structured, unstructured, and multi-structured data
Velocity: Needs to be analyzed quickly
As shown in the following figure, we can see the interaction between the three Vs:
Big data is the opportunity for any company to gain advantages from data aggregation, data exhaust, and metadata. This makes big data a useful business analytic tool, but there is a common misunderstanding about what big data is.
Apache Hadoop is the most popular implementation of MapReduce to solve large-scale distributed data storage, analysis, and retrieval tasks. However, MapReduce is just one of the three classes of technologies for storing and managing big data. The other two classes are NoSQL and massively parallel processing (MPP) data stores. In this book, we implement MapReduce functions and NoSQL storage through MongoDB , see Chapter 12, Data Processing and Aggregation with MongoDB and Chapter 13, Working with MapReduce.
MongoDB provides us with document-oriented storage, high availability, and map/reduce flexible aggregation for data processing.
A paper published by the IEEE in 2009, The Unreasonable Effectiveness of Data states:
But invariably, simple models and a lot of data trump over more elaborate models based on less data.
This is a fundamental idea in big data (you can find the full paper at http://bit.ly/1dvHCom). The trouble with real world data is that the probability of finding false correlations is high and gets higher as the datasets grow. That's why, in this book, we focus on meaningful data instead of big data.
One of the main challenges for big data is how to store, protect, backup, organize, and catalog the data in a petabyte scale. Another main challenge of big data is the concept of data ubiquity. With the proliferation of smart devices with several sensors and cameras the amount of data available for each person increases every minute. Big data must process all this data in real time.
Interaction with the outside world is highly important in data analysis. Using sensors such as RFID (Radio-frequency identification) or a smartphone to scan a QR code (Quick Response Code) is an easy way to interact directly with the customer, make recommendations, and analyze consumer trends.
On the other hand, people are using their smartphones all the time, using their cameras as a tool. In Chapter 5, Similarity-based Image Retrieval, we will use these digital images to perform search by image. This can be used, for example, in face recognition or to find reviews of a restaurant just by taking a picture of the front door.
The interaction with the real world can give you a competitive advantage and a real-time data source directly from the customer.
Formally, the SNA (social network analysis) performs the analysis of social relationships in terms of network theory, with nodes representing individuals and ties representing relationships between the individuals, as we can see in the following figure. The social network creates groups of related individuals (friendship) based on different aspects of their interaction. We can find important information such as hobbies (for product recommendation) or who has the most influential opinion in the group (centrality). We will present in Chapter 10, Working with Social Graphs, a project; who is your closest friend and we'll show a solution for Twitter clustering.
Social networks are strongly connected and these connections are often not symmetric. This makes the SNA computationally expensive, and needs to be addressed with high-performance solutions that are less statistical and more algorithmic.
The visualization of a social network can help us to get a good insight into how people are connected. The exploration of the graph is done through displaying nodes and ties in various colors, sizes, and distributions. The
D3.js library has animation capabilities that enable us to visualize the social graph with an interactive animation. These help us to simulate behaviors such as information diffusion or distance between nodes.
Facebook processes more than 500 TB data daily (images, text, video, likes, and relationships), this amount of data needs non-conventional treatment such as NoSQL databases and MapReduce frameworks, in this book, we work with MongoDB—a document-based NoSQL database, which also has great functions for aggregations and MapReduce processing.
The main goal of this book is to provide the reader with self-contained projects ready to deploy, in order to do this, as you go through the book you will use and implement tools such as Python, D3, and MongoDB. These tools will help you to program and deploy the projects. You also can download all the code from the author's GitHub repository https://github.com/hmcuesta.
You can see a detailed installation and setup process of all the tools in Appendix, Setting Up the Infrastructure.
Python is a scripting language—an interpreted language with its own built-in memory management and good facilities for calling and cooperating with other programs. There are two popular Versions, 2.7 or 3.x, in this book, we will focused on the 3.x Version because it is under active development and has already seen over two years of stable releases.
Python is multi-platform, which runs on Windows, Linux/Unix, and Mac OS X, and has been ported to the Java and .NET virtual machines. Python has powerful standard libraries and a wealth of third-party packages for numerical computation and machine learning such as
mlpy, and so on.
Python is excellent for beginners, yet great for experts and is highly scalable—suitable for large projects as well as small ones. Also it is easily extensible and object-oriented.
Python is widely used by organizations such as Google, Yahoo Maps, NASA, RedHat, Raspberry Pi, IBM, and so on.
A list of organizations using Python is available at http://wiki.python.org/moin/OrganizationsUsingPython.
Python has excellent documentation and examples at http://docs.python.org/3/.
Python is free to use, even for commercial products, download is available for free from http://python.org/.
mlpy (Machine Learning Python) is a Python module built on top of
SciPy, and the
GNU Scientific Libraries. It is open source and supports Python 3.x. The
mlpy module has a large amount of machine learning algorithms for supervised and unsupervised problems.
We will perform text classification with Naive Bayes
We can download the latest Version of
mlpy from http://mlpy.sourceforge.net/.
For reference you can refer to the paper mply: Machine Learning Python (http://arxiv.org/abs/1202.6548) submitted in 2012 by D. Albanese, R. Visintainer, S. Merler, S. Riccadonna, G. Jurman, and C. Furlanello.
D3.js you can manipulate all the elements of the DOM (Document Object Model); it is as flexible as the client-side web technology stack (HTML, CSS, and SVG).
You can download the latest Version of
D3.js from http://d3js.org/d3.v3.zip.
NoSQL (Not only SQL) is a term that covers different types of data storage technologies, used when you can't fit your business model into a classical relational data model. NoSQL is mainly used in Web 2.0 and in social media applications.
MongoDB is a document-based database. This means that MongoDB stores and organizes the data as a collection of documents that gives you the possibility to store the view models almost exactly like you model them in the application. Also, you can perform complex searches for data and elementary data mining with MapReduce.
MongoDB is used by highly recognized corporations such as Foursquare, Craigslist, Firebase, SAP, and Forbes. We can see a detailed list at http://www.mongodb.org/about/production-deployments/.
MongoDB has a big and active community and well-written documentation at http://docs.mongodb.org/manual/.
MongoDB is easy to learn and it's free, we can download MongoDB from http://www.mongodb.org/downloads.
In this chapter, we presented an overview of the data analysis ecosystem, explaining basic concepts of the data analysis process, tools, and some insight into the practical applications of the data analysis. We have also provided an overview of the different kinds of data; numerical and categorical. We got into the nature of data, structured (databases, logs, and reports) and unstructured (image collections, social networks, and text mining). Then, we introduced the importance of data visualization and how a fine visualization can help us in the exploratory data analysis. Finally we explored some of the concepts of big data and social networks analysis.
In the next chapter, we will work with data, cleaning, processing, and transforming, using Python and OpenRefine.