Reader small image

You're reading from  Predictive Analytics Using Rattle and Qlik Sense

Product typeBook
Published inJun 2015
Reading LevelIntermediate
Publisher
ISBN-139781784395803
Edition1st Edition
Languages
Right arrow
Authors (2):
Ferran Garcia Pagans
Ferran Garcia Pagans
author image
Ferran Garcia Pagans

Ferran Garcia Pagans studied software engineering at the University of Gironaand Ramon Llull University. After that, he did his masters in business administration at ESADE Business School. He has 16 years of experience in the software industry,where he has helped customers from different industries to create software solutions. He started his career working at the Ramon Llull University as a teacher and researcher.Then, he moved to the Volkswagen group as a software developer. After that, he worked with Oracle as a Java, SOA, and BPM specialist. Currently, he is a solution architect at Qlik, where he helps customers to achieve competitive advantages with data applications.
Read more about Ferran Garcia Pagans

Fernando G Pagans
Fernando G Pagans
author image
Fernando G Pagans

Ferran Garcia Pagans studied software engineering at the University of Gironaand Ramon Llull University. After that, he did his masters in business administration at ESADE Business School. He has 16 years of experience in the software industry,where he has helped customers from different industries to create software solutions. He started his career working at the Ramon Llull University as a teacher and researcher.Then, he moved to the Volkswagen group as a software developer. After that, he worked with Oracle as a Java, SOA, and BPM specialist. Currently, he is a solution architect at Qlik, where he helps customers to achieve competitive advantages with data applications.
Read more about Fernando G Pagans

View More author details
Right arrow

Chapter 3. Exploring and Understanding Your Data

In the previous chapter, we've explained how to load data and how to transform it using Rattle. In this chapter, we're going to learn how use Rattle to:

  • Summarize dataset characteristics

  • Identify missing values in the data

  • Create charts to represent data point distributions

We have two main objectives when we explore data. We would like to understand the problem we want to solve and we want to understand the structure of the dataset in order to choose the most appropriate predictive technique.

If you are a business analyst, Qlik Sense is a great tool to explore and understand your data. With Qlik Sense, you can find relationships between customers, products, and sales people in a very intuitive way. In the next chapter, we're going to learn how to use Qlik Sense to load and explore data.

As some predictive techniques are based on statistics, if you are preparing a dataset to apply a predictive technique, you would probably prefer a more formal or...

Text summaries


The Summary option in the Explore tab provides us with some descriptive statistics such as Summary, Describe, Basics, Kurtosis, and Skewness reports. Descriptive statistics covers methods to summarize data. The Summary option also provides a very useful Show Missing report:

Summary reports

Rattle provides us with these summary reports:

  • Summary

  • Describe

  • Basics

  • Kurtosis

  • Skewness

These reports summarize variable distributions and help to give an initial understanding of our data. In order to understand these reports, you only need a basic understanding of descriptive statistics.

Measures of central tendency – mean, median, and mode

For a variable, a measure of central tendency describes the center of the distribution as follows:

  • Mean: The mean is the average and is the best central tendency measure if the distribution is normal.

  • Median: Half of the observations have a lower value than this variable and the other half have a higher value. This is a good measure if there are extreme...

Visualizing distributions


In the last section, we discussed distributions and we saw some measures that describe them. In this section, we're going to see how to visualize distributions. Visualizations are more intuitive than numeric measures and they will help us to understand our data.

Rattle offers two different set of charts depending on the nature of the variables. For numeric variables, we can use Box Plot, Histogram, Cumulative, and Benford. And for categorical variables, Rattle provides us with Bar Plot, Dot Plot, and Mosaic charts. We're going to explore the most common visual representations.

Before using Rattle to plot charts, make sure that the Advanced Graphics option is unchecked. With this option checked, some charts like histograms will not be plotted. This is shown in the following screenshot:

Numeric variables

We're going to use the variable Age of the Titanic passenger list to show the different types of charts with numeric variables. Load the data set, set the variable Survived...

Correlations among input variables


An important step is to identify relationships among input variables. To measure this relationship, we use the correlation coefficient. Correlation coefficient is a number between +1 and -1. When two variables have a correlation coefficient close to +1, they have a strong positive correlation. A coefficient of exactly +1 indicates a perfect positive fit. A positive correlation between two variables means that both variables increase and decrease their values simultaneously. A correlation coefficient between two variables close to -1 shows that both variables have strong negative correlation. When two variables have a negative correlation, the value of one of the variables increases when the value of the other variable decreases. A correlation coefficient close to 0 or a weak correlation between two variables means that there is no linear relationship between those variables.

Coming back to the Titanic passenger list, I've selected the Explore tab, the Correlation...

Further learning


In this chapter, we've introduced some EDA measures. If you want a more extensive EDA introduction, I recommend the Exploratory Data Analysis course on Coursera – www.coursera.org/course/exdata.

If you prefer going to the source, Exploratory Data Analysis Paperback, by John W. Tukey, is for you.

Wikipedia offers some useful insights into these EDA statistics concepts.

Summary


This chapter was divided into three main sections depending on how we are looking at data – tables, text summaries, and charts.

When we saw text summaries, we introduced Summary, Describe, Basics, Kurtosis, and Skewness reports. To understand these reports, we needed to remember some basic statistics concepts like mean, median, mode, range, quartile, interquartile range, variance, and standard deviation.

In this chapter, we also introduced some important charts – histograms, correlations, Box Plot, and Bar Chart.

In the next chapter, we'll learn how to load data into Qlik Sense and how to create data visualizations. We'll use some of the charts we introduced in this chapter. You'll see that Qlik Sense is more powerful for a business user who wants to understand his data and create a graphical representation of his data. Rattle and R are tools closer to statistics and some functionalities, like the correlations analysis, are very powerful; for this reason, we've introduced EDA using...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Predictive Analytics Using Rattle and Qlik Sense
Published in: Jun 2015Publisher: ISBN-13: 9781784395803
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Ferran Garcia Pagans

Ferran Garcia Pagans studied software engineering at the University of Gironaand Ramon Llull University. After that, he did his masters in business administration at ESADE Business School. He has 16 years of experience in the software industry,where he has helped customers from different industries to create software solutions. He started his career working at the Ramon Llull University as a teacher and researcher.Then, he moved to the Volkswagen group as a software developer. After that, he worked with Oracle as a Java, SOA, and BPM specialist. Currently, he is a solution architect at Qlik, where he helps customers to achieve competitive advantages with data applications.
Read more about Ferran Garcia Pagans

author image
Fernando G Pagans

Ferran Garcia Pagans studied software engineering at the University of Gironaand Ramon Llull University. After that, he did his masters in business administration at ESADE Business School. He has 16 years of experience in the software industry,where he has helped customers from different industries to create software solutions. He started his career working at the Ramon Llull University as a teacher and researcher.Then, he moved to the Volkswagen group as a software developer. After that, he worked with Oracle as a Java, SOA, and BPM specialist. Currently, he is a solution architect at Qlik, where he helps customers to achieve competitive advantages with data applications.
Read more about Fernando G Pagans