Reader small image

You're reading from  Hands-On Data Analysis with Scala

Product typeBook
Published inMay 2019
Reading LevelExpert
PublisherPackt
ISBN-139781789346114
Edition1st Edition
Languages
Right arrow
Author (1)
Rajesh Gupta
Rajesh Gupta
author image
Rajesh Gupta

Rajesh is a Hands-on Big Data Tech Lead and Enterprise Architect with extensive experience in the full life cycle of software development. He has successfully architected, developed and deployed highly scalable data solutions using Spark, Scala and Hadoop technology stack for several enterprises. A passionate, hands-on technologist, Rajesh has masters degrees in Mathematics and Computer Science from BITS, Pilani (India).
Read more about Rajesh Gupta

Right arrow

Applying Statistics and Hypothesis Testing

This chapter provides an overview of statistical methods used in data analysis and covers techniques for deriving meaningful insights from data. We will first look at some basic statistical techniques used to gain a better understanding of data before moving on to more advanced methods that are used to compute statistics on vectorized data instead of simple scalar data.

This chapter also covers the various techniques for generating random numbers. Random numbers play a significant part in data analysis because they help us work with sample data in much smaller datasets. A good random sample selection ensures that smaller datasets can act as a good representative of the much bigger dataset.

We will also gain an understanding of hypothesis testing and look at some Scala tools readily available to make this task easier.

The following are...

Basics of statistics

This section introduces the basics of using applied examples.

Summary level statistics

Summary level statistics provide us with such information as minimum, maximum, and mean values of data.

The following is an example in Spark that looks at summarizing numbers from 1 to 100:

  1. Start a Spark shell in your Terminal:
$ spark-shell
  1. Import Random from Scala's util package:
scala> import scala.util.Random
import scala.util.Random
  1. Generate integers from 1 to 100 (included) and use the shuffle method of Scala's Random utility class to randomize their positions:
scala> val nums = Random.shuffle(1 to 100) // 100 numbers randomized
nums: scala.collection.immutable.IndexedSeq[Int] = Vector(70, 63...

Vector level statistics

In the previous section, we looked at statistics for columns containing a single numeric value. It is often the case that, for machine learning (ML), a more common way to represent data is as vectors of multiple numeric values. A vector is a generalized structure that consists of one or more elements of the same data type. For example, the following is an example of a vector of three elements of type double:

[2.0,3.0,5.0]
[4.0,6.0,7.0]

Computing statistics in the classic way won't work for vectors. It is also quite common to have weights associated with these vectors. There are times when the weights have to considered as well while computing statistics on such a data type.

Spark MLLib's Summarizer (https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/stat/Summarizer.html) provides several convenient methods to compute stats on vector...

Random data generation

Random data generation is useful for several purposes and plays a significant role in performance testing. This technique is also useful for generating synthetic data that can be used for various simulation experiment purposes. In fact, it is randomness that facilitates an unbiased sample selection from a large dataset.

We will look at random data generation with some specific properties:

  • Pseudorandom with no specific distribution
  • Normal distribution
  • Poisson distribution

Pseudorandom numbers

Scala provides built-in support to generate pseudorandom numbers using the scala.util.Random class. Let's explore some features of this class using Scala REPL:

  1. Import the Random class from the scala.util...

Hypothesis testing

Hypothesis testing is a statistical tool that is used for the following purposes:

  • Determining whether a result or model is statistically significant or not
  • Ensuring that a result or model did not occur by chance

A statistical hypothesis is used to establish a relationship between data using a sample set of observations. We can call this relationship a result or a model. The goal of hypothesis testing is to eliminate cases where a result occurs by chance. A null hypothesis, on the other hand, establishes that the relationship is not statistically significant.

We typically start with a sample set of observations that consists of values associated with more than one variable. In the Basics of statistics section, we looked at properties of a single variable in isolation, except for Pearson's correlation methodology, where we measured the linear relationship...

Summary

Statistics play an important role in the data analysis life cycle. This chapter provided an overview of basic statistics. We also learned how to extend basic statistical techniques and use them on data that is represented as vectors. In the vector bases stats, we got some insights into how weights could significantly alter statistical outcomes. We also learned various techniques for random data generation, and, finally, we took a high-level view of how to perform hypothesis testing.

In the next chapter, we will focus on Spark, a distributed data analysis and processing framework.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Analysis with Scala
Published in: May 2019Publisher: PacktISBN-13: 9781789346114
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rajesh Gupta

Rajesh is a Hands-on Big Data Tech Lead and Enterprise Architect with extensive experience in the full life cycle of software development. He has successfully architected, developed and deployed highly scalable data solutions using Spark, Scala and Hadoop technology stack for several enterprises. A passionate, hands-on technologist, Rajesh has masters degrees in Mathematics and Computer Science from BITS, Pilani (India).
Read more about Rajesh Gupta