Packt+ | Advance your knowledge in tech

You're reading from Java for Data Science

Product typeBook

Published inJan 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781785280115

Edition1st Edition

Languages

Java

Tools

Weka Deeplearning4j

Concepts

Data Science

Authors (2):

Richard M. Reese

Jennifer L. Reese

View More author details

Chapter 5. Statistical Data Analysis Techniques

The intent of this chapter is not to make the reader an expert in statistical techniques. Rather, it is to familiarize the reader with the basic statistical techniques in use and demonstrate how Java can support statistical analysis. While there are quite a variety of data analysis techniques, in this chapter, we will focus on the more common tasks.

These techniques range from the relatively simple mean calculation to sophisticated regression analysis models. Statistical analysis can be a very complicated process and requires significant study to be conducted properly. We will start with an introduction to basic statistical analysis techniques, including calculating the mean, median, mode, and standard deviation of a dataset. There are numerous approaches used to calculate these values, which we will demonstrate using standard Java and third-party APIs. We will also briefly discuss sample size and hypothesis testing.

Regression analysis is an...

Working with mean, mode, and median

The mean, median, and mode are basic ways to describe characteristics or summarize information from a dataset. When a new, large dataset is first encountered, it can be helpful to know basic information about it to direct further analysis. These values are often used in later analysis to generate more complex measurements and conclusions. This can occur when we use the mean of a dataset to calculate the standard deviation, which we will demonstrate in the Standard deviation section of this chapter.

Calculating the mean

The term mean, also called the average, is computed by adding values in a list and then dividing the sum by the number of values. This technique is useful for determining the general trend for a set of numbers. It can also be used to fill in missing data elements. We are going to examine several ways to calculate the mean for a given set of data using standard Java libraries as well as third-party APIs.

Using simple Java techniques to find...

Standard deviation

Standard deviation is a measurement of how values are spread around the mean. A high deviation means that there is a wide spread, whereas a low deviation means that the values are more tightly grouped around the mean. This measurement can be misleading if there is not a single focus point or there are numerous outliers.

We begin by showing a simple example using basic Java techniques. We are using our testData array from previous examples, duplicated here:

double[] testData = {12.5, 18.3, 11.2, 19.0, 22.1, 14.3, 16.2, 12.5,
   17.8, 16.5, 11.2};

Before we can calculate the standard deviation, we need to find the average. We could use any of our techniques listed in the Calculating the mean section, but we will add up our values and divide by the length of testData for simplicity's sake:

int sum = 0; 
for(double value : testData){ 
   sum += value; 
} 
double mean = sum/testData.length;

Next, we create a variable, sdSum,...

Sample size determination

Sample size determination involves identifying the quantity of data required to conduct accurate statistical analysis. When working with large datasets it is not always necessary to use the entire set. We use sample size determination to ensure we choose a sample small enough to manipulate and analyze easily, but large enough to represent our population of data accurately.

It is not uncommon to use a subset of data to train a model and another subset is used to test the model. This can be helpful for verifying accuracy and reliability of data. Some common consequences for a poorly determined sample size include false-positive results, false-negative results, identifying statistical significance where none exists, or suggesting a lack of significance where it is actually present. Many tools exist online for determining appropriate sample sizes, each with varying levels of complexity. One simple example is available at https://www.surveymonkey.com/mp/sample-size-calculator...

Hypothesis testing

Hypothesis testing is used to test whether certain assumptions, or premises, about a dataset could not happen by chance. If this is the case, then the results of the test are considered to be statistically significant.

Performing hypothesis testing is not a simple task. There are many different pitfalls to avoid such as the placebo effect or the observer effect. In the former, a participant will attain a result that they think is expected. In the observer effect, also called the Hawthorne effect, the results are skewed because the participants know they are being watched. Due to the complex nature of human behavior analysis, some types of statistical analysis are particularly subject to skewing or corruption.

The specific methods for performing hypothesis testing are outside the scope of this book and require a solid background in statistical processes and best practices. Apache Commons provides a package, org.apache.commons.math3.stat.inference, with tools for performing...

Regression analysis

Regression analysis is useful for determining trends in data. It indicates the relationship between dependent and independent variables. The independent variables determine the value of a dependent variable. Each independent variable can have either a strong or a weak effect on the value of the dependent variable. Linear regression uses a line in a scatterplot to show the trend. Non-linear regression uses some sort of curve to depict the relationships.

For example, there is a relationship between blood pressure and various factors such as age, salt intake, and Body Mass Index (BMI). The blood pressure can be treated as the dependent variable and the other factors as independent variables. Given a dataset containing these factors for a group of individuals we can perform regression analysis to see trends.

There are several types of regression analysis supported by Java. We will be examining simple linear regression and multiple linear regression. Both approaches take a dataset...

Summary

In this chapter, we provided a brief introduction to the basic statistical analysis techniques you may encounter in data science applications. We started with simple techniques to calculate the mean, median, and mode of a set of numerical data. Both standard Java and third-party Java APIs were used to show how these attributes are calculated. While these techniques are relatively simple, there are some issues that need to be considered when calculating them.

Next, we examined linear regression. This technique is more predictive in nature and attempts to calculate other values in the future or past based on a sample dataset. We examine both simple linear regression and multiple regression and used Apache Commons classes to perform the regression and JavaFX to draw graphs.

Simple linear regression uses a single independent variable to predict a dependent variable. Multiple regression uses more than one independent variable. Both of these techniques have statistical attributes used to...

The rest of the chapter is locked

You have been reading a chapter from

Java for Data Science

Published in: Jan 2017Publisher: PacktISBN-13: 9781785280115

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Richard M. Reese

Richard Reese has worked in the industry and academics for the past 29 years. For 10 years he provided software development support at Lockheed and at one point developed a C based network application. He was a contract instructor providing software training to industry for 5 years. Richard is currently an Associate Professor at Tarleton State University in Stephenville Texas. Richard is the author of various books and video courses some of which are as follows: Natural Language Processing with Java. Java for Data Science Getting Started with Natural Language Processing in Java
Read more about Richard M. Reese

Jennifer L. Reese

Jennifer L. Reese studied computer science at Tarleton State University. She also earned her M.Ed. from Tarleton in December 2016. She currently teaches computer science to high-school students. Her interests include the integration of computer science concepts with other academic disciplines, increasing diversity in computer science courses, and the application of data science to the field of education. She has co-authored two books: Java for Data Science and Java 7 New Features Cookbook. She previously worked as a software engineer. In her free time she enjoys reading, cooking, and traveling—especially to any destination with a beach. She is a musician and appreciates a variety of musical genres.
Read more about Jennifer L. Reese

Other recommended products

Related to this chapter

Natural Language Processing with Java Cookbook

This book will teach you how to perform basic and advanced NLP tasks in Java, using independent recipes. The book not only covers the essential aspects of NLP but also addresses other important areas such as the acquisition of text and techniques for utilizing NLP in varied domains

BookApr 2019386 pages

Natural Language Processing with Java

Natural Language Processing with Java will explore how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. You will leverage the power of Java to extract relationships within different elements of text and documents.

BookJul 2018318 pages

Java Data Science Cookbook

Java has been one of the most popular languages for developers for several decades and yet the potential of the Java ecosystem still remains untapped when it comes to using JVM-based languages and platforms to solve data science related problems. A variety of tools and libraries are available such as Spark, Hadoop, and Mahout for computation and libraries such as MLlib, Weka, DL4j to implement smart data models. This book uncovers practically all these techniques in the form of recipes showing you how these tools and libraries can solve statistical, analytical, data mining, and information science related problems.

BookMar 2017372 pages

Machine Learning in Java

Machine Learning in Java will provide you with the techniques and tools you need to quickly gain insight from complex data. You will start by learning how to apply machine learning methods to a variety of common tasks including classification, prediction, forecasting, market basket analysis, and clustering.

BookNov 2018300 pages

Java Deep Learning Projects

You will build full-fledged, deep learning applications with Java and different open-source libraries. Master numerical computing, deep learning, and the latest Java programming features to carry out complex advanced tasks. This book is filled with best practices/tips after every project to help you optimize your deep learning models with ease.

BookJun 2018436 pages

Java Deep Learning Cookbook

Deep Learning is a trending topic in AI currently, as it allows you to make faster and more accurate predictions using the power of neural networks. This book will teach you the process of neural network design, and show you how to develop efficient deep learning applications using Deeplearning4j through practical and easy to implement recipes.

BookNov 2019304 pages

Deep Learning with Hadoop

BookFeb 2017206 pages

Mastering Java for Data Science

Java is the most wide-spread programming language nowadays, and you fill find it everywhere, from small startup companies to large enterprises. It is also a common choice for developing Data Science applications thanks to Java's prevalence and rich data processing toolbox. This book will explain how to use Java for Data Science, overview the available Machine Learning libraries, and cover many topics including supervised and unsupervised learning, natural language processing, deep learning, and big data

BookApr 2017364 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages