You're reading from Hands-On Data Analysis with Scala

Product typeBook

Published inMay 2019

Reading LevelExpert

PublisherPackt

ISBN-139781789346114

Edition1st Edition

Languages

Scala

Concepts

Data Analysis

Author (1)

Rajesh Gupta

Traditional Machine Learning for Data Analysis

This chapter provides an overview of machine learning (ML) techniques for doing data analysis. In the previous chapters, we have explored some of the techniques that can be used by human beings to analyze and understand data. In this chapter, we look at how ML techniques could be used for similar purposes.

At the heart of ML is a number of algorithms that have proven to work for solving specific categories of problems with a high degree of effectiveness. This chapter covers the following popular ML methods:

Decision trees
Random forests
Ridge and lasso regression
k-means cluster analysis

It also covers the role of natural language processing (NLP) in effectively analyzing certain types of data problems. The discussion in this chapter is limited to traditional machine learning methods. It does not cover newer methods such as deep...

ML overview

Let's first look at what ML is. In a traditional sense, in order to solve a computational problem, we typically write explicit computer instructions that solve the problem based on all of the possible scenarios. The assumption here is that all of the rules associated with the specific problem being solved are known and well-defined in advance and could be codified into computer instructions. This assumption, however, is not always true. There are times when the rules are not known in advance and it is impractical to define deterministic rules that could be applied to solve the problem.

Let's look at this problem using a concrete example of an app stores where a consumer has the option of buying an app from a fairly large catalog of available apps. When the consumer logs into the app store, it displays a set of recommended apps that the consumer is highly...

Decision trees

As the name suggests, decision trees in ML build a tree-like structure with decision conditions on each branch. Conditions define the flow of the decision-making process. We can also think of decision trees as being similar to flow charts.

Decision trees are supervised ML algorithms. This implies that this algorithm learns from labeled data. It can be used for classification as well as regression.

Implementing decision trees

Let's look at a simple example to understand and explore this concept. We have the following observations:

Age in Years	Height in Inches	Weight in Pounds	Gender	Shoe Size
25	180	200	M	12
35	165	190	F	9
20	175	195	M	11
70	170	200...

Random forest

Random forest is an easy-to-use and powerful ML algorithm. It is also a supervised algorithm and requires labeled data to learn from. In fact, the decision tree acts as the building block for the random forest algorithm. Just like the decision tree, the random forest ML algorithm can be used for classification as well as regression.

The fundamental motivation behind the random forest algorithm is to combine results from multiple random decision trees into a single model. One very nice outcome of the random forest algorithm is that it prevents overfitting of the model to the training dataset.

Random forest algorithms

The random forest algorithm can be summarized as follows:

Each decision tree in a random forest...

Ridge and lasso regression

Ridge and lasso regression are supervised linear regression ML algorithms. Both of these algorithms aim at reducing model complexity and prevent overfitting. When there is a large number of features or variables in a training dataset, the model built by ML generally tends to be complex.

Characteristics of ridge regression

The key characteristics of ridge regression are as follows:

Coefficient shrinkage: This helps in reducing model complexity
Regularization: This adds information to prevent overfitting

Characteristics of lasso regression

Lasso...

k-means cluster analysis

k-means is a clustering ML algorithm. This is a nonsupervised ML algorithm. Its primary use is for clustering together closely related data and gaining an understanding of the structural properties of the data.

As the name suggests, this algorithm tries to form a k number of clusters around k-mean values. How many clusters are to be formed, that is, the value of k, is something a human being has to determine at the outset. This algorithm relies on the Euclidean distance to calculate the distance between two points. We can think of each observation as a point in n-dimensional space, where n is the number of features. The distance between two observations is the Euclidean distance between these in n-dimensional space.

To begin with, the algorithm picks up k random records from the dataset. These are the initial k-mean values. In the next step, for each record...

Natural language processing for data analysis

Natural language processing (NLP) is the ability of a machine to analyze and understand human language. Human language has a very high amount of complexity, which makes parsing and understanding it difficult. There is a great deal of context in spoken and written language. Machines work well with precise rules that are within the confines of good context. With that said, it is still possible to gain an insight into text analysis using NLP techniques. An excellent example of this is Twitter sentiment analysis. Based on the contents of tweets, using NLP, it is possible to determine whether the sentiments of the people are generally positive or negative as a group. Another great example is the successful application of NLP techniques in analyzing customer reviews of a product or service.

The ML algorithms explored so far in this chapter...

Algorithm selections

Each ML algorithm has its own strengths and weaknesses. Selecting an appropriate machine algorithm and tuning the model requires a fair amount of experience working with these algorithms, however, the following factors also play a significant role in applying these techniques effectively:

Asking the right question: A great deal of effort is generally required in formulating the right question.
Understanding the business domain: Having a good understanding of the relevant business domain and context is equally important to build good models.
Understanding data: Ultimately, the data is used to train the model. If the data is not understood correctly or the data quality is poor, the built model is unlikely to be effective.

All of the preceding aspects outlined are somewhat interdependent and a mastery of all of these is a prerequisite to selecting the appropriate...

Summary

In this chapter, we learned about ML and some of the most popular ML algorithms. The primary goal of ML is to build an analytical model using historical data without much human intervention. ML algorithms can be divided into two categories, namely, supervised learning and unsupervised learning. The supervised learning algorithm relies on labeled data to build models, whereas unsupervised learning uses data that is not labeled. We looked at the k-means cluster analysis algorithm, which is an unsupervised ML algorithm. Of the supervised ML algorithms, we explored decision trees, random forests, and ridge/lasso regression. We also got an overview of using NLP for performing text data analysis.

In the next chapter, we will examine the processing of data in real time and perform data analysis as the data becomes available.

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Data Analysis with Scala

Published in: May 2019Publisher: PacktISBN-13: 9781789346114

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Rajesh Gupta

Rajesh is a Hands-on Big Data Tech Lead and Enterprise Architect with extensive experience in the full life cycle of software development. He has successfully architected, developed and deployed highly scalable data solutions using Spark, Scala and Hadoop technology stack for several enterprises. A passionate, hands-on technologist, Rajesh has masters degrees in Mathematics and Computer Science from BITS, Pilani (India).
Read more about Rajesh Gupta

Other recommended products

Related to this chapter

Professional Scala

This book teaches you how to build and contribute to Scala programs, recognizing common patterns and techniques used with the language. You’ll learn how to write concise, functional code with Scala. After an introduction to core concepts, syntax, and writing example applications with scalac, you’ll learn about the Scala Collections API and how the language handles type safety via static types out-of-the-box. You’ll then learn about advanced functional programming patterns, and how you can write your own Domain Specific Languages (DSLs). By the end of the book, you’ll be equipped with the skills you need to successfully build smart, efficient applications in Scala that can be compiled to the JVM.

BookJul 2018186 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Apache Spark 2.x Machine Learning Cookbook

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.

BookSep 2017666 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Modern Scala Projects

Scala is a multipurpose programming language, especially for analyzing large datasets without impacting the application performance. Its functional libraries can interact with databases and build scalable frameworks that create robust data pipelines. This book showcases how you can use Scala and its constructs to meet specific project demands.

BookJul 2018334 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Hands-On Big Data Analytics with PySpark

In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.

BookMar 2019182 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Machine Learning with Spark

Spark ML is the machine learning module of Spark. It uses in-memory RDDs to process machine learning models faster for clustering, classification, and regression.

BookApr 2017532 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Learning PySpark

This book will get you to grips with the Spark Python API. You’ll explore how Python can be used with Spark to build scalable and reliable data-intensive applications.

BookFeb 2017274 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages