Reader small image

You're reading from  Hands-On Data Analysis with Scala

Product typeBook
Published inMay 2019
Reading LevelExpert
PublisherPackt
ISBN-139781789346114
Edition1st Edition
Languages
Right arrow
Author (1)
Rajesh Gupta
Rajesh Gupta
author image
Rajesh Gupta

Rajesh is a Hands-on Big Data Tech Lead and Enterprise Architect with extensive experience in the full life cycle of software development. He has successfully architected, developed and deployed highly scalable data solutions using Spark, Scala and Hadoop technology stack for several enterprises. A passionate, hands-on technologist, Rajesh has masters degrees in Mathematics and Computer Science from BITS, Pilani (India).
Read more about Rajesh Gupta

Right arrow

Preface

Efficient business decisions with an accurate understanding of business data help to deliver better performance across products and services. This book will help you to leverage the popular Scala libraries and tools to perform core data analysis tasks with ease.

The book begins with a quick overview of the building blocks of a standard data analysis process. You will learn how to perform basic tasks such as the extraction, staging, validation, cleaning, and shaping of datasets. You will later deep dive into the data exploration and visualization areas of the data analysis life cycle. You will make use of popular Scala libraries such as Saddle, Breeze, and Vegas to process your datasets. You will learn statistical methods for deriving meaningful insights from data. You will also learn how to create applications for Apache Spark 2.x on complex data analysis, in real time. You will discover traditional machine learning (ML) techniques for doing data analysis.

By the end of this book, you will be capable of handling large sets of structured and unstructured data, performing exploratory analysis, and building efficient Scala applications to discover and deliver insights.

Who this book is for

If you are a data scientist or a data analyst who wants to learn how to perform data analysis using Scala, this book is for you. All you need is knowledge of the basic fundamentals of Scala programming.

What this book covers

Chapter 1, Scala Overview, gives you a quick run through Scala and its features. It will prepare you for upcoming chapters.

Chapter 2, Data Analysis Life Cycle, turns the focus exclusively to data analysis and its typical life cycle. It provides an overview of the steps involved in the data analysis life cycle.

Chapter 3, Data Ingestion, deep-dives into the data ingestion aspects of the data life cycle. It covers extraction, staging, validation, cleaning, and shaping data tasks. It highlights how to deal with the variety aspect of data, that is, how to handle data from different sources in different formats.

Chapter 4, Data Exploration and Visualization, deep-dives into the data exploration and visualization parts of the life cycle. It familiarizes the reader with techniques for discovering inherent properties associated with data using statistical as well as visual methods.

Chapter 5, Applying Statistics and Hypothesis Testing, provides an overview of the statistical methods used in data analysis and covers techniques for deriving meaningful insights from data.

Chapter 6, Intro to Spark for Distributed Data Analysis, covers the transition to doing data analysis on distributed systems and doing it at scale. It provides a good introduction to Spark, a Scala-based distributed framework for data processing. It will guide you through Spark setup on your computer and introduce key features using practical examples.

Chapter 7, Traditional Machine Learning for Data Analysis, covers topics such as decision trees, random forests, lasso regression, and k-means cluster analysis. It also covers the role of NLP in effectively analyzing certain types of data.

Chapter 8, Near Real-Time Data Analysis Using Streaming, introduces the concept of stream-oriented processing and compares it to traditional batch-oriented processing. It also illustrates how streaming can be used to perform near real-time data analysis. This chapter deep-dives into Spark Streaming and will guide you on implementing clustering and a classifier leveraging Spark Streaming APIs.

Chapter 9, Working with Data at Scale, is dedicated to processing data at scale. It looks at data analysis from multiple dimensions, such as cost, reliability, and performance. It provides guidance on some of the best reliability and performance practices. It provides a complete picture of how a practical real-world data analysis life cycle works and will help you to put this into practice in a production environment.

To get the most out of this book

  • You should be familiar with the fundamentals of the Scala programming language
  • You should have a passion for analyzing data and extracting insight from of it
  • You should have basic familiarity with statistical methods and machine learning algorithms

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

  1. Log in or register at www.packt.com.
  2. Select the SUPPORT tab.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR/7-Zip for Windows
  • Zipeg/iZip/UnRarX for Mac
  • 7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Data-Analysis-with-Scala. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Create a package called handson.example by expanding to src/main/scala and right-clicking on this folder."

A block of code is set as follows:

scala> def factorial(n: Int): Long = if (n <= 1) 1 else n * factorial(n-1)
factorial: (n: Int)Int

scala> factorial(5)
res0: Long = 120

Any command-line input or output is written as follows:

$ brew install sbt@1

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Click on Create New Project, and then click on Scala and select the sbt console."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at customercare@packtpub.com.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Analysis with Scala
Published in: May 2019Publisher: PacktISBN-13: 9781789346114
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rajesh Gupta

Rajesh is a Hands-on Big Data Tech Lead and Enterprise Architect with extensive experience in the full life cycle of software development. He has successfully architected, developed and deployed highly scalable data solutions using Spark, Scala and Hadoop technology stack for several enterprises. A passionate, hands-on technologist, Rajesh has masters degrees in Mathematics and Computer Science from BITS, Pilani (India).
Read more about Rajesh Gupta