Reader small image

You're reading from  Machine Learning with Scala Quick Start Guide

Product typeBook
Published inApr 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789345070
Edition1st Edition
Languages
Right arrow
Authors (2):
Md. Rezaul Karim
Md. Rezaul Karim
author image
Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

Ajay Kumar N
Ajay Kumar N
author image
Ajay Kumar N

Ajay Kumar N has experience in big data, and specializes in cloud computing and various big data frameworks, including Apache Spark and Apache Hadoop. His primary language of choice is Python, but he also has a special interest in functional programming languages such as Scala. He has worked extensively with NumPy, pandas, and scikit-learn, and often contributes to open source projects related to data science and machine learning.
Read more about Ajay Kumar N

View More author details
Right arrow

Preface

Machine learning has made a huge impact not only in academia, but also in industry, by turning data into actionable intelligence. Scala is not only an object-oriented and functional programming language, but can also leverage the advantages of Java Virtual Machine (JVM). Scala provides code complexity optimization and offers concise notation, which is probably the reason it has seen a steady rise in adoption over the last few years, especially in data science and analytics.

This book is aimed at aspiring data scientists, data engineers, and deep learning enthusiasts who are newbies and want to have a great head start at machine learning best practices. Even if you're not well versed in machine learning concepts, but still want to expand your knowledge by delving into practical implementations of supervised learning, unsupervised learning, and recommender systems with Scala, you will be able to grasp the content easily!

Throughout the chapters, you'll become acquainted with popular machine learning libraries in Scala, learning how to carry out regression and classification analysis using both linear methods and tree-based ensemble techniques, as well as looking at clustering analysis, dimensionality reduction, and recommender systems, before delving into deep learning at the end.

After reading this book, you will have a good head start in solving more complex machine learning tasks. This book isn't meant to be read cover to cover. You can turn the pages to a chapter that looks like something you're trying to accomplish or that ignites your interest.

Suggestions for improvement are always welcome. Happy reading!

Who this book is for

Machine learning developers looking to learn how to train machine learning models in Scala, without spending too much time and effort, will find this book to be very useful. Some fundamental knowledge of Scala programming and some basics of statistics and linear algebra is all you need to get started with this book.

What this book covers

Chapter 1, Introduction to Machine Learning with Scala, first explains some basic concepts of machine learning and different learning tasks. It then discusses Scala-based machine learning libraries, which is followed by configuring your programming environment. Finally, it covers Apache Spark briefly, before demonstrating a step-by-step example.

Chapter 2, Scala for Regression Analysis, covers a supervised learning task called regression analysis with examples, followed by regression metrics. It then explains some regression analysis algorithms, including linear regression and generalized linear regression. Finally, it demonstrates a step-by-step solution to a regression analysis task using Spark ML in Scala.

Chapter 3, Scala for Learning Classification, briefly explains another supervised learning task called classification with examples, followed by explaining how to interpret performance evaluation metrics. It then covers widely used classification algorithms such as logistic regression, Naïve Bayes, and support vector machines (SVMs). Finally, it demonstrates a step-by-step solution to a classification problem using Spark ML in Scala.

Chapter 4, Scala for Tree-Based Ensemble Techniques, covers very powerful and widely used tree-based approaches, including decision trees, gradient-boosted trees, and random forest algorithms, for both classification and regression analysis. It then revisits the examples of Chapter 2, Scala for Regression Analysis, and Chapter 3, Scala for Learning Classification, before solving them using these tree-based algorithms.

Chapter 5, Scala for Dimensionality Reduction and Clustering, briefly discusses different clustering analysis algorithms, followed by a step-by-step example of solving a clustering problem. Finally, it discusses the curse of dimensionality in high-dimensional data, before showing an example of solving it using principal component analysis (PCA).

Chapter 6, Scala for Recommender System, briefly covers similarity-based, content-based, and collaborative filtering approaches for developing recommendation systems. Finally, it demonstrates an example of a book recommender system with Spark ML in Scala.

Chapter 7, Introduction to Deep Learning with Scala, briefly covers deep learning, artificial neural networks, and neural network architectures. It then discusses some available deep learning frameworks. Finally, it demonstrates a step-by-step example of solving a cancer type prediction problem using a long short-term memory (LSTM) network.

To get the most out of this book

All the examples have been implemented in Scala with some open source libraries, including Apahe Spark MLlib/ML and Deeplearning4j. However, to get the best out of this, you should have a powerful computer and software stack.

A Linux distribution is preferable (for example, Debian, Ubuntu, or CentOS). For example, for Ubuntu, it is recommended to have at least a 14.04 (LTS) 64-bit complete installation on VMware Workstation Player 12 or VirtualBox. You can run Spark jobs on Windows (7/8/10) or macOS X (10.4.7+) as well.

A computer with a Core i5 processor, enough storage (for example, for running Spark jobs, you'll need at least 50 GB of free disk storage for standalone cluster and for the SQL warehouse), and at least 16 GB RAM are recommended. And optionally, if you want to perform the neural network training on the GPU (for the last chapter only), the NVIDIA GPU driver has to be installed with CUDA and CuDNN configured.

The following APIs and tools are required in order to execute the source code in this book:

  • Java/JDK, version 1.8
  • Scala, version 2.11.8
  • Spark, version 2.2.0 or higher
  • Spark csv_2.11, version 1.3.0
  • ND4j backend version nd4j-cuda-9.0-platform for GPU; otherwise, nd4j-native
  • ND4j, version 1.0.0-alpha
  • DL4j, version 1.0.0-alpha
  • Datavec, version 1.0.0-alpha
  • Arbiter, version 1.0.0-alpha
  • Eclipse Mars or Luna (latest version) or IntelliJ IDEA
  • Maven Eclipse plugin (2.9 or higher)
  • Maven compiler plugin for Eclipse (2.3.2 or higher)
  • Maven assembly plugin for Eclipse (2.4.1 or higher)

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

  1. Log in or register at www.packt.com.
  2. Select the SUPPORT tab.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR/7-Zip for Windows
  • Zipeg/iZip/UnRarX for Mac
  • 7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Machine-Learning-with-Scala-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Code in Action

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "It gave me a Matthews correlation coefficient of 0.3888239300421191."

A block of code is set as follows:

rawTrafficDF.select("Hour (Coded)", "Immobilized bus", "Broken Truck", 
"Vehicle excess", "Fire", "Slowness in traffic (%)").show(5)

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

// Create a decision tree estimator
val dt = new DecisionTreeClassifier()
.setImpurity("gini")
.setMaxBins(10)
.setMaxDepth(30)
.setLabelCol("label")
.setFeaturesCol("features")

Any command-line input or output is written as follows:

 +-----+-----+
|churn|count|
+-----+-----+
|False| 2278|
| True| 388 |
+-----+-----+

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Clicking the Next button moves you to the next screen."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at customercare@packtpub.com.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with Scala Quick Start Guide
Published in: Apr 2019Publisher: PacktISBN-13: 9781789345070
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

author image
Ajay Kumar N

Ajay Kumar N has experience in big data, and specializes in cloud computing and various big data frameworks, including Apache Spark and Apache Hadoop. His primary language of choice is Python, but he also has a special interest in functional programming languages such as Scala. He has worked extensively with NumPy, pandas, and scikit-learn, and often contributes to open source projects related to data science and machine learning.
Read more about Ajay Kumar N