Reader small image

You're reading from  Machine Learning with Apache Spark Quick Start Guide

Product typeBook
Published inDec 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789346565
Edition1st Edition
Languages
Right arrow
Author (1)
Jillur Quddus
Jillur Quddus
author image
Jillur Quddus

Jillur Quddus is a lead technical architect, polyglot software engineer and data scientist with over 10 years of hands-on experience in architecting and engineering distributed, scalable, high-performance, and secure solutions used to combat serious organized crime, cybercrime, and fraud. Jillur has extensive experience of working within central government, intelligence, law enforcement, and banking, and has worked across the world including in Japan, Singapore, Malaysia, Hong Kong, and New Zealand. Jillur is both the founder of Keisan, a UK-based company specializing in open source distributed technologies and machine learning, and the lead technical architect at Methods, the leading digital transformation partner for the UK public sector.
Read more about Jillur Quddus

Right arrow

Setting Up a Local Development Environment

In this chapter, we will install, configure, and deploy a local analytical development environment by provisioning a self-contained single-node cluster that will allow us to do the following:

  • Prototype and develop machine learning models and pipelines in Python
  • Demonstrate the functionality and usage of Apache Spark's machine learning library, MLlib, via the Spark Python API (PySpark)
  • Develop and test machine learning models on a single-node cluster using small sample datasets, and thereafter scale up to multi-node clusters processing much larger datasets with little or no code changes required

Our single-node cluster will host the following technologies:

CentOS Linux 7 virtual machine

First of all, we will assume that you have access to a physical or virtual machine provisioned with the CentOS 7 operating system. CentOS 7 is a free Linux distribution derived from Red Hat Enterprise Linux (RHEL). It is commonly used, along with its licensed upstream parent, RHEL, as the operating system of choice for Linux-based servers, since it is stable and backed by a large active community with detailed documentation. All the commands that we will use to install the various technologies listed previously will be Linux shell commands to be executed on a single CentOS 7 (or RHEL) machine, whether physical or virtual. If you do not have access to a CentOS 7 machine, then there are quite a few options available to provision a CentOS 7 virtual machine:

  • Cloud computing platforms such as Amazon Web Services (AWS), Microsoft Azure, and the Google...

Summary

In this chapter, we have installed, configured, and deployed a local analytical development environment consisting of a single-node Apache Spark 2.3.2 and Apache Kafka 2.0.0 cluster that will also allow us to interactively develop Spark applications using Python 3.6 via Jupyter Notebook.

In the next chapter, we will discuss some of the high-level concepts behind common artificial intelligence and machine learning algorithms, as well as introducing Apache Spark's machine learning library, MLlib!

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with Apache Spark Quick Start Guide
Published in: Dec 2018Publisher: PacktISBN-13: 9781789346565
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jillur Quddus

Jillur Quddus is a lead technical architect, polyglot software engineer and data scientist with over 10 years of hands-on experience in architecting and engineering distributed, scalable, high-performance, and secure solutions used to combat serious organized crime, cybercrime, and fraud. Jillur has extensive experience of working within central government, intelligence, law enforcement, and banking, and has worked across the world including in Japan, Singapore, Malaysia, Hong Kong, and New Zealand. Jillur is both the founder of Keisan, a UK-based company specializing in open source distributed technologies and machine learning, and the lead technical architect at Methods, the leading digital transformation partner for the UK public sector.
Read more about Jillur Quddus