About this video

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This course will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark. You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command. By the end of this course, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.

All the code and supporting files for this course are available on Github at https://github.com/PacktPublishing/PySpark-for-Beginners

Style and Approach

This course takes a very comprehensive, step-by-step approach so you understand how the Spark ecosystem can be used with Python to develop efficient, scalable solutions. Every section is standalone and defined in a very easy-to-understand manner.

Publication date:
June 2018
Publisher
Packt
Duration
1 hour and 34 minutes
ISBN
9781789538762

About the Author

  • Tomasz Drabas

    Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 12 years' international experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting.

    Tomasz started his career in 2003 with LOT Polish Airlines in Warsaw, Poland while finishing his Master's degree in strategy management. In 2007, he moved to Sydney to pursue a doctoral degree in operations research at the University of New South Wales, School of Aviation; his research crossed boundaries between discrete choice modeling and airline operations research. During his time in Sydney, he worked as a Data Analyst for Beyond Analysis Australia and as a Senior Data Analyst/Data Scientist for Vodafone Hutchison Australia among others. He has also published scientific papers, attended international conferences, and served as a reviewer for scientific journals.

    In 2015 he relocated to Seattle to begin his work for Microsoft. While there, he has worked on numerous projects involving solving problems in high-dimensional feature space.

    Browse publications by this author

Recommended For You

Apache Spark Streaming with Python and PySpark [Video]

Add Spark Streaming to your data science and machine learning Python projects

By James Lee and 2 more
Hands-On Big Data Analytics with PySpark

Use PySpark to easily crush messy data at-scale and discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs

By Rudy Lai and 1 more