Reader small image

You're reading from  Mastering Geospatial Analysis with Python

Product typeBook
Published inApr 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781788293334
Edition1st Edition
Languages
Right arrow
Authors (3):
Silas Toms
Silas Toms
author image
Silas Toms

Silas Toms is a long-time geospatial professional and author who has previously published ArcPy and ArcGIS and Mastering Geospatial Analysis with Python. His career highlights include developing the real-time common operational picture used at Super Bowl 50, building geospatial software for autonomous cars, designing computer vision for next-gen insurance, and developing mapping systems for Zillow. He now works at Volta Charging, predicting the future of electric vehicle adoption and electric charging infrastructure.
Read more about Silas Toms

Paul Crickard
Paul Crickard
author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard

Eric van Rees
Eric van Rees
author image
Eric van Rees

Eric van Rees was first introduced to Geographical Information Systems (GIS) when studying Human Geography in the Netherlands. For 9 years, he was the editor-in-chief of GeoInformatics, an international GIS, surveying, and mapping publication and a contributing editor of GIS Magazine. During that tenure, he visited many geospatial user conferences, trade fairs, and industry meetings. He focuses on producing technical content, such as software tutorials, tech blogs, and innovative new use cases in the mapping industry.
Read more about Eric van Rees

View More author details
Right arrow

Chapter 16. Python Geoprocessing with Hadoop

Most of the examples in this book worked with relatively small datasets using a single computer. But as data gets larger, the datasets and even individual files may be spread out over a cluster of machines. Working with big data requires different tools. In this chapter, you will learn how to use Apache Hadoop to work with big data, and the Esri GIS tools for Hadoop to work with the big data spatially.

This chapter will teach you how to:

  • Install Linux
  • Install and run Docker
  • Install and configure a Hadoop environment
  • Work with files in HDFS
  • Basic queries using Hive
  • Install the Esri GIS tools for Hadoop
  • Perform spatial queries in Hive

What is Hadoop?


Hadoop is an open-source framework for working with large quantities of data spread across a single computer to thousands of computers. Hadoop is composed of four modules:

  • Hadoop Core
  • Hadoop Distributed File System (HDFS)
  • Yet Another Resource Negotiator (YARN)
  • MapReduce

The Hadoop Core makes up the components needed to run the other three modules. HDFS is a Java-based file system that has been designed to be distributed and is capable of storing large files across many machines. By large files, we are talking terabytes. YARN manages the resources and scheduling in your Hadoop framework. The MapReduce engine allows you to process data in parallel.

There are several other projects that can be installed to work with the Hadoop framework. In this chapter, you will use Hive and Ambari. Hive allows you to read and write data using SQL. You will use Hive to run the spatial queries on your data at the end of the chapter. Ambari provides a web user interface to Hadoop and Hive. In this...

Installing the Hadoop framework


In this chapter, you will not configure each of the Hadoop framework components yourself. You will run a Docker image, which requires you to install Docker. Currently, Docker runs on Windows 10 Pro or Enterprise, but it runs much better on Linux or Mac. Hadoop also runs on Windows but requires you to build it from source, and so it will be much easier to run it on Linux. Also, the Docker image you will use is running Linux, so getting familiar with Linux may be beneficial. In this section, you will learn how to install Linux.

Installing Linux

The first step to set up the Hadoop framework is to install Linux. You will need to get a copy of a Linux operating system. There are many flavors of Linux. You can choose whichever version you like, however, this chapter was written using CentOS 7 because most of the tools you will be installing have also been tested on CentOS. CentOS is a Red Hat-based version of Linux. You can download an ISO at: https://www.centos.org...

Hadoop basics


In this section, you will launch your Hadoop image and learn how to connect using ssh and Ambari. You will also move files and perform a basic Hive query. Once you understand how to interact with the framework, the next section will show you how to use a spatial query.

First, from the terminal, launch the Hortonworks Sandbox using the provided Bash script. The following command will show you how:

sudo sh start_sandbox-hdp.sh

The previous command executes the script you downloaded with the sandbox. Again, it used sudo to run as root. Depending on your machine, it may take some time to completely load and start all the services. When it is done, your terminal should look like it does in the following screenshot:

Connecting via Secure Shell

Now that the sandbox is running, you can connect using Secure Shell (SSH). The secure shell allows you to log in remotely to another machine. Open a new terminal and enter the following command:

ssh raj_ops@127.0.0.1 -p2222

The previous command uses...

Esri GIS tools for Hadoop


With your environment set up and some basic knowledge of Ambari, HDFS, and Hive, you will now learn how to add a spatial component to your queries. To do so, we will use the Esri GIS tools for Hadoop.

The first step is to download the files located at the GitHub repository, which is located at: https://github.com/Esri/gis-tools-for-hadoop. You will be using Ambari to move the files to HDFS not the container, so download these files to your local machine.

Note

Esri has a tutorial for downloading the files by using ssh to connect to the container and then using git to clone the repository. You can follow these instructions here: https://github.com/Esri/gis-tools-for-hadoop/wiki/GIS-Tools-for-Hadoop-for-Beginners.

You can download the files by using the GitHub Clone or download button on the right-hand side of the repository. To unzip the archive, use one of the following commands:

unzip gis-tools-for-hadoop-master.zip
unzip gis-tools-for-hadoop-master.zip -d /home/pcrickard...

HDFS and Hive in Python


This book is about Python for geospatial development, so in this section, you will learn how to use Python for HDFS operations and Hive queries. There are several database wrapper libraries with Python and Hadoop, but it does not seem like a single library has become a standout go-to library, and others, like Snakebite, don't appear ready to run on Python 3. In this section, you will learn how to use two libraries—PyHive and PyWebHDFS. You will also learn how you can use the Python subprocess module to execute HDFS and Hive commands.

To get PyHive, you can use conda and the following command:

conda install -c blaze pyhive

You may also need to install the sasl library:

conda install -c blaze sasl

The previous libraries will give you the ability to run Hive queries from Python. You will also want to be able to move files to HDFS. To do so, you can install pywebhdfs:

conda install -c conda-forge pywebhdfs

The preceding command will install the library, and as always, you can...

Summary


In this chapter, you learned how to set up a Hadoop environment. This required you to install Linux and Docker to download an image from Hortonworks, and to learn the ropes of that environment. Much of this chapter was spent on the environment and how to perform a spatial query using the GUI tools provided. This is because the Hadoop environment is complex and without a proper understanding, it would be hard to fully understand how to use it with Python. Lastly, you learned how to use HDFS and Hive in Python. The Python libraries for working with Hadoop, Hive, and HDFS are still developing. This chapter provided you with a foundation so that when these libraries improve, you will have enough knowledge of Hadoop and the accompanying technologies to implement these new Python libraries.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Geospatial Analysis with Python
Published in: Apr 2018Publisher: PacktISBN-13: 9781788293334
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Silas Toms

Silas Toms is a long-time geospatial professional and author who has previously published ArcPy and ArcGIS and Mastering Geospatial Analysis with Python. His career highlights include developing the real-time common operational picture used at Super Bowl 50, building geospatial software for autonomous cars, designing computer vision for next-gen insurance, and developing mapping systems for Zillow. He now works at Volta Charging, predicting the future of electric vehicle adoption and electric charging infrastructure.
Read more about Silas Toms

author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard

author image
Eric van Rees

Eric van Rees was first introduced to Geographical Information Systems (GIS) when studying Human Geography in the Netherlands. For 9 years, he was the editor-in-chief of GeoInformatics, an international GIS, surveying, and mapping publication and a contributing editor of GIS Magazine. During that tenure, he visited many geospatial user conferences, trade fairs, and industry meetings. He focuses on producing technical content, such as software tutorials, tech blogs, and innovative new use cases in the mapping industry.
Read more about Eric van Rees