First Script – Geocoding with Web APIs

Now that we know how to write functions, let's apply that knowledge to a practical task. In this chapter, we will build a function that will communicate with a web service via a REST API in order to get the latitude and longitude of a given address. Furthermore, we'll discuss how to use built-in Python libraries to read and write data from and to files. Finally, we will wrap this functionality into a standalone script, so that it can be used from the command line, with no Jupyter Notebook attached.

In this chapter, we will learn how to do the following:

Work generally with Python's built-in libraries and requests in particular
Communicate with web services via APIs
Read and write data using the CSV file format
Wrap code into a standalone script with the command-line interface, using the built-in sys.argv library, and...

Technical requirements

In this chapter, we will use two third-party libraries—requests and tqdm. Both are included in Anaconda Distribution, so if you use Anaconda, they are already installed. Otherwise, please install them.

To install a new package, type the following:

If you have conda installed, use conda install requests.
Otherwise, if you have pip, use pip install requests.

You will also need an internet connection, as we'll be working with a web service API.

The code for this chapter is available in the GitHub repository, specifically the Chapter06 folder (https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications).

Geocoding as a service

Often, the data we work with requires preprocessing; sometimes, that includes gathering additional information to add context or transform existing information. Typical examples of that are geocoding and reverse geocoding—the processes of converting an address into geocoordinates and vice versa, respectively. Converting an address into coordinates allows us to visualize data on a map, measure distances, and check membership (seeing things such as what country, neighborhood, or school district an address belongs to).

This is actually a hard task, as it requires you to have a large hierarchical database of relevant addresses and a complex parsing engine to make sense of semi-structured, often misspelled and ambiguous, addresses. Realistically, a service like that requires a large investment of time and resources.

The good news is that we can use some...

Learning about web APIs

First, what is an API? Well, an Application Programming Interface (API) is an interface for working with a specific application programmatically—that is, via code. Think of Twitter bots or email clients—all of them use APIs to work with their corresponding applications (Twitter and email servers, respectively).

An API does not have to involve the web—many local applications on your computer have APIs of their own, so we can interact with them through Python or any other language. In our case, however, we need to work with a web API. Those APIs operate via HTTP requests and responses. Many contemporary APIs follow REST guidelines—a set of six design constraints that were put forward by Roy Fielding. You can learn more about REST architecture via REST API Tutorial (https://restfulapi.net/) or the Packt books cited at the end of...

Working with the Nominatim API

In this particular case, we are going to use OSM's Nominatim service. Its API is simple, free, does not require authorization, and has a relatively open license. Moreover, as OSM is open source, we theoretically can add and improve its content, if that is necessary for our project.

In order to work with an API, we first need to read its documentation. Often, documentation includes example snippets of code to use with the service in question—the code is usually in Python. Nominatim's documentation can be found at nominatim.openstreetmap.org. According to it, to get information for a given address, we should send a request to the following URL:

https://nominatim.openstreetmap.org/search?

All our parameters—the address, response format, geographic limitations, and so on—need to be added using standard URL escaping (don...

Caching with decorators

As you can see, geocoding takes time—working with a server takes time, as does being nice and waiting between requests. Thus, we probably don't want to waste time asking the same questions over and over again. For example, if many records within the same sessions have the same address, it makes sense to pull that data once, and then reuse it. Specifics may depend on the nature of the data. Namely, if we're checking air ticket availability, we shouldn't cache the results—the data might change any second. But for geolocation, we don't anticipate any changes any time soon.

The process of storing data we've pulled locally and then using it instead of getting the same data again is called caching. For example, all modern browsers do this—they cache some secondary elements of the web page for you to use and they&apos...

Reading and writing data

Now that the function works, we can put it to work using any address, or an array of addresses using loops. For that, addresses could be copied and pasted into Jupyter, but that is not a sustainable solution. Most of the time, our data is stored somewhere in a database or a file. Let's learn how to read addresses from a file and store the results to another file.

CSV is a popular text-based format for tabular data, where each line represents a row and cells are separated by separator symbols—usually commas, but it could be a semicolon or a pipe. Cells containing separator or newline symbols are usually "escaped" using quotes. This format is not the most efficient, but it is widespread and easy to read using any text editor.

Python has a built-in library for dealing with .csv files—it is called csv. It has two ways to parse...

Moving code to a separate module

Now we have everything to process data and get the coordinates in bulk. In the Jupyter Notebook, this could be something as short as the following three lines, assuming we have the path_in and path_out variables predefined (of course, here we don't actually do anything with the errors):

path_in = './cities.csv'
path_out = './geocoded.csv'

data = read_csv(path_in)
result, errors = geocode_bulk(data, column='address', verbose=True)
write_csv(result, path_out)

It is not very convenient, however, to fire up Jupyter and run through all the cells every time just to load the functions we write. Instead, we can store our functions in a separate module—a text file with the .py extension—and import the functions from there.

Let's create a new text file using Visual Studio Code (which is what we recommend...

Collecting NYC Open Data from the Socrata service

In Chapter 12, Data Exploration and Visualization, Chapter 16, Data Pipelines with Luigi, Chapter 17, Let's Build a Dashboard, Chapter 18, Serving Models with a RESTful API, and Chapter 19, Serverless API Using Chalice, we'll be working with the New York City 311 complaints (a non-urgent version of the 911 service) dataset. This data is available via a public portal (https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9), both via a web interface and programmatically via an API. The code for pulling this data via the API is rather dull and similar to what we've written already, so we won't cover it in detail. In Chapter 16, Data Pipelines with Luigi, we'll discuss how to pull this dataset systematically and on a scheduled basis. If you want, however, feel free...

Summary

We've done a lot in this chapter. First, we learned about geocoding in general, including geocoding services and their web APIs. We also discussed how you can interact with web APIs programmatically, from Python, using the requests library. Then, we experimented with a specific API from Nominatim and wrote a thin wrapper function that geocodes any arbitrary address. On top of that, we wrote another function to geocode addresses in bulk that keeps working even if a specific request fails or no location was found for some addresses. We used the built-in csv library both to read data from and write to CSV files. Finally, as the code we used seemed as though it might be useful in the future, we moved it from a notebook into a dedicated Python file, which can be used as a standalone script with its own interface or as a module to import functions from.

In the next chapter...

Questions

What is an API? Why would we use it?
What do the various HTTP response status codes mean?
Is there a built-in library for dealing with HTTP? Why do we use requests instead?
How do you define command-line interface parameters for Python scripts?
What does if __name__ == '__main__' mean and why do we need it at the end of a script?