Chapter 8. Extracting Data from the Internet
In this chapter, we will look at ways we can extract data and files from the Internet using a range of data formats and services, namely web services (or
Application Protocol Interfaces (APIs)) using the Extensible Markup Language (XML) and JavaScript Object Notation (JSON) data formats.
We will also look at how we can use Python to download files and extract information from web pages for when a website does not offer an API to access their data.
Using urllib2 to download data
Before we get on to processing the data we extract from the online sources, we will first demonstrate use of the in-built urllib2
Python module for downloading data from the internet.
This will be used in all the examples later on in the chapter for parsing information downloaded from the various online sources.
In the following example, we will write a simple script that will download the text contents of a web page and print them to the terminal. This is not a practical use for this module, however it does demonstrate the use of the module for retrieving data from web resources.
We will start by importing the Python modules required for this script. We will save this script file as urllib_example.py
:
In this line, we are taking the first argument on the command line as a URL to open and return the HTML contents of:
Now, we will create a request object that represents a request to be sent to the web server. This is not...
In this section, we will be creating a simple currency converter application that will be run from the command line using the free to use Fixer.io
API (http://fixer.io) to provide the exchange rates, which are updated daily (which is less frequent than some other paid for APIs, but will be good enough for our use).
This is a JSON API; an example URL is: http://api.fixer.io/latest?base=GBP&symbols=JPY,EUR
This is making a request for the exchange rates to convert British pounds to Euro and Yen and returns data in the format:
As we will see in the next code, this data can be parsed using the json
Python module, which will return the structure of the JSON tree as a nested tree of Python dictionaries.
We will start by importing the required Python modules for this script, which we will save as currency_converter.py
:
In this section, we will look at creating a simple weather forecast application using the OpenWeatherMap 5 day forecast API (http://openweathermap.org/forecast#5days), which can return an XML document containing the forecast data.
This API is accessed through a URL in the following format; in this case, we are searching for the weather in Harwell, UK:
This gives an output in the following format, where the time
element is repeated for the number of forecasts that are available in the 5 day time range:
Parsing a web page using BeautifulSoup
In this section, we will use the BeautifulSoup
library to parse an HTML web page to extract information from it. This is particularly useful for when you wish to interact with a web page that does not provide an API to access their data, with the drawback being that it is more likely that an application using this method will be broken by a change in the web page structure (rather than an API, which is rarely changed, and when they are, developers are typically given warning of such a change).
In this next example, we will write a simple script to download low resolution previews of images from Pixiv (www.pixiv.net). This script will start in a similar way to the others we have written so far. Note that the UTF-8 character encoding is required here as the contents of the web pages are likely to contain Japanese characters.
This string template...
In this chapter, we looked at the urllib2
Python module and how this can be used to download data from the internet, as well as a series of modules and libraries for parsing the data in a variety of formats once it has been downloaded.
In the next chapter, we will start looking at building complete applications as we start designing and implementing command line interfaces.