Chapter 6. Interacting with Forms
In earlier chapters, we downloaded static web pages that always return the same content. Now, in this chapter, we will interact with web pages that depend on user input and state to return relevant content. This chapter will cover the following topics:
Sending a POST
request to submit a form
Using cookies to log in to a website
The high-level Mechanize module for easier form submissions
To interact with these forms, you will need a user account to log in to the website. You can register an account manually at http://example.webscraping.com/user/register. Unfortunately, we can not yet automate the registration form until the next chapter, which deals with CAPTCHA
.
Note
Form methods
HTML forms define two methods for submitting data to the server—GET
and POST
. With the
GET
method, data like ?name1=value1&name2=value2
is appended to the URL, which is known as a "query string". The browser sets a limit on the URL length, so this is only useful for small amounts...
The first form that we will automate is the Login form, which is available at http://example.webscraping.com/user/login. To understand the form, we will use Firebug Lite. With the full version of Firebug or Chrome DevTools, it is possible to just submit the form and check what data was transmitted in the network tab. However, the Lite version is restricted to viewing the structure, as follows:
The important parts here are the action
, enctype
, and method
attributes of the form
tag, and the two input
fields. The action
attribute sets the location where the form data will be submitted, in this case, #
, which means the same URL as the Login form. The enctype
attribute sets the encoding used for the submitted data, in this case, application/x-www-form-urlencoded
. Also, the method
attribute is set to post
to submit form data in the body to the server. For the input
tags, the important attribute is name
, which sets the name of the field when submitted to the server.
Extending the login script to update content
Now that the login automation is working, we can make the script more interesting by extending it to interact with the website and update the country data. The code used in this section is available at https://bitbucket.org/wswp/code/src/tip/chapter06/edit.py. You may have noticed an Edit link at the bottom of each country:
When logged in, this leads to another page where each property of a country can be edited:
We will make a script to increase the population of a country by one person each time it is run. The first step is to extract the current values of the country by reusing the parse_form()
function:
Automating forms with the Mechanize module
The examples built so far work, but each form requires a fair amount of work and testing. This effort can be minimized by using Mechanize, which provides a high-level interface to interact with forms. Mechanize can be installed via pip
using this command:
Here is how to implement the previous population increase example with Mechanize:
This code is much simpler than the previous example because we no longer need to manage cookies and the form inputs are easily accessible. This script first creates the Mechanize browser object...
Interacting with forms is a necessary skill when scraping web pages. This chapter covered two approaches: first, analyzing the form to generate the expected POST
request manually, and second, using the high-level Mechanize module.
In the following chapter, we will expand our form skillset and learn how to submit forms that require passing CAPTCHA
.