Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Web Scraping with Python

You're reading from  Web Scraping with Python

Product type Book
Published in Oct 2015
Publisher Packt
ISBN-13 9781782164364
Pages 174 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Richard Penman Richard Penman
Profile icon Richard Penman

Chapter 6. Interacting with Forms

In earlier chapters, we downloaded static web pages that always return the same content. Now, in this chapter, we will interact with web pages that depend on user input and state to return relevant content. This chapter will cover the following topics:

  • Sending a POST request to submit a form

  • Using cookies to log in to a website

  • The high-level Mechanize module for easier form submissions

To interact with these forms, you will need a user account to log in to the website. You can register an account manually at http://example.webscraping.com/user/register. Unfortunately, we can not yet automate the registration form until the next chapter, which deals with CAPTCHA.

Note

Form methods

HTML forms define two methods for submitting data to the server—GET and POST. With the GET method, data like ?name1=value1&name2=value2 is appended to the URL, which is known as a "query string". The browser sets a limit on the URL length, so this is only useful for small amounts...

The Login form


The first form that we will automate is the Login form, which is available at http://example.webscraping.com/user/login. To understand the form, we will use Firebug Lite. With the full version of Firebug or Chrome DevTools, it is possible to just submit the form and check what data was transmitted in the network tab. However, the Lite version is restricted to viewing the structure, as follows:

The important parts here are the action, enctype, and method attributes of the form tag, and the two input fields. The action attribute sets the location where the form data will be submitted, in this case, #, which means the same URL as the Login form. The enctype attribute sets the encoding used for the submitted data, in this case, application/x-www-form-urlencoded. Also, the method attribute is set to post to submit form data in the body to the server. For the input tags, the important attribute is name, which sets the name of the field when submitted to the server.

Note

Form encoding...

Extending the login script to update content


Now that the login automation is working, we can make the script more interesting by extending it to interact with the website and update the country data. The code used in this section is available at https://bitbucket.org/wswp/code/src/tip/chapter06/edit.py. You may have noticed an Edit link at the bottom of each country:

When logged in, this leads to another page where each property of a country can be edited:

We will make a script to increase the population of a country by one person each time it is run. The first step is to extract the current values of the country by reusing the parse_form() function:

>>> import login
>>> COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'
>>> opener = login.login_cookies()
>>> country_html = opener.open(COUNTRY_URL).read()
>>> data = parse_form(country_html)
>>> pprint.pprint(data)
{'_formkey': '4cf0294d-ea71-4cd8-ae2a-43d4ca0d46dd',...

Automating forms with the Mechanize module


The examples built so far work, but each form requires a fair amount of work and testing. This effort can be minimized by using Mechanize, which provides a high-level interface to interact with forms. Mechanize can be installed via pip using this command:

pip install mechanize

Here is how to implement the previous population increase example with Mechanize:

>>> import mechanize
>>> br = mechanize.Browser()
>>> br.open(LOGIN_URL)
>>> br.select_form(nr=0)
>>> br['email'] = LOGIN_EMAIL
>>> br['password'] = LOGIN_PASSWORD
>>> response = br.submit()
>>> br.open(COUNTRY_URL)
>>> br.select_form(nr=0)
>>> br['population'] = str(int(br['population']) + 1)
>>> br.submit()

This code is much simpler than the previous example because we no longer need to manage cookies and the form inputs are easily accessible. This script first creates the Mechanize browser object...

Summary


Interacting with forms is a necessary skill when scraping web pages. This chapter covered two approaches: first, analyzing the form to generate the expected POST request manually, and second, using the high-level Mechanize module.

In the following chapter, we will expand our form skillset and learn how to submit forms that require passing CAPTCHA.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Web Scraping with Python
Published in: Oct 2015 Publisher: Packt ISBN-13: 9781782164364
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}