Reader small image

You're reading from  Web Scraping with Python

Product typeBook
Published inOct 2015
Reading LevelIntermediate
PublisherPackt
ISBN-139781782164364
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Richard Penman
Richard Penman
author image
Richard Penman

Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing in web scraping while travelling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational in Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones.
Read more about Richard Penman

Right arrow

Three approaches to scrape a web page


Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, firstly with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module.

Regular expressions

If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at https://docs.python.org/2/howto/regex.html.

To scrape the area using regular expressions, we will first try matching the contents of the <td> element, as follows:

>>> import re
>>> url = 'http://example.webscraping.com/view/UnitedKingdom-239'
>>> html = download(url)
>>> re.findall('<td class="w2p_fw">(.*?)</td>', html)
['<img src="/places/static/images/flags/gb.png" />',
  '244,820 square kilometres',
  '62,348,447',
  'GB',
  'United Kingdom',
  'London',
  '<a href="/continent/EU">EU</a>',
  '.uk',
  'GBP',
  'Pound...
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Web Scraping with Python
Published in: Oct 2015Publisher: PacktISBN-13: 9781782164364

Author (1)

author image
Richard Penman

Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing in web scraping while travelling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational in Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones.
Read more about Richard Penman