Search Using Beautiful Soup

Exclusive offer: get 50% off this eBook here
Getting Started with Beautiful Soup

Getting Started with Beautiful Soup — Save 50%

Build your own web scraper and learn all about web scraping with Beautiful Soup with this book and ebook

$20.99    $10.50
by Vineeth G. Nair | January 2014 | Open Source Web Development

In this article by Vineeth G. Nair, the author of the book Getting Started with Beautiful Soup, we will learn the different searching methods provided by Beautiful Soup to search based on tag name, attribute values of tag, text within the document, regular expression, and so on. At the end, we will make use of these searching methods to scrape data from an online web page.

(For more resources related to this topic, see here.)

Searching with find_all()

The find() method was used to find the first result within a particular search criteria that we applied on a BeautifulSoup object. As the name implies, find_all() will give us all the items matching the search criteria we defined. The different filters that we see in find() can be used in the find_all() method. In fact, these filters can be used in any searching methods, such as find_parents() and find_siblings().

Let us consider an example of using find_all().

Finding all tertiary consumers

We saw how to find the first and second primary consumer. If we need to find all the tertiary consumers, we can't use find(). In this case, find_all() will become handy.

all_tertiaryconsumers = soup.find_all(class_="tertiaryconsumerslist")

The preceding code line finds all the tags with the = "tertiaryconsumerlist" class. If given a type check on this variable, we can see that it is nothing but a list of tag objects as follows:

print(type(all_tertiaryconsumers)) #output <class 'list'>

We can iterate through this list to display all tertiary consumer names by using the following code:

for tertiaryconsumer in all_tertiaryconsumers: print(tertiaryconsumer.div.string) #output lion tiger

Understanding parameters used with find_all()

Like find(), the find_all() method also has a similar set of parameters with an extra parameter, limit, as shown in the following code line:

find_all(name,attrs,recursive,text,limit,**kwargs)

The limit parameter is used to specify a limit to the number of results that we get. For example, from the e-mail ID sample we saw, we can use find_all() to get all the e-mail IDs. Refer to the following code:

email_ids = soup.find_all(text=emailid_regexp) print(email_ids) #output [u'abc@example.com',u'xyz@example.com',u'foo@example.com']

Here, if we pass limit, it will limit the result set to the limit we impose, as shown in the following example:

email_ids_limited = soup.find_all(text=emailid_regexp,limit=2) print(email_ids_limited) #output [u'abc@example.com',u'xyz@example.com']

From the output, we can see that the result is limited to two.

The find() method is find_all() with limit=1.

We can pass True or False values to find the methods. If we pass True to find_all(), it will return all tags in the soup object. In the case of find(), it will be the first tag within the object. The print(soup.find_all(True)) line of code will print out all the tags associated with the soup object.

In the case of searching for text, passing True will return all text within the document as follows:

all_texts = soup.find_all(text=True) print(all_texts) #output [u'\n', u'\n', u'\n', u'\n', u'\n', u'plants', u'\n', u'100000', u'\n', u'\n', u'\n', u'algae', u'\n', u'100000', u'\n', u'\n', u'\n', u'\n', u'\n', u'deer', u'\n', u'1000', u'\n', u'\n', u'\n', u'rabbit', u'\n', u'2000', u'\n', u'\n', u'\n', u'\n', u'\n', u'fox', u'\n', u'100', u'\n', u'\n', u'\n', u'bear', u'\n', u'100', u'\n', u'\n', u'\n', u'\n', u'\n', u'lion', u'\n', u'80', u'\n', u'\n', u'\n', u'tiger', u'\n', u'50', u'\n', u'\n', u'\n', u'\n', u'\n']

The preceding output prints every text content within the soup object including the new-line characters too.

Also, in the case of text, we can pass a list of strings and find_all() will find every string defined in the list:

all_texts_in_list = soup.find_all(text=["plants","algae"]) print(all_texts_in_list) #output [u'plants', u'algae']

This is same in the case of searching for the tags, attribute values of tag, custom attributes, and the CSS class.

For finding all the div and li tags, we can use the following code line:

div_li_tags = soup.find_all(["div","li"])

Similarly, for finding tags with the producerlist and primaryconsumerlist classes, we can use the following code lines:

all_css_class = soup.find_all(class_=["producerlist","primaryconsumerlist"])

Both find() and find_all() search an object's descendants (that is, all children coming after it in the tree), their children, and so on. We can control this behavior by using the recursive parameter. If recursive = False, search happens only on an object's direct children.

For example, in the following code, search happens only at direct children for div and li tags. Since the direct child of the soup object is html, the following code will give an empty list:

div_li_tags = soup.find_all(["div","li"],recursive=False) print(div_li_tags) #output []

If find_all() can't find results, it will return an empty list, whereas find() returns None.

Navigation using Beautiful Soup

Navigation in Beautiful Soup is almost the same as the searching methods. In navigating, instead of methods, there are certain attributes that facilitate the navigation. So each Tag or NavigableString object will be a member of the resulting tree with the Beautiful Soup object placed at the top and other objects as the nodes of the tree.

The following code snippet is an example for an HTML tree:

html_markup = """<div class="ecopyramid"> <ul id="producers"> <li class="producerlist"> <div class="name">plants</div> <div class="number">100000</div> </li> <li class="producerlist"> <div class="name">algae</div> <div class="number">100000</div> </li> </ul> </div>"""

For the previous code snippet, the following HTML tree is formed:

In the previous figure, we can see that Beautiful Soup is the root of the tree, the Tag objects make up the different nodes of the tree, while NavigableString objects make up the leaves of the tree.

Navigation in Beautiful Soup is intended to help us visit the nodes of this HTML/XML tree. From a particular node, it is possible to:

  • Navigate down to the children
  • Navigate up to the parent
  • Navigate sideways to the siblings
  • Navigate to the next and previous objects parsed

We will be using the previous html_markup as an example to discuss the different navigations using Beautiful Soup.

Summary

In this article, we discussed in detail the different search methods in Beautiful Soup, namely, find(), find_all(), find_next(), and find_parents(); code examples for a scraper using search methods to get information from a website; and understanding the application of search methods in combination.

We also discussed in detail the different navigation methods provided by Beautiful Soup, methods specific to navigating downwards and upwards, and sideways, to the previous and next elements of the HTML tree.

Resources for Article:


Further resources on this subject:


Getting Started with Beautiful Soup Build your own web scraper and learn all about web scraping with Beautiful Soup with this book and ebook
Published: January 2014
eBook Price: $20.99
Book Price: $34.99
See more
Select your format and quantity:

About the Author :


Vineeth G. Nair

Vineeth G. Nair completed his bachelors in Computer Science and Engineering from Model Engineering College, Cochin, Kerala. He is currently working with Oracle India Pvt. Ltd. as a Senior Applications Engineer.

He developed an interest in Python during his college days and began working as a freelance programmer. This led him to work on several web scraping projects using Beautiful Soup. It helped him gain a fair level of mastery on the technology and a good reputation in the freelance arena. He can be reached at vineethgnair.mec@gmail.com. You can visit his website at www.kochi-coders.com.

Books From Packt


 Learning Python Design Patterns
Learning Python Design Patterns

Web Services Testing with soapUI
Web Services Testing with soapUI

Spring Web Services 2 Cookbook
Spring Web Services 2 Cookbook

 Instant jsoup How-to [Instant]
Instant jsoup How-to [Instant]

Instant Nokogiri [Instant]
Instant Nokogiri [Instant]

Python High Performance Programming
Python High Performance Programming

Python Data Visualization Cookbook
Python Data Visualization Cookbook

 Python 3 Object Oriented Programming
Python 3 Object Oriented Programming


Your rating: None Average: 1.5 (4 votes)

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
p
5
C
Y
Q
4
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software