Getting Started with Beautiful Soup — Save 50%
Build your own web scraper and learn all about web scraping with Beautiful Soup with this book and ebook
In this article by Vineeth G. Nair, the author of the book Getting Started with Beautiful Soup, we will learn the different searching methods provided by Beautiful Soup to search based on tag name, attribute values of tag, text within the document, regular expression, and so on. At the end, we will make use of these searching methods to scrape data from an online web page.
(For more resources related to this topic, see here.)
Searching with find_all()
The find() method was used to find the first result within a particular search criteria that we applied on a BeautifulSoup object. As the name implies, find_all() will give us all the items matching the search criteria we defined. The different filters that we see in find() can be used in the find_all() method. In fact, these filters can be used in any searching methods, such as find_parents() and find_siblings().
Let us consider an example of using find_all().
Finding all tertiary consumers
We saw how to find the first and second primary consumer. If we need to find all the tertiary consumers, we can't use find(). In this case, find_all() will become handy.
all_tertiaryconsumers = soup.find_all(class_="tertiaryconsumerslist")
The preceding code line finds all the tags with the = "tertiaryconsumerlist" class. If given a type check on this variable, we can see that it is nothing but a list of tag objects as follows:
print(type(all_tertiaryconsumers)) #output <class 'list'>
We can iterate through this list to display all tertiary consumer names by using the following code:
for tertiaryconsumer in all_tertiaryconsumers: print(tertiaryconsumer.div.string) #output lion tiger
Understanding parameters used with find_all()
Like find(), the find_all() method also has a similar set of parameters with an extra parameter, limit, as shown in the following code line:
The limit parameter is used to specify a limit to the number of results that we get. For example, from the e-mail ID sample we saw, we can use find_all() to get all the e-mail IDs. Refer to the following code:
email_ids = soup.find_all(text=emailid_regexp) print(email_ids) #output [email@example.com',firstname.lastname@example.org',email@example.com']
Here, if we pass limit, it will limit the result set to the limit we impose, as shown in the following example:
email_ids_limited = soup.find_all(text=emailid_regexp,limit=2) print(email_ids_limited) #output [firstname.lastname@example.org',email@example.com']
From the output, we can see that the result is limited to two.
The find() method is find_all() with limit=1.
We can pass True or False values to find the methods. If we pass True to find_all(), it will return all tags in the soup object. In the case of find(), it will be the first tag within the object. The print(soup.find_all(True)) line of code will print out all the tags associated with the soup object.
In the case of searching for text, passing True will return all text within the document as follows:
all_texts = soup.find_all(text=True) print(all_texts) #output [u'\n', u'\n', u'\n', u'\n', u'\n', u'plants', u'\n', u'100000', u'\n', u'\n', u'\n', u'algae', u'\n', u'100000', u'\n', u'\n', u'\n', u'\n', u'\n', u'deer', u'\n', u'1000', u'\n', u'\n', u'\n', u'rabbit', u'\n', u'2000', u'\n', u'\n', u'\n', u'\n', u'\n', u'fox', u'\n', u'100', u'\n', u'\n', u'\n', u'bear', u'\n', u'100', u'\n', u'\n', u'\n', u'\n', u'\n', u'lion', u'\n', u'80', u'\n', u'\n', u'\n', u'tiger', u'\n', u'50', u'\n', u'\n', u'\n', u'\n', u'\n']
The preceding output prints every text content within the soup object including the new-line characters too.
Also, in the case of text, we can pass a list of strings and find_all() will find every string defined in the list:
all_texts_in_list = soup.find_all(text=["plants","algae"]) print(all_texts_in_list) #output [u'plants', u'algae']
This is same in the case of searching for the tags, attribute values of tag, custom attributes, and the CSS class.
For finding all the div and li tags, we can use the following code line:
div_li_tags = soup.find_all(["div","li"])
Similarly, for finding tags with the producerlist and primaryconsumerlist classes, we can use the following code lines:
all_css_class = soup.find_all(class_=["producerlist","primaryconsumerlist"])
Both find() and find_all() search an object's descendants (that is, all children coming after it in the tree), their children, and so on. We can control this behavior by using the recursive parameter. If recursive = False, search happens only on an object's direct children.
For example, in the following code, search happens only at direct children for div and li tags. Since the direct child of the soup object is html, the following code will give an empty list:
div_li_tags = soup.find_all(["div","li"],recursive=False) print(div_li_tags) #output 
If find_all() can't find results, it will return an empty list, whereas find() returns None.
Navigation using Beautiful Soup
Navigation in Beautiful Soup is almost the same as the searching methods. In navigating, instead of methods, there are certain attributes that facilitate the navigation. So each Tag or NavigableString object will be a member of the resulting tree with the Beautiful Soup object placed at the top and other objects as the nodes of the tree.
The following code snippet is an example for an HTML tree:
html_markup = """<div class="ecopyramid"> <ul id="producers"> <li class="producerlist"> <div class="name">plants</div> <div class="number">100000</div> </li> <li class="producerlist"> <div class="name">algae</div> <div class="number">100000</div> </li> </ul> </div>"""
For the previous code snippet, the following HTML tree is formed:
In the previous figure, we can see that Beautiful Soup is the root of the tree, the Tag objects make up the different nodes of the tree, while NavigableString objects make up the leaves of the tree.
Navigation in Beautiful Soup is intended to help us visit the nodes of this HTML/XML tree. From a particular node, it is possible to:
- Navigate down to the children
- Navigate up to the parent
- Navigate sideways to the siblings
- Navigate to the next and previous objects parsed
We will be using the previous html_markup as an example to discuss the different navigations using Beautiful Soup.
In this article, we discussed in detail the different search methods in Beautiful Soup, namely, find(), find_all(), find_next(), and find_parents(); code examples for a scraper using search methods to get information from a website; and understanding the application of search methods in combination.
We also discussed in detail the different navigation methods provided by Beautiful Soup, methods specific to navigating downwards and upwards, and sideways, to the previous and next elements of the HTML tree.
Resources for Article:
- Web Services Testing and soapUI [article]
- Web Scraping with Python [article]
- Plotting data using Matplotlib: Part 1 [article]
|Build your own web scraper and learn all about web scraping with Beautiful Soup with this book and ebook|
eBook Price: €16.99
Book Price: €26.99
About the Author :
Vineeth G. Nair completed his bachelors in Computer Science and Engineering from Model Engineering College, Cochin, Kerala. He is currently working with Oracle India Pvt. Ltd. as a Senior Applications Engineer.
He developed an interest in Python during his college days and began working as a freelance programmer. This led him to work on several web scraping projects using Beautiful Soup. It helped him gain a fair level of mastery on the technology and a good reputation in the freelance arena. He can be reached at firstname.lastname@example.org. You can visit his website at www.kochi-coders.com.