Search Using Beautiful Soup

Learn how to extract information from websites using Beautiful Soup and the Python urllib2 module. This practical, hands-on guide covers everything you need to know to get a head start in website scraping.

(For more resources related to this topic, see here.)

Searching with find_all()

The find() method was used to find the first result within a particular search criteria that we applied on a BeautifulSoup object. As the name implies, find_all() will give us all the items matching the search criteria we defined. The different filters that we see in find() can be used in the find_all() method. In fact, these filters can be used in any searching methods, such as find_parents() and find_siblings().

Let us consider an example of using find_all().

Finding all tertiary consumers

We saw how to find the first and second primary consumer. If we need to find all the tertiary consumers, we can't use find(). In this case, find_all() will become handy.

all_tertiaryconsumers = soup.find_all(class_="tertiaryconsumerslist")

The preceding code line finds all the tags with the = "tertiaryconsumerlist" class. If given a type check on this variable, we can see that it is nothing but a list of tag objects as follows:

print(type(all_tertiaryconsumers)) #output <class 'list'>

We can iterate through this list to display all tertiary consumer names by using the following code:

for tertiaryconsumer in all_tertiaryconsumers: print(tertiaryconsumer.div.string) #output lion tiger

Understanding parameters used with find_all()

Like find(), the find_all() method also has a similar set of parameters with an extra parameter, limit, as shown in the following code line:

find_all(name,attrs,recursive,text,limit,**kwargs)

The limit parameter is used to specify a limit to the number of results that we get. For example, from the e-mail ID sample we saw, we can use find_all() to get all the e-mail IDs. Refer to the following code:

email_ids = soup.find_all(text=emailid_regexp) print(email_ids) #output [u'abc@example.com',u'xyz@example.com',u'foo@example.com']

Here, if we pass limit, it will limit the result set to the limit we impose, as shown in the following example:

email_ids_limited = soup.find_all(text=emailid_regexp,limit=2) print(email_ids_limited) #output [u'abc@example.com',u'xyz@example.com']

From the output, we can see that the result is limited to two.

The find() method is find_all() with limit=1.

We can pass True or False values to find the methods. If we pass True to find_all(), it will return all tags in the soup object. In the case of find(), it will be the first tag within the object. The print(soup.find_all(True)) line of code will print out all the tags associated with the soup object.

In the case of searching for text, passing True will return all text within the document as follows:

all_texts = soup.find_all(text=True) print(all_texts) #output [u'\n', u'\n', u'\n', u'\n', u'\n', u'plants', u'\n', u'100000', u'\n', u'\n', u'\n', u'algae', u'\n', u'100000', u'\n', u'\n', u'\n', u'\n', u'\n', u'deer', u'\n', u'1000', u'\n', u'\n', u'\n', u'rabbit', u'\n', u'2000', u'\n', u'\n', u'\n', u'\n', u'\n', u'fox', u'\n', u'100', u'\n', u'\n', u'\n', u'bear', u'\n', u'100', u'\n', u'\n', u'\n', u'\n', u'\n', u'lion', u'\n', u'80', u'\n', u'\n', u'\n', u'tiger', u'\n', u'50', u'\n', u'\n', u'\n', u'\n', u'\n']

The preceding output prints every text content within the soup object including the new-line characters too.

Also, in the case of text, we can pass a list of strings and find_all() will find every string defined in the list:

all_texts_in_list = soup.find_all(text=["plants","algae"]) print(all_texts_in_list) #output [u'plants', u'algae']

This is same in the case of searching for the tags, attribute values of tag, custom attributes, and the CSS class.

For finding all the div and li tags, we can use the following code line:

div_li_tags = soup.find_all(["div","li"])

Similarly, for finding tags with the producerlist and primaryconsumerlist classes, we can use the following code lines:

all_css_class = soup.find_all(class_=["producerlist","primaryconsumerlist"])

Both find() and find_all() search an object's descendants (that is, all children coming after it in the tree), their children, and so on. We can control this behavior by using the recursive parameter. If recursive = False, search happens only on an object's direct children.

For example, in the following code, search happens only at direct children for div and li tags. Since the direct child of the soup object is html, the following code will give an empty list:

div_li_tags = soup.find_all(["div","li"],recursive=False) print(div_li_tags) #output []

If find_all() can't find results, it will return an empty list, whereas find() returns None.

Navigation using Beautiful Soup

Navigation in Beautiful Soup is almost the same as the searching methods. In navigating, instead of methods, there are certain attributes that facilitate the navigation. So each Tag or NavigableString object will be a member of the resulting tree with the Beautiful Soup object placed at the top and other objects as the nodes of the tree.

The following code snippet is an example for an HTML tree:

html_markup = """<div class="ecopyramid"> <ul id="producers"> <li class="producerlist"> <div class="name">plants</div> <div class="number">100000</div> </li> <li class="producerlist"> <div class="name">algae</div> <div class="number">100000</div> </li> </ul> </div>"""

For the previous code snippet, the following HTML tree is formed:

In the previous figure, we can see that Beautiful Soup is the root of the tree, the Tag objects make up the different nodes of the tree, while NavigableString objects make up the leaves of the tree.

Navigation in Beautiful Soup is intended to help us visit the nodes of this HTML/XML tree. From a particular node, it is possible to:

  • Navigate down to the children
  • Navigate up to the parent
  • Navigate sideways to the siblings
  • Navigate to the next and previous objects parsed

We will be using the previous html_markup as an example to discuss the different navigations using Beautiful Soup.

Summary

In this article, we discussed in detail the different search methods in Beautiful Soup, namely, find(), find_all(), find_next(), and find_parents(); code examples for a scraper using search methods to get information from a website; and understanding the application of search methods in combination.

We also discussed in detail the different navigation methods provided by Beautiful Soup, methods specific to navigating downwards and upwards, and sideways, to the previous and next elements of the HTML tree.

Resources for Article:


Further resources on this subject:


Books to Consider

comments powered by Disqus