Home

Automate it! - Recipes to upskill your business

By Chetan Giridhar

Book

Subscription

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

Subscription

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

This book gives you a great selection of recipes to automate your business processes with Python, and provides a platform for you to understand how Python is useful to make time consuming and repetitive business tasks more efficient. Python is a mature high level language, has object-oriented programming features, powers various apps, has a huge set of modules, and great community support. Python is extremely easy to use, can help you get complex tasks done efficiently and is an apt choice for our needs.

With a classic problem-solution based approach and real-world examples, you will delve into things that automate your business processes. You will begin by learning about the Python modules to work with Web, Worksheets, Presentations and PDFs. You’ll leverage Python recipes to automate processes in HR, Finance and making them efficient and reliable. For instance, company payroll — an integral process in HR will be automated with Python recipes.

A few chapters of this book will also help you gain knowledge on working with bots and computer vision. You will learn how to build bots for automating business use cases by integrating artificial intelligence. You’ll also understand how Python is helpful in face detection and building a scanner of your own. You will see how to effectively and easily use Python code to manage SMS and voice notifications, opening a world of possibilities using cloud telephony to solve your business needs. Moving forward, you will learn to work with APIs, Webhooks and Emails to automate Marketing and Customer Support processes. Finally, using the various Python libraries, this book will arm you with knowledge to customize data solutions and generate reports to meet your business needs.

This book will help you up-skill and make your business processes efficient with the various Python recipes covered in this book.

Publication date:: January 2017
Publisher: Packt
Pages: 392
ISBN: 9781786460516
Download code from GitHub

Chapter 1. Working with the Web

Can you image a life without the Internet? For almost everything, right from exchanging information to ordering food, we rely heavily on the Internet today. Let's go through the interesting world of the World Wide Web and cover numerous ways with which we can interact with it using Python modules.

In this chapter, we will cover the following recipes:

Making HTTP requests
A brief look at web scraping
Parsing and extracting web content
Downloading content from the Web
Working with third-party REST APIs
Asynchronous HTTP server in Python
Web automation with selenium bindings
Automating lead generation with web scraping

Introduction

Internet has made life so easy that sometimes you just don't realize the power of it. Checking out your friend's status, calling your parents, responding to an important business e-mail, or playing a game--we rely on the World Wide Web (WWW) today for almost everything.

Thankfully, Python has a rich set of modules that help us perform various tasks on the Web. Phew! Not only could you make simple HTTP requests retrieve data from websites or download pages and images, you could also parse the page content to gather information and analyze it to generate meaningful insights with Python. And wait; did I mention that you could spawn a browser in an automated fashion to perform a daily mundane task?

The recipes in this chapter will primarily focus on the Python modules that can be treated as the tool of choice while performing the preceding operations on the Web. Specifically, we will focus on the following Python modules in this chapter:

requests (http://docs.python-requests.org/en/master/)
urllib2 (https://docs.python.org/2/library/urllib2.html)
lxml (https://pypi.python.org/pypi/lxml)
BeautifulSoup4 (https://pypi.python.org/pypi/beautifulsoup4)
selenium (http://selenium-python.readthedocs.org/)

Note

While the recipes in this chapter will give you an overview of how to interact with the Web using Python modules, I encourage you to try out and develop code for multiple use cases, which will benefit you as an individual and your project on an organizational scale.

Making HTTP requests

Throughout the following recipes in this chapter, we will use Python v2.7 and the requests (v2.9.1) module of Python. This recipe will show you how to make HTTP requests to web pages on the Internet.

But before going there, let's understand the Hypertext Transfer Protocol (HTTP) in brief. HTTP is a stateless application protocol for data communication on the WWW. A typical HTTP Session involves a sequence of request or response transactions. The client initiates a TCP connection to the Server on a dedicated IP and Port; when the Server receives the request, it responds with the response code and text. HTTP defines request methods (HTTP verbs like GET, POST), which indicate the desired action to be taken on the given Web URL.

In this recipe, we'll learn how to make HTTP GET/POST requests using Python's requests module. We'll also learn how to POST json data and handle HTTP exceptions. Cool, let's jump in.

Getting ready

To step through this recipe, you will need to install Python v2.7. Once installed, you will need to install Python pip. PIP stands for Pip Installs Packages and is a program that can be used to download and install the required Python packages on your computer. Lastly, we'll need the requests module to make HTTP requests.

We will start by installing the requests module (I'll leave the Python and pip installation for you to perform on your machine, based on your operating system). No other prerequisites are required. So, hurry up and let's get going!

How to do it...

On your Linux/Mac computer, go to Terminal and run the following command:
```
        pip install -U requests
```
You only need to use sudo if you don't have permissions to Python site packages, else sudo is not required.

The following code helps you make a HTTP GET request with Python's requests module:

        import requests r =
        requests.get('http://ip.jsontest.com/')
        print("Response object:", r)
        print("Response Text:", r.text)

You will observe the following output:
Creating a HTTP GET request with data payload is also trivial with requests. The following code helps you in achieving this. This is how you can also check the URL request that will be sent:
```
        payload = {'q': 'chetan'} r =
        requests.get('https://github.com/search', params=payload)
        print("Request URL:", r.url)
```

Let's now make a HTTP POST call using the requests module. This is similar to filling up and posting a login or signup form on a website:

        payload = {'key1': 'value1'} r = 
        requests.post("http://httpbin.org/post", data=payload)
        print("Response text:", r.json())

Handling errors and exceptions is also very convenient with requests. The following code snippet shows an example of error handling. If you run this code without an Internet connection on your machine, it will result in an exception. The exception handler catches the exception and states that it failed to establish a new connection, as expected:
```
        try:
            r = requests.get("http://www.google.com/")
        except requests.exceptions.RequestException as e:
            print("Error Response:", e.message)
```

How it works...

In the this recipe, we looked at how to make different types of HTTP requests with Python's requests module. Let's look at how this code works:

In the first example, we made a GET request to http://ip.jsontest.com and got the response code and response text. It returns the current IP address of our computer on the Internet.
In the second example, we made a HTTP GET request with the payload data. Look how the request URL contains ?q=chetan, and it searches all the repositories by the name, Chetan, on GitHub.
Next, we made a POST request with the payload data being {'key1', 'value1'}. This is like submitting an online form, as we observed in the How to do it section.
The requests module has a Response object, r, which includes various methods. These methods help in extracting response, status code and other information required while working with the Web:
- r.status_code - Returns the response code
- r.json() - Converts the response to .json format
- r.text - Returns the response data for the query
- r.content - Includes the HTML and XML tags in the response content
- r.url - Defines the Web URL of the request made
We also looked at the exception handling with the requests module, wherein, if there was no Internet, an exception occurred and the requests module could easily catch this exception. This was achieved with the requests.exceptions class of the requests module.

There's more...

Cool, that was neat! Making HTTP requests on the Web is just the beginning. There's still more in terms of what we can do with the Web, such as working with page contents. So, let's see what's next.

A brief look at web scraping

Before we learn how to perform web scraping, let's understand what scraping means. In the Web world, scraping is a way to sift through the pages of a website with the intention of extracting the required information in the said format with the help of a computer program. For example, if I want to get the title and date of all the articles published on a blog, I could write a program to scrape through the blog, get the required data, and store it in a database or a flat file, based on the requirement.

Web scraping is often confused with web crawling. The web crawler is a bot that systematically browses the Web with the purpose of web indexing and is used by search engines to index web pages so that users can search the Web more effectively.

But scraping is not easy. The data, which is interesting to us, is available on a blog or website in a particular format, say XML tags or embedded in HTML tags. So, it is important for us to know the format before we begin extracting the data we need. Also, the web scraper should know the format in which the extracted data needs to be stored in order to act on it later. It is also important to understand that the scraping code will fail should the HTML or XML format change, even though the browser display may be the same.

Legality of web scraping

Web scraping has always been under the scanner in legal terms. Can you do web scraping? How legal or ethical is it? Can we use the data obtained from scraping for profit?

This subject has been under a lot of discussion, but at a high level, you may get into issues with web scraping if you scrape the Web for copyright information, violate the Computer Fraud and Abuse Act, or violate a website's terms of service. For instance, if you're scraping the Web to get public data, you should still be fine. However, it is very contextual and you need to be careful about what you're scraping and how you are using the data.

Here are a few pointers on the Web on data scraping:

Getting ready

We take an example of pricing data from the https://github.com/ website to demonstrate web scraping with Python. This is a really trivial example but gets us up to speed with scraping. Let's get started and scrape some interesting data with this Python recipe.

How to do it...

Open the Google Chrome browser on your computer and open the https://github.com/pricing/ web page. On this page, you will notice multiple pricing plans namely, Personal, Organization, and Enterprise.
Now, on your browser, right-click on the pricing of the Personal plan and click on the Inspect element, as shown in the following screenshot:
Once you click on Inspect, the Chrome browser's console log opens up, which will help you understand the HTML structure of GitHub's pricing page, as follows:
If you look at the highlighted HTML span - $7, you'll know that this web page uses the default-currency class to list down the pricing of plans. We'll now use this property to extract the prices of multiple GitHub plans.
But before doing that, let's install the Python module, lxml, which will be needed to extract content from the preceding HTML document. Install the lxml and requests modules:
```
        pip install lxml

        pip install requests
```

Now, open your favorite editor and type this code snippet:

        from lxml import html 
        import requests 
 
        page = requests.get('https://github.com/pricing/') 
        tree = html.fromstring(page.content) 
        print("Page Object:", tree) 
        plans = tree.xpath('//h2[@class="pricing-card-name 
        alt-h3"]/text()') 
        pricing = tree.xpath('//span[@class="default-
        currency"]/text()') 
        print("Plans:", plans, "\nPricing:", pricing)

If you look at the preceding code, we used the default-currency class and pricing-card-name display-heading-3 to get the pricing and pricing plan. If you run the code snippet, the output of the program will be as follows:

Note
With web scrapping you will see issues when the HTML tags for the web content has changed. For instance, if a CSS class name gets changed or an anchor is replaced with a button, the scraping code may not fetch the data you need. So, make sure you change your Python code accordingly.

How it works...

As we discussed earlier, we need to find out an appropriate way of extracting information. So, in this example, we first got the HTML tree for the https://github.com/pricing/ page. We got the tree with the help of the fromstring() method that converts the contents of the page (string format) to the HTML format.

Then, using the lxml module and the tree_xpath() method, we looked for the default-currency class and pricing-card-name display-heading-3 to get the pricing and pricing plans.

See how we used the complete XPath, h3[@class='class-name'], to locate the pricing plans and the //span[@class="default-currency"] XPath to select the actual pricing data. Once the elements were selected, we printed the text data that was returned to us as a Python list.

That's it; we scraped the GitHub page for the required data. Nice and simple.

There's more...

You learnt what web scrapers are, and how they go ahead and extract interesting information from the Web. You also understood how they are different from web crawlers. But then, there's always something more!

Web scraping involves extraction, which cannot happen until we parse the HTML content from the web page to get the data interesting to us. In the next recipe, we'll learn about parsing HTML and XML content in detail.

Parsing and extracting web content

Well, now we're confident about making HTTP requests to multiple URLs. We also looked at a simple example of web scraping.

But WWW is made up of pages with multiple data formats. If we want to scrape the Web and make sense of the data, we should also know how to parse different formats in which data is available on the Web.

In this recipe, we'll discuss how to s.

Getting ready

Data on the Web is mostly in the HTML or XML format. To understand how to parse web content, we'll take an example of an HTML file. We'll learn how to select certain HTML elements and extract the desired data. For this recipe, you need to install the BeautifulSoup module of Python. The BeautifulSoup module is one of the most comprehensive Python modules that will do a good job of parsing HTML content. So, let's get started.

How to do it...

We start by installing BeautifulSoup on our Python instance. The following command will help us install the module. We install the latest version, which is beautifulsoup4:
```
        pip install beautifulsoup4
```

Now, let's take a look at the following HTML file, which will help us learn how to parse the HTML content:

        <html xmlns="http://www.w3.org/1999/html">
        <head>
            <title>Enjoy Facebook!</title> 
        </head>
        <body>
            <p>
              <span>You know it's easy to get intouch with
              your <strong>Friends</strong> on web!<br></span>
              Click here <a href="https://facebook.com">here</a>
              to sign up and enjoy<br>
            </p>
            <p class="wow"> Your gateway to social web! </p>
            <div id="inventor">Mark Zuckerberg</div>
            Facebook, a webapp used by millions
        </body>
        </html>

Let's name this file as python.html. Our HTML file is hand-crafted so that we can learn the multiple ways of parsing it to get the required data from it. Python.html has typical HTML tags given as follows:
- <head> - It is the container of all head elements like <title>.
- <body> - It defines the body of the HTML document.
-  - This element defines a paragraph in HTML.
-  - It is used to group inline elements in a document.
-  - It is used to apply a bold style to the text present under this tag.
- <a> - It represents a hyperlink or anchor and contains <href> that points to the hyperlink.
- <class> - It is an attribute that points to a class in a style sheet.
- <div id> - It is a container that encapsulates other page elements and divides the content into sections. Every section can be identified by attribute id.
If we open this HTML in a browser, this is how it'll look:
Let's now write some Python code to parse this HTML file. We start by creating a BeautifulSoup object.
Tip
We always need to define the parser. In this case we used lxml as the parser. The parser helps us read files in a designated format so that querying data becomes easy.
```
        import bs4
        myfile = open('python.html')
        soup = bs4.BeautifulSoup(myfile, "lxml")
        #Making the soup
        print "BeautifulSoup Object:", type(soup)
```
The output of the preceding code is seen in the following screenshot:
OK, that's neat, but how do we retrieve data? Before we try to retrieve data, we need to select the HTML elements that contain the data we need.

We can select or find HTML elements in different ways. We could select elements with ID, CSS, or tags. The following code uses python.html to demonstrate this concept:

        #Find Elements By tags
        print soup.find_all('a')
        print soup.find_all('strong')
        #Find Elements By id
        print soup.find('div', {"id":"inventor"})
        print soup.select('#inventor')
        #Find Elements by css print
        soup.select('.wow')

The output of the preceding code can be viewed in the following screenshot:

Now let's move on and get the actual content from the HTML file. The following are a few ways in which we can extract the data of interest:

        print "Facebook URL:", soup.find_all('a')[0]['href']
        print "Inventor:", soup.find('div', {"id":"inventor"}).text 
        print "Span content:", soup.select('span')[0].getText()

The output of the preceding code snippet is as follows:

Whoopie! See how we got all the text we wanted from the HTML elements.

How it works...

In this recipe, you learnt the skill of finding or selecting different HTML elements based on ID, CSS, or tags.

In the second code example of this recipe, we used find_all('a') to get all the anchor elements from the HTML file. When we used the find_all() method, we got multiple instances of the match as an array. The select() method helps you reach the element directly.

We also used find('div', <divId>) or select(<divId>) to select HTML elements by div Id. Note how we selected the inventor element with div ID #inventor in two ways using the find() and select() methods. Actually, the select method can also be used as select(<class-name>) to select HTML elements with a CSS class name. We used this method to select element wow in our example.

In the third code example, we searched for all the anchor elements in the HTML page and looked at the first index with soup.find_all('a')[0]. Note that since we have only one anchor tag, we used the index 0 to select that element, but if we had multiple anchor tags, it could be accessed with index 1. Methods like getText() and attributes like text (as seen in the preceding examples) help in extracting the actual content from the elements.

There's more...

Cool, so we understood how to parse a web page (or an HTML page) with Python. You also learnt how to select or find HTML elements by ID, CSS, or tags. We also looked at examples of how to extract the required content from HTML. What if we want to download the contents of a page or file from the Web? Let's see if we can achieve that in our next recipe.

Downloading content from the Web

So, in the earlier recipe, we saw how to make HTTP requests, and you also learnt how to parse a web response. It's time to move ahead and download content from the Web. You know that the WWW is not just about HTML pages. It contains other resources, such as text files, documents, and images, among many other formats. Here, in this recipe, you'll learn ways to download images in Python with an example.

Getting ready

To download images, we will need two Python modules, namely BeautifulSoup and urllib2. We could use the requests module instead of urrlib2, but this will help you learn about urllib2 as an alternative that can be used for HTTP requests, so you can boast about it.

How to do it...

Before starting this recipe, we need to answer two questions. What kind of images would we like to download? From which location on the Web do I download the images? In this recipe, we download Avatar movie images from Google (https://google.com) images search. We download the top five images that match the search criteria. For doing this, let's import the Python modules and define the variables we'll need:
```
        from bs4 import BeautifulSoup
        import re
        import urllib2
        import os 
        ## Download paramters
        image_type = "Project"
        movie = "Avatar"
        url = "https://www.google.com/search?q="+movie+"&source=lnms&tbm=isch"
```
OK then, let's now create a BeautifulSoup object with URL parameters and appropriate headers. See the use of User-Agent while making HTTP calls with Python's urllib module. The requests module uses its own User-Agent while making HTTP calls:
```
        header = {'User-Agent': 'Mozilla/5.0'}
        soup = BeautifulSoup(urllib2.urlopen
        (urllib2.Request(url,headers=header)))
```
Google images are hosted as static content under the domain name http://www.gstatic.com/. So, using the BeautifulSoup object, we now try to find all the images whose source URL contains http://www.gstatic.com/. The following code does exactly the same thing:
```
        images = [a['src'] for a in soup.find_all("img", {"src":
        re.compile("gstatic.com")})][:5]
        for img in images:
        print "Image Source:", img
```
The output of the preceding code snippet can be seen in the following screenshot. Note how we get the image source URL on the Web for the top five images:

Now that we have the source URL of all the images, let's download them. The following Python code uses the urlopen() method to read() the image and downloads it onto the local file system:

        for img in images:
          raw_img = urllib2.urlopen(img).read()
          cntr = len([i for i in os.listdir(".") if image_type in i]) + 1
        f = open(image_type + "_"+ str(cntr)+".jpg", 'wb') 
        f.write(raw_img)
        f.close()

When the images get downloaded, we can see them on our editor. The following snapshot shows the top five images we downloaded and Project_3.jpg looks as follows:

How it works...

So, in this recipe, we looked at downloading content from the Web. First, we defined the parameters for download. Parameters are like configurations that define the location where the downloadable resource is available and what kind of content is to be downloaded. In our example, we defined that we have to download Avatar movie images and, that too, from Google.

Then we created the BeautifulSoup object, which will make the URL request using the urllib2 module. Actually, urllib2.Request() prepares the request with the configuration, such as headers and the URL itself, and urllib2.urlopen() actually makes the request. We wrapped the HTML response of the urlopen() method and created a BeautifulSoup object so that we could parse the HTML response.

Next, we used the soup object to search for the top five images present in the HTML response. We searched for images based on the img tag with the find_all() method. As we know, find_all() returns a list of image URLs where the picture is available on Google.

Finally, we iterated through all the URLs and again used the urlopen() method on URLs to read() the images. Read() returns the image in a raw format as binary data. We then used this raw image to write to a file on our local file system. We also added a logic to name the image (they actually auto-increment) so that they're uniquely identified in the local file system.

That's nice! Exactly what we wanted to achieve! Now let's up the ante a bit and see what else we can explore in the next recipe.

Working with third-party REST APIs

Now that we've covered ground on scraping, crawling, and parsing, it's time for another interesting work that we can do with Python, which is working with third-party APIs. I'd assume many of us are aware and might have a basic understanding of REST API. So, let's get started!

Getting ready

To demonstrate the understanding, we take the case of GitHub gists. Gists in GitHub are the best way to share your work, a small code snippet that helps your colleague or a small app with multiple files that gives an understanding of a concept. GitHub allows the creation, listing, deleting, and updating of gists, and it presents a classical case of working with GitHub REST APIs.

So, in this section, we use our very own requests module to make HTTP requests to GitHub REST API to create, update, list, or delete gists.

The following steps will show you how to work with GitHub REST APIs using Python.

How to do it...

To work with GitHub REST APIs, we need to create a Personal access token. For doing that, log in to https://github.com/ and browse to https://github.com/settings/tokens and click on Generate new token:
You'll now be taken to the New personal access token page. Enter a description at the top of the page and check the gists option among the scopes given out. Note that scope represents the access for your token. For instance, if you just select gists, you can use GitHub APIs to work on the gists resource but not on other resources such as repo or users. For this recipe, the gists scope is just what we need:
Once you click on Generate token, you'd be presented with a screen containing your personal access token. Keep this token confidential with you.

With the access token available, let's start working with APIs and create a new gist. With create, we add a new resource, and for doing this, we make an HTTP POST request on GitHub APIs, such as in the following code:

        import requests
        import json
        BASE_URL = 'https://api.github.com'
        Link_URL = 'https://gist.github.com'
        username = '<username>' ## Fill in your github username
        api_token = '<api_token>'  ## Fill in your token
        header = {  'X-Github-Username': '%s' % username,
                    'Content-Type': 'application/json',
                    'Authorization': 'token %s' % api_token,
        }
        url = "/gists" 
        data ={
          "description": "the description for this gist",
          "public": True,
          "files": { 
            "file1.txt": { 
              "content": "String file contents" 
            } 
          }
        }
        r = requests.post('%s%s' % (BASE_URL, url), 
            headers=header, 
           data=json.dumps(data))
       print r.json()['url']

If I now go to my gists page on GitHub, I should see the newly created gist. And voila, it's available!

Hey, we were successful in creating the gist with the GitHub APIs. That's cool, but can we now view this gist? In the preceding example, we also printed the URL of the newly created gist. It will be in the format, https://gist.github.com/<username>/<gist_id>. We now use this gist_id to get the details of the gist, which means we make a HTTP GET request on the gist_id:

        import requests
        import json
        BASE_URL = 'https://api.github.com'
        Link_URL =
        'https://gist.github.com'

        username = '<username>'
        api_token = '<api_token>'
        gist_id = '<gist id>' 

        header = { 'X-Github-Username': '%s' % username,
                   'Content-Type': 'application/json',
                   'Authorization': 'token %s' % api_token,
        }
        url = "/gists/%s" % gist_id
        r = requests.get('%s%s' % (BASE_URL, url),
                          headers=header)
        print r.json()

We created a new gist with the HTTP POST request and got the details of the gist with the HTTP GET request in the previous steps. Now, let's update this gist with the HTTP PATCH request.
Note
Many third-party libraries choose to use the PUT request to update a resource, but HTTP PATCH can also be used for this operation, as chosen by GitHub.

The following code demonstrates updating the gist:

        import requests
        import json

        BASE_URL = 'https://api.github.com'
        Link_URL = 'https://gist.github.com'

        username = '<username>'
        api_token = '<api_token>'
        gist_id = '<gist_id>'

        header = { 'X-Github-Username': '%s' % username,
                   'Content-Type': 'application/json',
                   'Authorization': 'token %s' % api_token,
        }
        data = {   "description": "Updating the description
                   for this gist",
                   "files": {
                     "file1.txt": {
                       "content": "Updating file contents.."
                     }
                   } 
        }
        url = "/gists/%s" % gist_id
        r = requests.patch('%s%s' %(BASE_URL, url), 
                           headers=header,
                           data=json.dumps(data))
        print r.json()

Now, if I look at my GitHub login and browse to this gist, the contents of the gist have been updated. Awesome! Don't forget to see the Revisions in the screenshot--see it got updated to revision 2:

Now comes the most destructive API operation--yes deleting the gist. GitHub provides an API for removing the gist by making use of the HTTP DELETEoperation on its /gists/<gist_id> resource. The following code helps us delete the gist:

        import requests
        import json
        BASE_URL = 'https://api.github.com'
        Link_URL = 'https://gist.github.com'
        username = '<username>'
        api_token = '<api_token>'
        gist_id = '<gist_id>'

        header = {  'X-Github-Username': '%s' % username,
                    'Content-Type': 'application/json', 
                    'Authorization': 'token %s' % api_token,
        }
        url = "/gists/%s" % gist_id 
        r = requests.delete('%s%s' %(BASE_URL, url),
                            headers=header, )

Let's quickly find out if the gist is now available on the GitHub website? We can do that by browsing the gist URL on any web browser. And what does the browser say? It says 404 resource not found, so we have successfully deleted the gist! Refer to the following screenshot:
Finally, let's list all the gists in your account. For this we make an HTTP GET API call on the /users/<username>/gists resource:

        import requests

        BASE_URL = 'https://api.github.com'
        Link_URL = 'https://gist.github.com'

        username = '<username>'      ## Fill in your github username 
        api_token = '<api_token>'  ## Fill in your token

        header = {  'X-Github-Username': '%s' % username, 
                    'Content-Type': 'application/json',
                    'Authorization': 'token %s' % api_token,
        }
        url = "/users/%s/gists" % username
        r = requests.get('%s%s' % (BASE_URL, url),
                          headers=header)
        gists = r.json()
        for gist in gists:
            data = gist['files'].values()[0]
            print data['filename'],
            data['raw_url'], data['language']

The output of the preceding code for my account is as follows:

How it works...

Python's requests module helps in making HTTP GET/POST/PUT/PATCH and DELETE API calls on GitHub's resources. These operations, also known as HTTP verbs in the REST terminology, are responsible for taking certain actions on the URL resources.

As we saw in the examples, the HTTP GET request helps in listing the gists, POST creates a new gist, PATCH updates a gist, and DELETE completely removes the gist. Thus, in this recipe, you learnt how to work with third-party REST APIs--an essential part of WWW today--using Python.

Asynchronous HTTP server in Python

If you realize, many web applications that we interact with, are by default synchronous. A client connection gets established for every request made by the client and a callable method gets invoked on the server side. The server performs the business operation and writes the response body to the client socket. Once the response is exhausted, the client connection gets closed. All these operations happen in sequence one after the other--hence, synchronous.

But the Web today, as we see it, cannot rely on synchronous modes of operations only. Consider the case of a website that queries data from the Web and retrieves the information for you. (For instance, your website allows for integration with Facebook and every time a user visits a certain page of your website, you pull data from his Facebook account.) Now, if we develop this web application in a synchronous manner, for every request made by the client, the server would make an I/O call to either the database or over the network to retrieve information and then present it back to the client. If these I/O requests take a longer time to respond, the server gets blocked waiting for the response. Typically web servers maintain a thread pool that handles multiple requests from the client. If a server waits long enough to serve requests, the thread pool may get exhausted soon and the server will get stalled.

Solution? In comes the asynchronous ways of doing things!

Getting ready

For this recipe, we will use Tornado, an asynchronous framework developed in Python. It has support for both Python 2 and Python 3 and was originally developed at FriendFeed (http://blog.friendfeed.com/). Tornado uses a non-blocking network I/O and solves the problem of scaling to tens of thousands of live connections (C10K problem). I like this framework and enjoy developing code with it. I hope you'd too! Before we get into the How to do it section, let's first install tornado by executing the following command:

    pip install -U tornado

How to do it...

We're now ready to develop our own HTTP server that works on an asynchronous philosophy. The following code represents an asynchronous server developed in the tornado web framework:

        import tornado.ioloop
        import tornado.web
        import httplib2

        class AsyncHandler(tornado.web.RequestHandler):
            @tornado.web.asynchronous
            def get(self):
              http = httplib2.Http()
              self.response, self.content = 
                http.request("http://ip.jsontest.com/", "GET")
              self._async_callback(self.response, self.content)

            def _async_callback(self, response, content): 
            print "Content:", content
            print "Response:\nStatusCode: %s Location: %s"
              %(response['status'], response['content-location']) 
            self.finish()
            tornado.ioloop.IOLoop.instance().stop()
        application = tornado.web.Application([
              (r"/", AsyncHandler)], debug=True)
       if __name__ == "__main__":
         application.listen(8888)
         tornado.ioloop.IOLoop.instance().start()

Run the server as:
```
        python tornado_async.py
```
The server is now running on port 8888 and ready to receive requests.
Now, launch any browser of your choice and browse to http://localhost:8888/. On the server, you'll see the following output:

How it works...

Our asynchronous web server is now up and running and accepting requests on port 8888. But what is asynchronous about this? In fact, tornado works on the philosophy of a single-threaded event loop. This event loop keeps polling for events and passes it on to the corresponding event handlers.

In the preceding example, when the app is run, it starts by running the ioloop. The ioloop is a single-threaded event loop and is responsible for receiving requests from the clients. We have defined the get() method, which is decorated with @tornado.web.asynchronous, which makes it asynchronous. When a user makes a HTTP GET request on http://localhost:8888/, the get() method is triggered that internally makes an I/O call to http://ip.jsontest.com.

Now, a typical synchronous web server would wait for the response of this I/O call and block the request thread. But tornado being an asynchronous framework, it triggers a task, adds it to a queue, makes the I/O call, and returns the thread of execution back to the event loop.

The event loop now keeps monitoring the task queue and polls for a response from the I/O call. When the event is available, it executes the event handler, async_callback(), to print the content and its response and then stops the event loop.

There's more...

Event-driven web servers such as tornado make use of kernel-level libraries to monitor for events. These libraries are kqueue, epoll, and so on. If you're really interested, you should do more reading on this. Here are a few resources:

Web automation with selenium bindings

In all the recipes so far, we had a dedicated URL to make HTTP requests, be it calling a REST API or downloading content from the Web. But then, there are services that don't have a defined API resource or need to log in to the Web to perform operations. In such cases, you don't have much control over the requests, as it is the same URL that serves multiple different content, based on the user session or cookie. Then what do we do?

Well, how about controlling the browser itself to achieve tasks in such scenarios? Controlling the browser itself? Interesting, isn't it?

Getting ready

For this recipe, we'll use Python's selenium module. Selenium (http://www.seleniumhq.org/) is a portable software framework for web applications and automates browser actions. You could automate mundane tasks with selenium. Selenium spawns a browser and helps you perform tasks as though a human is doing them. Selenium supports some of the most popularly used browsers like Firefox, Chrome, Safari, and Internet Explorer, among others. Let's take an example of logging in to Facebook with Python's selenium in this recipe.

How to do it...

We start by installing selenium bindings for Python. Installing selenium can be done with the following command:
```
           pip install selenium
```

Let's start by first creating a browser object. We use the Firefox browser for spawning the browser instance:

        from selenium import webdriver browser =
        webdriver.Firefox()
        print "WebDriver Object", browser

The following screenshot shows how a selenium web driver object got created. It also has a unique session ID:
Next, we ask the browser to browse to the Facebook home page. The following code helps us achieve this:
```
        browser.maximize_window()
        browser.get('https://facebook.com')
```
Once you run the preceding code, you will see a Firefox browser opened, and it connects to the Facebook login page, as in the following screenshot:

For the next step, we locate the e-mail and password elements and enter the appropriate data:

        email = browser.find_element_by_name('email')
        password = browser.find_element_by_name('pass')
        print "Html elements:"
        print "Email:", email, "\nPassword:", password

The output of the preceding code is as follows:

Once we have selected the Email and Password text inputs, we now fill them with the correct Email and Password. The following code will enable entering Email and Password:
```
        email.send_keys('abc@gmail.com') #Enter correct email
        addresspassword.send_keys('pass123') #Enter correct password
```
Now that we have entered Email and Password, the last thing to do is submit the form and click on the Log In button. We do this by finding the element by ID and clicking on the element:
```
        browser.find_element_by_id('loginbutton').click()
```
If you have entered the correct e-mail ID and password, you'd have logged in to Facebook!

How it works...

For this recipe, we used the selenium WebDriver Python APIs. WebDriver is the latest inclusion in selenium APIs and drives browsers natively like a user. It can drive locally or on a remote machine using the selenium server. In this example, we ran it on the local machine. Basically, the selenium server runs on a local machine on a default port 4444 and selenium WebDriver APIs interact with the selenium server to take actions on the browser.

In this recipe, we first created a WebDriver instance using the Firefox browser. We then used the WebDriver API to browse to the Facebook homepage. We then parsed the HTML page and located the Email and Password input elements. How did we find the elements? Yes, similar to what we did in the web scraping example. As we have the developer console in Chrome, we can install the firebug plugin in Firefox. Using this plugin, we can get the HTML elements for Email and Password. See the following screenshot:

Once we figured the HTML element names, we programmatically created an HTML element object using WebDriver's find_element_by_name() method. WebDriver API has a method send_keys() that can work on element objects and enter the required text (in this case email and password). The last operation is to submit the form, and we performed it by finding the Log In object and clicking on it.

There's more...

We looked at a very basic example with the selenium WebDriver Python bindings. Now it's up to your imagination what you can achieve with selenium, automating mundane tasks.

Automating lead generation with web scraping

Ryan is a marketing manager at Dely Inc. Dely is a food delivery start-up and is trying to establish itself in the city of London. Dely is good at logistics and wants to aggregate restaurants on their platform, so when consumers order food from these restaurants, Dely will be responsible for the actual delivery. Dely is hoping that with every delivery they do, they will get a percentage cut from the restaurants. In return, restaurants have to think about their kitchen and not the logistical aspects. If you carefully think, virtually, every restaurant, big or small, is their probable lead. Dely wants to reach out to these restaurants and hopes to add them to their platform and fulfill their delivery needs.

Ryan is responsible for getting in touch with restaurants and wants to run a marketing campaign on all the target restaurants. But before he can do this, he needs to create a database of all the restaurants in London. He needs details, such as the name of the restaurant, the street address, and the contact number so that he can reach these restaurants. Ryan knows all his leads are listed on Yelp, but doesn't know where to start. Also, if he starts looking at all restaurants manually, it will take him a huge amount of time. With the knowledge you gained in this chapter, can you help Ryan with lead generation?

Legality of web scraping

We covered the legal aspects of web scraping in the initial parts of the chapter. I would like to warn you again on this. The example covered in this chapter, again, is for you to understand how to perform web scraping. Also, here we're scraping Yelp for public data, which is commonly available, as in this case, it is available on the restaurant's website itself.

Getting ready

Now, if you look at Ryan's problem, he needs an automated way of collecting the database of all the restaurants listed in London. Yes, you got it right. Web scraping can help Ryan build this database. Can it be that easy? Let's see in this recipe.

For this recipe, we don't need any extra modules. We'll use the BeautifulSoup and urllib Python modules that we used in the previous recipes of this chapter.

How to do it...

We start by going to the Yelp website (https://yelp.com/) and searching for all the restaurants in the city of London. When you do that, you'll get a list of all the restaurants in London. Observe the URL that displays the search criteria. It is https://www.yelp.com/search?find_desc=Restaurants&find_loc=London. See the following screenshot for reference:
Now, if you click on any of the restaurants' link that shows up in the search results, we should get the details that Ryan needs. See the following screenshot, where we get the details of Ffiona's Restaurant. Note how every restaurant has a dedicated URL; in this case, it is https://www.yelp.com/biz/ffionas-restaurant-london?osq=Restaurants. Also note that on this page, we have the name of the restaurant, the street address, and even the contact number. All the details that Ryan needs for his campaign; that's cool!
OK nice, so we now know how to get the list of restaurants and also fetch the relevant details for a restaurant. But how do we achieve this in an automated way? As we saw in the web scraping example, we need to look for the HTML elements on the web pages from where we can collect this data.
Let's start with the search page. Open the search page (https://www.yelp.com/search?find_desc=Restaurants&find_loc=London) on your Chrome browser. Now, right-click on the first restaurant's URL and click on Inspect to get the HTML elements. If you notice, in the following screenshot, all the restaurants that are listed on the search page have a common CSS class name, biz-name, which indicates the name of the restaurant. It also contains the href tag, which points to the dedicated URL of the restaurant. In our screenshot, we get the name, Ffiona's Restaurant, and the href points to the restaurant's URL, https://yelp.com/biz/ffionas-restaurant-london?osq=Resturants.
Now, let's look at the dedicated page of the restaurant to see how we collect the street address and the contact number of the restaurant with the HTML elements. We perform the same operation, right-click, and Inspect to get the HTML elements of street address and contact number. See the following screenshot for reference. Note that for the street address, we have a separate CSS class, street-address, and the contact number is available under a span with the class name, biz-phone.

Awesome! So, we now have all the HTML elements that can be used to scrape the data in an automated way. Let's now look at the implementation. The following Python code performs these operations in an automated way:

        from bs4 import BeautifulSoup 
        from threading import Thread 
        import urllib 
 
        #Location of restaurants 
        home_url = "https://www.yelp.com" 
        find_what = "Restaurants" 
        location = "London" 
 
        #Get all restaurants that match the search criteria 
        search_url = "https://www.yelp.com/search?find_desc=" +
        find_what + "&find_loc=" + location 
        s_html = urllib.urlopen(search_url).read() 
        soup_s = BeautifulSoup(s_html, "lxml") 
 
        #Get URLs of top 10 Restaurants in London 
        s_urls = soup_s.select('.biz-name')[:10] 
        url = [] 
        for u in range(len(s_urls)): 
        url.append(home_url + s_urls[u]['href']) 
 
 
        #Function that will do actual scraping job 
        def scrape(ur): 
                html = urllib.urlopen(ur).read() 
                soup = BeautifulSoup(html, "lxml") 
 
                title = soup.select('.biz-page-title') 
                saddress = soup.select('.street-address') 
                phone = soup.select('.biz-phone') 
 
                if title: 
                     print "Title: ", title[0].getText().strip() 
                if saddress: 
                     print "Street Address: ",
        saddress[0].getText().strip() 
                if phone: 
                     print "Phone Number: ", phone[0].getText().strip() 
                print "-------------------" 
 
        threadlist = [] 
        i=0 
        #Making threads to perform scraping 
        while i<len(url): 
                  t = Thread(target=scrape,args=(url[i],)) 
                  t.start() 
                  threadlist.append(t) 
                  i=i+1 
 
        for t in threadlist: 
                  t.join()

OK, great! Now, if we run the preceding Python code, we get the details of the top 10 restaurants in London, along with their names, street addresses and contact numbers. Refer to the following screenshot:
In the preceding screenshot, we get the records of 10 restaurants in London provided by Yelp. Title is the name of the restaurant and Street Address and Phone Number are self-explanatory. Awesome! We did it for Ryan.

How it works...

In the preceding code snippet, we built the search criteria. We searched on https://yelp.com and looked for restaurants in London. With these details, we got the search URL on Yelp.

We then created a urllib object and used the urlopen() method on this search URL to read() the list of all the restaurants provided by Yelp matching the search criteria. The list of all the restaurants is stored as an HTML page, which is stored in the variable, s_html.

Using the BeautifulSoup module, we created a soup instance on the HTML content so that we could start extracting the required data using the CSS elements.

Initially, we browsed the top 10 results of the search on Yelp and got the URLs of the restaurants. We stored these URLs in the URL Python list. To get the URL, we selected the CSS class name biz-name using the code soup_s.select(.biz-name)[:10].

We also defined a method, scrape(), which takes the restaurant URL as a parameter. In this method, we read the details of the restaurant, such as name, street address, and contact number, using the CSS class names biz-page-title, street-address, and biz-phone, respectively. To get the exact data, we selected the HTML elements using title=soup.select(.biz-page-title) and got the data with title[0].getText().strip(). Note that the select() method returns the found element as an array, so we need to look for index 0 to get the actual text.

We iterated through all the restaurant URLs in a while loop and scraped the URL using the scrape() method to get the details for each restaurant. It prints the name, street address, and contact number for each restaurant on your console, as we saw in the preceding screenshot.

To improve on the performance of our screaping program, we performed data extraction for every restaurant in an independent thread. We created a new thread with t = Thread(target=scrape,args=(url[i],)) and got the results from each of them with the t.join() call.

That’s it, folks! Ryan is extremely happy with this effort. In this example, we helped Ryan and automated a critical business task for him. Throughout this book we'll look at various use cases where Python can be leveraged to automate business processes and make them efficient. Interested in more? Well, see you in the next chapter.

About the Author

Chetan Giridhar

Chetan Giridhar is a technology leader and an open source evangelist. He is the author of Learning Python Design Patterns, Second Edition, has been an invited speaker at international PyCon conferences, and is an associate editor at the Python Papers journal. He takes keen interest in platform engineering, distributed systems, mobile app development, and real-time cloud applications. You can take a look at his experiments at https://github.com/cjgiridhar and his website https://technobeans.com. In his current role as the chief technology officer, Chetan owns the product strategy and drives technology for CallHub. Prior to CallHub, he was associated with BlueJeans Networks and NetApp, where he worked on cloud, video, and enterprise storage products. Chetan believes that the world is full of knowledge; he's always curious to learn new things and share them with open source community, friends, and colleagues. You can connect with him on LinkedIn at https://www.linkedin.c
Browse publications by this author

The content was not in line with the book as a whole. I didn't feel complete after working on the samples.

Automate it! - Recipes to upskill your business

Chapter 1. Working with the Web

Introduction

Note

Making HTTP requests

Getting ready

How to do it...

How it works...

There's more...

A brief look at web scraping

Legality of web scraping

Getting ready

How to do it...

Note

How it works...

There's more...

Parsing and extracting web content

Getting ready

How to do it...

Tip

How it works...

There's more...

Downloading content from the Web

Getting ready

How to do it...

How it works...

Working with third-party REST APIs

Getting ready

How to do it...

Note

How it works...

See also

Asynchronous HTTP server in Python

Getting ready

How to do it...

How it works...

There's more...

Web automation with selenium bindings

Getting ready

How to do it...

How it works...

There's more...

Automating lead generation with web scraping

Legality of web scraping

Getting ready

How to do it...

How it works...