Scraping the web and collecting files
In this recipe, we will learn how to collect data by web scraping. We will write a script for that.
Getting ready
Besides having a Terminal open, you need to have basic knowledge of the grep and wget commands.
How to do it…
Now, we will write a script to scrape the contents from imdb.com. We will use the grep and wget commands in the script to get the contents. Create a scrap_contents.shscript and write the following code in it:
$ mkdir -p data $ cd data $ wget -q -r -l5 -x 5 https://imdb.com $ cd .. $ grep -r -Po -h '(?<=href=")[^"]*' data/ > links.csv $ grep "^http" links.csv > links_filtered.csv $ sort -u links_filtered.csv > links_final.csv $ rm -rf data links.csv links_filtered.csv
How it works…
In the preceding script, we have written code to get contents from a website. The wget utility is used for retrieving files from the web using the http, https, and ftp protocols. In this example, we are getting data from imdb.com and therefore we specified...