Nokogiri

Instant Nokogiri


August 2013

$12.99

Learning data scraping and parsing in Ruby using the Nokogiri gem

(For more resources related to this topic, see here.)

Spoofing browser agents

When you request a web page, you send metainformation along with your request in the form of headers. One of these headers, User-agent, informs the web server which web browser you are using. By default open-uri, the library we are using to scrape, will report your browser as Ruby.

There are two issues with this. First, it makes it very easy for an administrator to look through their server logs and see if someone has been scraping the server. Ruby is not a standard web browser. Second, some web servers will deny requests that are made by a nonstandard browsing agent.

We are going to spoof our browser agent so that the server thinks we are just another Mac using Safari.

An example is as follows:


# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# this string is the browser agent for Safari running on a Mac
browser = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4)
AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5
Safari/536.30.1'

# create a new Nokogiri HTML document from the scraped URL and pass in
the
# browser agent as a second parameter
doc = Nokogiri::HTML(open('http://nytimes.com', browser))

# you can now go along with your request as normal
# you will show up as just another safari user in the logs
puts doc.at_css('h2 a').to_s

Caching

It's important to remember that every time we scrape content, we are using someone else's server's resources. While it is true that we are not using any more resources than a standard web browser request, the automated nature of our requests leave the potential for abuse.

In the previous examples we have searched for the top headline on The New York Times website. What if we took this code and put it in a loop because we always want to know the latest top headline? The code would work, but we would be launching a mini denial of service (DOS) attack on the server by hitting their page potentially thousands of times every minute.

Many servers, Google being one example, have automatic blocking set up to prevent these rapid requests. They ban IP addresses that access their resources too quickly. This is known as rate limiting.

To avoid being rate limited and in general be a good netizen, we need to implement a caching layer. Traditionally in a large app this would be implemented with a database. That's a little out of scope for this article, so we're going to build our own caching layer with a simple TXT file. We will store the headline in the file and then check the file modification date to see if enough time has passed before checking for new headlines.

Start by creating the cache.txt file in the same directory as your code:

$ touch cache.txt

We're now ready to craft our caching solution:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# set how long in minutes until our data is expired
# multiplied by 60 to convert to seconds
expiration = 1 * 60

# file to store our cache in
cache = "cache.txt"

# Calculate how old our cache is by subtracting it's modification time
# from the current time.

# Time.new gets the current time
# The mtime methods gets the modification time on a file
cache_age = Time.new - File.new(cache).mtime

# if the cache age is greater than our expiration time
if cache_age > expiration
# our cache has expire
puts "cache has expired. fetching new headline"

# we will now use our code from the quick start to
# snag a new headline

# scrape the web page
data = open('http://nytimes.com')

# create a Nokogiri HTML Document from our data
doc = Nokogiri::HTML(data)

# parse the top headline and clean it up
headline = doc.at_css('h2 a').content.gsub(/\n/," ").strip

# we now need to save our new headline
# the second File.open parameter "w" tells Ruby to overwrite
# the old file
File.open(cache, "w") do |file|
# we then simply puts our text into the file
file.puts headline
end

puts "cache updated"

else
# we should use our cached copy
puts "using cached copy"
# read cache into a string using the read method
headline = IO.read("cache.txt")
end

puts "The top headline on The New York Times is ..."
puts headline

Our cache is set to expire in one minute, so assuming it has been one minute since you created your cache.txt file, let's fire up our Ruby script:

ruby cache.rb
cache has expired. fetching new headline
cache updated
The top headline on The New York Times is ...
Supreme Court Invalidates Key Part of Voting Rights Act

If we run our script again before another minute passes, it should use the cached copy:

$ ruby cache.rb
using cached copy
The top headline on The New York Times is ...
Supreme Court Invalidates Key Part of Voting Rights Act

SSL

By default, open-uri does not support scraping a page with SSL. This means any URL that starts with https will give you an error. We can get around this by adding one line below our require statements:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# disable SSL checking to allow scraping
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE

Mechanize

Sometimes you need to interact with a page before you can scrape it. The most common examples are logging in or submitting a form. Nokogiri is not set up to interact with pages. Nokogiri doesn't even scrape or download the page. That duty falls on open-uri. If you need to interact with a page, there is another gem you will have to use: Mechanize.

Mechanize is created by the same team as Nokogiri and is used for automating interactions with websites. Mechanize includes a functioning copy of Nokogiri.

To get started, install the mechanize gem:

$ gem install mechanize
Successfully installed mechanize-2.7.1

We're going to recreate the code sample from the installation where we parsed the top Google results for "packt", except this time we are going to start by going to the Google home page and submitting the search form:

# mechanize takes the place of Nokogiri and open-uri
require 'mechanize'

# create a new mechanize agent
# think of this as launching your web browser
agent = Mechanize.new

# open a URL in your agent / web browser
page = agent.get('http://google.com/')

# the google homepage has one big search box
# if you inspect the HTML, you will find a form with the name 'f'
# inside of the form you will find a text input with the name 'q'
google_form = page.form('f')

# tell the page to set the q input inside the f form to 'packt'
google_form.q = 'packt'

# submit the form
page = agent.submit(google_form)

# loop through an array of objects matching a CSS
# selector. mechanize uses the search method instead of
# xpath or css. search supports xpath and css
# you can use the search method in Nokogiri too if you
# like it
page.search('h3.r').each do |link|
# print the link text
puts link.content
end

Now execute the Ruby script and you should see the titles for the top results:

$ ruby mechanize.rb
Packt Publishing: Home
Books
Latest Books
Login/register
PacktLib
Support
Contact
Packt - Wikipedia, the free encyclopedia
Packt Open Source (PacktOpenSource) on Twitter
Packt Publishing (packtpub) on Twitter
Packt Publishing | LinkedIn
Packt Publishing | Facebook

For more information refer to the site:

http://mechanize.rubyforge.org/

People and places you should get to know

If you need help with Nokogiri, here are some people and places that will prove invaluable.

Official sites

The following are the sites you can refer:

Articles and tutorials

The top five Nokogiri resources are as follows:

Community

The community sites are as follows:

Twitter

Nokogiri leaders on Twitter are:

  • Nokogiri co-author Mike Dalessio: @flavorjones
  • Nokogiri co-author Aaron Patterson: @tenderlove
  • Me: @TheHunter
  • For more information on open source, follow Packt Publishing: @PacktOpenSource

Summary

Thus, we learnt about Nokogiri open source library in this article.

Resources for Article :


Further resources on this subject:


Books to Consider

comments powered by Disqus