Nokogiri

Exclusive offer: get 50% off this eBook here
Instant Nokogiri [Instant]

Instant Nokogiri [Instant] — Save 50%

Learning data scraping and parsing in Ruby using the Nokogiri gem with this book and ebook

$12.99    $6.50
by Hunter Powers | August 2013 | Open Source Web Development

In this article by Hunter Powers, author of the book Instant Nokogiri, you will get an insight about Nokogiri the open source library to parse XML and HTML in Ruby.

(For more resources related to this topic, see here.)

Spoofing browser agents

When you request a web page, you send metainformation along with your request in the form of headers. One of these headers, User-agent, informs the web server which web browser you are using. By default open-uri, the library we are using to scrape, will report your browser as Ruby.

There are two issues with this. First, it makes it very easy for an administrator to look through their server logs and see if someone has been scraping the server. Ruby is not a standard web browser. Second, some web servers will deny requests that are made by a nonstandard browsing agent.

We are going to spoof our browser agent so that the server thinks we are just another Mac using Safari.

An example is as follows:


# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# this string is the browser agent for Safari running on a Mac
browser = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4)
AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5
Safari/536.30.1'

# create a new Nokogiri HTML document from the scraped URL and pass in
the
# browser agent as a second parameter
doc = Nokogiri::HTML(open('http://nytimes.com', browser))

# you can now go along with your request as normal
# you will show up as just another safari user in the logs
puts doc.at_css('h2 a').to_s

Caching

It's important to remember that every time we scrape content, we are using someone else's server's resources. While it is true that we are not using any more resources than a standard web browser request, the automated nature of our requests leave the potential for abuse.

In the previous examples we have searched for the top headline on The New York Times website. What if we took this code and put it in a loop because we always want to know the latest top headline? The code would work, but we would be launching a mini denial of service (DOS) attack on the server by hitting their page potentially thousands of times every minute.

Many servers, Google being one example, have automatic blocking set up to prevent these rapid requests. They ban IP addresses that access their resources too quickly. This is known as rate limiting.

To avoid being rate limited and in general be a good netizen, we need to implement a caching layer. Traditionally in a large app this would be implemented with a database. That's a little out of scope for this article, so we're going to build our own caching layer with a simple TXT file. We will store the headline in the file and then check the file modification date to see if enough time has passed before checking for new headlines.

Start by creating the cache.txt file in the same directory as your code:

$ touch cache.txt

We're now ready to craft our caching solution:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# set how long in minutes until our data is expired
# multiplied by 60 to convert to seconds
expiration = 1 * 60

# file to store our cache in
cache = "cache.txt"

# Calculate how old our cache is by subtracting it's modification time
# from the current time.

# Time.new gets the current time
# The mtime methods gets the modification time on a file
cache_age = Time.new - File.new(cache).mtime

# if the cache age is greater than our expiration time
if cache_age > expiration
# our cache has expire
puts "cache has expired. fetching new headline"

# we will now use our code from the quick start to
# snag a new headline

# scrape the web page
data = open('http://nytimes.com')

# create a Nokogiri HTML Document from our data
doc = Nokogiri::HTML(data)

# parse the top headline and clean it up
headline = doc.at_css('h2 a').content.gsub(/\n/," ").strip

# we now need to save our new headline
# the second File.open parameter "w" tells Ruby to overwrite
# the old file
File.open(cache, "w") do |file|
# we then simply puts our text into the file
file.puts headline
end

puts "cache updated"

else
# we should use our cached copy
puts "using cached copy"
# read cache into a string using the read method
headline = IO.read("cache.txt")
end

puts "The top headline on The New York Times is ..."
puts headline

Our cache is set to expire in one minute, so assuming it has been one minute since you created your cache.txt file, let's fire up our Ruby script:

ruby cache.rb
cache has expired. fetching new headline
cache updated
The top headline on The New York Times is ...
Supreme Court Invalidates Key Part of Voting Rights Act

If we run our script again before another minute passes, it should use the cached copy:

$ ruby cache.rb
using cached copy
The top headline on The New York Times is ...
Supreme Court Invalidates Key Part of Voting Rights Act

SSL

By default, open-uri does not support scraping a page with SSL. This means any URL that starts with https will give you an error. We can get around this by adding one line below our require statements:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# disable SSL checking to allow scraping
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE

Mechanize

Sometimes you need to interact with a page before you can scrape it. The most common examples are logging in or submitting a form. Nokogiri is not set up to interact with pages. Nokogiri doesn't even scrape or download the page. That duty falls on open-uri. If you need to interact with a page, there is another gem you will have to use: Mechanize.

Mechanize is created by the same team as Nokogiri and is used for automating interactions with websites. Mechanize includes a functioning copy of Nokogiri.

To get started, install the mechanize gem:

$ gem install mechanize
Successfully installed mechanize-2.7.1

We're going to recreate the code sample from the installation where we parsed the top Google results for "packt", except this time we are going to start by going to the Google home page and submitting the search form:

# mechanize takes the place of Nokogiri and open-uri
require 'mechanize'

# create a new mechanize agent
# think of this as launching your web browser
agent = Mechanize.new

# open a URL in your agent / web browser
page = agent.get('http://google.com/')

# the google homepage has one big search box
# if you inspect the HTML, you will find a form with the name 'f'
# inside of the form you will find a text input with the name 'q'
google_form = page.form('f')

# tell the page to set the q input inside the f form to 'packt'
google_form.q = 'packt'

# submit the form
page = agent.submit(google_form)

# loop through an array of objects matching a CSS
# selector. mechanize uses the search method instead of
# xpath or css. search supports xpath and css
# you can use the search method in Nokogiri too if you
# like it
page.search('h3.r').each do |link|
# print the link text
puts link.content
end

Now execute the Ruby script and you should see the titles for the top results:

$ ruby mechanize.rb
Packt Publishing: Home
Books
Latest Books
Login/register
PacktLib
Support
Contact
Packt - Wikipedia, the free encyclopedia
Packt Open Source (PacktOpenSource) on Twitter
Packt Publishing (packtpub) on Twitter
Packt Publishing | LinkedIn
Packt Publishing | Facebook

For more information refer to the site:

http://mechanize.rubyforge.org/

People and places you should get to know

If you need help with Nokogiri, here are some people and places that will prove invaluable.

Official sites

The following are the sites you can refer:

Articles and tutorials

The top five Nokogiri resources are as follows:

Community

The community sites are as follows:

Twitter

Nokogiri leaders on Twitter are:

  • Nokogiri co-author Mike Dalessio: @flavorjones
  • Nokogiri co-author Aaron Patterson: @tenderlove
  • Me: @TheHunter
  • For more information on open source, follow Packt Publishing: @PacktOpenSource

Summary

Thus, we learnt about Nokogiri open source library in this article.

Resources for Article :


Further resources on this subject:


Instant Nokogiri [Instant] Learning data scraping and parsing in Ruby using the Nokogiri gem with this book and ebook
Published: August 2013
eBook Price: $12.99
See more
Select your format and quantity:

About the Author :


Hunter Powers

Hunter Powers, a Full Stack Web Developer, began programming at age 6, and has been gathering steam ever since. From childhood awards (including “Most Outstanding Presentation in the Field of Engineering” from NC State University and “Superior Achievement in Computer Science” from the North Carolina Student Academy of Science) to more recent achievements at TechCrunch Disrupt 2012, he has never lost his passion for the science, languages, and dynamics of computers. Powers’ early accomplishments in the field led him to open WireThePlanet.com, his first business, at the age of 13. The company has prospered through the years, developing local and national websites, producing national television advertisements, and directing the art and design for an international card and print company. With a range of products including websites, book covers, advertising, logos, national festivals, and many more, his customer base includes e-commerce companies, charities and clubs, renowned foodies, and even the weatherman.

In 2011, Powers joined a big data streaming company. His work there has been represented by projects for AOL’s TechCrunch, Engadget, The Washington Post, The New York Times, Fox’s X-Factor, Fox Sports Australia, New York Presbyterian Hospital, and The NFL, to name a few. A graduate of The University of Virginia, Powers’ interests extend beyond technology into filmmaking, photography, and writing. He was first published while in high school and has written many short as well as feature length science fiction screenplays. Powers also directs with credits for multiple commercials, web videos, and short films.

Currently residing in the Logan Circle arts district in Washington DC, Powers is working on his next book. You can find his blog at http://www.HunterPowers.com.

Books From Packt


Ruby on Rails Web Mashup Projects
Ruby on Rails Web Mashup Projects

Cloning Internet Applications with Ruby
Cloning Internet Applications with Ruby

RubyMotion iOS Development Essentials
RubyMotion iOS Development Essentials

 Ruby and MongoDB Web Development Beginner's Guide
Ruby and MongoDB Web Development Beginner's Guide

Instant RubyMotion App Development [Instant]
Instant RubyMotion App Development [Instant]

Building Dynamic Web 2.0 Websites with Ruby on Rails
Building Dynamic Web 2.0 Websites with Ruby on Rails

Ruby on Rails Enterprise Application Development: Plan, Program, Extend
Ruby on Rails Enterprise Application Development: Plan, Program, Extend

Aptana RadRails: An IDE for Rails Development
Aptana RadRails: An IDE for Rails Development


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
m
b
P
M
e
q
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software