URL Shorteners – Designing the TinyURL Clone with Ruby

Exclusive offer: get 50% off this eBook here
Cloning Internet Applications with Ruby

Cloning Internet Applications with Ruby — Save 50%

Make clones of some of the best applications on the Web using the dynamic and object-oriented features of Ruby

$23.99    $12.00
by Chang Sau Sheong | August 2010 | Open Source

This article by Chang Sau Sheong, author of the book Cloning Internet Applications with Ruby, explains about the popular Internet application, TinyURL. It describes how to create a TinyURL clone, its basic principles, and algorithms used.

(For more resources on Ruby, see here.)

We start off with an easy application, a simple yet very useful Internet application, URL shorteners. We will take a quick tour of URL shorteners before jumping into the design of a simple URL shortener, followed by an in-depth discussion of how we clone our own URL shortener, Tinyclone.

All about URL shorteners

Internet applications don't always need to be full of features or cover all aspects of your Internet life to be successful. Sometimes it's ok to be simple and just focus on providing a single feature. It doesn't even need to be earth-shatteringly important—it should be just useful enough for its target users. The archetypical and probably most extreme example of this is the URL shortening application or URL shortener.

This service offers a very simple but surprisingly useful feature. It provides a shorter URL that represents a normally longer URL. When a user goes to the short URL, he will be redirected to the original URL. For this simple feature, top three most popular URL shortening services (TinyURL, bit.ly, and is.gd) collectively had about 11 million unique visitors, 110 million page views and a reach of about one percent of the Internet in June 2009. In 2008, the most popular URL shortener at that time, TinyURL, was made one of Time Magazine's Top 50 Best Websites.

The idea to shorten long and unwieldy URLs into shorter, more manageable ones has been around for some time. One of the earlier attempts to make it a public service is Make A Shorter Link (MASL), which appeared around July 2001. MASL did just that, though the usefulness was debatable as the domain name was long and the shortened URL could potentially be longer than the original.

However, the pioneering site that popularized this concept (and subsequently bought over MASL and a few other similar sites) is TinyURL. TinyURL was launched in January 2002 by Kevin Gilbertson to help him to link directly to newsgroup postings which frequently had long URLs. It rapidly became one of the most popular URL shorteners around. In 2008, an estimated 100 similar services came to existence in various forms.

URLs or Uniform Resource Locators are resource identifiers that specify where identified resources are available and how they can be retrieved. A popular term for URL is a Web address. Every URL is made up of the following:

<resource type>://<username>:<password>@<domain>:<port
>/<file path name>?<query string>#<anchor>

Not all parts of the URL are required by a browser, if the resource type is missing, it is normally assumed to be http, if the port is missing, it is normally assumed to be 80 (for http). The username, password, query string and anchor components are optional.

Initially, TinyURL and similar types of URL shorteners focused on simply providing a short representative URL to their users. Naturally the competitive breadth for shortening URLs was rather well, short. Many chose TinyURL over MASL because TinyURL had a shorter and easier to remember domain name (http://tinyurl.com over http://makeashorterlink.com)

Subsequent competition over this space intensified and extended to providing various other features, including custom short URLs (TinyURL, bit.ly), analysis of click-through statistics (bit.ly), advertisements (Adjix, Linkbee), preview pages (TinyURL, is.gd) and so on.

The explosive growth of Twitter (from June 2008 to June 2009, Twitter grew 1,164%) opened a new chapter for URL shorteners. Twitter chose a limit of 140 characters for each tweet to accommodate the 160 characters in an SMS message (Twitter was invented as a service for people to use SMS to tell small groups what they are doing). With Twitter's popularity skyrocketing, came the need for users to shorten URLs to fit into the 140 characters limit. Originally Twitter used TinyURL as its default URL shortener and this triggered a steep climb in the usage of TinyURL during the early days of Twitter.

However, in May 2009, bit.ly replaced TinyURL as Twitter's default URL shortener and the impact was immediate. For the first time in that period, TinyURL recorded a drop in the number of users in May 2009, dropping from 6.1 million to 5.3 million unique users, while bit.ly jumped from 1.8 million to 2.9 million almost overnight. That's not the end of the story though. In April 2010 during Twitter's Chirp conference, Twitter announced its own URL shortener (twt.tl). As of writing it is still unclear the market share will pan out but it's clear that URL shorteners have good value and everyone is jumping into this market. In December 2009, Google came up with its own two URL shorteners goo.gl and youtu.be. Amazon.com (amzn.to), Facebook (fb.me) and Wordpress (wp.me) all have their own URL shorteners as well.

Next, let's do a quick review of why URLs shorteners are so popular and why they attract criticism as well.

Here's a quick summary of the benefits:

  • Create short and easy to remember URLs
  • Allow passing of links in character-limited services such as Twitter
  • Create vanity URLs for marketing purposes
  • Can verbally pass URLs

The most obvious benefit of having a shortened URL is that it's, well, short. A typical example of an URL gone bad is a link to a location in Google Maps:

http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=singapore +flyer&vps=1&jsv=169c&sll=1.352083,103.819836&sspn=0.68645,1.382904&g =singapore&ie=UTF8&latlng=8354962237652576151&ei=Shh3SsSRDpb4vAPsxLS3 BQ&cd=1&usq=Singapore+Flyer

Such URLs are meant to be clicked on as it is virtually impossible to pass it around verbally. It might be justifiable if the URL is cut and pasted on documents, but sometimes certain applications will truncate parts of the URL while processing. This makes a long URL difficult to click on and even produces erroneous links. In fact, this was the main motivation in creating most of the earlier URL shorteners—older email clients tend to truncate URLs when they are more than 80 characters.

Short links are of course crucial in character-limited message passing systems like Twitter, Plurk, and SMS. Passing long URLs is impossible without URL shorteners.

Short URLs are very useful in cases of vanity URLs where for example, the Google Maps link above could be shortened to http://tinyurl.com/singapore-flyer. Such vanity URLs are useful when passing from one person to another, or even when using in a mass marketing way. Sticking to the maps theme in our examples, if you want to give a Google Maps link to your restaurant and put it up in catalogs and brochures, you will not want to give the long URL. Instead you would want a nice, descriptive and short URL.

Short URLs are also useful in cases of accessibility. For example, reading out the Google Maps link above is almost impossible, but reading out the TinyURL link (vanity or otherwise) is much easier in comparison.

Many popular URL shorteners also provide some form of statistics and analytics on the usage of the links. This feature allows you to track your short URLs to see how many clicks it received and what kind of patterns can be derived from the clicks. Although the metrics are usually not advanced, they do provide basic usefulness.

On the other hand, URL shorteners have it fair share of criticisms as well. Here is a summary of the bad side of URL shorteners:

  • Provides opportunity to spammers because it hide original URLs
  • Could be unreliable if dependent on it for redirection
  • Possible undesirable or vulgar short URLs

URL shorteners have security issues. When a URL shortener creates a short URL, it effectively hides the original link and this provides opportunity for spammers or other abusers to redirect users to their sites. One relatively mild form of such attack is 'rickrolling'. Rickrolling uses a classic bait-and-switch trick to redirect users to a Rick Astley music video of Never Gonna Give You Up. For example, you might feel that the URL http://tinyurl.com/singapore-flyer goes to Google Map, but when you click on it, you might be rickrolled and redirected to that Rick Astley music video instead.

Also, because most short URLs are not customized, it is quite difficult to see if the link is genuine or not just from the URL. Many prominent websites and applications have such concerns, including MySpace, Flickr and even Microsoft Live Messenger, and have one time or another banned or restricted usage of TinyURL because of this problem. To combat spammers and fraud, URL shortening services have come up with the idea of link previews, which allows users to preview a short URL before it redirects the user to the long URL. For example TinyURL will show the user the long URL on a preview page and requires the user to explicitly go to the long URL.

Another problem is performance and reliability. When you access a website, your browser goes to a few DNS servers to resolve the address, but the URL shortener adds another layer of indirection. While DNS servers have redundancy and failsafe measures, there is no such assurance from URL shorteners. If the traffic to a particular link becomes too high, will the shortening service provider be able to add more servers to improve performance or even prevent a meltdown altogether? The problem of course lies in over-dependency on the shortening service.

Finally, a negative side effect of random or even customized short URLs is that undesirable, vulgar or embarrassing short URLs can be created. Earlier on TinyURL short URLs were predictable and it was exploited, such as embarrassing short URLs that were made to redirect to the White House websites of then U.S. Vice President Dick Cheney and Second Lady Lynne Cheney.

We have just covered significant ground on URL shorteners. If you are a programmer you might be wondering, "Why do I need to know such information? I am really interested in the programming bits, the others are just fluff to me."

Background information on the application we want to clone is very important. It tells us what why that application exists in the first place and gives us an idea what are the main features (what makes it popular). It also tells us what problems it faces such that we are aware of the problem while programming it, or even avoid it altogether. This is important when we come to the design of the application. Finally it gives us better appreciation of the application and the motivations and issues faced by the product and technical people behind the application we wish to clone.

Main features

Next, let's list down the features of a URL shortener. The intention in this section is to distill the basic features of the application, features that define the service. Features listed here will be features that make the application what it is.

However, as much as possible we want to also explore some additional features that extend the application and are provided by many of its competitors. Most importantly, the features here are mostly features of the most popular and definitive web application in the category. In this article, this will be TinyURL.

These are the main features of a URL shortener:

  • Users can create a short URL that represents a long URL
  • Users who visit the short URL will be redirected to the long URL
  • Users can preview a short URL to enable them to see what the long URL is
  • Users can provide a custom URL to represent the long URL
  • Undesirable words are not allowed in the short URL
  • Users are able to view various statistics involving the short URL, including the number of clicks and where the clicks come from (optional, not in TinyURL)

URL shorteners are simple web applications and the one that we will design and build will also be simple.

Designing the clone

Cloning TinyURL is relatively simple but there is some thought behind the design of the application. We will be building a clone of TinyURL called Tinyclone, which will be hosted at the domain http://tinyclone.saush.com.

Creating a short URL for each long URL

The domain of the short URL is fixed. What's left is the file path name. We need to represent the long URL with a unique file path name (a key), one for each long URL. This means we need to persist the relationship between the key and the URL.

One of the ways we can associate the long URL with a unique key is to hash the long URL and use the resulting hash as the unique key. However, the resulting hash might be long and hashing functions could be slow.

The faster and easier way is to use a relational database's auto-incremented row ID as the unique key. The database will help ensure the uniqueness of the ID. However, the running row ID number is base 10. To represent a million URLs would already require 7 characters, to represent 1 billion would take up 9 characters. In order to keep the number of characters smaller, we will need a larger base numbering system.

In this clone we will use base 36, which is 26 characters of the alphabet (case insensitive) and 10 numbers. Using this system, we will only need 5 characters to represent 1 million URLs:

1,000,000 base 36 = lfls

And 1 billion URLs can be represented in just six characters:

1,000,000,000 base 36 = gjdgxs

Cloning Internet Applications with Ruby Make clones of some of the best applications on the Web using the dynamic and object-oriented features of Ruby
Published: August 2010
eBook Price: $23.99
Book Price: $39.99
See more
Select your format and quantity:

(For more resources on Ruby, see here.)

Automatically redirecting from a short URL to a long URL

HTTP has a built-in mechanism for redirection. In fact it has a whole class of HTTP status codes to do this. HTTP status codes that start with 3 (such as 301, 302) tell the browser to go look for that resource in another location. This is used in the case where a web page has moved to another location or is no longer at the original location. The two most commonly used redirection status codes are 301 Move Permanently and 302 Found.

301 tells the browser that the resource has moved away permanently and that it should look at that location as the permanent location of the resource. 302 on the other hand (despite its name) tells the browser that the resource it is looking for has moved away temporarily.

While the difference seems trivial, search engines (as user agents) treat the status codes differently. 301 tells the search engines that the short URL's location has moved permanently away to the long URL, so credit for the long URL goes to the long URL. However because 302 only tells the search engine that the location has moved temporarily, the credit goes to the short URL. This can cause issues with search engine marketing accounting.

Obviously in our design we will need to use the 301 Moved Permanently HTTP status code to do the redirection. When the short URL http://tinyclone.saush.com/singapore-flyer is requested, we need to send a HTTP response:

HTTP/1.1 301 Moved Permanently
Location: http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=
singapore+flyer&vps=1&jsv=169c&sll=1.352083,103.819836&sspn=0.68645,1.
382904&g=singapore&ie=UTF8&latlng=8354962237652576151&ei=Shh3SsSRDpb4v
APsxLS3BQ&cd=1&usq=Singapore+Flyer

Content-Type: text/html
Content-Length: 235
<html>
<head>
<title>Moved</title>
</head>
<body>
<h1>Moved</h1>
<p>This page has moved to <a href="http://maps.google.com/maps?f=q
&source=s_q&hl=en&geocode=&q=singapore+flyer&vps=1&jsv=169c&sll=1.35
2083,103.819836&sspn=0.68645,1.382904&g=singapore&ie=UTF8&latlng=8354-
962237652576151&ei=Shh3SsSRDpb4vAPsxLS3BQ&cd=1&usq=Singapore+Flyer">Si
ngapore Flyer</a>.</p>
</body>
</html>

Providing a customized short URL

Providing a customized short URL with the above design we had in mind before makes things less straightforward. Remember that our design uses the database row ID in base 36 as the unique key. To customize the short URL we cannot use this database row ID, so the customized short URL needs to be stored separately.

In Tinyclone we store the customized short URL in a separate secondary table called Links, which in turn points to the actual data in a table called Url. When a short URL is created and the user doesn't request a customized URL, we store the database record ID from the Url table as a base-36 string into the Links table. If the user requests a customized URL, we store the customized URL instead of the record ID.

A record in the Links table therefore maps a string to the actual record ID in the URL table. When a short URL is requested, we first look into the secondary table, which in turn points us to the actual record in the primary table.

Filtering undesirable words out

While we could use more complex filtering mechanisms, URL shorteners are simple web applications, so we stick to a simpler filtering mechanism. When we create the secondary table record, we compare the key with a list of banned words loaded in memory on startup.

If it is a customized short URL and the word is in the list, we prevent the user from using it. If the key was the actual record ID in the primary table, we create another record (and therefore using another record ID) to store the URL.

What happens if the new record ID is also coincidentally in the banned words list? We'll just have to create another one recursively until we find one that is not in the list. There is a probability that 2 or more consequent record IDs are in the banned words list, but the frequency of it happening is low enough that we don't need to worry about it.

Previewing the long URL

This feature is simple to implement. When the short URL preview function is called, we will show a page that displays the long URL instead of redirecting to that page. In Tinyclone we lump the preview long URL page together with the statistics information page.

Providing statistics

To provide statistics for the usage of the short URL, we need to store the number of times the short URL has been clicked and where the user is coming from. To do this, we create a Visits table to store the number of times the short URL has been visited. Each record in the Visits table stores the information about a particular visit, including its date and where the visitor comes from.

We use the environment variable from the server called REMOTE_ADDR to find out where the visitor comes. REMOTE_ADDR provides the remote IP address of the client that is accessing the short URL. We use this IP address with an IP geocoding provider to find the country that the visitor comes from then store the country code as well as the IP address.

Collecting the data is only the first step though. We will need to display it properly. There are plenty of visualization APIs and tools in the market; many of them are freely available. For the purpose of this article, we have chosen to use the Google Charts API to generate the following charts:

  • A bar chart showing the daily number of visits
  • A bar chart showing the total number of visits from individual countries
  • A map of the world visualizing the number of visits from individual countries

You might notice that in our design the user does not need to log into this application to use it. Tinyclone is a clone that does not have any access control on its pages. Most URL shorteners have a public and main feature that redirects short URLs to their original, long URLs. In addition to that some URL shorteners have user-specific access controlled pages that provide information to the users such as the statistics and reporting feature shown above. However, in this clone we will not be implementing any access controlled pages.

Technologies and platforms used

We use a number of technologies in this article, mainly revolving around the Ruby programming language and its various libraries.

Sinatra

Sinatra is a Domain Specific Language (DSL) for quickly creating web applications in Ruby. It keeps a minimal feature set for developers and is an excellent tool for creating small to mid-sized web applications using Ruby.

Haml

Haml (HTML Abstraction Markup Language) is a simple markup language that is used to cleanly describe HTML in a web page. Haml was originally built for Ruby but has also been ported to other languages and platforms.

DataMapper

DataMapper is a object-relational mapping library for Ruby. While there are a number of Ruby object-relational mapping libraries, DataMapper has a number of good features. It is independent and doesn't tie in with any particular frameworks. It is also built to be fast and efficient, often delaying interacting with the data store until it is needed. It is also very Ruby centric and fits in well in the rest of the technologies that we use in the article.

Blueprint CSS

Blueprint CSS is a simple and effective CSS framework. It provides a basic set of CSS styles that makes developing web applications much easier. One of its most useful features is a set of grid layout styles that allows simple to complex layouts to be created easily and effectively. Used together with HAML, it allows us to create great looking front-ends for our web applications.

Mashups

While the main features in the applications are all implemented within the article itself, sometimes we still depend on other services provided by generally well-known providers. In this article we use two services—Google Chart API for visualizing the statistics we gather on the short URLs and HostIP to geocode IP addresses we get.

Google Chart API

The Google Chart API provides its users a means of dynamically generating various types of charts. In this article, we use the Google Chart API to generate statistics visualizations, specifically bar charts and maps.

The Google Chart API returns a PNG-format image in response to a URL. Several types of image can be generated, and for each image type, you can specify attributes such as size, colors, and labels. The Google Chart API is not rate-limited but Google advises users to let them know if they use more than 250,000 API calls per day.

HostIP

HostIP is a free service that provides geocoding services based on IP addresses. Its usage is very simple—we just need to call a HostIP URL with an IP address and it will return geocoded information on the IP address. HostIP is a community-based project that gets its data from its users so accuracy is not perfect. However, for our purpose it is good enough.

Heroku

Heroku is a Ruby-specific cloud-computing platform that provides specialized Ruby hosting services for developers. It allows Ruby developers to easily and almost instantly deploy web applications to the Internet. Heroku supports Rack-based web applications so deploying our Sinatra applications to Heroku is a breeze. While Heroku charges for hosting, it also provides a free basic tier account.

Summary

In this article, we designed one of the simplest popular Internet applications around, TinyURL. We started off the article with a general introduction to URL shorteners and a list of pros and cons of using URL shorteners. Then we listed the main features and discussed the design of our URL shortener, called Tinyclone. After designing our clone, we went briefly into a short refresher on the technologies used.


Further resources on this subject:


Cloning Internet Applications with Ruby Make clones of some of the best applications on the Web using the dynamic and object-oriented features of Ruby
Published: August 2010
eBook Price: $23.99
Book Price: $39.99
See more
Select your format and quantity:

About the Author :


Chang Sau Sheong

Chang Sau Sheong has more than 12 years experience in software application development and has spent much of his career in web and Internet-based applications. He has a wide range of experience in banking payment-related as well as Internet-based e-commerce software. Currently he is the Director of Software Development of a 50+ strong software development team in Welcome Real-time, a multi-national payment/loyalty software company based in France and Singapore.

Sau Sheong hails from tropical Malaysia but has spent most of his adult and working life in sunny Singapore, where he shares his spare time between enthusiastically writing software and equally enthusiastically playing Nintendo Wii with his wife and son.

Books From Packt

jQuery 1.4 Reference Guide
jQuery 1.4 Reference Guide

FreeSWITCH 1.0.6
FreeSWITCH 1.0.6

Agile Web Application Development with Yii1.1 and   PHP5
Agile Web Application Development with Yii1.1 and PHP5

Joomla! Social Networking with JomSocial
Joomla! Social Networking with JomSocial

Grok 1.0 Web Development
Grok 1.0 Web Development

Plone 3.3 Site Administration
Plone 3.3 Site Administration

Nginx HTTP Server
Nginx HTTP Server

NetBeans Platform 6.9 Developer's Guide
NetBeans Platform 6.9 Developer's Guide

Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software