Chapter 6. robots.txt, .htaccess, and W3C Validation
Much of the SEO that we've accomplished so far is visible to your visitors (for example, titles, headings, body text, and even a sitemap or two). In this chapter, we're going to address some of the more technical aspects of on-page SEO. Over the last ten years, many elements have been added to the HTML specification. The search engines themselves have developed other elements to help you communicate better with them. Since our ultimate goal is to do well by the search engines and our visitors, it's time to embrace your inner geek and get technical with your SEO. Pocket protectors ready? Let's do this thing.
In this chapter, we're going to cover:
The robots.txt
files and common directives used in these files
Problems with Drupal's standard robots.txt
and how to fix them
Adding the XML Sitemap to the robots.txt
Understanding and editing the .htaccess
file
W3C Validation
Note
Take care when upgrading your Drupal installation!
In this chapter, we...
Optimizing the robots.txt file
The robots.txt
file is a file that sits at the root level of your web site and asks spiders and bots to behave themselves when they're on your site. You can take a look at it by pointing your browser to http://www.yourDrupalsite.com/robots.txt. Think of it like an electronic No Trespassing sign that can easily tell the search engines not to crawl a certain directory or page of your site. Using wildcards, you can even tell the engines not to crawl certain file types like .jpg or .pdf. This means none of your JPEG images or PDF files will show up in the search engines. (I'm not recommending that you do that…but you could.)
Note
The robots.txt file is required by Google
On December 1, 2008, John Mueller, a Google analyst, said that if the Googlebot can't access the robots.txt file (say the server is unreachable or returns a 5xx error result code) then it won't crawl the web site at all. In other words, the robots.txt file must be there if you want the web site...
Mastering the .htaccess file
There is a server configuration file at the root level of your Drupal 6 site called the .htaccess
file. This file is a list of instructions to your web server software, usually Apache. These instructions are very helpful for cleaning up some redirects and otherwise making your site function a bit better for the search engines. In Chapter 1, The Tools You'll Need, we told Google Webmaster Tools that we wanted our site to show up in Google with or without the www in the URL. The .htaccess
file allows you to do the same thing directly on your web site. Why are both necessary? In Google's tool, you're only telling Google how you want them to display your URLs; you're not actually changing the URLs on your web site. With the .htaccess
file, you're actually affecting how the files are served. This will change how your site is displayed in all search engines.
Note
Hey, why can't I can't see the .htaccess file?
In Unix/Linux Operating Systems, any file that begins with...
Drupal is a well-written piece of software that produces well-formed web sites. However, don't assume that it will still be that way when you're done with it. Not all of the modules, themes, or content on your site will pass muster. This is especially true if your site is open to users to create their own content.
You should run a comprehensive scan of the site to check for improperly formed code, broken links, and other oversights that could hinder your search engine positioning. Obviously, Google can't reject sites just because they have bad markup (most of the sites out there have at least one thing wrong with them). However, bad HTML can confuse the search engine spiders. They're not as forgiving as a modern browser is to technical issues. By eliminating any problem markup, you can remove this concern from your site.
There is a great, and free, tool that you can use to scan your site. It's called the W3C HTML Validator.
Scanning your site with the W3C HTML Validator...
In this chapter, we covered some of the most technical aspects of a good SEO. We discussed:
The robots.txt
file
The .htaccess
files
We've got one more chapter of technical, on-page optimization, and then you'll be ready to start populating your site with content.