Since I recently got into this whole blogging thing, and I’m someone who tends to exhaustively research anything I’m interested in ( I guess that’s why I like my job), I wanted to share a few tips for WordPress that I’ve worked out that may help others. One of those is determining exactly what is the right robots.txt file to use for your WordPress site. The goal is not so much SEO (search engine optimization) as it is to make sure the right content is being indexed by sites like Google, and the wrong stuff isn’t. I’ll break this down into somewhat basic terms for people who may be new to the process. There are a variety of blog posts on the subject, and I think I’ve compiled my own spin on the issue. The key is that you don’t want to block too much, so try to only block things that are meaningless to readers (like script files).
The root folder of your site can have a text file in it named robots.txt. This file contains some rules that you set that determine what files and folders you want to allow search engines to find, and which ones you want to label as being off-limits. Google has a bad rap for ignoring robots.txt files, but I believe that is coming from some confusion as far as how Google interprets this file. By playing with their robots.txt analysis tool I found something that I think many neophytes are missing.
First, a general primer. Below are the first few lines from my robots.txt file.
User-agent: *
# disallow all files in these directories
Disallow: /blog/wp-*
You can see the first line defined the “user agent,” which is the name of the search engine spider that you want to listen to the following section. The second starts with a #, which indicates that it’s a comment and it’s not an instruction for anyone. The third line says “do not search any subfolder of /blog/ when that folder name starts with “wp-”. That would include all of your admin pages for example, like wp-content, wp-admin etc.
The robots.txt file allows you to have multiple sections for different search engines. For example, you could have a specific section for google which starts with:
User-agent: Googlebot
Now here’s a trick. If you have a section specifically for google, then google will ignore the section “User-agent: *”. So if you are thinking you’ll have special instructions for google in addition to the * section, you need to duplicate the star section in the Google section.
A great tool for testing this out is Google’s Webmaster Tools site. Here you can register your site with Google, indicating that you are the owner, and then it will give you a section for “robots.txt analysis.” This is what I based my tweaks on. I made a list of URLs that I wanted google to include, and URLs that I did not want it to include, and tweaked my robots.txt file until it gave me the results I wanted.
Now as far as what I want to include and what I don’t want to include, here were my goals.
- Do not index my tags, or my catagories. If someone finds one of my posts I want the link in google to be to one of my posts, not to a list of posts. Otherwise the searcher might see a link to both the original post and to some catagory page that contains the post.
- Do not index my post-specific RSS feeds. It bugs the crap out of me to click on a link in Google and have it turn out to be a link to a feed rather than the content itself. It’s fine to index the main feed, but not a post-specific one.
- Do not index any of the administrative pages, PHP files, or anything that’s not a re-written URL (so no URLs with a ? in them).
The final result of my tweaking is below:
User-agent: *
# disallow all files in these directories
Disallow: /blog/wp-*
Disallow: /blog/contact/
Disallow: /blog/category/
Disallow: /blog/about/
Disallow: /blog/*?*
Disallow: /blog/*/trackback/$
Disallow: /mint$
Disallow: /stats/$
Disallow: /feeder/$
Disallow: /blog/*/feed/$
Allow: /blog/feed/$# allow google image bot to search all images
User-agent: Googlebot-Image
Allow: /*# disallow archiving site
User-agent: ia_archiver
Disallow: /*# disable duggmirror
User-agent: duggmirror
Disallow: /*
It could probable be improved, but I think it does what I want it to do. I’ll explain a few lines:
- Disallow: /blog/wp-*
Blocks all of your wordpress admin pages, content pages, anything that’s part of WordPress rather than a post. - Disallow: /blog/category/
Block the post catagory indexes. I use a WordPress plugin to auto-submit sitemaps to the main search engines, so I’m not worried about my posts being found. - Disallow: /blog/about/
I’m not interested in being found by who I am, only by what I post. - Disallow: /blog/*/trackback/$
Disallow: /blog/*/feed/$
These block google from linking to non-content pages. Every post will have a /feed subdirectory that links to it’s RSS feed. I hate clicking those in google, and I don’t want to inflict it on others. Note that the $ here means “end of the URL.” So these only match URLs that end in /trackback/ and /feed/. - Allow: /blog/feed/$
- Disallow: /blog/*?*
Block the non-redirected page, if any are out there. This means goole will only find the “pretty” URLs. - Disallow: /mint$
I use the wonderful Mint for web statistics, and I don’t want search engines to find it. If you have your own site you should check out my little review of the software. - Disallow: /feeder/$
I use a Mint add-in for seeing how many people subscribe to the site, and it uses the /feeder/ folder, which has no user-readable content.
Now to test my robots.txt I used the following list of URLs for my site. The first 3 lines should show up as “Allowed” and the rest should be “Blocked.” I use this with the robots analysis tool to verify that it’s doing what I expect. The blocked URLs are just random bits of PHP, scripts and so forth.
http://variablefragment.com/
http://variablefragment.com/blog/archives/
http://variablefragment.com/blog/feed/
http://variablefragment.com/blog/2007/04/06/mint/
http://variablefragment.com/blog/2007/04/06/mint/feed/
http://variablefragment.com/mint
http://variablefragment.com/blog/wp-content/themes/k2/
http://variablefragment.com/blog/wp-login.php
http://variablefragment.com/blog/wp-content/themes/k2/js /livesearch.js.php
A final note is that once you have your robots.txt file configured the way you like, that doesn’t seem to mean that Google will remove now-forbidden pages from its index. I assume they will expire, but I haven’t looked into it. If anyone knows drop me a note in the comments. You can manually remove pages using the webmaster tools interface. There is a link on the left called “URL Removals.”
Note, too, that you can always check out what your favorite blogs are using for their robots.txt file. Just append /robots.txt to the domain name and you’ll see it. I found some ideas by checking that out as well. Just be careful, and don’t get too creative! Remember, you can torpedo yourself if you block too much content from being indexed.
Also, there is a very nice writeup by John Wiseman of his SEO tips for WordPress. This contains more general tips for SEO, such a creating a sitemap or tweaking your page names.
0 Responses to “A robots.txt for WordPress”