In the previous post on Search Engine Friendly Websites we learnt what search engine spiders are. Now we can look at what we can do to influence spidering behaviour. Spidering is an often overlooked part of SEO. If we write a page of content and put in on the web, getting it spidered is the first thing we should be doing. Remember if Google doesn't have it in its index it can't rank it. Right?
We can actually dictate to each engine how we want it's spiders to behave in our site, in fact they like us to do this. So how do we do it?
Technically we can do this with the robots text file, the robots meta tag and we can also use fresh content like a blog to attract a spiders attention. Today we're going to stick to the "robots.txt" file and tomorrow we will deal with the Robots Meta Tag
The robot exclusion standard (also known as the Robots Exclusion Protocol) is a method of stipulating to spiders what they can or can't see, which is in the robots.txt file, this is a list of folders and files you don't want the spiders to index.
Firstly you stipulate which spiders you don't want spidering and secondly what you do want them to spider, so:
Stipulates that all bots follow your list, you may also want to just stipulate which bots if a particular engine is penalising you, say for duplicate content reasons, but others aren't, so you can name each one uniquely
If you didn't know "Slurp" is Yahoo's bot. You then tell them what you don't want them to see, so:
This tells your spider that you don't want them looking at anything in the admin directory. Just as disallow is supported so is allow, but only by some engines but this includes the major player Google. This could be useful if you wish to include a particular file in a folder you have disallowed:
You can also tell a spider where your sitemap is. If you don't have sitemap create one, even if you don't think it's necessary check by typing site:mysite.com into any engine. It will tell you how many pages are in its index from the site you stipulated. If there are big variations then you know some spiders are having problems, and this is the best way to sort that out after you check your sites navigational architecture.
MSNbot isn't so good at spidering as the other two big engines, it tends to find problems and respond to them by either spidering a bunch of unimportant content or ignoring it altogether. It also doesn't have a webmaster tool, unlike Yahoo or Google, where you can tell them directly where your sitemap is so this is the next best thing. This looks like this:
You can also if you want stipulate how many visits per second, for example this would be 1 visit every 60 seconds
You can also stipulate a visit time, which may be useful if you have a particular promotion or conference call with your users at a particular time of each day or your response times slow down at certain times of the day you can free your web sites server from spiders.
Note: This is highly important, make sure you spell everything in your robots.txt file correctly. I have seen examples of major engines not spidering whole sites because the robots.txt file has typos in it.
Good spiders follow this to the letter, bad spiders don't. You can trust spiders from the big search engines like Google, Yahoo and MSN. If there is a rogue spider, such as one harvesting emails to sell for spam then these types of spiders are not. So just using the robots.txt file on confidential information alone will not make those parts of your site secure so don't use it solely for this purpose. But you can be assured that good spiders will follow your robots.txt file to the letter.