Robots.txt Optimization

Today I am going to talk about an interesting file and i.e. Robots.txt file . The purpose of these file is to tell the search engine robots or crawler that they are allowed to access my website. At this moment many of you might think that why should they insert the robots.txt file on the root directory of their site. When they want spiders to crawl their website completely and it is their normal duty. Than wait! I have a reply for you, when spiders look for a particular page on your website where that is not available than the normal result is error 404 and these is a known fact. Here comes a robots.txt file in action, it is a well known name for search engine spiders and they will look it to the file to check if any barrier is set on the site for them. If no robots.txt file created it will end to an error 404 page. The error will appear to spiders and they may report it as a broken link. This broken link report may reduce the importance of your website in Search Engine’s view. So to avoid this situation it is always advisable to upload this simple text file to the root directory on their server. Hope it is clear from the above discussion that the main purpose of robots.txt is to tell the web spiders to don’t crawl the following (given) links at the same moment no one can force the spiders to crawl their website as it is purely depends upon spiders. But one can block spiders from accessing certain part or even full of his website.

Let give me an example: You may not want Google to crawl the /images directory of your site, as it’s both meaningless to you and a waste of your site’s bandwidth. “Robots.txt” lets you tell Google just that by using simple text file.

Let’s start with an optimization process. Create a regular text file called “robots.txt”, and make sure it’s named exactly that. This file must be uploaded to the root accessible directory of your site, not a subdirectory. The format is simple enough for most intents and purpose; a user-agent line to identify the crawler in question followed by one or more disallow: lines to disallow it from crawling certain part of your site.

1) Here's a basic "robots.txt":

User-agent: *
Disallow: /

With the above declared, all robots (indicated by "*") are instructed to not index any of your pages (indicated by "/").

2) Lets get a little more discriminatory now. While every webmaster loves Google, you may not want Google's Image bot crawling your site's images and making them searchable online, if just to save bandwidth. The below declaration will do the trick:

User-agent: Googlebot-Image
Disallow: /

3) The following disallows all search engines and robots from crawling selected directories and pages:

User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /tutorials/blank.htm

4) You can conditionally target multiple robots in "robots.txt." Take a look at the below:

User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /privatedir/

This is interesting- here we declare that crawlers in general should not crawl any parts of our site, EXCEPT for Google, which is allowed to crawl the entire site apart from /cgi-bin/ and /privatedir/. So the rules of specificity apply, not inheritance.

6 comments:

Powered by Blogger.