What is robot.txt file
It the file at root location of your domain. It contains the instructions to guide the search engine bots to crawl. we can specify in robot.txt file which portion of my file should not be crawled. They follow the robot exclusion standard protocol. We can also specify here who can crawl like desktop or mobile crawler.
# 1 How Robot.txt file can be used for No image file?
- for the wabpages which are not having images than robot.txt should be used to control crawling traffic.
- it should not be used to hide the pages form Google search engine. It is because the any other page having link of this page than through that it will got indexed.
- If you want not to index than we should use non-index tag or password protection.
# 2 Robot.txt file >>>> Image File.
- Robot.text file prevent the image file from appearing in search engines. However robot.txt file does not restrict other pages or users from liking to the image.
#3 Robot.txt file>>>> resource files
- robot.txt can be used to block unimportant content like images, script, style files. However if it makes harder for the crawler to understand that we should not block these resources.
#4 limitation of robot.txt file
Instructions in robot.txt file does not effect the behavior of bots on your file. But instructions in file guides the crawler to access the pages.
#5 Interpretation of instruction by crawlers.
- Google bots and other respectable crawlers obey robot.txt but some may not do like this. So we should address the crawler with proper syntax.
# 6 how to Prevent page from indexing ?
- Google will not index if robot.txt instructions prevent it. But Google still index from a link from other websites where it linked. You should password protect the file or put noindex meta tag or response header.
#6.1 Implementation noindex
- There are two method of one meta tag and another is Http noindex.
- In order to restrict Google bot from indexing a page on your blog, place the following meta tag into the <head> section of your blog page.
<meta name=”googlebot” content=”noindex”>
if to prevent all bots replace Google it by robots.
- search engine has different bots for different purposes. If we want to show page in Google web search and not in Google news than use name=googlebot-news in place of googlebot.
- Help Google spot meta tag.
- Google has to crawl the on page to see meta tag. If your page is shown in the search results and noindex means it has not yet crawled by bots after adding tag.
- Fetch as Google to recrawl.
- We can request Google to recrawl from fetch as Google. Another reason is robot.txt file is blocking the page.
- Robot. text file tester
You can edit and test tour robot.text file at Google console.
< Http Response Method>
- Http response header can also be to give response to request in place of meta tag.
#7 Password Protection
If you want to not to index any page for public domain or list in search engine results than password protect it. Store in password protected directory on server. this way not bots can crawl them and show in search engine.
#8 How to create Robot.txt file
- A robots.txt file has one or more rules. Each rule blocks allows or restrict access for a given bots to a specified file of blog or site.
I am explaining below with simple robots.txt file with two rules.
Disallow: /tag /
user agent name “googlebot” should not crawl the folder or any of the subfolders of
2 Instruction allows all other all other user or bots can access entire website.
#8.1 Basic Robot.txt file guigelines
We should know the various syntex of the robot.txt file to Understand better. So you should read the following article
- Crawler is stand for the services provided by the search engine and they access the known URL of the webmaster for allowed content with a standered browser.
- User-agent is stand of identifying specific crawler of set of crawlers like for example.
here the “googlebot” is specific crawler chosen.
- Robot.txt file should be located in toplevel directory like root directory of the website . it should be accessible by appropriate protocol like HTTP or HTTPS.
- google also accept robot.txt file for FTP through FTP protocol.
#8.2 File Editor
- We can use any editor except world processor which saves in standard ANSCII formats and UTF-8 text file.
- File name should be robot.txt.
- We an have only one file for website.
- file should be at root location to control crawling of all URL down below.
- It can not be placed at subdirectory.
- File an be applied to subdomain .
- Comments on robot.txt are of any number of lines.
# 8.3 Syntax
- File should be ANSCII or UTF-8 text file. Other formats are not acceptable but Google. So we must ensure it other wise we will lend in big problem.
- Robot.file may consist of multiple rule and multiple directive, one directive per line.
- Rule given standard information like who the rule is applied to (User Agent).
- which directories or files the specified crawler can access or can not access.
- Rules written in the file are processed from Top to bottom.
- Default assumption is that a user agent or crawler can acessed any page or directory which is not blocked by a Disallow:Rule.
- Rules are case sensitive for example
- Disallow:/apk.asp applied to http:/www.gpatrika.com/apk.asp not applied to http:/www.gpatrika.com/APK.asp .
* 8.4 Following Directives are used in robot.txt file.
it specify the name of the search engine robot or web crawler software for which rule pertain to.
Example-1 To block adsbot
Example-2 To block googlebot and adsbot
# Example 3: Block all the but Adsbot crawlers
This will block all the crwaler except adsbot crawler
It give directive to the bots not to crawl the given path as you can see in the above Exmaple 1 and example 2 where the bots are not allowed for entire root directory ( Disallow:/ ).
It gives directive to the bots to crawl the given path . it also use to overide the Disallow to allow the subdirectory oe webpage in the debarred directory.
it is full URL metioned in the file . it is the good way communicated to bot to crawl.
#Another example of the Robot.txt file
robot.txt file contains the one or more block of rules each starting with User-agent line. Here is the example explaining to you.
# Block googlebot from for directory gpatrika.com/catagary/ & gpatrika.com/tag/ but allow access to
gpatrika.com/catagary/seo than we have write following in the file
# Block the entire website from anothercrawler
# Allow access to a single crawler
# allow access all but a single crawler not
#Disallowing of single webpage
# Block a particular image from google-bot
# Block all images from your site from google images
# disallowing crawling of specific file type example *.jpg
# We should eager to know when we disallow entire site than how to allow the adsense ads on those pages
Trick is Disallow all the web-crawler except the mediaparteners-google and this way you can hide your page from the search engine and the mediaparteners-google still decide which ads to show on your page to the visitors.
# 9 Test Your Robot.txt file with google Tester
- You can test your created file from the google search console having robot.txt tester.
Robot.txt file tester shown the instruction and directive written in the file and also show the Eroor and warning at the bottom of the file.
- We can test URL .
- Enter the URL in the text box at the bottom of the page.
- Select the crawler from the drop down menu.
- Press the test button.
- Check whether ACCEPTED and BLOCKED.
- You can edit the file according and retest the URL.
- Changes made in the file does not get saved in the original file. so We need to copy the changes to the actual file at the server root location.
#9.1 Limitation of the Google Robot.txt file Tester
- main limitation is that the tester does not edit the main copy of the server file.
- this tester only test your file with google crawler and can not predict the behavior of the others.
#10 How to Submit The Updated Robot.txt File to The to google.
- You can click to the submit button of the file and a dialog box appear for downloading the file.
- This file you can submit to the root location of the domain at server ( http://www.gpatrika.com/robot.txt ).
- After uploading the file at correct location you can press the verify live Version of the file.
- Submit the live of the robot.txt file and notify that the changes have been made and this will give indication to crawl the file.
- You can refresh You Browser and Now you can see your new version of the robot.txt file . You can also see time and date when the google has crawled successful the newset version of your file.