Difference between revisions of "Learn/How-To-Use-Robots.txt"

Latest revision as of 06:22, 4 December 2013

By [[User:|]] on

What is the robots.txt file?

Robots.txt is a file that a website owner can put on his or her web server to give instructions to computer programs that travel around the web, looking for specific kinds of web pages and content. These computer programs are often called "robots," which is why this file is called robots.txt.
People usually use robots.txt to prevent search engines from including in their indexes any web pages that really shouldn't be indexed - for example, web contact forms, print versions of web pages and other content that's duplicated elsewhere on the site. Robots.txt can also be used to request that specific robots not index a site.

Crawlers, robots, agents, bots and spiders

These five terms all describe basically the same thing: an automated software program used to locate, collect, or analyze data from web pages. Search engines like Google use a spider to collect data on web pages for inclusion in their databases. The spider also follows links on web pages to find new pages.

AboutUs uses a robot to analyze web pages for its Website Analysis. Our bot's name is "AboutUsBot".

How robots.txt works

When a legitimate robot wants to visit a web page like www.example.com/good-content, it first checks for www.example.com/robots.txt to make sure the site owner is willing to let the robot examine the page.

The robot looks for several things:

Does a robots.txt file exist?
Is the robot explicitly excluded in the robots.txt file?
Is the robot excluded from the web page, the directory the page is in, or both?

If the robot isn't explicitly excluded from the page, the robot will do what it's designed to do: collect data, analyze it, or whatever.

How to create a robots.txt file

Robots.txt is a standard text file that you can create in Notepad or any other text editor. You can use Word or another word processor, but be sure to save the file as raw text (.txt) when you are done.

The file name should be in lower case: "robots.txt," not "Robots.Txt."

When the file is ready, upload it to the top-level directory of your web server. Robots and spiders should be able to find it at www.YourDomain.com/robots.txt. Don't forget to go there - or have your web designer go there - and verify that the file works.

What belongs in a robots.txt file?

Robots.txt files usually contain a single record. They look something like this:

User-agent: *

Disallow: /print/

Disallow: /temp/

In this example, all robots have been excluded from the /print and /temp directories.

How do I prevent robots from scanning my site?

There is no easy way to prevent all robots from crawling your site. However, you can request that well-behaved robots not visit your site by adding these two lines to your robots.txt file:

User-agent: *

Disallow: /

This asks all robots to refrain from crawling any pages on the site. Keep in mind that this will block search engine robots from indexing your site and including it in their search results too.

Things to keep in mind

The robots.txt file is a polite request to robots, not a mandate they must follow. Malicious bots will pay no attention to this request -- for example, robots that scan the web for security vulnerabilities or email address harvesters used by spammers.
The robots.txt file is public. Anyone can see which sections of your server you don't want robots to examine. If you want to hide information, protect it with a password.
You need a separate "Disallow" line for every page, file, or directory you want to exclude.
Everything not explicitly disallowed is considered fair game for a robot to retrieve.
You can't have blank lines within a record. Blank lines are used to separate multiple records.
Regular expressions cannot be used in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif" won't work.

Uses for robots.txt

To exclude all robots from the entire website

User-agent: *

Disallow: /

User-agent: * means this section applies to all robots.

Disallow: / asks the robot not to visit any pages on the site.

To allow all robots complete access

User-agent: *

Disallow:

This is exactly the same as having an empty robots.txt file or not having one at all.

To exclude all robots from part of the website

User-agent: *

Disallow: /directory 1/

Disallow: /directory 2/

Disallow: /directory 3/

This asks all robots to avoid content from directory 1, directory 2, and directory 3. Robots are welcome to look at content in any other directory on the site.

To exclude a single robot

User-agent: Robot-Name

Disallow: /

This asks the robot named Robot-Name to refrain from crawling any part of the website.

To allow a single robot

User-agent: Robot-Name

Disallow:

User-agent: *

Disallow: /

This tells the robot named Robot-Name that it's welcome to examine the entire website, but asks all other robots to refrain from crawling any part of the site.

To exclude all pages except one

There isn't a way to explicitly "allow" a page. Therefore, you need to "disallow" all the pages except the one you want robots to find. The easiest way to do this is to put all files you want disallowed into a separate directory, and leave the one page you want crawled in the level above the "disallow" directory:

User-agent: *

Disallow: /good/bad/

This tells all robots they can examine everything in the /good directory but they shouldn't look in the /bad directory that lives under the /good directory.

You can also explicitly disallow all pages you don't want the robots to examine. For example:

User-agent: *

Disallow: /good/page 1.html

Disallow: /good/page 2.html

Disallow: /good/page 3.html

This tells all robots that you don't want them to look at page 1, page 2 or page 3 in the /good directory.

To specify the location of your sitemap

You can tell search engine bots where to find your XML sitemap like this:

Sitemap: http://www.yourwebsite.com/sitemap.xml

@@ Line 1: / Line 1: @@
+{{DISPLAYTITLE:How To Use Robots.txt}}
 {{ArticleTemplate2
-| Writer        = Martin_Laetsch
+| Writer = Martin_Laetsch
-| Name          = Martin Laetsch
+| Name = Martin Laetsch
-| Header        = How%20To%20Use%20Robots.txt
+| Header = How%20To%20Use%20Robots.txt
-| Subhead       = Don't block search engines from your site
+| Subhead = Don't block search engines from your site
-| Bitly         = http://bit.ly/How2UseRobotsTxt
+| Bitly = http://bit.ly/How2UseRobotsTxt
-| Date          = December 2, 2010
+| Date = December 2, 2010
 }}
+==What is the robots.txt file?==
-===What is the robots.txt file?===
+----
 Robots.txt is a file that a website owner can put on his or her web server to give instructions to computer programs that travel around the web, looking for specific kinds of web pages and content. These computer programs are often called "robots," which is why this file is called robots.txt.
+<br />
+People usually use robots.txt to prevent search engines from including in their indexes any web pages that really shouldn't be indexed - for example, web contact forms, print versions of web pages and other content that's duplicated elsewhere on the site. Robots.txt can also be used to request that specific robots not index a site.
-People usually use robots.txt to prevent search engines from indexing web pages whose content really shouldn't be indexed - for example, web contact forms, print versions of a site and other duplicate content. Robots.txt can also be used to request that specific robots not index a site.
+==Crawlers, robots, agents, bots and spiders==
+----
-===Crawlers, robots, agents, bots and spiders===
 These five terms all describe basically the same thing: an automated software program used to locate, collect, or analyze data from web pages. Search engines like [[Google.com|Google]] use a spider to collect data on web pages for inclusion in their databases. The spider also follows links on web pages to find new pages.
-[[AboutUs.org]] uses a robot to analyze web pages for its [[Online Visibility Audit]].
+AboutUs uses a robot to analyze web pages for its Website Analysis.  Our bot's name is "AboutUsBot".
-===How robots.txt works===
+==How robots.txt works==
+----
 When a legitimate robot wants to visit a web page like '''www.example.com/good-content,''' it first checks for '''www.example.com/robots.txt''' to make sure the site owner is willing to let the robot examine the page.
@@ Line 28: / Line 31: @@
 If the robot isn't explicitly excluded from the page, the robot will do what it's designed to do: collect data, analyze it, or whatever.
-===How to create a robots.txt file===
+==How to create a robots.txt file==
+----
 Robots.txt is a standard text file that you can create in Notepad or any other text editor. You can use Word or another word processor, but be sure to save the file as raw text ('''.txt''') when you are done.
@@ Line 36: / Line 40: @@
 Don't forget to go there - or have your web designer go there - and verify that the file works.
-===What belongs in a robots.txt file?===
+==What belongs in a robots.txt file?==
+----
 Robots.txt files usually contain a single record. They look something like this:
@@ Line 45: / Line 50: @@
 In this example, all robots have been excluded from the /print and /temp directories.
-===How do I prevent robots from scanning my site?===
+==How do I prevent robots from scanning my site?==
+----
 There is no easy way to prevent all robots from crawling your site. However, you can request that well-behaved robots not visit your site by adding these two lines to your robots.txt file:
@@ Line 54: / Line 59: @@
 This asks all robots to refrain from crawling any pages on the site. Keep in mind that this will block search engine robots from [[Learn/Get-Your-Website-Indexed|indexing your site]] and including it in their search results too.
-===Things to keep in mind===
+==Things to keep in mind==
+----
 * The robots.txt file is a polite request to robots, not a mandate they must follow. Malicious bots will pay no attention to this request -- for example, robots that scan the web for security vulnerabilities or email address harvesters used by spammers.
 * The robots.txt file is public. Anyone can see which sections of your server you don't want robots to examine. If you want to hide information, protect it with a password.
@@ Line 62: / Line 68: @@
 * Regular expressions cannot be used in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif" won't work.
+==Uses for robots.txt==
-===Uses for robots.txt===
+----
-====To exclude all robots from the entire website====
+; To exclude all robots from the entire website
 : <code>User-agent: *</code>
@@ Line 72: / Line 78: @@
 :'''Disallow: /''' asks the robot not to visit '''any''' pages on the site.
-====To allow all robots complete access====
+; To allow all robots complete access
 : <code>User-agent: *</code>
@@ Line 79: / Line 85: @@
 This is exactly the same as having an empty robots.txt file or not having one at all.
-====To exclude all robots from part of the website====
+; To exclude all robots from part of the website
 : <code>User-agent: *</code>
@@ Line 88: / Line 94: @@
 This asks all robots to avoid content from '''directory 1''', '''directory 2''', and '''directory 3'''. Robots are welcome to look at content in any other directory on the site.
-====To exclude a single robot====
+; To exclude a single robot
 : <code>User-agent: Robot-Name</code>
@@ Line 95: / Line 101: @@
 This asks the robot named '''Robot-Name''' to refrain from crawling any part of the website.
-====To allow a single robot====
+==To allow a single robot==
+----
 : <code>User-agent: Robot-Name</code>
 : <code>Disallow:</code>
@@ Line 105: / Line 111: @@
 This tells the robot named '''Robot-Name''' that it's welcome to examine the entire website, but asks all other robots to refrain from crawling any part of the site.
-====To exclude all pages except one====
+; To exclude all pages except one
 There isn't a way to explicitly "allow" a page. Therefore, you need to "disallow" all the pages except the one you want robots to find. The easiest way to do this is to put all files you want disallowed into a separate directory, and leave the one page you want crawled in the level above the "disallow" directory:
@@ Line 123: / Line 129: @@
 This tells all robots that you don't want them to look at '''page 1''', '''page 2''' or '''page 3''' in the /good directory.
-====To specify the location of your sitemap====
+; To specify the location of your sitemap
 You can tell search engine bots where to find your [[Learn/Creating-an-XML-Sitemap|XML sitemap]] like this:
-: <code>Sitemap: http://www.yourwebsite.com/sitemap.xml</code>
+: <code><nowiki>Sitemap: http://www.yourwebsite.com/sitemap.xml</nowiki></code>
-{{LearnBottomBio
-| Writer        = Martin Laetsch
-| Name          = Martin Laetsch
-| Image         = Image:MartinLaetschListKeeprBig.png
-| AuthorWebsite = AboutUs.org
-| ShortBio      = Martin is chief strategy officer at AboutUs. He has developed search programs at large companies and small startups, and speaks at global search-marketing conferences.
-}}

Difference between revisions of "Learn/How-To-Use-Robots.txt"

Latest revision as of 06:22, 4 December 2013

Contents

What is the robots.txt file?

Crawlers, robots, agents, bots and spiders

How robots.txt works

How to create a robots.txt file

What belongs in a robots.txt file?

How do I prevent robots from scanning my site?

Things to keep in mind

Uses for robots.txt

To allow a single robot