Difference between revisions of "Learn/404-Errors-Drive-Visitors-Away"

Revision as of 23:45, 6 October 2010

What does a "page not found" message mean?

Have you ever typed a URL in the navigation bar only to receive a "404 error message" or learn that the page no longer exists? It can be frustrating and feel like a waste of time, especially when repeated attempts produce the same results. It's not good for site visibility and it certainly isn't good for a customer trying to find your page. It's similar to a customer crossing town to visit your business and finding the "closed" sign on your door. Will this customer return? They might if you have a niche business, but if you're selling something that can be obtained elsewhere, they probably won't return. The same can occur with your website.

When and why do these messages occur?

"Page not found" or "404 error message" is the standard response to a server request for a site that is either dead, broken, or no longer exists. In other words, when a server looks for a specific URL and can't find the requested site, the server will send a message indicating the page cannot be loaded or opened. Link rot is a slang term for the same occurrence.

While there are many reasons for a link to be broken, it is frequently due to some form of blocking such as content filters or firewalls. Links may also be rendered inactive when the server hosting a page stops working, or relocates to a new domain name.

How to avoid a 404 error message?

What belongs in a robots.txt file?

Robots.txt files usually contain a single record. They look something like this:

User-agent: *

Disallow: /print/

Disallow: /temp/

In this example, the /print and /temp directories are excluded from all robots.

How do I prevent robots scanning my site?

There is no easy way to prevent all robots from visiting your site. However, you can request that well-behaved robots not visit your site by adding these two lines into your robots.txt file:

User-agent: *

Disallow: /

This asks all robots to please not visit any pages on the site.

Things to be aware of

You need a separate "Disallow" line for every page, file, or directory that you want to exclude.
You can't have blank lines in a record. Blank lines are used to separate multiple records.
Regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif" won't work.
Everything not explicitly disallowed is considered fair game to retrieve.
The robots.txt file is a polite request to robots and not a mandate they have to follow. Robots that scan the web for security vulnerabilities, email address harvesters used by spammers, and other malicious bots will pay no attention to the request.
The robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to examine. If you want to hide information, password protect the section instead of trying to rely on robots.txt to hide information.

@@ Line 9: / Line 9: @@
 ==How to avoid a 404 error message?==
-==How to create a Robots.txt file==
-Robots.txt is a standard text file that you can create in Notepad or any other text editor. You can use Word or another word processor, but be sure to save the file as raw text (.txt) when you are done.
-The file name should be all lower case: "robots.txt", not "Robots.Txt.
-When the file is ready, you need to upload it to the top-level directory of your web server. It should be accessible from www.<yourdomain.com>/robots.txt.
 ==What belongs in a robots.txt file?==