Don’t Block Search Engine Crawlers

By Michael Cottam

Make sure your important pages can rank

A great page on your website will never rank well in search results – and be discovered by more people – if the search engines don’t know it’s there in the first place.

First, some background: Search engines use computer programs called “crawlers” (or “spiders” or “robots”) to discover web pages and add them to their indexes. Google calls its crawler “Googlebot,” and lots of SEO (search engine optimization) professionals call it that, too.

How Search Engines Crawl Your Site

A search engine crawler examines a page on your site – your home page, for instance – and looks for links to other pages. Those links can go to pages on other websites – “external” links – or to other pages on your own website, when they’re known as “internal” links.

Every time a search engine crawler finds an internal link on your site, it checks to see if it already knows about that page. If not, the crawler puts that page into a list of “pages to crawl” on your site. As it crawls each page, the search engine analyzes the content to see what the page is about. The crawler looks at:

  • the page title
  • the words in the body text (the main text) of the page
  • the alt text of images – if there is any
  • any H1 or other headings on the page

All this information is stored in the search engine’s index, in the entry for that particular web page.

Explicitly Blocking the Crawlers

There’s a meta tag called ROBOTS in the section of a webpage that tells a crawler what to do when it finds that page. For instance, the following ROBOTS tag tells the search engines not to place this particular page in their index, and not to follow any links from this page to other pages:

There are legitimate uses for the robots tag – for example, you might not want search engines to bother indexing pages a site visitor should see only when he or she is logged in. You don’t want that kind of privileged information turning up in search results.

Another good example is the printable versions of pages on your site. That’s because a print-friendly page normally has the same content as the original version. Search engines can regard duplicate content on a website as an indication of poor site quality, and can downgrade a site in search results for that reason. So marking print pages NOINDEX is a good idea.

Alternatively, you can mark the printable pages with a rel=canonical statement, pointing back to the version of the page that’s not optimized for printing. This tells search engines that the original version of the page is the standard, or canonical, version. Then these two pages won’t be regarded as duplicates.

There’s another good reason to mark certain pages on your site NOINDEX. Search engines are going to spend only a limited amount of time crawling your site. You don’t want them to waste time crawling lots of pages you don’t care about and failing to index the pages you do care about.

Robots Tag Mistakes To Avoid

1) Sometimes people copy an existing web page to make a new one…and forget to change the meta tags in the section of the page. If the page you copied has the meta tag in it that blocks robots, then your new page will block robots too – even if you actually want that page indexed. Don’t forget to change the meta tags. 2) People often misinterpret this version of the ROBOTS meta tag: You might think this means “no blocking at all”. That’s not correct. It actually means the same thing as “NOINDEX, NOFOLLOW”. In other words, it tells the robots to refrain from indexing that page, and also tells them not to follow the links on that page. 3) Be careful when you code a robots.txt file. Don’t confuse the robots.txt file with the single-line directive discussed above. Robots.txt is a global file in the root folder of your website. Search engines look for it to give them directions about what to crawl and index on your site – and what they should refrain from indexing. A mistake in this file could end up blocking search engines from large portions of your website, or even your entire website.

Want to make sure your website’s robots.txt file isn’t telling search engines to ignore your site? Enter your site’s domain name (for example, in the search bar at the top right of any page, and click. Once you reach the page about your site, click “Home Page Analysis” and then “Search Engine Friendliness” to learn what you’re saying to the robots.

Retrieved from "’t_Block_Search_Engine_Crawlers&oldid=45244834"