Rewrite PageScrapeBot

OurWork Edit-chalk-10bo12.png

Contributors: Athar Hameed, Ali Aslam, Arif Iqbal, Hassan Javeed, Laiq Jafri, Mohammad Ghufran, Umar Sheikh

What (summary)

The current PageScrapeBot is part of a large Java tomcat process running on our database master. We want to rewrite it in ruby and move it off of the database master. Rewrite PageScrapeBot is one Task in the larger WhoisRefresh Project.

The bot currently goes and gets the whois record from alexa, crawls the site looking for a logo, meta information for category guessing, looks for "aboutus" information on the site and several other things, then dumps all of that information into the database from where it is collected and reassembled by the PageCreationBot.

Why this is important

  • Gives us mastery over our technology. We can change it easily (e.g., add amazon books that mention the domain), or adapt it for a different problem.
  • Allows us to turn off Apache and the large Java tomcat process on our database master.

DoneDone

Given a domain, the bot collects the following information and adds it to the database

  • A snippet about the domain, gathered from the site itself
  • A thumbnail portrait of the front page Done with alexa, should we consider to capture it from nameintel?
  • A best guess at the logo for the site
  • A list of related domains
  • The address (or perhaps addresses) for the site along with citation from where the address came
  • Meta information about the owner, email, alexa rank, language of the site, whether it is online or not, and whether it is classified as adult by alexa or not
  • dmoz categories that contain the site categories are fetched from alexa and not directly from dmoz
  • Metainformation about the domain (for example is it an idn domain)
  • location of the homepage for the site (www.example.com vs example.com)
  • incoming links to the frontpage
  • keywords from the meta information on the frontpage
  • all the meta information ... for example, is it parked? does it forward? is it online? is the registration protected?
  • sites that it links out to only for the home page
  • About Us Text extraction

Status as of 1st November, 2007

PageScrapeBot does the following.

  • When PageScrapeBot's process method is called with a domain name as argument, it fetches information about this domain by calling the various extractors. It then dumps these values in relevant tables of whois-refresh database.
  • First, the PageScrapeBot's process method fetches related links by calling GoogleDataParser's process method. Currently, it just fetches the top 10 related links. It stores these links in an instance variable. These related links are eventually dumped into related_links table.
  • Second, the PageScrapeBot's process method fetches inlinks, categories, title, description, thumbnail information from AlexaExtractor's process methods. In the case of thumbnail, the thumbnail is also stored in the local file in the specified dir . Eventually, the inlinks are dumped into inbound_links table; the thumbnail info is dumped into image_links table; and the description and title are dumped into alexas as well as domains table.
  • Lastly, the PageScrapeBot's process method fetches keywords' meta info, logo, and aboutus text by calling relevant methods of WebsiteInfo class. In the case of logo, the logo is also stored in a local file in the specified dir . Eventually, the keywords info is dumped into keywords table, logo info is dumped into image_links table, and aboutus text is dumped into domains table.

Test Suites for all extractors of PageScrapeBot can be run by using the command . The following individual extractor's test cases can be run using rake commands:

  • GoogleDataParser: Use
  • AlexaExtractor: Use
  • AboutUsExtractor: Use
  • LogoFinder: Use

Suggested Tweaks

  • PageScrapeBot's process method should invoke the various extractors in seperate threads instead of a single thread as is the case now.
  • Update the database to do away with unnecessary tables. (caution: this can have repercussions for other applications using the current database)

Things that still need some minor effort

  • Rails multi threading issues kills db commands.
  • Facebook does not respond saying unrecognized browser.
  • robots.txt mock
  • Better AboutUs Information Extractor.

Steps to get to DoneDone

Improve AboutUs Information Extractor

Improve Logo Finder

  • Find Logo from site's AboutUs page
Currently, the bot tries to extract the site's logo from the main page. A possible pitfall with this approach is that the main page will contain lots of extra html text and parsing for logo image tag could be time consuming and not correct in some instances.
The idea is to first find the logo from the aboutus page if it exists. The rationale is that the html text that makes up the site's aboutus page will be significantly shorter and less complex than that of the main page.

Improve Test Cases

  • Update test cases for the following:
    • Google Data Parser
    • Alexa Data Parser
    • Website Crawler (including logo & aboutus text tests)
    • Integration Test Suite for PageScrapeBot

Schema

  • Redefine the schema.
  • Put it on the page creation bot.
  • Write migrations for the new schema
  • Verify the migration using console

Get compost up and running on all dev machines

  • Get compost up and running on everyone machines.
  • Make sure that the commits are going in the repository and database/models.


Fetch Domain Info from Google

  • Fetch Related links
    • Indexing issues because of threading.
    • proper handling of the case when more results demanded than available
    • Test Case: Domain name has no related links (2)
    • Test Case: Domain name has fewer related links than requested (2)
    • Test Case: Scrape all related links from the returned Google search page(s) (2)
    • Test Case: Scrape a subset of related links from the returned Google search page (2)
    • Performance tests (2)
  • Fetch Incoming links
    • Test Case: Domain name has no incoming links (2)
    • Test Case: Domain name has fewer incoming links than requested (2)
    • Test Case: Scrape a subset of incoming links from the returned Google search page (2)

Site crawler that finds logo, and meta tags on the front page

  • Pass Rake test
  • Write test cases for websites containing logos, aboutus
  • Parse aboutus pages of website
  • Fetch Logo
  • Metatags extraction.
    • Description (5)
    • Title
    • Keywords
    • language
  • Capture External links.
  • Capture Internal links.
  • Domain Information
    • forwarded/redirected
    • parked
  • Location of pages
    • home
  • Respect robots.txt and robots info in Metatags.

Capture alexa information

  • Rewrite code to push data in new database schema.
  • sites linking in
  • contact information
  • related domains
  • site rank
  • status
  • adult content
  • thumbnail
  • Multi threading support
  • Test cases
    • Connectivity problems(3)
    • Invalid domain(5)
    • Domain with some information
    • Domain with massive information(5)

Integrate with MediaWiki

  • Call the Scrap Bot from the Mediawiki and see if that works.

Concern/Issues

Tuesday Sep 11, 2007

  • Home page information. How is it different from domain information?

Tuesday Sep 11, 2007

  • Should we split Keywords?

Monday Sep 10, 2007

  • Alexa key for thumbnail information?
    • Same as for all alexa web-services ... account now has thumbnail enabled
  • what is ticket in alexa information?

Thursday Sep 6, 2007

  • where to put the thumbnail information. Image or url?
  • Schema Related
    • Why anchor text
    • primary key(id) for rails.
    • alexa info in domain table or separate table.
    • display, extension, isidn, rank in domain table?
    • owned domain ?
    • what is queue?
    • why schema info?
    • thumbnail information?
    • online in alexa

Wednesday Sep 5, 2007

  • position of related domain?
  • schema of the database?
  • number of links?
  • multiple versions of data?
  • whoisparsing?

Tuesday Sep 4, 2007

  • Code of the current crawler is in ?
  • Offline Crawling
    • versions of information
  • Testing of alexa information?
    • Monitoring rather testing. Result from alexa makes sense. Call for things that we know etc.
  • The way we are using google API to fetch pages etc.
  • Crawl the complete website to get meta tags etc.
    • Try to get all information in 5 pages. Heuristic approach
    • robots.txt dont crawl the disallowed website.
  • Changing the schema, if we feel like?

Monday Sep 3, 2007

  • Scalability concerns. ?
    • code of the current crawler is in ?
  • Name of people in the commit.
  • Amazon account
    • Alexa information is months old.
    • Incoming links through google.
    • Outgoing links? crawling the website.
  • Extracting information from web pages.
  • Small for small people.
  • web server.
  • web interface to see where we need to put in the calling.
    • app/web_services/web_services.rb in compost.
    • wiki/extensions/AboutUsWebServices in aboutus.
  • Schema ?

Schema

  • Domain
    • url
    • homepage

Development Repositories

  • /opt/git/compost-whois-refresh
  • /opt/git/aboutus-whois-refresh

Gather Information

Ruby Libraries

  • mechanise mechanize
  • hpricot
  • topsoil web spider


Retrieved from "http://aboutus.com/index.php?title=Rewrite_PageScrapeBot&oldid=12133286"