Rewrite PageScrapeBot

Rating: 0 - 0 votes

Company Logo

Company Name

Company Contact

Page Type

This page is about a company.

OurWork

Contributors: Athar Hameed, Ali Aslam, Arif Iqbal, Hassan Javeed, Laiq Jafri, Mohammad Ghufran, Umar Sheikh

What (summary)

The current PageScrapeBot is part of a large Java tomcat process running on our database master. We want to rewrite it in ruby and move it off of the database master. Rewrite PageScrapeBot is one Task in the larger WhoisRefresh Project.

The bot currently goes and gets the whois record from alexa, crawls the site looking for a logo, meta information for category guessing, looks for "aboutus" information on the site and several other things, then dumps all of that information into the database from where it is collected and reassembled by the PageCreationBot.

Why this is important

Gives us mastery over our technology. We can change it easily (e.g., add amazon books that mention the domain), or adapt it for a different problem.
Allows us to turn off Apache and the large Java tomcat process on our database master.

DoneDone

Given a domain, the bot collects the following information and adds it to the database

A snippet about the domain, gathered from the site itself
A thumbnail portrait of the front page
A best guess at the logo for the site
A list of related domains
The address (or perhaps addresses) for the site along with citation from where the address came
Meta information about the owner, email, alexa rank, language of the site, whether it is online or not, and whether it is classified as adult by alexa or not
dmoz categories that contain the site
metainformation about the domain (for example is it an idn domain)
location of the homepage for the site (www.example.com vs example.com)
incoming links to the frontpage
keywords from the meta information on the frontpage
the logo for the website
all the meta information ... for example, is it parked? does it forward? is it online? is the registration protected?
sites that it links out to

Steps to get to DoneDone

Schema

~~Redefine the schema.~~
~~Put it on the page creation bot.~~
~~Write migrations for the new schema~~
~~Verify the migration using console~~

Get compost up and running on all dev machines

~~Get compost up and running on everyone machines.~~
~~Make sure that the commits are going in the repository and database/models.~~

Find dmoz categories of the website

Fetch domain categories from dmoz data.

Fetch Domain Info from Google

~~Fetch Related links~~
- ~~Indexing issues because of threading.~~
- ~~proper handling of the case when more results demanded than available~~
- ~~Test Case: Domain name has no related links (2)~~
- ~~Test Case: Domain name has fewer related links than requested (2)~~
- ~~Test Case: Scrape all related links from the returned Google search page(s) (2)~~
- ~~Test Case: Scrape a subset of related links from the returned Google search page (2)~~
- ~~Performance tests (2)~~
Fetch Incoming links
- Test Case: Domain name has no incoming links (2)
- Test Case: Domain name has fewer incoming links than requested (2)
- Test Case: Scrape all incoming links from the returned Google search page(s) (2)
- Test Case: Scrape a subset of incoming links from the returned Google search page (2)
- Performance tests (2)

Site crawler that finds logo, and meta tags on the front page

Metatags extraction.
- ~~Description (5)~~
- ~~Title~~
- ~~Keywords~~
- ~~language~~
Capture External links.
~~Capture Internal links.~~
Logo
Domain Information
- forwarded/redirected
- parked
Location of pages
- home
- aboutus. Also extract the text.
- contactus?
~~Respect robots.txt and robots info in Metatags.~~

Capture alexa information

Rewrite code to push data in new database schema.
~~sites linking in~~
~~contact information~~
~~related domains~~
~~site rank~~
~~status~~
~~adult content~~
thumbnail
~~Multi threading support~~
Test cases
- Connectivity problems(3)
- Invalid domain(5)
- ~~Domain with some information~~
- Domain with massive information(5)

Integrate with MediaWiki

Call the Scrap Bot from the Mediawiki and see if that works.

Concern/Issues

Tuesday Sep 11, 2007

Keywords should be split?

Monday Sep 10, 2007

Alexa key for thumbnail information?
- Same as for all alexa web-services ... account now has thumbnail enabled
what is ticket in alexa information?

Thursday Sep 6, 2007

where to put the thumbnail information. Image or url?
Schema Related
- Why anchor text
- primary key(id) for rails.
- alexa info in domain table or separate table.
- display, extension, isidn, rank in domain table?
- owned domain ?
- what is queue?
- why schema info?
- thumbnail information?
- online in alexa

Wednesday Sep 5, 2007

position of related domain?
schema of the database?
number of links?
multiple versions of data?
whoisparsing?

Tuesday Sep 4, 2007

Code of the current crawler is in ?
Offline Crawling
- versions of information
Testing of alexa information?
- Monitoring rather testing. Result from alexa makes sense. Call for things that we know etc.
The way we are using google API to fetch pages etc.
Crawl the complete website to get meta tags etc.
- Try to get all information in 5 pages. Heuristic approach
- robots.txt dont crawl the disallowed website.
Changing the schema, if we feel like?

Monday Sep 3, 2007

Scalability concerns. ?
- code of the current crawler is in ?
Name of people in the commit.
Amazon account
- Alexa information is months old.
- Incoming links through google.
- Outgoing links? crawling the website.
Extracting information from web pages.
Small for small people.
web server.
web interface to see where we need to put in the calling.
- app/web_services/web_services.rb in compost.
- wiki/extensions/AboutUsWebServices in aboutus.
Schema ?

Schema

Domain
- url
- homepage

Development Repositories

/opt/git/compost-whois-refresh
/opt/git/aboutus-whois-refresh

Gather Information

Ruby Libraries

mechanise mechanize
hpricot
topsoil web spider

rake test errors 2007.09.10

  1) Error:
test_url_info_google(AlexaDataParserTest):
NoMethodError: private method `gsub' called for #<Scrape:0x2f74098>
    /www/lib/ruby/gems/1.8/gems/activerecord-1.15.3/lib/active_record/base.rb:1860:in `method_missing'
    /www/lib/ruby/1.8/cgi.rb:342:in `escape'
    /Users/brandon/Code/compost-whois-refresh/config/../lib/page_scrape_bot/urlinfo.rb:38:in `process'
    /Users/brandon/Code/compost-whois-refresh/config/../lib/page_scrape_bot/urlinfo.rb:38:in `collect'
    /Users/brandon/Code/compost-whois-refresh/config/../lib/page_scrape_bot/urlinfo.rb:38:in `process'
    ./test/unit/pagescraperbot/alexadataparser_test.rb:40:in `test_url_info_google'

  2) Error:
test_url_info_facebook(TestGoogleUrlInfo):
ArgumentError: wrong number of arguments (3 for 4)
    ./test/unit/pagescraperbot/urlinfo_test.rb:36:in `process'
    ./test/unit/pagescraperbot/urlinfo_test.rb:36:in `test_url_info_facebook'

  3) Error:
test_url_info_google(TestGoogleUrlInfo):
ArgumentError: wrong number of arguments (3 for 4)
    ./test/unit/pagescraperbot/urlinfo_test.rb:7:in `process'
    ./test/unit/pagescraperbot/urlinfo_test.rb:7:in `test_url_info_google'

  4) Error:
test_altavista(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
    ./test/unit/pagescraperbot/website_crawler_test.rb:74:in `test_altavista'

  5) Error:
test_blizzard(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
    ./test/unit/pagescraperbot/website_crawler_test.rb:18:in `test_blizzard'

  6) Error:
test_blizzard_store(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
    ./test/unit/pagescraperbot/website_crawler_test.rb:30:in `test_blizzard_store'

  7) Error:
test_cnn(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
    ./test/unit/pagescraperbot/website_crawler_test.rb:66:in `test_cnn'

  8) Error:
test_geo(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
    ./test/unit/pagescraperbot/website_crawler_test.rb:47:in `test_geo'

  9) Error:
test_google(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
    ./test/unit/pagescraperbot/website_crawler_test.rb:38:in `test_google'

 10) Error:
test_uptownla(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
    ./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
    ./test/unit/pagescraperbot/website_crawler_test.rb:58:in `test_uptownla'

Rewrite PageScrapeBot

Company Logo

Company Name

Company Contact

Page Type

Edit Page Image

Edit Name

Edit Contact Information

Edit Page Type

Map

Edit Page Rating

Company Logo

Company Name

Company Contact

Page Type

What (summary)

Why this is important

DoneDone

Steps to get to DoneDone

Schema

Get compost up and running on all dev machines

Find dmoz categories of the website

Fetch Domain Info from Google

Site crawler that finds logo, and meta tags on the front page

Capture alexa information

Integrate with MediaWiki

Concern/Issues

Tuesday Sep 11, 2007

Monday Sep 10, 2007

Thursday Sep 6, 2007

Wednesday Sep 5, 2007

Tuesday Sep 4, 2007

Monday Sep 3, 2007

Schema

Development Repositories

Gather Information

rake test errors 2007.09.10

Edit Page Image

Edit Name

Edit Contact Information

Edit Page Type

Map

Edit Page Rating

Company Logo

Company Name

Company Contact

Page Type

Edit Page Image

Edit Name

Edit Contact Information

Edit Page Type

Map

Edit Page Rating