Rewrite PageScrapeBot
Contributors: Athar Hameed, Ali Aslam, Arif Iqbal, Hassan Javeed, Laiq Jafri, Mohammad Ghufran, Umar Sheikh
What (summary)
The current PageScrapeBot is part of a large Java tomcat process running on our database master. We want to rewrite it in ruby and move it off of the database master. Rewrite PageScrapeBot is one Task in the larger WhoisRefresh Project.
The bot currently goes and gets the whois record from alexa, crawls the site looking for a logo, meta information for category guessing, looks for "aboutus" information on the site and several other things, then dumps all of that information into the database from where it is collected and reassembled by the PageCreationBot.
Why this is important
- Gives us mastery over our technology. We can change it easily (e.g., add amazon books that mention the domain), or adapt it for a different problem.
- Allows us to turn off Apache and the large Java tomcat process on our database master.
DoneDone
Given a domain, the bot collects the following information and adds it to the database
- A snippet about the domain, gathered from the site itself
- A thumbnail portrait of the front page
- A best guess at the logo for the site
- A list of related domains
- The address (or perhaps addresses) for the site along with citation from where the address came
- Meta information about the owner, email, alexa rank, language of the site, whether it is online or not, and whether it is classified as adult by alexa or not
- dmoz categories that contain the site
- metainformation about the domain (for example is it an idn domain)
- location of the homepage for the site (www.example.com vs example.com)
- incoming links to the frontpage
- keywords from the meta information on the frontpage
- the logo for the website
- all the meta information ... for example, is it parked? does it forward? is it online? is the registration protected?
- sites that it links out to
Steps to get to DoneDone
Schema
-
Redefine the schema. -
Put it on the page creation bot. -
Write migrations for the new schema -
Verify the migration using console
Get compost up and running on all dev machines
-
Get compost up and running on everyone machines. -
Make sure that the commits are going in the repository and database/models.
Find dmoz categories of the website
- Fetch domain categories from dmoz data.
Fetch Domain Info from Google
-
Fetch Related links-
Indexing issues because of threading. -
proper handling of the case when more results demanded than available -
Test Case: Domain name has no related links (2) -
Test Case: Domain name has fewer related links than requested (2) -
Test Case: Scrape all related links from the returned Google search page(s) (2) -
Test Case: Scrape a subset of related links from the returned Google search page (2) -
Performance tests (2)
-
- Fetch Incoming links
- Test Case: Domain name has no incoming links (2)
- Test Case: Domain name has fewer incoming links than requested (2)
- Test Case: Scrape all incoming links from the returned Google search page(s) (2)
- Test Case: Scrape a subset of incoming links from the returned Google search page (2)
- Performance tests (2)
Site crawler that finds logo, and meta tags on the front page
- Metatags extraction.
-
Description (5) -
Title -
Keywords -
language
-
- Capture External links.
-
Capture Internal links. - Logo
- Domain Information
- forwarded/redirected
- parked
- Location of pages
- home
- aboutus. Also extract the text.
- contactus?
-
Respect robots.txt and robots info in Metatags.
Capture alexa information
- Rewrite code to push data in new database schema.
-
sites linking in -
contact information -
related domains -
site rank -
status -
adult content - thumbnail
-
Multi threading support - Test cases
- Connectivity problems(3)
- Invalid domain(5)
-
Domain with some information - Domain with massive information(5)
Integrate with MediaWiki
- Call the Scrap Bot from the Mediawiki and see if that works.
Concern/Issues
Tuesday Sep 11, 2007
- Keywords should be split?
Monday Sep 10, 2007
- Alexa key for thumbnail information?
- Same as for all alexa web-services ... account now has thumbnail enabled
- what is ticket in alexa information?
Thursday Sep 6, 2007
- where to put the thumbnail information. Image or url?
- Schema Related
- Why anchor text
- primary key(id) for rails.
- alexa info in domain table or separate table.
- display, extension, isidn, rank in domain table?
- owned domain ?
- what is queue?
- why schema info?
- thumbnail information?
- online in alexa
Wednesday Sep 5, 2007
- position of related domain?
- schema of the database?
- number of links?
- multiple versions of data?
- whoisparsing?
Tuesday Sep 4, 2007
- Code of the current crawler is in ?
- Offline Crawling
- versions of information
- Testing of alexa information?
- Monitoring rather testing. Result from alexa makes sense. Call for things that we know etc.
- The way we are using google API to fetch pages etc.
- Crawl the complete website to get meta tags etc.
- Try to get all information in 5 pages. Heuristic approach
- robots.txt dont crawl the disallowed website.
- Changing the schema, if we feel like?
Monday Sep 3, 2007
- Scalability concerns. ?
- code of the current crawler is in ?
- Name of people in the commit.
- Amazon account
- Alexa information is months old.
- Incoming links through google.
- Outgoing links? crawling the website.
- Extracting information from web pages.
- Small for small people.
- web server.
- web interface to see where we need to put in the calling.
- app/web_services/web_services.rb in compost.
- wiki/extensions/AboutUsWebServices in aboutus.
- Schema ?
Schema
- Domain
- url
- homepage
Development Repositories
- /opt/git/compost-whois-refresh
- /opt/git/aboutus-whois-refresh
Gather Information
Ruby Libraries
- mechanise mechanize
- hpricot
- topsoil web spider
rake test errors 2007.09.10
1) Error:
test_url_info_google(AlexaDataParserTest):
NoMethodError: private method `gsub' called for #<Scrape:0x2f74098>
/www/lib/ruby/gems/1.8/gems/activerecord-1.15.3/lib/active_record/base.rb:1860:in `method_missing'
/www/lib/ruby/1.8/cgi.rb:342:in `escape'
/Users/brandon/Code/compost-whois-refresh/config/../lib/page_scrape_bot/urlinfo.rb:38:in `process'
/Users/brandon/Code/compost-whois-refresh/config/../lib/page_scrape_bot/urlinfo.rb:38:in `collect'
/Users/brandon/Code/compost-whois-refresh/config/../lib/page_scrape_bot/urlinfo.rb:38:in `process'
./test/unit/pagescraperbot/alexadataparser_test.rb:40:in `test_url_info_google'
2) Error:
test_url_info_facebook(TestGoogleUrlInfo):
ArgumentError: wrong number of arguments (3 for 4)
./test/unit/pagescraperbot/urlinfo_test.rb:36:in `process'
./test/unit/pagescraperbot/urlinfo_test.rb:36:in `test_url_info_facebook'
3) Error:
test_url_info_google(TestGoogleUrlInfo):
ArgumentError: wrong number of arguments (3 for 4)
./test/unit/pagescraperbot/urlinfo_test.rb:7:in `process'
./test/unit/pagescraperbot/urlinfo_test.rb:7:in `test_url_info_google'
4) Error:
test_altavista(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
./test/unit/pagescraperbot/website_crawler_test.rb:74:in `test_altavista'
5) Error:
test_blizzard(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
./test/unit/pagescraperbot/website_crawler_test.rb:18:in `test_blizzard'
6) Error:
test_blizzard_store(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
./test/unit/pagescraperbot/website_crawler_test.rb:30:in `test_blizzard_store'
7) Error:
test_cnn(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
./test/unit/pagescraperbot/website_crawler_test.rb:66:in `test_cnn'
8) Error:
test_geo(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
./test/unit/pagescraperbot/website_crawler_test.rb:47:in `test_geo'
9) Error:
test_google(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
./test/unit/pagescraperbot/website_crawler_test.rb:38:in `test_google'
10) Error:
test_uptownla(WebsiteCrawlerTest):
NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.innerHTML
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:101:in `parse_title'
./test/unit/pagescraperbot/../../../lib/page_scrape_bot/website_crawler.rb:460:in `process'
./test/unit/pagescraperbot/website_crawler_test.rb:58:in `test_uptownla'

