* all the meta information ... for example, is it parked? does it forward? is it online? is the registration protected?
* all the meta information ... for example, is it parked? does it forward? is it online? is the registration protected?
* <strike>sites that it links out to</strike> only for the home page
* <strike>sites that it links out to</strike> only for the home page
−
* About Us Text extraction
+
* <s>About Us Text extraction</s>
+
+
== Status as of 1st November, 2007 ==
+
PageScrapeBot does the following.
+
* When PageScrapeBot's process method is called with a domain name as argument, it fetches information about this domain by calling the various extractors. It then dumps these values in relevant tables of whois-refresh database.
+
* First, the PageScrapeBot's process method fetches related links by calling GoogleDataParser's process method. Currently, it just fetches the top 10 related links. It stores these links in an instance variable. These related links are eventually dumped into related_links table.
+
* Second, the PageScrapeBot's process method fetches inlinks, categories, title, description, thumbnail information from AlexaExtractor's process methods. In the case of thumbnail, the thumbnail is also stored in the local file in the specified dir <Thumbnail_Dir>. Eventually, the inlinks are dumped into inbound_links table; the thumbnail info is dumped into image_links table; and the description and title are dumped into alexas as well as domains table.
+
* Lastly, the PageScrapeBot's process method fetches keywords' meta info, logo, and aboutus text by calling relevant methods of WebsiteInfo class. In the case of logo, the logo is also stored in a local file in the specified dir <Logo_Dir>. Eventually, the keywords info is dumped into keywords table, logo info is dumped into image_links table, and aboutus text is dumped into domains table.
+
+
Test Suites for all extractors of PageScrapeBot can be run by using the command <rake test:bot>.
+
The following individual extractor's test cases can be run using rake commands:
+
* GoogleDataParser: Use <rake test:bot:google>
+
* AlexaExtractor: Use <rake test:bot:alexa>
+
* AboutUsExtractor: Use <rake test:bot:aboutus>
+
* LogoFinder: Use <rake test:bot:logo>
+
+
==== Suggested Tweaks ====
+
* PageScrapeBot's process method should invoke the various extractors in seperate threads instead of a single thread as is the case now.
+
* Update the database to do away with unnecessary tables. (caution: this can have repercussions for other applications using the current database)
== Things that still need some minor effort ==
== Things that still need some minor effort ==
−
* Rails multi threading issues kills db commands.
+
* <s>Rails multi threading issues kills db commands.</s>
* Facebook does not respond saying unrecognized browser.
* Facebook does not respond saying unrecognized browser.
* robots.txt mock
* robots.txt mock
−
* Better [[AboutUs]] Information Extractor.
+
* <s>Better [[AboutUs]] Information Extractor.</s>
−
== Rake Test Failure: Trace ==
+
== '''Steps to get to [[DoneDone]]''' ==
−
(in /home/dev/nimbus/compost-whois-refresh_backup)
Loaded suite /www/lib/ruby/gems/1.8/gems/rake-0.7.3/lib/rake/rake_test_loader
−
Started
−
EEEEE.E......./home/dev/nimbus/compost-whois-refresh_backup/config/../lib/page_creation_bot.rb:111: warning: don't put space before argument parentheses
−
.F.F................EE........
−
Finished in 0.0 seconds.
−
−
1) Error:
−
test_create(AuBlackholeDomainsControllerTest):
−
ActiveRecord::RecordNotFound: Couldn't find User with ID=1
The current PageScrapeBot is part of a large Java tomcat process running on our database master. We want to rewrite it in ruby and move it off of the database master. Rewrite PageScrapeBot is one Task in the larger WhoisRefreshProject.
The bot currently goes and gets the whois record from alexa, crawls the site looking for a logo, meta information for category guessing, looks for "aboutus" information on the site and several other things, then dumps all of that information into the database from where it is collected and reassembled by the PageCreationBot.
Why this is important
Gives us mastery over our technology. We can change it easily (e.g., add amazon books that mention the domain), or adapt it for a different problem.
Allows us to turn off Apache and the large Java tomcat process on our database master.
Given a domain, the bot collects the following information and adds it to the database
A snippet about the domain, gathered from the site itself
A thumbnail portrait of the front page Done with alexa, should we consider to capture it from nameintel?
A best guess at the logo for the site
A list of related domains
The address (or perhaps addresses) for the site along with citation from where the address came
Meta information about the owner, email, alexa rank, language of the site, whether it is online or not, and whether it is classified as adult by alexa or not
dmoz categories that contain the site categories are fetched from alexa and not directly from dmoz
Metainformation about the domain (for example is it an idn domain)
location of the homepage for the site (www.example.com vs example.com)
incoming links to the frontpage
keywords from the meta information on the frontpage
all the meta information ... for example, is it parked? does it forward? is it online? is the registration protected?
sites that it links out to only for the home page
About Us Text extraction
Status as of 1st November, 2007
PageScrapeBot does the following.
When PageScrapeBot's process method is called with a domain name as argument, it fetches information about this domain by calling the various extractors. It then dumps these values in relevant tables of whois-refresh database.
First, the PageScrapeBot's process method fetches related links by calling GoogleDataParser's process method. Currently, it just fetches the top 10 related links. It stores these links in an instance variable. These related links are eventually dumped into related_links table.
Second, the PageScrapeBot's process method fetches inlinks, categories, title, description, thumbnail information from AlexaExtractor's process methods. In the case of thumbnail, the thumbnail is also stored in the local file in the specified dir . Eventually, the inlinks are dumped into inbound_links table; the thumbnail info is dumped into image_links table; and the description and title are dumped into alexas as well as domains table.
Lastly, the PageScrapeBot's process method fetches keywords' meta info, logo, and aboutus text by calling relevant methods of WebsiteInfo class. In the case of logo, the logo is also stored in a local file in the specified dir . Eventually, the keywords info is dumped into keywords table, logo info is dumped into image_links table, and aboutus text is dumped into domains table.
Test Suites for all extractors of PageScrapeBot can be run by using the command .
The following individual extractor's test cases can be run using rake commands:
GoogleDataParser: Use
AlexaExtractor: Use
AboutUsExtractor: Use
LogoFinder: Use
Suggested Tweaks
PageScrapeBot's process method should invoke the various extractors in seperate threads instead of a single thread as is the case now.
Update the database to do away with unnecessary tables. (caution: this can have repercussions for other applications using the current database)
Things that still need some minor effort
Rails multi threading issues kills db commands.
Facebook does not respond saying unrecognized browser.
Currently, the bot tries to extract the site's logo from the main page. A possible pitfall with this approach is that the main page will contain lots of extra html text and parsing for logo image tag could be time consuming and not correct in some instances.
The idea is to first find the logo from the aboutus page if it exists. The rationale is that the html text that makes up the site's aboutus page will be significantly shorter and less complex than that of the main page.
Improve Test Cases
Update test cases for the following:
Google Data Parser
Alexa Data Parser
Website Crawler (including logo & aboutus text tests)
Integration Test Suite for PageScrapeBot
Schema
Redefine the schema.
Put it on the page creation bot.
Write migrations for the new schema
Verify the migration using console
Get compost up and running on all dev machines
Get compost up and running on everyone machines.
Make sure that the commits are going in the repository and database/models.
Fetch Domain Info from Google
Fetch Related links
Indexing issues because of threading.
proper handling of the case when more results demanded than available
Test Case: Domain name has no related links (2)
Test Case: Domain name has fewer related links than requested (2)
Test Case: Scrape all related links from the returned Google search page(s) (2)
Test Case: Scrape a subset of related links from the returned Google search page (2)
Performance tests (2)
Fetch Incoming links
Test Case: Domain name has no incoming links (2)
Test Case: Domain name has fewer incoming links than requested (2)
Test Case: Scrape a subset of incoming links from the returned Google search page (2)
Site crawler that finds logo, and meta tags on the front page
Pass Rake test
Write test cases for websites containing logos, aboutus
Parse aboutus pages of website
Fetch Logo
Metatags extraction.
Description (5)
Title
Keywords
language
Capture External links.
Capture Internal links.
Domain Information
forwarded/redirected
parked
Location of pages
home
Respect robots.txt and robots info in Metatags.
Capture alexa information
Rewrite code to push data in new database schema.
sites linking in
contact information
related domains
site rank
status
adult content
thumbnail
Multi threading support
Test cases
Connectivity problems(3)
Invalid domain(5)
Domain with some information
Domain with massive information(5)
Integrate with MediaWiki
Call the Scrap Bot from the Mediawiki and see if that works.
Concern/Issues
Tuesday Sep 11, 2007
Home page information. How is it different from domain information?
Tuesday Sep 11, 2007
Should we split Keywords?
Monday Sep 10, 2007
Alexa key for thumbnail information?
Same as for all alexa web-services ... account now has thumbnail enabled
what is ticket in alexa information?
Thursday Sep 6, 2007
where to put the thumbnail information. Image or url?
Schema Related
Why anchor text
primary key(id) for rails.
alexa info in domain table or separate table.
display, extension, isidn, rank in domain table?
owned domain ?
what is queue?
why schema info?
thumbnail information?
online in alexa
Wednesday Sep 5, 2007
position of related domain?
schema of the database?
number of links?
multiple versions of data?
whoisparsing?
Tuesday Sep 4, 2007
Code of the current crawler is in ?
Offline Crawling
versions of information
Testing of alexa information?
Monitoring rather testing. Result from alexa makes sense. Call for things that we know etc.
The way we are using google API to fetch pages etc.
Crawl the complete website to get meta tags etc.
Try to get all information in 5 pages. Heuristic approach
robots.txt dont crawl the disallowed website.
Changing the schema, if we feel like?
Monday Sep 3, 2007
Scalability concerns. ?
code of the current crawler is in ?
Name of people in the commit.
Amazon account
Alexa information is months old.
Incoming links through google.
Outgoing links? crawling the website.
Extracting information from web pages.
Small for small people.
web server.
web interface to see where we need to put in the calling.