Difference between revisions of "WhoisRefreshRunRefresh"

Line 16: Line 16:
 
*** Wiki comment for as-of date of whois info
 
*** Wiki comment for as-of date of whois info
 
* design a process to modify all pages
 
* design a process to modify all pages
 +
 +
== Status ==
 +
Currently, PageCreationBot, PageScrapeBot and WhoisParsing together generate a page as follows:
 +
* PageCreationBot creates a domainbox template and uses the thumbnail tag to imbed the thumbnail into itself. PageScrapeBot fetches thumbnail from alexa for a given domain name and stores it locally in a pre-defined directory. We need to figure out the mechanism behind thumbnail tag. i.e. how does it locate a particular thumbnail image. Corresponding to this, we need to provide a mechanism to put the fetched thumbnail image so that the mediawiki can locate it.
 +
* PageCreationBot creates a section named 'Logo' where it puts the logo that PageScrapeBot fetched from the site itself. The logo is inserted into the page using the wiki Image tag. Need to find a better way of doing this ala thumbnails.
 +
* Next, the PageCreationBot creates a description section which is filled with description fetched from alexa followed by any about us text extracted from the site. (The aboutus text is contained in a sub-section)
 +
* Related and Inlinking Domains sections are populated. Related Domains are fetched from google, whereas sites linking in are fetched from alexa.
 +
* Keywords fetched from meta tags in the home page are placed in a seperate section 'Keyword'
 +
* Categories fetched from alexa are used to create categories that the page belongs to using the categories tag.
 +
* Contact info is to be fetched from contact table that is populated by WhoisParsing and put in it's own section.
 +
 +
Things to do:
 +
* Need to embed logo and thumbnails in same manner.
 +
* Understand mechanism behind thumbnail tag.
 +
* Devise a mechansim to detect registration by proxy. Decide on plan of action if proxy registration encountered.
  
 
== Possible Scenario ==
 
== Possible Scenario ==

Revision as of 12:54, 5 November 2007

OurWork Edit-chalk-10bo12.png

Run over all pages pertaining to website information that have 0 human edits and get and insert fresh whois information. FOr example, www.aboutus.org/facebook.com.

Steps to DoneDone

  • Find out how many pages this would hit - approx 7,659,827
  • modify one page
    • Contact information:
      • Contact name
      • Contact email (protected)
      • Street Address (protected)
      • City, State/Province, Postal Code
      • Geocode for maps location
      • Contact Phone Number
      • Contact Fax Number
      • Wiki comment for as-of date of whois info
  • design a process to modify all pages

Status

Currently, PageCreationBot, PageScrapeBot and WhoisParsing together generate a page as follows:

  • PageCreationBot creates a domainbox template and uses the thumbnail tag to imbed the thumbnail into itself. PageScrapeBot fetches thumbnail from alexa for a given domain name and stores it locally in a pre-defined directory. We need to figure out the mechanism behind thumbnail tag. i.e. how does it locate a particular thumbnail image. Corresponding to this, we need to provide a mechanism to put the fetched thumbnail image so that the mediawiki can locate it.
  • PageCreationBot creates a section named 'Logo' where it puts the logo that PageScrapeBot fetched from the site itself. The logo is inserted into the page using the wiki Image tag. Need to find a better way of doing this ala thumbnails.
  • Next, the PageCreationBot creates a description section which is filled with description fetched from alexa followed by any about us text extracted from the site. (The aboutus text is contained in a sub-section)
  • Related and Inlinking Domains sections are populated. Related Domains are fetched from google, whereas sites linking in are fetched from alexa.
  • Keywords fetched from meta tags in the home page are placed in a seperate section 'Keyword'
  • Categories fetched from alexa are used to create categories that the page belongs to using the categories tag.
  • Contact info is to be fetched from contact table that is populated by WhoisParsing and put in it's own section.

Things to do:

  • Need to embed logo and thumbnails in same manner.
  • Understand mechanism behind thumbnail tag.
  • Devise a mechansim to detect registration by proxy. Decide on plan of action if proxy registration encountered.

Possible Scenario

  • One of our valued clients enters the following url : http://www.aboutus.org/i_am_not_on_aboutus_yet.com
  • Unfortunately, this page currently does not exist in our db.
  • The default wiki behavior is to return a newly created empty page to the client.
  • Surely, we can do better.
  • So we try to make a best-effort autogenerated page
  • Our top-level glue will first call PageScrapeBot's process method with this new domain as its argument. This will result in domain-specific information being dumped into database.
  • It will then do the same to fetch whois information by calling WhoIsParsings' parse method.
  • At the end of this process, the db is populated with relevant details regarding this domain.
  • Once loaded with all this amunition...it will fire a request to pagecreationbot to create this new page using relevant data from db.
  • And voila, we have a newly created page for our valued client.




Retrieved from "http://aboutus.com/index.php?title=WhoisRefreshRunRefresh&oldid=12186475"