Difference between revisions of "WhoisRefreshRunRefresh"

(Status)
Line 15: Line 15:
 
*** Contact Fax Number
 
*** Contact Fax Number
 
*** Wiki comment for as-of date of whois info
 
*** Wiki comment for as-of date of whois info
 +
 
* design a process to modify all pages
 
* design a process to modify all pages
 +
** Find the page to update (slow page walk)
 +
**Get and Parse fresh Whois information
 +
*** Since we can only query for a given domain once, we should probably keep the full response around in case we parse it wrong
 +
**Change the address on the page
 +
*** WikiTransformer can be gutted and used for this
 +
*** When save the new page revision, make sue to suppress creation of the linksUpdate job
 +
  
 
== Status ==
 
== Status ==

Revision as of 07:47, 7 November 2007

OurWork Edit-chalk-10bo12.png

Run over all pages pertaining to website information that have 0 human edits and get and insert fresh whois information. FOr example, www.aboutus.org/facebook.com.

Steps to DoneDone

  • Find out how many pages this would hit - approx 7,659,827
  • modify one page
    • Contact information:
      • Contact name
      • Contact email (protected)
      • Street Address (protected)
      • City, State/Province, Postal Code
      • Geocode for maps location
      • Contact Phone Number
      • Contact Fax Number
      • Wiki comment for as-of date of whois info
  • design a process to modify all pages
    • Find the page to update (slow page walk)
    • Get and Parse fresh Whois information
      • Since we can only query for a given domain once, we should probably keep the full response around in case we parse it wrong
    • Change the address on the page
      • WikiTransformer can be gutted and used for this
      • When save the new page revision, make sue to suppress creation of the linksUpdate job


Status

Currently, PageCreationBot, PageScrapeBot and WhoisParsing together generate a page as follows:

  • PageCreationBot creates a domainbox template and uses the thumbnail tag to imbed the thumbnail into itself. PageScrapeBot fetches thumbnail from alexa for a given domain name and stores it locally in a pre-defined directory. We need to figure out the mechanism behind thumbnail tag. i.e. how does it locate a particular thumbnail image. Corresponding to this, we need to provide a mechanism to put the fetched thumbnail image so that the mediawiki can locate it.
  • PageCreationBot creates a section named 'Logo' where it puts the logo that PageScrapeBot fetched from the site itself. The logo is inserted into the page using the wiki Image tag. Need to find a better way of doing this ala thumbnails.
  • Next, the PageCreationBot creates a description section which is filled with description fetched from alexa followed by any about us text extracted from the site. (The aboutus text is contained in a sub-section)
  • Related and Inlinking Domains sections are populated. Related Domains are fetched from google, whereas sites linking in are fetched from alexa.
  • Keywords fetched from meta tags in the home page are placed in a seperate section 'Keyword'
  • Categories fetched from alexa are used to create categories that the page belongs to using the categories tag.
  • Contact info is to be fetched from contact table that is populated by WhoisParsing and put in it's own section.

Things to do:

  • Need to embed logo and thumbnails in same manner.
  • Understand mechanism behind thumbnail tag.
  • Devise a mechanism to detect registration by proxy. Decide on plan of action if proxy registration encountered.
  • Decide on course of action based on the status of domain. i.e. parked, locked.

Possible Scenario

  • One of our valued clients enters the following url : http://www.aboutus.org/i_am_not_on_aboutus_yet.com
  • Unfortunately, this page currently does not exist in our db.
  • The default wiki behavior is to return a newly created empty page to the client.
  • Surely, we can do better.
  • So we try to make a best-effort autogenerated page
  • Our top-level glue will first call PageScrapeBot's process method with this new domain as its argument. This will result in domain-specific information being dumped into database.
  • It will then do the same to fetch whois information by calling WhoIsParsings' parse method.
  • At the end of this process, the db is populated with relevant details regarding this domain.
  • Once loaded with all this amunition...it will fire a request to pagecreationbot to create this new page using relevant data from db.
  • And voila, we have a newly created page for our valued client.




Retrieved from "http://aboutus.com/index.php?title=WhoisRefreshRunRefresh&oldid=12221075"