Difference between revisions of "WhoisRefreshRunRefresh"
(→Status) |
|||
| Line 15: | Line 15: | ||
*** Contact Fax Number | *** Contact Fax Number | ||
*** Wiki comment for as-of date of whois info | *** Wiki comment for as-of date of whois info | ||
| + | |||
* design a process to modify all pages | * design a process to modify all pages | ||
| + | ** Find the page to update (slow page walk) | ||
| + | **Get and Parse fresh Whois information | ||
| + | *** Since we can only query for a given domain once, we should probably keep the full response around in case we parse it wrong | ||
| + | **Change the address on the page | ||
| + | *** WikiTransformer can be gutted and used for this | ||
| + | *** When save the new page revision, make sue to suppress creation of the linksUpdate job | ||
| + | |||
== Status == | == Status == | ||
Revision as of 07:47, 7 November 2007
Run over all pages pertaining to website information that have 0 human edits and get and insert fresh whois information. FOr example, www.aboutus.org/facebook.com.
Steps to DoneDone
- Find out how many pages this would hit - approx 7,659,827
- modify one page
- Contact information:
- Contact name
- Contact email (protected)
- Street Address (protected)
- City, State/Province, Postal Code
- Geocode for maps location
- Contact Phone Number
- Contact Fax Number
- Wiki comment for as-of date of whois info
- Contact information:
- design a process to modify all pages
- Find the page to update (slow page walk)
- Get and Parse fresh Whois information
- Since we can only query for a given domain once, we should probably keep the full response around in case we parse it wrong
- Change the address on the page
- WikiTransformer can be gutted and used for this
- When save the new page revision, make sue to suppress creation of the linksUpdate job
Status
Currently, PageCreationBot, PageScrapeBot and WhoisParsing together generate a page as follows:
- PageCreationBot creates a domainbox template and uses the thumbnail tag to imbed the thumbnail into itself. PageScrapeBot fetches thumbnail from alexa for a given domain name and stores it locally in a pre-defined directory. We need to figure out the mechanism behind thumbnail tag. i.e. how does it locate a particular thumbnail image. Corresponding to this, we need to provide a mechanism to put the fetched thumbnail image so that the mediawiki can locate it.
- PageCreationBot creates a section named 'Logo' where it puts the logo that PageScrapeBot fetched from the site itself. The logo is inserted into the page using the wiki Image tag. Need to find a better way of doing this ala thumbnails.
- Next, the PageCreationBot creates a description section which is filled with description fetched from alexa followed by any about us text extracted from the site. (The aboutus text is contained in a sub-section)
- Related and Inlinking Domains sections are populated. Related Domains are fetched from google, whereas sites linking in are fetched from alexa.
- Keywords fetched from meta tags in the home page are placed in a seperate section 'Keyword'
- Categories fetched from alexa are used to create categories that the page belongs to using the categories tag.
- Contact info is to be fetched from contact table that is populated by WhoisParsing and put in it's own section.
Things to do:
- Need to embed logo and thumbnails in same manner.
- Understand mechanism behind thumbnail tag.
- Devise a mechanism to detect registration by proxy. Decide on plan of action if proxy registration encountered.
- Decide on course of action based on the status of domain. i.e. parked, locked.
Possible Scenario
- One of our valued clients enters the following url : http://www.aboutus.org/i_am_not_on_aboutus_yet.com
- Unfortunately, this page currently does not exist in our db.
- The default wiki behavior is to return a newly created empty page to the client.
- Surely, we can do better.
- So we try to make a best-effort autogenerated page
- Our top-level glue will first call PageScrapeBot's process method with this new domain as its argument. This will result in domain-specific information being dumped into database.
- It will then do the same to fetch whois information by calling WhoIsParsings' parse method.
- At the end of this process, the db is populated with relevant details regarding this domain.
- Once loaded with all this amunition...it will fire a request to pagecreationbot to create this new page using relevant data from db.
- And voila, we have a newly created page for our valued client.

