WhoisRefreshRunRefresh
This is part of the WhoisRefresh project. Run over all pages pertaining to website information that have 0 human edits and get and insert fresh whois information. FOr example, www.aboutus.org/facebook.com.
Contents
Steps to DoneDone
- Find out how many pages this would hit - approx 7,659,827
- modify one page
- Contact information:
- Contact name
- Contact email (protected)
- Street Address (protected)
- City, State/Province, Postal Code
- Geocode for maps location
- Contact Phone Number
- Contact Fax Number
- Wiki comment for as-of date of whois info
- Contact information:
- design a process to modify all pages
- Find the page to update (slow page walk)
- Get and Parse fresh Whois information
- Since we can only query for a given domain once, we should probably keep the full response around in case we parse it wrong
- Change the address on the page
- WikiTransformer can be gutted and used for this
- When save the new page revision, make sue to suppress creation of the linksUpdate job
Anything having to do with page creation bot is out of scope for this particular project and should be refactored away to a more apporpriate place.
Status
Wed Dec 05, 2007
Achieved
- Used Pair Distance algorithm to compare registrant's and admin's addresses. This approximate string matching scheme will ignore minor differences in the two strings.
- Tweaked the proxy identification method to better detect proxy registrations.
- Figured out how to extend graphic protection extension to incorporate multi-line images. Can push it into live branch of aboutus if and when required.
- Added test cases to better check new functionality.
Things to Do:
- Need to talk with Jason Parmer and say this task is DoneDone
Objectives
- Make address comparison smarter.
- Make proxy identification more encompassing.
- Continue work on extending graphic protection extension to incorporate multi-line images.
Tue Dec 04, 2007
Achieved
- Started work on extending graphic protection extension. (work in progress)
- Paired with Jason Parmer to merge the whoisrefresh branch into live branch.
- Started running it in live.
- Shelving the geocode http switch scheme for another task.
Objectives
- Extend the graphic protection extension to incorporate multi-line images. Make it disable by default.
- Make the address comparison smarter by looking into a suitable approximate string matching algo.
- Develop high-level architecture for moving geocode http from server to client side.
Mon Dec 03, 2007
Achieved
- Address comparison now ignores punctuation marks. Need to make it more smart. (work in progress)
- Organization name inserted.
- Domain registrar who registered that domain is also searched in the contact info to identify proxy registration.
- Address section checked for human edits. WhoisInserter removes address section from pages it updates.
- Added test cases to check for the aforementioned modifications.
- Refactored test cases.
Objectives
- Make the following modifications (alongwith test cases):
- Insert contact's organization name as well.
- Try to make the address comparison smarter if possible.
- Search for domain registrars' in the contact information to better identify proxy registration.
- Check address section of AboutUsBot revision. If no human edits exist, remove it from the WhoisRefreshBot version.
- If possible, start work on extending the current graphic extension to incorporate multi-line images.
Fri Nov 30, 2007
Achieved
- Communicated possible improvements/modifications in WhoisRefreshRunRefresh via email to all concerned. The text of that email is given below for public consumption:
- Geocodes in the map tag have a precision of up to 6 decimal places. There might be a need to fuzz it a little bit by decreasing the precision to 3 decimal places.
- Google Map is showing for proxy registration as well. This happens because the geocodes get embedded by default. That's a definite no-no and will be changed.
- In some cases, the registrant and administrative address looked very similar. There is a case for making our address comparison routine smarter.
- In a couple of cases, it missed identifying proxy registration. Some tweaking is required in the proxy identification routine.
- In the case of http://www.aboutus.org/4lomza.pl , the whois parser returned bad contact information.
- If contact information was retrieved but failed to parse further, all of the contact information will be obfuscated and displayed as a single line in wiki page. This is a side effect of our graphic extension only being able to generate images out of single line text. To improve formatting, we can use multiple address tags in this case to divide the protected text into multiple lines.
- The original aboutusbot puts the whois information in two seperate sections: Address in 'address' section' and rest of the info in 'contact' section. On the other hand, whoisrefreshbot puts the whois information in only the contact section. In some cases, this has resulted in two addresses showing up on the updated pages. We need to either remove the old address section or divide our information into two sections as well.
- Added test cases to define the following behavior:
- In case of proxy registration, no geocodes are embedded in the map tag.
- Covered more proxy registration cases resulting in better test coverage.
- Only registrant address should be displayed if it is a subset of administrative address and not an exact equal. This is necessary because, in many cases, whois parser also prepends the organization name with its address.
- Check for precision of embedded geocodes.
- Check code behaves normally when whois contact information contains arbitrary whitespace and newline characters.
- Refactored WhoisInserter's code
- Following tweaks in WhoisInserter were made:
- Fixed google maps showing up on proxy registration pages.
- Able to change geocode precision.
Objectives
- Go through the 100 pages updated by WhoisRefreshBot and identify possible modifications.
- Add test cases for WhoisInserter.
Wed Nov 28, 2007
- WhoisInserter's Test Cases are able to define WhoisInserter's behavior in the following scenarios:
- Domain is registered through proxy service.
- Whois Parser fails to parse registrant address.
- Whois Parser succeeds in parsing registrant address.
- Whois Record does not contain registrant's email/phone and it's address is same as administrative contact's.
- Whois Record does not contain registrant's email/phone and it's address is different than administrative contact's.
- Verify geocodes embedded in correct place in wiki text.
- Suppression of linksUpdate jobs.
- WhoisInserter refactoring
- WhoisInserter is also able to do the following:
- Suppress LinksUpdate Jobs while creating new revisions of wiki pages.
- Conditionally put the process to sleep for a time period so as to respect our rate limits.
Tue Nov 27, 2007
- WhoisInserter's Test Cases are able to define WhoisInserter's behavior in the following scenarios:
- Domain Page with no human edits with previous contact section.
- Domain Page with no human edits with no previous contact section.
- Non Domain Page
- Whois Parsing Failure
- Domain Page with human edits in contact section.
- Domain Page with human edits in sections other than contact section.
- WhoisInserter is now also able to do the following:
- Will only walk through the pages table once. Will restart from the page record from previous session.
- Defined directory structure for storing whois records. Instead of dumping data in a single directory, mimic the directory structure followed by mediawiki graphic extension.
Things to Do
- Make WhoisInserter respect our current 20k whois records limit.
- Finish test cases.
Monday Nov 26, 2007
- Started work on the test cases. Simple cases taken care of.
- Refactored code to increase the program's structure.
- Do not embed geo_coordinates into the wiki text if none are found.
Things to do
- Complete test cases.
- Say it's ready to hit 100 pages in production db.
Friday Nov 23, 2007
- WhoisInserter is now also able to do the following:
- Alternate between enom and ni to fetch whois records if nothing found locally.
- Only touch domain pages.
- Only hits those pages with zero human edits in the contact section.
- Brainstorming the test cases (work in progress)
Things to do Monday
- Decide on the wiki text if contact info is not parsed. Currently obfuscates all of it in an address tag.
- Do major work on test suite.
- Looks good. Do you think this will be running the first 100 by the end of today or possibly tomorrow? --Brandon 21:23, 25 November 2007 (PST)
Thurs Nov 22, 2007
- Started refactoring whois_inserter script to incorporate the following:
- Fetch whois records from domain registrars if not found locally. This also includes alternating between ni and enom. As a side effect, the fetched record is stored in our local system. (done for ni, enom refactoring pending)
- Hit only those pages with zero human edits. As an improvement, might want to relax this condition and update all pages with zero human edits in contact section. Performance Implications need to be taken into account though.
- Improve Test Suite for whois-inserter.
Things to do tomorrow:
- Finish enom refactoring
- Handle human edits issue.
- Spend some time on test suite
- If new feedback comes out of staging server demo, try to understand and possibly incorporate them.
- Excellent! Last time we chatted it was next week for first 100 records will be run on live. Is that still what you guys are thinking? --Brandon 22:16, 22 November 2007 (PST)
Wed Nov 21, 2007
WhoisRefreshRunRefresh is able to do the following:
- Insert new Contact Section if no such section exists.
- Replace Contact Section contents with new whois data.
- Street Address and Email is obfuscated.
- City, State, Country, Postal Code and phone number are plain text. (tweaked today)
- Proxy Registration is detected and appropriate message is inserted into contact section rather than the proxy address, etc. (tweaked today)
- Google Maps is enabled.
- Geo-Coordinates are embedded in the page_html and are no longer show up in edit textbox. (done today)
- Some scripts written for importing/exporting table records in chunks from prod. database (work in progress)
Things to do:
- Work on the import/export scripts.
- Get feedback of the current status of WhoisRefreshRunRefresh from Ray and others. Tweak functionality if required.
- I'd like you guys to run the bot on 100 pages on the live site today. --Brandon 21:27, 21 November 2007 (PST)
Old Stuff
Currently, PageCreationBot, PageScrapeBot and WhoisParsing together generate a page as follows:
- PageCreationBot creates a domainbox template and uses the thumbnail tag to imbed the thumbnail into itself. PageScrapeBot fetches thumbnail from alexa for a given domain name and stores it locally in a pre-defined directory. We need to figure out the mechanism behind thumbnail tag. i.e. how does it locate a particular thumbnail image. Corresponding to this, we need to provide a mechanism to put the fetched thumbnail image so that the mediawiki can locate it.
- PageCreationBot creates a section named 'Logo' where it puts the logo that PageScrapeBot fetched from the site itself. The logo is inserted into the page using the wiki Image tag. Need to find a better way of doing this ala thumbnails.
- Next, the PageCreationBot creates a description section which is filled with description fetched from alexa followed by any about us text extracted from the site. (The aboutus text is contained in a sub-section)
- Related and Inlinking Domains sections are populated. Related Domains are fetched from google, whereas sites linking in are fetched from alexa.
- Keywords fetched from meta tags in the home page are placed in a seperate section 'Keyword'
- Categories fetched from alexa are used to create categories that the page belongs to using the categories tag.
- Contact info is to be fetched from contact table that is populated by WhoisParsing and put in it's own section.
Things to do:
- Need to embed logo and thumbnails in same manner.
- Understand mechanism behind thumbnail tag.
- Devise a mechanism to detect registration by proxy. Decide on plan of action if proxy registration encountered.
- Decide on course of action based on the status of domain. i.e. parked, locked.
Possible Scenario
- One of our valued clients enters the following url : http://www.aboutus.org/i_am_not_on_aboutus_yet.com
- Unfortunately, this page currently does not exist in our db.
- The default wiki behavior is to return a newly created empty page to the client.
- Surely, we can do better.
- So we try to make a best-effort autogenerated page
- Our top-level glue will first call PageScrapeBot's process method with this new domain as its argument. This will result in domain-specific information being dumped into database.
- It will then do the same to fetch whois information by calling WhoIsParsings' parse method.
- At the end of this process, the db is populated with relevant details regarding this domain.
- Once loaded with all this amunition...it will fire a request to pagecreationbot to create this new page using relevant data from db.
- And voila, we have a newly created page for our valued client.