Difference between revisions of "WhoisRefresh"

Latest revision as of 18:47, 6 November 2007

Rating: 0 - 0 votes

Company Logo

Company Name

Company Contact

Page Type

This page is about a company.

DevelopmentTeam 10-9, ~~8-6~~, ~~4-2~~, 4)

This project actually brings together several smaller projects into one. Essentially, it involves a rewrite of the page creation bot to use several new data sources.

(10) ~~WhoisParsing~~ (Arif Iqbal) (7-10)
- AddressParsing
RewritePageCreationBot
MediaWiki:InitialDomainPage
RewritePageScrapeBot
(4) WhoisRefreshRunRefresh (Jason and Ali)

Steps to DoneDone

Branch whois-refresh
Get it running on our local machines
Assess where we currently are
Ruthlessly prune ... spin off any tasks that aren't absolutely essential onto their own pages that will be considered later

Synopsis

A new domain-page is scraped, and populated

When a red-link is clicked
When a search for a domain doesn't return a page

AcceptanceTest

Look at 50 pages that we know don't have contact info and verify that the contact info is coming in.

Background

rfc3912 the current protocol specification
Domain Statistics by TLD
100 oldest dot com domains
Registrar Stats

The ranking and percentages come from http://populicio.us/toptlds.html and are at least as stale as November 2006.

Rank	TLD	Purpose	Percentage	Whois Server
1	.com	commercial organizations	58.3973	whois.verisign-grs.com
2	.org	organizations	12.8734	whois.pir.org
3	.net	network infrastructures	7.3600	whois.verisign-grs.com
4	.uk	United Kingdom	3.2604	whois.nic.uk
5	.edu	educational establishments accredited in the United States	2.7008	whois.educause.edu
6	.jp	Japan	2.6159	whois.jprs.jp
7	.de	Germany	2.1484	whois.denic.de
8	.br	Brazil	0.8066	whois.registro.br
9	.ca	Canada	0.7208	whois.cira.ca
10	.gov	governments and their agencies in the United States	0.6832	whois.dotgov.gov
11	.au	Australia	0.6463	whois.aunic.net
12	.info	informational sites	0.5674	whois.afilias.net
13	.nl	Netherlands	0.5380	whois.domain-registry.nl
14	.fr	France	0.5108	whois.nic.fr
15	.us	United States	0.5030	whois.nic.us
16	.ru	Russian Federation	0.4610	whois.ripn.net
17	.it	Italy	0.3527	whois.nic.it
18	.cn	China	0.3480	whois.cnnic.net.cn
19	.ch	Switzerland	0.2761	whois.nic.ch
20	.tw	Taiwan	0.2727	whois.twnic.net.tw
21	.es	Spain	0.2699
22	.se	Sweden	0.2493	whois.iis.se
23	.dk	Denmark	0.1957	whois.dk-hostmaster.dk
24	.be	Belgium	0.1956	whois.dns.be
25	.pl	Poland	0.1816	whois.dns.pl
26	.at	Austria	0.1659	whois.nic.at
27	.il	Israel	0.1559	whois.isoc.org.il
28	.tv	Tuvalu	0.1553
29	.nz	New Zealand	0.1233	whois.srs.net.nz
30	.biz	business use	0.1188	whois.biz
??	.eu	European Union	???	whois.eu

Information We Need

Date of lookup
Registrant Address
Admin Address
Phone
Email

Question: do we need both the registrant and admin addresses or is one enough? In the past we've only used one. - Ray | talk


<address>fe565342385cbcce7cb35b486876b8d5</address>
 .
 .
 .
<address asgraphic="HASH" latitude="" longitude="" error="This is where an error message goes">
...This address is displayed as a graphic to make 
it more difficult for web robots to harvest it.  
If you would like to change the address that is 
displayed, simply replace these instructions 
with the new address and then save the page...
</address>

Latitude/Longitude

Need a table so that we can associate one or more lat/long pairs with a page.

Some neat regular expressions

These are from the book, "The Ruby Way"

The following regex matches a phone number in the NANP format (North American Numbering Plan). It allows three common ways of writing such a phone number:phone = /^(($\d{3}$ |\d{3}-)\d{3}-\d{4}|\d{3}\.\d{3}\.\d{4})$/

"(512) 555-1234" =~ phone # true "512.555.1234" =~ phone # true "512-555-1234" =~ phone # true "(512)-555-1234" =~ phone # false "512-555.1234" =~ phone # false

Here is a regex to match a U.S. ZIP Code (which may be five or nine digits):

zip = /^\d{5}(-\d{4})?$/

The following regex matches all the 51 usual codes (50 states and the District of Columbia):

state = /^A[LKZR] | C[AOT] | D[EC] | FL | GA | HI | I[DLNA] | K[SY] | LA | M[EDAINSOT] | N[EVHJMYCD] | O[HKR] | PA | RI | S[CD] | T[NX] | UT | V[TA] | W[AVIY]$/x

Whois Records We Need

We need 50 more whois records covering the range of formats for each of the following registrars:

public domain
ONLINENIC
STRATO
BASICFUSION.COM
DOMAINNAMESALES | domain name sales
core
METAPREDICT.COM

We have a few but need 50 whois records covering the range of formats for each of the following registrars:

ascio
beijing Innovative
belgium Domains
capitol_domain
discount_domain
domaindiscover
domain_doorman
domainsite
dotregistrar
dotster
fabulous.com
gandi
hichina
innerwise
joker.com
Key systems
Mark Monitor
Melbourne IT
Moniker
NameKing
Name.Net
Names4ever
namesdirect
nameview
nicline
ovh
psi-usa
registerfly
schlund
srsplus
wild west domains

@@ Line 1: / Line 1: @@
-<noinclude><big>[[DevelopmentPriorities]] < </noinclude>[[WhoisRefresh]] ('''10''', <strike>8-6</strike>, <strike>4-2</strike>, <strike>4</strike>)  ... ([[Brandon CS Sanders|Brandon]], [[User:Arif_Iqbal|Arif]], [[Athar Hameed|Athar]])<noinclude></big>
+<noinclude><big>[[DevelopmentTeam]] < [[DevelopmentPriorities|Priorities]] < </noinclude>[[WhoisRefresh]] ('''14''', <strike>10-9</strike>, <strike>8-6</strike>, <strike>4-2</strike>, <strike>4</strike>)  <noinclude></big>
 {{RightTOC}}
 This project actually brings together several smaller projects into one.  Essentially, it involves a rewrite of the page creation bot to use several new data sources.
-== MediaWiki:InitialDomainPage ==
+# {{:WhoisParsing}}
-* Pull PageCreationBot info from [[MediaWiki:InitialDomainPage]] (which is editable only by a sysop)
+#* [[AddressParsing]]
-* Add Community Review and Discussion sections
+# {{:RewritePageCreationBot}}
-* Phase out the graphic tags in favor of email and address
+# {{:MediaWiki:InitialDomainPage}}
+# {{:RewritePageScrapeBot}}
+# {{:WhoisRefreshRunRefresh}}
-== RewritePageCreationBot ==
+== Steps to [[DoneDone]] ==
-* This allows us to move it from Thunderclap (shut down Tomcat)
+* Branch whois-refresh
-* Tie it into partner APIs
+* Get it running on our local machines
-* Cleaner, better tested, etc.
+* Assess where we currently are
-* Turn off categories and related domains.
+* Ruthlessly prune ... spin off any tasks that aren't absolutely essential onto their own pages that will be considered later
-** really?  This isn't what I'd heard. [[User:Jason Parmer|Jason Parmer]] 15:46, 18 June 2007 (PDT)
-* When a user enters an allcaps domain, force it to camelcase
-== Create Domain Page Specification ==
-* Create adds the domain to the electric sheep queue
-* The electric sheep pull the domain from the queue and begin nibbling information about the domain
-**
-== Finding Addresses ==
-We've demonstrated that it's really easy to create simple regular expression detectors and that the regular expressions themselves can be quite large with no problem.  We've also demonstrated that it isn't much more complicated to combine them in a sequence, and that the speed of the sequence is determined by the speed of the first recognizer run in the sequence.
-Things like adding word boundaries '\b' to regular expressions can dramatically speed them up.  In fact, moving the '\b' from inside the alternation to outside of it on country.rb resulted in a 4x speedup for it and all recognizers that start by recognizing a country.
-Finding complete addresses depends upon what we lead with.  If we lead with an organization or a street, it will be slow.  If on the other hand we lead with the postal_code it will be fast.
-It seems like what we want to be able to do is create a complex pattern from simpler patterns -- A sequence of simple patterns.  The key is being able to select the first one to be run from the sequence and pin the pattern down.  Then move out and recognize the immediately preceding and immediately following patterns (do this recursively).
-The StringScanner class already allows us to set the position to begin searching for a match from, and our recognizers are already outputing the location they matched.  Now we just need to create a simple mechanism for specifying which recognizer to run first, then adding something onto the start or end of the string that we've already matched.
- recognizer.match_first       postal_code
- recognizer.then_skip_after   whitespace
- recognizer.then_match_after  country
- recognizer.then_skip_before  whitespace
- recognizer.then_match_before state_or_province
- recognizer.then_skip_before  whitespace
- recognizer.then_match_before street
-== Refresh Specification ==
-When a domain page is viewed:
-* Do we have a contact info section that has been human edited?  If so do nothing
-* Else
-** Pull new contact information and dump into contact info section
-** Overwrite if contact info section already exists
-** Protect with the "address" tag
-** Lookup lat/longitude via google api
-** Mashup lat/longitude with google maps
-== We're done when ==
-* Clicking through random pages ...
-== Next ==
-* <strike>Get the tests failing</strike>
-* <strike>Check the tests in</strike>
-* Research our address extraction strategy: ([[HMM]]s, [[MDL]])
-* Update Brandon's test environment to include the current test database
-== Problems ==
-* The formats across registrars are inconsistent
-* Different countries use different conventions for their addresses
-* .org registry limits requests to 4/minute on port 43. (Web-based queries limited to 50/minute)
 == Synopsis ==
+A new domain-page is scraped, and populated
-When a page is viewed we check to see if we have the contact info for it.  If not, we get the whois record for it and use that to populate the page.  In the process we obfuscate the email and postal addresses to make it harder for spammers.  We also grab the lat/long for the postal address so that we can display a map.
+* When a red-link is clicked
+* When a search for a domain doesn't return a page
 == [[AcceptanceTest]] ==
@@ Line 155: / Line 101: @@
 | ??   || .eu   || European Union                                              ||  ???       || whois.eu
 |}
 == Information We Need ==
@@ Line 192: / Line 136: @@
 Need a table so that we can associate one or more lat/long pairs with a page.
-== Attemp to Fix Freetype ==
-http://lists.freedesktop.org/archives/fontconfig/2006-October/002501.html
-When you get the followin error
-<pre>
-dyld: lazy symbol binding failed: Symbol not found: _FSPathMakeRef
-  Referenced from: /usr/local/lib/libfreetype.6.dylib
-  Expected in: flat namespace
-</pre>
-Fix your freetype to work with MacOSX fonts
-<pre>
-  freetype-config:
-  - libs="-lfreetype -lz"
-  + libs="-lfreetype -lz -Xlinker -framework -Xlinker CoreServices -Xlinker -framework -Xlinker ApplicationServices"
-  freetype2.pc:
-  - Libs: -L${libdir} -lfreetype -lz
-  + Libs: -L${libdir} -lfreetype -lz -Xlinker -framework -Xlinker CoreServices -Xlinker -framework -Xlinker ApplicationServices
-</pre>
-Then rebuild php from scratch (make clean; make; make install)
-'''Turns out this was a false start ... didn't work'''
-[[InstallingAboutUsOnOSX]]
 == Some neat regular expressions ==
@@ Line 240: / Line 156: @@
 state =  /^A[LKZR] | C[AOT] | D[EC] | FL | GA | HI | I[DLNA] | K[SY] | LA | M[EDAINSOT] | N[EVHJMYCD] | O[HKR] | PA | RI | S[CD] | T[NX] | UT | V[TA] | W[AVIY]$/x
-== Athar's thoughts for today (21/06/07) ==
+== Whois Records We Need ==
+We need 50 more whois records covering the range of formats for each of the following registrars:
-I have been wrestling with regular expressions. The assigned task is fairly straightforward. I don't understand what I am doing wrong in my regexs. I will look at it again tonight.
+* public domain
+* ONLINENIC
-== Work Process ==
+* STRATO
-Things that you must do at the start of every day
+* BASICFUSION.COM
-* svn update , git pull
+* DOMAINNAMESALES | domain name sales
-* svn diff, git diff
+* core
-** Pretent to be the person who made the changes and ask yourself the reason of every change e.g. if I am arif who is looking at the changes that Brandon has made then i ll pretend to be Brandon(i.e. the guy who made those changes) and expain to Arif(the guy who need to understand the changes) the reason of every change.
+* METAPREDICT.COM
-** If you don understand the reason, write a NO and ask the guy about the changes by writing a comment in the code or ask the question on the wiki
-* Write on the wiki the work you need to do that you say at the end of the day that it was a good productive day. Also write down the minimum changes that you must complete by the end of the day.
-* You do the work throughout the day and keep on committing the code. Keep on improving the code base throughout the day.
-* At the end of the day, you actually write what you accomplished throughout the day.
-* Write the messages on the wiki page for the team members that will carry on the work from where you left.
-== Thanks Arif ==
-I'll try to be more disciplined in the future!  [[Image:Brandon CS Sanders Caricature.jpg|25px]] [[Brandon CS Sanders]]
+We have a few but need 50 whois records covering the range of formats for each of the following registrars:
+* ascio
+* beijing Innovative
+* belgium Domains
+* capitol_domain
+* discount_domain
+* domaindiscover
+* domain_doorman
+* domainsite
+* dotregistrar
+* dotster
+* fabulous.com
+* gandi
+* hichina
+* innerwise
+* joker.com
+* Key systems
+* Mark Monitor
+* Melbourne IT
+* Moniker
+* NameKing
+* Name.Net
+* Names4ever
+* namesdirect
+* nameview
+* nicline
+* ovh
+* psi-usa
+* registerfly
+* schlund
+* srsplus
+* wild west domains
 </noinclude>
+[[Category:DevelopmentTeamProject]]

Difference between revisions of "WhoisRefresh"

Company Logo

Company Name

Company Contact

Page Type

Edit Page Image

Edit Name

Edit Contact Information

Edit Page Type

Map

Edit Page Rating

Latest revision as of 18:47, 6 November 2007

Company Logo

Company Name

Company Contact

Page Type

Contents

Steps to DoneDone

Synopsis

AcceptanceTest

Background

Information We Need

Next

Latitude/Longitude

Some neat regular expressions

Whois Records We Need

Edit Page Image

Edit Name

Edit Contact Information

Edit Page Type

Map

Edit Page Rating

Company Logo

Company Name

Company Contact

Page Type

Edit Page Image

Edit Name

Edit Contact Information

Edit Page Type

Map

Edit Page Rating