Difference between revisions of "WhoisRefresh"

(Starting to elucidate what needs to be done next)
 

(45 intermediate revisions by 7 users not shown)



Line 1: Line 1:
<noinclude><big>[[DevelopmentPriorities]] < </noinclude>[[WhoisRefresh]] ('''4''')  ... ([[Brandon CS Sanders|Brandon]])<noinclude></big>
+
<noinclude><big>[[DevelopmentTeam]] < [[DevelopmentPriorities|Priorities]] < </noinclude>[[WhoisRefresh]] ('''14''', <strike>10-9</strike>, <strike>8-6</strike>, <strike>4-2</strike>, <strike>4</strike>)  <noinclude></big>
 +
{{RightTOC}}
  
== Next ==
+
This project actually brings together several smaller projects into one.  Essentially, it involves a rewrite of the page creation bot to use several new data sources.
 +
 
 +
# {{:WhoisParsing}}
 +
#* [[AddressParsing]]
 +
# {{:RewritePageCreationBot}}
 +
# {{:MediaWiki:InitialDomainPage}}
 +
# {{:RewritePageScrapeBot}}
 +
# {{:WhoisRefreshRunRefresh}}
  
* <strike>Get the tests failing</strike>
+
== Steps to [[DoneDone]] ==
* Check the tests in
+
* Branch whois-refresh
* Research our address extraction strategy: ([[HMM]]s, [[MDL]])
+
* Get it running on our local machines
* Update Brandon's test environment to include the current test database
+
* Assess where we currently are
 +
* Ruthlessly prune ... spin off any tasks that aren't absolutely essential onto their own pages that will be considered later
  
 
== Synopsis ==
 
== Synopsis ==
 
+
A new domain-page is scraped, and populated
When a page is viewed we check to see if we have the contact info for it.  If not, we get the whois record for it and use that to populate the page.  In the process we obfuscate the email and postal addresses to make it harder for spammers.  We also grab the lat/long for the postal address so that we can display a map.
+
* When a red-link is clicked
 +
* When a search for a domain doesn't return a page
  
 
== [[AcceptanceTest]] ==
 
== [[AcceptanceTest]] ==
Line 91: Line 101:
 
| ??  || .eu  || European Union                                              ||  ???      || whois.eu
 
| ??  || .eu  || European Union                                              ||  ???      || whois.eu
 
|}
 
|}
 
 
  
 
== Information We Need ==
 
== Information We Need ==
Line 129: Line 137:
 
Need a table so that we can associate one or more lat/long pairs with a page.
 
Need a table so that we can associate one or more lat/long pairs with a page.
  
== Attemp to Fix Freetype ==
+
== Some neat regular expressions ==  
 +
 
 +
These are from the book, "The Ruby Way"
 +
 
 +
* The following regex matches a phone number in the NANP format (North American Numbering Plan). It allows three common ways of writing such a phone number:phone = /^((\(\d{3}\) |\d{3}-)\d{3}-\d{4}|\d{3}\.\d{3}\.\d{4})$/
  
http://lists.freedesktop.org/archives/fontconfig/2006-October/002501.html
+
"(512) 555-1234" =~ phone    # true
 +
"512.555.1234"  =~ phone    # true
 +
"512-555-1234"  =~ phone    # true
 +
"(512)-555-1234" =~ phone    # false
 +
"512-555.1234"  =~ phone    # false
  
When you get the followin error
+
* Here is a regex to match a U.S. ZIP Code (which may be five or nine digits):
<pre>
+
 
dyld: lazy symbol binding failed: Symbol not found: _FSPathMakeRef
+
zip = /^\d{5}(-\d{4})?$/
  Referenced from: /usr/local/lib/libfreetype.6.dylib
 
  Expected in: flat namespace
 
</pre>
 
  
Fix your freetype to work with MacOSX fonts
+
* The following regex matches all the 51 usual codes (50 states and the District of Columbia):
<pre>
+
state = /^A[LKZR] | C[AOT] | D[EC] | FL | GA | HI | I[DLNA] | K[SY] | LA | M[EDAINSOT] | N[EVHJMYCD] | O[HKR] | PA | RI | S[CD] | T[NX] | UT | V[TA] | W[AVIY]$/x
  freetype-config:
 
  - libs="-lfreetype -lz"
 
  + libs="-lfreetype -lz -Xlinker -framework -Xlinker CoreServices -Xlinker -framework -Xlinker ApplicationServices"
 
  
  freetype2.pc:
+
== Whois Records We Need ==
  - Libs: -L${libdir} -lfreetype -lz
+
We need 50 more whois records covering the range of formats for each of the following registrars:
  + Libs: -L${libdir} -lfreetype -lz -Xlinker -framework -Xlinker CoreServices -Xlinker -framework -Xlinker ApplicationServices
+
* public domain
</pre>
+
* ONLINENIC
 +
* STRATO
 +
* BASICFUSION.COM
 +
* DOMAINNAMESALES | domain name sales
 +
* core
 +
* METAPREDICT.COM
  
Then rebuild php from scratch (make clean; make; make install)
+
We have a few but need 50 whois records covering the range of formats for each of the following registrars:
 +
* ascio
 +
* beijing Innovative
 +
* belgium Domains
 +
* capitol_domain
 +
* discount_domain
 +
* domaindiscover
 +
* domain_doorman
 +
* domainsite
 +
* dotregistrar
 +
* dotster
 +
* fabulous.com
 +
* gandi
 +
* hichina
 +
* innerwise
 +
* joker.com
 +
* Key systems
 +
* Mark Monitor
 +
* Melbourne IT
 +
* Moniker
 +
* NameKing
 +
* Name.Net
 +
* Names4ever
 +
* namesdirect
 +
* nameview
 +
* nicline
 +
* ovh
 +
* psi-usa
 +
* registerfly
 +
* schlund
 +
* srsplus
 +
* wild west domains
  
'''Turns out this was a false start ... didn't work'''
 
  
[[InstallingAboutUsOnOSX]]
 
 
</noinclude>
 
</noinclude>
 +
[[Category:DevelopmentTeamProject]]

Latest revision as of 18:47, 6 November 2007

DevelopmentTeam 10-9, 8-6, 4-2, 4)


This project actually brings together several smaller projects into one. Essentially, it involves a rewrite of the page creation bot to use several new data sources.

  1. (10) WhoisParsing (Arif Iqbal) (7-10)
  2. RewritePageCreationBot
  3. MediaWiki:InitialDomainPage
  4. RewritePageScrapeBot
  5. (4) WhoisRefreshRunRefresh (Jason and Ali)

Steps to DoneDone

  • Branch whois-refresh
  • Get it running on our local machines
  • Assess where we currently are
  • Ruthlessly prune ... spin off any tasks that aren't absolutely essential onto their own pages that will be considered later

Synopsis

A new domain-page is scraped, and populated

  • When a red-link is clicked
  • When a search for a domain doesn't return a page

AcceptanceTest

Look at 50 pages that we know don't have contact info and verify that the contact info is coming in.

Background

The ranking and percentages come from http://populicio.us/toptlds.html and are at least as stale as November 2006.

Rank TLD Purpose Percentage Whois Server
1 .com commercial organizations 58.3973 whois.verisign-grs.com
2 .org organizations 12.8734 whois.pir.org
3 .net network infrastructures 7.3600 whois.verisign-grs.com
4 .uk United Kingdom 3.2604 whois.nic.uk
5 .edu educational establishments accredited in the United States 2.7008 whois.educause.edu
6 .jp Japan 2.6159 whois.jprs.jp
7 .de Germany 2.1484 whois.denic.de
8 .br Brazil 0.8066 whois.registro.br
9 .ca Canada 0.7208 whois.cira.ca
10 .gov governments and their agencies in the United States 0.6832 whois.dotgov.gov
11 .au Australia 0.6463 whois.aunic.net
12 .info informational sites 0.5674 whois.afilias.net
13 .nl Netherlands 0.5380 whois.domain-registry.nl
14 .fr France 0.5108 whois.nic.fr
15 .us United States 0.5030 whois.nic.us
16 .ru Russian Federation 0.4610 whois.ripn.net
17 .it Italy 0.3527 whois.nic.it
18 .cn China 0.3480 whois.cnnic.net.cn
19 .ch Switzerland 0.2761 whois.nic.ch
20 .tw Taiwan 0.2727 whois.twnic.net.tw
21 .es Spain 0.2699
22 .se Sweden 0.2493 whois.iis.se
23 .dk Denmark 0.1957 whois.dk-hostmaster.dk
24 .be Belgium 0.1956 whois.dns.be
25 .pl Poland 0.1816 whois.dns.pl
26 .at Austria 0.1659 whois.nic.at
27 .il Israel 0.1559 whois.isoc.org.il
28 .tv Tuvalu 0.1553
29 .nz New Zealand 0.1233 whois.srs.net.nz
30 .biz business use 0.1188 whois.biz
 ?? .eu European Union  ??? whois.eu

Information We Need

  • Date of lookup
  • Registrant Address
  • Admin Address
  • Phone
  • Email
Question: do we need both the registrant and admin addresses or is one enough? In the past we've only used one. - Ray | talk

Next

  • Get gpMakeImage working so that tests pass
  • Send email
  • Given an address get a lattitude/longitude
  • Obfuscate address

<address>fe565342385cbcce7cb35b486876b8d5</address>
 .
 .
 .
<address asgraphic="HASH" latitude="" longitude="" error="This is where an error message goes">
...This address is displayed as a graphic to make 
it more difficult for web robots to harvest it.  
If you would like to change the address that is 
displayed, simply replace these instructions 
with the new address and then save the page...
</address>

Latitude/Longitude

Need a table so that we can associate one or more lat/long pairs with a page.

Some neat regular expressions

These are from the book, "The Ruby Way"

  • The following regex matches a phone number in the NANP format (North American Numbering Plan). It allows three common ways of writing such a phone number:phone = /^((\(\d{3}\) |\d{3}-)\d{3}-\d{4}|\d{3}\.\d{3}\.\d{4})$/

"(512) 555-1234" =~ phone # true "512.555.1234" =~ phone # true "512-555-1234" =~ phone # true "(512)-555-1234" =~ phone # false "512-555.1234" =~ phone # false

  • Here is a regex to match a U.S. ZIP Code (which may be five or nine digits):

zip = /^\d{5}(-\d{4})?$/

  • The following regex matches all the 51 usual codes (50 states and the District of Columbia):

state = /^A[LKZR] | C[AOT] | D[EC] | FL | GA | HI | I[DLNA] | K[SY] | LA | M[EDAINSOT] | N[EVHJMYCD] | O[HKR] | PA | RI | S[CD] | T[NX] | UT | V[TA] | W[AVIY]$/x

Whois Records We Need

We need 50 more whois records covering the range of formats for each of the following registrars:

  • public domain
  • ONLINENIC
  • STRATO
  • BASICFUSION.COM
  • DOMAINNAMESALES | domain name sales
  • core
  • METAPREDICT.COM

We have a few but need 50 whois records covering the range of formats for each of the following registrars:

  • ascio
  • beijing Innovative
  • belgium Domains
  • capitol_domain
  • discount_domain
  • domaindiscover
  • domain_doorman
  • domainsite
  • dotregistrar
  • dotster
  • fabulous.com
  • gandi
  • hichina
  • innerwise
  • joker.com
  • Key systems
  • Mark Monitor
  • Melbourne IT
  • Moniker
  • NameKing
  • Name.Net
  • Names4ever
  • namesdirect
  • nameview
  • nicline
  • ovh
  • psi-usa
  • registerfly
  • schlund
  • srsplus
  • wild west domains


Retrieved from "http://aboutus.com/index.php?title=WhoisRefresh&oldid=12209162"