Difference between revisions of "WhoisRefresh"
Jason Parmer (talk | contribs) (→RewritePageCreationBot) |
(Thinking about composing recognizers) |
||
Line 21: | Line 21: | ||
* The electric sheep pull the domain from the queue and begin nibbling information about the domain | * The electric sheep pull the domain from the queue and begin nibbling information about the domain | ||
** | ** | ||
+ | |||
+ | == Finding Addresses == | ||
+ | |||
+ | We've demonstrated that it's really easy to create simple regular expression detectors and that the regular expressions themselves can be quite large with no problem. We've also demonstrated that it isn't much more complicated to combine them in a sequence, and that the speed of the sequence is determined by the speed of the first recognizer run in the sequence. | ||
+ | |||
+ | Things like adding word boundaries '\b' to regular expressions can dramatically speed them up. In fact, moving the '\b' from inside the alternation to outside of it on country.rb resulted in a 4x speedup for it and all recognizers that start by recognizing a country. | ||
+ | |||
+ | Finding complete addresses depends upon what we lead with. If we lead with an organization or a street, it will be slow. If on the other hand we lead with the postal_code it will be fast. | ||
+ | |||
+ | |||
+ | It seems like what we want to be able to do is create a complex pattern from simpler patterns -- A sequence of simple patterns. The key is being able to select the first one to be run from the sequence and pin the pattern down. Then move out and recognize the immediately | ||
== Refresh Specification == | == Refresh Specification == |
Revision as of 05:36, 20 June 2007
This project actually brings together several smaller projects into one. Essentially, it involves a rewrite of the page creation bot to use several new data sources.
Contents
MediaWiki:InitialDomainPage
- Pull PageCreationBot info from MediaWiki:InitialDomainPage (which is editable only by a sysop)
- Add Community Review and Discussion sections
- Phase out the graphic tags in favor of email and address
RewritePageCreationBot
- This allows us to move it from Thunderclap (shut down Tomcat)
- Tie it into partner APIs
- Cleaner, better tested, etc.
- Turn off categories and related domains.
- really? This isn't what I'd heard. Jason Parmer 15:46, 18 June 2007 (PDT)
Create Domain Page Specification
- Create adds the domain to the electric sheep queue
- The electric sheep pull the domain from the queue and begin nibbling information about the domain
Finding Addresses
We've demonstrated that it's really easy to create simple regular expression detectors and that the regular expressions themselves can be quite large with no problem. We've also demonstrated that it isn't much more complicated to combine them in a sequence, and that the speed of the sequence is determined by the speed of the first recognizer run in the sequence.
Things like adding word boundaries '\b' to regular expressions can dramatically speed them up. In fact, moving the '\b' from inside the alternation to outside of it on country.rb resulted in a 4x speedup for it and all recognizers that start by recognizing a country.
Finding complete addresses depends upon what we lead with. If we lead with an organization or a street, it will be slow. If on the other hand we lead with the postal_code it will be fast.
It seems like what we want to be able to do is create a complex pattern from simpler patterns -- A sequence of simple patterns. The key is being able to select the first one to be run from the sequence and pin the pattern down. Then move out and recognize the immediately
Refresh Specification
When a domain page is viewed:
- Do we have a contact info section that has been human edited? If so do nothing
- Else
- Pull new contact information and dump into contact info section
- Overwrite if contact info section already exists
- Protect with the "address" tag
- Lookup lat/longitude via google api
- Mashup lat/longitude with google maps
We're done when
- Clicking through random pages ...
Next
-
Get the tests failing -
Check the tests in - Research our address extraction strategy: (HMMs, MDL)
- Update Brandon's test environment to include the current test database
Problems
- The formats across registrars are inconsistent
- Different countries use different conventions for their addresses
- .org registry limits requests to 4/minute on port 43. (Web-based queries limited to 50/minute)
Synopsis
When a page is viewed we check to see if we have the contact info for it. If not, we get the whois record for it and use that to populate the page. In the process we obfuscate the email and postal addresses to make it harder for spammers. We also grab the lat/long for the postal address so that we can display a map.
AcceptanceTest
Look at 50 pages that we know don't have contact info and verify that the contact info is coming in.
Background
- rfc3912 the current protocol specification
- Domain Statistics by TLD
- 100 oldest dot com domains
- Registrar Stats
The ranking and percentages come from http://populicio.us/toptlds.html and are at least as stale as November 2006.
Rank | TLD | Purpose | Percentage | Whois Server |
---|---|---|---|---|
1 | .com | commercial organizations | 58.3973 | whois.verisign-grs.com |
2 | .org | organizations | 12.8734 | whois.pir.org |
3 | .net | network infrastructures | 7.3600 | whois.verisign-grs.com |
4 | .uk | United Kingdom | 3.2604 | whois.nic.uk |
5 | .edu | educational establishments accredited in the United States | 2.7008 | whois.educause.edu |
6 | .jp | Japan | 2.6159 | whois.jprs.jp |
7 | .de | Germany | 2.1484 | whois.denic.de |
8 | .br | Brazil | 0.8066 | whois.registro.br |
9 | .ca | Canada | 0.7208 | whois.cira.ca |
10 | .gov | governments and their agencies in the United States | 0.6832 | whois.dotgov.gov |
11 | .au | Australia | 0.6463 | whois.aunic.net |
12 | .info | informational sites | 0.5674 | whois.afilias.net |
13 | .nl | Netherlands | 0.5380 | whois.domain-registry.nl |
14 | .fr | France | 0.5108 | whois.nic.fr |
15 | .us | United States | 0.5030 | whois.nic.us |
16 | .ru | Russian Federation | 0.4610 | whois.ripn.net |
17 | .it | Italy | 0.3527 | whois.nic.it |
18 | .cn | China | 0.3480 | whois.cnnic.net.cn |
19 | .ch | Switzerland | 0.2761 | whois.nic.ch |
20 | .tw | Taiwan | 0.2727 | whois.twnic.net.tw |
21 | .es | Spain | 0.2699 | |
22 | .se | Sweden | 0.2493 | whois.iis.se |
23 | .dk | Denmark | 0.1957 | whois.dk-hostmaster.dk |
24 | .be | Belgium | 0.1956 | whois.dns.be |
25 | .pl | Poland | 0.1816 | whois.dns.pl |
26 | .at | Austria | 0.1659 | whois.nic.at |
27 | .il | Israel | 0.1559 | whois.isoc.org.il |
28 | .tv | Tuvalu | 0.1553 | |
29 | .nz | New Zealand | 0.1233 | whois.srs.net.nz |
30 | .biz | business use | 0.1188 | whois.biz |
?? | .eu | European Union | ??? | whois.eu |
Information We Need
- Date of lookup
- Registrant Address
- Admin Address
- Phone
- Question: do we need both the registrant and admin addresses or is one enough? In the past we've only used one. - Ray | talk
Next
-
Get gpMakeImage working so that tests pass - Send email
- Given an address get a lattitude/longitude
- Obfuscate address
<address>fe565342385cbcce7cb35b486876b8d5</address> . . . <address asgraphic="HASH" latitude="" longitude="" error="This is where an error message goes"> ...This address is displayed as a graphic to make it more difficult for web robots to harvest it. If you would like to change the address that is displayed, simply replace these instructions with the new address and then save the page... </address>
Latitude/Longitude
Need a table so that we can associate one or more lat/long pairs with a page.
Attemp to Fix Freetype
http://lists.freedesktop.org/archives/fontconfig/2006-October/002501.html
When you get the followin error
dyld: lazy symbol binding failed: Symbol not found: _FSPathMakeRef Referenced from: /usr/local/lib/libfreetype.6.dylib Expected in: flat namespace
Fix your freetype to work with MacOSX fonts
freetype-config: - libs="-lfreetype -lz" + libs="-lfreetype -lz -Xlinker -framework -Xlinker CoreServices -Xlinker -framework -Xlinker ApplicationServices" freetype2.pc: - Libs: -L${libdir} -lfreetype -lz + Libs: -L${libdir} -lfreetype -lz -Xlinker -framework -Xlinker CoreServices -Xlinker -framework -Xlinker ApplicationServices
Then rebuild php from scratch (make clean; make; make install)
Turns out this was a false start ... didn't work