Difference between revisions of "IncreasePagesWithAdvertisements"

(Steps to get to DoneDone)
m
Line 21: Line 21:
 
== Steps to get to [[DoneDone]] ==
 
== Steps to get to [[DoneDone]] ==
 
* <s> Clean up the page </s>
 
* <s> Clean up the page </s>
 +
* Brief Ali about the search improvements staging.
 
* Merge the changes with the live branch.
 
* Merge the changes with the live branch.
 
==(December 5, 2007)==
 
'''Objectives'''
 
* Today we will continue to clean up the test data so that we can verify and remove all the files which are actually not adult pages.
 
* Also, will try to add some very obvious keywords in foreign languages to our RE.
 
 
'''Done'''
 
We skipped the first part just to be sure that we check as much pages as possible before we get rid of them.
 
* We added keywords for four languages, namely French, Dutch, Russian and Japanese.
 
*This has produced the following numbers:
 
*True positives: 32335
 
*True negatives: 1281
 
*False positives: 137
 
*False negatives: 3694
 
 
Seeing the results; which do not reflect a major change in the numbers, suggests that pages indeed are mostly clean.
 
 
==Current Status (December 4, 2007)==
 
Today we tweaked the regular expressions some more. Analyzed both false positives and false negatives. Improved the results a notch.
 
The current results are as follows:
 
*True positives: 31928
 
*True negatives: 1287
 
*False positives: 131
 
*False negatives: 4113
 
: 1287 / (1287 + 131) = 90.76 %
 
: 4113 / (4113 + 31928) = 11.4 %
 
 
===Issues===
 
We still don't know how to handle webpages with text in foreign language. Tried to work out some keywords in foreign language with little success. Is leaving out pages with foreign language a good idea? I am inclined on doing this till we find a workable method to handle such pages.
 
 
Today we started out by hand cleaning data to remove files that were obviously not adult but had the adult tag on them. Will continue on this tomorrow so that we can get a more accurate number for False Negatives.
 
 
==Current Status==
 
We made some changes to the code. There were a lot of cases where innocent words which had suggestive words as their subset were being marked adult. We added functionality to check for these words and it has improved results.
 
 
There are 4255 pages in our test set whose only recognizable feature is the adult tag embedded withing them. Keeping those pages, the results are:
 
*True positives: 31129
 
*True negatives: 1246
 
*False positives: 172
 
*False negatives: 4979
 
 
If we do filter out pages with based on the adult category tag, the results are:
 
*True positives: 36100
 
*True negatives: 1246
 
*False positives: 172
 
*False negatives: 8
 
 
There is a difference of ~700 pages (2%) which are being marked clean incorrectly if we don't use adult tag filter. This number needs to be brought down.
 
 
Is it possible to have some more data? I observed that quite a few pages are placed in the wrong category. Examples are:
 
* http://chessmaniac.com/
 
* http://travelhealthresource.com/
 
* http://oaksretirement.com/
 
* http://reliabletransfer.com/
 
 
==Current Numbers ==
 
Numbers and percentages as of November 23, 2007 <br />
 
The following results are acheived after considering the presence of the Adult Tag on the pages.
 
*Correctly marked Adult: 36100
 
*Correctly marked as NotAdult: 1220
 
*Incorrectly marked as Adult: 198
 
*Incorrectly marked as NotAdult: 8
 
: 1220 / ( 198 + 1220 ) = 86%
 
: 8 / ( 36100 + 8 ) = 0.022%
 
 
: '''Unfortunately we can't consider the presence of the Adult Tag.  The vast majority of our pages that are adult content do not have it.  Our ground truth data is skewed because we used the presence of the Adult flag to create our test set.  So almost every page in the Adult set has the tag, so using it gives very incorrect results.''' --[[Brandon]] 21:30, 25 November 2007 (PST)
 
 
Numbers and percentages as of Today(November 22, 2007)
 
*Correctly marked Adult: 25495
 
*Correctly marked as NotAdult: 1023
 
*Incorrectly marked as Adult: 395
 
*Incorrectly marked as NotAdult: 10613
 
: 1023 / (395 + 1023) = 72.1 %
 
: 10613 / (25495 + 10613) = 29.3 %
 
====Issues====
 
The features are getting overwhelmed by foreign language results and domain names. As a result, our classifier gets confused by the features from foreign language pages.
 
Trying to, and will continue to work on removing domain names and other noise features from the content of the page to improve the numbers.
 
: Suggestions and comments are more than welcome!
 
 
This seems to be moving in the wrong direction.  Perhaps it is time to start a different approach?  Perhaps it is time to try to improve the regular expression results that were already almost to the goals? --[[Brandon]] 22:26, 22 November 2007 (PST)
 
 
====Numbers with Regular Expressions====
 
*Total: 37,526
 
*Correctly Marked as Adult: 33,447
 
*Correctly Marked as Not Adult: 1,097
 
*Incorrectly Marked as Adult: 321
 
*Incorrectly Marked as Not Adult: 2,661
 
 
: 1097/(321+1097)*100.0 = 77.3%
 
: 2661/(33447+2661)*100.0 = 7.3%
 
 
These results are not considering the Possibly Adult flag and they are most of the way to the goals.  Perhaps it would be fruitful to try and improve the regular expression?  I don't think that I've pushed it nearly as far as it will go. --[[Brandon]] 21:27, 25 November 2007 (PST)
 
 
== Interesting Cases ==
 
* [[NiceRoundAsses.com]] ... because all of the naughty words are buried in domain names ... but there are a lot of them
 
 
== See ==
 
* http://domains.googlesyndication.com/apps/domainpark/domainpark.cgi?client=ca-dp-adultcheck_xml&s=example.com
 
* repository is at nimbus.aboutus.com:/opt/git/adult-filter
 
  
 
[[Category:DevelopmentTask]]
 
[[Category:DevelopmentTask]]
 
</noinclude>
 
</noinclude>

Revision as of 04:52, 22 January 2008

OurWork Edit-chalk-10bo12.png

What (summary)

Improve the AdSidebar adult content filter so that fewer non-adult pages are flagged and advertisements show up on more page. Even more importantly, catch more of the pages that actually do have adult content on them so that Google doesn't get angry at us for showing ads on adult pages.

Why this is important

Revenue determines how many resources we have to spend on developing cool features and tools for our community. We need to improve our revenue.

DoneDone

  • Advertisements show up on at least 90% of the pages that are not adult content
  • Advertisements sow up on at most 5% of the pages that are adult content
    • Statistics as of 7th Jan, 2008 are shown below.

Statistics

7th Jan, 2008

  • Correctly marked as adult: 32452 (92.11%)
  • Correctly marked as clean: 2224
  • Incorrectly marked as adult: 90 (3.89%)
  • Incorrectly marked as clean: 2781

Steps to get to DoneDone

  • Clean up the page
  • Brief Ali about the search improvements staging.
  • Merge the changes with the live branch.


Retrieved from "http://aboutus.com/index.php?title=IncreasePagesWithAdvertisements&oldid=14536735"