IncreasePagesWithAdvertisements

Revision as of 06:44, 11 January 2008 by Ali Aslam (talk | contribs) (Steps to get to DoneDone: figure out mediawiki entry point.)



OurWork Edit-chalk-10bo12.png

What (summary)

Improve the AdSidebar adult content filter so that fewer non-adult pages are flagged and advertisements show up on more page. Even more importantly, catch more of the pages that actually do have adult content on them so that Google doesn't get angry at us for showing ads on adult pages.

Why this is important

Revenue determines how many resources we have to spend on developing cool features and tools for our community. We need to improve our revenue.

DoneDone

  • Advertisements show up on at least 90% of the pages that are not adult content
  • Advertisements sow up on at most 5% of the pages that are adult content
    • Statistics as of 7th Jan, 2008 are shown below.

Statistics

7th Jan, 2008

  • Correctly marked as adult: 32452 (92.11%)
  • Correctly marked as clean: 2224
  • Incorrectly marked as adult: 90 (3.89%)
  • Incorrectly marked as clean: 2781

Steps to get to DoneDone

  • Tinker with the order of the index.html and index.php in the apache directive "DirectoryIndex" to determine if the order matters. It matters.
  • Are the differences between apache 1.3, 2.0 and 2.2 significant and what about apache 2.1. apache 2.1 was a development branch not for public consumption. There are major differences between apache 1.3 and 2.0 and significant changes between 2.0 and 2.2 versions. So read the documentation for the version of apache you're running.
  • Learn how to quickly make sense of the documentation of a particular apache directive.
  • Find and understand apache's "DirectoryIndex" directive documentation.
  • Understand apache configuration, specifically which files get run and how they are chosen. index.html is first, then comes index.php.
  • Determine which file gets run first when mediawiki is accessed through web server and command line.
  • Understand the aboutus mediawiki configuration.
  • Determine the exact differences between our local setup and aboutustest's. Configurations are different.
  • Testing the extension on aboutustest server.
  • Port code base from compost to mediawiki. The rationale behind it is that code in mediawiki scales a lot better than that in compost.
  • Adult Filter currently only trips on adult keywords. Google also gets grumpy on sites that have alcohol and gambling content. There is a case to incorporate these into our adult filter. Let go for now. Will include if and when google asks us to.
  • Create a webservice in compost to check a particular page for adult content.
    • Test the webservice through local wiki.
  • Replace existing adult content filter in php with a call to this web service.
    • Set up adsidebar extension to work in local wiki.
  • Collect a sample of pages
    • hand-audit for adult pages in sample
  • Partition the adult content keywords into levels of suggestiveness
  • If 1 of these words is detected ... label as adult
  • If 2 or more of these words show up at least once each in the page ... label as adult
  • If 3 or more of these words show up at least once each in the page ... label as adult

(December 5, 2007)

Objectives

  • Today we will continue to clean up the test data so that we can verify and remove all the files which are actually not adult pages.
  • Also, will try to add some very obvious keywords in foreign languages to our RE.

Done We skipped the first part just to be sure that we check as much pages as possible before we get rid of them.

  • We added keywords for four languages, namely French, Dutch, Russian and Japanese.
  • This has produced the following numbers:
  • True positives: 32335
  • True negatives: 1281
  • False positives: 137
  • False negatives: 3694

Seeing the results; which do not reflect a major change in the numbers, suggests that pages indeed are mostly clean.

Current Status (December 4, 2007)

Today we tweaked the regular expressions some more. Analyzed both false positives and false negatives. Improved the results a notch. The current results are as follows:

  • True positives: 31928
  • True negatives: 1287
  • False positives: 131
  • False negatives: 4113
1287 / (1287 + 131) = 90.76 %
4113 / (4113 + 31928) = 11.4 %

Issues

We still don't know how to handle webpages with text in foreign language. Tried to work out some keywords in foreign language with little success. Is leaving out pages with foreign language a good idea? I am inclined on doing this till we find a workable method to handle such pages.

Today we started out by hand cleaning data to remove files that were obviously not adult but had the adult tag on them. Will continue on this tomorrow so that we can get a more accurate number for False Negatives.

Current Status

We made some changes to the code. There were a lot of cases where innocent words which had suggestive words as their subset were being marked adult. We added functionality to check for these words and it has improved results.

There are 4255 pages in our test set whose only recognizable feature is the adult tag embedded withing them. Keeping those pages, the results are:

  • True positives: 31129
  • True negatives: 1246
  • False positives: 172
  • False negatives: 4979

If we do filter out pages with based on the adult category tag, the results are:

  • True positives: 36100
  • True negatives: 1246
  • False positives: 172
  • False negatives: 8

There is a difference of ~700 pages (2%) which are being marked clean incorrectly if we don't use adult tag filter. This number needs to be brought down.

Is it possible to have some more data? I observed that quite a few pages are placed in the wrong category. Examples are:

Current Numbers

Numbers and percentages as of November 23, 2007
The following results are acheived after considering the presence of the Adult Tag on the pages.

  • Correctly marked Adult: 36100
  • Correctly marked as NotAdult: 1220
  • Incorrectly marked as Adult: 198
  • Incorrectly marked as NotAdult: 8
1220 / ( 198 + 1220 ) = 86%
8 / ( 36100 + 8 ) = 0.022%
Unfortunately we can't consider the presence of the Adult Tag. The vast majority of our pages that are adult content do not have it. Our ground truth data is skewed because we used the presence of the Adult flag to create our test set. So almost every page in the Adult set has the tag, so using it gives very incorrect results. --Brandon 21:30, 25 November 2007 (PST)

Numbers and percentages as of Today(November 22, 2007)

  • Correctly marked Adult: 25495
  • Correctly marked as NotAdult: 1023
  • Incorrectly marked as Adult: 395
  • Incorrectly marked as NotAdult: 10613
1023 / (395 + 1023) = 72.1 %
10613 / (25495 + 10613) = 29.3 %

Issues

The features are getting overwhelmed by foreign language results and domain names. As a result, our classifier gets confused by the features from foreign language pages. Trying to, and will continue to work on removing domain names and other noise features from the content of the page to improve the numbers.

Suggestions and comments are more than welcome!

This seems to be moving in the wrong direction. Perhaps it is time to start a different approach? Perhaps it is time to try to improve the regular expression results that were already almost to the goals? --Brandon 22:26, 22 November 2007 (PST)

Numbers with Regular Expressions

  • Total: 37,526
  • Correctly Marked as Adult: 33,447
  • Correctly Marked as Not Adult: 1,097
  • Incorrectly Marked as Adult: 321
  • Incorrectly Marked as Not Adult: 2,661
1097/(321+1097)*100.0 = 77.3%
2661/(33447+2661)*100.0 = 7.3%

These results are not considering the Possibly Adult flag and they are most of the way to the goals. Perhaps it would be fruitful to try and improve the regular expression? I don't think that I've pushed it nearly as far as it will go. --Brandon 21:27, 25 November 2007 (PST)

Interesting Cases

  • NiceRoundAsses.com ... because all of the naughty words are buried in domain names ... but there are a lot of them

See



Retrieved from "http://aboutus.com/index.php?title=IncreasePagesWithAdvertisements&oldid=14090234"