Difference between revisions of "StructuredDataFromWikiPages"

(New page: <noinclude><big>OurWork < DevelopmentTeam < Priorities < </noinclude>BigBeautifulInfluenceUs {{JustTinyEditIcon|BigBeautifulInfluenceUs}}<noinclud...)
 
(Discussion)
 

(4 intermediate revisions by 3 users not shown)



Line 1: Line 1:
<noinclude><big>[[OurWork]] < [[DevelopmentTeam]] < [[DevelopmentTeamPriorities|Priorities]] < </noinclude>[[BigBeautifulInfluenceUs]] {{JustTinyEditIcon|BigBeautifulInfluenceUs}}<noinclude></big>
+
<noinclude><big>[[OurWork]] < [[DevelopmentTeam]] < [[DevelopmentTeamPriorities|Priorities]] < </noinclude>[[StructuredDataFromWikiPages]] {{JustTinyEditIcon|StructuredDataFromWikiPages}}<noinclude></big>
 
__NOTOC__
 
__NOTOC__
  
Line 10: Line 10:
 
** Phone #
 
** Phone #
  
 +
== What is structured data? ==
 +
 +
== What else besides address is important for us? ==
  
 
== Why this is important ==
 
== Why this is important ==
Line 15: Line 18:
  
 
== [[DoneDone]] ==
 
== [[DoneDone]] ==
* Most pages that have structured data that can  
+
* Easily identify which data has been added or changed to a wiki page by human edits. (standard diff may work?)
 +
* Apply heuristics (section, regular expression, machine learning, something else) to determine if a piece of data should belong in a structured field. 
 +
* If all human-added data can be extracted, indicate that the entire wiki page should be deleted.
 +
* If there remain human-added data that can't be identified and extracted, return wikitext containing only the non-identified human data, with all bot-created data removed.
 +
 
  
 
== Steps to get to [[DoneDone]] ==
 
== Steps to get to [[DoneDone]] ==
 +
* Build many test cases--pick many random human-edited pages.
 +
** Pull out revision histories (or at least diffs to compare to original bot scrape)
 +
** Identify and extract human-edited data yourself.  Great fun!
 +
* Make test cases pass. (In order below?)
 +
** First identify all human-edited data
 +
** Then classify and extract said data
 +
** Then determine if a page should be deleted, and, if not, which data should be left behind.
 +
* Throw a wild and crazy party
 +
 +
== Discussion ==
  
 +
Zoetrope is a research project at University of Washington that carries this idea further by adding history and a drag and drop interface. Watch the video:
  
== Discussion ==
+
* http://uwnews.org/article.asp?articleID=45255
  
 +
</noinclude>
 
[[Category:OpenTask]]
 
[[Category:OpenTask]]
 
[[Category:DevelopmentTeam]]
 
[[Category:DevelopmentTeam]]
</noinclude>
 

Latest revision as of 16:00, 2 December 2008

OurWork Edit-chalk-10bo12.png


What (summary)

Extract data that users have entered onto Wiki pages and turn them into structured data for easier manipulation.

We need to extract

  • Contact info
    • Address
    • Phone #

What is structured data?

What else besides address is important for us?

Why this is important

We are moving towards using more highly-structured data, but need to leverage the large quantity of data users have entered onto our site.

DoneDone

  • Easily identify which data has been added or changed to a wiki page by human edits. (standard diff may work?)
  • Apply heuristics (section, regular expression, machine learning, something else) to determine if a piece of data should belong in a structured field.
  • If all human-added data can be extracted, indicate that the entire wiki page should be deleted.
  • If there remain human-added data that can't be identified and extracted, return wikitext containing only the non-identified human data, with all bot-created data removed.


Steps to get to DoneDone

  • Build many test cases--pick many random human-edited pages.
    • Pull out revision histories (or at least diffs to compare to original bot scrape)
    • Identify and extract human-edited data yourself. Great fun!
  • Make test cases pass. (In order below?)
    • First identify all human-edited data
    • Then classify and extract said data
    • Then determine if a page should be deleted, and, if not, which data should be left behind.
  • Throw a wild and crazy party

Discussion

Zoetrope is a research project at University of Washington that carries this idea further by adding history and a drag and drop interface. Watch the video:



Retrieved from "http://aboutus.com/index.php?title=StructuredDataFromWikiPages&oldid=17091573"