Currently the dataset is relatively small. It will be contineously updated as our Hadoop job gets more data.

Top 200 Anchor Texts

About

This project is our entry to the CommonCrawl contest. The idea is inspired by Google's release of the entity linking dataset, which provides baseline for research on entity linking and other information retrieval and natural language processing tasks.

Human language is ambiguous, and synonymy and polysemy are fundamental problems in natural language processing (NLP) and information retrieval (IR). One of the approaches for Word Sense Disambiguation (WSD) is utilizing external ontologies, e.g. Wikipedia to determine the meaning of a word based on the probabilities that it can be mapped each of the possible Wikipedia concepts. Our entry aims to build such a corpus of anchortext-WikipediaConcept-Count triples from the CommonCrawl dataset, so as to benifit research on WSD, NLP and IR. More specifically, we extract all anchortexts (the text you click on in a webpage link) which point to a Wikipedia page, together with the corresponding Wikipedia page. Based on the corpus, we developed this web application to demonstrate the anchortext-WikipediaConcept-Count structure.

Applicatin scenarios

  • Given a concept (represented as a wikipedia page), it can tell what are the most common terms people use to describe the concept. This can be seen as an "Explicit Topic Modeling". Example
  • Given a sentence, it can help identify entities (person, locatin, organization) in the sentence and map them onto Wikipedia concepts
  • CommonCrawl vs. Google, with regards to anchortext-WikipediaConcept-Count corpus richness and precision
  • For entity linking tasks, will the combination of both corpus boost the performance compared with the usage of each dataset individually?

Code: https://github.com/chrishan/wikientities
Live Demo: http://wikientities.appspot.com/

If you find our work interesting, please vote our entry on CommonCrawl Website and stay tuned for our release of the dataset.

The corpus was generated from a single segment (about 3/4 TB) of the CommonCrawl data. There are 10,330,169 anchors, 2,277,936 unique anchor texts and 6,697,367 unique concepts in the corpus.