Top 200 Anchor Texts
- edit - 1876780 Wikipedia entities
- search - 258636 Wikipedia entities
- navigation - 258625 Wikipedia entities
- talk - 258465 Wikipedia entities
- mobile view - 256791 Wikipedia entities
- permanent link - 238645 Wikipedia entities
- read - 236670 Wikipedia entities
- view history - 236637 Wikipedia entities
- printable version - 236130 Wikipedia entities
- create a book - 236127 Wikipedia entities
- article - 230844 Wikipedia entities
- english - 84043 Wikipedia entities
- internal link - 15353 Wikipedia entities
- talk page - 12073 Wikipedia entities
- history - 3962 Wikipedia entities
- category - 3923 Wikipedia entities
- file - 3867 Wikipedia entities
- contribs - 3666 Wikipedia entities
- expansion - 2804 Wikipedia entities
- logs - 2640 Wikipedia entities
- discuss - 2453 Wikipedia entities
- details - 2426 Wikipedia entities
- related articles - 2260 Wikipedia entities
- view source - 2236 Wikipedia entities
- metadata - 1909 Wikipedia entities
- file history - 1904 Wikipedia entities
- file usage - 1904 Wikipedia entities
- global file usage - 1668 Wikipedia entities
- discography - 1523 Wikipedia entities
- info - 1463 Wikipedia entities
- user page - 1452 Wikipedia entities
- user contributions - 1431 Wikipedia entities
- watch - 1271 Wikipedia entities
- view - 1071 Wikipedia entities
- list - 1050 Wikipedia entities
- music - 991 Wikipedia entities
- deaths - 931 Wikipedia entities
- births - 922 Wikipedia entities
- canada - 920 Wikipedia entities
- delete - 919 Wikipedia entities
- disestablishments - 915 Wikipedia entities
- establishments - 915 Wikipedia entities
- france - 910 Wikipedia entities
- top - 862 Wikipedia entities
- link - 829 Wikipedia entities
- germany - 825 Wikipedia entities
- australia - 816 Wikipedia entities
- discussion - 773 Wikipedia entities
- links - 749 Wikipedia entities
- united states - 733 Wikipedia entities
- italy - 732 Wikipedia entities
- architecture - 717 Wikipedia entities
- state leaders - 688 Wikipedia entities
- literature - 680 Wikipedia entities
- georgia - 660 Wikipedia entities
- japan - 659 Wikipedia entities
- washington - 658 Wikipedia entities
- source - 648 Wikipedia entities
- india - 633 Wikipedia entities
- president - 630 Wikipedia entities
- art - 620 Wikipedia entities
- ireland - 615 Wikipedia entities
- flag - 607 Wikipedia entities
- russia - 602 Wikipedia entities
- greece - 599 Wikipedia entities
- spain - 593 Wikipedia entities
- mexico - 586 Wikipedia entities
- characters - 585 Wikipedia entities
- government - 569 Wikipedia entities
- poland - 555 Wikipedia entities
- people - 538 Wikipedia entities
- israel - 535 Wikipedia entities
- norway - 533 Wikipedia entities
- men - 532 Wikipedia entities
- education - 532 Wikipedia entities
- turkey - 531 Wikipedia entities
- netherlands - 530 Wikipedia entities
- works - 530 Wikipedia entities
- south africa - 525 Wikipedia entities
- quality scale - 523 Wikipedia entities
- episodes - 519 Wikipedia entities
- denmark - 512 Wikipedia entities
- film - 509 Wikipedia entities
- culture - 506 Wikipedia entities
- united kingdom - 504 Wikipedia entities
- seasons - 501 Wikipedia entities
- belgium - 498 Wikipedia entities
- science - 494 Wikipedia entities
- la liste des auteurs - 492 Wikipedia entities
- template - 485 Wikipedia entities
- project page - 484 Wikipedia entities
- portal - 477 Wikipedia entities
- importance scale - 476 Wikipedia entities
- economy - 474 Wikipedia entities
- geography - 472 Wikipedia entities
- philippines - 466 Wikipedia entities
- romania - 465 Wikipedia entities
- portugal - 462 Wikipedia entities
- sports - 461 Wikipedia entities
- austria - 459 Wikipedia entities
- sweden - 458 Wikipedia entities
- hungary - 457 Wikipedia entities
- switzerland - 454 Wikipedia entities
- purge - 452 Wikipedia entities
- smith - 451 Wikipedia entities
- pakistan - 448 Wikipedia entities
- women - 441 Wikipedia entities
- singapore - 430 Wikipedia entities
- luxembourg - 429 Wikipedia entities
- book - 425 Wikipedia entities
- players - 424 Wikipedia entities
- brazil - 424 Wikipedia entities
- china - 421 Wikipedia entities
- bulgaria - 420 Wikipedia entities
- sovereign states - 418 Wikipedia entities
- finland - 410 Wikipedia entities
- report - 408 Wikipedia entities
- politics - 407 Wikipedia entities
- cyprus - 402 Wikipedia entities
- notes - 400 Wikipedia entities
- full list - 399 Wikipedia entities
- egypt - 398 Wikipedia entities
- ukraine - 395 Wikipedia entities
- elections - 394 Wikipedia entities
- start - 392 Wikipedia entities
- king - 392 Wikipedia entities
- argentina - 378 Wikipedia entities
- scotland - 378 Wikipedia entities
- malaysia - 374 Wikipedia entities
- jackson - 373 Wikipedia entities
- iran - 373 Wikipedia entities
- archaeology - 368 Wikipedia entities
- armenia - 365 Wikipedia entities
- england - 362 Wikipedia entities
- czech republic - 358 Wikipedia entities
- brown - 354 Wikipedia entities
- croatia - 353 Wikipedia entities
- law - 349 Wikipedia entities
- hong kong - 347 Wikipedia entities
- serbia - 345 Wikipedia entities
- lebanon - 344 Wikipedia entities
- azerbaijan - 343 Wikipedia entities
- albania - 343 Wikipedia entities
- governor - 343 Wikipedia entities
- malta - 343 Wikipedia entities
- prime minister - 343 Wikipedia entities
- military - 341 Wikipedia entities
- discussion page - 340 Wikipedia entities
- south korea - 338 Wikipedia entities
- florida - 337 Wikipedia entities
- estonia - 334 Wikipedia entities
- union - 332 Wikipedia entities
- indonesia - 332 Wikipedia entities
- jordan - 332 Wikipedia entities
- demographics - 331 Wikipedia entities
- lithuania - 331 Wikipedia entities
- high - 329 Wikipedia entities
- german - 328 Wikipedia entities
- mid - 323 Wikipedia entities
- religion - 323 Wikipedia entities
- cities - 323 Wikipedia entities
- media - 323 Wikipedia entities
- slovenia - 323 Wikipedia entities
- french - 321 Wikipedia entities
- constitution - 319 Wikipedia entities
- slovakia - 318 Wikipedia entities
- awards - 318 Wikipedia entities
- michigan - 315 Wikipedia entities
- iceland - 315 Wikipedia entities
- california - 313 Wikipedia entities
- williams - 313 Wikipedia entities
- thailand - 312 Wikipedia entities
- texas - 311 Wikipedia entities
- usa - 311 Wikipedia entities
- parliament - 307 Wikipedia entities
- johnson - 307 Wikipedia entities
- kazakhstan - 305 Wikipedia entities
- records - 303 Wikipedia entities
- macedonia - 303 Wikipedia entities
- minnesota - 301 Wikipedia entities
- wales - 301 Wikipedia entities
- transport - 301 Wikipedia entities
- iraq - 299 Wikipedia entities
- chile - 299 Wikipedia entities
- football - 298 Wikipedia entities
- dated info - 297 Wikipedia entities
- vietnam - 297 Wikipedia entities
- green - 297 Wikipedia entities
- latvia - 296 Wikipedia entities
- white - 296 Wikipedia entities
- managers - 294 Wikipedia entities
- political parties - 294 Wikipedia entities
- central - 291 Wikipedia entities
- sri lanka - 288 Wikipedia entities
- television - 283 Wikipedia entities
- belarus - 280 Wikipedia entities
- members - 280 Wikipedia entities
- complete list - 277 Wikipedia entities
- moldova - 277 Wikipedia entities
- wilson - 275 Wikipedia entities
About
This project is our entry to the CommonCrawl contest. The idea is inspired by Google's release of the entity linking dataset, which provides baseline for research on entity linking and other information retrieval and natural language processing tasks.
Human language is ambiguous, and synonymy and polysemy are fundamental problems in natural language processing (NLP) and information retrieval (IR). One of the approaches for Word Sense Disambiguation (WSD) is utilizing external ontologies, e.g. Wikipedia to determine the meaning of a word based on the probabilities that it can be mapped each of the possible Wikipedia concepts. Our entry aims to build such a corpus of anchortext-WikipediaConcept-Count triples from the CommonCrawl dataset, so as to benifit research on WSD, NLP and IR. More specifically, we extract all anchortexts (the text you click on in a webpage link) which point to a Wikipedia page, together with the corresponding Wikipedia page. Based on the corpus, we developed this web application to demonstrate the anchortext-WikipediaConcept-Count structure.
Applicatin scenarios
- Given a concept (represented as a wikipedia page), it can tell what are the most common terms people use to describe the concept. This can be seen as an "Explicit Topic Modeling". Example
- Given a sentence, it can help identify entities (person, locatin, organization) in the sentence and map them onto Wikipedia concepts
- CommonCrawl vs. Google, with regards to anchortext-WikipediaConcept-Count corpus richness and precision
- For entity linking tasks, will the combination of both corpus boost the performance compared with the usage of each dataset individually?
Code: https://github.com/chrishan/wikientities
Live Demo: http://wikientities.appspot.com/
If you find our work interesting, please vote our entry on CommonCrawl Website and stay tuned for our release of the dataset.