Skip to main content

Table 4 Data and evaluation methods for English

From: A review on Relation Extraction with an eye on Portuguese

References Data/corpora Data size Method Evaluation Performance (%)
Brin [10] Web pages 24 million pages Exact pattern matching Manual evaluation of 20 books selected from a list of over 150,000 19 correct books—95 %
Agichtein and Gravano [1] North American News corpus 300,000 newspapers Matching with similar function Manual evaluation of a set of 100 tuples 93 correct tuples—93 %
Hasegawa et al. [54] Articles from New York Times 1 year (1995) Clustering Manual evaluation of the relations for 2 domains Person-GPE F\(=\) 80 %, Company-Company F\(=\) 75 %
Pantel and Pennacchiotti [79] Articles from TREC-9 and CHEM TREC-9 \(=\) 5,951,432 words, CHEM \(=\) 313,590 words Weakly-supervisioned classifier Manual annotation of 680 instances from TREC and CHEM corpora (2 experts) TREC part-of P\(=\) 69.9 %, succession P\(=\) 49 %, CHEM is-a P\(=\) 76 %, reaction P\(=\) 91.4 %, production P\(=\) 55.8 %
Carlson et al. [17] Web pages 200 million pages Coupling Semi-supervised Learning Freebase database as Golden Standard Category average P\(=\) 83 %; relation average P\(=\) 84 %.
Li et al. [64] Wikipedia and Tago project Wikipedia \(=\) 4,556,821 pages, % Tago \(=\) 67,973 entity pairs Semi-supervised multi-view ranking 5 types of relation extract by YAGO Project as Golden Standard Relation average \(=\) 39 %
Banko and Cafarella [3], Yates et al. [102] Web pages 9 million pages Naive Bayes Manual evaluation of 400 tuples (3 experts) 80.4 % correct tuples
Banko and Etzioni[4] Sent500 corpus [13] Sent500 = 500 sentences Conditional Random Fields Small set of labeled data for 4 relations from Sent500 Open relation F\(=\) 59.8 %; pre-specified relation F\(=\) 29.5%
Zhu et al. [105] Sent500 corpus and Web1M corpus Sent500 \(=\) 500 sentences, Web1M \(=\) 1 million of blocks of Web pages Markov Logic Networks Manual evaluation of the extracted tuples from Sent500 F\(=\) 76.4 %
Wu [99], Wu and Weld [100] WSJ from Penn Treebank, Wikipedia and Web pages Conditional Random Fields Manual evaluation of 300 sentences from each corpora (2 experts) WSJ F\(=\) 64.7 %, Wikipedia F\(=\) 57.2 %, Web F\(=\) 65%
Culotta et al. [25] Articles from Wikipedia 271 articles Conditional Random Fields Manual annotation of the 53 family relations F = 61.36 %
Li et al. [65] Articles from New York Times, articles from Wikipedia [25] New York Times \(=\) 150 articles, Wikipedia \(=\) 271 articles Conditional Random Fields Manual annotation of the relations New York Times Employment F\(=\) 80 %, Wikipedia personal/social F\(=\) 51 %
Fader et al. [36] Web pages 500 sentences Logic Regression classifier Manual evaluation of each extraction as correct or incorrect (2 experts) F\(=\) 69.8 %.
Liu et al. [66] Expert-curated corpus 150K words Semantic interpretation approach Manual annotation of 565 relation instances for protein-organism-location F\(=\) 74.9 %
Gamallo et al. [48] Sentences from Wikipedia in English, Spanish, Galician, Portuguese (2010) English \(=\) 78,826,696, Spanish \(=\) 21,208,089, Galician \(=\) 1,461,705, Portuguese \(=\) 11,714,672 Unsupervised extraction of verb-based triples Manual evaluation of 200 sentences from English Wikipedia (2 experts) P\(=\) 68 %