Skip to main content

RelHunter: a machine learning method for relation extraction from text

Abstract

We propose RelHunter, a machine learning-based method for the extraction of structured information from text. RelHunter’s key idea is to model the target structures as a relation over entities. Hence, the modeling effort is reduced to the identification of entities and the generation of a candidate relation, which are simpler problems than the original one. RelHunter fits a very broad spectrum of complex computational linguistic problems. We apply it to five tasks: phrase chunking, clause identification, hedge detection, quotation extraction, and dependency parsing. We compare RelHunter to token classification approaches through several computational experiments on seven multilingual corpora. RelHunter outperforms the token classification approaches by 2.14% on average. Moreover, we compare the derived systems against state-of-the-art systems for each corpus. Our systems achieve state-of-the-art performances for three corpora: Portuguese phrase chunking, Portuguese clause identification, and English quotation extraction. Additionally, the derived systems show good quality performance for the other four corpora.

References

  1. 1.

    Brill E (1995) Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Comput Linguist 21(4):543–565

    Google Scholar 

  2. 2.

    Buchholz S, Marsi E (2006) CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of the tenth conference on computational natural language learning, New York, USA, pp 149–164

    Chapter  Google Scholar 

  3. 3.

    Carreras X, Màrquez L, Punyakanok V, Roth D (2002) Learning and inference for clause identification. In: Proceedings of the thirteenth European conference on machine learning, pp 35–47

    Google Scholar 

  4. 4.

    Carreras X, Màrquez L, Castro J (2005) Filtering-ranking perceptron learning for partial parsing. Mach Learn 60(13):41–71

    Article  Google Scholar 

  5. 5.

    de La Clergerie É, Sagot B, Stern R, Denis P, Recourcé G, Mignot V (2009) Extracting and visualizing quotations from news wires. In: Proceedings of the 4th language and technology conference, Poznań, Poland, November

  6. 6.

    dos Santos CN, Milidiú RL (2009) Entropy guided transformation learning. In: Foundations of computational intelligence, volume 1: Learning and approximation. Studies in Computational Intelligence, vol 201. Springer, Berlin, pp 159–184

    Chapter  Google Scholar 

  7. 7.

    dos Santos CN, Milidiú RL, Renteria RP (2008) Portuguese part-of-speech tagging using entropy guided transformation learning. In: Proceedings of the international conference on computational processing of Portuguese language (PROPOR), Aveiro, Portugal

  8. 8.

    Farkas R, Vincze V, Mora G, Csirik J, Szarvas G (2010) The CoNLL 2010 shared task: learning to detect hedges and their scope in natural language text. In: Proceedings of the fourteenth conference on computational natural language learning shared task (CoNLL), Uppsala, Sweden

  9. 9.

    Fernandes ER, dos Santos CN, Milidiú RL (2009) Portuguese language processing service. In: Proceedings of the web in Ibero-America alternate track of the 18th World Wide Web conference (WWW), Madrid

  10. 10.

    Fernandes ER, Pires BA, dos Santos CN, Milidiú RL (2009) Clause identification using entropy guided transformation learning. In: Proceedings of the 7th Brazilian symposium in information and human language technology (STIL), São Carlos, Brazil

  11. 11.

    Fernandes ER, Pires BA, dos Santos CN, Milidiú RL (2010) A machine learning approach to Portuguese clause identification. In: Proceedings of the 9th international conference on computational processing of the Portuguese language (PROPOR), Porto, Alegre, Brazil. Lecture notes in artificial intelligence, vol 6001. Springer, Berlin, pp 55–64

    Chapter  Google Scholar 

  12. 12.

    Fernandes E, Crestana C, Milidiú R (2010) Hedge detection using the RelHunter approach. In: Proceedings of the 14th conference on computational natural language learning, July 2010, Uppsala, Sweden. Association for Computational Linguistics, Stroudsburg, pp 64–69. http://www.aclweb.org/anthology/W10-3009

    Google Scholar 

  13. 13.

    Freitas MC, Rocha P, Bick E (2008) Floresta sintá(c)tica: bigger, thicker and easier. In: Teixeira A, Lúcia Strube de Lima V, Caldas de Oliveira L, Quaresma P (eds) Computational processing of the Portuguese language. Lecture notes in computer science, vol 5190. Springer, Berlin, pp 216–219

    Chapter  Google Scholar 

  14. 14.

    Màrquez L, Carreras X, Litkowski KC, Stevenson S (2008) Semantic role labeling: an introduction to the special issue. Comput Linguist 34(2):145–159

    Article  Google Scholar 

  15. 15.

    McDonald R, Lerman K, Pereira F (2006) Multilingual dependency analysis with a two-stage discriminative parser. In: Proceedings of the tenth conference on computational natural language learning, New York, USA. Association for Computational Linguistics, Stroudsburg, pp 216–220

    Chapter  Google Scholar 

  16. 16.

    Milidiú RL, dos Santos CN, Duarte JC (2008) Phrase chunking using entropy guided transformation learning. In: Proceedings of ACL–HLT, Columbus, OH, USA. Association for Computational Linguistics, Stroudsburg, pp 647–655

    Google Scholar 

  17. 17.

    Milidiú RL, dos Santos CN, Duarte JC (2008) Portuguese corpus-based learning using ETL. J Braz Comput Soc 14(4). doi:10.1590/S0104-65002008000400003

  18. 18.

    Milidiú RL, dos Santos CN, Crestana CEM (2009) A token classification approach to dependency parsing. In: Proceedings of the 7th Brazilian symposium in information and human language technology (STIL), São Carlos, Brazil

  19. 19.

    Nivre J, Hall J, Kübler S, McDonald R, Nilsson J, Riedel S, Yuret D (2007) The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the CoNLL shared task, Prague, Czech Republic, pp 915–932

  20. 20.

    Pouliquen B, Steinberger R, Best C (2007) Automatic detection of quotations in multilingual news. In: Proceedings of recent advances in natural language processing, Borovets, Bulgaria, September

  21. 21.

    Punyakanok V, Roth D (2001) The use of classifiers in sequential inference. In: Proceedings of the conference on advances in neural information processing systems (NIPS). MIT Press, Cambridge, pp 995–1001

    Google Scholar 

  22. 22.

    Sang EFTK (2000) Text chunking by system combination. In: Proceedings of conference on computational natural language learning, Lisbon, Portugal

  23. 23.

    Sang EFTK, Buchholz S (2000) Introduction to the CoNLL-2000 shared task: chunking. In: Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal

  24. 24.

    Sang EFTK, Déjean H (2001) Introduction to the CoNLL-2001 shared task: clause identification. In: Proceedings of fifth conference on computational natural language learning, Toulouse, France

  25. 25.

    Sarmento L, Nunes S (2009) Automatic extraction of quotes and topics from news feeds. In: Proceedings of the 4th doctoral symposium on informatics engineering, Porto, Portugal, February

  26. 26.

    Vincze V, Szarvas G, Richárd F, Mora G, Csirik J (2008) The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinf 9(Suppl 11):S9

    Article  Google Scholar 

  27. 27.

    Wu YC, Chang CH, Lee YS (2006) A general and multi-lingual phrase chunking model based on masking method. In: Proceedings of the 7th international conference on intelligent text processing and computational linguistics, pp 144–155

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Eraldo R. Fernandes.

Additional information

This work was partially funded by CNPq and FAPERJ grants 557.128/2009-9 and E-26/170028/2008. The first author holds a CNPq doctoral fellowship and is supported by Instituto Federal de Educação, Ciência e Tecnologia de Goiás, Brazil.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Fernandes, E.R., Milidiú, R.L. & Rentería, R.P. RelHunter: a machine learning method for relation extraction from text. J Braz Comput Soc 16, 191–199 (2010). https://doi.org/10.1007/s13173-010-0018-y

Download citation

Keywords

  • Natural language processing
  • Entity relation extraction
  • Machine learning
  • Entropy Guided Transformation Learning