Skip to main content

Extracting compound terms from domain corpora


The need for domain ontologies motivates the research on structured information extraction from texts. A foundational part of this process is the identification of domain relevant compound terms. This paper presents an evaluation of compound terms extraction from a corpus of the domain of Pediatrics. Bigrams and trigrams were automatically extracted from a corpus composed by 283 texts from a Portuguese journal, Jornal de Pediatria, using three different extraction methods. Considering that these methods generate an elevated number of candidates, we analyzed the quality of the resulting terms according to different methods and cut-off points. The evaluation is reported by metrics such as precision, recall and f-measure, which are computed on the basis of a hand-made reference list of domain relevant compounds.


  1. 1.

    Aubin S, Hamon T (2006) Improving term extraction with terminological resources. In: FinTAL 2006. LNAI, vol 4139, pp 380–387

    Google Scholar 

  2. 2.

    Baptista J, Batista F, Mamede N (2006) Building a dictionary of anthroponyms. In: Proceedings of 7th PROPOR

  3. 3.

    Baroni M, Bernadini S (2004) BootCaT: Bootstrapping Corpora and Terms from the Web. In: Proceedings of the 4th LREC, pp 1313–1316

    Google Scholar 

  4. 4.

    Baségio T (2006) Uma abordagem semi-automática para identificação de estruturas ontológicas a partir de textos na língua Portuguesa do Brasil. Dissertation (MSc), PUCRS

  5. 5.

    Bick E (2000) The parsing system “Palavras”: automatic grammatical analysis of Portuguese in a constraint grammar framework. PhD thesis, Arhus University

  6. 6.

    Bourigault D (2002) UPERY: un outil d’analyse distributionnelle étendue pour la construction d’ontologies a partir de corpus. In: TALN, Nancy

  7. 7.

    Bourigault D, Lame G (2002) Analyse distributionnelle et structuration de terminologie—application a la construction d’une ontologie documentaire du Droit. In: TAL, vol 43(1), pp 1–22

    Google Scholar 

  8. 8.

    Bourigault D, Fabre C, Frérot C, Jacques M, Ozdowska S (2005) SYNTEX, analyseur syntaxique de corpus. In: TALN, Dourdan

  9. 9.

    Buitelaar P, Cimiano P, Magnini B (2005) Ontology learning from text: An overview. In: Buitelaar P, Cimiano P, Magnini B (eds) Ontology learning from text: methods, evaluation and applications. Frontiers in artificial intelligence and applications, vol 123. IOS Press, Amsterdam

    Google Scholar 

  10. 10.

    Coulthard RJ (2005) The application of corpus methodology to translation: the JPED parallel corpus and the pediatrics comparable corpus. Dissertation (MSc), UFSC

  11. 11.

    Fortuna B, Lavrac N, Velardi P (2008) Advancing topic ontology learning through term extraction. In: PRICAI 2008. LNAI, vol 5351, pp 626–635

    Google Scholar 

  12. 12.

    Hulth A (2004) Enhancing linguistically oriented automatic keyword extraction. In: HLT-NAACL, ACL

  13. 13.

    Ide N, Bonhomme P, Romary L (2000) Xces: An xml-based encoding standart for linguistic corpora. In: Proceedings of the second LREC

  14. 14.

    Lavelli A, Sebastiani F, Zanoli R (2004) Distributional term representations: an experimental comparison. In: Proceedings of the 13th ACM CIKM, pp 615–624

    Google Scholar 

  15. 15.

    Lopes L, Vieira R, Finatto MJ, Zanette A, Martins D, Ribeiro LC Jr (2009) Automatic extraction of composite terms for construction of ontologies: an experiment in the health care area. RECIIS—Electron J Commun Inf Innov Health 3(1):72–84

    Google Scholar 

  16. 16.

    Lopes L, Fernandes P, Vieira R, Fedrizzi G (2009) ExATOlp: An automatic tool for term extraction from Portuguese language corpora. In: Proceedings of the fourth language & technology conference: human language technologies as a challenge for computer science and linguistics, LTC’09, Faculty of Mathematics and Computer Science of Adam Mickiewicz University, November, 2009, pp 427–431

  17. 17.

    Lopes L, Oliveira LH, Vieira R. (2010) Portuguese term extraction methods: comparing linguistic and statistical approaches. In: Proceedings of the 9th PROPOR

  18. 18.

    Maedche A, Staab S (2000) Semi-automatic engineering of ontologies from text. In: Proceedings of the 12th SEKE

  19. 19.

    Manning CD Schutze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge

    MATH  Google Scholar 

  20. 20.

    Navigli R, Velardi P (2002) Semantic interpretation of terminological strings. In: Proceedings of the 6th TKE, INIST-CNRS, Vandoeuvre-lès-Nancy, France

  21. 21.

    Pazienza MT, Pennacchiotti M, Zanzotto FM (2005) Terminology extraction: an analysis of linguistic and statistical approaches. In: Sirmakessis S (ed) Knowlodge mining. Studies in fuzziness and soft computing, vol 185. Springer, Berlin

    Google Scholar 

  22. 22.

    Park Y, Bird R, Bougarev B (2002) Automatic glossary extraction: Beyond terminology identification. In: Proceedings of the 19th COLING, Taipei, Taiwan

  23. 23.

    Ribeiro LC (2008) OntoLP: Construção semi-automática de ontologias a partir de textos da língua portuguesa. Dissertation (MSc), UNISINOS

  24. 24.

    Suchanek FM, Ifrim G, Andweikum G (2006) Leila: Learning to extract information by linguistic analysis. In: Proceedings of the 2nd workshop on ontology learning and population. Association for computational linguistics

Download references

Author information



Corresponding author

Correspondence to Lucelene Lopes.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Lopes, L., Vieira, R., Finatto, M.J. et al. Extracting compound terms from domain corpora. J Braz Comput Soc 16, 247–259 (2010).

Download citation


  • Term extraction
  • Statistical and linguistic methods
  • Ontology automatic construction
  • Extraction from corpora