- Original Paper
- Open access
- Published:
Extracting compound terms from domain corpora
Journal of the Brazilian Computer Society volume 16, pages 247–259 (2010)
Abstract
The need for domain ontologies motivates the research on structured information extraction from texts. A foundational part of this process is the identification of domain relevant compound terms. This paper presents an evaluation of compound terms extraction from a corpus of the domain of Pediatrics. Bigrams and trigrams were automatically extracted from a corpus composed by 283 texts from a Portuguese journal, Jornal de Pediatria, using three different extraction methods. Considering that these methods generate an elevated number of candidates, we analyzed the quality of the resulting terms according to different methods and cut-off points. The evaluation is reported by metrics such as precision, recall and f-measure, which are computed on the basis of a hand-made reference list of domain relevant compounds.
References
Aubin S, Hamon T (2006) Improving term extraction with terminological resources. In: FinTAL 2006. LNAI, vol 4139, pp 380–387
Baptista J, Batista F, Mamede N (2006) Building a dictionary of anthroponyms. In: Proceedings of 7th PROPOR
Baroni M, Bernadini S (2004) BootCaT: Bootstrapping Corpora and Terms from the Web. In: Proceedings of the 4th LREC, pp 1313–1316
Baségio T (2006) Uma abordagem semi-automática para identificação de estruturas ontológicas a partir de textos na língua Portuguesa do Brasil. Dissertation (MSc), PUCRS
Bick E (2000) The parsing system “Palavras”: automatic grammatical analysis of Portuguese in a constraint grammar framework. PhD thesis, Arhus University
Bourigault D (2002) UPERY: un outil d’analyse distributionnelle étendue pour la construction d’ontologies a partir de corpus. In: TALN, Nancy
Bourigault D, Lame G (2002) Analyse distributionnelle et structuration de terminologie—application a la construction d’une ontologie documentaire du Droit. In: TAL, vol 43(1), pp 1–22
Bourigault D, Fabre C, Frérot C, Jacques M, Ozdowska S (2005) SYNTEX, analyseur syntaxique de corpus. In: TALN, Dourdan
Buitelaar P, Cimiano P, Magnini B (2005) Ontology learning from text: An overview. In: Buitelaar P, Cimiano P, Magnini B (eds) Ontology learning from text: methods, evaluation and applications. Frontiers in artificial intelligence and applications, vol 123. IOS Press, Amsterdam
Coulthard RJ (2005) The application of corpus methodology to translation: the JPED parallel corpus and the pediatrics comparable corpus. Dissertation (MSc), UFSC
Fortuna B, Lavrac N, Velardi P (2008) Advancing topic ontology learning through term extraction. In: PRICAI 2008. LNAI, vol 5351, pp 626–635
Hulth A (2004) Enhancing linguistically oriented automatic keyword extraction. In: HLT-NAACL, ACL
Ide N, Bonhomme P, Romary L (2000) Xces: An xml-based encoding standart for linguistic corpora. In: Proceedings of the second LREC
Lavelli A, Sebastiani F, Zanoli R (2004) Distributional term representations: an experimental comparison. In: Proceedings of the 13th ACM CIKM, pp 615–624
Lopes L, Vieira R, Finatto MJ, Zanette A, Martins D, Ribeiro LC Jr (2009) Automatic extraction of composite terms for construction of ontologies: an experiment in the health care area. RECIIS—Electron J Commun Inf Innov Health 3(1):72–84
Lopes L, Fernandes P, Vieira R, Fedrizzi G (2009) ExATOlp: An automatic tool for term extraction from Portuguese language corpora. In: Proceedings of the fourth language & technology conference: human language technologies as a challenge for computer science and linguistics, LTC’09, Faculty of Mathematics and Computer Science of Adam Mickiewicz University, November, 2009, pp 427–431
Lopes L, Oliveira LH, Vieira R. (2010) Portuguese term extraction methods: comparing linguistic and statistical approaches. In: Proceedings of the 9th PROPOR
Maedche A, Staab S (2000) Semi-automatic engineering of ontologies from text. In: Proceedings of the 12th SEKE
Manning CD Schutze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge
Navigli R, Velardi P (2002) Semantic interpretation of terminological strings. In: Proceedings of the 6th TKE, INIST-CNRS, Vandoeuvre-lès-Nancy, France
Pazienza MT, Pennacchiotti M, Zanzotto FM (2005) Terminology extraction: an analysis of linguistic and statistical approaches. In: Sirmakessis S (ed) Knowlodge mining. Studies in fuzziness and soft computing, vol 185. Springer, Berlin
Park Y, Bird R, Bougarev B (2002) Automatic glossary extraction: Beyond terminology identification. In: Proceedings of the 19th COLING, Taipei, Taiwan
Ribeiro LC (2008) OntoLP: Construção semi-automática de ontologias a partir de textos da língua portuguesa. Dissertation (MSc), UNISINOS
Suchanek FM, Ifrim G, Andweikum G (2006) Leila: Learning to extract information by linguistic analysis. In: Proceedings of the 2nd workshop on ontology learning and population. Association for computational linguistics
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Lopes, L., Vieira, R., Finatto, M.J. et al. Extracting compound terms from domain corpora. J Braz Comput Soc 16, 247–259 (2010). https://doi.org/10.1007/s13173-010-0020-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13173-010-0020-4