Extracting compound terms from domain corpora

Lopes, Lucelene; Vieira, Renata; Finatto, Maria José; Martins, Daniel

doi:10.1007/s13173-010-0020-4

Original Paper
Open access
Published: 20 August 2010

Extracting compound terms from domain corpora

Lucelene Lopes¹,
Renata Vieira¹,
Maria José Finatto² &
…
Daniel Martins¹

Journal of the Brazilian Computer Society volume 16, pages 247–259 (2010)Cite this article

622 Accesses
10 Citations
Metrics details

Abstract

The need for domain ontologies motivates the research on structured information extraction from texts. A foundational part of this process is the identification of domain relevant compound terms. This paper presents an evaluation of compound terms extraction from a corpus of the domain of Pediatrics. Bigrams and trigrams were automatically extracted from a corpus composed by 283 texts from a Portuguese journal, Jornal de Pediatria, using three different extraction methods. Considering that these methods generate an elevated number of candidates, we analyzed the quality of the resulting terms according to different methods and cut-off points. The evaluation is reported by metrics such as precision, recall and f-measure, which are computed on the basis of a hand-made reference list of domain relevant compounds.

References

Aubin S, Hamon T (2006) Improving term extraction with terminological resources. In: FinTAL 2006. LNAI, vol 4139, pp 380–387
Google Scholar
Baptista J, Batista F, Mamede N (2006) Building a dictionary of anthroponyms. In: Proceedings of 7th PROPOR
Baroni M, Bernadini S (2004) BootCaT: Bootstrapping Corpora and Terms from the Web. In: Proceedings of the 4th LREC, pp 1313–1316
Google Scholar
Baségio T (2006) Uma abordagem semi-automática para identificação de estruturas ontológicas a partir de textos na língua Portuguesa do Brasil. Dissertation (MSc), PUCRS
Bick E (2000) The parsing system “Palavras”: automatic grammatical analysis of Portuguese in a constraint grammar framework. PhD thesis, Arhus University
Bourigault D (2002) UPERY: un outil d’analyse distributionnelle étendue pour la construction d’ontologies a partir de corpus. In: TALN, Nancy
Bourigault D, Lame G (2002) Analyse distributionnelle et structuration de terminologie—application a la construction d’une ontologie documentaire du Droit. In: TAL, vol 43(1), pp 1–22
Google Scholar
Bourigault D, Fabre C, Frérot C, Jacques M, Ozdowska S (2005) SYNTEX, analyseur syntaxique de corpus. In: TALN, Dourdan
Buitelaar P, Cimiano P, Magnini B (2005) Ontology learning from text: An overview. In: Buitelaar P, Cimiano P, Magnini B (eds) Ontology learning from text: methods, evaluation and applications. Frontiers in artificial intelligence and applications, vol 123. IOS Press, Amsterdam
Google Scholar
Coulthard RJ (2005) The application of corpus methodology to translation: the JPED parallel corpus and the pediatrics comparable corpus. Dissertation (MSc), UFSC
Fortuna B, Lavrac N, Velardi P (2008) Advancing topic ontology learning through term extraction. In: PRICAI 2008. LNAI, vol 5351, pp 626–635
Google Scholar
Hulth A (2004) Enhancing linguistically oriented automatic keyword extraction. In: HLT-NAACL, ACL
Ide N, Bonhomme P, Romary L (2000) Xces: An xml-based encoding standart for linguistic corpora. In: Proceedings of the second LREC
Lavelli A, Sebastiani F, Zanoli R (2004) Distributional term representations: an experimental comparison. In: Proceedings of the 13th ACM CIKM, pp 615–624
Google Scholar
Lopes L, Vieira R, Finatto MJ, Zanette A, Martins D, Ribeiro LC Jr (2009) Automatic extraction of composite terms for construction of ontologies: an experiment in the health care area. RECIIS—Electron J Commun Inf Innov Health 3(1):72–84
Google Scholar
Lopes L, Fernandes P, Vieira R, Fedrizzi G (2009) ExATOlp: An automatic tool for term extraction from Portuguese language corpora. In: Proceedings of the fourth language & technology conference: human language technologies as a challenge for computer science and linguistics, LTC’09, Faculty of Mathematics and Computer Science of Adam Mickiewicz University, November, 2009, pp 427–431
Lopes L, Oliveira LH, Vieira R. (2010) Portuguese term extraction methods: comparing linguistic and statistical approaches. In: Proceedings of the 9th PROPOR
Maedche A, Staab S (2000) Semi-automatic engineering of ontologies from text. In: Proceedings of the 12th SEKE
Manning CD Schutze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge
MATH Google Scholar
Navigli R, Velardi P (2002) Semantic interpretation of terminological strings. In: Proceedings of the 6th TKE, INIST-CNRS, Vandoeuvre-lès-Nancy, France
Pazienza MT, Pennacchiotti M, Zanzotto FM (2005) Terminology extraction: an analysis of linguistic and statistical approaches. In: Sirmakessis S (ed) Knowlodge mining. Studies in fuzziness and soft computing, vol 185. Springer, Berlin
Google Scholar
Park Y, Bird R, Bougarev B (2002) Automatic glossary extraction: Beyond terminology identification. In: Proceedings of the 19th COLING, Taipei, Taiwan
Ribeiro LC (2008) OntoLP: Construção semi-automática de ontologias a partir de textos da língua portuguesa. Dissertation (MSc), UNISINOS
Suchanek FM, Ifrim G, Andweikum G (2006) Leila: Learning to extract information by linguistic analysis. In: Proceedings of the 2nd workshop on ontology learning and population. Association for computational linguistics

Download references

Author information

Authors and Affiliations

PPGCC, FACIN, PUCRS, Av. Ipiranga, 6681, Porto Alegre, Brazil
Lucelene Lopes, Renata Vieira & Daniel Martins
DECLAVE, IL, UFRGS, Av. Bento Gonçalves, Porto Alegre, Brazil
Maria José Finatto

Authors

Lucelene Lopes
View author publications
You can also search for this author in PubMed Google Scholar
Renata Vieira
View author publications
You can also search for this author in PubMed Google Scholar
Maria José Finatto
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Martins
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lucelene Lopes.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Lopes, L., Vieira, R., Finatto, M.J. et al. Extracting compound terms from domain corpora. J Braz Comput Soc 16, 247–259 (2010). https://doi.org/10.1007/s13173-010-0020-4

Download citation

Received: 10 February 2010
Accepted: 01 August 2010
Published: 20 August 2010
Issue Date: November 2010
DOI: https://doi.org/10.1007/s13173-010-0020-4

Extracting compound terms from domain corpora

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords