Skip to main content

scriptLattes: an open-source knowledge extraction system from the Lattes platform


The Lattes platform is the major scientific information system maintained by the National Council for Scientific and Technological Development (CNPq). This platform allows to manage the curricular information of researchers and institutions working in Brazil based on the so called Lattes Curriculum. However, the public information is individually available for each researcher, not providing the automatic creation of reports of several scientific productions for research groups. It is thus difficult to extract and to summarize useful knowledge for medium to large size groups of researchers. This paper describes the design, implementation and experiences with scriptLattes: an open-source system to create academic reports of groups based on curricula of the Lattes Database. The scriptLattes system is composed by the following modules: (a) data selection, (b) data preprocessing, (c) redundancy treatment, (d) collaboration graph generation among group members, (e) research map generation based on geographical information, and (f) automatic report creation of bibliographical, technical and artistic production, and academic supervisions. The system has been extensively tested for a large variety of research groups of Brazilian institutions, and the generated reports have shown an alternative to easily extract knowledge from data in the context of Lattes platform. The source code, usage instructions and examples are available at


  1. Amorin CV. Curriculum vitae organization: the Lattes software platform.Pesquisa Odontológica Brasileira 2003; 17(1): 18–22.

    Google Scholar 

  2. Balancieri R, Bovo AB, Kern VM, Pacheco RCS and Barcia RM. A análise de redes de colaboraçào cientifica sob as novas tecnologias de informação e comunicação: um estudo na Plataforma Lattes. Ciência da Informação 2005; 34(l):64–77.

    Google Scholar 

  3. Börner K, Chen CM and Boyack KW. Visualizing knowledge domains. In: Cronin, B. (Ed.). Annual Review of Information Science and Technology 2003; 37(1):179-255.

  4. Castaño AC.Populando ontologias através de informações em HTML: o caso do currículo Lattes. [Master’s thesis]. São Paulo: Universidade de São Paulo; 2008.

    Google Scholar 

  5. Cormen TH, Leiserson CE, Rivest RL and Stein C.Introd action to algorithms. 2 ed. Cambridge: MIT Press; 2001.

    Google Scholar 

  6. Costa LF, Rodrigues FA, Travieso G and Villas Boas PR. Characterization of complex networks: a survey of measurements.Advances in Physics 2007; 56(l):167–242.

    Article  Google Scholar 

  7. Day MY, Tsai TH, Sung CL, Lee CW, Wu SH, Ong CS et al. A knowledge-based approach to citation extraction. In: Zhang D, Khoshgoftaar TM and Shyu ML. (Eds.).Proceedings of the International Conference on Information Reuse and Integration; 2005; Las Vegas Hilton. Las Vegas: IEEE Systems, Man, and Cybernetics Society; 2005. p. 50–55.

    Google Scholar 

  8. Duda RO, Hart PE and Stork DG.Pattern classification. 2 ed. New York: John Wiley & Sons; 2000.

    Google Scholar 

  9. Han H, Zha H and Giles CL. Name disambiguation in author citations using a K-way spectral clustering method. In:Proceedings of the 5 ACM/IEEE-CS Joint Conference on Digital Libraries, Tools & techniques: identifying names of people and places; 2005; Denver. Canada: ACM; 2005. p. 334–343.

    Google Scholar 

  10. 10. Hey T, Tansley S and Tolle K. (Eds.).The fourth paradigm. Redmond, Washington: Microsoft Research; 2009.

    Google Scholar 

  11. The Digging into Data Challenge. 2009. Available from: Access in: 20/10/2009.

  12. Communications of the ACM: Surviving the data deluge 2008; 51(12). New York, NY, USA: ACM; 2008.

    Google Scholar 

  13. Jolliffe IT.Principal component analysis. 2 ed. New York: Springer-Verlag; 2002. (Series in statistics)

    MATH  Google Scholar 

  14. Koren Y, North SC and Volinsky C. Measuring and extracting proximity in networks. In: Eliassi Rad T, Ungar LH, Craven M and Gunopulos D. (Eds.).Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2006; Philadelphia. Philadelphia: ACM; 2006. p. 245- 255.

    Google Scholar 

  15. Kouzes RT, Anderson GA, Elbert ST, Gorton I and Gracio DK. The changing paradigm of dataintensive computing.Computer 2009; 42(l):26–34.

    Article  Google Scholar 

  16. Laender AHF, Lucena CJP, Maldonado JC, Souza e Silva E and Ziviani N. Assessing the research and education quality of the top Brazilian Computer Science graduate programs.ACM SIGCSE Bulletin 2008; 40(2):135–145.

    Article  Google Scholar 

  17. Liu X, Bollen J, Nelson ML and Van de Sompel H. Co-authorship networks in the digital library research community.Informations Processing and Management 2005; 41(6):1462–1480.

    Article  Google Scholar 

  18. Project Zoomable Visual Transformation Machine. 2009. Available from: Access in: 20/10/2009.

  19. Maia MF and Caregnato SE. Co-autoria como indicador de redes de colaboração cientifica.Perspectivas em Ciência da Informação 2008; 13(2):18–31.

    Article  Google Scholar 

  20. Nascimento MA, Sander Jand Pound J. Analysis of SIGMOD’s co-authorship graph.SIGMOD Record 2003; 32(3):8–10.

    Article  Google Scholar 

  21. Newman MEJ and Girvan M. Finding and evaluating community structure in networks.Physical Review E 2004; 69(2):026113.

    Article  Google Scholar 

  22. Nicholson S. The basis for bibliomining: frameworks for bringing together usage-based data mining and bibliometrics through data warehousing in digital library services.Informations Processing and Management 2006; 42(3):785–804.

    Article  MathSciNet  Google Scholar 

  23. University of São Paulo - USP.Publications of the Department of Computer Science. São Paulo, 2009. Available from: http:/ / Access in: 20/10/2009.

  24. Vision Research Group- IME — USP.Publications of the Vision Research Group. São Paulo: University of São Paulo, 2009. Available from: creativision/publications_vision/. Access in: 20/10/2009.

    Google Scholar 

  25. Pacheco RCS and Kern VM. Uma ontologia comum para a integração de bases de informações e conhecimento sobre ciência e tecnologia.Ciência da Informação 2001; 30(3):56–63.

    Google Scholar 

  26. Paulovich FV, Nonato LG, Minghim R and Levkowitz H. Least square projection: a fast high-precision multidimensional projection technique and its application to document mapping.IEEE Transactions on Visualization and Computer Graphics 2008; 14(3):564–575.

    Article  Google Scholar 

  27. Peng F and McCallum A. Information extraction from research papers using conditional random fields.Informations Processing and Management 2006; 42(4):963–979.

    Article  Google Scholar 

  28. Said YH, Wegman EJ, Sharabati WK and Rigsby JT. Social networks of author-coauthor relationships.Computational Statistics & Data Analysis 2008; 52(4):2177–2184.

    Article  MATH  MathSciNet  Google Scholar 

  29. Project script Lattes.scriptLattes: uma ferramenta para extração e visualização de conhecimento a partir de Currículos Lattes. São Paulo: Universidade de São Paulo, 2009. Available from: Access in: 20/10/2009.

    Google Scholar 

  30. Sobral FAF, Almeida MRC and Caixeta MVG. As lideranças científicas.Ciências & Cognição 2008; 13(2):179–191.

    Google Scholar 

Download references

Author information

Authors and Affiliations


Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Mena-Chalco, J.P., Junior, R.M.C. scriptLattes: an open-source knowledge extraction system from the Lattes platform. J Braz Comp Soc 15, 31–39 (2009).

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: