Skip to main content

Table 2 Corpora size

From: Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese

Corpus Train Development Test Total Tagset
Mac-Morpho v1 Sentences 42,022 2,211 9,141 53,374 41
  Tokens 957,439 50,232 213,794 1,221,465  
Mac-Morpho v2 Sentences 42,742 2,249 4,999 49,990 30
  Tokens 807,818 43,145 94,995 945,958  
Mac-Morpho v3 Sentences 37,948 1,997 9,987 49,932 26
  Tokens 728,497 38,881 178,373 945,751  
Tycho Brahe Sentences 29,163 1,535 10,234 40,932 265
  Tokens 734,922 40,679 259,991 1,035,592