From: Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese
Corpus | Train | Development | Test | Total | Tagset | |
---|---|---|---|---|---|---|
Mac-Morpho v1 | Sentences | 42,022 | 2,211 | 9,141 | 53,374 | 41 |
Tokens | 957,439 | 50,232 | 213,794 | 1,221,465 | ||
Mac-Morpho v2 | Sentences | 42,742 | 2,249 | 4,999 | 49,990 | 30 |
Tokens | 807,818 | 43,145 | 94,995 | 945,958 | ||
Mac-Morpho v3 | Sentences | 37,948 | 1,997 | 9,987 | 49,932 | 26 |
Tokens | 728,497 | 38,881 | 178,373 | 945,751 | ||
Tycho Brahe | Sentences | 29,163 | 1,535 | 10,234 | 40,932 | 265 |
Tokens | 734,922 | 40,679 | 259,991 | 1,035,592 |