Skip to main content

Table 2 Corpora size

From: Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese

Corpus

Train

Development

Test

Total

Tagset

Mac-Morpho v1

Sentences

42,022

2,211

9,141

53,374

41

 

Tokens

957,439

50,232

213,794

1,221,465

 

Mac-Morpho v2

Sentences

42,742

2,249

4,999

49,990

30

 

Tokens

807,818

43,145

94,995

945,958

 

Mac-Morpho v3

Sentences

37,948

1,997

9,987

49,932

26

 

Tokens

728,497

38,881

178,373

945,751

 

Tycho Brahe

Sentences

29,163

1,535

10,234

40,932

265

 

Tokens

734,922

40,679

259,991

1,035,592