Measuring grammatical status in Chinese through quantitative corpus analysis
This paper applies a quantitative model developed for measuring grammatical status, using data from the Lancaster Corpus of Mandarin Chinese (lcmc). The model takes into account four quantitative factors (token frequency, collocate diversity, colligate diversity and deviation of proportions) and uses them as predictors in a binary logistic regression in order to compute a score of grammatical status between ‘0’ (lexical/non-grammatical) and ‘1’ (highly grammatical) for each given element. The results of the lcmc model are then compared to those of a similar study of the British National Corpus (bnc). The comparison suggests that token frequency emerges as one of the most relevant parameters for quantifying degrees of grammatical status in both language models, together with the collocate diversity measure when using a broad window span. On the other hand, the colligational measures (left- or right-based) and the other collocate diversity measures using small spans (left- or right-based) contribute very differently to the two languages due to their typologically distinctive structures.