Corpus Analysis with Antconc

Text Analysis and Visualization Research on the Hetu Dangse During the Qing Dynasty of China

Information Technology and Libraries ◽

10.6017/ital.v40i3.13279 ◽

2021 ◽

Vol 40 (3) ◽

Author(s):

Zhiyu Wang ◽

Jingyu Wu ◽

Guang Yu ◽

Zhiping Song

Keyword(s):

Data Analysis ◽

Qing Dynasty ◽

Text Analysis ◽

Large Scale ◽

Liaoning Province ◽

Support Vector ◽

Historical Documents ◽

The Qing Dynasty ◽

Historical Materials ◽

The Government

In traditional historical research, interpreting historical documents subjectively and manually causes problems such as one-sided understanding, selective analysis, and one-way knowledge connection. In this study, we aim to use machine learning to automatically analyze and explore historical documents from a text analysis and visualization perspective. This technology solves the problem of large-scale historical data analysis that is difficult for humans to read and intuitively understand. In this study, we use the historical documents of the Qing Dynasty Hetu Dangse,preserved in the Archives of Liaoning Province, as data analysis samples. China’s Hetu Dangse is the largest Qing Dynasty thematic archive with Manchu and Chinese characters in the world. Through word frequency analysis, correlation analysis, co-word clustering, word2vec model, and SVM (Support Vector Machines) algorithms, we visualize historical documents, reveal the relationships between functions of the government departments in the Shengjing area of the Qing Dynasty, achieve the automatic classification of historical archives, improve the efficient use of historical materials as well as build connections between historical knowledge. Through this, archivists can be guided practically in historical materials’ management and compilation.

Download Full-text

Consistency and Variability in Children’s Word Learning Across Languages

Open Mind ◽

10.1162/opmi_a_00026 ◽

2019 ◽

Vol 3 ◽

pp. 52-67 ◽

Cited By ~ 10

Author(s):

Mika Braginsky ◽

Daniel Yurovsky ◽

Virginia A. Marchman ◽

Michael C. Frank

Keyword(s):

Word Learning ◽

Large Scale ◽

Parent Report ◽

Corpus Analysis ◽

Function Words ◽

Lexical Categories ◽

Derived Properties ◽

Report Data ◽

And Function

Why do children learn some words earlier than others? The order in which words are acquired can provide clues about the mechanisms of word learning. In a large-scale corpus analysis, we use parent-report data from over 32,000 children to estimate the acquisition trajectories of around 400 words in each of 10 languages, predicting them on the basis of independently derived properties of the words’ linguistic environment (from corpora) and meaning (from adult judgments). We examine the consistency and variability of these predictors across languages, by lexical category, and over development. The patterning of predictors across languages is quite similar, suggesting similar processes in operation. In contrast, the patterning of predictors across different lexical categories is distinct, in line with theories that posit different factors at play in the acquisition of content words and function words. By leveraging data at a significantly larger scale than previous work, our analyses identify candidate generalizations about the processes underlying word learning across languages.

Download Full-text

Punctuation and syntactic structure in obwohl and weil clauses in nonstandard written German

Written Language & Literacy ◽

10.1075/wll.19.2.04sch ◽

2016 ◽

Vol 19 (2) ◽

pp. 212-245 ◽

Cited By ~ 1

Author(s):

Roland Schäfer ◽

Ulrike Sayatz

Keyword(s):

Large Scale ◽

Syntactic Structure ◽

Second Order ◽

Corpus Analysis ◽

Verb Second ◽

Corpus Study ◽

Theoretical Approaches ◽

Corpus Studies ◽

Constituent Order ◽

Insight Into

In this paper, we analyze written sentences containing the German particles obwohl (“although”) and weil (“because”). In standard written German, these particles embed clauses in verb-last constituent order, which is characteristic of subordinated clauses. In spoken and – as we show – nonstandard written German, they embed clauses in verb-second constituent order, which is characteristic of independent sentences. Our usage-based approach to the syntax – graphemics interface includes a large-scale corpus analysis of the patterns of punctuation in the nonstandard variants that provides clues to the syntactic structure and degree of sentential independence of the nonstandard variants. Our corpus study confirms and refines hypotheses from existing theoretical approaches by clearly showing that writers mark obwohl clauses with verb-second order systematically as independent sentences, whereas weil clauses with verb-second order are much less strongly marked as independent. This work suggests that similar corpus studies could provide deeper insight into the interplay between syntax and graphemics.

Download Full-text

Ensemble Methods for Improving Classification of Data Produced by Latent Dirichlet Allocation

Computer Science and Mathematical Modelling ◽

10.5604/01.3001.0013.1458 ◽

2019 ◽

Vol 0 (8/2018) ◽

pp. 17-28

Author(s):

Maciej Jankowski

Keyword(s):

Text Analysis ◽

Large Scale ◽

Latent Dirichlet Allocation ◽

Previous Analysis ◽

Ensemble Methods ◽

Topic Modelling ◽

New Methods ◽

Data Scientist ◽

Dirichlet Allocation

Topic models are very popular methods of text analysis. The most popular algorithm for topic modelling is LDA (Latent Dirichlet Allocation). Recently, many new methods were proposed, that enable the usage of this model in large scale processing. One of the problem is, that a data scientist has to choose the number of topics manually. This step, requires some previous analysis. A few methods were proposed to automatize this step, but none of them works very well if LDA is used as a preprocessing for further classification. In this paper, we propose an ensemble approach which allows us to use more than one model at prediction phase, at the same time, reducing the need of finding a single best number of topics. We have also analyzed a few methods of estimating topic number.

Download Full-text

UIMA GRID: Distributed Large-scale Text Analysis

Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07) ◽

10.1109/ccgrid.2007.118 ◽

2007 ◽

Cited By ~ 6

Author(s):

Michael Thomas Egner ◽

Markus Lorch ◽

Edd Biddle

Keyword(s):

Text Analysis ◽

Large Scale

Download Full-text

Corpora for Theory and Practice

International Journal of Corpus Linguistics ◽

10.1075/ijcl.1.1.03bar ◽

1996 ◽

Vol 1 (1) ◽

pp. 1-37 ◽

Cited By ~ 25

Author(s):

Michael Barlow

Keyword(s):

Second Language ◽

Language Learning ◽

Text Analysis ◽

Second Language Learning ◽

Theory And Practice ◽

Corpus Analysis ◽

Data Driven ◽

Actual Usage ◽

Grammatical Analysis ◽

Patterns Of Use

In this paper intuition-based studies of reflexive forms such as myself are contrasted with a corpus-based investigation of actual usage of reflexives. The examination of reflexives in English in several corpora reveals a variety of patterns, which are analysed within a schema-based approach to grammar (Barlow and Kemmer 1994). This approach follows the cognitive/functional tradition of grammatical analysis in viewing all grammatical units as composed of form-meaning pairings. The paper demonstrates that a schema-based approach is well-suited to the task of describing the major and minor patterns of use revealed by corpus analysis. The importance of text analysis in language teaching is highlighted and connections between the schema-based grammatical formalism and data-driven approaches to second language learning (Johns 1991b) are briefly explored.

Download Full-text

The Doxa of Reading

PMLA ◽

10.1632/pmla.2017.132.3.636 ◽

2017 ◽

Vol 132 (3) ◽

pp. 636-642 ◽

Cited By ~ 2

Author(s):

Andrew Goldstone

Keyword(s):

Text Analysis ◽

Literary History ◽

World Literature ◽

Digital Tools ◽

Machine Learning Techniques ◽

Sociology Of Literature ◽

Distant Reading ◽

Learning Techniques ◽

Previous Decade ◽

Semantic Transformation

Reading Franco Moretti's Graphs, Maps, Trees as a late-stage graduate student in 2008 was invigorating. Here was an approach to literary history free from the pieties of close reading, committed to empiricism, seeking to fulfill, with its “materialist conception of form,” the promise of the sociology of literature (92). And, at the time, it seemed natural that the way to follow the path laid out by Moretti in Graphs and in the essays he had published over the previous decade was to go to my computer, polish my rusty programming skills, and start making graphs. Yet reconsidering Moretti's Distant Reading now, one is struck by how nondigital the book is. In fact, the meaning of distant reading has undergone a rapid semantic transformation. In “Conjectures on World Literature,” originally published in 2000, Moretti introduces the phrase to describe “a patchwork of other people's research, without a single direct textual reading” (Distant Reading 48). Today, however, distant reading typically refers to computational studies of text. Introducing a 2016 cluster of essays called “Text Analysis at Scale,” Matthew K. Gold and Lauren Klein employ the term to speak of “using digital tools to ‘read’ large swaths of text” (Introduction); in his contribution to the cluster, Ted Underwood embraces “distant reading” as a name for applying machine-learning techniques to unstructured text. Discussions of distant reading have become discussions of computation with text, even if no section of Distant Reading features the elaborate computations found in the Stanford Literary Lab pamphlets to which Moretti has contributed.

Download Full-text

Task Effects on Linguistic Complexity and Accuracy: A Large-Scale Learner Corpus Analysis Employing Natural Language Processing Techniques

Language Learning ◽

10.1111/lang.12232 ◽

2017 ◽

Vol 67 (S1) ◽

pp. 180-208 ◽

Cited By ~ 33

Author(s):

Theodora Alexopoulou ◽

Marije Michel ◽

Akira Murakami ◽

Detmar Meurers

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Corpus Analysis ◽

Linguistic Complexity ◽

Learner Corpus ◽

Task Effects ◽

Learner Corpus Analysis ◽

Processing Techniques

Download Full-text

Lexicons on Demand: Neural Word Embeddings for Large-Scale Text Analysis

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/677 ◽

2017 ◽

Cited By ~ 1

Author(s):

Ethan Fast ◽

Binbin Chen ◽

Michael S. Bernstein

Keyword(s):

Text Analysis ◽

Large Scale ◽

Data Driven ◽

Word Embeddings ◽

Human Language ◽

Lexical Categories ◽

On Demand ◽

Small Set ◽

Highly Correlated ◽

The Web

Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by learning a neural embedding across billions of words on the web. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated such as neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.

Download Full-text

Aspects of Discourse Stream Analysis

Global Language Review ◽

10.31703/glr.2019(iv-ii).01 ◽

2019 ◽

Vol IV (II) ◽

pp. 1-6

Author(s):

Mark Perkins

Keyword(s):

Big Data ◽

Real Time ◽

Text Analysis ◽

Large Scale ◽

Relative Speed ◽

Fake News ◽

New Techniques ◽

Subject Orientation ◽

Theory Of Discourse ◽

Theoretical Developments

The huge proliferation of textual (and other data) in digital and organisational sources has led to new techniques of text analysis. The potential thereby unleashed may be underpinned by further theoretical developments to the theory of Discourse Stream Analysis (DSA) as presented here. These include the notion of change in the discourse stream in terms of discourse stream fronts, linguistic elements evolving in real time, and notions of time itself in terms of relative speed, subject orientation and perception. Big data has also given rise to fake news, the manipulation of messages on a large scale. Fake news is conveyed in fake discourse streams and has led to a new field of description and analysis.

Download Full-text