scholarly journals Corpus Analysis with Antconc

Author(s):  
Heather Froehlich

Corpus analysis is a form of text analysis which allows you to make comparisons between textual objects at a large scale (so-called 'distant reading').

2021 ◽  
Vol 40 (3) ◽  
Author(s):  
Zhiyu Wang ◽  
Jingyu Wu ◽  
Guang Yu ◽  
Zhiping Song

In traditional historical research, interpreting historical documents subjectively and manually causes problems such as one-sided understanding, selective analysis, and one-way knowledge connection. In this study, we aim to use machine learning to automatically analyze and explore historical documents from a text analysis and visualization perspective. This technology solves the problem of large-scale historical data analysis that is difficult for humans to read and intuitively understand. In this study, we use the historical documents of the Qing Dynasty Hetu Dangse,preserved in the Archives of Liaoning Province, as data analysis samples. China’s Hetu Dangse is the largest Qing Dynasty thematic archive with Manchu and Chinese characters in the world. Through word frequency analysis, correlation analysis, co-word clustering, word2vec model, and SVM (Support Vector Machines) algorithms, we visualize historical documents, reveal the relationships between functions of the government departments in the Shengjing area of the Qing Dynasty, achieve the automatic classification of historical archives, improve the efficient use of historical materials as well as build connections between historical knowledge. Through this, archivists can be guided practically in historical materials’ management and compilation.


Open Mind ◽  
2019 ◽  
Vol 3 ◽  
pp. 52-67 ◽  
Author(s):  
Mika Braginsky ◽  
Daniel Yurovsky ◽  
Virginia A. Marchman ◽  
Michael C. Frank

Why do children learn some words earlier than others? The order in which words are acquired can provide clues about the mechanisms of word learning. In a large-scale corpus analysis, we use parent-report data from over 32,000 children to estimate the acquisition trajectories of around 400 words in each of 10 languages, predicting them on the basis of independently derived properties of the words’ linguistic environment (from corpora) and meaning (from adult judgments). We examine the consistency and variability of these predictors across languages, by lexical category, and over development. The patterning of predictors across languages is quite similar, suggesting similar processes in operation. In contrast, the patterning of predictors across different lexical categories is distinct, in line with theories that posit different factors at play in the acquisition of content words and function words. By leveraging data at a significantly larger scale than previous work, our analyses identify candidate generalizations about the processes underlying word learning across languages.


2016 ◽  
Vol 19 (2) ◽  
pp. 212-245 ◽  
Author(s):  
Roland Schäfer ◽  
Ulrike Sayatz

In this paper, we analyze written sentences containing the German particles obwohl (“although”) and weil (“because”). In standard written German, these particles embed clauses in verb-last constituent order, which is characteristic of subordinated clauses. In spoken and – as we show – nonstandard written German, they embed clauses in verb-second constituent order, which is characteristic of independent sentences. Our usage-based approach to the syntax – graphemics interface includes a large-scale corpus analysis of the patterns of punctuation in the nonstandard variants that provides clues to the syntactic structure and degree of sentential independence of the nonstandard variants. Our corpus study confirms and refines hypotheses from existing theoretical approaches by clearly showing that writers mark obwohl clauses with verb-second order systematically as independent sentences, whereas weil clauses with verb-second order are much less strongly marked as independent. This work suggests that similar corpus studies could provide deeper insight into the interplay between syntax and graphemics.


2019 ◽  
Vol 0 (8/2018) ◽  
pp. 17-28
Author(s):  
Maciej Jankowski

Topic models are very popular methods of text analysis. The most popular algorithm for topic modelling is LDA (Latent Dirichlet Allocation). Recently, many new methods were proposed, that enable the usage of this model in large scale processing. One of the problem is, that a data scientist has to choose the number of topics manually. This step, requires some previous analysis. A few methods were proposed to automatize this step, but none of them works very well if LDA is used as a preprocessing for further classification. In this paper, we propose an ensemble approach which allows us to use more than one model at prediction phase, at the same time, reducing the need of finding a single best number of topics. We have also analyzed a few methods of estimating topic number.


1996 ◽  
Vol 1 (1) ◽  
pp. 1-37 ◽  
Author(s):  
Michael Barlow

In this paper intuition-based studies of reflexive forms such as myself are contrasted with a corpus-based investigation of actual usage of reflexives. The examination of reflexives in English in several corpora reveals a variety of patterns, which are analysed within a schema-based approach to grammar (Barlow and Kemmer 1994). This approach follows the cognitive/functional tradition of grammatical analysis in viewing all grammatical units as composed of form-meaning pairings. The paper demonstrates that a schema-based approach is well-suited to the task of describing the major and minor patterns of use revealed by corpus analysis. The importance of text analysis in language teaching is highlighted and connections between the schema-based grammatical formalism and data-driven approaches to second language learning (Johns 1991b) are briefly explored.


PMLA ◽  
2017 ◽  
Vol 132 (3) ◽  
pp. 636-642 ◽  
Author(s):  
Andrew Goldstone

Reading Franco Moretti's Graphs, Maps, Trees as a late-stage graduate student in 2008 was invigorating. Here was an approach to literary history free from the pieties of close reading, committed to empiricism, seeking to fulfill, with its “materialist conception of form,” the promise of the sociology of literature (92). And, at the time, it seemed natural that the way to follow the path laid out by Moretti in Graphs and in the essays he had published over the previous decade was to go to my computer, polish my rusty programming skills, and start making graphs. Yet reconsidering Moretti's Distant Reading now, one is struck by how nondigital the book is. In fact, the meaning of distant reading has undergone a rapid semantic transformation. In “Conjectures on World Literature,” originally published in 2000, Moretti introduces the phrase to describe “a patchwork of other people's research, without a single direct textual reading” (Distant Reading 48). Today, however, distant reading typically refers to computational studies of text. Introducing a 2016 cluster of essays called “Text Analysis at Scale,” Matthew K. Gold and Lauren Klein employ the term to speak of “using digital tools to ‘read’ large swaths of text” (Introduction); in his contribution to the cluster, Ted Underwood embraces “distant reading” as a name for applying machine-learning techniques to unstructured text. Discussions of distant reading have become discussions of computation with text, even if no section of Distant Reading features the elaborate computations found in the Stanford Literary Lab pamphlets to which Moretti has contributed.


Author(s):  
Ethan Fast ◽  
Binbin Chen ◽  
Michael S. Bernstein

Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by learning a neural embedding across billions of words on the web. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated such as neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.


2019 ◽  
Vol IV (II) ◽  
pp. 1-6
Author(s):  
Mark Perkins

The huge proliferation of textual (and other data) in digital and organisational sources has led to new techniques of text analysis. The potential thereby unleashed may be underpinned by further theoretical developments to the theory of Discourse Stream Analysis (DSA) as presented here. These include the notion of change in the discourse stream in terms of discourse stream fronts, linguistic elements evolving in real time, and notions of time itself in terms of relative speed, subject orientation and perception. Big data has also given rise to fake news, the manipulation of messages on a large scale. Fake news is conveyed in fake discourse streams and has led to a new field of description and analysis.


Sign in / Sign up

Export Citation Format

Share Document