chinese text mining
Recently Published Documents


TOTAL DOCUMENTS

6
(FIVE YEARS 0)

H-INDEX

2
(FIVE YEARS 0)

2020 ◽  
Vol 24 (1) ◽  
pp. 46-66
Author(s):  
Joseph Dennis

Abstract This article analyses patterns of book donations to local school libraries in the Ming (1368–1644), drawing on a data set made with LoGaRT, a Chinese text mining and processing software created by the Max Planck Institute for History of Science. Records of donated books and other records explaining donor motivations make it possible to show what types of people donated, and what books they selected. Donors gave books on a broad range of topics. Big data makes it possible to identify changes over time and space, and enhances our understanding of book circulation. This article builds on Timothy Brook’s work on Ming school libraries, in which he argued that they had a set of core books issued by the central government, but little else. I argue that donated books were also important for many library collections.


2016 ◽  
Vol 113 (22) ◽  
pp. 6154-6159 ◽  
Author(s):  
Ke Deng ◽  
Peter K. Bol ◽  
Kate J. Li ◽  
Jun S. Liu

With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method.


Sign in / Sign up

Export Citation Format

Share Document