A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics

Martin Gerlach; Francesc Font-Clos

doi:10.3390/e22010126

A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics

Entropy ◽

10.3390/e22010126 ◽

2020 ◽

Vol 22 (1) ◽

pp. 126 ◽

Cited By ~ 5

Author(s):

Martin Gerlach ◽

Francesc Font-Clos

Keyword(s):

Statistical Analysis ◽

Natural Language ◽

Language Processing ◽

Corpus Linguistics ◽

Open Science ◽

Quantitative Linguistics ◽

Processing Strategies ◽

Different Sources ◽

Different Levels

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

Download Full-text

Does higher education properly prepare graduates for the growing artificial intelligence market? Gaps identification using text mining

Human Systems Management ◽

10.3233/hsm-211179 ◽

2021 ◽

pp. 1-13

Author(s):

Lamiae Benhayoun ◽

Daniel Lang

Keyword(s):

Artificial Intelligence ◽

Natural Language Processing ◽

Text Mining ◽

Natural Language ◽

Language Processing ◽

Academic Training ◽

Market Requirements ◽

Job Advertisements ◽

The Individual

BACKGROUND: The renewed advent of Artificial Intelligence (AI) is inducing profound changes in the classic categories of technology professions and is creating the need for new specific skills. OBJECTIVE: Identify the gaps in terms of skills between academic training on AI in French engineering and Business Schools, and the requirements of the labour market. METHOD: Extraction of AI training contents from the schools’ websites and scraping of a job advertisements’ website. Then, analysis based on a text mining approach with a Python code for Natural Language Processing. RESULTS: Categorization of occupations related to AI. Characterization of three classes of skills for the AI market: Technical, Soft and Interdisciplinary. Skills’ gaps concern some professional certifications and the mastery of specific tools, research abilities, and awareness of ethical and regulatory dimensions of AI. CONCLUSIONS: A deep analysis using algorithms for Natural Language Processing. Results that provide a better understanding of the AI capability components at the individual and the organizational levels. A study that can help shape educational programs to respond to the AI market requirements.

Download Full-text

A Hybrid Siamese Neural Network for Natural Language Inference in Cyber-Physical Systems

ACM Transactions on Internet Technology ◽

10.1145/3418208 ◽

2021 ◽

Vol 21 (2) ◽

pp. 1-25

Author(s):

Pin Ni ◽

Yuming Li ◽

Gangmin Li ◽

Victor Chang

Keyword(s):

Natural Language ◽

Language Processing ◽

Short Term Memory ◽

Physical World ◽

Heterogeneous Data ◽

Cyber Physical Systems ◽

Physical Systems ◽

Language Data ◽

Text Language ◽

Different Sources

Cyber-Physical Systems (CPS), as a multi-dimensional complex system that connects the physical world and the cyber world, has a strong demand for processing large amounts of heterogeneous data. These tasks also include Natural Language Inference (NLI) tasks based on text from different sources. However, the current research on natural language processing in CPS does not involve exploration in this field. Therefore, this study proposes a Siamese Network structure that combines Stacked Residual Long Short-Term Memory (bidirectional) with the Attention mechanism and Capsule Network for the NLI module in CPS, which is used to infer the relationship between text/language data from different sources. This model is mainly used to implement NLI tasks and conduct a detailed evaluation in three main NLI benchmarks as the basic semantic understanding module in CPS. Comparative experiments prove that the proposed method achieves competitive performance, has a certain generalization ability, and can balance the performance and the number of trained parameters.

Download Full-text

Natural-language processing and automatic indexing

The Indexer The International Journal of Indexing ◽

10.3828/indexer.1990.17.1.8 ◽

1990 ◽

Vol 17 (1) ◽

pp. 21-29

Author(s):

C. Korycinski ◽

Alan F. Newell

Keyword(s):

Natural Language Processing ◽

Statistical Analysis ◽

Natural Language ◽

Language Processing ◽

Database Systems ◽

Statistical Techniques ◽

Free Text ◽

Automatic Indexing ◽

Text Database ◽

Processing Techniques

The task of producing satisfactory indexes by automatic means has been tackled on two fronts: by statistical analysis of text and by attempting content analysis of the text in much the same way as a human indexcr does. Though statistical techniques have a lot to offer for free-text database systems, neither method has had much success with back-of-the-bopk indexing. This review examines some problems associated with the application of natural-language processing techniques to book texts.

Download Full-text

Su1920 DETECTION AND CHARACTERIZATION OF EXTRA-INTESTINAL MANIFESTATIONS OF IBD IN CLINICAL OFFICE NOTES USING NATURAL LANGUAGE PROCESSING

Gastroenterology ◽

10.1016/s0016-5085(20)32446-x ◽

2020 ◽

Vol 158 (6) ◽

pp. S-702

Author(s):

Ryan Stidham ◽

Deahan Yu ◽

Shibamouli Lahiri ◽

Vinod Vydiswaran

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing

Download Full-text

Top-down information is more important in noisy situations: Exploring the role of pragmatic, semantic, and syntactic information in language processing

10.31234/osf.io/xp736 ◽

2019 ◽

Author(s):

Fabio Trecca ◽

Kristian Tylén ◽

Riccardo Fusaroli ◽

Christer Johansson ◽

Morten H. Christiansen

Keyword(s):

Language Processing ◽

Spoken Language ◽

Top Down ◽

Speech Input ◽

Processing Strategies ◽

Syntactic Information ◽

Noisy Conditions ◽

Relative Weighting ◽

Different Sources

Language processing depends on the integration of bottom-up information with top-down cues from several different sources—primarily our knowledge of the real world, of discourse contexts, and of how language works. Previous studies have shown that factors pertaining to both the sender and the receiver of the message affect the relative weighting of such information. Here, we suggest another factor that may change our processing strategies: perceptual noise in the environment. We hypothesize that listeners weight different sources of top-down information more in situations of perceptual noise than in noise-free situations. Using a sentence-picture matching experiment with four forced-choice alternatives, we show that degrading the speech input with noise compels the listeners to rely more on top-down information in processing. We discuss our results in light of previous findings in the literature, highlighting the need for a unified model of spoken language comprehension in different ecologically valid situations, including under noisy conditions.

Download Full-text

Digital Archiving by Nigerian and Foreign Authors in a Low Resource Context: A Content Analysis of Publications on Natural Language Processing of Nigerian Languages

Proceedings of the Annual Conference of CAIS / Actes du congrès annuel de l'ACSI ◽

10.29173/cais1175 ◽

2020 ◽

Author(s):

Toluwase Asubiaro

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Web Of Science ◽

Open Science ◽

Digital Archives ◽

Scientific Publications ◽

Digital Archiving ◽

Data Archiving ◽

Computer Codes

This study investigated if there is a difference in the number of articles, datasets and computer codes that foreign and Nigerian authors of scientific publications on natural language processing (NLP) of Nigerian languages deposited in digital archives. Relevant articles were systematically retrieved from Google, Web of Science and Scopus. Authorship type and data archiving information was extracted from the full text of the relevant publications. Result shows that papers with foreign authorship (80.4%) published their articles in non-commercial repositories, more than papers with Nigerian authorship (55.3%). Similarly, few papers with foreign authorship deposited research data (19.1%) and computer codes (10.4%), while none of the papers with Nigerian authorship did. It was recommended that librarians in Nigeria should create awareness on the benefits of digital archiving and open science. Cette étude a eximané les différences dans le nombre d'articles, d'ensembles de données et de codes informatiques dans les articles scientifiques sur le traitement du langage naturel que les auteurs nigériens et les auteurs étrangers ont soumis dans les dépôts d'autoarchivage. Les articles pertinents ont été systématiquement extraits de Google, Web of Science et Scopus. Les informations relatives au type d'auteur et à l'archivage des données ont été extraites du texte intégral des publications pertinentes. Les résultats montrent que les articles écrits par des auteurs étrangers ont davantage publié leurs articles dans des dépôts non commerciaux (80,4%) que les auteurs nigériens (55,3%). Peu d'auteurs étrangers ont déposé des données de recherche (19,1%) et des codes informatiques (10,4%) tandis qu'aucun auteur nigérien ne l'a fait. Ces résultats démontrent l'importance de la sensibilisation aux avantages des dépôt d'archivage et de la science ouverte pour les bibliothécaires nigériens.

Download Full-text

Towards a systematic characterization of protein complex function: a natural language processing and machine-learning framework

10.1101/2021.02.24.432789 ◽

2021 ◽

Author(s):

Varun S. Sharma ◽

Andrea Fossati ◽

Rodolfo Ciuffa ◽

Marija Buljan ◽

Evan G. Williams ◽

...

Keyword(s):

Natural Language ◽

Language Processing ◽

Protein Complex ◽

Large Scale ◽

Protein Complexes ◽

Complex Function ◽

Word Embedding ◽

Biological Processes ◽

Go Terms

SummaryIt is a general assumption of molecular biology that the ensemble of expressed molecules, their activities and interactions determine biological processes, cellular states and phenotypes. Quantitative abundance of transcripts, proteins and metabolites are now routinely measured with considerable depth via an array of “OMICS” technologies, and recently a number of methods have also been introduced for the parallel analysis of the abundance, subunit composition and cell state specific changes of protein complexes. In comparison to the measurement of the molecular entities in a cell, the determination of their function remains experimentally challenging and labor-intensive. This holds particularly true for determining the function of protein complexes, which constitute the core functional assemblies of the cell. Therefore, the tremendous progress in multi-layer molecular profiling has been slow to translate into increased functional understanding of biological processes, cellular states and phenotypes. In this study we describe PCfun, a computational framework for the systematic annotation of protein complex function using Gene Ontology (GO) terms. This work is built upon the use of word embedding— natural language text embedded into continuous vector space that preserves semantic relationships— generated from the machine reading of 1 million open access PubMed Central articles. PCfun leverages the embedding for rapid annotation of protein complex function by integrating two approaches: (1) an unsupervised approach that obtains the nearest neighbor (NN) GO term word vectors for a protein complex query vector, and (2) a supervised approach using Random Forest (RF) models trained specifically for recovering the GO terms of protein complex queries described in the CORUM protein complex database. PCfun consolidates both approaches by performing the statistical test for the enrichment of the top NN GO terms within the child terms of the predicted GO terms by RF models. Thus, PCfun amalgamates information learned from the gold-standard protein-complex database, CORUM, with the unbiased predictions obtained directly from the word embedding, thereby enabling PCfun to identify the potential functions of putative protein complexes. The documentation and examples of the PCfun package are available at https://github.com/sharmavaruns/PCfun. We anticipate that PCfun will serve as a useful tool and novel paradigm for the large-scale characterization of protein complex function.

Download Full-text

Ontology-Assisted Database Integration to Support Natural Language Processing and Biomedical Data-mining

Journal of Integrative Bioinformatics ◽

10.1515/jib-2004-1 ◽

2004 ◽

Vol 1 (1) ◽

pp. 1-10 ◽

Cited By ~ 6

Author(s):

Jean-Luc Verschelde ◽

Mariana Casella Dos Santos ◽

Tom Deray ◽

Barry Smith ◽

Werner Ceusters

Keyword(s):

Data Mining ◽

Natural Language ◽

Language Processing ◽

Biomedical Ontology ◽

Biological Data ◽

Heterogeneous Databases ◽

Database Integration ◽

Biomedical Data ◽

Biological Phenomena ◽

Different Levels

Summary Successful biomedical data mining and information extraction require a complete picture of biological phenomena such as genes, biological processes, and diseases; as these exist on different levels of granularity. To realize this goal, several freely available heterogeneous databases as well as proprietary structured datasets have to be integrated into a single global customizable scheme. We will present a tool to integrate different biological data sources by mapping them to a proprietary biomedical ontology that has been developed for the purposes of making computers understand medical natural language.

Download Full-text

MICE

International Journal of Corpus Linguistics ◽

10.1075/ijcl.9.1.03are ◽

2004 ◽

Vol 9 (1) ◽

pp. 53-68 ◽

Cited By ~ 6

Author(s):

Montserrat Arévalo Rodríguez ◽

Montserrat Civit Torruella ◽

Maria Antònia Martí

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Corpus Linguistics ◽

Proper Names ◽

Named Entity ◽

Different Types

In the field of corpus linguistics, Named Entity treatment includes the recognition and classification of different types of discursive elements like proper names, date, time, etc. These discursive elements play an important role in different Natural Language Processing applications and techniques such as Information Retrieval, Information Extraction, translations memories, document routers, etc.

Download Full-text

Selection of correction candidates for the normalization of Spanish user-generated content

Natural Language Engineering ◽

10.1017/s1351324914000011 ◽

2014 ◽

Vol 22 (1) ◽

pp. 135-161 ◽

Cited By ~ 2

Author(s):

M. MELERO ◽

M.R. COSTA-JUSSÀ ◽

P. LAMBERT ◽

M. QUIXAL

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Linear Interpolation ◽

Language Models ◽

User Generated Content ◽

Grammar Correction ◽

Word Forms ◽

Spell Checker ◽

Different Sources

AbstractWe present research aiming to build tools for the normalization of User-Generated Content (UGC). We argue that processing this type of text requires the revisiting of the initial steps of Natural Language Processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user-generated texts) presents a number of nonstandard communicative and linguistic characteristics – often closer to oral and colloquial language than to edited text. We present a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews, and blogs, and describe its main characteristics. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging. Our aim with this paper is to seize the power of already existing spell and grammar correction engines and endow them with automatic normalization capabilities in order to pave the way for the application of standard Natural Language Processing tools to typical UGC text. Particularly, we propose a strategy for automatically normalizing UGC by adding a module on top of a pre-existing spell-checker that selects the most plausible correction from an unranked list of candidates provided by the spell-checker. To build this selector module we train four language models, each one containing a different type of linguistic information in a trade-off with its generalization capabilities. Our experiments show that the models trained on truecase and lowercase word forms are more discriminative than the others at selecting the best candidate. We have also experimented with a parametrized combination of the models by both optimizing directly on the selection task and doing a linear interpolation of the models. The resulting parametrized combinations obtain results close to the best performing model but do not improve on those results, as measured on the test set. The precision of the selector module in ranking number one the expected correction proposal on the test corpora reaches 82.5% for Twitter text (baseline 57%) and 88% for non-Twitter text (baseline 64%).

Download Full-text