Multilingual Open Information Extraction: Challenges and Opportunities

Daniela Barreiro Claro; Marlo Souza; Clarissa Castellã Xavier; Leandro Oliveira

doi:10.3390/info10070228

Multilingual Open Information Extraction: Challenges and Opportunities

Information ◽

10.3390/info10070228 ◽

2019 ◽

Vol 10 (7) ◽

pp. 228 ◽

Cited By ~ 4

Author(s):

Daniela Barreiro Claro ◽

Marlo Souza ◽

Clarissa Castellã Xavier ◽

Leandro Oliveira

Keyword(s):

Information Extraction ◽

Language Processing ◽

State Of The Art ◽

Transfer Of Knowledge ◽

Linguistic Resources ◽

Open Information Extraction ◽

General Rules ◽

Challenges And Opportunities ◽

Multilingual Approach

The number of documents published on the Web in languages other than English grows every year. As a consequence, the need to extract useful information from different languages increases, highlighting the importance of research into Open Information Extraction (OIE) techniques. Different OIE methods have dealt with features from a unique language; however, few approaches tackle multilingual aspects. In those approaches, multilingualism is restricted to processing text in different languages, rather than exploring cross-linguistic resources, which results in low precision due to the use of general rules. Multilingual methods have been applied to numerous problems in Natural Language Processing, achieving satisfactory results and demonstrating that knowledge acquisition for a language can be transferred to other languages to improve the quality of the facts extracted. We argue that a multilingual approach can enhance OIE methods as it is ideal to evaluate and compare OIE systems, and therefore can be applied to the collected facts. In this work, we discuss how the transfer knowledge between languages can increase acquisition from multilingual approaches. We provide a roadmap of the Multilingual Open IE area concerning state of the art studies. Additionally, we evaluate the transfer of knowledge to improve the quality of the facts extracted in each language. Moreover, we discuss the importance of a parallel corpus to evaluate and compare multilingual systems.

Download Full-text

Multilingual Open Information Extraction: Challenges and Opportunities

10.20944/preprints201905.0029.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Daniela Barreiro Claro ◽

Marlo Souza ◽

Clarissa Castellã Xavier ◽

Leandro Oliveira

Keyword(s):

Information Extraction ◽

Language Processing ◽

Extraction Method ◽

State Of The Art ◽

Transfer Of Knowledge ◽

Open Information Extraction ◽

General Rules ◽

Challenges And Opportunities ◽

Multilingual Approach

The number of documents published on the Web other languages than English grows every year. As a consequence, it increases the necessity of extracting useful information from different languages, pointing out the importance of researching Open Information Extraction (OIE) techniques. Different OIE methods have been dealing with features from a unique language. On the other hand, few approaches tackle multilingual aspects. In such approaches, multilingual is only treated as an extraction method, which results in low precision due to the use of general rules. Multilingual methods have been applied to a vast amount of problems in Natural Language Processing achieving satisfactory results and demonstrating that knowledge acquisition for a language can be transferred to other languages to improve the quality of the facts extracted. We state that a multilingual approach can enhance OIE methods, being ideal to evaluate and compare OIE systems, and as a consequence, to applying it to the collected facts. In this work, we discuss how the transfer knowledge between languages can increase the acquisition from multilingual approaches. We provide a roadmap of the Multilingual Open IE area concerning the state of the art studies. Additionally, we evaluate the transfer of knowledge to improve the quality of the facts extracted in each language. Moreover, we discuss the importance of a parallel corpus to evaluate and compare multilingual systems.

Download Full-text

A Methodology for Open Information Extraction and Representation from Large Scientific Corpora: The CORD-19 Data Exploration Use Case

Applied Sciences ◽

10.3390/app10165630 ◽

2020 ◽

Vol 10 (16) ◽

pp. 5630

Author(s):

Dimitris Papadopoulos ◽

Nikolaos Papadakis ◽

Antonis Litke

Keyword(s):

Information Extraction ◽

Language Processing ◽

State Of The Art ◽

Data Exploration ◽

Graph Representation ◽

Linguistic Features ◽

Open Information Extraction ◽

Structured Knowledge ◽

Machine Readable ◽

Medical Dataset

The usefulness of automated information extraction tools in generating structured knowledge from unstructured and semi-structured machine-readable documents is limited by challenges related to the variety and intricacy of the targeted entities, the complex linguistic features of heterogeneous corpora, and the computational availability for readily scaling to large amounts of text. In this paper, we argue that the redundancy and ambiguity of subject–predicate–object (SPO) triples in open information extraction systems has to be treated as an equally important step in order to ensure the quality and preciseness of generated triples. To this end, we propose a pipeline approach for information extraction from large corpora, encompassing a series of natural language processing tasks. Our methodology consists of four steps: i. in-place coreference resolution, ii. extractive text summarization, iii. parallel triple extraction, and iv. entity enrichment and graph representation. We manifest our methodology on a large medical dataset (CORD-19), relying on state-of-the-art tools to fulfil the aforementioned steps and extract triples that are subsequently mapped to a comprehensive ontology of biomedical concepts. We evaluate the effectiveness of our information extraction method by comparing it in terms of precision, recall, and F1-score with state-of-the-art OIE engines and demonstrate its capabilities on a set of data exploration tasks.

Download Full-text

Span Model for Open Information Extraction on Accurate Corpus

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6497 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9523-9530

Author(s):

Junlang Zhan ◽

Hai Zhao

Keyword(s):

Information Extraction ◽

State Of The Art ◽

Training Dataset ◽

Model Design ◽

Test Set ◽

Benchmark Test ◽

Open Information Extraction ◽

Test Sets ◽

Improved Model ◽

Benchmark Evaluation

Open Information Extraction (Open IE) is a challenging task especially due to its brittle data basis. Most of Open IE systems have to be trained on automatically built corpus and evaluated on inaccurate test set. In this work, we first alleviate this difficulty from both sides of training and test sets. For the former, we propose an improved model design to more sufficiently exploit training dataset. For the latter, we present our accurately re-annotated benchmark test set (Re-OIE2016) according to a series of linguistic observation and analysis. Then, we introduce a span model instead of previous adopted sequence labeling formulization for n-ary Open IE. Our newly introduced model achieves new state-of-the-art performance on both benchmark evaluation datasets.

Download Full-text

Deep learning for brain disorders: from data processing to disease treatment

Briefings in Bioinformatics ◽

10.1093/bib/bbaa310 ◽

2020 ◽

Author(s):

Ninon Burgos ◽

Simona Bottani ◽

Johann Faouzi ◽

Elina Thibeau-Sutre ◽

Olivier Colliot

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Language Processing ◽

State Of The Art ◽

Imaging Genetics ◽

Environmental Data ◽

Brain Disorders ◽

Disease Treatment ◽

Clinical Routine

Abstract In order to reach precision medicine and improve patients’ quality of life, machine learning is increasingly used in medicine. Brain disorders are often complex and heterogeneous, and several modalities such as demographic, clinical, imaging, genetics and environmental data have been studied to improve their understanding. Deep learning, a subpart of machine learning, provides complex algorithms that can learn from such various data. It has become state of the art in numerous fields, including computer vision and natural language processing, and is also growingly applied in medicine. In this article, we review the use of deep learning for brain disorders. More specifically, we identify the main applications, the concerned disorders and the types of architectures and data used. Finally, we provide guidelines to bridge the gap between research studies and clinical routine.

Download Full-text

Developing Core Technologies for Resource-Scarce Nguni Languages

Information ◽

10.3390/info12120520 ◽

2021 ◽

Vol 12 (12) ◽

pp. 520

Author(s):

Jakobus S. du Toit ◽

Martin J. Puttkammer

Keyword(s):

South African ◽

Language Processing ◽

Rule Based ◽

Local Resource ◽

African Languages ◽

Linguistic Resources ◽

Part Of Speech ◽

Parallel Data ◽

Further Development

The creation of linguistic resources is crucial to the continued growth of research and development efforts in the field of natural language processing, especially for resource-scarce languages. In this paper, we describe the curation and annotation of corpora and the development of multiple linguistic technologies for four official South African languages, namely isiNdebele, Siswati, isiXhosa, and isiZulu. Development efforts included sourcing parallel data for these languages and annotating each on token, orthographic, morphological, and morphosyntactic levels. These sets were in turn used to create and evaluate three core technologies, viz. a lemmatizer, part-of-speech tagger, morphological analyzer for each of the languages. We report on the quality of these technologies which improve on previously developed rule-based technologies as part of a similar initiative in 2013. These resources are made publicly accessible through a local resource agency with the intention of fostering further development of both resources and technologies that may benefit the NLP industry in South Africa.

Download Full-text

Can We Survive without Labelled Data in NLP? Transfer Learning for Open Information Extraction

Applied Sciences ◽

10.3390/app10175758 ◽

2020 ◽

Vol 10 (17) ◽

pp. 5758

Author(s):

Injy Sarhan ◽

Marco Spruit

Keyword(s):

Information Extraction ◽

Transfer Learning ◽

Language Processing ◽

Relation Extraction ◽

Training Data ◽

Open Information Extraction ◽

Comparable Performance ◽

Medical Domain ◽

Inductive Transfer Learning ◽

F Measure

Various tasks in natural language processing (NLP) suffer from lack of labelled training data, which deep neural networks are hungry for. In this paper, we relied upon features learned to generate relation triples from the open information extraction (OIE) task. First, we studied how transferable these features are from one OIE domain to another, such as from a news domain to a bio-medical domain. Second, we analyzed their transferability to a semantically related NLP task, namely, relation extraction (RE). We thereby contribute to answering the question: can OIE help us achieve adequate NLP performance without labelled data? Our results showed comparable performance when using inductive transfer learning in both experiments by relying on a very small amount of the target data, wherein promising results were achieved. When transferring to the OIE bio-medical domain, we achieved an F-measure of 78.0%, only 1% lower when compared to traditional learning. Additionally, transferring to RE using an inductive approach scored an F-measure of 67.2%, which was 3.8% lower than training and testing on the same task. Hereby, our analysis shows that OIE can act as a reliable source task.

Download Full-text

Automatic Multilingual Stopwords Identification from Very Small Corpora

Electronics ◽

10.3390/electronics10172169 ◽

2021 ◽

Vol 10 (17) ◽

pp. 2169

Author(s):

Stefano Ferilli

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

State Of The Art ◽

Linguistic Knowledge ◽

Linguistic Resources ◽

Novel Approach ◽

Document Frequency ◽

Critical Problems ◽

Fully Automatic ◽

Local Languages

Tools for Natural Language Processing work using linguistic resources, that are language-specific. The complexity of building such resources causes many languages to lack them. So, learning them automatically from sample texts would be a desirable solution. This usually requires huge training corpora, which are not available for many local languages and jargons, lacking a wide literature. This paper focuses on stopwords, i.e., terms in a text which do not contribute in conveying its topic or content. It provides two main, inter-related and complementary, methodological contributions: (i) it proposes a novel approach based on term and document frequency to rank candidate stopwords, that works also on very small corpora (even single documents); and (ii) it proposes an automatic cutoff strategy to select the best candidates in the ranking, thus addressing one of the most critical problems in the stopword identification practice. Nice features of these approaches are that (i) they are generic and applicable to different languages, (ii) they are fully automatic, and (iii) they do not require any previous linguistic knowledge. Extensive experiments show that both are extremely effective and reliable. The former outperforms all comparable approaches in the state-of-the-art, both in terms of performance (Precision stays at 100% or nearly so for a large portion of the top-ranked candidate stopwords, while Recall is quite close to the maximum reachable in theory.) and in smooth behavior (Precision is monotonically decreasing, and Recall is monotonically increasing, allowing the experimenter to choose the preferred balance.). The latter is more flexible than existing solutions in the literature, requiring just one parameter intuitively related to the balance between Precision and Recall one wishes to obtain.

Download Full-text

Dependency parsing of Polish

Poznan Studies in Contemporary Linguistics ◽

10.1515/psicl-2019-0012 ◽

2019 ◽

Vol 55 (2) ◽

pp. 305-337 ◽

Cited By ~ 1

Author(s):

Alina Wróblewska ◽

Piotr Rybak

Keyword(s):

Language Processing ◽

Argument Structure ◽

Question Answering ◽

State Of The Art ◽

Dependency Parsing ◽

Dependency Theory ◽

Crucial Issue ◽

Experimental Part ◽

Predicate Argument Structure

Abstract The predicate-argument structure transparently encoded in dependency-based syntactic representations supports machine translation, question answering, information extraction, etc. The quality of dependency parsing is therefore a crucial issue in natural language processing. In the current paper we discuss the fundamental ideas of the dependency theory and provide an overview of selected dependency-based resources for Polish. Furthermore, we present some state-of-the-art dependency parsing systems whose models can be estimated on correctly annotated data. In the experimental part, we provide an in-depth evaluation of these systems on Polish data. Our results show that graph-based parsers, even those without any neural component, are better suited for Polish than transition-based parsing systems.

Download Full-text

RussianLanguage Thesauri: Automated Construction and Application For Natural Language Processing Tasks

Modeling and Analysis of Information Systems ◽

10.18255/1818-1015-2018-4-435-458 ◽

2018 ◽

Vol 25 (4) ◽

pp. 435-458

Author(s):

Nadezhda S. Lagutina ◽

Ksenia V. Lagutina ◽

Aleksey S. Adrianov ◽

Ilya V. Paramonov

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Russian Language ◽

Labor Costs ◽

Linguistic Resources ◽

Advantages And Disadvantages ◽

Text Corpora ◽

Linguistic Methods

The paper reviews the existing Russian-language thesauri in digital form and methods of their automatic construction and application. The authors analyzed the main characteristics of open access thesauri for scientific research, evaluated trends of their development, and their effectiveness in solving natural language processing tasks. The statistical and linguistic methods of thesaurus construction that allow to automate the development and reduce labor costs of expert linguists were studied. In particular, the authors considered algorithms for extracting keywords and semantic thesaurus relationships of all types, as well as the quality of thesauri generated with the use of these tools. To illustrate features of various methods for constructing thesaurus relationships, the authors developed a combined method that generates a specialized thesaurus fully automatically taking into account a text corpus in a particular domain and several existing linguistic resources. With the proposed method, experiments were conducted with two Russian-language text corpora from two subject areas: articles about migrants and tweets. The resulting thesauri were assessed by using an integrated assessment developed in the previous authors’ study that allows to analyze various aspects of the thesaurus and the quality of the generation methods. The analysis revealed the main advantages and disadvantages of various approaches to the construction of thesauri and the extraction of semantic relationships of different types, as well as made it possible to determine directions for future study.

Download Full-text

Indonesian Information Extraction : Challenges and Opportunities

JATISI (Jurnal Teknik Informatika dan Sistem Informasi) ◽

10.35957/jatisi.v8i1.710 ◽

2021 ◽

Vol 8 (1) ◽

pp. 421-429

Author(s):

Yan Puspitarani

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Research Trends ◽

Structured Data ◽

Daily Lives ◽

Unstructured Text ◽

Challenges And Opportunities ◽

Data Source

Information extraction is part of natural language processing, aiming to find, retrieve, or process information. The data source for information extraction is text. Text cannot be separated from people's daily lives. Through text, a lot of confidential information can be obtained. To produce information, the unstructured text will be converted into structured data. There are many approaches that researchers take to this process. Most of the studies are in English. Therefore, this paper will present current research trends, challenges, and information extraction opportunities using Indonesian.

Download Full-text