Self-Supervised Chinese Ontology Learning from Online Encyclopedias

The Scientific World JOURNAL ◽

10.1155/2014/848631 ◽

2014 ◽

Vol 2014 ◽

pp. 1-13 ◽

Cited By ~ 6

Author(s):

Fanghuai Hu ◽

Zhiqing Shao ◽

Tong Ruan

Keyword(s):

Machine Learning ◽

Structural Information ◽

Relation Extraction ◽

Knowledge Bases ◽

The Self ◽

Supervised Machine Learning ◽

Ontology Learning ◽

High Coverage ◽

Category Labels ◽

Training Examples

Constructing ontology manually is a time-consuming, error-prone, and tedious task. We present SSCO, a self-supervised learning based chinese ontology, which contains about 255 thousand concepts, 5 million entities, and 40 million facts. We explore the three largest online Chinese encyclopedias for ontology learning and describe how to transfer the structured knowledge in encyclopedias, including article titles, category labels, redirection pages, taxonomy systems, and InfoBox modules, into ontological form. In order to avoid the errors in encyclopedias and enrich the learnt ontology, we also apply some machine learning based methods. First, we proof that the self-supervised machine learning method is practicable in Chinese relation extraction (at least for synonymy and hyponymy) statistically and experimentally and train some self-supervised models (SVMs and CRFs) for synonymy extraction, concept-subconcept relation extraction, and concept-instance relation extraction; the advantages of our methods are that all training examples are automatically generated from the structural information of encyclopedias and a few general heuristic rules. Finally, we evaluate SSCO in two aspects, scale and precision; manual evaluation results show that the ontology has excellent precision, and high coverage is concluded by comparing SSCO with other famous ontologies and knowledge bases; the experiment results also indicate that the self-supervised models obviously enrich SSCO.

Download Full-text

Entity Type Recognition for Heterogeneous Semantic Graphs

AI Magazine ◽

10.1609/aimag.v36i1.2569 ◽

2015 ◽

Vol 36 (1) ◽

pp. 75-86 ◽

Cited By ~ 4

Author(s):

Jennifer Sleeman ◽

Tim Finin ◽

Anupam Joshi

Keyword(s):

Machine Learning ◽

Background Knowledge ◽

Knowledge Bases ◽

Heterogeneous Data ◽

Unstructured Data ◽

Supervised Machine Learning ◽

Coreference Resolution ◽

Multiple Sources ◽

Fine Grained ◽

High Level

We describe an approach for identifying fine-grained entity types in heterogeneous data graphs that is effective for unstructured data or when the underlying ontologies or semantic schemas are unknown. Identifying fine-grained entity types, rather than a few high-level types, supports coreference resolution in heterogeneous graphs by reducing the number of possible coreference relations that must be considered. Big data problems that involve integrating data from multiple sources can benefit from our approach when the datas ontologies are unknown, inaccessible or semantically trivial. For such cases, we use supervised machine learning to map entity attributes and relations to a known set of attributes and relations from appropriate background knowledge bases to predict instance entity types. We evaluated this approach in experiments on data from DBpedia, Freebase, and Arnetminer using DBpedia as the background knowledge base.

Download Full-text

Detecting Controversial Articles on Citizen Journalism

Jurnal Ilmu Komputer dan Informasi ◽

10.21609/jiki.v11i1.478 ◽

2018 ◽

Vol 11 (1) ◽

pp. 34

Author(s):

Alfan Farizki Wicaksono ◽

Sharon Raissa Herdiyana ◽

Mirna Adriani

Keyword(s):

Machine Learning ◽

Structural Information ◽

Structural Features ◽

The Body ◽

Supervised Machine Learning ◽

Citizen Journalism ◽

Learning Approach ◽

Daily News ◽

Machine Learning Approach ◽

Controversial Topic

Someone's understanding and stance on a particular controversial topic can be influenced by daily news or articles he consume everyday. Unfortunately, readers usually do not realize that they are reading controversial articles. In this paper, we address the problem of automatically detecting controversial article from citizen journalism media. To solve the problem, we employ a supervised machine learning approach with several hand-crafted features that exploits linguistic information, meta-data of an article, structural information in the commentary section, and sentiment expressed inside the body of an article. The experimental results shows that our proposed method manages to perform the addressed task effectively. The best performance so far is achieved when we use all proposed feature with Logistic Regression as our model (82.89\% in terms of accuracy). Moreover, we found that information from commentary section (structural features) contributes most to the classification task.

Download Full-text

Machine learning deciphers structural features of RNA duplexes measured with solution X-ray scattering

IUCrJ ◽

10.1107/s2052252520008830 ◽

2020 ◽

Vol 7 (5) ◽

pp. 870-880

Author(s):

Yen-Lin Chen ◽

Lois Pollack

Keyword(s):

Machine Learning ◽

Structural Information ◽

Structural Parameters ◽

Supervised Machine Learning ◽

Scattering Data ◽

Length Scales ◽

X Ray ◽

X Ray Scattering ◽

Extreme Gradient Boosting ◽

Ray Scattering

Macromolecular structures can be determined from solution X-ray scattering. Small-angle X-ray scattering (SAXS) provides global structural information on length scales of 10s to 100s of Ångstroms, and many algorithms are available to convert SAXS data into low-resolution structural envelopes. Extension of measurements to wider scattering angles (WAXS or wide-angle X-ray scattering) can sharpen the resolution to below 10 Å, filling in structural details that can be critical for biological function. These WAXS profiles are especially challenging to interpret because of the significant contribution of solvent in addition to solute on these smaller length scales. Based on training with molecular dynamics generated models, the application of extreme gradient boosting (XGBoost) is discussed, which is a supervised machine learning (ML) approach to interpret features in solution scattering profiles. These ML methods are applied to predict key structural parameters of double-stranded ribonucleic acid (dsRNA) duplexes. Duplex conformations vary with salt and sequence and directly impact the foldability of functional RNA molecules. The strong structural periodicities in these duplexes yield scattering profiles with rich sets of features at intermediate-to-wide scattering angles. In the ML models, these profiles are treated as 1D images or features. These ML models identify specific scattering angles, or regions of scattering angles, which correspond with and successfully predict distinct structural parameters. Thus, this work demonstrates that ML strategies can integrate theoretical molecular models with experimental solution scattering data, providing a new framework for extracting highly relevant structural information from solution experiments on biological macromolecules.

Download Full-text

Analysis and Improvement of Minimally Supervised Machine Learning for Relation Extraction

Natural Language Processing and Information Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-642-12550-8_2 ◽

2010 ◽

pp. 8-23 ◽

Cited By ~ 5

Author(s):

Hans Uszkoreit ◽

Feiyu Xu ◽

Hong Li

Keyword(s):

Machine Learning ◽

Relation Extraction ◽

Supervised Machine Learning ◽

Minimally Supervised

Download Full-text

HIERARCHICAL-INTERPOLATIVE FUZZY SYSTEM CONSTRUCTION BY GENETIC AND BACTERIAL MEMETIC PROGRAMMING APPROACHES

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s021848851240017x ◽

2012 ◽

Vol 20 (supp02) ◽

pp. 105-131 ◽

Cited By ~ 6

Author(s):

KRISZTIÁN BALÁZS ◽

LÁSZLÓ T. KÓCZY

Keyword(s):

Machine Learning ◽

System Modeling ◽

Fuzzy Rule ◽

Knowledge Bases ◽

Learning System ◽

Supervised Machine Learning ◽

Rule Base ◽

System Construction ◽

Rule Bases ◽

Programming Algorithms

In this paper a family of new methods are proposed for constructing hierarchical-interpolative fuzzy rule bases in the frame of a fuzzy rule based supervised machine learning system modeling black box systems defined by input-output pairs. The resulting hierarchical rule base is constructed by using structure building pure evolutionary and memetic techniques, namely, Genetic and Bacterial Programming Algorithms and their memetic variants containing local search steps. Applying hierarchical-interpolative fuzzy rule bases is a rather efficient way of reducing the complexity of knowledge bases, whereas evolutionary methods (including memetic techniques) ensure a relatively fast convergence in the learning process. As it is presented in the paper, by applying a newly proposed representation schema these approaches can be combined to form hierarchical-interpolative machine learning systems.

Download Full-text

Exploring the Use of Machine Learning to Automate the Qualitative Coding of Church-related Tweets

Fieldwork in Religion ◽

10.1558/firn.40610 ◽

2020 ◽

Vol 14 (2) ◽

pp. 140-159

Author(s):

Anthony-Paul Cooper ◽

Emmanuel Awuni Kolog ◽

Erkki Sutinen

Keyword(s):

Machine Learning ◽

Online Community ◽

High Volume ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Social Media Data ◽

Twitter Data ◽

Resource Intensity ◽

Media Data ◽

Better Than

This article builds on previous research around the exploration of the content of church-related tweets. It does so by exploring whether the qualitative thematic coding of such tweets can, in part, be automated by the use of machine learning. It compares three supervised machine learning algorithms to understand how useful each algorithm is at a classification task, based on a dataset of human-coded church-related tweets. The study finds that one such algorithm, Naïve-Bayes, performs better than the other algorithms considered, returning Precision, Recall and F-measure values which each exceed an acceptable threshold of 70%. This has far-reaching consequences at a time where the high volume of social media data, in this case, Twitter data, means that the resource-intensity of manual coding approaches can act as a barrier to understanding how the online community interacts with, and talks about, church. The findings presented in this article offer a way forward for scholars of digital theology to better understand the content of online church discourse.

Download Full-text

Application of Supervised Machine Learning Algorithms for Lithofacies Classification.

10.2523/19349-ms ◽

2019 ◽

Author(s):

Subhadeep Sarkar ◽

Chandan Majumdar

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Lithofacies Classification

Download Full-text

Scalable Approach to High Coverages on Oxides via Iterative Training of a Machine-Learning Algorithm

10.26434/chemrxiv.10288514.v1 ◽

2019 ◽

Author(s):

Andrew Medford ◽

Shengchun Yang ◽

Fuzhu Liu

Keyword(s):

Machine Learning ◽

Chemical Potential ◽

Learning Algorithm ◽

Absolute Error ◽

Low Energy ◽

Training Data ◽

High Coverage ◽

Metal Compounds ◽

Adsorption Energies ◽

The Stability

Understanding the interaction of multiple types of adsorbate molecules on solid surfaces is crucial to establishing the stability of catalysts under various chemical environments. Computational studies on the high coverage and mixed coverages of reaction intermediates are still challenging, especially for transition-metal compounds. In this work, we present a framework to predict differential adsorption energies and identify low-energy structures under high- and mixed-adsorbate coverages on oxide materials. The approach uses Gaussian process machine-learning models with quantified uncertainty in conjunction with an iterative training algorithm to actively identify the training set. The framework is demonstrated for the mixed adsorption of CHx, NHx and OHx species on the oxygen vacancy and pristine rutile TiO2(110) surface sites. The results indicate that the proposed algorithm is highly efficient at identifying the most valuable training data, and is able to predict differential adsorption energies with a mean absolute error of ~0.3 eV based on <25% of the total DFT data. The algorithm is also used to identify 76% of the low-energy structures based on <30% of the total DFT data, enabling construction of surface phase diagrams that account for high and mixed coverage as a function of the chemical potential of C, H, O, and N. Furthermore, the computational scaling indicates the algorithm scales nearly linearly (N1.12) as the number of adsorbates increases. This framework can be directly extended to metals, metal oxides, and other materials, providing a practical route toward the investigation of the behavior of catalysts under high-coverage conditions.

Download Full-text

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Supervised Machine Learning ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Download Full-text

A Deep Analysis and Efficient Implementation of Supervised Machine Learning Algorithms for Enhancing The Classification Ability of System

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i3.10941101 ◽

2019 ◽

Vol 7 (3) ◽

pp. 1094-1101

Author(s):

Sandeep Kumar Verma ◽

Turendar Sahu ◽

Manjit Jaiswal

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Efficient Implementation ◽

Machine Learning Algorithms ◽

Supervised Machine Learning

Download Full-text