Semi-Supervised Learning Using Hierarchical Mixture Models: Gene Essentiality Case Study

Michael W. Daniels; Daniel Dvorkin; Rani K. Powers; Katerina Kechris

doi:10.3390/mca26020040

Semi-Supervised Learning Using Hierarchical Mixture Models: Gene Essentiality Case Study

Mathematical and Computational Applications ◽

10.3390/mca26020040 ◽

2021 ◽

Vol 26 (2) ◽

pp. 40

Author(s):

Michael W. Daniels ◽

Daniel Dvorkin ◽

Rani K. Powers ◽

Katerina Kechris

Keyword(s):

Supervised Learning ◽

Mixture Model ◽

Essential Genes ◽

Training Data ◽

Data Sets ◽

Data Types ◽

Gene Essentiality ◽

Multiple Data ◽

Level Data ◽

Accuracy Of Prediction

Integrating gene-level data is useful for predicting the role of genes in biological processes. This problem has typically focused on supervised classification, which requires large training sets of positive and negative examples. However, training data sets that are too small for supervised approaches can still provide valuable information. We describe a hierarchical mixture model that uses limited positively labeled gene training data for semi-supervised learning. We focus on the problem of predicting essential genes, where a gene is required for the survival of an organism under particular conditions. We applied cross-validation and found that the inclusion of positively labeled samples in a semi-supervised learning framework with the hierarchical mixture model improves the detection of essential genes compared to unsupervised, supervised, and other semi-supervised approaches. There was also improved prediction performance when genes are incorrectly assumed to be non-essential. Our comparisons indicate that the incorporation of even small amounts of existing knowledge improves the accuracy of prediction and decreases variability in predictions. Although we focused on gene essentiality, the hierarchical mixture model and semi-supervised framework is standard for problems focused on prediction of genes or other features, with multiple data types characterizing the feature, and a small set of positive labels.

Download Full-text

Chapter 17. Logic Tensor Networks: Theory and Applications

10.3233/faia210498 ◽

2021 ◽

Author(s):

Luciano Serafini ◽

Artur d’Avila Garcez ◽

Samy Badreddine ◽

Ivan Donadello ◽

Michael Spranger ◽

...

Keyword(s):

Supervised Learning ◽

Data Sets ◽

Symbolic Representations ◽

Tensor Networks ◽

Multiple Data ◽

Abstract Knowledge ◽

Symbolic Ai ◽

Rich Data ◽

Efficient Learning ◽

High Level

The recent availability of large-scale data combining multiple data modalities has opened various research and commercial opportunities in Artificial Intelligence (AI). Machine Learning (ML) has achieved important results in this area mostly by adopting a sub-symbolic distributed representation. It is generally accepted now that such purely sub-symbolic approaches can be data inefficient and struggle at extrapolation and reasoning. By contrast, symbolic AI is based on rich, high-level representations ideally based on human-readable symbols. Despite being more explainable and having success at reasoning, symbolic AI usually struggles when faced with incomplete knowledge or inaccurate, large data sets and combinatorial knowledge. Neurosymbolic AI attempts to benefit from the strengths of both approaches combining reasoning with complex representation of knowledge and efficient learning from multiple data modalities. Hence, neurosymbolic AI seeks to ground rich knowledge into efficient sub-symbolic representations and to explain sub-symbolic representations and deep learning by offering high-level symbolic descriptions for such learning systems. Logic Tensor Networks (LTN) are a neurosymbolic AI system for querying, learning and reasoning with rich data and abstract knowledge. LTN introduces Real Logic, a fully differentiable first-order language with concrete semantics such that every symbolic expression has an interpretation that is grounded onto real numbers in the domain. In particular, LTN converts Real Logic formulas into computational graphs that enable gradient-based optimization. This chapter presents the LTN framework and illustrates its use on knowledge completion tasks to ground the relational predicates (symbols) into a concrete interpretation (vectors and tensors). It then investigates the use of LTN on semi-supervised learning, learning of embeddings and reasoning. LTN has been applied recently to many important AI tasks, including semantic image interpretation, ontology learning and reasoning, and reinforcement learning, which use LTN for supervised classification, data clustering, semi-supervised learning, embedding learning, reasoning and query answering. The chapter presents some of the main recent applications of LTN before analyzing results in the context of related work and discussing the next steps for neurosymbolic AI and LTN-based AI models.

Download Full-text

Persistent self-supervised learning: From stereo to monocular vision for obstacle avoidance

International Journal of Micro Air Vehicles ◽

10.1177/1756829318756355 ◽

2018 ◽

Vol 10 (2) ◽

pp. 186-206 ◽

Cited By ~ 3

Author(s):

Kevin van Hecke ◽

Guido de Croon ◽

Laurens van der Maaten ◽

Daniel Hennes ◽

Dario Izzo

Keyword(s):

Supervised Learning ◽

Stereo Vision ◽

Large Data ◽

Monocular Vision ◽

Training Data ◽

Data Sets ◽

Learning Approaches ◽

Robust Learning ◽

Flying Robot ◽

First Time

Self-supervised learning is a reliable learning mechanism in which a robot uses an original, trusted sensor cue for training to recognize an additional, complementary sensor cue. We study for the first time in self-supervised learning how a robot’s learning behavior should be organized, so that the robot can keep performing its task in the case that the original cue becomes unavailable. We study this persistent form of self-supervised learning in the context of a flying robot that has to avoid obstacles based on distance estimates from the visual cue of stereo vision. Over time it will learn to also estimate distances based on monocular appearance cues. A strategy is introduced that has the robot switch from flight based on stereo to flight based on monocular vision, with stereo vision purely used as “training wheels” to avoid imminent collisions. This strategy is shown to be an effective approach to the “feedback-induced data bias” problem as also experienced in learning from demonstration. Both simulations and real-world experiments with a stereo vision equipped ARDrone2 show the feasibility of this approach, with the robot successfully using monocular vision to avoid obstacles in a 5 × 5 m room. The experiments show the potential of persistent self-supervised learning as a robust learning approach to enhance the capabilities of robots. Moreover, the abundant training data coming from the own sensors allow to gather large data sets necessary for deep learning approaches.

Download Full-text

Cost-conscious comparison of supervised learning algorithms over multiple data sets

Pattern Recognition ◽

10.1016/j.patcog.2011.10.005 ◽

2012 ◽

Vol 45 (4) ◽

pp. 1772-1781 ◽

Cited By ~ 28

Author(s):

Aydın Ulaş ◽

Olcay Taner Yıldız ◽

Ethem Alpaydın

Keyword(s):

Supervised Learning ◽

Learning Algorithms ◽

Data Sets ◽

Multiple Data ◽

Supervised Learning Algorithms ◽

Multiple Data Sets

Download Full-text

Enhancing Image Diagnosis by the Implementation of Transfer Classifiers

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c4060.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 999-1002

Keyword(s):

Supervised Learning ◽

Transfer Learning ◽

Training Data ◽

Data Sets ◽

Learning Approaches ◽

Learning Techniques ◽

Image Diagnosis ◽

Sensitivity Problem ◽

Common Application ◽

Target Data

Images generated from a variety of sources and foundations today can pose difficulty for a user to interpret similarity in them or analyze them for further use because of their segmentation policies. This unconventionality can generate many errors, because of which the previously used traditional methodologies such as supervised learning techniques less resourceful, which requires huge quantity of labelled training data which mirrors the desired target data. This paper thus puts forward the mechanism of an alternative technique i.e. transfer learning to be used in image diagnosis so that efficiency and accuracy among images can be achieved. This type of mechanism deals with variation in the desired and actual data used for training and the outlier sensitivity, which ultimately enhances the predictions by giving better results in various areas, thus leaving the traditional methodologies behind. The following analysis further discusses about three types of transfer classifiers which can be applied using only small volume of training data sets and their contrast with the traditional method which requires huge quantities of training data having attributes with slight changes. The three different separators were compared amongst them and also together from the traditional methodology being used for a very common application used in our daily life. Also, commonly occurring problems such as the outlier sensitivity problem were taken into consideration and measures were taken to recognise and improvise them. On further research it was observed that the performance of transfer learning exceeds that of the conventional supervised learning approaches being used for small amount of characteristic training data provided reducing the stratification errors to a great extent

Download Full-text

Supporting Regenerative Medicine by Integrative Dimensionality Reduction

Methods of Information in Medicine ◽

10.3414/me11-02-0045 ◽

2012 ◽

Vol 51 (04) ◽

pp. 341-347 ◽

Cited By ~ 2

Author(s):

F. Mulas ◽

L. Zagar ◽

B. Zupan ◽

R. Bellazzi

Keyword(s):

Regenerative Medicine ◽

Dimensionality Reduction ◽

Predictive Accuracy ◽

Expression Profiles ◽

Training Data ◽

Data Sets ◽

Developmental Potential ◽

Multiple Data ◽

Reduction Methods ◽

Multiple Data Sets

SummaryObjective: The assessment of the developmental potential of stem cells is a crucial step towards their clinical application in regenerative medicine. It has been demonstrated that genome-wide expression profiles can predict the cellular differentiation stage by means of dimensionality reduction methods. Here we show that these techniques can be further strengthened to support decision making with i) a novel strategy for gene selection; ii) methods for combining the evidence from multiple data sets.Methods: We propose to exploit dimensionality reduction methods for the selection of genes specifically activated in different stages of differentiation. To obtain an integrated predictive model, the expression values of the selected genes from multiple data sets are combined. We investigated distinct approaches that either aggregate data sets or use learning ensembles.Results: We analyzed the performance of the proposed methods on six publicly available data sets. The selection procedure identified a reduced subset of genes whose expression values gave rise to an accurate stage prediction. The assessment of predictive accuracy demonstrated a high quality of predictions for most of the data integration methods presented.Conclusion: The experimental results highlighted the main potentials of proposed approaches. These include the ability to predict the true staging by combining multiple training data sets when this could not be inferred from a single data source, and to focus the analysis on a reduced list of genes of similar predictive performance.

Download Full-text

Urban Agglomeration Effects in India: Evidence from Town-Level Data

Asian Development Review ◽

10.1162/adev_a_00100 ◽

2017 ◽

Vol 34 (2) ◽

pp. 201-228 ◽

Cited By ~ 3

Author(s):

Rana Hasan ◽

Yi Jiang ◽

Radine Michelle Rafols

Keyword(s):

Industrial Structure ◽

Geographic Location ◽

Data Sets ◽

Formal Sector ◽

Multiple Data ◽

Employment Share ◽

Level Data ◽

City Population ◽

Agglomeration Effects ◽

Multiple Data Sets

Combining multiple data sets for India, we estimate the elasticity of wages with respect to town population and density between 1% and 2%, which is smaller than estimates in the literature based on district-level analysis. We also find that the employment share of firms with 10 or more workers—which typically describes firms that operate in the formal sector—is positively associated with city population and negatively associated with city density. Town characteristics such as infrastructure availability, geographic location, educational services, and industrial structure also play a role in explaining city productivity and the presence of relatively large firms. Overall, we interpret our results to suggest that there is scope to realize more fully urbanization's potential by addressing issues related to urban planning, infrastructure, and public service delivery, as has been emphasized previously by observers of Indian urbanization.

Download Full-text

Introduction to this special section: Cross-disciplinary applications of geophysics

The Leading Edge ◽

10.1190/tle37090654.1 ◽

2018 ◽

Vol 37 (9) ◽

pp. 654-654

Author(s):

Kyle Spikes ◽

Yongyi Li

Keyword(s):

Data Integration ◽

Special Section ◽

Unconventional Reservoirs ◽

Data Sets ◽

Data Types ◽

Multiple Data

As the volume of seismic and other data types continues to increase, the use of such data sets has extended to different approaches and techniques of data integration and interpretation. The intent of this special section on cross-disciplinary applications of geophysics is to highlight such uses of multiple data types. Although not limited to any type or location of a given reservoir, the two articles in this section primarily focus on onshore unconventional reservoirs. Nonetheless, the techniques and approaches will also be of interest to readers and practitioners who deal with conventional reservoirs, both onshore and offshore.

Download Full-text

Importance of Big Data

Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8505-5.ch001 ◽

2015 ◽

pp. 1-19 ◽

Cited By ~ 1

Author(s):

Seema Ansari ◽

Radha Mohanlal ◽

Javier Poncela ◽

Adeel Ansari ◽

Komal Mohanlal

Keyword(s):

Big Data ◽

Heterogeneous Data ◽

Data Sets ◽

Data Types ◽

It Industry ◽

Management Tools ◽

Location Data ◽

Processing Power ◽

Multiple Data ◽

Big Data Applications

Combining vast amounts of heterogeneous data and increasing the processing power of existing database management tools is no doubt the emerging need of IT industry in coming years. The complexity and size of data sets that need to be acquired, analyzed, stored, sorted or transferred has spiked in the recent years. Due to the tremendously increasing volume of multiple data types, creating Big Data applications that can extract the valuable trends and relationships required for further processes or deriving useful results is quite challenging task. Companies, corporate organizations or be it government agencies, all need to analyze and execute Big Data implementation to pave new paths of productivity and innovation. This chapter discusses the emerging technology of modern era: Big Data with detailed description of the three V's (Variety, Velocity and Volume). Further chapters will enable to understand the concepts of data mining and big data analysis, Potentials of Big Data in five domains i.e. Healthcare, Public sector, Retail, Manufacturing and Personal location Data.

Download Full-text

Nearest labelset using double distances for multi-label classification

PeerJ Computer Science ◽

10.7717/peerj-cs.242 ◽

2019 ◽

Vol 5 ◽

pp. e242

Author(s):

Hyukjun Gweon ◽

Matthias Schonlau ◽

Stefan H. Steiner

Keyword(s):

Maximum Likelihood ◽

Supervised Learning ◽

Feature Space ◽

Training Data ◽

Model Parameters ◽

Data Sets ◽

Weighted Sum ◽

Novel Approach ◽

Binomial Regression ◽

F Measure

Multi-label classification is a type of supervised learning where an instance may belong to multiple labels simultaneously. Predicting each label independently has been criticized for not exploiting any correlation between labels. In this article we propose a novel approach, Nearest Labelset using Double Distances (NLDD), that predicts the labelset observed in the training data that minimizes a weighted sum of the distances in both the feature space and the label space to the new instance. The weights specify the relative tradeoff between the two distances. The weights are estimated from a binomial regression of the number of misclassified labels as a function of the two distances. Model parameters are estimated by maximum likelihood. NLDD only considers labelsets observed in the training data, thus implicitly taking into account label dependencies. Experiments on benchmark multi-label data sets show that the proposed method on average outperforms other well-known approaches in terms of 0/1 loss, and multi-label accuracy and ranks second on the F-measure (after a method called ECC) and on Hamming loss (after a method called RF-PCT).

Download Full-text

Active semi-supervised framework with data editing

Computer Science and Information Systems ◽

10.2298/csis120202045z ◽

2012 ◽

Vol 9 (4) ◽

pp. 1513-1532 ◽

Cited By ~ 4

Author(s):

Xue Zhang ◽

Wangxin Xiao

Keyword(s):

Active Learning ◽

Supervised Learning ◽

Text Classification ◽

State Of The Art ◽

The Self ◽

Training Data ◽

Data Sets ◽

Text Data ◽

Data Editing ◽

Data Problem

In order to address the insufficient training data problem, many active semi-supervised algorithms have been proposed. The self-labeled training data in semi-supervised learning may contain much noise due to the insufficient training data. Such noise may snowball themselves in the following learning process and thus hurt the generalization ability of the final hypothesis. Extremely few labeled training data in sparsely labeled text classification aggravate such situation. If such noise could be identified and removed by some strategy, the performance of the active semi-supervised algorithms should be improved. However, such useful techniques of identifying and removing noise have been seldom explored in existing active semi-supervised algorithms. In this paper, we propose an active semi-supervised framework with data editing (we call it ASSDE) to improve sparsely labeled text classification. A data editing technique is used to identify and remove noise introduced by semi-supervised labeling. We carry out the data editing technique by fully utilizing the advantage of active learning, which is novel according to our knowledge. The fusion of active learning with data editing makes ASSDE more robust to the sparsity and the distribution bias of the training data. It further simplifies the design of semi-supervised learning which makes ASSDE more efficient. Extensive experimental study on several real-world text data sets shows the encouraging results of the proposed framework for sparsely labeled text classification, compared with several state-of-the-art methods.

Download Full-text