Structure motif–centric learning framework for inorganic crystalline systems

The multiple-instance problem is a difficult machine learning problem that appears in cases where knowledge about training examples is incomplete. In this problem, the teacher labels examples that are sets (also called bags) of instances. The teacher does not label whether an individual instance in a bag is positive or negative. The learning algorithm needs to generate a classifier that will correctly classify unseen examples (i.e., bags of instances). This learning framework is receiving growing attention in the machine learning community and since it was introduced by Dietterich, Lathrop, Lozano-Perez (1997), a wide range of tasks have been formulated as multi-instance problems. Among these tasks, we can cite content-based image retrieval (Chen, Bi, & Wang, 2006) and annotation (Qi and Han, 2007), text categorization (Andrews, Tsochantaridis, & Hofmann, 2002), web index page recommendation (Zhou, Jiang, & Li, 2005; Xue, Han, Jiang, & Zhou, 2007) and drug activity prediction (Dietterich et al., 1997; Zhou & Zhang, 2007). In this chapter we introduce MOG3P-MI, a multiobjective grammar guided genetic programming algorithm to handle multi-instance problems. In this algorithm, based on SPEA2, individuals represent classification rules which make it possible to determine if a bag is positive or negative. The quality of each individual is evaluated according to two quality indexes: sensitivity and specificity. Both these measures have been adapted to MIL circumstances. Computational experiments show that the MOG3P-MI is a robust algorithm for classification in different domains where achieves competitive results and obtain classifiers which contain simple rules which add comprehensibility and simplicity in the knowledge discovery process, being suitable method for solving MIL problems (Zafra & Ventura, 2007).

Download Full-text

Harnessing machine learning to boost heuristic strategies for phylogenetic-tree search

10.21203/rs.3.rs-48247/v1 ◽

2020 ◽

Author(s):

Dana Azouri ◽

Shiran Abadi ◽

Yishay Mansour ◽

Itay Mayrose ◽

Tal Pupko

Keyword(s):

Machine Learning ◽

Phylogenetic Tree ◽

Learning Algorithm ◽

Search Space ◽

Large Set ◽

Tree Search ◽

Learning Approaches ◽

Tree Reconstruction ◽

Heuristic Strategies ◽

Tree Inference

Abstract Inferring a phylogenetic tree, which describes the evolutionary relationships among a set of organisms, genes, or genomes, is a fundamental step in numerous evolutionary studies. With the aim of making tree inference feasible for problems involving more than a handful of sequences, current algorithms for phylogenetic tree reconstruction utilize various heuristic approaches. Such approaches rely on performing costly likelihood optimizations, and thus evaluate only a subset of all potential trees. Consequently, all existing methods suffer from the known tradeoff between accuracy and running time. Here, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus avoiding numerous expensive likelihood computations. Our analyses suggest that machine-learning approaches can make heuristic tree searches substantially faster without losing accuracy and thus could be incorporated for narrowing down the examined neighboring trees of each intermediate tree in any tree search methodology.

Download Full-text

A general-purpose machine learning framework for predicting properties of inorganic materials

npj Computational Materials ◽

10.1038/npjcompumats.2016.28 ◽

2016 ◽

Vol 2 (1) ◽

Cited By ~ 326

Author(s):

Logan Ward ◽

Ankit Agrawal ◽

Alok Choudhary ◽

Christopher Wolverton

Keyword(s):

Machine Learning ◽

General Purpose ◽

Inorganic Materials ◽

Learning Framework

Download Full-text

Estimating nitrogen and phosphorus concentrations in streams and rivers across the contiguous United States: a machine learning framework

10.7287/peerj.preprints.27585 ◽

2019 ◽

Author(s):

Longzhu Shen ◽

Giuseppe Amatulli ◽

Tushar Sethi ◽

Peter Raymond ◽

Sami Domisch

Keyword(s):

Machine Learning ◽

Large Scale ◽

Learning Algorithm ◽

External Validation ◽

Anthropogenic Activity ◽

Spatial And Temporal Variability ◽

Nitrogen And Phosphorus ◽

Learning Framework ◽

Environmental Models ◽

Improved Accuracy

Nitrogen (N) and Phosphorus (P) are essential nutrients for life processes in water bodies but in excessive quantities, they are a significant source of aquatic pollution. Eutrophication has now become widespread due to such an imbalance, and is largely attributed to anthropogenic activity. In view of this phenomenon, we present a new dataset and statistical method for estimating and mapping elemental and compound con- centrations of N and P at a resolution of 30 arc-seconds (∼1 km) for the conterminous US. The model is based on a Random Forest (RF) machine learning algorithm that was fitted with environmental variables and seasonal N and P concentration observations from 230,000 stations spanning across US stream networks. Accounting for spatial and temporal variability offers improved accuracy in the analysis of N and P cycles. The algorithm has been validated with an internal and external validation procedure that is able to explain 70-83% of the variance in the model. The dataset is ready for use as input in a variety of environmental models and analyses, and the methodological framework can be applied to large-scale studies on N and P pollution, which include water quality, species distribution and water ecology research worldwide.

Download Full-text

Methods for correcting inference based on outcomes predicted by machine learning

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2001238117 ◽

2020 ◽

Vol 117 (48) ◽

pp. 30266-30275

Author(s):

Siruo Wang ◽

Tyler H. McCormick ◽

Jeffrey T. Leek

Keyword(s):

Machine Learning ◽

Statistical Inference ◽

Variance Estimation ◽

Learning Algorithm ◽

R Package ◽

Neural Nets ◽

Learning Framework ◽

Validation Set ◽

Low Dimensional ◽

The Relationship

Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package:https://github.com/leekgroup/postpi.

Download Full-text

Harnessing machine learning to guide phylogenetic-tree search algorithms

Nature Communications ◽

10.1038/s41467-021-22073-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Dana Azouri ◽

Shiran Abadi ◽

Yishay Mansour ◽

Itay Mayrose ◽

Tal Pupko

Keyword(s):

Machine Learning ◽

Phylogenetic Tree ◽

Learning Algorithm ◽

Search Space ◽

Large Set ◽

Tree Search ◽

Proof Of Concept ◽

Tree Reconstruction ◽

Promising Candidate ◽

Tree Inference

AbstractInferring a phylogenetic tree is a fundamental challenge in evolutionary studies. Current paradigms for phylogenetic tree reconstruction rely on performing costly likelihood optimizations. With the aim of making tree inference feasible for problems involving more than a handful of sequences, inference under the maximum-likelihood paradigm integrates heuristic approaches to evaluate only a subset of all potential trees. Consequently, existing methods suffer from the known tradeoff between accuracy and running time. In this proof-of-concept study, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus potentially accelerating heuristic tree searches without losing accuracy. Our analyses suggest that machine learning can guide tree-search methodologies towards the most promising candidate trees.

Download Full-text

HANDWRITTEN NUMERAL AND MACHINE PRINTED MULTIPLE FONT CHARACTER RECOGNITION USING NEURAL NETWORK CLASSIFIER

Journal of Circuits System and Computers ◽

10.1142/s0218126696000388 ◽

1996 ◽

Vol 06 (06) ◽

pp. 569-580 ◽

Cited By ~ 2

Author(s):

J. CAO ◽

M. AHMADI ◽

M. SHRIDHAR

Keyword(s):

Neural Network ◽

Character Recognition ◽

Learning Algorithm ◽

Back Propagation ◽

Large Set ◽

Real World Data ◽

Neural Network Learning ◽

Chain Code ◽

Network Learning ◽

Convergence Problems

In this paper a new neural network is proposed for recognition of handwritten digits and multi-font machine printed characters. In this system, overlapped regional chain code histograms of characters are used as features and a neural network has been used for classification. A new neural network learning algorithm that combines unsupervised learning with supervised learning has been developed. This new algorithm overcomes the slow learning and difficult convergence problems that are typical of back-propagation learning algorithms. The algorithm was tested on a large set of handwritten digits collected from real world data and a set of multi-font machine printed English letters.

Download Full-text

Estimating nitrogen and phosphorus concentrations in streams and rivers across the contiguous United States: a machine learning framework

10.7287/peerj.preprints.27585v1 ◽

2019 ◽

Author(s):

Longzhu Shen ◽

Giuseppe Amatulli ◽

Tushar Sethi ◽

Peter Raymond ◽

Sami Domisch

Keyword(s):

Machine Learning ◽

Large Scale ◽

Learning Algorithm ◽

External Validation ◽

Anthropogenic Activity ◽

Spatial And Temporal Variability ◽

Nitrogen And Phosphorus ◽

Learning Framework ◽

Environmental Models ◽

Improved Accuracy

Nitrogen (N) and Phosphorus (P) are essential nutrients for life processes in water bodies but in excessive quantities, they are a significant source of aquatic pollution. Eutrophication has now become widespread due to such an imbalance, and is largely attributed to anthropogenic activity. In view of this phenomenon, we present a new dataset and statistical method for estimating and mapping elemental and compound con- centrations of N and P at a resolution of 30 arc-seconds (∼1 km) for the conterminous US. The model is based on a Random Forest (RF) machine learning algorithm that was fitted with environmental variables and seasonal N and P concentration observations from 230,000 stations spanning across US stream networks. Accounting for spatial and temporal variability offers improved accuracy in the analysis of N and P cycles. The algorithm has been validated with an internal and external validation procedure that is able to explain 70-83% of the variance in the model. The dataset is ready for use as input in a variety of environmental models and analyses, and the methodological framework can be applied to large-scale studies on N and P pollution, which include water quality, species distribution and water ecology research worldwide.

Download Full-text

Cellular intelligence: dynamic specialization through non-equilibrium multi-scale compartmentalization

10.1101/2021.06.25.449951 ◽

2021 ◽

Author(s):

Rémy V Tuyéras ◽

Leandro Z Agudelo ◽

Soumya P Ram ◽

Anjanet R Loon ◽

Burak Kutlu ◽

...

Keyword(s):

Machine Learning ◽

Data Mining ◽

Learning Algorithm ◽

Living Systems ◽

Emergent Properties ◽

New Approach ◽

Learning Framework ◽

Multi Scale ◽

Reference Machine ◽

Non Equilibrium

Intelligence is usually associated with the ability to perceive, retain and use information to adapt to changes in one's environment. In this context, systems of living cells can be thought of as intelligent entities. Here, we show that the concepts of non-equilibrium tuning and compartmentalization are sufficient to model manifestations of cellular intelligence such as specialization, division, fusion and communication using the language of operads. We implement our framework as an unsupervised learning algorithm, IntCyt, which we show is able to memorize, organize and abstract reference machine-learning datasets through generative and self-supervised tasks. Overall, our learning framework captures emergent properties programmed in living systems, and provides a powerful new approach for data mining. to memorize, organize and abstract reference machine-learning datasets through generative and self-supervised tasks. Overall, our learning framework captures emergent properties programmed in living systems, and provides a powerful new approach for data mining.

Download Full-text

Disease Prediction Using Machine Learning

International Journal of Scientific Research in Science and Technology ◽

10.32628/ijsrst12183118 ◽

2021 ◽

pp. 551-555

Author(s):

Gaurav Shilimkar ◽

Amol Bhilare ◽

Shivam Pisal

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Imperfect Information ◽

Prediction Models ◽

Learning Algorithm ◽

Real Life ◽

Clinical Information ◽

Large Set ◽

Rule Mining ◽

Data Points

Big data has a significant part in a number of businesses, but it is largely essential to the rapidly growing healthcare industry. It plays an important role by offering a large set of data points, constructing a robust system which allows for better and more accurate results in disease detection. Originally, the forecasts are made on the information accessible, but the absence of imperfect information contributes to a decrease in the caliber of precision. Besides incomplete data different qualities of particular regional diseases, which change based on their areas of origin can weaken the prediction models further. In this paper we use data mining techniques such as association rule mining, classification, clustering and finally the Decision Tree Machine learning algorithm to analyze the different kinds of general body-based illnesses. We implemented and assessed the efficacy of the Decision Tree algorithm over real-life clinical information.

Download Full-text