scholarly journals Structure motif–centric learning framework for inorganic crystalline systems

2021 ◽  
Vol 7 (17) ◽  
pp. eabf1754
Author(s):  
Huta R. Banjade ◽  
Sandro Hauri ◽  
Shanshan Zhang ◽  
Francesco Ricci ◽  
Weiyi Gong ◽  
...  

Incorporation of physical principles in a machine learning (ML) architecture is a fundamental step toward the continued development of artificial intelligence for inorganic materials. As inspired by the Pauling’s rule, we propose that structure motifs in inorganic crystals can serve as a central input to a machine learning framework. We demonstrated that the presence of structure motifs and their connections in a large set of crystalline compounds can be converted into unique vector representations using an unsupervised learning algorithm. To demonstrate the use of structure motif information, a motif-centric learning framework is created by combining motif information with the atom-based graph neural networks to form an atom-motif dual graph network (AMDNet), which is more accurate in predicting the electronic structures of metal oxides such as bandgaps. The work illustrates the route toward fundamental design of graph neural network learning architecture for complex materials by incorporating beyond-atom physical principles.

Author(s):  
Amelia Zafra

The multiple-instance problem is a difficult machine learning problem that appears in cases where knowledge about training examples is incomplete. In this problem, the teacher labels examples that are sets (also called bags) of instances. The teacher does not label whether an individual instance in a bag is positive or negative. The learning algorithm needs to generate a classifier that will correctly classify unseen examples (i.e., bags of instances). This learning framework is receiving growing attention in the machine learning community and since it was introduced by Dietterich, Lathrop, Lozano-Perez (1997), a wide range of tasks have been formulated as multi-instance problems. Among these tasks, we can cite content-based image retrieval (Chen, Bi, & Wang, 2006) and annotation (Qi and Han, 2007), text categorization (Andrews, Tsochantaridis, & Hofmann, 2002), web index page recommendation (Zhou, Jiang, & Li, 2005; Xue, Han, Jiang, & Zhou, 2007) and drug activity prediction (Dietterich et al., 1997; Zhou & Zhang, 2007). In this chapter we introduce MOG3P-MI, a multiobjective grammar guided genetic programming algorithm to handle multi-instance problems. In this algorithm, based on SPEA2, individuals represent classification rules which make it possible to determine if a bag is positive or negative. The quality of each individual is evaluated according to two quality indexes: sensitivity and specificity. Both these measures have been adapted to MIL circumstances. Computational experiments show that the MOG3P-MI is a robust algorithm for classification in different domains where achieves competitive results and obtain classifiers which contain simple rules which add comprehensibility and simplicity in the knowledge discovery process, being suitable method for solving MIL problems (Zafra & Ventura, 2007).


2020 ◽  
Author(s):  
Dana Azouri ◽  
Shiran Abadi ◽  
Yishay Mansour ◽  
Itay Mayrose ◽  
Tal Pupko

Abstract Inferring a phylogenetic tree, which describes the evolutionary relationships among a set of organisms, genes, or genomes, is a fundamental step in numerous evolutionary studies. With the aim of making tree inference feasible for problems involving more than a handful of sequences, current algorithms for phylogenetic tree reconstruction utilize various heuristic approaches. Such approaches rely on performing costly likelihood optimizations, and thus evaluate only a subset of all potential trees. Consequently, all existing methods suffer from the known tradeoff between accuracy and running time. Here, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus avoiding numerous expensive likelihood computations. Our analyses suggest that machine-learning approaches can make heuristic tree searches substantially faster without losing accuracy and thus could be incorporated for narrowing down the examined neighboring trees of each intermediate tree in any tree search methodology.


2019 ◽  
Author(s):  
Longzhu Shen ◽  
Giuseppe Amatulli ◽  
Tushar Sethi ◽  
Peter Raymond ◽  
Sami Domisch

Nitrogen (N) and Phosphorus (P) are essential nutrients for life processes in water bodies but in excessive quantities, they are a significant source of aquatic pollution. Eutrophication has now become widespread due to such an imbalance, and is largely attributed to anthropogenic activity. In view of this phenomenon, we present a new dataset and statistical method for estimating and mapping elemental and compound con- centrations of N and P at a resolution of 30 arc-seconds (∼1 km) for the conterminous US. The model is based on a Random Forest (RF) machine learning algorithm that was fitted with environmental variables and seasonal N and P concentration observations from 230,000 stations spanning across US stream networks. Accounting for spatial and temporal variability offers improved accuracy in the analysis of N and P cycles. The algorithm has been validated with an internal and external validation procedure that is able to explain 70-83% of the variance in the model. The dataset is ready for use as input in a variety of environmental models and analyses, and the methodological framework can be applied to large-scale studies on N and P pollution, which include water quality, species distribution and water ecology research worldwide.


2020 ◽  
Vol 117 (48) ◽  
pp. 30266-30275
Author(s):  
Siruo Wang ◽  
Tyler H. McCormick ◽  
Jeffrey T. Leek

Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package:https://github.com/leekgroup/postpi.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Dana Azouri ◽  
Shiran Abadi ◽  
Yishay Mansour ◽  
Itay Mayrose ◽  
Tal Pupko

AbstractInferring a phylogenetic tree is a fundamental challenge in evolutionary studies. Current paradigms for phylogenetic tree reconstruction rely on performing costly likelihood optimizations. With the aim of making tree inference feasible for problems involving more than a handful of sequences, inference under the maximum-likelihood paradigm integrates heuristic approaches to evaluate only a subset of all potential trees. Consequently, existing methods suffer from the known tradeoff between accuracy and running time. In this proof-of-concept study, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus potentially accelerating heuristic tree searches without losing accuracy. Our analyses suggest that machine learning can guide tree-search methodologies towards the most promising candidate trees.


1996 ◽  
Vol 06 (06) ◽  
pp. 569-580 ◽  
Author(s):  
J. CAO ◽  
M. AHMADI ◽  
M. SHRIDHAR

In this paper a new neural network is proposed for recognition of handwritten digits and multi-font machine printed characters. In this system, overlapped regional chain code histograms of characters are used as features and a neural network has been used for classification. A new neural network learning algorithm that combines unsupervised learning with supervised learning has been developed. This new algorithm overcomes the slow learning and difficult convergence problems that are typical of back-propagation learning algorithms. The algorithm was tested on a large set of handwritten digits collected from real world data and a set of multi-font machine printed English letters.


2019 ◽  
Author(s):  
Longzhu Shen ◽  
Giuseppe Amatulli ◽  
Tushar Sethi ◽  
Peter Raymond ◽  
Sami Domisch

Nitrogen (N) and Phosphorus (P) are essential nutrients for life processes in water bodies but in excessive quantities, they are a significant source of aquatic pollution. Eutrophication has now become widespread due to such an imbalance, and is largely attributed to anthropogenic activity. In view of this phenomenon, we present a new dataset and statistical method for estimating and mapping elemental and compound con- centrations of N and P at a resolution of 30 arc-seconds (∼1 km) for the conterminous US. The model is based on a Random Forest (RF) machine learning algorithm that was fitted with environmental variables and seasonal N and P concentration observations from 230,000 stations spanning across US stream networks. Accounting for spatial and temporal variability offers improved accuracy in the analysis of N and P cycles. The algorithm has been validated with an internal and external validation procedure that is able to explain 70-83% of the variance in the model. The dataset is ready for use as input in a variety of environmental models and analyses, and the methodological framework can be applied to large-scale studies on N and P pollution, which include water quality, species distribution and water ecology research worldwide.


2021 ◽  
Author(s):  
Rémy V Tuyéras ◽  
Leandro Z Agudelo ◽  
Soumya P Ram ◽  
Anjanet R Loon ◽  
Burak Kutlu ◽  
...  

Intelligence is usually associated with the ability to perceive, retain and use information to adapt to changes in one's environment. In this context, systems of living cells can be thought of as intelligent entities. Here, we show that the concepts of non-equilibrium tuning and compartmentalization are sufficient to model manifestations of cellular intelligence such as specialization, division, fusion and communication using the language of operads. We implement our framework as an unsupervised learning algorithm, IntCyt, which we show is able to memorize, organize and abstract reference machine-learning datasets through generative and self-supervised tasks. Overall, our learning framework captures emergent properties programmed in living systems, and provides a powerful new approach for data mining. to memorize, organize and abstract reference machine-learning datasets through generative and self-supervised tasks. Overall, our learning framework captures emergent properties programmed in living systems, and provides a powerful new approach for data mining.


Author(s):  
Gaurav Shilimkar ◽  
Amol Bhilare ◽  
Shivam Pisal

Big data has a significant part in a number of businesses, but it is largely essential to the rapidly growing healthcare industry. It plays an important role by offering a large set of data points, constructing a robust system which allows for better and more accurate results in disease detection. Originally, the forecasts are made on the information accessible, but the absence of imperfect information contributes to a decrease in the caliber of precision. Besides incomplete data different qualities of particular regional diseases, which change based on their areas of origin can weaken the prediction models further. In this paper we use data mining techniques such as association rule mining, classification, clustering and finally the Decision Tree Machine learning algorithm to analyze the different kinds of general body-based illnesses. We implemented and assessed the efficacy of the Decision Tree algorithm over real-life clinical information.


Sign in / Sign up

Export Citation Format

Share Document