Pathogenic gene prediction based on network embedding

Author(s):  
Yang Liu ◽  
Yuchen Guo ◽  
Xiaoyan Liu ◽  
Chunyu Wang ◽  
Maozu Guo

Abstract In disease research, the study of gene–disease correlation has always been an important topic. With the emergence of large-scale connected data sets in biology, we use known correlations between the entities, which may be from different sets, to build a biological heterogeneous network and propose a new network embedded representation algorithm to calculate the correlation between disease and genes, using the correlation score to predict pathogenic genes. Then, we conduct several experiments to compare our method to other state-of-the-art methods. The results reveal that our method achieves better performance than the traditional methods.

2020 ◽  
Vol 34 (04) ◽  
pp. 4412-4419 ◽  
Author(s):  
Zhao Kang ◽  
Wangtao Zhou ◽  
Zhitong Zhao ◽  
Junming Shao ◽  
Meng Han ◽  
...  

A plethora of multi-view subspace clustering (MVSC) methods have been proposed over the past few years. Researchers manage to boost clustering accuracy from different points of view. However, many state-of-the-art MVSC algorithms, typically have a quadratic or even cubic complexity, are inefficient and inherently difficult to apply at large scales. In the era of big data, the computational issue becomes critical. To fill this gap, we propose a large-scale MVSC (LMVSC) algorithm with linear order complexity. Inspired by the idea of anchor graph, we first learn a smaller graph for each view. Then, a novel approach is designed to integrate those graphs so that we can implement spectral clustering on a smaller graph. Interestingly, it turns out that our model also applies to single-view scenario. Extensive experiments on various large-scale benchmark data sets validate the effectiveness and efficiency of our approach with respect to state-of-the-art clustering methods.


2019 ◽  
Vol 12 (S10) ◽  
Author(s):  
Bo Xu ◽  
Yu Liu ◽  
Shuo Yu ◽  
Lei Wang ◽  
Jie Dong ◽  
...  

Abstract Background Prediction of pathogenic genes is crucial for disease prevention, diagnosis, and treatment. But traditional genetic localization methods are often technique-difficulty and time-consuming. With the development of computer science, computational biology has gradually become one of the main methods for finding candidate pathogenic genes. Methods We propose a pathogenic genes prediction method based on network embedding which is called Multipath2vec. Firstly, we construct an heterogeneous network which is called GP−network. It is constructed based on three kinds of relationships between genes and phenotypes, including correlations between phenotypes, interactions between genes and known gene-phenotype pairs. Then in order to embedding the network better, we design the multi-path to guide random walk in GP−network. The multi-path includes multiple paths between genes and phenotypes which can capture complex structural information of heterogeneous network. Finally, we use the learned vector representation of each phenotype and protein to calculate the similarities and rank according to the similarities between candidate genes and the target phenotype. Results We implemented Multipath2vec and four baseline approaches (i.e., CATAPULT, PRINCE, Deepwalk and Metapath2vec) on many-genes gene-phenotype data, single-gene gene-phenotype data and whole gene-phenotype data. Experimental results show that Multipath2vec outperformed the state-of-the-art baselines in pathogenic genes prediction task. Conclusions We propose Multipath2vec that can be utilized to predict pathogenic genes and experimental results show the higher accuracy of pathogenic genes prediction.


Author(s):  
Ziyao Li ◽  
Liang Zhang ◽  
Guojie Song

Many successful methods have been proposed for learning low dimensional representations on large-scale networks, while almost all existing methods are designed in inseparable processes, learning embeddings for entire networks even when only a small proportion of nodes are of interest. This leads to great inconvenience, especially on super-large or dynamic networks, where these methods become almost impossible to implement. In this paper, we formalize the problem of separated matrix factorization, based on which we elaborate a novel objective function that preserves both local and global information. We further propose SepNE, a simple and flexible network embedding algorithm which independently learns representations for different subsets of nodes in separated processes. By implementing separability, our algorithm reduces the redundant efforts to embed irrelevant nodes, yielding scalability to super-large networks, automatic implementation in distributed learning and further adaptations. We demonstrate the effectiveness of this approach on several real-world networks with different scales and subjects. With comparable accuracy, our approach significantly outperforms state-of-the-art baselines in running times on large networks.


2015 ◽  
Author(s):  
Stinus Lindgreen ◽  
Karen L Adair ◽  
Paul Gardner

Metagenome studies are becoming increasingly widespread, yielding important insights into microbial communities covering diverse environments from terrestrial and aquatic ecosystems to human skin and gut. With the advent of high-throughput sequencing platforms, the use of large scale shotgun sequencing approaches is now commonplace. However, a thorough independent benchmark comparing state-of-the-art metagenome analysis tools is lacking. Here, we present a benchmark where the most widely used tools are tested on complex, realistic data sets. Our results clearly show that the most widely used tools are not necessarily the most accurate, that the most accurate tool is not necessarily the most time consuming, and that there is a high degree of variability between available tools. These findings are important as the conclusions of any metagenomics study are affected by errors in the predicted community composition. Data sets and results are freely available from http://www.ucbioinformatics.org/metabenchmark.html


2020 ◽  
Vol 34 (01) ◽  
pp. 354-361 ◽  
Author(s):  
Chidubem Arachie ◽  
Manas Gaur ◽  
Sam Anzaroot ◽  
William Groves ◽  
Ke Zhang ◽  
...  

Social media plays a major role during and after major natural disasters (e.g., hurricanes, large-scale fires, etc.), as people “on the ground” post useful information on what is actually happening. Given the large amounts of posts, a major challenge is identifying the information that is useful and actionable. Emergency responders are largely interested in finding out what events are taking place so they can properly plan and deploy resources. In this paper we address the problem of automatically identifying important sub-events (within a large-scale emergency “event”, such as a hurricane). In particular, we present a novel, unsupervised learning framework to detect sub-events in Tweets for retrospective crisis analysis. We first extract noun-verb pairs and phrases from raw tweets as sub-event candidates. Then, we learn a semantic embedding of extracted noun-verb pairs and phrases, and rank them against a crisis-specific ontology. We filter out noisy and irrelevant information then cluster the noun-verb pairs and phrases so that the top-ranked ones describe the most important sub-events. Through quantitative experiments on two large crisis data sets (Hurricane Harvey and the 2015 Nepal Earthquake), we demonstrate the effectiveness of our approach over the state-of-the-art. Our qualitative evaluation shows better performance compared to our baseline.


Entropy ◽  
2019 ◽  
Vol 21 (3) ◽  
pp. 254 ◽  
Author(s):  
Shaokai Wang ◽  
Xutao Li ◽  
Yunming Ye ◽  
Shanshan Feng ◽  
Raymond Lau ◽  
...  

Presently, many users are involved in multiple social networks. Identifying the same user in different networks, also known as anchor link prediction, becomes an important problem, which can serve numerous applications, e.g., cross-network recommendation, user profiling, etc. Previous studies mainly use hand-crafted structure features, which, if not carefully designed, may fail to reflect the intrinsic structure regularities. Moreover, most of the methods neglect the attribute information of social networks. In this paper, we propose a novel semi-supervised network-embedding model to address the problem. In the model, each node of the multiple networks is represented by a vector for anchor link prediction, which is learnt with awareness of observed anchor links as semi-supervised information, and topology structure and attributes as input. Experimental results on the real-world data sets demonstrate the superiority of the proposed model compared to state-of-the-art techniques.


2013 ◽  
Vol 2013 ◽  
pp. 1-11 ◽  
Author(s):  
Lukasz Zwolinski ◽  
Marta Kozak ◽  
Karol Kozak

Technological advancements are constantly increasing the size and complexity of data resulting from large-scale RNA interference screens. This fact has led biologists to ask complex questions, which the existing, fully automated analyses are often not adequate to answer. We present a concept of 1Click1View (1C1V) as a methodology for interactive analytic software tools. 1C1V can be applied for two-dimensional visualization of image-based screening data sets from High Content Screening (HCS). Through an easy-to-use interface, one-click, one-view concept, and workflow based architecture, visualization method facilitates the linking of image data with numeric data. Such method utilizes state-of-the-art interactive visualization tools optimized for fast visualization of large scale image data sets. We demonstrate our method on an HCS dataset consisting of multiple cell features from two screening assays.


2020 ◽  
Author(s):  
Ming He ◽  
Chen Huang ◽  
Bo Liu ◽  
Yadong Wang ◽  
Junyi Li

Abstract Background Exploring the relationship between disease and gene is of great significance for understanding the pathogenesis of disease and de-veloping corresponding therapeutic measures. The prediction of dis-ease-gene association by computational methods accelerates the pro-cess.Results Many existing methods cannot fully utilize the multi-dimen-sional biological entity relationship to predict disease-gene association due to multi-source heterogeneous data. This paper proposes Fac-torHNE, a factor graph-aggregated heterogeneous network embedding method for disease-gene association prediction, which captures a vari-ety of semantic relationships between the heterogeneous nodes by fac-torization. It produces different semantic factor graphs and effectively aggregates a variety of semantic relationships, by using end-to-end multi-perspectives loss function to optimize model. Then it produces good nodes embedding to prediction disease-gene association.Conclusions Experimental verification and analysis show FactorHNE has better performance and scalability than the existing models. It also has good interpretability can be extended to large-scale biomedical net-work data analysis.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ming He ◽  
Chen Huang ◽  
Bo Liu ◽  
Yadong Wang ◽  
Junyi Li

Abstract Background Exploring the relationship between disease and gene is of great significance for understanding the pathogenesis of disease and developing corresponding therapeutic measures. The prediction of disease-gene association by computational methods accelerates the process. Results Many existing methods cannot fully utilize the multi-dimensional biological entity relationship to predict disease-gene association due to multi-source heterogeneous data. This paper proposes FactorHNE, a factor graph-aggregated heterogeneous network embedding method for disease-gene association prediction, which captures a variety of semantic relationships between the heterogeneous nodes by factorization. It produces different semantic factor graphs and effectively aggregates a variety of semantic relationships, by using end-to-end multi-perspectives loss function to optimize model. Then it produces good nodes embedding to prediction disease-gene association. Conclusions Experimental verification and analysis show FactorHNE has better performance and scalability than the existing models. It also has good interpretability and can be extended to large-scale biomedical network data analysis.


Sign in / Sign up

Export Citation Format

Share Document