scholarly journals Optimal Network Alignment with Graphlet Degree Vectors

2010 ◽  
Vol 9 ◽  
pp. CIN.S4744 ◽  
Author(s):  
Tijana Milenković ◽  
Weng Leong Ng ◽  
Wayne Hayes ◽  
NatašA PržUlj

Important biological information is encoded in the topology of biological networks. Comparative analyses of biological networks are proving to be valuable, as they can lead to transfer of knowledge between species and give deeper insights into biological function, disease, and evolution. We introduce a new method that uses the Hungarian algorithm to produce optimal global alignment between two networks using any cost function. We design a cost function based solely on network topology and use it in our network alignment. Our method can be applied to any two networks, not just biological ones, since it is based only on network topology. We use our new method to align protein-protein interaction networks of two eukaryotic species and demonstrate that our alignment exposes large and topologically complex regions of network similarity. At the same time, our alignment is biologically valid, since many of the aligned protein pairs perform the same biological function. From the alignment, we predict function of yet unannotated proteins, many of which we validate in the literature. Also, we apply our method to find topological similarities between metabolic networks of different species and build phylogenetic trees based on our network alignment score. The phylogenetic trees obtained in this way bear a striking resemblance to the ones obtained by sequence alignments. Our method detects topologically similar regions in large networks that are statistically significant. It does this independent of protein sequence or any other information external to network topology.

2010 ◽  
Vol 7 (50) ◽  
pp. 1341-1354 ◽  
Author(s):  
Oleksii Kuchaiev ◽  
Tijana Milenković ◽  
Vesna Memišević ◽  
Wayne Hayes ◽  
Nataša Pržulj

Sequence comparison and alignment has had an enormous impact on our understanding of evolution, biology and disease. Comparison and alignment of biological networks will probably have a similar impact. Existing network alignments use information external to the networks, such as sequence, because no good algorithm for purely topological alignment has yet been devised. In this paper, we present a novel algorithm based solely on network topology, that can be used to align any two networks. We apply it to biological networks to produce by far the most complete topological alignments of biological networks to date. We demonstrate that both species phylogeny and detailed biological function of individual proteins can be extracted from our alignments. Topology-based alignments have the potential to provide a completely new, independent source of phylogenetic information. Our alignment of the protein–protein interaction networks of two very different species—yeast and human—indicate that even distant species share a surprising amount of network topology, suggesting broad similarities in internal cellular wiring across all life on Earth.


2016 ◽  
Author(s):  
Dan DeBlasio ◽  
John Kececioglu

AbstractMotivationWhile mutation rates can vary across the residues of a protein, when computing alignments of protein sequences the same setting of values for substitution score and gap penalty parameters is typically used across their entire length. We provide for the first time a new method called adaptive local realignment that automatically uses diverse parameter settings in different regions of the input sequences when computing multiple sequence alignments. This allows parameter settings to adapt to more closely match the local mutation rate across a protein.MethodOur method builds on our prior work on global alignment parameter advising with the Facet alignment accuracy estimator. Given a computed alignment, in each region that has low estimated accuracy, a collection of candidate realignments is generated using a precomputed set of alternate parameter settings. If one of these alternate realignments has higher estimated accuracy than the original subalignment, the region is replaced with the new realignment, and the concatenation of these realigned regions forms the final alignment that is output.ResultsAdaptive local realignment significantly improves the quality of alignments over using the single best default parameter setting. In particular, this new method of local advising, when combined with prior methods for global advising, boosts alignment accuracy by as much as 26% over the best default setting on hard-to-align benchmarks (and by 6.4% over using global advising alone).AvailabilityA new version of the Opal multiple sequence aligner that incorporates adaptive local realignment using Facet for parameter advising, is available free for non-commercial use at http://[email protected]


2019 ◽  
Author(s):  
Benoit Morel ◽  
Alexey M. Kozlov ◽  
Alexandros Stamatakis ◽  
Gergely J. Szöllősi

AbstractInferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data pre-processing (e.g., computing bootstrap trees), and rely on approximations and heuristics that limit the degree of tree space exploration. Here we present GeneRax, the first maximum likelihood species tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared to competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson-Foulds distance. On empirical datasets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1099 Cyanobacteria families in eight minutes on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax.


2021 ◽  
Vol 4 ◽  
Author(s):  
Dalila Destanović ◽  
Lejla Ušanović ◽  
Lejla Lasić ◽  
Jasna Hanjalić ◽  
Belma Kalamujić Stroil

Chaetopteryx villosa (Fabricius, 1798) is a caddisfly species distributed throughout Europe, except in the Balkan and Apennine Peninsula. However, phylogenetically close species belonging to the C. villosa group are widespread throughout entire Europe. Species of this group (C. villosa, C. gessneri, C. fusca, C. sahlbergi, C. atlantica, C. bosniaca, C. vulture, and C. trinacriae) have distinct distributions with some overlaps. Adult forms of these species are morphologically similar, whereas larval morphology is only known for some species. There are also indications of species hybridization (e.g., C. villosa x fusca). Presumably, the molecular approach for the species determination of this group would be highly beneficial. In the BOLD database, there are 154 specimens with COI-5P barcodes of C. villosa species. Out of the remaining species, C. sahlbergi has 27 specimens with a barcode, C. fusca 20, C. gessneri 5, C. bosniaca 5, and C. atlantica 1, whereas sequences from the species C. vulture and C. trinacriae are missing. Therefore, we tested the power of discrimination of the COI-5P marker in the C. villosa group, as the most common barcoding markers for species identification in animals. Only sequences from public records originating from experienced research groups or taxonomists and containing a specimen photograph were taken as input. A total of 75 sequences from the BOLD database were obtained. Out of these sequences, 11 belonged to C. fusca, 5 to C. gessneri, 52 to C. villosa, 5 to C. bosniaca, and 2 to C. sahlbergi. For the generation of overview trees, COI-5P barcodes of Rhyacophila fasciata and Rh. nubila were used as outgroups. All sequences were trimmed at 5’ and 3’ ends, resulting in a final alignment length of 516 base pairs. Multiple sequence alignments and editing were done in the MEGA-X software. Analysis of nucleotide polymorphism was done in DNASP6 software. MEGA-X was used to calculate the pairwise distance and overall mean p-distance, and to construct the overview trees. Analysis of DNA polymorphism revealed 14 haplotypes of C. villosa, 3 haplotypes of C. fusca, 2 haplotypes of C. gessneri, and one for species C. bosniaca and C. sahlbergi. There were no significant interspecific and intraspecific differences among haplotypes based on pairwise distances. The p-distance between one of the haplotypes of C. fusca and C. villosa was 0.000, whereas the p-distance among haplotypes of C. villosa varied from 0.001 to about 0.055. The mean overall p-distance among haplotypes of all species equaled 0.03. No species-specific clusters were observed when phylogenetic trees were constructed except for C. gessneri, regardless of the method used (i.e., NJ, UPGMA, ML, ME, or MP). To minimize the possibility of species misidentification, we used only records submitted by NTNU-Norwegian University of Science and Technology (Norway), SNSB-Zoologische Staatssammlung Muenchen (Germany), Zoologisches Forschungsmuseum Alexander Koenig (Germany), University of Oulu, Zoological Museum (Finland), prof Hans Malicky and prof Mladen Kučinić. No records identified as hybrids were included in the analyses. With the exception of C. gessneri, COI-5P marker failed to separate the species of the C. villosa group. However, it is highly unlikely that poor species determination was the basis for such a result. To enable the comprehensive and unbiased evaluation of the relationships within this group, data coverage in BOLD database for most of the studied species should be enhanced, encompassing different geographical distribution of samples. Further studies are needed to detect the array of molecular markers suitable for the species delineation in a complex group such as C. villosa.


2015 ◽  
Vol 2015 ◽  
pp. 1-11 ◽  
Author(s):  
Bin Shen ◽  
Muwei Zhao ◽  
Wei Zhong ◽  
Jieyue He

With the continuous development of biological experiment technology, more and more data related to uncertain biological networks needs to be analyzed. However, most of current alignment methods are designed for the deterministic biological network. Only a few can solve the probabilistic network alignment problem. However, these approaches only use the part of probabilistic data in the original networks allowing only one of the two networks to be probabilistic. To overcome the weakness of current approaches, an improved method called completely probabilistic biological network comparison alignment (C_PBNA) is proposed in this paper. This new method is designed for complete probabilistic biological network alignment based on probabilistic biological network alignment (PBNA) in order to take full advantage of the uncertain information of biological network. The degree of consistency (agreement) indicates that C_PBNA can find the results neglected by PBNA algorithm. Furthermore, the GO consistency (GOC) and global network alignment score (GNAS) have been selected as evaluation criteria, and all of them proved that C_PBNA can obtain more biologically significant results than those of PBNA algorithm.


2020 ◽  
Vol 37 (9) ◽  
pp. 2763-2774 ◽  
Author(s):  
Benoit Morel ◽  
Alexey M Kozlov ◽  
Alexandros Stamatakis ◽  
Gergely J Szöllősi

Abstract Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges, species-tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on approximations and heuristics that limit the degree of tree space exploration. Here, we present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared with competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson–Foulds distance. On empirical data sets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020).  


Sign in / Sign up

Export Citation Format

Share Document