scholarly journals Gradients Do Grow on Trees: A Linear-Time O(N)-Dimensional Gradient for Statistical Phylogenetics

2020 ◽  
Vol 37 (10) ◽  
pp. 3047-3060
Author(s):  
Xiang Ji ◽  
Zhenyu Zhang ◽  
Andrew Holbrook ◽  
Akihiko Nishimura ◽  
Guy Baele ◽  
...  

Abstract Calculation of the log-likelihood stands as the computational bottleneck for many statistical phylogenetic algorithms. Even worse is its gradient evaluation, often used to target regions of high probability. Order O(N)-dimensional gradient calculations based on the standard pruning algorithm require O(N2) operations, where N is the number of sampled molecular sequences. With the advent of high-throughput sequencing, recent phylogenetic studies have analyzed hundreds to thousands of sequences, with an apparent trend toward even larger data sets as a result of advancing technology. Such large-scale analyses challenge phylogenetic reconstruction by requiring inference on larger sets of process parameters to model the increasing data heterogeneity. To make these analyses tractable, we present a linear-time algorithm for O(N)-dimensional gradient evaluation and apply it to general continuous-time Markov processes of sequence substitution on a phylogenetic tree without a need to assume either stationarity or reversibility. We apply this approach to learn the branch-specific evolutionary rates of three pathogenic viruses: West Nile virus, Dengue virus, and Lassa virus. Our proposed algorithm significantly improves inference efficiency with a 126- to 234-fold increase in maximum-likelihood optimization and a 16- to 33-fold computational performance increase in a Bayesian framework.

2020 ◽  
Vol 34 (04) ◽  
pp. 4412-4419 ◽  
Author(s):  
Zhao Kang ◽  
Wangtao Zhou ◽  
Zhitong Zhao ◽  
Junming Shao ◽  
Meng Han ◽  
...  

A plethora of multi-view subspace clustering (MVSC) methods have been proposed over the past few years. Researchers manage to boost clustering accuracy from different points of view. However, many state-of-the-art MVSC algorithms, typically have a quadratic or even cubic complexity, are inefficient and inherently difficult to apply at large scales. In the era of big data, the computational issue becomes critical. To fill this gap, we propose a large-scale MVSC (LMVSC) algorithm with linear order complexity. Inspired by the idea of anchor graph, we first learn a smaller graph for each view. Then, a novel approach is designed to integrate those graphs so that we can implement spectral clustering on a smaller graph. Interestingly, it turns out that our model also applies to single-view scenario. Extensive experiments on various large-scale benchmark data sets validate the effectiveness and efficiency of our approach with respect to state-of-the-art clustering methods.


2016 ◽  
Vol 6 (1) ◽  
Author(s):  
Stinus Lindgreen ◽  
Karen L. Adair ◽  
Paul P. Gardner

Abstract Metagenome studies are becoming increasingly widespread, yielding important insights into microbial communities covering diverse environments from terrestrial and aquatic ecosystems to human skin and gut. With the advent of high-throughput sequencing platforms, the use of large scale shotgun sequencing approaches is now commonplace. However, a thorough independent benchmark comparing state-of-the-art metagenome analysis tools is lacking. Here, we present a benchmark where the most widely used tools are tested on complex, realistic data sets. Our results clearly show that the most widely used tools are not necessarily the most accurate, that the most accurate tool is not necessarily the most time consuming and that there is a high degree of variability between available tools. These findings are important as the conclusions of any metagenomics study are affected by errors in the predicted community composition and functional capacity. Data sets and results are freely available from http://www.ucbioinformatics.org/metabenchmark.html


2012 ◽  
Vol 10 (05) ◽  
pp. 1250009
Author(s):  
WILLIAM KRIVAN ◽  
NICK ARNOLD ◽  
CECILE MORALES ◽  
DARRICK CARTER

Although synthesizing and utilizing individual peptides and DNA primers has become relatively inexpensive, massively parallel probing and next-generation sequencing approaches have dramatically increased the number of molecules that can be subjected to screening; this, in turn, requires vast numbers of peptides and therefore results in significant expenses. To alleviate this issue, pools of related molecules are often used to downselect prior to testing individual sequences. A computational selection process to create pools of related sequences at large scale has not been reported for peptides. In the case of PCR primers, there have been successful attempts to address this problem by designing degenerate primers that can be produced at the same cost as conventional, unique primers and then be used to amplify several different genomic regions. We present an algorithm, "FlexGrePPS" (Flexible Greedy Peptide Pool Search), that can create a near-optimal set of peptide pools. This approach is also applicable to nucleotide sequences and outperforms most DNA primer selection programs. For the proteomic compression with FlexGrePPS, the main body of our work presented here, we demonstrate the feasibility of the computation of an exhaustive cover of pathogenic proteomes with degenerate peptides that lend themselves to antigenic screening. Furthermore, we present preliminary data that demonstrate the experimental utility of highly degenerate peptides for antigenic screening. FlexGrePPS provides a near-optimal solution for proteomic compression and there are no programs available for comparison. We also demonstrate computational performance of our GreedyPrime implementation, which is a modified version of FlexGrePPS applicable to the design of degenerate primers and is comparable to existing programs for the design of degenerate primers. Specifically, we focus on the comparisons with PAMPS and DPS-DIP, software tools that have recently been shown to be superior to other methods. FlexGrePPS forms the foundation of a novel antigenic screening methodology that is based on the representation of an entire proteome by near-optimal degenerate peptide pools. Our preliminary wet lab data indicate that the approach will likely prove successful in comprehensive wet lab studies, and hence will dramatically reduce the expenses for antigenic screening and make whole proteome screening feasible. Although FlexGrePPS was designed for computational performance in order to handle vast data sets, there is the very surprising finding that even for small data sets the primer design version of FlexGrePPS, GreedyPrime, offers similar or even superior results for MP-DPD and most MDPD instances when compared to existing methods; despite the much longer run times, other approaches did not fare significantly better in reducing the original data sets to degenerate primers. The FlexGrePPS and GreedyPrime programs are available at no charge under the GNU LGPL license at http://sourceforge.net/projects/flexgrepps/ .


2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Yipu Zhang ◽  
Ping Wang

New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the(l, d)motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the(l, d)motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.


Identifying communities has always been a fundamental task in analysis of complex networks. Currently used algorithms that identify the community structures in large-scale real-world networks require a priori information such as the number and sizes of communities or are computationally expensive. Amongst them, the label propagation algorithm (LPA) brings great scaslability together with high accuracy but which is not accurate enough because of its randomness. In this paper, we study the equivalence properties of nodes on social network graphs according to the labeling criteria to shorten social network graphs and develop label propagation algorithms on shortened graphs to discover effective social networking communities without requiring optimization of the objective function as well as advanced information about communities. Test results on sample data sets show that the proposed algorithm execution time is significantly reduced compared to the published algorithms. The proposed algorithm takes an almost linear time and improves the overall quality of the identified community in complex networks with a clear community structure.


2021 ◽  
Author(s):  
Miguel Mendez Sandin ◽  
Sarah Romac ◽  
Fabrice Not

Ribosomal DNA (rDNA) genes are known to be valuable markers for the barcoding of eukaryotic life and its phylogenetic classification at various taxonomic levels. The large scale exploration of environmental microbial diversity through metabarcoding approaches have been focused mainly on the hypervariable regions V4 and V9 of the 18S rDNA gene. Yet, the accurate interpretation of such environmental surveys is hampered by technical (e.g., PCR and sequencing errors) and biological biases (e.g., intra-genomic variability). Here we explored the intra-genomic diversity of Nassellaria and Spumellaria specimens (Radiolaria) by comparing Sanger sequencing with two different high-throughput sequencing platforms: Illumina and Oxford Nanopore Technologies (MinION). Our analysis determined that intra-genomic variability of Nassellaria and Spumellaria is generally low, yet in some Spumellaria specimens we found two different copies of the V4 with a similarity lower than 97%. From the different sequencing methods, Illumina showed the highest number of contaminations (i.e., environmental DNA, cross-contamination, tag-jumping), revealed by its high sequencing depth; and Minion showed the highest sequencing rate error (~14%). Yet the long reads produced by MinION (~2900 bp) allowed accurate phylogenetic reconstruction studies. These results, highlight the requirement for a careful interpretation of Illumina based metabarcoding studies, in particular regarding low abundant amplicons, and open future perspectives towards full environmental rDNA metabarcoding surveys.


2009 ◽  
Vol 35 (4) ◽  
pp. 559-595 ◽  
Author(s):  
Liang Huang ◽  
Hao Zhang ◽  
Daniel Gildea ◽  
Kevin Knight

Systems based on synchronous grammars and tree transducers promise to improve the quality of statistical machine translation output, but are often very computationally intensive. The complexity is exponential in the size of individual grammar rules due to arbitrary re-orderings between the two languages. We develop a theory of binarization for synchronous context-free grammars and present a linear-time algorithm for binarizing synchronous rules when possible. In our large-scale experiments, we found that almost all rules are binarizable and the resulting binarized rule set significantly improves the speed and accuracy of a state-of-the-art syntax-based machine translation system. We also discuss the more general, and computationally more difficult, problem of finding good parsing strategies for non-binarizable rules, and present an approximate polynomial-time algorithm for this problem.


Sign in / Sign up

Export Citation Format

Share Document