Gradients Do Grow on Trees: A Linear-Time O(N)-Dimensional Gradient for Statistical Phylogenetics

Xiang Ji; Zhenyu Zhang; Andrew Holbrook; Akihiko Nishimura; Guy Baele; Andrew Rambaut; Philippe Lemey; Marc A Suchard

doi:10.1093/molbev/msaa130

Gradients Do Grow on Trees: A Linear-Time O(N)-Dimensional Gradient for Statistical Phylogenetics

Molecular Biology and Evolution ◽

10.1093/molbev/msaa130 ◽

2020 ◽

Vol 37 (10) ◽

pp. 3047-3060

Author(s):

Xiang Ji ◽

Zhenyu Zhang ◽

Andrew Holbrook ◽

Akihiko Nishimura ◽

Guy Baele ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Linear Time ◽

Phylogenetic Reconstruction ◽

Fold Increase ◽

Time Algorithm ◽

Data Sets ◽

Lassa Virus ◽

Computational Performance ◽

Computational Bottleneck

Abstract Calculation of the log-likelihood stands as the computational bottleneck for many statistical phylogenetic algorithms. Even worse is its gradient evaluation, often used to target regions of high probability. Order O(N)-dimensional gradient calculations based on the standard pruning algorithm require O(N2) operations, where N is the number of sampled molecular sequences. With the advent of high-throughput sequencing, recent phylogenetic studies have analyzed hundreds to thousands of sequences, with an apparent trend toward even larger data sets as a result of advancing technology. Such large-scale analyses challenge phylogenetic reconstruction by requiring inference on larger sets of process parameters to model the increasing data heterogeneity. To make these analyses tractable, we present a linear-time algorithm for O(N)-dimensional gradient evaluation and apply it to general continuous-time Markov processes of sequence substitution on a phylogenetic tree without a need to assume either stationarity or reversibility. We apply this approach to learn the branch-specific evolutionary rates of three pathogenic viruses: West Nile virus, Dengue virus, and Lassa virus. Our proposed algorithm significantly improves inference efficiency with a 126- to 234-fold increase in maximum-likelihood optimization and a 16- to 33-fold computational performance increase in a Bayesian framework.

Download Full-text

Near linear time algorithm to detect community structures in large-scale networks

Physical Review E ◽

10.1103/physreve.76.036106 ◽

2007 ◽

Vol 76 (3) ◽

Cited By ~ 1528

Author(s):

Usha Nandini Raghavan ◽

Réka Albert ◽

Soundar Kumara

Keyword(s):

Large Scale ◽

Linear Time ◽

Time Algorithm ◽

Linear Time Algorithm ◽

Community Structures ◽

Large Scale Networks

Download Full-text

A Linear Time Algorithm for Influence Maximization in Large-Scale Social Networks

Neural Information Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-319-70139-4_76 ◽

2017 ◽

pp. 752-761 ◽

Cited By ~ 2

Author(s):

Hongchun Wu ◽

Jiaxing Shang ◽

Shangbo Zhou ◽

Yong Feng

Keyword(s):

Social Networks ◽

Large Scale ◽

Linear Time ◽

Time Algorithm ◽

Influence Maximization ◽

Linear Time Algorithm

Download Full-text

Large-Scale Multi-View Subspace Clustering in Linear Time

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5867 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4412-4419 ◽

Cited By ~ 3

Author(s):

Zhao Kang ◽

Wangtao Zhou ◽

Zhitong Zhao ◽

Junming Shao ◽

Meng Han ◽

...

Keyword(s):

Large Scale ◽

State Of The Art ◽

Linear Time ◽

Subspace Clustering ◽

Data Sets ◽

Clustering Methods ◽

Single View ◽

Novel Approach ◽

Points Of View ◽

Effectiveness And Efficiency

A plethora of multi-view subspace clustering (MVSC) methods have been proposed over the past few years. Researchers manage to boost clustering accuracy from different points of view. However, many state-of-the-art MVSC algorithms, typically have a quadratic or even cubic complexity, are inefficient and inherently difficult to apply at large scales. In the era of big data, the computational issue becomes critical. To fill this gap, we propose a large-scale MVSC (LMVSC) algorithm with linear order complexity. Inspired by the idea of anchor graph, we first learn a smaller graph for each view. Then, a novel approach is designed to integrate those graphs so that we can implement spectral clustering on a smaller graph. Interestingly, it turns out that our model also applies to single-view scenario. Extensive experiments on various large-scale benchmark data sets validate the effectiveness and efficiency of our approach with respect to state-of-the-art clustering methods.

Download Full-text

An evaluation of the accuracy and speed of metagenome analysis tools

Scientific Reports ◽

10.1038/srep19233 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 187

Author(s):

Stinus Lindgreen ◽

Karen L. Adair ◽

Paul P. Gardner

Keyword(s):

Aquatic Ecosystems ◽

Large Scale ◽

High Throughput Sequencing ◽

Data Sets ◽

Metagenome Analysis ◽

Analysis Tools ◽

Sequencing Platforms ◽

Capacity Data ◽

High Degree ◽

Realistic Data

Abstract Metagenome studies are becoming increasingly widespread, yielding important insights into microbial communities covering diverse environments from terrestrial and aquatic ecosystems to human skin and gut. With the advent of high-throughput sequencing platforms, the use of large scale shotgun sequencing approaches is now commonplace. However, a thorough independent benchmark comparing state-of-the-art metagenome analysis tools is lacking. Here, we present a benchmark where the most widely used tools are tested on complex, realistic data sets. Our results clearly show that the most widely used tools are not necessarily the most accurate, that the most accurate tool is not necessarily the most time consuming and that there is a high degree of variability between available tools. These findings are important as the conclusions of any metagenomics study are affected by errors in the predicted community composition and functional capacity. Data sets and results are freely available from http://www.ucbioinformatics.org/metabenchmark.html

Download Full-text

FLEXGREPPS — FLEXIBLE GREEDY PEPTIDE POOL SEARCH: COMPUTATION OF NEAR-OPTIMAL SETS OF DEGENERATE POLYPEPTIDES FOR ANTIGENIC SCREENING

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720012500096 ◽

2012 ◽

Vol 10 (05) ◽

pp. 1250009

Author(s):

WILLIAM KRIVAN ◽

NICK ARNOLD ◽

CECILE MORALES ◽

DARRICK CARTER

Keyword(s):

Large Scale ◽

Selection Process ◽

Peptide Pool ◽

Small Data ◽

Data Sets ◽

Main Body ◽

Degenerate Primers ◽

Computational Performance ◽

Wet Lab ◽

Dna Primers

Although synthesizing and utilizing individual peptides and DNA primers has become relatively inexpensive, massively parallel probing and next-generation sequencing approaches have dramatically increased the number of molecules that can be subjected to screening; this, in turn, requires vast numbers of peptides and therefore results in significant expenses. To alleviate this issue, pools of related molecules are often used to downselect prior to testing individual sequences. A computational selection process to create pools of related sequences at large scale has not been reported for peptides. In the case of PCR primers, there have been successful attempts to address this problem by designing degenerate primers that can be produced at the same cost as conventional, unique primers and then be used to amplify several different genomic regions. We present an algorithm, "FlexGrePPS" (Flexible Greedy Peptide Pool Search), that can create a near-optimal set of peptide pools. This approach is also applicable to nucleotide sequences and outperforms most DNA primer selection programs. For the proteomic compression with FlexGrePPS, the main body of our work presented here, we demonstrate the feasibility of the computation of an exhaustive cover of pathogenic proteomes with degenerate peptides that lend themselves to antigenic screening. Furthermore, we present preliminary data that demonstrate the experimental utility of highly degenerate peptides for antigenic screening. FlexGrePPS provides a near-optimal solution for proteomic compression and there are no programs available for comparison. We also demonstrate computational performance of our GreedyPrime implementation, which is a modified version of FlexGrePPS applicable to the design of degenerate primers and is comparable to existing programs for the design of degenerate primers. Specifically, we focus on the comparisons with PAMPS and DPS-DIP, software tools that have recently been shown to be superior to other methods. FlexGrePPS forms the foundation of a novel antigenic screening methodology that is based on the representation of an entire proteome by near-optimal degenerate peptide pools. Our preliminary wet lab data indicate that the approach will likely prove successful in comprehensive wet lab studies, and hence will dramatically reduce the expenses for antigenic screening and make whole proteome screening feasible. Although FlexGrePPS was designed for computational performance in order to handle vast data sets, there is the very surprising finding that even for small data sets the primer design version of FlexGrePPS, GreedyPrime, offers similar or even superior results for MP-DPD and most MDPD instances when compared to existing methods; despite the much longer run times, other approaches did not fare significantly better in reducing the original data sets to degenerate primers. The FlexGrePPS and GreedyPrime programs are available at no charge under the GNU LGPL license at http://sourceforge.net/projects/flexgrepps/ .

Download Full-text

A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets

BioMed Research International ◽

10.1155/2015/218068 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 4

Author(s):

Yipu Zhang ◽

Ping Wang

Keyword(s):

High Throughput ◽

Motif Discovery ◽

Large Scale ◽

High Throughput Sequencing ◽

Es Cells ◽

Motif Finding ◽

Data Sets ◽

Data Set ◽

Binding Motifs ◽

Motif Finding Algorithm

New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the(l, d)motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the(l, d)motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.

Download Full-text

A Method To Improve The Time Of Computing For Detecting Community Structure In Social Network Graph

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f8235.088619 ◽

2019 ◽

Vol 8 (6) ◽

pp. 933-937

Keyword(s):

Community Structure ◽

Social Network ◽

Complex Networks ◽

Large Scale ◽

Linear Time ◽

A Priori ◽

Label Propagation ◽

Data Sets ◽

Network Graph ◽

Priori Information

Identifying communities has always been a fundamental task in analysis of complex networks. Currently used algorithms that identify the community structures in large-scale real-world networks require a priori information such as the number and sizes of communities or are computationally expensive. Amongst them, the label propagation algorithm (LPA) brings great scaslability together with high accuracy but which is not accurate enough because of its randomness. In this paper, we study the equivalence properties of nodes on social network graphs according to the labeling criteria to shorten social network graphs and develop label propagation algorithms on shortened graphs to discover effective social networking communities without requiring optimization of the objective function as well as advanced information about communities. Test results on sample data sets show that the proposed algorithm execution time is significantly reduced compared to the published algorithms. The proposed algorithm takes an almost linear time and improves the overall quality of the identified community in complex networks with a clear community structure.

Download Full-text

Scalable GWR: A Linear-Time Algorithm for Large-Scale Geographically Weighted Regression with Polynomial Kernels

Annals of the American Association of Geographers ◽

10.1080/24694452.2020.1774350 ◽

2020 ◽

pp. 1-22

Author(s):

Daisuke Murakami ◽

Narumasa Tsutsumida ◽

Takahiro Yoshida ◽

Tomoki Nakaya ◽

Binbin Lu

Keyword(s):

Geographically Weighted Regression ◽

Large Scale ◽

Linear Time ◽

Time Algorithm ◽

Weighted Regression ◽

Linear Time Algorithm ◽

Polynomial Kernels

Download Full-text

Intra-genomic rDNA gene variability of Nassellaria and Spumellaria (Rhizaria, Radiolaria) assessed by Sanger, MinION and Illumina sequencing

10.1101/2021.10.05.463214 ◽

2021 ◽

Author(s):

Miguel Mendez Sandin ◽

Sarah Romac ◽

Fabrice Not

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Phylogenetic Reconstruction ◽

Environmental Dna ◽

Genomic Diversity ◽

Genomic Variability ◽

Rdna Gene ◽

Sequencing Errors ◽

Sequencing Platforms ◽

Gene Variability

Ribosomal DNA (rDNA) genes are known to be valuable markers for the barcoding of eukaryotic life and its phylogenetic classification at various taxonomic levels. The large scale exploration of environmental microbial diversity through metabarcoding approaches have been focused mainly on the hypervariable regions V4 and V9 of the 18S rDNA gene. Yet, the accurate interpretation of such environmental surveys is hampered by technical (e.g., PCR and sequencing errors) and biological biases (e.g., intra-genomic variability). Here we explored the intra-genomic diversity of Nassellaria and Spumellaria specimens (Radiolaria) by comparing Sanger sequencing with two different high-throughput sequencing platforms: Illumina and Oxford Nanopore Technologies (MinION). Our analysis determined that intra-genomic variability of Nassellaria and Spumellaria is generally low, yet in some Spumellaria specimens we found two different copies of the V4 with a similarity lower than 97%. From the different sequencing methods, Illumina showed the highest number of contaminations (i.e., environmental DNA, cross-contamination, tag-jumping), revealed by its high sequencing depth; and Minion showed the highest sequencing rate error (~14%). Yet the long reads produced by MinION (~2900 bp) allowed accurate phylogenetic reconstruction studies. These results, highlight the requirement for a careful interpretation of Illumina based metabarcoding studies, in particular regarding low abundant amplicons, and open future perspectives towards full environmental rDNA metabarcoding surveys.

Download Full-text

Binarization of Synchronous Context-Free Grammars

Computational Linguistics ◽

10.1162/coli.2009.35.4.35406 ◽

2009 ◽

Vol 35 (4) ◽

pp. 559-595 ◽

Cited By ~ 16

Author(s):

Liang Huang ◽

Hao Zhang ◽

Daniel Gildea ◽

Kevin Knight

Keyword(s):

Machine Translation ◽

Large Scale ◽

Linear Time ◽

Statistical Machine Translation ◽

Polynomial Time Algorithm ◽

Time Algorithm ◽

Difficult Problem ◽

Translation System ◽

Context Free ◽

Context Free Grammars

Systems based on synchronous grammars and tree transducers promise to improve the quality of statistical machine translation output, but are often very computationally intensive. The complexity is exponential in the size of individual grammar rules due to arbitrary re-orderings between the two languages. We develop a theory of binarization for synchronous context-free grammars and present a linear-time algorithm for binarizing synchronous rules when possible. In our large-scale experiments, we found that almost all rules are binarizable and the resulting binarized rule set significantly improves the speed and accuracy of a state-of-the-art syntax-based machine translation system. We also discuss the more general, and computationally more difficult, problem of finding good parsing strategies for non-binarizable rules, and present an approximate polynomial-time algorithm for this problem.

Download Full-text