Alignment-free Sequence Searching over Whole Genomes Using 3D random plot of Query DNA Sequences

DaYoung Lee

doi:10.31449/inf.v42i3.2276

CLASSIFICATION AND IDENTIFICATION OF FUNGAL SEQUENCES USING CHARACTERISTIC RESTRICTION ENDONUCLEASE CUT ORDER

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720010004616 ◽

2010 ◽

Vol 08 (02) ◽

pp. 181-198 ◽

Cited By ~ 2

Author(s):

RAJIB SENGUPTA ◽

DHUNDY R. BASTOLA ◽

HESHAM H. ALI

Keyword(s):

Dna Sequences ◽

Restriction Enzymes ◽

Epidemiological Studies ◽

Global Alignment ◽

Target Sequence ◽

Alignment Algorithm ◽

Molecular Fingerprinting ◽

Data Set ◽

Alignment Free ◽

Wet Lab

Restriction Fragment Length Polymorphism (RFLP) is a powerful molecular tool that is extensively used in the molecular fingerprinting and epidemiological studies of microorganisms. In a wet-lab setting, the DNA is cut with one or more restriction enzymes and subjected to gel electrophoresis to obtain signature fragment patterns, which is utilized in the classification and identification of organisms. This wet-lab approach may not be practical when the experimental data set includes a large number of genetic sequences and a wide pool of restriction enzymes to choose from. In this study, we introduce a novel concept of Enzyme Cut Order — a biological property-based characteristic of DNA sequences which can be defined and analyzed computationally without any alignment algorithm. In this alignment-free approach, a similarity matrix is developed based on the pairwise Longest Common Subsequences (LCS) of the Enzyme Cut Orders. The choice of an ideal set of restriction enzymes used for analysis is augmented by using genetic algorithms. The results obtained from this approach using internal transcribed spacer regions of rDNA from fungi as the target sequence show that the phylogenetically-related organisms form a single cluster and successful grouping of phylogenetically close or distant organisms is dependent on the choice of restriction enzymes used in the analysis. Additionally, comparison of trees obtained with this alignment-free and the legacy method revealed highly similar tree topologies. This novel alignment-free method, which utilizes the Enzyme Cut Order and restriction enzyme profile, is a reliable alternative to local or global alignment-based classification and identification of organisms.

Download Full-text

Alignment-free phylogeny of whole genomes using underlying subwords

Algorithms for Molecular Biology ◽

10.1186/1748-7188-7-34 ◽

2012 ◽

Vol 7 (1) ◽

Cited By ~ 45

Author(s):

Matteo Comin ◽

Davide Verzotto

Keyword(s):

Alignment Free ◽

Whole Genomes

Download Full-text

Model-based inference of punctuated molecular evolution

10.1101/852343 ◽

2019 ◽

Cited By ~ 1

Author(s):

Marc Manceau ◽

Julie Marin ◽

Hélène Morlon ◽

Amaury Lambert

Keyword(s):

Molecular Evolution ◽

Dna Sequences ◽

Temporal Variations ◽

Multiple Sequence ◽

Model Combining ◽

Constant Rate ◽

Whole Genomes ◽

Standard Models ◽

Natural Variance ◽

Venom Proteins

AbstractIn standard models of molecular evolution, DNA sequences evolve through asynchronous substitutions according to Poisson processes with a constant rate (called the molecular clock) or a time-varying rate (relaxed clock). However, DNA sequences can also undergo episodes of fast divergence that will appear as synchronous substitutions affecting several sites simultaneously at the macroevolutionary time scale. Here, we develop a model combining basal, clock-like molecular evolution with episodes of fast divergence called spikes arising at speciation events. Given a multiple sequence alignment and its time-calibrated species phylogeny, our model is able to detect speciation events (including hidden ones) co-occurring with spike events and to estimate the probability and amplitude of these spikes on the phylogeny. We identify the conditions under which spikes can be distinguished from the natural variance of the clock-like component of molecular evolution and from temporal variations of the clock. We apply the method to genes underlying snake venom proteins and identify several spikes at gene-specific locations in the phylogeny. This work should pave the way for analyses relying on whole genomes to inform on modes of species diversification.

Download Full-text

Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements

GigaScience ◽

10.1093/gigascience/giaa048 ◽

2020 ◽

Vol 9 (5) ◽

Cited By ~ 1

Author(s):

Morteza Hosseini ◽

Diogo Pratas ◽

Burkhard Morgenstern ◽

Armando J Pinho

Keyword(s):

Dna Sequences ◽

Large Scale ◽

High Throughput Sequencing ◽

Genetic Disorders ◽

Chromosomal Evolution ◽

Genomic Rearrangements ◽

Efficient Tool ◽

Compression Technique ◽

Alignment Free ◽

Memory Efficient

Abstract Background The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. Results We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. Conclusions Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ∼1 GB, which makes Smash++ feasible to run on present-day standard computers.

Download Full-text

VISTA Family of Computational Tools for Comparative Analysis of DNA Sequences and Whole Genomes

Gene Mapping, Discovery, and Expression ◽

10.1385/1-59745-097-9:69 ◽

2006 ◽

pp. 69-90 ◽

Cited By ~ 3

Author(s):

Inna Dubchak ◽

Dmitriy V. Ryaboy

Keyword(s):

Comparative Analysis ◽

Dna Sequences ◽

Computational Tools ◽

Whole Genomes

Download Full-text

Model-Based Inference of Punctuated Molecular Evolution

Molecular Biology and Evolution ◽

10.1093/molbev/msaa144 ◽

2020 ◽

Vol 37 (11) ◽

pp. 3308-3323 ◽

Cited By ~ 2

Author(s):

Marc Manceau ◽

Julie Marin ◽

Hélène Morlon ◽

Amaury Lambert

Keyword(s):

Molecular Evolution ◽

Dna Sequences ◽

Multiple Sequence ◽

Model Combining ◽

Constant Rate ◽

Whole Genomes ◽

Standard Models ◽

Natural Variance ◽

Venom Proteins ◽

Relaxed Clock

Abstract In standard models of molecular evolution, DNA sequences evolve through asynchronous substitutions according to Poisson processes with a constant rate (called the molecular clock) or a rate that can vary (relaxed clock). However, DNA sequences can also undergo episodes of fast divergence that will appear as synchronous substitutions affecting several sites simultaneously at the macroevolutionary timescale. Here, we develop a model, which we call the Relaxed Clock with Spikes model, combining basal, clock-like molecular substitutions with episodes of fast divergence called spikes arising at speciation events. Given a multiple sequence alignment and its time-calibrated species phylogeny, our model is able to detect speciation events (including hidden ones) cooccurring with spike events and to estimate the probability and amplitude of these spikes on the phylogeny. We identify the conditions under which spikes can be distinguished from the natural variance of the clock-like component of molecular substitutions and from variations of the clock. We apply the method to genes underlying snake venom proteins and identify several spikes at gene-specific locations in the phylogeny. This work should pave the way for analyses relying on whole genomes to inform on modes of species diversification.

Download Full-text

Site2genome: locating short DNA sequences in whole genomes

Bioinformatics ◽

10.1093/bioinformatics/bth094 ◽

2004 ◽

Vol 20 (9) ◽

pp. 1468-1469 ◽

Cited By ~ 2

Author(s):

M. C. Frith ◽

A. S. Halees ◽

U. Hansen ◽

Z. Weng

Keyword(s):

Dna Sequences ◽

Whole Genomes

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

10.1101/2021.11.10.468111 ◽

2021 ◽

Author(s):

Metin Balaban ◽

Nishat Anjum Bristy ◽

Ahnaf Faisal ◽

Md Shamsuzzoha Bayzid ◽

Siavash Mirarab

Keyword(s):

Dna Sequences ◽

Distance Estimation ◽

Sequence Evolution ◽

Phylogenetic Distance ◽

Strand Bias ◽

Alignment Free ◽

Bias Model ◽

Genome Wide ◽

Genome Wide Data ◽

Complex Models

While aligning sequences has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods have much appeal in terms of simplifying the process of inference, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for some emerging forms of data such as genome skims, which cannot be assembled. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is that they typically rely on simplified models of sequence evolution such as Jukes-Cantor. It is possible to compute pairwise distances under more complex models by computing frequencies of base substitutions provided that these quantities can be estimated in the alignment-free setting. A particular limitation is that for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the strand of DNA sequences is unknown. Under such conditions, the so-called no-strand bias models are the most complex models that can be used. Here, we show how to calculate distances under a no-strain bias restriction of the General Time Reversible (GTR) model called TK4 without relying on alignments. The method relies on replacing letters in the input sequences, and subsequent computation of Jaccard indices between k-mer sets. For the method to work on large genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance. We show in simulation that these alignment-free distances can be highly accurate when genomes evolve under the assumed models, and we examine the effectiveness of the method on real genomic data.

Download Full-text

Phylogenomics of 42 tomato chloroplasts using assembly and alignment-free method

10.7287/peerj.preprints.3271v1 ◽

2017 ◽

Author(s):

Raúl Amado Cattáneo ◽

Luis Diambra ◽

Andrés Norman McCarthy

Keyword(s):

Dna Sequences ◽

Evolutionary Biology ◽

Low Frequency ◽

Genomic Data ◽

Data Sets ◽

High Coverage ◽

Genome Data ◽

Sequencing Technologies ◽

Alignment Free ◽

Chloroplast Phylogeny

Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on the comparison of single DNA sequences, or a concatenation of a number of these. However, with the advent of next-generation DNA sequencing technologies, the approaches that consider large genomic data sets are of growing importance for the elucidation of evolutionary relationships among species. Among these approaches, the assembly and alignment-free methods which allow an efficient distance computation and phylogeny reconstruction are of great importance. However, it is not yet clear under what quality conditions and abundance of genomic data such methods are able to infer phylogenies accurately. In the present study we assess the method originally proposed by Fan et al. for whole genome data, in the elucidation of Tomatoes' chloroplast phylogenetics using short read sequences. We find that this assembly and alignment-free method is capable of reproducing previous results under conditions of high coverage, given that low frequency k-mers (i.e. error prone data) are effectively filter out. Finally, we present a complete chloroplast phylogeny for the best data quality candidates of the recently published 360 tomato genomes.

Download Full-text