The importance of Chargaff’s second parity rule for genomic signatures in metagenomics

Mapping Intimacies ◽

10.1101/146001 ◽

2017 ◽

Author(s):

Fabio Gori ◽

Dimitrios Mavroeidis ◽

Mike SM Jetten ◽

Elena Marchiori

Keyword(s):

Dna Sequences ◽

Dimensional Space ◽

Mapping Function ◽

Metagenomic Data ◽

Base Calling ◽

Parity Rule ◽

Genomic Signatures ◽

Alignment Free ◽

Source Organism ◽

Reverse Complement

AbstractAn important problem in metagenomic data analysis is to identify the source organism, or at least taxon, of each sequence. Most methods tackle this problem in two steps by using an alignment-free approach: first the DNA sequences are represented as points of a real n-dimensional space via a mapping function then either clustering or classification algorithms are applied. Those mapping functions require to be genomic signatures: the dissimilarity between the mapped points must reflect the degree of phylogenetic similarity of the source species. Designing good signatures for metagenomics can be challenging due to the special characteristics of metagenomic sequences; most of the existing signatures were not designed accordingly and they were tested only on error-free sequences sampled from a few dozens of species.In this work we analyze comparatively the goodness of existing and novel signatures based on tetranu-cleotide frequencies via statistical models and computational experiments; we also study how they are affected by the generalized Chargaff’s second parity rule (GCSPR), which states that in a given sequence longer than 50kbp, inverse oligonucleotides are approximately equally frequent. We analyze 38 million sequences of 150 bp-1,000 bp with 1% base-calling error, sampled from 1,284 microbes. Our models indicate that GCSPR reduces strand-dependence of signatures, that is, their values are less affected by the source strand; GCSPR is further exploited by some signatures to reduce the intra-species dispersion. Two novel signatures stand out both in the models and in the experiments: the combination signature and the operation signature. The former achieves strand-independence without grouping oligonucleotides; this could be valuable for alignment-free sequence comparison methods when distinguishing inverse oligonucleotides matters. Operation signature sums the frequencies of reverse, complement, and inverse tetranucleotides; having 72 features it reduces the computational intensity of the analysis.

Download Full-text

Numerical Characterization of DNA Sequences for Alignment-free Sequence Comparison – A Review

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207324666210811101437 ◽

2021 ◽

Vol 24 ◽

Author(s):

Natarajan Ramanathan ◽

Jayalakshmi Ramamurthy ◽

Ganapathy Natarajan

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Graphical Representation ◽

Dimensional Space ◽

Building Blocks ◽

Chaos Game Representation ◽

Alignment Free ◽

Comparison Methods ◽

Numerical Characterization

Background: Biological macromolecules namely, DNA, RNA, and protein have their building blocks organized in a particular sequence and the sequential arrangement encodes evolutionary history of the organism (species). Hence, biological sequences have been used for studying evolutionary relationships among the species. This is usually carried out by multiple sequence algorithms (MSA). Due to certain limitations of MSA, alignment-free sequence comparison methods were developed. The present review is on alignment-free sequence comparison methods carried out using numerical characterization of DNA sequences. Discussion: The graphical representation of DNA sequences by chaos game representation and other 2-dimesnional and 3-dimensional methods are discussed. The evolution of numerical characterization from the various graphical representations and the application of the DNA invariants thus computed in phylogenetic analysis is presented. The extension of computing molecular descriptors in chemometrics to the calculation of new set of DNA invariants and their use in alignment-free sequence comparison in a N-dimensional space and construction of phylogenetic tress is also reviewed. Conclusion: The phylogenetic tress constructed by the alignment-free sequence comparison methods using DNA invariants were found to be better than those constructed using alignment-based tools such as PHLYIP and ClustalW. One of the graphical representation methods is now extended to study viral sequences of infectious diseases for the identification of conserved regions to design peptide-based vaccine by combining numerical characterization and graphical representation.

Download Full-text

CLASSIFICATION AND IDENTIFICATION OF FUNGAL SEQUENCES USING CHARACTERISTIC RESTRICTION ENDONUCLEASE CUT ORDER

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720010004616 ◽

2010 ◽

Vol 08 (02) ◽

pp. 181-198 ◽

Cited By ~ 2

Author(s):

RAJIB SENGUPTA ◽

DHUNDY R. BASTOLA ◽

HESHAM H. ALI

Keyword(s):

Dna Sequences ◽

Restriction Enzymes ◽

Epidemiological Studies ◽

Global Alignment ◽

Target Sequence ◽

Alignment Algorithm ◽

Molecular Fingerprinting ◽

Data Set ◽

Alignment Free ◽

Wet Lab

Restriction Fragment Length Polymorphism (RFLP) is a powerful molecular tool that is extensively used in the molecular fingerprinting and epidemiological studies of microorganisms. In a wet-lab setting, the DNA is cut with one or more restriction enzymes and subjected to gel electrophoresis to obtain signature fragment patterns, which is utilized in the classification and identification of organisms. This wet-lab approach may not be practical when the experimental data set includes a large number of genetic sequences and a wide pool of restriction enzymes to choose from. In this study, we introduce a novel concept of Enzyme Cut Order — a biological property-based characteristic of DNA sequences which can be defined and analyzed computationally without any alignment algorithm. In this alignment-free approach, a similarity matrix is developed based on the pairwise Longest Common Subsequences (LCS) of the Enzyme Cut Orders. The choice of an ideal set of restriction enzymes used for analysis is augmented by using genetic algorithms. The results obtained from this approach using internal transcribed spacer regions of rDNA from fungi as the target sequence show that the phylogenetically-related organisms form a single cluster and successful grouping of phylogenetically close or distant organisms is dependent on the choice of restriction enzymes used in the analysis. Additionally, comparison of trees obtained with this alignment-free and the legacy method revealed highly similar tree topologies. This novel alignment-free method, which utilizes the Enzyme Cut Order and restriction enzyme profile, is a reliable alternative to local or global alignment-based classification and identification of organisms.

Download Full-text

Towards end-to-end disease prediction from raw metagenomic data

10.1101/2020.10.29.360297 ◽

2020 ◽

Author(s):

Maxence Queyrel ◽

Edi Prifti ◽

Jean-Daniel Zucker

Keyword(s):

Dna Sequences ◽

Real Life ◽

Multiple Instance Learning ◽

Disease Classification ◽

Metagenomic Data ◽

Numerical Representation ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

End To End ◽

Bioinformatics Workflows

AbstractAnalysis of the human microbiome using metagenomic sequencing data has demonstrated high ability in discriminating various human diseases. Raw metagenomic sequencing data require multiple complex and computationally heavy bioinformatics steps prior to data analysis. Such data contain millions of short sequences read from the fragmented DNA sequences and are stored as fastq files. Conventional processing pipelines consist multiple steps including quality control, filtering, alignment of sequences against genomic catalogs (genes, species, taxonomic levels, functional pathways, etc.). These pipelines are complex to use, time consuming and rely on a large number of parameters that often provide variability and impact the estimation of the microbiome elements. Recent studies have demonstrated that training Deep Neural Networks directly from raw sequencing data is a promising approach to bypass some of the challenges associated with mainstream bioinformatics pipelines. Most of these methods use the concept of word and sentence embeddings that create a meaningful and numerical representation of DNA sequences, while extracting features and reducing the dimentionality of the data. In this paper we present an end-to-end approach that classifies patients into disease groups directly from raw metagenomic reads: metagenome2vec. This approach is composed of four steps (i) generating a vocabulary of k-mers and learning their numerical embeddings; (ii) learning DNA sequence (read) embeddings; (iii) identifying the genome from which the sequence is most likely to come and (iv) training a multiple instance learning classifier which predicts the phenotype based on the vector representation of the raw data. An attention mechanism is applied in the network so that the model can be interpreted, assigning a weight to the influence of the prediction for each genome. Using two public real-life datasets as well a simulated one, we demonstrated that this original approach reached very high performances, comparable with the state-of-the-art methods applied directly on processed data though mainstream bioinformatics workflows. These results are encouraging for this proof of concept work. We believe that with further dedication, the DNN models have the potential to surpass mainstream bioinformatics workflows in disease classification tasks.

Download Full-text

A topological characterization of DNA sequences based on chaos geometry and persistent homology

10.1101/2021.01.31.429071 ◽

2021 ◽

Author(s):

Dong Quan Ngoc Nguyen ◽

Phuong Dong Tan Le ◽

Lin Xing ◽

Lizhen Lin

Keyword(s):

Dna Sequences ◽

Algebraic Topology ◽

Influenza A ◽

Graphical Representation ◽

Dimensional Space ◽

Persistent Homology ◽

Point Clouds ◽

Influenza A Viruses ◽

Topological Characterization ◽

Topological Features

AbstractMethods for analyzing similarities among DNA sequences play a fundamental role in computational biology, and have a variety of applications in public health, and in the field of genetics. In this paper, a novel geometric and topological method for analyzing similarities among DNA sequences is developed, based on persistent homology from algebraic topology, in combination with chaos geometry in 4-dimensional space as a graphical representation of DNA sequences. Our topological framework for DNA similarity analysis is general, alignment-free, and can deal with DNA sequences of various lengths, while proving first-of-the-kind visualization features for visual inspection of DNA sequences directly, based on topological features of point clouds that represent DNA sequences. As an application, we test our methods on three datasets including genome sequences of different types of Hantavirus, Influenza A viruses, and Human Papillomavirus.

Download Full-text

Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements

GigaScience ◽

10.1093/gigascience/giaa048 ◽

2020 ◽

Vol 9 (5) ◽

Cited By ~ 1

Author(s):

Morteza Hosseini ◽

Diogo Pratas ◽

Burkhard Morgenstern ◽

Armando J Pinho

Keyword(s):

Dna Sequences ◽

Large Scale ◽

High Throughput Sequencing ◽

Genetic Disorders ◽

Chromosomal Evolution ◽

Genomic Rearrangements ◽

Efficient Tool ◽

Compression Technique ◽

Alignment Free ◽

Memory Efficient

Abstract Background The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. Results We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. Conclusions Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ∼1 GB, which makes Smash++ feasible to run on present-day standard computers.

Download Full-text

Local binary patterns as a feature descriptor in alignment-free visualisation of metagenomic data

2016 IEEE Symposium Series on Computational Intelligence (SSCI) ◽

10.1109/ssci.2016.7849955 ◽

2016 ◽

Cited By ~ 2

Author(s):

Samaneh Kouchaki ◽

Santosh Tirunagari ◽

Avraam Tapinos ◽

David L Robertson

Keyword(s):

Local Binary Patterns ◽

Metagenomic Data ◽

Feature Descriptor ◽

Alignment Free

Download Full-text

Reverse-complement similarity codes for DNA sequences

2000 IEEE International Symposium on Information Theory (Cat. No.00CH37060) ◽

10.1109/isit.2000.866628 ◽

2002 ◽

Cited By ~ 3

Author(s):

A.G. D'yachkov ◽

P.A. Vilenkin ◽

D.C. Torney ◽

P.S. White

Keyword(s):

Dna Sequences ◽

Reverse Complement

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

10.1101/2021.11.10.468111 ◽

2021 ◽

Author(s):

Metin Balaban ◽

Nishat Anjum Bristy ◽

Ahnaf Faisal ◽

Md Shamsuzzoha Bayzid ◽

Siavash Mirarab

Keyword(s):

Dna Sequences ◽

Distance Estimation ◽

Sequence Evolution ◽

Phylogenetic Distance ◽

Strand Bias ◽

Alignment Free ◽

Bias Model ◽

Genome Wide ◽

Genome Wide Data ◽

Complex Models

While aligning sequences has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods have much appeal in terms of simplifying the process of inference, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for some emerging forms of data such as genome skims, which cannot be assembled. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is that they typically rely on simplified models of sequence evolution such as Jukes-Cantor. It is possible to compute pairwise distances under more complex models by computing frequencies of base substitutions provided that these quantities can be estimated in the alignment-free setting. A particular limitation is that for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the strand of DNA sequences is unknown. Under such conditions, the so-called no-strand bias models are the most complex models that can be used. Here, we show how to calculate distances under a no-strain bias restriction of the General Time Reversible (GTR) model called TK4 without relying on alignments. The method relies on replacing letters in the input sequences, and subsequent computation of Jaccard indices between k-mer sets. For the method to work on large genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance. We show in simulation that these alignment-free distances can be highly accurate when genomes evolve under the assumed models, and we examine the effectiveness of the method on real genomic data.

Download Full-text

Phylogenomics of 42 tomato chloroplasts using assembly and alignment-free method

10.7287/peerj.preprints.3271v1 ◽

2017 ◽

Author(s):

Raúl Amado Cattáneo ◽

Luis Diambra ◽

Andrés Norman McCarthy

Keyword(s):

Dna Sequences ◽

Evolutionary Biology ◽

Low Frequency ◽

Genomic Data ◽

Data Sets ◽

High Coverage ◽

Genome Data ◽

Sequencing Technologies ◽

Alignment Free ◽

Chloroplast Phylogeny

Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on the comparison of single DNA sequences, or a concatenation of a number of these. However, with the advent of next-generation DNA sequencing technologies, the approaches that consider large genomic data sets are of growing importance for the elucidation of evolutionary relationships among species. Among these approaches, the assembly and alignment-free methods which allow an efficient distance computation and phylogeny reconstruction are of great importance. However, it is not yet clear under what quality conditions and abundance of genomic data such methods are able to infer phylogenies accurately. In the present study we assess the method originally proposed by Fan et al. for whole genome data, in the elucidation of Tomatoes' chloroplast phylogenetics using short read sequences. We find that this assembly and alignment-free method is capable of reproducing previous results under conditions of high coverage, given that low frequency k-mers (i.e. error prone data) are effectively filter out. Finally, we present a complete chloroplast phylogeny for the best data quality candidates of the recently published 360 tomato genomes.

Download Full-text