A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

Brian B Luczak; Benjamin T James; Hani Z Girgis

doi:10.1093/bib/bbx161

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

Briefings in Bioinformatics ◽

10.1093/bib/bbx161 ◽

2017 ◽

Vol 20 (4) ◽

pp. 1222-1237 ◽

Cited By ~ 10

Author(s):

Brian B Luczak ◽

Benjamin T James ◽

Hani Z Girgis

Keyword(s):

Query Sequence ◽

Sequence Length ◽

Local Alignment ◽

Global Alignment ◽

Alignment Algorithm ◽

Earth Mover’S Distance ◽

Earth Mover's Distance ◽

Alignment Free ◽

Length Difference ◽

Alignment Algorithms

Abstract Motivation Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. Results We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. Availability The source code of the benchmarking tool is available as Supplementary Materials.

Download Full-text

LePrimAlign: local entropy-based alignment of PPI networks to predict conserved modules

BMC Genomics ◽

10.1186/s12864-019-6271-3 ◽

2019 ◽

Vol 20 (S9) ◽

Cited By ~ 1

Author(s):

Sawal Maskey ◽

Young-Rae Cho

Keyword(s):

Computational Cost ◽

Network Alignment ◽

System Level ◽

Local Network ◽

Local Alignment ◽

Global Alignment ◽

Alignment Algorithm ◽

Interaction Patterns ◽

Ppi Networks ◽

Alignment Algorithms

Abstract Background Cross-species analysis of protein-protein interaction (PPI) networks provides an effective means of detecting conserved interaction patterns. Identifying such conserved substructures between PPI networks of different species increases our understanding of the principles deriving evolution of cellular organizations and their functions in a system level. In recent years, network alignment techniques have been applied to genome-scale PPI networks to predict evolutionary conserved modules. Although a wide variety of network alignment algorithms have been introduced, developing a scalable local network alignment algorithm with high accuracy is still challenging. Results We present a novel pairwise local network alignment algorithm, called LePrimAlign, to predict conserved modules between PPI networks of three different species. The proposed algorithm exploits the results of a pairwise global alignment algorithm with many-to-many node mapping. It also applies the concept of graph entropy to detect initial cluster pairs from two networks. Finally, the initial clusters are expanded to increase the local alignment score that is formulated by a combination of intra-network and inter-network scores. The performance comparison with state-of-the-art approaches demonstrates that the proposed algorithm outperforms in terms of accuracy of identified protein complexes and quality of alignments. Conclusion The proposed method produces local network alignment of higher accuracy in predicting conserved modules even with large biological networks at a reduced computational cost.

Download Full-text

CLASSIFICATION AND IDENTIFICATION OF FUNGAL SEQUENCES USING CHARACTERISTIC RESTRICTION ENDONUCLEASE CUT ORDER

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720010004616 ◽

2010 ◽

Vol 08 (02) ◽

pp. 181-198 ◽

Cited By ~ 2

Author(s):

RAJIB SENGUPTA ◽

DHUNDY R. BASTOLA ◽

HESHAM H. ALI

Keyword(s):

Dna Sequences ◽

Restriction Enzymes ◽

Epidemiological Studies ◽

Global Alignment ◽

Target Sequence ◽

Alignment Algorithm ◽

Molecular Fingerprinting ◽

Data Set ◽

Alignment Free ◽

Wet Lab

Restriction Fragment Length Polymorphism (RFLP) is a powerful molecular tool that is extensively used in the molecular fingerprinting and epidemiological studies of microorganisms. In a wet-lab setting, the DNA is cut with one or more restriction enzymes and subjected to gel electrophoresis to obtain signature fragment patterns, which is utilized in the classification and identification of organisms. This wet-lab approach may not be practical when the experimental data set includes a large number of genetic sequences and a wide pool of restriction enzymes to choose from. In this study, we introduce a novel concept of Enzyme Cut Order — a biological property-based characteristic of DNA sequences which can be defined and analyzed computationally without any alignment algorithm. In this alignment-free approach, a similarity matrix is developed based on the pairwise Longest Common Subsequences (LCS) of the Enzyme Cut Orders. The choice of an ideal set of restriction enzymes used for analysis is augmented by using genetic algorithms. The results obtained from this approach using internal transcribed spacer regions of rDNA from fungi as the target sequence show that the phylogenetically-related organisms form a single cluster and successful grouping of phylogenetically close or distant organisms is dependent on the choice of restriction enzymes used in the analysis. Additionally, comparison of trees obtained with this alignment-free and the legacy method revealed highly similar tree topologies. This novel alignment-free method, which utilizes the Enzyme Cut Order and restriction enzyme profile, is a reliable alternative to local or global alignment-based classification and identification of organisms.

Download Full-text

Acceleration of Nucleotide Semi-Global Alignment with Adaptive Banded Dynamic Programming

10.1101/130633 ◽

2017 ◽

Cited By ~ 9

Author(s):

Hajime Suzuki ◽

Masahiro Kasahara

Keyword(s):

Dynamic Programming ◽

Single Molecule ◽

Computation Time ◽

Error Rates ◽

Nucleotide Sequences ◽

Sequencing Error ◽

Local Alignment ◽

Global Alignment ◽

Alignment Algorithm ◽

Short Read

AbstractMotivationPairwise alignment of nucleotide sequences has previously been carried out using the seed- and-extend strategy, where we enumerate seeds (shared patterns) between sequences and then extend the seeds by Smith-Waterman-like semi-global dynamic programming to obtain full pairwise alignments. With the advent of massively parallel short read sequencers, algorithms and data structures for efficiently finding seeds have been extensively explored. However, recent advances in single-molecule sequencing technologies have enabled us to obtain millions of reads, each of which is orders of magnitude longer than those output by the short-read sequencers, demanding a faster algorithm for the extension step that accounts for most of the computation time required for pairwise local alignment. Our goal is to design a faster extension algorithm suitable for single-molecule sequencers with high sequencing error rates (e.g., 10-15%) and with more frequent insertions and deletions than substitutions.ResultsWe propose an adaptive banded dynamic programming algorithm for calculating pairwise semi-global alignment of nucleotide sequences that allows a relatively high insertion or deletion rate while keeping band width relatively low (e.g., 32 or 64 cells) regardless of sequence lengths. Our new algorithm eliminated mutual dependences between elements in a vector, allowing an efficient Single-Instruction-Multiple-Data parallelization. We experimentally demonstrate that our algorithm runs approximately 5× faster than the extension alignment algorithm in NCBI BLAST+ while retaining similar sensitivity (recall).We also show that our extension algorithm is more sensitive than the extension alignment routine in DALIGNER, while the computation time is comparable.AvailabilityThe implementation of the algorithm and the benchmarking scripts are available at https://github.com/ocxtal/[email protected]

Download Full-text

Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab001 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Hani Z Girgis ◽

Benjamin T James ◽

Brian B Luczak

Keyword(s):

Dna Sequences ◽

Phylogenetic Trees ◽

Linear Models ◽

General Linear ◽

Global Alignment ◽

Optimal Alignment ◽

Pairwise Identity ◽

General Linear Models ◽

Alignment Free ◽

Alignment Algorithms

Abstract Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment—including gaps—of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2–80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.

Download Full-text

High-throughput Protein Sequence Alignment on Multi-core Systems

International Journal of Integrated Engineering ◽

10.30880/ijie.2020.12.07.007 ◽

2020 ◽

Vol 12 (7) ◽

Author(s):

Muhammad Yahya ◽

◽

Laiq Hasan ◽

Syed Asad Ali ◽

◽

...

Keyword(s):

Sequence Alignment ◽

Large Scale ◽

Query Sequence ◽

Rapid Evolution ◽

Accurate Method ◽

Alignment Algorithm ◽

Sequencing Technologies ◽

Main Challenge ◽

Alignment Algorithms ◽

Wide Range

Rapid evolution in sequencing technologies results in generating data on an enormous scale. A focal and main challenge in analyzing data at such a large scale is the alignment of the DNA/Protein sequences, whereby reads are compared to the reference sequences. To find similar sequences, alignment algorithms are used to align a query sequence with the database. Alignment algorithms can be utilized to classify the source of a sequence, to discover similarities among the organisms, or to deduce a progenitor connection. A wide range of algorithms for alignment has been developed in recent years. In this paper, an accurate method of accelerating such algorithms using GPUs has been investigated. A Swiss-Prot database has been processed using GPU implemented Smith-Waterman Sequence Alignment Algorithm. The first step in the process generates the alignment scores but not the actual alignment. Various available alignment tools like ssearch2 are then utilized to align the output file generated during the first step. The performance of GPU-accelerated implementation as compared to other techniques is then evaluated for performance /throughput improvement. Swiss-Prot database was aligned using various alignment tools. NVIDIA TESLA K40 GPU is being utilized for generating the results for this research. This implementation achieves the performance of 44.3 Giga cell updates per second (GCUPS), which is 22.9 times better than its implementation on GTX 275. Performance is improved as the workload of sequences of equal length is equally distributed among all the threads on Multiprocessors of GPU.

Download Full-text

F1000Research TMATCH: A New Algorithm for Protein Alignments using amino-acid hydrophobicities

10.1101/2019.12.16.878744 ◽

2019 ◽

Cited By ~ 1

Author(s):

David Cavanaugh ◽

Krishnan Chittur

Keyword(s):

Amino Acids ◽

Dynamic Programming ◽

Local Alignment ◽

Alignment Algorithm ◽

Hydrophobicity Scale ◽

Fundamental Properties ◽

Extra Information ◽

Alignment Algorithms ◽

Gap Opening ◽

Protein Alignments

AbstractThe identification of proteins of similar structure using sequence alignment is an important problem in bioinformatics. We decribe TMATCH, a basic dynamic programming alignment algorithm which can rapidly identify proteins of similar structure from a database. TMATCH was developed to utilize an optimal hydrophobicity metric for alignments traceable to fundamental properties of amino-acids. Standard alignment algorithms use affine gap penalties as contrasted with the TMATCH algorithm adaptation of local alignment score reinforcement of favorable diagonal paths (transitions) and punishment of unfavorable transitions paired with fixed gap opening penalties. The TMATCH algorithm is especially designed to take advantage of the extra information available within the hydrophobicity scale to detect homologies, as opposed to the probabilities derived from raw percent identities.

Download Full-text

MATCHING ALGORITHM USING WAVELET THINNING FEATURES FOR OFFLINE SIGNATURE VERIFICATION

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s021969130700163x ◽

2007 ◽

Vol 05 (01) ◽

pp. 27-38 ◽

Cited By ~ 4

Author(s):

BIN FANG ◽

XINGE YOU ◽

WEN-SHENG CHEN ◽

YUAN YAN TANG

Keyword(s):

Similarity Measurement ◽

Local Alignment ◽

Alignment Algorithm ◽

Affine Model ◽

Alignment Algorithms ◽

Offline Signature Verification ◽

Training Samples ◽

Limited Training Samples ◽

Global And Local ◽

Discriminatory Information

Structure distortion evaluation allows us to directly measure the similarity between signature patterns without classification using feature vectors, which usually suffers from limited training samples. In this paper, we incorporate the merits of both global and local alignment algorithms to define structure distortion using signature skeletons identified by a robust wavelet thinning technique. A weak affine model is employed to globally register two signature skeletons and structure distortion between two signature patterns, which are determined by applying an elastic local alignment algorithm. Similarity measurement is evaluated in the form of Euclidean distance of all found corresponding feature points. Experimental results showed that the proposed similarity measurement was able to provide sufficient discriminatory information in terms of equal error rate being 18.6% with four training samples.

Download Full-text

GASAL2: a GPU accelerated sequence alignment library for high-throughput NGS data

BMC Bioinformatics ◽

10.1186/s12859-019-3086-9 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 5

Author(s):

Nauman Ahmed ◽

Jonathan Lévy ◽

Shanshan Ren ◽

Hamid Mushtaq ◽

Koen Bertels ◽

...

Keyword(s):

Sequence Alignment ◽

High Throughput ◽

High Performance ◽

Local Alignment ◽

Global Alignment ◽

Pairwise Sequence Alignment ◽

Rna Sequences ◽

Dna And Rna ◽

Alignment Algorithms ◽

Ngs Data

Abstract Background Due the computational complexity of sequence alignment algorithms, various accelerated solutions have been proposed to speedup this analysis. NVBIO is the only available GPU library that accelerates sequence alignment of high-throughput NGS data, but has limited performance. In this article we present GASAL2, a GPU library for aligning DNA and RNA sequences that outperforms existing CPU and GPU libraries. Results The GASAL2 library provides specialized, accelerated kernels for local, global and all types of semi-global alignment. Pairwise sequence alignment can be performed with and without traceback. GASAL2 outperforms the fastest CPU-optimized SIMD implementations such as SeqAn and Parasail, as well as NVIDIA’s own GPU-based library known as NVBIO. GASAL2 is unique in performing sequence packing on GPU, which is up to 750x faster than NVBIO. Overall on Geforce GTX 1080 Ti GPU, GASAL2 is up to 21x faster than Parasail on a dual socket hyper-threaded Intel Xeon system with 28 cores and up to 13x faster than NVBIO with a query length of up to 300 bases and 100 bases, respectively. GASAL2 alignment functions are asynchronous/non-blocking and allow full overlap of CPU and GPU execution. The paper shows how to use GASAL2 to accelerate BWA-MEM, speeding up the local alignment by 20x, which gives an overall application speedup of 1.3x vs. CPU with up to 12 threads. Conclusions The library provides high performance APIs for local, global and semi-global alignment that can be easily integrated into various bioinformatics tools.

Download Full-text

Pin-Align: A New Dynamic Programming Approach to Align Protein-Protein Interaction Networks

Computational and Mathematical Methods in Medicine ◽

10.1155/2014/393908 ◽

2014 ◽

Vol 2014 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Farid Amir-Ghiasvand ◽

Abbas Nowzari-Dalini ◽

Vida Momenzadeh

Keyword(s):

Protein Interaction ◽

Protein Interaction Networks ◽

Interaction Networks ◽

Programming Approach ◽

Local Alignment ◽

Global Alignment ◽

Interaction Patterns ◽

Protein Protein Interaction ◽

Alignment Algorithms ◽

Protein Protein Interaction Networks

To date, few tools for aligning protein-protein interaction networks have been suggested. These tools typically find conserved interaction patterns using various local or global alignment algorithms. However, the improvement of the speed, scalability, simplification, and accuracy of network alignment tools is still the target of new researches. In this paper, we introducePin-Align, a new tool for local alignment of protein-protein interaction networks.Pin-Alignaccuracy is tested on protein interaction networks from IntAct, DIP, and the Stanford Network Database and the results are compared with other well-known algorithms. It is shown thatPin-Alignhas higher sensitivity and specificity in terms of KEGG Ortholog groups.

Download Full-text

Qudaich: A smart sequence aligner

10.1101/060509 ◽

2016 ◽

Author(s):

Sajia Akhter ◽

Robert A Edwards

Keyword(s):

Sequence Alignment ◽

Dna Sequences ◽

High Throughput Sequencing ◽

Query Sequence ◽

Metagenomic Data ◽

Alignment Algorithm ◽

Next Generation ◽

Sequence Alignments ◽

Alignment Algorithms ◽

Local Sequence

AbstractNext generation sequencing (NGS) technology produces massive amounts of data in a reasonable time and low cost. Analyzing and annotating these data requires sequence alignments to compare them with genes, proteins and genomes in different databases. Sequence alignment is the first step in metagenomics analysis, and pairwise comparisons of sequence reads provide a measure of similarity between environments. Most of the current aligners focus on aligning NGS datasets against long reference sequences rather than comparing between datasets. As the number of metagenomes and other genomic data increases each year, there is a demand for more sophisticated, faster sequence alignment algorithms. Here, we introduce a novel sequence aligner, Qudaich, which can efficiently process large volumes of data and is suited to de novo comparisons of next generation reads datasets. Qudaich can handle both DNA and protein sequences and attempts to provide the best possible alignment for each query sequence. Qudaich can produce more useful alignments quicker than other contemporary alignment algorithms.Author SummaryThe recent developments in sequencing technology provides high throughput sequencing data and have resulted in large volumes of genomic and metagenomic data available in public databases. Sequence alignment is an important step for annotating these data. Many sequence aligners have been developed in last few years for efficient analysis of these data, however most of them are only able to align DNA sequences and mainly focus on aligning NGS data against long reference genomes. Therefore, in this study we have designed a new sequence aligner, qudaich, which can generate pairwise local sequence alignment (at both the DNA and protein level) between two NGS datasets and can efficiently handle the large volume of NGS datasets. In qudaich, we introduce a unique sequence alignment algorithm, which outperforms the traditional approaches. Qudaich not only takes less time to execute, but also finds more useful alignments than contemporary aligners.

Download Full-text