Sequence Comparison Alignment-Free Approach Based on Suffix Tree andL-WordsFrequency

The Scientific World JOURNAL ◽

10.1100/2012/450124 ◽

2012 ◽

Vol 2012 ◽

pp. 1-4 ◽

Cited By ~ 9

Author(s):

Inês Soares ◽

Ana Goios ◽

António Amorim

Keyword(s):

Sequence Comparison ◽

Suffix Tree ◽

Linear Time ◽

Distance Matrix ◽

Tree Structures ◽

Python Language ◽

Alignment Free ◽

Genetic Distance Matrix ◽

Alternative Alignment ◽

Alignment Step

The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps (insertions/deletions). In such cases, an alternative alignment-free method would prove valuable. Our method starts by a computation of a generalized suffix tree of all sequences, which is completed in linear time. Using this tree, the frequency of all possible words with a preset lengthL—L-words—in each sequence is rapidly calculated. Based on theL-wordsfrequency profile of each sequence, a pairwise standard Euclidean distance is then computed producing a symmetric genetic distance matrix, which can be used to generate a neighbor joining dendrogram or a multidimensional scaling graph. We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks. Our approach is, thus, a fast and simple application that proved to be efficient and powerful when applied to mitochondrial genomes. The algorithm was implemented in Python language and is freely available on the web.

Download Full-text

ADACT: a tool for analysing (dis)similarity among nucleotide and protein sequences using minimal and relative absent words

Bioinformatics ◽

10.1093/bioinformatics/btaa853 ◽

2020 ◽

Author(s):

Mujtahid Akon ◽

Muntashir Akon ◽

Mohimenul Kabir ◽

M Saifur Rahman ◽

M Sohel Rahman

Keyword(s):

Sequence Comparison ◽

Distance Matrix ◽

Distance Measures ◽

Supplementary Information ◽

Biological Sequences ◽

Web Based ◽

Species Relationship ◽

Alignment Free ◽

Absent Words ◽

Time And Space Complexity

Abstract Motivation Researchers and practitioners use a number of popular sequence comparison tools that use many alignment-based techniques. Due to high time and space complexity and length-related restrictions, researchers often seek alignment-free tools. Recently, some interesting ideas, namely, Minimal Absent Words (MAW) and Relative Absent Words (RAW), have received much interest among the scientific community as distance measures that can give us alignment-free alternatives. This drives us to structure a framework for analysing biological sequences in an alignment-free manner. Results In this application note, we present Alignment-free Dissimilarity Analysis & Comparison Tool (ADACT), a simple web-based tool that computes the analogy among sequences using a varied number of indexes through the distance matrix, species relation list and phylogenetic tree. This tool basically combines absent word (MAW or RAW) computation, dissimilarity measures, species relationship and thus brings all required software in one platform for the ease of researchers and practitioners alike in the field of bioinformatics. We have also developed a restful API. Availability and implementation ADACT has been hosted at http://research.buet.ac.bd/ADACT/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Extraction of high quality k-words for alignment-free sequence comparison

Journal of Theoretical Biology ◽

10.1016/j.jtbi.2014.05.016 ◽

2014 ◽

Vol 358 ◽

pp. 31-51 ◽

Cited By ~ 7

Author(s):

Upuli Gunasinghe ◽

Damminda Alahakoon ◽

Susan Bedingfield

Keyword(s):

Sequence Comparison ◽

High Quality ◽

Alignment Free

Download Full-text

Alignment-free sequence comparison for virus genomes based on location correlation coefficient

Infection Genetics and Evolution ◽

10.1016/j.meegid.2021.105106 ◽

2021 ◽

pp. 105106

Author(s):

Lily He ◽

Siyang Sun ◽

Qianyue Zhang ◽

Xiaona Bao ◽

Peter K. Li

Keyword(s):

Correlation Coefficient ◽

Sequence Comparison ◽

Alignment Free ◽

Virus Genomes

Download Full-text

The Suffix Tree of a Tree and Minimizing Sequential Transducers

BRICS Report Series ◽

10.7146/brics.v2i47.19948 ◽

1995 ◽

Vol 2 (47) ◽

Author(s):

Dany Breslauer

Keyword(s):

Suffix Tree ◽

Linear Time ◽

Time Algorithm ◽

Linear Time Algorithm

This paper gives a linear-time algorithm for the construction of the<br />suffix tree of a tree. The suffix tree of a tree is used to obtain an efficient<br />algorithm for the minimization of sequential transducers.

Download Full-text

Agromorphological and molecular analysis discloses wide genetic variability in sunflower breeding lines from USDA, USA

Indian Journal of Genetics and Plant Breeding (The) ◽

10.31742/ijgpb.79.2.8 ◽

2019 ◽

Vol 79 (02) ◽

Author(s):

K. T. Ramya ◽

A. Vishnuvardhan Reddy ◽

M. Sujatha

Keyword(s):

Genetic Distance ◽

Polymorphic Information Content ◽

Distance Matrix ◽

Coefficient Matrix ◽

Male Sterile ◽

Breeding Lines ◽

Genetic Distance Matrix ◽

Ssr Primers ◽

Fertility Restorers ◽

Simple Sequence

The present study investigates genetic divergence among 84 fertility restorers and 32 cytoplasmic male sterile (CMS) lines of sunflower augmented from USDA, USA along with the popular Indian parental lines using simple sequence repeats (SSR). Thirty-nine polymorphic SSR primers produced 139 alleles with an average of 3.56 alleles per locus. The polymorphic information content ranged from 0.23 to 0.69 with an average of 0.45. The average genetic distance was 0.45 and 0.42 for the R and CMS lines, respectively. Dendrogram based on the dissimilarity coefficient matrix grouped the CMS and R lines into separate clusters except for Cluster A which consisted of all CMS lines along with five R lines. Genetic distance matrix estimated from three sets of mitochondrial primers (BOX, ERIC and REP) grouped the 32 CMS lines into eight clusters. The results suggest the existence of considerable genetic diversity among the restorer and CMS lines of sunflower obtained from USDA, USA.

Download Full-text

Alignment-Free Sequence Comparison Based on Next Generation Sequencing Reads: Extended Abstract

Lecture Notes in Computer Science - Research in Computational Molecular Biology ◽

10.1007/978-3-642-29627-7_29 ◽

2012 ◽

pp. 272-285 ◽

Cited By ~ 2

Author(s):

Kai Song ◽

Jie Ren ◽

Zhiyuan Zhai ◽

Xuemei Liu ◽

Minghua Deng ◽

...

Keyword(s):

Next Generation Sequencing ◽

Sequence Comparison ◽

Next Generation ◽

Alignment Free ◽

Generation Sequencing

Download Full-text

Alignment-free sequence comparison: A systematic survey from a machine learning perspective

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2022.3140873 ◽

2022 ◽

pp. 1-1

Author(s):

Katrin Sophie Bohnsack ◽

Marika Kaden ◽

Julia Abel ◽

Thomas Villmann

Keyword(s):

Machine Learning ◽

Sequence Comparison ◽

Systematic Survey ◽

Alignment Free

Download Full-text

Suffix Tree Data Structures for Matrices

Pattern Matching Algorithms ◽

10.1093/oso/9780195113679.003.0013 ◽

1997 ◽

Author(s):

R. Giancarlo ◽

R. Grossi

Keyword(s):

Linear Space ◽

Suffix Tree ◽

Linear Time ◽

Suffix Trees ◽

Construction Time ◽

Matching Problems ◽

Tree Construction ◽

The Matrix ◽

Visual Databases ◽

Efficient Construction

We discuss the suffix tree generalization to matrices in this chapter. We extend the suffix tree notion (described in Chapter 3) from text strings to text matrices whose entries are taken from an ordered alphabet with the aim of solving pattern-matching problems. This suffix tree generalization can be efficiently used to implement low-level routines for Computer Vision, Data Compression, Geographic Information Systems and Visual Databases. We examine the submatrices in the form of the text’s contiguous parts that still have a matrix shape. Representing these text submatrices as “suitably formatted” strings stored in a compacted trie is the rationale behind suffix trees for matrices. The choice of the format inevitably influences suffix tree construction time and space complexity. We first deal with square matrices and show that many suffix tree families can be defined for the same input matrix according to the matrix’s string representations. We can store each suffix tree in linear space and give an efficient construction algorithm whose input is both the matrix and the string representation chosen. We then treat rectangular matrices and define their corresponding suffix trees by means of some general rules which we list formally. We show that there is a super-linear lower bound to the space required (in contrast with the linear space required by suffix trees for square matrices). We give a simple example of one of these suffix trees. The last part of the chapter illustrates some technical results regarding suffix trees for square matrices: we show how to achieve an expected linear-time suffix tree construction for a constant-size alphabet under some mild probabilistic assumptions about the input distribution. We begin by defining a wide class of string representations for square matrices. We let Σ denote an ordered alphabet of characters and introduce another alphabet of five special characters, called shapes. A shape is one of the special characters taken from set {IN,SW,NW,SE,NE}. Shape IN encodes the 1x1 matrix generated from the empty matrix by creating a square.

Download Full-text

Benchmarking of alignment-free sequence comparison methods

Genome Biology ◽

10.1186/s13059-019-1755-7 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 39

Author(s):

Andrzej Zielezinski ◽

Hani Z. Girgis ◽

Guillaume Bernard ◽

Chris-Andre Leimeister ◽

Kujin Tang ◽

...

Keyword(s):

Sequence Comparison ◽

Alignment Free ◽

Comparison Methods

Download Full-text

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

BMC Bioinformatics ◽

10.1186/s12859-020-03738-5 ◽

2020 ◽

Vol 21 (S6) ◽

Author(s):

Sriram P. Chockalingam ◽

Jodh Pannu ◽

Sahar Hooshmand ◽

Sharma V. Thankachan ◽

Srinivas Aluru

Keyword(s):

Phylogenetic Trees ◽

Linear Time ◽

Sequence Similarity ◽

Similarity Measures ◽

Phylogeny Reconstruction ◽

Greedy Heuristics ◽

Biological Sequences ◽

Sequence Comparisons ◽

Multiple Sequence ◽

Alignment Free

Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.

Download Full-text