scholarly journals Sequence Comparison Alignment-Free Approach Based on Suffix Tree andL-WordsFrequency

2012 ◽  
Vol 2012 ◽  
pp. 1-4 ◽  
Author(s):  
Inês Soares ◽  
Ana Goios ◽  
António Amorim

The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps (insertions/deletions). In such cases, an alternative alignment-free method would prove valuable. Our method starts by a computation of a generalized suffix tree of all sequences, which is completed in linear time. Using this tree, the frequency of all possible words with a preset lengthL—L-words—in each sequence is rapidly calculated. Based on theL-wordsfrequency profile of each sequence, a pairwise standard Euclidean distance is then computed producing a symmetric genetic distance matrix, which can be used to generate a neighbor joining dendrogram or a multidimensional scaling graph. We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks. Our approach is, thus, a fast and simple application that proved to be efficient and powerful when applied to mitochondrial genomes. The algorithm was implemented in Python language and is freely available on the web.

Author(s):  
Mujtahid Akon ◽  
Muntashir Akon ◽  
Mohimenul Kabir ◽  
M Saifur Rahman ◽  
M Sohel Rahman

Abstract Motivation Researchers and practitioners use a number of popular sequence comparison tools that use many alignment-based techniques. Due to high time and space complexity and length-related restrictions, researchers often seek alignment-free tools. Recently, some interesting ideas, namely, Minimal Absent Words (MAW) and Relative Absent Words (RAW), have received much interest among the scientific community as distance measures that can give us alignment-free alternatives. This drives us to structure a framework for analysing biological sequences in an alignment-free manner. Results In this application note, we present Alignment-free Dissimilarity Analysis & Comparison Tool (ADACT), a simple web-based tool that computes the analogy among sequences using a varied number of indexes through the distance matrix, species relation list and phylogenetic tree. This tool basically combines absent word (MAW or RAW) computation, dissimilarity measures, species relationship and thus brings all required software in one platform for the ease of researchers and practitioners alike in the field of bioinformatics. We have also developed a restful API. Availability and implementation ADACT has been hosted at http://research.buet.ac.bd/ADACT/. Supplementary information Supplementary data are available at Bioinformatics online.


2014 ◽  
Vol 358 ◽  
pp. 31-51 ◽  
Author(s):  
Upuli Gunasinghe ◽  
Damminda Alahakoon ◽  
Susan Bedingfield

1995 ◽  
Vol 2 (47) ◽  
Author(s):  
Dany Breslauer

This paper gives a linear-time algorithm for the construction of the<br />suffix tree of a tree. The suffix tree of a tree is used to obtain an efficient<br />algorithm for the minimization of sequential transducers.


2019 ◽  
Vol 79 (02) ◽  
Author(s):  
K. T. Ramya ◽  
A. Vishnuvardhan Reddy ◽  
M. Sujatha

The present study investigates genetic divergence among 84 fertility restorers and 32 cytoplasmic male sterile (CMS) lines of sunflower augmented from USDA, USA along with the popular Indian parental lines using simple sequence repeats (SSR). Thirty-nine polymorphic SSR primers produced 139 alleles with an average of 3.56 alleles per locus. The polymorphic information content ranged from 0.23 to 0.69 with an average of 0.45. The average genetic distance was 0.45 and 0.42 for the R and CMS lines, respectively. Dendrogram based on the dissimilarity coefficient matrix grouped the CMS and R lines into separate clusters except for Cluster A which consisted of all CMS lines along with five R lines. Genetic distance matrix estimated from three sets of mitochondrial primers (BOX, ERIC and REP) grouped the 32 CMS lines into eight clusters. The results suggest the existence of considerable genetic diversity among the restorer and CMS lines of sunflower obtained from USDA, USA.


Author(s):  
R. Giancarlo ◽  
R. Grossi

We discuss the suffix tree generalization to matrices in this chapter. We extend the suffix tree notion (described in Chapter 3) from text strings to text matrices whose entries are taken from an ordered alphabet with the aim of solving pattern-matching problems. This suffix tree generalization can be efficiently used to implement low-level routines for Computer Vision, Data Compression, Geographic Information Systems and Visual Databases. We examine the submatrices in the form of the text’s contiguous parts that still have a matrix shape. Representing these text submatrices as “suitably formatted” strings stored in a compacted trie is the rationale behind suffix trees for matrices. The choice of the format inevitably influences suffix tree construction time and space complexity. We first deal with square matrices and show that many suffix tree families can be defined for the same input matrix according to the matrix’s string representations. We can store each suffix tree in linear space and give an efficient construction algorithm whose input is both the matrix and the string representation chosen. We then treat rectangular matrices and define their corresponding suffix trees by means of some general rules which we list formally. We show that there is a super-linear lower bound to the space required (in contrast with the linear space required by suffix trees for square matrices). We give a simple example of one of these suffix trees. The last part of the chapter illustrates some technical results regarding suffix trees for square matrices: we show how to achieve an expected linear-time suffix tree construction for a constant-size alphabet under some mild probabilistic assumptions about the input distribution. We begin by defining a wide class of string representations for square matrices. We let Σ denote an ordered alphabet of characters and introduce another alphabet of five special characters, called shapes. A shape is one of the special characters taken from set {IN,SW,NW,SE,NE}. Shape IN encodes the 1x1 matrix generated from the empty matrix by creating a square.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Andrzej Zielezinski ◽  
Hani Z. Girgis ◽  
Guillaume Bernard ◽  
Chris-Andre Leimeister ◽  
Kujin Tang ◽  
...  

2020 ◽  
Vol 21 (S6) ◽  
Author(s):  
Sriram P. Chockalingam ◽  
Jodh Pannu ◽  
Sahar Hooshmand ◽  
Sharma V. Thankachan ◽  
Srinivas Aluru

Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.


Sign in / Sign up

Export Citation Format

Share Document