gapped alignment
Recently Published Documents


TOTAL DOCUMENTS

13
(FIVE YEARS 0)

H-INDEX

5
(FIVE YEARS 0)

2019 ◽  
Vol 17 (02) ◽  
pp. 1950008 ◽  
Author(s):  
Sanjeev Kumar ◽  
Suneeta Agarwal ◽  
Ranvijay

New generation sequencing machines: Illumina and Solexa can generate millions of short reads from a given genome sequence on a single run. Alignment of these reads to a reference genome is a core step in Next-generation sequencing data analysis such as genetic variation and genome re-sequencing etc. Therefore there is a need of a new approach, efficient with respect to memory as well as time to align these enormous reads with the reference genome. Existing techniques such as MAQ, Bowtie, BWA, BWBBLE, Subread, Kart, and Minimap2 require huge memory for whole reference genome indexing and reads alignment. Gapped alignment versions of these techniques are also 20–40% slower than their respective normal versions. In this paper, an efficient approach: WIT for reference genome indexing and reads alignment using Burrows–Wheeler Transform (BWT) and Wavelet Tree (WT) is proposed. Both exact and approximate alignments are possible by it. Experimental work shows that the proposed approach WIT performs the best in case of protein sequence indexing. For indexing, the reference genome space required by WIT is 0.6[Formula: see text]N (N is the size of reference genome) whereas existing techniques BWA, Subread, Kart, and Minimap2 require space in between 1.25[Formula: see text]N to 5[Formula: see text]N. Experimentally, it is also observed that even using such small index size alignment time of proposed approach is comparable in comparison to BWA, Subread, Kart, and Minimap2. Other alignment parameters accuracy and confidentiality are also experimentally shown to be better than Minimap2. The source code of the proposed approach WIT is available at http://www.algorithm-skg.com/wit/home.html .


2019 ◽  
Author(s):  
Al Erives

ABSTRACTMaximal homology alignment is a new biologically-relevant approach to DNA sequence alignment that maps the internal dispersed microhomology of individual sequences onto two dimensions. It departs from the current method of gapped alignment, which uses a simplified binary state model of nucleotide position. In gapped alignment nucleotide positions have either no relationship (1-to-None) or else orthological relationship (1-to-1) with nucleotides in other sequences. Maximal homology alignment, however, allows additional states such as 1-to-Many and Many-to-Many, thus modeling both orthological and paralogical relationships, which together comprise the main homology types. Maximal homology alignment collects dispersed microparalogy into the same alignment columns on multiple rows, and thereby generates a two-dimensional representation of a single sequence. Sequence alignment then proceeds as the alignment of two-dimensional topological objects. The operations of producing and aligning two-dimensional auto-alignments motivate a need for tests of two-dimensional homological integrity. Here, I work out and implement basic principles for computationally testing the two dimensions of positional homology, which are inherent to biological sequences due to replication slippage and related errors. I then show that maximal homology alignment is more informative than gapped alignment in modeling the evolution of genetic sequences. In general, MHA is more suited when small insertions and deletions predominantly originate as local microparalogy. These results show that both conserved and non-conserved genomic sequences are enriched with a signature of replication slippage relative to their random permutations.


2018 ◽  
Author(s):  
Albert J Erives

AbstractIn attempting to align divergent homologs of a conserved developmental enhancer, a flaw in the homology concept embedded in gapped alignment (GA) was discovered. To correct this flaw, we developed a methodological approach called maximal homology alignment (MHA). The goal of MHA is to rescue internal microparalogy of biological sequences rather than to insert a pattern of gaps (null characters), which transform homologous sequences into strings of uniform size (1-dimensional lengths). The core operation in MHA is the “cinch”, whereby inferred tandem microparalogy is represented in multiple rows across the same span of alignment columns. Thus, MHAs have a second (vertical) paralogy dimension, which re-categorizes most indel mutations as replication slippage and attenuates the indel problem. Furthermore, internally-cinched, inferred microparalogy in a self-MHA can later be relaxed to restore uniformity to 2-dimensional widths in a multiple sequence alignment. This de-cinching operation is used as a first resort before artificial null characters are used. We implement MHA in a program called maximal, which is composed of a series of modules for cinching and cyclelizing divergent tandem repeats. In conclusion, we find that the MHA approach is of higher utility than GA in non-protein-coding regulatory sequences, which are unconstrained by codon-based reading frames and are enriched in dense microparalogical content.


2017 ◽  
Author(s):  
Krešimir Križanović ◽  
Amina Echchiki ◽  
Julien Roux ◽  
Mile Šikić

AbstractMotivationHigh–throughput sequencing has transformed the study of gene expression levels through RNA-seq, a technique that is now routinely used by various fields, such as genetic research or diagnostics. The advent of third generation sequencing technologies providing significantly longer reads opens up new possibilities. However, the high error rates common to these technologies set new bioinformatics challenges for the gapped alignment of reads to their genomic origin. In this study, we have explored how currently available RNA-seq splice-aware alignment tools cope with increased read lengths and error rates. All tested tools were initially developed for short NGS reads, but some have claimed support for long PacBio or even ONT MinION reads.ResultsThe tools were tested on synthetic and real datasets from the PacBio and ONT MinION technologies, and both alignment quality and resource usage were compared across tools. The effect of error correction of long reads was explored, both using self-correction and correction with an external short reads dataset. A tool was developed for evaluating RNA-seq alignment results. This tool can be used to compare the alignment of simulated reads to their genomic origin, or to compare the alignment of real reads to a set of annotated transcripts.Our tests show that while some RNA-seq aligners were unable to cope with long error-prone reads, others produced overall good results. We further show that alignment accuracy can be improved using error-corrected reads.Availabilityhttps://github.com/kkrizanovic/[email protected]


2017 ◽  
Author(s):  
Kavya Vaddadi ◽  
Naveen Sivadasan ◽  
Kshitij Tayal ◽  
Rajgopal Srinivasan

AbstractGenomic variations in a reference collection are naturally represented as genome variation graphs. Such graphs encode common subsequences as vertices and the variations are captured using additional vertices and directed edges. The resulting graphs are directed graphs possibly with cycles. Existing algorithms for aligning sequences on such graphs make use of partial order alignment (POA) techniques that work on directed acyclic graphs (DAG). For this, acyclic extensions of the input graphs are first constructed through expensive loop unrolling steps (DAGification). Also, such graph extensions could have considerable blow up in their size and in the worst case the blow up factor is proportional to the input sequence length. We provide a novel alignment algorithm V-ALIGN that aligns the input sequence directly on the input graph while avoiding such expensive DAGification steps. V-ALIGN is based on a novel dynamic programming formulation that allows gapped alignment directly on the input graph. It supports affine and linear gaps. We also propose refinements to V-ALIGN for better performance in practice. In this, the time to fill the DP table has linear dependence on the sizes of the sequence, the graph and its feedback vertex set. We perform experiments to compare against the POA based alignment. For aligning short sequences, standard approaches restrict the expensive gapped alignment to small filtered subgraphs having high ‘similarity’ to the input sequence. In such cases, the performance of V-ALIGN for gapped alignment on the filtered subgraph depends on the subgraph sizes.


2015 ◽  
Vol 7 (1) ◽  
Author(s):  
Rendong Yang ◽  
Andrew C. Nelson ◽  
Christine Henzler ◽  
Bharat Thyagarajan ◽  
Kevin A. T. Silverstein

Author(s):  
M. Vidyasagar

This chapter considers some applications of Markov processes and hidden Markov processes to computational biology. It introduces three important problems, namely: sequence alignment, the gene-finding problem, and protein classification. After providing an overview of some relevant aspects of biology, the chapter examines the problem of optimal gapped alignment between two sequences. This is a way to detect similarity between two sequences over a common alphabet, such as the four-symbol alphabet of nucleotides, or the 20-symbol alphabet of amino acids. The chapter proceeds by discussing some widely used algorithms for finding genes from DNA sequences (genomes), including the GLIMMER algorithm and the GENSCAN algorithm. Finally, it describes a special type of hidden Markov model termed profile hidden Markov model, which is commonly used to classify proteins into a small number of groups.


2004 ◽  
Vol 41 (4) ◽  
pp. 975-983 ◽  
Author(s):  
John L. Spouge

In bioinformatics, the notion of an ‘island’ enhances the efficient simulation of gapped local alignment statistics. This paper generalizes several results relevant to gapless local alignment statistics from one to higher dimensions, with a particular eye to applications in gapped alignment statistics. For example, reversal of paths (rather than of discrete time) generalizes a distributional equality, from queueing theory, between the Lindley (local sum) and maximum processes. Systematic investigation of an ‘ownership’ relationship among vertices in ℤ2 formalizes the notion of an island as a set of vertices having a common owner. Predictably, islands possess some stochastic ordering and spatial averaging properties. Moreover, however, the average number of vertices in a subcritical stationary island is 1, generalizing a theorem of Kac about stationary point processes. The generalization leads to alternative ways of simulating some island statistics.


2004 ◽  
Vol 41 (04) ◽  
pp. 975-983 ◽  
Author(s):  
John L. Spouge

In bioinformatics, the notion of an ‘island’ enhances the efficient simulation of gapped local alignment statistics. This paper generalizes several results relevant to gapless local alignment statistics from one to higher dimensions, with a particular eye to applications in gapped alignment statistics. For example, reversal of paths (rather than of discrete time) generalizes a distributional equality, from queueing theory, between the Lindley (local sum) and maximum processes. Systematic investigation of an ‘ownership’ relationship among vertices in ℤ2 formalizes the notion of an island as a set of vertices having a common owner. Predictably, islands possess some stochastic ordering and spatial averaging properties. Moreover, however, the average number of vertices in a subcritical stationary island is 1, generalizing a theorem of Kac about stationary point processes. The generalization leads to alternative ways of simulating some island statistics.


Sign in / Sign up

Export Citation Format

Share Document