GSAlign – an efficient sequence alignment tool for intra-species genomes

Mapping Intimacies ◽

10.1101/782193 ◽

2019 ◽

Author(s):

Hsin-Nan Lin ◽

Wen-Lian Hsu

Keyword(s):

Sequence Alignment ◽

State Of The Art ◽

Genome Comparison ◽

Sequence Variants ◽

Sequence Alignments ◽

Large Genome ◽

Alignment Tool ◽

Sequence Variations ◽

Efficient Sequence ◽

Alignment Result

AbstractPersonal genomics and comparative genomics are becoming more important in clinical practice and genome research. Both fields require sequence alignment to discover sequence conservation and variation. Though many methods have been developed, some are designed for small genome comparison while some are not efficient for large genome comparison. Moreover, most existing genome comparison tools have not been evaluated the correctness of sequence alignments systematically. A wrong sequence alignment would produce false sequence variants. In this study, we present GSAlign that handles large genome sequence alignment efficiently and identifies sequence variants from the alignment result. GSAlign is an efficient sequence alignment tool for intra-species genomes. It identifies sequence variations from the sequence alignments. We estimate performance by measuring the correctness of predicted sequence variations. The experiment results demonstrated that GSAlign is not only faster than most existing state-of-the-art methods, but also identifies sequence variants with high accuracy.

Download Full-text

Refining pairwise sequence alignments of membrane proteins by the incorporation of anchors

10.1101/2020.09.16.299453 ◽

2020 ◽

Author(s):

René Staritzbichler ◽

Edoardo Sarti ◽

Emily Yaklich ◽

Antoniya Aleksandrova ◽

Markus Stamm ◽

...

Keyword(s):

Membrane Proteins ◽

Sequence Alignment ◽

Ad Hoc ◽

Pairwise Alignment ◽

Low Complexity ◽

Pairwise Sequence Alignment ◽

Sequence Alignments ◽

Alignment Procedure ◽

Alignment Tool ◽

Optimum Alignment

AbstractThe alignment of primary sequences is a fundamental step in the analysis of protein structure, function, and evolution. Integral membrane proteins pose a significant challenge for such sequence alignment approaches, because their evolutionary relationships can be very remote, and because a high content of hydrophobic amino acids reduces their complexity. Frequently, biochemical or biophysical data is available that informs the optimum alignment, for example, indicating specific positions that share common functional or structural roles. Currently, if those positions are not correctly aligned by a standard pairwise alignment procedure, the incorporation of such information into the alignment is typically addressed in an ad hoc manner, with manual adjustments. However, such modifications are problematic because they reduce the robustness and reproducibility of the alignment. An alternative approach is the use of restraints, or anchors, to incorporate such position-matching explicitly during alignment. Here we introduce position anchoring in the alignment tool AlignMe as an aid to pairwise sequence alignment of membrane proteins. Applying this approach to realistic scenarios involving distantly-related and low complexity sequences, we illustrate how the addition of even a single anchor can dramatically improve the accuracy of the alignments, while maintaining the reproducibility and rigor of the overall alignment.

Download Full-text

Fast and SNP-aware short read alignment with SALT

BMC Bioinformatics ◽

10.1186/s12859-021-04088-6 ◽

2021 ◽

Vol 22 (S9) ◽

Author(s):

Wei Quan ◽

Bo Liu ◽

Yadong Wang

Keyword(s):

Sequence Alignment ◽

Genetic Variants ◽

High Throughput Sequencing ◽

Reference Genome ◽

Graph Model ◽

Sequence Alignments ◽

Short Read ◽

Read Alignment ◽

Short Read Alignment ◽

Alignment Tool

Abstract Background DNA sequence alignment is a common first step in most applications of high-throughput sequencing technologies. The accuracy of sequence alignments directly affects the accuracy of downstream analyses, such as variant calling and quantitative analysis of transcriptome; therefore, rapidly and accurately mapping reads to a reference genome is a significant topic in bioinformatics. Conventional DNA read aligners map reads to a linear reference genome (such as the GRCh38 primary assembly). However, such a linear reference genome represents the genome of only one or a few individuals and thus lacks information on variations in the population. This limitation can introduce bias and impact the sensitivity and accuracy of mapping. Recently, a number of aligners have begun to map reads to populations of genomes, which can be represented by a reference genome and a large number of genetic variants. However, compared to linear reference aligners, an aligner that can store and index all genetic variants has a high cost in memory (RAM) space and leads to extremely long run time. Aligning reads to a graph-model-based index that includes all types of variants is ultimately an NP-hard problem in theory. By contrast, considering only single nucleotide polymorphism (SNP) information will reduce the complexity of the index and improve the speed of sequence alignment. Results The SNP-aware alignment tool (SALT) is a fast, memory-efficient, and SNP-aware short read alignment tool. SALT uses 5.8 GB of RAM to index a human reference genome (GRCh38) and incorporates 12.8M UCSC common SNPs. Compared with a state-of-the-art aligner, SALT has a similar speed but higher accuracy. Conclusions Herein, we present an SNP-aware alignment tool (SALT) that aligns reads to a reference genome that incorporates an SNP database. We benchmarked SALT using simulated and real datasets. The results demonstrate that SALT can efficiently map reads to the reference genome with significantly improved accuracy. Incorporating SNP information can improve the accuracy of read alignment and can reveal novel variants. The source code is freely available at https://github.com/weiquan/SALT.

Download Full-text

Refining pairwise sequence alignments of membrane proteins by the incorporation of anchors

PLoS ONE ◽

10.1371/journal.pone.0239881 ◽

2021 ◽

Vol 16 (4) ◽

pp. e0239881

Author(s):

René Staritzbichler ◽

Edoardo Sarti ◽

Emily Yaklich ◽

Antoniya Aleksandrova ◽

Marcus Stamm ◽

...

Keyword(s):

Membrane Proteins ◽

Sequence Alignment ◽

Ad Hoc ◽

Low Complexity ◽

Pairwise Sequence Alignment ◽

Sequence Alignments ◽

Alignment Procedure ◽

Alignment Tool ◽

Hydrophobic Amino Acids ◽

Optimum Alignment

The alignment of primary sequences is a fundamental step in the analysis of protein structure, function, and evolution, and in the generation of homology-based models. Integral membrane proteins pose a significant challenge for such sequence alignment approaches, because their evolutionary relationships can be very remote, and because a high content of hydrophobic amino acids reduces their complexity. Frequently, biochemical or biophysical data is available that informs the optimum alignment, for example, indicating specific positions that share common functional or structural roles. Currently, if those positions are not correctly matched by a standard pairwise sequence alignment procedure, the incorporation of such information into the alignment is typically addressed in an ad hoc manner, with manual adjustments. However, such modifications are problematic because they reduce the robustness and reproducibility of the aligned regions either side of the newly matched positions. Previous studies have introduced restraints as a means to impose the matching of positions during sequence alignments, originally in the context of genome assembly. Here we introduce position restraints, or “anchors” as a feature in our alignment tool AlignMe, providing an aid to pairwise global sequence alignment of alpha-helical membrane proteins. Applying this approach to realistic scenarios involving distantly-related and low complexity sequences, we illustrate how the addition of anchors can be used to modify alignments, while still maintaining the reproducibility and rigor of the rest of the alignment. Anchored alignments can be generated using the online version of AlignMe available at www.bioinfo.mpg.de/AlignMe/.

Download Full-text

GSAlign: an efficient sequence alignment tool for intra-species genomes

BMC Genomics ◽

10.1186/s12864-020-6569-1 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 2

Author(s):

Hsin-Nan Lin ◽

Wen-Lian Hsu

Keyword(s):

Sequence Alignment ◽

Alignment Tool ◽

Efficient Sequence

Download Full-text

Exact Multiple Sequence Alignment by Synchronized Decision Diagrams

INFORMS Journal on Computing ◽

10.1287/ijoc.2019.0937 ◽

2020 ◽

Author(s):

Amin Hosseininasab ◽

Willem-Jan van Hoeve

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

State Of The Art ◽

Mixed Integer ◽

Mixed Integer Program ◽

Second Phase ◽

Polynomial Space ◽

Sequence Alignments ◽

Multiple Sequence ◽

First Time

This paper develops an exact solution algorithm for the multiple sequence alignment (MSA) problem. In the first step, we design a dynamic programming model and use it to construct a novel multivalued decision diagram (MDD) representation of all pairwise sequence alignments (PSA). PSA MDDs are then synchronized using side constraints to model the MSA problem as a mixed-integer program (MIP), for the first time, in polynomial space complexity. Two bound-based filtering procedures are developed to reduce the size of the MDDs, and the resulting MIP is solved using logic-based Benders decomposition. For a more effective algorithm, we develop a two-phase solution approach. In the first phase, we use optimistic filtering to quickly obtain a near-optimal bound, which we then use for exact filtering in the second phase to prove or obtain an optimal solution. Numerical results on benchmark instances show that our algorithm solves several instances to optimality for the first time, and, in case optimality cannot be proven, considerably improves upon a state-of-the-art heuristic MSA solver. Comparison with an existing state-of-the-art exact MSA algorithm shows that our approach is more time efficient and yields significantly smaller optimality gaps.

Download Full-text

Molecular homology and multiple-sequence alignment: an analysis of concepts and practice

Australian Systematic Botany ◽

10.1071/sb15001 ◽

2015 ◽

Vol 28 (1) ◽

pp. 46 ◽

Cited By ~ 20

Author(s):

David A. Morrison ◽

Matthew J. Morgan ◽

Scot A. Kelchner

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Molecular Data ◽

Simple Relationship ◽

Sequence Alignments ◽

Multiple Sequence ◽

Molecular Change ◽

Nucleotide Homology ◽

Tree Building ◽

Molecular Homology

Sequence alignment is just as much a part of phylogenetics as is tree building, although it is often viewed solely as a necessary tool to construct trees. However, alignment for the purpose of phylogenetic inference is primarily about homology, as it is the procedure that expresses homology relationships among the characters, rather than the historical relationships of the taxa. Molecular homology is rather vaguely defined and understood, despite its importance in the molecular age. Indeed, homology has rarely been evaluated with respect to nucleotide sequence alignments, in spite of the fact that nucleotides are the only data that directly represent genotype. All other molecular data represent phenotype, just as do morphology and anatomy. Thus, efforts to improve sequence alignment for phylogenetic purposes should involve a more refined use of the homology concept at a molecular level. To this end, we present examples of molecular-data levels at which homology might be considered, and arrange them in a hierarchy. The concept that we propose has many levels, which link directly to the developmental and morphological components of homology. Of note, there is no simple relationship between gene homology and nucleotide homology. We also propose terminology with which to better describe and discuss molecular homology at these levels. Our over-arching conceptual framework is then used to shed light on the multitude of automated procedures that have been created for multiple-sequence alignment. Sequence alignment needs to be based on aligning homologous nucleotides, without necessary reference to homology at any other level of the hierarchy. In particular, inference of nucleotide homology involves deriving a plausible scenario for molecular change among the set of sequences. Our clarifications should allow the development of a procedure that specifically addresses homology, which is required when performing alignment for phylogenetic purposes, but which does not yet exist.

Download Full-text

The BioCyc collection of microbial genomes and metabolic pathways

Briefings in Bioinformatics ◽

10.1093/bib/bbx085 ◽

2017 ◽

Vol 20 (4) ◽

pp. 1085-1093 ◽

Cited By ~ 107

Author(s):

Peter D Karp ◽

Richard Billington ◽

Ron Caspi ◽

Carol A Fulcher ◽

Mario Latendresse ◽

...

Keyword(s):

Sequence Alignment ◽

Metabolic Pathways ◽

Biomedical Literature ◽

Analysis Software ◽

New Developments ◽

Microbial Genomes ◽

Additional Information ◽

Alignment Tool ◽

Extensive Range ◽

Types Of Information

Abstract BioCyc.org is a microbial genome Web portal that combines thousands of genomes with additional information inferred by computer programs, imported from other databases and curated from the biomedical literature by biologist curators. BioCyc also provides an extensive range of query tools, visualization services and analysis software. Recent advances in BioCyc include an expansion in the content of BioCyc in terms of both the number of genomes and the types of information available for each genome; an expansion in the amount of curated content within BioCyc; and new developments in the BioCyc software tools including redesigned gene/protein pages and metabolite pages; new search tools; a new sequence-alignment tool; a new tool for visualizing groups of related metabolic pathways; and a facility called SmartTables, which enables biologists to perform analyses that previously would have required a programmer’s assistance.

Download Full-text

Phosphodiesterase sequence variants may predispose to prostate cancer

Endocrine Related Cancer ◽

10.1530/erc-15-0134 ◽

2015 ◽

Vol 22 (4) ◽

pp. 519-530 ◽

Cited By ~ 8

Author(s):

Rodrigo B de Alexandre ◽

Anelia D Horvath ◽

Eva Szarek ◽

Allison D Manning ◽

Leticia F Leal ◽

...

Keyword(s):

Prostate Cancer ◽

Immunohistochemical Analysis ◽

Cyclic Guanosine Monophosphate ◽

Genome Project ◽

Guanosine Monophosphate ◽

Sequence Variants ◽

Sequencing Data ◽

Sequence Variations ◽

Novel Variants ◽

Somatic State

We hypothesized that mutations that inactivate phosphodiesterase (PDE) activity and lead to increased cAMP and cyclic guanosine monophosphate levels may be associated with prostate cancer (PCa). We sequenced the entire PDE coding sequences in the DNA of 16 biopsy samples from PCa patients. Novel mutations were confirmed in the somatic or germline state by Sanger sequencing. Data were then compared to the 1000 Genome Project. PDE, CREB and pCREB protein expression was also studied in all samples, in both normal and abnormal tissue, by immunofluorescence. We identified three previously described PDE sequence variants that were significantly more frequent in PCa. Four novel sequence variations, one each in thePDE4B,PDE6C,PDE7BandPDE10Agenes, respectively, were also found in the PCa samples. Interestingly,PDE10AandPDE4Bnovel variants that were present in 19 and 6% of the patients were found in the tumor tissue only. In patients carrying PDE defects, there was pCREB accumulation (P<0.001), and an increase of the pCREB:CREB ratio (patients 0.97±0.03; controls 0.52±0.03;P-value <0.001) by immunohistochemical analysis. We conclude that PDE sequence variants may play a role in the predisposition and/or progression to PCa at the germline and/or somatic state respectively.

Download Full-text

Benchmarking Statistical Multiple Sequence Alignment

10.1101/304659 ◽

2018 ◽

Cited By ~ 1

Author(s):

Michael Nute ◽

Ehsan Saleh ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structural Alignment ◽

Estimation Method ◽

Simulated Data ◽

Protein Sequences ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Simulated Data Sets

AbstractThe estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical co-estimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical co-estimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy is dramatically more accurate than the other alignment methods on the simulated data sets, but is among the least accurate on the biological benchmarks. There are several potential causes for this discordance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments; future research is needed to understand the most likely explanation for our observations. multiple sequence alignment, BAli-Phy, protein sequences, structural alignment, homology

Download Full-text

VirusDIP: Virus Data Integration Platform

10.1101/2020.06.08.139451 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lina Wang ◽

Fengzhen Chen ◽

Xueqin Guo ◽

Lijin You ◽

Xiaoxia Yang ◽

...

Keyword(s):

Sequence Alignment ◽

Sequence Data ◽

Data Retrieval ◽

Viral Sequence ◽

Origin And Evolution ◽

Alignment Tool ◽

Public Data ◽

Virus Research ◽

Global Initiative ◽

Tree Building

AbstractMotivationThe Coronavirus Disease 2019 (COVID-19) pandemic poses a huge threat to human public health. Viral sequence data plays an important role in the scientific prevention and control of epidemics. A comprehensive virus database will be vital useful for virus data retrieval and deep analysis. To promote sharing of virus data, several virus databases and related analyzing tools have been created.ResultsTo facilitate virus research and promote the global sharing of virus data, we present here VirusDIP, a one-stop service platform for archive, integration, access, analysis of virus data. It accepts the submission of viral sequence data from all over the world and currently integrates data resources from the National GeneBank Database (CNGBdb), Global initiative on sharing all influenza data (GISAID), and National Center for Biotechnology Information (NCBI). Moreover, based on the comprehensive data resources, BLAST sequence alignment tool and multi-party security computing tools are deployed for multi-sequence alignment, phylogenetic tree building and global trusted sharing. VirusDIP is gradually establishing cooperation with more databases, and paving the way for the analysis of virus origin and evolution. All public data in VirusDIP are freely available for all researchers worldwide.Availabilityhttps://db.cngb.org/virus/[email protected]

Download Full-text