Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance

Mapping Intimacies ◽

10.1101/092742 ◽

2016 ◽

Author(s):

Toshiyuki Oda ◽

Kyungtaek Lim ◽

Kentaro Tomii

Keyword(s):

Sequence Similarity ◽

Substitution Matrix ◽

Accurate Estimation ◽

Search Performance ◽

Multiple Sequence ◽

Weight Calculation ◽

Minimum Block ◽

Narrow Width ◽

Block Width ◽

Blast Performance

AbstractPSI-BLAST, an extremely popular tool for sequence similarity search, features the utilization of Position Specific Scoring Matrix (PSSM) constructed from a multiple sequence alignment (MSA). PSSM allows the detection of more distant homologs than a general amino acid substitution matrix does. An accurate estimation of the weights of sequences in an MSA is crucially important for PSSM construction. PSI-BLAST divides a given MSA into multiple blocks, for which sequence weights are calculated. When the block width becomes very narrow, the sequence weight calculation can be difficult.We demonstrate that PSI-BLAST indeed generates a significant fraction of blocks having widths less than 5, thereby degrading the PSI-BLAST performance. We revised the code of PSI-BLAST to prevent the blocks from being narrower than a given minimum block width (MBW). We designate the modified application of PSI-BLAST as PSI-BLASTexB. When MBW is 25, PSI-BLASTexB notably outperforms PSI-BLAST consistently for three independent benchmark sets. The performance boost is even more drastic when an MSA, instead of a sequence, was used as a query.Our results demonstrate that the generation of narrow-width blocks during the sequence weight calculation is a critically important factor that restricts the PSI-BLAST search performance. By preventing narrow blocks, PSI-BLASTexB remarkably upgrades the PSI-BLAST performance.

Download Full-text

PyMod: sequence similarity searches, multiple sequence-structure alignments, and homology modeling within PyMOL

BMC Bioinformatics ◽

10.1186/1471-2105-13-s4-s2 ◽

2012 ◽

Vol 13 (Suppl 4) ◽

pp. S2 ◽

Cited By ~ 82

Author(s):

Emanuele Bramucci ◽

Alessandro Paiardini ◽

Francesco Bossa ◽

Stefano Pascarella

Keyword(s):

Homology Modeling ◽

Sequence Similarity ◽

Sequence Structure ◽

Multiple Sequence ◽

Similarity Searches

Download Full-text

Molecular characterization and phylogenetic analysis of NBS-LRR genes in wild relatives of eggplant (Solanum melongena L

Indian Journal of Agricultural Research ◽

10.18805/ijare.a-4793 ◽

2018 ◽

Author(s):

Sona. S Dev ◽

P. Poornima ◽

Akhil Venu

Keyword(s):

Phylogenetic Analysis ◽

Amino Acid ◽

Sequence Similarity ◽

Interleukin 1 ◽

Preliminary Investigation ◽

Solanum Melongena ◽

Wild Relatives ◽

Amino Acid Sequences ◽

R Genes ◽

Multiple Sequence

Eggplantor brinjal (Solanum melongena L.), is highly susceptible to various soil-borne diseases. The extensive use of chemical fungicides to combat these diseases can be minimized by identification of resistance gene analogs (RGAs) in wild species of cultivated plants.In the present study, degenerate PCR primers for the conserved regions ofnucleotide binding site-leucine rich repeat (NBS-LRR) were used to amplify RGAs from wild relatives of eggplant (Black nightshade (Solanum nigrum), Indian nightshade (Solanumviolaceum)and Solanu mincanum) which showed resistance to the bacterial wilt pathogen, Ralstonia solanacearumin the preliminary investigation. The amino acid sequence of the amplicons when compared to each other and to the amino acid sequences of known RGAs deposited in Gen Bank revealed significant sequence similarity. The phylogenetic analysis indicated that they belonged to the toll interleukin-1 receptors (TIR)-NBS-LRR type R-genes. Multiple sequence alignment with other known R genes showed significant homology with P-loop, Kinase 2 and GLPL domains of NBS-LRR class genes. There has been no report on R genes from these wild eggplants and hence the diversity analysis of these novel RGAs can lead to the identification of other novel R genes within the germplasm of different brinjal plants as well as other species of Solanum.

Download Full-text

Analysis of the diversity of the glycoside hydrolase family 130 in mammal gut microbiomes reveals a novel mannoside-phosphorylase function

Microbial Genomics ◽

10.1099/mgen.0.000404 ◽

2020 ◽

Vol 6 (10) ◽

Author(s):

Ao Li ◽

Elisabeth Laville ◽

Laurence Tarquis ◽

Vincent Lombard ◽

David Ropartz ◽

...

Keyword(s):

Glycoside Hydrolase ◽

Sequence Similarity ◽

Gut Bacteria ◽

Glycoside Hydrolase Family ◽

Sequence Alignments ◽

Multiple Sequence ◽

Content Type ◽

Multiple Sequence Alignments ◽

Hydrolase Family

Mannoside phosphorylases are involved in the intracellular metabolization of mannooligosaccharides, and are also useful enzymes for the in vitro synthesis of oligosaccharides. They are found in glycoside hydrolase family GH130. Here we report on an analysis of 6308 GH130 sequences, including 4714 from the human, bovine, porcine and murine microbiomes. Using sequence similarity networks, we divided the diversity of sequences into 15 mostly isofunctional meta-nodes; of these, 9 contained no experimentally characterized member. By examining the multiple sequence alignments in each meta-node, we predicted the determinants of the phosphorolytic mechanism and linkage specificity. We thus hypothesized that eight uncharacterized meta-nodes would be phosphorylases. These sequences are characterized by the absence of signal peptides and of the catalytic base. Those sequences with the conserved E/K, E/R and Y/R pairs of residues involved in substrate binding would target β-1,2-, β-1,3- and β-1,4-linked mannosyl residues, respectively. These predictions were tested by characterizing members of three of the uncharacterized meta-nodes from gut bacteria. We discovered the first known β-1,4-mannosyl-glucuronic acid phosphorylase, which targets a motif of the Shigella lipopolysaccharide O-antigen. This work uncovers a reliable strategy for the discovery of novel mannoside-phosphorylases, reveals possible interactions between gut bacteria, and identifies a biotechnological tool for the synthesis of antigenic oligosaccharides.

Download Full-text

Rfam 14: expanded coverage of metagenomic, viral and microRNA families

Nucleic Acids Research ◽

10.1093/nar/gkaa1047 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D192-D200 ◽

Cited By ~ 2

Author(s):

Ioanna Kalvari ◽

Eric P Nawrocki ◽

Nancy Ontiveros-Palacios ◽

Joanna Argasinska ◽

Kevin Lamkiewicz ◽

...

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search ◽

Covariance Model ◽

Rna Sequences ◽

Multiple Sequence ◽

The Family ◽

Recent Developments ◽

Community Contribution ◽

Website Features

Abstract Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.

Download Full-text

Partial sequence analysis of mitochondrial cytochrome B gene of Labeo calbasu of Bangladesh

Journal of Biodiversity Conservation and Bioresource Management ◽

10.3329/jbcbm.v5i1.42182 ◽

2019 ◽

Vol 5 (1) ◽

pp. 25-30

Author(s):

RA Begum ◽

MT Alam ◽

H Jahan ◽

MS Alam

Keyword(s):

Genetic Diversity ◽

Cytochrome B ◽

Tissue Sample ◽

Sequence Similarity ◽

Cytochrome B Gene ◽

Mitochondrial Cytochrome ◽

Cyt B ◽

Multiple Sequence ◽

Mitochondrial Cytochrome B ◽

Cyt B Gene

Labeo calbasu (Family Cyprinidae) was studied at DNA level to know genetic diversity within and between species. The mitochondrial cytochrome b (cyt-b) gene of L. calbasu was sequenced and compared to the corresponding sequences of other Labeo species. DNA was isolated from the tissue sample of L. calbasu using phenol: chloroform extraction method. Forward and reverse primers were designed to amplify the target region of cytochrome b gene. A standard PCR protocol was used for the amplification of the desired region. Then, the forward and reverse sequences obtained were aligned and edited to finalize a length of 510 nucleotides which was submitted to NCBI genbank database. Nucleotide BLAST of this sequence at NCBI resulted 100% sequence similarity with L. calbasu sequence of the same region of cyt-b gene. Multiple sequence alignment of the sequence with seven more Labeo species sequences revealed 120 polymorphic sites, which have been mark of diversity among the species and might be used in molecular identification of the Labeo species. A constructed phylogenetic tree has shown relationship among the Labeo species. This research demonstrated the usefulness of mitochondrial DNA-based approach in species identification. Further, the data will provide appropriate background for studying genetic diversity within-species of the Labeo species in general and of L. calbasu in particular. J. Biodivers. Conserv. Bioresour. Manag. 2019, 5(1): 25-30

Download Full-text

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

BMC Bioinformatics ◽

10.1186/s12859-020-03738-5 ◽

2020 ◽

Vol 21 (S6) ◽

Author(s):

Sriram P. Chockalingam ◽

Jodh Pannu ◽

Sahar Hooshmand ◽

Sharma V. Thankachan ◽

Srinivas Aluru

Keyword(s):

Phylogenetic Trees ◽

Linear Time ◽

Sequence Similarity ◽

Similarity Measures ◽

Phylogeny Reconstruction ◽

Greedy Heuristics ◽

Biological Sequences ◽

Sequence Comparisons ◽

Multiple Sequence ◽

Alignment Free

Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.

Download Full-text

Molecular and Symptom Analysis Reveal the Presence of New Phytoplasmas Associated with Sugarcane Grassy Shoot Disease in India

Plant Disease ◽

10.1094/pdis-91-11-1413 ◽

2007 ◽

Vol 91 (11) ◽

pp. 1413-1418 ◽

Cited By ~ 17

Author(s):

Kanchan Nasare ◽

Amit Yadav ◽

Anil K. Singh ◽

K. B. Shivasharanappa ◽

Y. S. Nerkar ◽

...

Keyword(s):

16S Rrna ◽

Sequence Similarity ◽

Pcr Amplification ◽

Saccharum Officinarum ◽

23S Rrna ◽

Rrna Gene ◽

Sequence Alignments ◽

High Sequence Similarity ◽

Multiple Sequence ◽

Very High

A total of 240 sugarcane (Saccharum officinarum) plants showing phenotypic symptoms of sugarcane grassy shoot (SCGS) disease were collected from three states of India, Maharashtra, Karnataka, and Uttar Pradesh. Phytoplasmas were detected in all symptomatic samples by the polymerase chain reaction (PCR) amplification of phytoplasma-specific 16S rRNA gene and 16S-23S rRNA spacer region (SR) sequences. No amplification was observed when DNA from asymptomatic plant samples was used as a template. Sixteen samples were selected on the basis of phenotypic symptoms and geographic location, and cloning and sequencing of the 16S rRNA and spacer regions were performed. Multiple sequence alignments of the 16S rRNA sequences revealed that they share very high sequence similarity with phytoplasmas of rice yellow dwarf, 16SrXI. However, the 16S-23S rRNA SR sequence analysis revealed that while the majority of phytoplasmas shared very high (>99%) sequence similarity with previously reported sugarcane phytoplasmas, two of them, namely BV2 (DQ380342) and VD7 (DQ380343), shared relatively low sequence similarity (79 and 84%, respectively). Therefore, these two phytoplasmas may be previously unreported ones that cause significant yield losses in sugarcane in India.

Download Full-text

Efficient Multiple Sequences Alignment Algorithm Generation via Components Assembly Under PAR Framework

Frontiers in Genetics ◽

10.3389/fgene.2020.628175 ◽

2021 ◽

Vol 11 ◽

Author(s):

Haipeng Shi ◽

Haihe Shi ◽

Shenghua Xu

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Sequence Similarity ◽

Alignment Algorithm ◽

Pairwise Sequence Alignment ◽

Multiple Sequence ◽

Sequence Alignment Algorithm ◽

Alignment Algorithms ◽

Sequence Similarity Analysis ◽

High Level

As a key algorithm in bioinformatics, sequence alignment algorithm is widely used in sequence similarity analysis and genome sequence database search. Existing research focuses mainly on the specific steps of the algorithm or is for specific problems, lack of high-level abstract domain algorithm framework. Multiple sequence alignment algorithms are more complex, redundant, and difficult to understand, and it is not easy for users to select the appropriate algorithm; some computing errors may occur. Based on our constructed pairwise sequence alignment algorithm component library and the convenient software platform PAR, a few expansion domain components are developed for multiple sequence alignment application domain, and specific multiple sequence alignment algorithm can be designed, and its corresponding program, i.e., C++/Java/Python program, can be generated efficiently and thus enables the improvement of the development efficiency of complex algorithms, as well as accuracy of sequence alignment calculation. A star alignment algorithm is designed and generated to demonstrate the development process.

Download Full-text

Modeling variation of clinical team processes with multiple sequence alignment

Methodological Innovations ◽

10.1177/2059799119840985 ◽

2019 ◽

Vol 12 (1) ◽

pp. 205979911984098

Author(s):

Nathan J Bahr ◽

S Herzberg ◽

W Lambert ◽

M Hansen ◽

JJ McNulty ◽

...

Keyword(s):

Emergency Medical Service ◽

Sequence Alignment ◽

Medical Service ◽

Process Analysis ◽

Sequence Similarity ◽

Multiple Sequence ◽

Team Processes ◽

Performance Quality ◽

Emergency Medical ◽

Task Sequences

Our objective was to model process variation of Emergency Medical Service teams responding to simulated pediatric emergencies and determine if sequence alignment distinguishes performance quality. We performed a retrospective process analysis by watching and coding activities in videos from standardized simulations of 42 Emergency Medical Service teams. Teams were classified into high- or low-performing groups based on the Clinical Teamwork Scale™. Activities were coded according to resuscitation tasks, performer, and times. We used ClustalG to align task sequences within and between groups, and measured similarity. Teams within and between performance levels had an average sequence similarity of 52 ± 7% and 50 ± 7%. Teams performed clinically appropriate tasks that varied in prioritization, for example, performing compressions or connecting the EKG monitor early. There was no statistical difference in gross similarity between groups but specific differences in prioritization may have had clinically meaningful implications. Alignment could improve by accounting for task duration and concurrency.

Download Full-text

NeoRdRp: A comprehensive dataset for identifying RNA-dependent RNA polymerase of various RNA viruses from metatranscriptomic data

10.1101/2021.12.31.474423 ◽

2022 ◽

Author(s):

Shoichi Sakaguchi ◽

Syun-ichi Urayama ◽

Yoshihiro Takaki ◽

Hong Wu ◽

Youichi Suzuki ◽

...

Keyword(s):

Rna Polymerase ◽

Rna Viruses ◽

Rna Virus ◽

Sequence Similarity ◽

Virus Detection ◽

Detection Methods ◽

Amino Acid Sequence Similarity ◽

Sequencing Data ◽

Rna Dependent Rna Polymerase ◽

Multiple Sequence

RNA viruses are distributed in various environments, and most RNA viruses have been recently identified by metatranscriptome sequencing. However, due to the high nucleotide diversity of RNA viruses, it is still challenging to identify their sequences. Therefore, this study generated a dataset of RNA-dependent RNA polymerase (RdRp) domains essential for all RNA viruses belonging to Orthornavirae. Also, the collected genes with RdRp domains from various RNA viruses were clustered by amino acid sequence similarity. For each cluster, a multiple sequence alignment was generated, and a hidden Markov model (HMM) profile was created if the number of sequences was greater than five. Using the 1,467 HMM profiles, we detected RdRp domains in the RefSeq RNA virus sequences, combined the hit sequences with the RdRp domains, and reconstructed the HMM profiles. As a result, 2,234 HMM profiles were generated from 12,316 RdRp domain sequences, and the dataset was named NeoRdRp. Additionally, using the UniProt dataset, we confirmed that almost all NeoRdRp HMM profiles could specifically detect RdRps in Orthornavirae. Furthermore, we compared the NeoRdRp dataset with two previously reported RNA virus detection methods to detect RNA virus sequences from metatranscriptome sequencing data. Our methods can identify most of the RNA viruses in the datasets; however, some RNA viruses were not detected, similar to the other two methods. The NeoRdRp can be improved by repeatedly adding new RdRp sequences and can be expected to be widely applied as a system for detecting various RNA viruses from metatranscriptome data.

Download Full-text