The Novel Sequence Distance Measuring Algorithm Based on Optimal Transport and Cross-Attention Mechanism

Shock and Vibration ◽

10.1155/2021/3272119 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Yanmin Yu ◽

Yongcai Lai ◽

Ping Yan ◽

Haiying Liu

Keyword(s):

Optimal Transport ◽

Sequence Data ◽

Sequence Similarity ◽

Attention Mechanism ◽

Target Sequence ◽

Distance Metric ◽

Hinge Loss ◽

Ground Distance ◽

Anchor Sequence ◽

Distance Measuring

In this paper, we propose a novel sequence distance measuring algorithm based on optimal transport (OT) and cross-attention mechanism. Given a source sequence and a target sequence, we first calculate the ground distance between each pair of source and target terms of the two sequences. The ground distance is calculated over the subsequences around the two terms. We firstly pay attention from each the source terms to each target terms with attention weights, so that we have a representative source subsequence vector regarding each term in the target subsequence. Then, we pay attention from each representative vector of the term of the target subsequence to the entire source subsequence. In this way, we construct the cross-attention weights and use them to calculate the pairwise ground distances. With the ground distances, we derive the OT distance between the two sequences and train the attention parameters and ground distance metric parameters together. The training process is conducted with training triplets of sequences, where each triplet is composed of an anchor sequence, a must-link sequence, and a cannot-link sequence. The corresponding hinge loss function of each triplet is minimized, and we develop an iterative algorithm to solve the optimal transport problem and the attention/ground distance metric parameters in an alternate way. The experiments over sequence similarity search benchmark datasets, including text, video, and rice smut protein sequence data, are conducted. The experimental results show the algorithm is effective.

Download Full-text

Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins

Scientific Reports ◽

10.1038/s41598-021-81063-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Dimitri Boeckaerts ◽

Michiel Stock ◽

Bjorn Criel ◽

Hans Gerstmans ◽

Bernard De Baets ◽

...

Keyword(s):

Machine Learning ◽

Predictive Model ◽

Receptor Binding ◽

Bacterial Infections ◽

Sequence Data ◽

Sequence Similarity ◽

Area Under The Curve ◽

Local Alignment ◽

Search Tool ◽

Different Levels

AbstractNowadays, bacteriophages are increasingly considered as an alternative treatment for a variety of bacterial infections in cases where classical antibiotics have become ineffective. However, characterizing the host specificity of phages remains a labor- and time-intensive process. In order to alleviate this burden, we have developed a new machine-learning-based pipeline to predict bacteriophage hosts based on annotated receptor-binding protein (RBP) sequence data. We focus on predicting bacterial hosts from the ESKAPE group, Escherichia coli, Salmonella enterica and Clostridium difficile. We compare the performance of our predictive model with that of the widely used Basic Local Alignment Search Tool (BLAST). Our best-performing predictive model reaches Precision-Recall Area Under the Curve (PR-AUC) scores between 73.6 and 93.8% for different levels of sequence similarity in the collected data. Our model reaches a performance comparable to that of BLASTp when sequence similarity in the data is high and starts outperforming BLASTp when sequence similarity drops below 75%. Therefore, our machine learning methods can be especially useful in settings in which sequence similarity to other known sequences is low. Predicting the hosts of novel metagenomic RBP sequences could extend our toolbox to tune the host spectrum of phages or phage tail-like bacteriocins by swapping RBPs.

Download Full-text

Within-Arctic horizontal gene transfer as a driver of convergent evolution in distantly related microalgae

10.1101/2021.07.31.454568 ◽

2021 ◽

Author(s):

Richard G Dorrell ◽

Alan Kuo ◽

Zoltan Fussy ◽

Elisabeth H Richardson ◽

Asaf Salamov ◽

...

Keyword(s):

Sequence Data ◽

Sequence Similarity ◽

Algal Species ◽

Gene Families ◽

The Arctic ◽

Arctic Water ◽

Diverse Range ◽

Binding Domains ◽

Ice Binding ◽

Ice Conditions

The Arctic Ocean is being impacted by warming temperatures, increasing freshwater and highly variable ice conditions. The microalgal communities underpinning Arctic marine food webs, once thought to be dominated by diatoms, include a phylogenetically diverse range of small algal species, whose biology remains poorly understood. Here, we present genome sequences of a cryptomonad, a haptophyte, a chrysophyte, and a pelagophyte, isolated from the Arctic water column and ice. Comparing protein family distributions and sequence similarity across a densely-sampled set of algal genomes and transcriptomes, we note striking convergences in the biology of distantly related small Arctic algae, compared to non-Arctic relatives; although this convergence is largely exclusive of Arctic diatoms. Using high-throughput phylogenetic approaches, incorporating environmental sequence data from Tara Oceans, we demonstrate that this convergence was partly explained by horizontal gene transfers (HGT) between Arctic species, in over at least 30 other discrete gene families, and most notably in ice-binding domains (IBD). These Arctic-specific genes have been repeatedly transferred between Arctic algae, and are independent of equivalent HGTs in the Antarctic Southern Ocean. Our data provide insights into the specialised Arctic marine microbiome, and underlines the role of geographically-limited HGT as a driver of environmental adaptation in eukaryotic algae.

Download Full-text

Phylogenetic analysis of DNA sequence data.

Techniques for work with plant and soil nematodes ◽

10.1079/9781786391759.0265 ◽

2021 ◽

pp. 265-282

Author(s):

Sergei A. Subbotin

Keyword(s):

Phylogenetic Analysis ◽

Sequence Data ◽

Sequence Similarity ◽

Phylogenetic Study ◽

Public Database ◽

Sample Collection ◽

Minimum Evolution ◽

Molecular Phylogenetic ◽

Pcr Products ◽

Selection Of

Abstract The goal of phylogenetics is to construct relationships that are true representations of the evolutionary history of a group of organisms or genes. The history inferred from phylogenetic analysis is usually depicted as branching in tree-like diagrams or networks. In nematology, phylogenetic studies have been applied to resolve a wide range of questions dealing with improving classifications and testing evolution processes, such as co-evolution, biogeography and many others. There are several main steps involved in a phylogenetic study: (i) selection of ingroup and outgroup taxa for a study; (ii) selection of one or several gene fragments for a study; (iii) sample collection, obtaining PCR products and sequencing of gene fragments; (iv) visualization, editing raw sequence data and sequence assembling; (v) search for sequence similarity in a public database; (vi) making and editing multiple alignment of sequences; (vii) selecting appropriate DNA model for a dataset; (viii) phylogenetic reconstruction using minimum evolution, maximum parsimony, maximum likelihood and Bayesian inference; (ix) visualization of tree files and preparation of tree for a publication; and (x) sequence submission to a public database. Molecular phylogenetic study requires particularly careful planning because it is usually relatively expensive in terms of the cost in reagents and time.

Download Full-text

Similarity and Homology

Sequence Analysis Primer ◽

10.1093/oso/9780195098747.003.0006 ◽

1995 ◽

Author(s):

David J. States ◽

Mark S. Boguski

Keyword(s):

Sequence Alignment ◽

Common Ancestor ◽

Sequence Data ◽

Sequence Similarity ◽

Biological Macromolecules ◽

Clear Understanding ◽

Molecular Sequence Data ◽

Molecular Sequence ◽

Automated Tools ◽

Similarity Relationships

Properly approached, molecular sequence data is a rich source of knowledge capable of teaching us much about the structure, function, and evolution of biological macromolecules. To effectively realize this potential, however, some understanding of the process of and theoretical basis for sequence comparison is needed as well as a variety of practical tools to access and manipulate the data. The volume of molecular sequence data has long since surpassed human information processing capacity for even simple tasks such as searching for related sequences, and with the ever increasing rate at which new sequences are being produced, the need for computer-assisted analysis becomes more and more acute. Automated tools can extend human capabilities by orders of magnitude in both speed and accuracy. The educated application of these automated tools is an essential part of modern molecular biology research. This chapter considers the theory and practice of analyzing sequence similarity as it applies to database searching and sequence alignment. Five major areas will be examined. First, we describe the use of dot matrix plots to elucidate the structures and features relating a sequence pair. Secondly, we discuss optimal pairwise alignment of sequences using dynamic programming algorithms. Thirdly, we examine fast, approximate techniques for detecting local similarities. Fourthly, the uses of and techniques for multiple sequence alignment are described. Finally, the statistical significance of sequence similarity is considered. In the analysis of molecular sequences, the terms similarity andhomology are often used without a clear understanding of their distinct implications. Similarity is a descriptive term which only implies that two sequences, by some criterion, resemble each other and carries no suggestion as to their origins or ancestry. Homology refers specifically to similarity due to descent from a common ancestor (Patterson, 1988;Reeck etal., 1987). On the basis of similarity relationships among a group of sequences, it may be possible to infer homology, but outside of an explicit laboratory model system, descent from a common ancestor remains hypothetical. There are philosophical issues in the inference of homology as well as practical ones. In classical morphology, conjunction (the occurrence of two traits in a single individual) is considered evidence that they are not homologous (Patterson, 1982).

Download Full-text

Investigating Microbial Eukaryotic Diversity from a Global Census: Insights from a Comparison of Pyrotag and Full-Length Sequences of 18S rRNA Genes

Applied and Environmental Microbiology ◽

10.1128/aem.00057-14 ◽

2014 ◽

Vol 80 (14) ◽

pp. 4363-4373 ◽

Cited By ~ 49

Author(s):

Alle A. Y. Lie ◽

Zhenfeng Liu ◽

Sarah K. Hu ◽

Adriane C. Jones ◽

Diane Y. Kim ◽

...

Keyword(s):

Species Richness ◽

Sanger Sequencing ◽

18S Rrna ◽

Sequence Data ◽

Sequence Similarity ◽

Hypervariable Region ◽

Full Length ◽

Rrna Genes ◽

Data Sets ◽

18S Rrna Genes

ABSTRACTNext-generation DNA sequencing (NGS) approaches are rapidly surpassing Sanger sequencing for characterizing the diversity of natural microbial communities. Despite this rapid transition, few comparisons exist between Sanger sequences and the generally much shorter reads of NGS. Operational taxonomic units (OTUs) derived from full-length (Sanger sequencing) and pyrotag (454 sequencing of the V9 hypervariable region) sequences of 18S rRNA genes from 10 global samples were analyzed in order to compare the resulting protistan community structures and species richness. Pyrotag OTUs called at 98% sequence similarity yielded numbers of OTUs that were similar overall to those for full-length sequences when the latter were called at 97% similarity. Singleton OTUs strongly influenced estimates of species richness but not the higher-level taxonomic composition of the community. The pyrotag and full-length sequence data sets had slightly different taxonomic compositions of rhizarians, stramenopiles, cryptophytes, and haptophytes, but the two data sets had similarly high compositions of alveolates. Pyrotag-based OTUs were often derived from sequences that mapped to multiple full-length OTUs at 100% similarity. Thus, pyrotags sequenced from a single hypervariable region might not be appropriate for establishing protistan species-level OTUs. However, nonmetric multidimensional scaling plots constructed with the two data sets yielded similar clusters, indicating that beta diversity analysis results were similar for the Sanger and NGS sequences. Short pyrotag sequences can provide holistic assessments of protistan communities, although care must be taken in interpreting the results. The longer reads (>500 bp) that are now becoming available through NGS should provide powerful tools for assessing the diversity of microbial eukaryotic assemblages.

Download Full-text

TE-greedy-nester: structure-based detection of LTR retrotransposons and their nesting

Bioinformatics ◽

10.1093/bioinformatics/btaa632 ◽

2020 ◽

Vol 36 (20) ◽

pp. 4991-4999

Author(s):

Matej Lexa ◽

Pavel Jedlicka ◽

Ivan Vanat ◽

Michal Cervenansky ◽

Eduard Kejnovsky

Keyword(s):

Sequence Data ◽

Sequence Similarity ◽

Recursive Algorithm ◽

Complex Mixture ◽

Computation Time ◽

Full Length ◽

Supplementary Information ◽

Ltr Retrotransposons ◽

Process Error ◽

Transposon Evolution

Abstract Motivation Transposable elements (TEs) in eukaryotes often get inserted into one another, forming sequences that become a complex mixture of full-length elements and their fragments. The reconstruction of full-length elements and the order in which they have been inserted is important for genome and transposon evolution studies. However, the accumulation of mutations and genome rearrangements over evolutionary time makes this process error-prone and decreases the efficiency of software aiming to recover all nested full-length TEs. Results We created software that uses a greedy recursive algorithm to mine increasingly fragmented copies of full-length LTR retrotransposons in assembled genomes and other sequence data. The software called TE-greedy-nester considers not only sequence similarity but also the structure of elements. This new tool was tested on a set of natural and synthetic sequences and its accuracy was compared to similar software. We found TE-greedy-nester to be superior in a number of parameters, namely computation time and full-length TE recovery in highly nested regions. Availability and implementation http://gitlab.fi.muni.cz/lexa/nested. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Limited Genetic Diversity of blaCMY-2-Containing IncI1-pST12 Plasmids from Enterobacteriaceae of Human and Broiler Chicken Origin in The Netherlands

Microorganisms ◽

10.3390/microorganisms8111755 ◽

2020 ◽

Vol 8 (11) ◽

pp. 1755

Author(s):

Evert Drijver ◽

Joep Stohr ◽

Jaco Verweij ◽

Carlo Verhulst ◽

Francisca Velkers ◽

...

Keyword(s):

Escherichia Coli ◽

Sequence Data ◽

Sequence Similarity ◽

Sequencing Analysis ◽

Nucleotide Polymorphisms ◽

Pan Genome ◽

Next Generation Sequencing Analysis ◽

Long Read ◽

Short Read Sequence ◽

High Degree

Distinguishing epidemiologically related and unrelated plasmids is essential to confirm plasmid transmission. We compared IncI1–pST12 plasmids from both human and livestock origin and explored the degree of sequence similarity between plasmids from Enterobacteriaceae with different epidemiological links. Short-read sequence data of Enterobacteriaceae cultured from humans and broilers were screened for the presence of both a blaCMY-2 gene and an IncI1–pST12 replicon. Isolates were long-read sequenced on a MinION sequencer (OxfordNanopore Technologies). After plasmid reconstruction using hybrid assembly, pairwise single nucleotide polymorphisms (SNPs) were determined. The plasmids were annotated, and a pan-genome was constructed to compare genes variably present between the different plasmids. Nine Escherichia coli sequences of broiler origin, four Escherichia coli sequences, and one Salmonella enterica sequence of human origin were selected for the current analysis. A circular contig with the IncI1–pST12 replicon and blaCMY-2 gene was extracted from the assembly graph of all fourteen isolates. Analysis of the IncI1–pST12 plasmids revealed a low number of SNP differences (range of 0–9 SNPs). The range of SNP differences overlapped in isolates with different epidemiological links. One-hundred and twelve from a total of 113 genes of the pan-genome were present in all plasmid constructs. Next generation sequencing analysis of blaCMY-2-containing IncI1–pST12 plasmids isolated from Enterobacteriaceae with different epidemiological links show a high degree of sequence similarity in terms of SNP differences and the number of shared genes. Therefore, statements on the horizontal transfer of these plasmids based on genetic identity should be made with caution.

Download Full-text

Fast and accurate HLA typing from short-read next-generation sequence data with xHLA

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1707945114 ◽

2017 ◽

Vol 114 (30) ◽

pp. 8059-8064 ◽

Cited By ~ 54

Author(s):

Chao Xie ◽

Zhen Xuan Yeo ◽

Marie Wong ◽

Jason Piper ◽

Tao Long ◽

...

Keyword(s):

Sequence Data ◽

Sequence Similarity ◽

Amino Acid Level ◽

Hla Typing ◽

Sequencing Data ◽

Desktop Computer ◽

Short Read ◽

Short Read Sequencing ◽

Hla Genes ◽

Human Chromosome 6

The HLA gene complex on human chromosome 6 is one of the most polymorphic regions in the human genome and contributes in large part to the diversity of the immune system. Accurate typing of HLA genes with short-read sequencing data has historically been difficult due to the sequence similarity between the polymorphic alleles. Here, we introduce an algorithm, xHLA, that iteratively refines the mapping results at the amino acid level to achieve 99–100% four-digit typing accuracy for both class I and II HLA genes, taking only∼3 min to process a 30× whole-genome BAM file on a desktop computer.

Download Full-text

Hoeflea phototrophica sp. nov., a novel marine aerobic alphaproteobacterium that forms bacteriochlorophyll a

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.63958-0 ◽

2006 ◽

Vol 56 (4) ◽

pp. 821-826 ◽

Cited By ~ 45

Author(s):

Hanno Biebl ◽

Brian J. Tindall ◽

Rüdiger Pukall ◽

Heinrich Lünsdorf ◽

Martin Allgaier ◽

...

Keyword(s):

Sea Water ◽

Sequence Data ◽

Sequence Similarity ◽

Photosynthetic Pigment ◽

Carbon Sources ◽

Rrna Gene ◽

Good Growth ◽

Reaction Centre ◽

The Novel ◽

Poor Growth

Within a collection of marine strains that were shown to contain the photosynthesis reaction-centre genes pufL and pufM, a novel group of alphaproteobacteria was found and was characterized phenotypically. The 16S rRNA gene sequence data suggested that the strains belonged to the order Rhizobiales and were closest (98·5 % sequence similarity) to the recently described species Hoeflea marina. The cells contained bacteriochlorophyll a and a carotenoid, presumably spheroidenone, in small to medium amounts. Cells of the novel strains were small rods and were motile by means of single polarly inserted flagella. Good growth occurred in complex media with 0·5–7·0 % sea salts, at 25–33 °C (optimum, 31 °C) and at pH values in the range 6–9. With the exception of acetate and malate, organic carbon sources tested supported poor growth or no growth at all. Growth factors were required; these were provided by small amounts of yeast extract, but not by standard vitamin solutions. Growth occurred under aerobic to microaerobic conditions, but not under anaerobic conditions, either in the dark or light. Nitrate was not reduced. Photosynthetic pigments were formed at low to medium salt concentrations, but not at the salt concentration of sea water (3·5 %). On the basis of smaller cell size, different substrate utilization profile and photosynthetic pigment content, the novel strains can be classified as representatives of a second species of Hoeflea, for which the name Hoeflea phototrophica sp. nov. is proposed. The type strain of Hoeflea phototrophica sp. nov. is DFL-43T (=DSM 17068T=NCIMB 14078T).

Download Full-text

Fibrisoma limi gen. nov., sp. nov., a filamentous bacterium isolated from tidal flats

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.025395-0 ◽

2011 ◽

Vol 61 (6) ◽

pp. 1418-1424 ◽

Cited By ~ 19

Author(s):

Manuela Filippini ◽

Andres Kaech ◽

Urs Ziegler ◽

Homayoun C. Bagheri

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Gene Sequence ◽

Sequence Data ◽

Sequence Similarity ◽

16S Rrna Gene Sequence ◽

Rrna Gene ◽

Rrna Gene Sequence ◽

The North ◽

Gram Staining

An orange-pigmented, Gram-staining-negative, non-motile, filament-forming, rod-shaped bacterium (BUZ 3T) was isolated from a coastal mud sample from the North Sea (Fedderwardersiel, Germany) and characterized taxonomically using a polyphasic approach. According to 16S rRNA gene sequence data, it belonged to the family Cytophagaceae, exhibiting low 16S rRNA gene sequence similarity (<90 %) with members of the genera Spirosoma, Rudanella and Fibrella. The DNA G+C content was 52.0 mol%. The major fatty acids were summed feature 3 (comprising C16 : 1ω7c and/or iso-C15 : 0 2-OH), C16 : 1ω5c and iso-C17 : 0 3-OH. The major polar lipids consisted of phosphatidylethanolamine and several aminolipids. On the basis of phenotypic, chemotaxonomic and phylogenetic data, it is proposed that strain BUZ 3T represents a novel genus and species, for which the name Fibrisoma limi gen. nov., sp. nov. is proposed. The type strain is BUZ 3T ( = DSM 22564T = CCUG 58137T).

Download Full-text