Data Structures to Represent a Set of
            k
            -long DNA Sequences

Rayan Chikhi; Jan Holub; Paul Medvedev

doi:10.1145/3445967

Data Structures to Represent a Set of k -long DNA Sequences

ACM Computing Surveys ◽

10.1145/3445967 ◽

2021 ◽

Vol 54 (1) ◽

pp. 1-22

Author(s):

Rayan Chikhi ◽

Jan Holub ◽

Paul Medvedev

Keyword(s):

Data Structures ◽

Dna Sequences ◽

Sequencing Data ◽

String Algorithms ◽

Fixed Length ◽

The Past

The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k -mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k -mer set has emerged as a shared underlying component. A set of k -mers has unique features and applications that, over the past 10 years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k -mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.

Download Full-text

Detection of Flaviviral-Like DNA Sequences in Aedes aegypti (Diptera: Culicidae) Collected From Argentina

Journal of Medical Entomology ◽

10.1093/jme/tjab073 ◽

2021 ◽

Author(s):

Melisa B Bonica ◽

Dario E Balcazar ◽

Ailen Chuchuy ◽

Jorge A Barneche ◽

Carolina Torres ◽

...

Keyword(s):

Public Health ◽

Aedes Aegypti ◽

Dna Sequences ◽

Immature Stages ◽

Public Health Burden ◽

The Past ◽

Health Burden ◽

The World ◽

Aegypti Genome ◽

Dengue Epidemics

Abstract Diseases caused by flaviviruses are a major public health burden across the world. In the past decades, South America has suffered dengue epidemics, the re-emergence of yellow fever and St. Louis encephalitis viruses, and the introduction of West Nile and Zika viruses. Many insect-specific flaviviruses (ISFs) that cannot replicate in vertebrate cells have recently been described. In this study, we analyzed field-collected mosquito samples from six different ecoregions of Argentina to detect flaviviruses. We did not find any RNA belonging to pathogenic flaviviruses or ISFs in adults or immature stages. However, flaviviral-like DNA similar to flavivirus NS5 region was detected in 83–100% of Aedes aegypti (L.). Despite being previously described as an ancient element in the Ae. aegypti genome, the flaviviral-like DNA sequence was not detected in all Ae. aegypti samples and sequences obtained did not form a monophyletic group, possibly reflecting the genetic diversity of mosquito populations in Argentina.

Download Full-text

The hidden layers of microbial community structure: extracting the concealed diversity dimensions from our sequencing data

FEMS Microbiology Letters ◽

10.1093/femsle/fnaa086 ◽

2020 ◽

Vol 367 (11) ◽

Author(s):

Andrea Fasolo ◽

Laura Treu ◽

Piergiorgio Stevanato ◽

Giuseppe Concheri ◽

Stefano Campanaro ◽

...

Keyword(s):

Sequencing Data ◽

16S Sequencing ◽

Evolutionary Aspect ◽

One Dimensional ◽

Sequence Identity ◽

The Past ◽

Environmental Processes ◽

Final Layer ◽

Evolutionary Success ◽

Microbial Groups

ABSTRACT Microbial metabarcoding is the standard approach to assess communities’ diversity. However reports are often limited to simple OTU abundances for each phylum, giving rather one-dimensional views of microbial assemblages, overlooking other accessible aspects. The first is masked by databases incompleteness; OTU picking involves clustering at 97% (near-species) sequence identity, but different OTUs regularly end up under a same taxon name. When expressing diversity as number of obtained taxonomical names, a large portion of the real diversity lying within the data remains underestimated. Using the 16S sequencing results of an environmental transect across a gradient of 17 coastal habitats we first extracted the number of OTUs hidden under the same name. Further, we observed which was the deepest rank yielded by annotation, revealing for which microbial groups are we missing most knowledge. Data were then used to infer an evolutionary aspect: what is, in each phylum the success of the present time individuals (abundances for each OTU) in relation to their prior evolutionary success in differentiation (number of OTUs). This information reveals whether the past speciation/diversification force is matched by the present competitiveness in reproduction/persistence. The final layer explored is functional diversity, i.e. abundances of groups involved in specific environmental processes.

Download Full-text

Predecessor Search, String Algorithms and Data Structures

Encyclopedia of Algorithms ◽

10.1007/978-1-4939-2864-4_632 ◽

2016 ◽

pp. 1605-1611

Author(s):

Djamal Belazzougui

Keyword(s):

Data Structures ◽

Algorithms And Data Structures ◽

String Algorithms ◽

Search String

Download Full-text

Towards end-to-end disease prediction from raw metagenomic data

10.1101/2020.10.29.360297 ◽

2020 ◽

Author(s):

Maxence Queyrel ◽

Edi Prifti ◽

Jean-Daniel Zucker

Keyword(s):

Dna Sequences ◽

Real Life ◽

Multiple Instance Learning ◽

Disease Classification ◽

Metagenomic Data ◽

Numerical Representation ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

End To End ◽

Bioinformatics Workflows

AbstractAnalysis of the human microbiome using metagenomic sequencing data has demonstrated high ability in discriminating various human diseases. Raw metagenomic sequencing data require multiple complex and computationally heavy bioinformatics steps prior to data analysis. Such data contain millions of short sequences read from the fragmented DNA sequences and are stored as fastq files. Conventional processing pipelines consist multiple steps including quality control, filtering, alignment of sequences against genomic catalogs (genes, species, taxonomic levels, functional pathways, etc.). These pipelines are complex to use, time consuming and rely on a large number of parameters that often provide variability and impact the estimation of the microbiome elements. Recent studies have demonstrated that training Deep Neural Networks directly from raw sequencing data is a promising approach to bypass some of the challenges associated with mainstream bioinformatics pipelines. Most of these methods use the concept of word and sentence embeddings that create a meaningful and numerical representation of DNA sequences, while extracting features and reducing the dimentionality of the data. In this paper we present an end-to-end approach that classifies patients into disease groups directly from raw metagenomic reads: metagenome2vec. This approach is composed of four steps (i) generating a vocabulary of k-mers and learning their numerical embeddings; (ii) learning DNA sequence (read) embeddings; (iii) identifying the genome from which the sequence is most likely to come and (iv) training a multiple instance learning classifier which predicts the phenotype based on the vector representation of the raw data. An attention mechanism is applied in the network so that the model can be interpreted, assigning a weight to the influence of the prediction for each genome. Using two public real-life datasets as well a simulated one, we demonstrated that this original approach reached very high performances, comparable with the state-of-the-art methods applied directly on processed data though mainstream bioinformatics workflows. These results are encouraging for this proof of concept work. We believe that with further dedication, the DNN models have the potential to surpass mainstream bioinformatics workflows in disease classification tasks.

Download Full-text

Application of high-throughput serological tests for controlling zoonotic viral infection in Amazon rainforest and as a strategy to prevent future pandemics (Preprint)

10.2196/preprints.27747 ◽

2021 ◽

Author(s):

Marcela Santiago Pacheco De Azevedo ◽

Daniela Gomes Ribeiro ◽

Felipe Rocha Da Silva Santos ◽

Siomar De Castro Soares ◽

Vasco Ariston De Carvalho Azevedo ◽

...

Keyword(s):

Developing Countries ◽

High Throughput ◽

Low Income ◽

Dna Sequences ◽

Scientific Information ◽

Preventive Health Care ◽

Amazon Rainforest ◽

Low Income Countries ◽

Sequencing Data ◽

Serological Tests

UNSTRUCTURED From the bubonic plague on the 14th century to the new Coronavirus disease 2019 (COVID-19), pandemics have profoundly changed societies function. Infectious disease outbreaks are getting shorter and shorter due to our densely populated cities, global travel, and nature mass exploration. In this regard, there is a particular concern about fires occurring in Brazil's Amazon rainforest, one of the most biodiverse places on earth that facilitates cross-species transmission giving rise to the emergency of new virulent pathogens. Situation is further complicated because Amazon spans across eight developing countries with limited preventive health care services. In this perspective, this review highlights the role of new methodologies best suited for epidemiological monitoring in low-income countries, such as high-throughput serological tests. Phage immunoprecipitation sequencing (Phip-Seq), for example, can evaluate antibody-repertoire binding specificities using oligonucleotides libraries encoding epitopes covering the DNA sequences from all human pathogenic viruses or all Arboviruses already described. After incubation with an individual's serum, these libraries can be immunoprecipitated for subsequent analysis by DNA sequencing. Data are analyzed revealing peptides recognized by the antibodies present in the sample. Being a technique at a relatively low cost, its implementation in developing countries is feasible and can generate very interesting scientific information.

Download Full-text

Predecessor Search, String Algorithms and Data Structures

Encyclopedia of Algorithms ◽

10.1007/978-3-642-27848-8_632-2 ◽

2015 ◽

pp. 1-8

Author(s):

Djamal Belazzougui

Keyword(s):

Data Structures ◽

Algorithms And Data Structures ◽

String Algorithms ◽

Search String

Download Full-text

Genetic and archaeological evidence for a former breeding population of Aleutian Cackling Goose (Branta hutchinsii leucopareia) on Adak Island, central Aleutians, Alaska

Canadian Journal of Zoology ◽

10.1139/z11-027 ◽

2011 ◽

Vol 89 (8) ◽

pp. 732-743 ◽

Cited By ~ 7

Author(s):

B.J. Wilson ◽

S.J. Crockford ◽

J.W. Johnson ◽

R.S. Malhi ◽

B.M. Kemp

Keyword(s):

Mitochondrial Genome ◽

Control Region ◽

Dna Sequences ◽

Archaeological Site ◽

Breeding Population ◽

Cytb Gene ◽

Breeding Ground ◽

Before Present ◽

European Contact ◽

The Past

Many well-preserved bones of medium-sized goose have been recovered from the Zeto Point archaeological site (ADK-011) on Adak Island in the central Aleutians, Alaska, that date to ca. 170–415 years before present based on conventional radiometric dates of the deposits. This prehistoric sample includes remains of adults and unfledged goslings that defied confident identification based on osteological criteria. While the presence of newborns indicates that Adak was a breeding ground, which species was doing the nesting remained uncertain. Of the five species of medium-sized goose (order Anseriformes, family Anatidae) known or presumed to visit Adak Island, three are rarely sighted. The only common visitor is the Emperor Goose ( Chen canagica (Sevastianov, 1802)). The Aleutian Cackling Goose ( Branta hutchinsii leucopareia (Brandt, 1836)) breeds elsewhere in the Aleutians but does not currently breed on Adak Island and there are no records of it nesting there in the past. Here DNA sequences from portions of the cytochrome b (cytb) gene and the control region (CR) of the mitochondrial genome were recovered from 28 of 29 Adak prehistoric goose remains. All adult specimens identified to species were either C. canagica or B. h. leuopareia, but all specifically identified juvenile specimens were B. h. leuopareia. The results demonstrate that Adak Island was a breeding ground of the Aleutian Cackling Goose prior to European contact.

Download Full-text

The longest common substring problem

Mathematical Structures in Computer Science ◽

10.1017/s0960129515000110 ◽

2015 ◽

Vol 27 (2) ◽

pp. 277-295 ◽

Cited By ~ 1

Author(s):

MAXIME CROCHEMORE ◽

COSTAS S. ILIOPOULOS ◽

ALESSIO LANGIU ◽

FILIPPO MIGNOSI

Keyword(s):

Data Structures ◽

Dna Sequences ◽

Simple Algorithm ◽

Suffix Trees ◽

Simple Method ◽

Suffix Arrays ◽

Lowest Common Ancestor ◽

Wide Range ◽

Efficient Data ◽

Longest Common Substring

Given a set $\mathcal{D}$ of q documents, the Longest Common Substring (LCS) problem asks, for any integer 2 ⩽ k ⩽ q, the longest substring that appears in k documents. LCS is a well-studied problem having a wide range of applications in Bioinformatics: from microarrays to DNA sequences alignments and analysis. This problem has been solved by Hui (2000International Journal of Computer Science and Engineering15 73–76) by using a famous constant-time solution to the Lowest Common Ancestor (LCA) problem in trees coupled with the use of suffix trees.In this article, we present a simple method for solving the LCS problem by using suffix trees (STs) and classical union-find data structures. In turn, we show how this simple algorithm can be adapted in order to work with other space efficient data structures such as the enhanced suffix arrays (ESA) and the compressed suffix tree.

Download Full-text

Is Phylotranscriptomics as Reliable as Phylogenomics?

Molecular Biology and Evolution ◽

10.1093/molbev/msaa181 ◽

2020 ◽

Vol 37 (12) ◽

pp. 3672-3683 ◽

Cited By ~ 3

Author(s):

Seongmin Cheon ◽

Jianzhi Zhang ◽

Chungoo Park

Keyword(s):

Genome Sequencing ◽

Dna Sequences ◽

Orthologous Gene ◽

Gene Identification ◽

Sequencing Data ◽

Genome Sequences ◽

Phylogenetic Information ◽

Tissue Of Origin ◽

Phylogenetic Method ◽

Rigorous Method

Abstract Phylogenomics, the study of phylogenetic relationships among taxa based on their genome sequences, has emerged as the preferred phylogenetic method because of the wealth of phylogenetic information contained in genome sequences. Genome sequencing, however, can be prohibitively expensive, especially for taxa with huge genomes and when many taxa need sequencing. Consequently, the less costly phylotranscriptomics has seen an increased use in recent years. Phylotranscriptomics reconstructs phylogenies using DNA sequences derived from transcriptomes, which are often orders of magnitude smaller than genomes. However, in the absence of corresponding genome sequences, comparative analyses of transcriptomes can be challenging and it is unclear whether phylotranscriptomics is as reliable as phylogenomics. Here, we respectively compare the phylogenomic and phylotranscriptomic trees of 22 mammals and 15 plants that have both sequenced nuclear genomes and publicly available RNA sequencing data from multiple tissues. We found that phylotranscriptomic analysis can be sensitive to orthologous gene identification. When a rigorous method for identifying orthologs is employed, phylogenomic and phylotranscriptomic trees are virtually identical to each other, regardless of the tissue of origin of the transcriptomes and whether the same tissue is used across species. These findings validate phylotranscriptomics, brighten its prospect, and illustrate the criticality of reliable ortholog detection in such practices.

Download Full-text

An Entropy-Based Position Projection Algorithm for Motif Discovery

BioMed Research International ◽

10.1155/2016/9127474 ◽

2016 ◽

Vol 2016 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Yipu Zhang ◽

Ping Wang ◽

Maode Yan

Keyword(s):

Dna Sequences ◽

Motif Discovery ◽

Optimal Solution ◽

Training Model ◽

Projection Algorithm ◽

Local Optimum ◽

The Past ◽

Local Optimal Solution ◽

And Function ◽

Discovery Algorithms

Motif discovery problem is crucial for understanding the structure and function of gene expression. Over the past decades, many attempts using consensus and probability training model for motif finding are successful. However, the most existing motif discovery algorithms are still time-consuming or easily trapped in a local optimum. To overcome these shortcomings, in this paper, we propose an entropy-based position projection algorithm, called EPP, which designs a projection process to divide the dataset and explores the best local optimal solution. The experimental results on real DNA sequences, Tompa data, and ChIP-seq data show that EPP is advantageous in dealing with the motif discovery problem and outperforms current widely used algorithms.

Download Full-text