Using the Sadakane Compressed Suffix Tree to Solve the All-Pairs Suffix-Prefix Problem

BioMed Research International ◽

10.1155/2014/745298 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 3

Author(s):

Maan Haj Rachid ◽

Qutaibah Malluhi ◽

Mohamed Abouelhoda

Keyword(s):

Parallel Algorithm ◽

Suffix Tree ◽

De Novo ◽

Efficient Solutions ◽

Experimental Results ◽

Matching Problem ◽

De Novo Genome Assembly ◽

Large Size ◽

String Processing ◽

Compressed Index

The all-pairs suffix-prefix matching problem is a basic problem in string processing. It has an application in the de novo genome assembly task, which is one of the major bioinformatics problems. Due to the large size of the input data, it is crucial to use fast and space efficient solutions. In this paper, we present a space-economical solution to this problem using the generalized Sadakane compressed suffix tree. Furthermore, we present a parallel algorithm to provide more speed for shared memory computers. Our sequential and parallel algorithms are optimized by exploiting features of the Sadakane compressed index data structure. Experimental results show that our solution based on the Sadakane’s compressed index consumes significantly less space than the ones based on noncompressed data structures like the suffix tree and the enhanced suffix array. Our experimental results show that our parallel algorithm is efficient and scales well with increasing number of processors.

Download Full-text

PARALLEL PATTERN MATCHING WITH SCALING

Parallel Processing Letters ◽

10.1142/s0129626401000476 ◽

2001 ◽

Vol 11 (01) ◽

pp. 125-138 ◽

Cited By ~ 1

Author(s):

H. MONGELLI ◽

S. W. SONG

Keyword(s):

Parallel Algorithms ◽

Parallel Algorithm ◽

Pattern Matching ◽

Computing Time ◽

Parallel Machine ◽

Experimental Results ◽

Coarse Grained ◽

Matching Problem ◽

Communication Round ◽

Parallel Pattern

Given a text and a pattern, the problem of pattern matching consists of determining all the positions of the text where the pattern occurs. When the text and the pattern are matrices, the matching is termed bidimensional. There are variations of this problem where we allow the matching using a somehow modified pattern. A modification that we will allow is that the pattern can be scaled. We propose a new parallel algorithm for this problem, under the CGM (Coarse Grained Multicomputer) model. This algorithm requires linear local computing time in the input, linear memory and uses only one communication round, during which at most a linear amount of data is exchanged. To be the best of our knowledge, there are no known parallel algorithms for the bidimensional pattern matching problem with scaling in the literature. This proposed algorithm was implemented in C, using the PVM interface and was executed on a Parsytec PowerXplorer parallel machine. The experimental results obtained were very promising and showed significant speedups.

Download Full-text

nanotatoR: a tool for enhanced annotation of genomic structural variants

BMC Genomics ◽

10.1186/s12864-020-07182-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Surajit Bhattacharya ◽

Hayk Barseghyan ◽

Emmanuèle C. Délot ◽

Eric Vilain

Keyword(s):

De Novo ◽

Genome Mapping ◽

Gene List ◽

Sufficient Information ◽

Rna Seq ◽

Structural Variants ◽

De Novo Genome Assembly ◽

Pathogenic Variants ◽

Increased Sensitivity

Abstract Background Whole genome sequencing is effective at identification of small variants, but because it is based on short reads, assessment of structural variants (SVs) is limited. The advent of Optical Genome Mapping (OGM), which utilizes long fluorescently labeled DNA molecules for de novo genome assembly and SV calling, has allowed for increased sensitivity and specificity in SV detection. However, compared to small variant annotation tools, OGM-based SV annotation software has seen little development, and currently available SV annotation tools do not provide sufficient information for determination of variant pathogenicity. Results We developed an R-based package, nanotatoR, which provides comprehensive annotation as a tool for SV classification. nanotatoR uses both external (DGV; DECIPHER; Bionano Genomics BNDB) and internal (user-defined) databases to estimate SV frequency. Human genome reference GRCh37/38-based BED files are used to annotate SVs with overlapping, upstream, and downstream genes. Overlap percentages and distances for nearest genes are calculated and can be used for filtration. A primary gene list is extracted from public databases based on the patient’s phenotype and used to filter genes overlapping SVs, providing the analyst with an easy way to prioritize variants. If available, expression of overlapping or nearby genes of interest is extracted (e.g. from an RNA-Seq dataset, allowing the user to assess the effects of SVs on the transcriptome). Most quality-control filtration parameters are customizable by the user. The output is given in an Excel file format, subdivided into multiple sheets based on SV type and inheritance pattern (INDELs, inversions, translocations, de novo, etc.). nanotatoR passed all quality and run time criteria of Bioconductor, where it was accepted in the April 2019 release. We evaluated nanotatoR’s annotation capabilities using publicly available reference datasets: the singleton sample NA12878, mapped with two types of enzyme labeling, and the NA24143 trio. nanotatoR was also able to accurately filter the known pathogenic variants in a cohort of patients with Duchenne Muscular Dystrophy for which we had previously demonstrated the diagnostic ability of OGM. Conclusions The extensive annotation enables users to rapidly identify potential pathogenic SVs, a critical step toward use of OGM in the clinical setting.

Download Full-text

Genome sequences reveal global dispersal routes and suggest convergent genetic adaptations in seahorse evolution

Nature Communications ◽

10.1038/s41467-021-21379-x ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Chunyan Li ◽

Melisa Olave ◽

Yali Hou ◽

Geng Qin ◽

Ralf F. Schneider ◽

...

Keyword(s):

Coastal Waters ◽

Genetic Basis ◽

De Novo ◽

Ocean Currents ◽

Rapid Adaptation ◽

Developmental Gene ◽

De Novo Genome Assembly ◽

Biogeographic Patterns ◽

Hippocampus Erectus ◽

New Habitats

AbstractSeahorses have a circum-global distribution in tropical to temperate coastal waters. Yet, seahorses show many adaptations for a sedentary, cryptic lifestyle: they require specific habitats, such as seagrass, kelp or coral reefs, lack pelvic and caudal fins, and give birth to directly developed offspring without pronounced pelagic larval stage, rendering long-range dispersal by conventional means inefficient. Here we investigate seahorses’ worldwide dispersal and biogeographic patterns based on a de novo genome assembly of Hippocampus erectus as well as 358 re-sequenced genomes from 21 species. Seahorses evolved in the late Oligocene and subsequent circum-global colonization routes are identified and linked to changing dynamics in ocean currents and paleo-temporal seaway openings. Furthermore, the genetic basis of the recurring “bony spines” adaptive phenotype is linked to independent substitutions in a key developmental gene. Analyses thus suggest that rafting via ocean currents compensates for poor dispersal and rapid adaptation facilitates colonizing new habitats.

Download Full-text

Genomic analyses unveil helmeted guinea fowl (Numida meleagris) domestication in West Africa

Genome Biology and Evolution ◽

10.1093/gbe/evab090 ◽

2021 ◽

Author(s):

Quan-Kuan Shen ◽

Min-Sheng Peng ◽

Adeniyi C Adeola ◽

Ling Kui ◽

Shengchang Duan ◽

...

Keyword(s):

West Africa ◽

De Novo ◽

Guinea Fowl ◽

Chromatin Interaction ◽

De Novo Genome Assembly ◽

Numida Meleagris ◽

Whole Genomes ◽

Genomic Analyses ◽

Related Wild Species ◽

Wild Progenitors

Abstract Domestication of the helmeted guinea fowl (HGF; Numida meleagris) in Africa remains elusive. Here we report a high-quality de novo genome assembly for domestic HGF generated by long and short-reads sequencing together with optical and chromatin interaction mapping. Using this assembly as the reference, we performed population genomic analyses for newly sequenced whole-genomes for 129 birds from Africa, Asia, and Europe, including domestic animals (n = 89), wild progenitors (n = 34), and their closely related wild species (n = 6). Our results reveal domestication of HGF in West Africa around 1,300-5,500 years ago. Scanning for selective signals characterized the functional genes in behavior and locomotion changes involved in domestication of HGF. The pleiotropy and linkage in genes affecting plumage color and fertility were revealed in the recent breeding of Italian domestic HGF. In addition to presenting a missing piece to the jigsaw puzzle of domestication in poultry, our study provides valuable genetic resources for researchers and breeders to improve production in this species.

Download Full-text

De Novo Genome Assembly of Limpet Bathyacmaea lactea (Gastropoda: Pectinodontidae): The First Reference Genome of a Deep-Sea Gastropod Endemic to Cold Seeps

Genome Biology and Evolution ◽

10.1093/gbe/evaa100 ◽

2020 ◽

Vol 12 (6) ◽

pp. 905-910 ◽

Cited By ~ 2

Author(s):

Ruoyu Liu ◽

Kun Wang ◽

Jun Liu ◽

Wenjie Xu ◽

Yang Zhou ◽

...

Keyword(s):

Deep Sea ◽

Metal Ion ◽

De Novo ◽

Demographic History ◽

Gene Families ◽

Phylogenetic Position ◽

Cold Seeps ◽

Nitrogen And Phosphorus ◽

De Novo Genome Assembly ◽

A Genome

Abstract Cold seeps, characterized by the methane, hydrogen sulfide, and other hydrocarbon chemicals, foster one of the most widespread chemosynthetic ecosystems in deep sea that are densely populated by specialized benthos. However, scarce genomic resources severely limit our knowledge about the origin and adaptation of life in this unique ecosystem. Here, we present a genome of a deep-sea limpet Bathyacmaea lactea, a common species associated with the dominant mussel beds in cold seeps. We yielded 54.6 gigabases (Gb) of Nanopore reads and 77.9-Gb BGI-seq raw reads, respectively. Assembly harvested a 754.3-Mb genome for B. lactea, with 3,720 contigs and a contig N50 of 1.57 Mb, covering 94.3% of metazoan Benchmarking Universal Single-Copy Orthologs. In total, 23,574 protein-coding genes and 463.4 Mb of repetitive elements were identified. We analyzed the phylogenetic position, substitution rate, demographic history, and TE activity of B. lactea. We also identified 80 expanded gene families and 87 rapidly evolving Gene Ontology categories in the B. lactea genome. Many of these genes were associated with heterocyclic compound metabolism, membrane-bounded organelle, metal ion binding, and nitrogen and phosphorus metabolism. The high-quality assembly and in-depth characterization suggest the B. lactea genome will serve as an essential resource for understanding the origin and adaptation of life in the cold seeps.

Download Full-text

Improved hybrid de novo genome assembly of domesticated apple (Malus x domestica)

GigaScience ◽

10.1186/s13742-016-0139-0 ◽

2016 ◽

Vol 5 (1) ◽

Cited By ~ 28

Author(s):

Xuewei Li ◽

Ling Kui ◽

Jing Zhang ◽

Yinpeng Xie ◽

Liping Wang ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Malus X Domestica ◽

De Novo Genome Assembly

Download Full-text

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

PLoS ONE ◽

10.1371/journal.pone.0023501 ◽

2011 ◽

Vol 6 (8) ◽

pp. e23501 ◽

Cited By ~ 107

Author(s):

Jarrod A. Chapman ◽

Isaac Ho ◽

Sirisha Sunkara ◽

Shujun Luo ◽

Gary P. Schroth ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

De Novo Genome Assembly

Download Full-text

Ultra Efficient Acceleration for De Novo Genome Assembly via Near-Memory Computing

10.1109/pact52795.2021.00022 ◽

2021 ◽

Author(s):

Minxuan Zhou ◽

Lingxi Wu ◽

Muzhou Li ◽

Niema Moshiri ◽

Kevin Skadron ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

De Novo Genome Assembly

Download Full-text

De novo Genome Assembly from Next-Generation Sequencing (NGS) Reads

Next-Generation Sequencing Data Analysis ◽

10.1201/b19532-11 ◽

2016 ◽

pp. 144-155

Keyword(s):

Next Generation Sequencing ◽

Genome Assembly ◽

De Novo ◽

Next Generation ◽

De Novo Genome Assembly ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Download Full-text

Optimizing de novo genome assembly from PCR-amplified metagenomes

PeerJ ◽

10.7717/peerj.6902 ◽

2019 ◽

Vol 7 ◽

pp. e6902 ◽

Cited By ~ 9

Author(s):

Simon Roux ◽

Gareth Trubl ◽

Danielle Goudeau ◽

Nandita Nath ◽

Estelle Couradeau ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Pcr Amplification ◽

Error Rates ◽

De Novo Genome Assembly ◽

Low Input ◽

Assembly Algorithm ◽

Coverage Bias ◽

Size Number ◽

Assembly Pipeline

Background Metagenomics has transformed our understanding of microbial diversity across ecosystems, with recent advances enabling de novo assembly of genomes from metagenomes. These metagenome-assembled genomes are critical to provide ecological, evolutionary, and metabolic context for all the microbes and viruses yet to be cultivated. Metagenomes can now be generated from nanogram to subnanogram amounts of DNA. However, these libraries require several rounds of PCR amplification before sequencing, and recent data suggest these typically yield smaller and more fragmented assemblies than regular metagenomes. Methods Here we evaluate de novo assembly methods of 169 PCR-amplified metagenomes, including 25 for which an unamplified counterpart is available, to optimize specific assembly approaches for PCR-amplified libraries. We first evaluated coverage bias by mapping reads from PCR-amplified metagenomes onto reference contigs obtained from unamplified metagenomes of the same samples. Then, we compared different assembly pipelines in terms of assembly size (number of bp in contigs ≥ 10 kb) and error rates to evaluate which are the best suited for PCR-amplified metagenomes. Results Read mapping analyses revealed that the depth of coverage within individual genomes is significantly more uneven in PCR-amplified datasets versus unamplified metagenomes, with regions of high depth of coverage enriched in short inserts. This enrichment scales with the number of PCR cycles performed, and is presumably due to preferential amplification of short inserts. Standard assembly pipelines are confounded by this type of coverage unevenness, so we evaluated other assembly options to mitigate these issues. We found that a pipeline combining read deduplication and an assembly algorithm originally designed to recover genomes from libraries generated after whole genome amplification (single-cell SPAdes) frequently improved assembly of contigs ≥10 kb by 10 to 100-fold for low input metagenomes. Conclusions PCR-amplified metagenomes have enabled scientists to explore communities traditionally challenging to describe, including some with extremely low biomass or from which DNA is particularly difficult to extract. Here we show that a modified assembly pipeline can lead to an improved de novo genome assembly from PCR-amplified datasets, and enables a better genome recovery from low input metagenomes.

Download Full-text