Strategies for the bioinformatic treatment of high-throughput sequencing data for diatom studies and bioassessment

Comparison of three clustering approaches for detecting novel environmental microbial diversity

10.7287/peerj.preprints.1414 ◽

2015 ◽

Author(s):

Dominik Forster ◽

Micah Dunthorn ◽

Thorsten Stoeck ◽

Frédéric Mahé

Keyword(s):

Graph Theory ◽

Microbial Ecology ◽

High Throughput Sequencing ◽

De Novo ◽

Sequence Similarity ◽

Pairwise Alignment ◽

Reference Database ◽

Clustering Methods ◽

Underlying Network ◽

Network Topologies

Discovery of novel diversity in high-throughput sequencing (HTS) studies is a central task in environmental microbial ecology. To evaluate the effects that amplicon clustering methods have on novel diversity discovery, we clustered an environmental marine protist HTS dataset of protist reads together with accessions from the taxonomically curated PR2 reference database using three de novo approaches: sequence similarity networks, USEARCH, and Swarm. The novel diversity uncovered by each clustering approach differed drastically in the number of operational taxonomic units (OTUs) and the number of environmental amplicons in these novel diversity OTUs. Global pairwise alignment comparisons revealed that numerous amplicons classified as novel by USEARCH and Swarm were actually highly similar to reference accessions. Using graph theory we found additional novel diversity within OTUs that would have gone unnoticed without further using their underlying network topologies. Our results suggest that novel diversity inferred from clustering approaches requires further validation, whereas graph theory provides a powerful tool for microbial ecology and the analyses of environmental HTS datasets.

Download Full-text

Comparison of three clustering approaches for detecting novel environmental microbial diversity

10.7287/peerj.preprints.1414v1 ◽

2015 ◽

Author(s):

Dominik Forster ◽

Micah Dunthorn ◽

Thorsten Stoeck ◽

Frédéric Mahé

Keyword(s):

Graph Theory ◽

Microbial Ecology ◽

High Throughput Sequencing ◽

De Novo ◽

Sequence Similarity ◽

Pairwise Alignment ◽

Reference Database ◽

Clustering Methods ◽

Underlying Network ◽

Network Topologies

Discovery of novel diversity in high-throughput sequencing (HTS) studies is a central task in environmental microbial ecology. To evaluate the effects that amplicon clustering methods have on novel diversity discovery, we clustered an environmental marine protist HTS dataset of protist reads together with accessions from the taxonomically curated PR2 reference database using three de novo approaches: sequence similarity networks, USEARCH, and Swarm. The novel diversity uncovered by each clustering approach differed drastically in the number of operational taxonomic units (OTUs) and the number of environmental amplicons in these novel diversity OTUs. Global pairwise alignment comparisons revealed that numerous amplicons classified as novel by USEARCH and Swarm were actually highly similar to reference accessions. Using graph theory we found additional novel diversity within OTUs that would have gone unnoticed without further using their underlying network topologies. Our results suggest that novel diversity inferred from clustering approaches requires further validation, whereas graph theory provides a powerful tool for microbial ecology and the analyses of environmental HTS datasets.

Download Full-text

Exact sequence variants should replace operational taxonomic units in marker gene data analysis

10.1101/113597 ◽

2017 ◽

Cited By ~ 7

Author(s):

Benjamin J Callahan ◽

Paul J McMurdie ◽

Susan P Holmes

Keyword(s):

De Novo ◽

Marker Gene ◽

Taxonomic Resolution ◽

Reference Database ◽

Sequence Variants ◽

Sequencing Data ◽

Operational Taxonomic Units ◽

The Status ◽

Reference Databases ◽

Gene Data

AbstractRecent advances have made it possible to analyze high-throughput marker-gene sequencing data without resorting to the customary construction of molecular operational taxonomic units (OTUs): clusters of sequencing reads that differ by less than a fixed dissimilarity threshold. New methods control errors sufficiently that sequence variants (SVs) can be resolved exactly, down to the level of single-nucleotide differences over the sequenced gene region. The benefits of finer taxonomic resolution are immediately apparent, and arguments for SV methods have focused on their improved resolution. Less obvious, but we believe more important, are the broad benefits deriving from the status of SVs as consistent labels with intrinsic biological meaning identified independently from a reference database. Here we discuss how those features grant SVs the combined advantages of closed-reference OTUs — including computational costs that scale linearly with study size, simple merging between independently processed datasets, and forward prediction — and of de novo OTUs — including accurate diversity measurement and applicability to communities lacking deep coverage in reference databases. We argue that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that SVs should replace OTUs as the standard unit of marker gene analysis and reporting.

Download Full-text

Comparison of three clustering approaches for detecting novel environmental microbial diversity

PeerJ ◽

10.7717/peerj.1692 ◽

2016 ◽

Vol 4 ◽

pp. e1692 ◽

Cited By ~ 16

Author(s):

Dominik Forster ◽

Micah Dunthorn ◽

Thorsten Stoeck ◽

Frédéric Mahé

Keyword(s):

Microbial Ecology ◽

High Throughput ◽

High Throughput Sequencing ◽

De Novo ◽

Sequence Similarity ◽

Pairwise Alignment ◽

Clustering Methods ◽

Sequencing Studies ◽

Similarity Networks ◽

Sequence Similarity Networks

Discovery of novel diversity in high-throughput sequencing studies is an important aspect in environmental microbial ecology. To evaluate the effects that amplicon clustering methods have on the discovery of novel diversity, we clustered an environmental marine high-throughput sequencing dataset of protist amplicons together with reference sequences from the taxonomically curated Protist Ribosomal Reference (PR2) database using threede novoapproaches: sequence similarity networks, USEARCH, and Swarm. The potentially novel diversity uncovered by each clustering approach differed drastically in the number of operational taxonomic units (OTUs) and in the number of environmental amplicons in these novel diversity OTUs. Global pairwise alignment comparisons revealed that numerous amplicons classified as potentially novel by USEARCH and Swarm were more than 97% similar to references of PR2. Using shortest path analyses on sequence similarity network OTUs and Swarm OTUs we found additional novel diversity within OTUs that would have gone unnoticed without further exploiting their underlying network topologies. These results demonstrate that graph theory provides powerful tools for microbial ecology and the analysis of environmental high-throughput sequencing datasets. Furthermore, sequence similarity networks were most accurate in delineating novel diversity from previously discovered diversity.

Download Full-text

Learning sequence patterns of AGO-sRNA affinity from high-throughput sequencing libraries to improve in silico functional small RNA detection and classification in plants

10.1101/173575 ◽

2017 ◽

Cited By ~ 1

Author(s):

Lionel Morgado ◽

Ritsert C. Jansen ◽

Frank Johannes

Keyword(s):

Small Rna ◽

High Throughput Sequencing ◽

De Novo ◽

Support Vector ◽

Sequencing Data ◽

Learning Sequence ◽

Rna Detection ◽

Binding Motifs ◽

Regulatory Pathways ◽

Vector Machines

ABSTRACTThe loading of small RNA (sRNA) into Argonaute (AGO) complexes is a crucial step in all regulatory pathways identified so far in plants that depend on such non-coding sequences. Important transcriptional and post-transcriptional silencing mechanisms can be activated depending on the specific AGO protein to which sRNA bind. It is known that sRNA-AGO associations are at least partly encoded in the sRNA primary structure, but the sequence features that drive this association have not been fully explored. Here we train support vector machines (SVM) on sRNA sequencing data obtained from AGO-immunoprecipitation experiments to identify features that determine sRNA affinity to specific AGOs. Our SVM reveal that AGO affinity is strongly determined by complex k-mers in the 5’ and 3’ ends of sRNA, in addition to well-known features such as sRNA length and the base composition of the first nucleotide. Moreover, we find that these k-mers tend to overlap known transcription factor (TF) binding motifs, thus highlighting a close interplay between TF and sRNA-mediated transcriptional regulation. We embedded the learned SVM in a computational pipeline that can be used for de novo functional classification of sRNA sequences. This tool, called SAILS, is provided as a web portal accessible at http://sails.eu.nu.

Download Full-text

Analysis of the Genomic Basis of Functional Diversity in Dinoflagellates using a Transcriptome-Based Sequence Similarity Network

10.1101/211243 ◽

2017 ◽

Author(s):

Arnaud Meng ◽

Erwan Corre ◽

Ian Probert ◽

Andres Gutierrez-Rodriguez ◽

Raffaele Siano ◽

...

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Sequence Similarity ◽

Connected Components ◽

Similarity Network ◽

Network Analyses ◽

Comprehensive Picture ◽

Core Proteome ◽

Functional Features ◽

Genomic Basis

ABSTRACTDinoflagellates are one of the most abundant and functionally diverse groups of eukaryotes. Despite an overall scarcity of genomic information for dinoflagellates, constantly emerging high-throughput sequencing resources can be used to characterize and compare these organisms. We assembled de novo and processed 46 dinoflagellate transcriptomes and used a sequence similarity network (SSN) to compare the underlying genomic basis of functional features within the group. This approach constitutes the most comprehensive picture to date of the genomic potential of dinoflagellates. A core proteome composed of 252 connected components (CCs) of putative conserved protein domains (pCDs) was identified. Of these, 206 were novel and 16 lacked any functional annotation in public databases. Integration of functional information in our network analyses allowed investigation of pCDs specifically associated to functional traits. With respect to toxicity, sequences homologous to those of proteins involved in toxin biosynthesis pathways (e.g. sxtA1-4 and sxtG) were not specific to known toxin-producing species. Although not fully specific to symbiosis, the most represented functions associated with proteins involved in the symbiotic trait were related to membrane processes and ion transport. Overall, our SSN approach led to identification of 45,207 and 90,794 specific and constitutive pCDs of respectively the toxic and symbiotic species represented in our analyses. Of these, 56% and 57% respectively (i.e. 25,393 and 52,193 pCDs) completely lacked annotation in public databases. This stresses the extent of our lack of knowledge, while emphasizing the potential of SSNs to identify candidate pCDs for further functional genomic characterization.

Download Full-text

Characterization of the transcriptome and EST-SSR development in Boea clarkeana, a desiccation-tolerant plant endemic to China

10.7287/peerj.preprints.2603v1 ◽

2016 ◽

Author(s):

Ying Wang ◽

Kun Liu ◽

De Bi ◽

Biao Shou Zhou ◽

Wen Jian Shao

Keyword(s):

De Novo ◽

Gene Annotation ◽

Sequence Similarity ◽

Molecular Study ◽

Sequence Information ◽

Sequencing Data ◽

Protein Database ◽

Illumina Hiseq ◽

Significant Similarity ◽

Assembly Technology

Background. Resurrection plants constitute a unique cadre within angiosperms. Boea clarkeana Hemsl. (Boea, Gesneriaceae) is a desiccation-tolerant dicotyledonous herb that is endemic to China. Although research on angiosperms with DT could be instructive for crops, genomic resources for B. clarkeana remain scarce. In addition, transcriptome sequencing could be an effective way to study desiccation-tolerant plants. Methods. In the present study, we used the platform Illumina HiSeqTM 2000 and de novo assembly technology to obtain leaf transcriptomes of B. clarkeana and conducted a BLASTX alignment of the sequencing data and protein databases for sequence classification and annotation. Then, based on the sequence information obtained, we developed EST-SSR markers by means of EST-SSR mining, primer design and polymorphism identification. Results. A total of 91,449 unigenes were generated from the leaf cDNA library of B. clarkeana in this study. Based on a sequence similarity search with a known protein database, 72,087 unigenes were annotated. Among the annotated unigenes, a total of 71,170 unigenes showed significant similarity to known proteins of 463 popular model species in the Nr database, and 59,962 unigenes and 32,336 unigenes were assigned to GO classifications and COG, respectively. In addition, 44,924 unigenes were mapped in 128 KEGG pathways. Furthermore, a total of 7,610 unigenes with 8,563 microsatellites were found. Seventy-four primer pairs were selected from 436 primer pairs designed for polymorphism validation. SSRs with higher polymorphism rates were concentrated on dinucleotides, pentanucleotides and hexanucleotides. Finally, 17 pairs with highly polymorphic and stable loci were selected for polymorphism screening. There were a total of 65 alleles, with 2–6 alleles at each locus. Mainly due to the unique biological characteristics of plants, the HE, HO and PIC per locus were very low, ranging from 0 to 0.196, 0.082 to 0.14 and 0 to 0.155, respectively. Discussion. A substantial fraction transcriptome sequences of B. clarkeana were generated in this study, which is the first molecular-level analysis of this plant. These sequences are valuable resources for gene annotation and discovery and molecular marker development. These sequences could also provide a valuable basis for the future molecular study of B. clarkeana.

Download Full-text

Phylogenetic relationships of endornaviruses in common bean from the western highlands of Kenya and global sequences

10.7287/peerj.preprints.26904 ◽

2018 ◽

Author(s):

James M Wainaina ◽

Elijah Ateka ◽

Timothy Makori ◽

Monica A Kehoe ◽

Laura M Boykin

Keyword(s):

Phaseolus Vulgaris ◽

Complete Genome ◽

High Throughput Sequencing ◽

De Novo ◽

Current Knowledge ◽

Sequence Similarity ◽

Evolutionary Relationship ◽

Sub Saharan Africa ◽

Complete Genomes ◽

Sub Saharan

Background: Endornaviruses are non-pathogenic viruses infecting multiple agricultural important crops including legumes, with global distribution. However, there is an absence on the complete genome of endornaviruses from legumes in particular with the sub-Saharan region. In this study, we report the first complete genomes of PvEV1 and PvEV2, and the evolutionary relationship of these genomes. Methods: Viral symptomatic common beans (Phaseolus vulgaris) showing Bean common mosaic necrosis virus (BCMNV) symptoms from Vihiga county, in the western highlands of Kenya were collected during field survey’s in the region. High throughput sequencing (RNA-Seq) was carried out on total RNA isolated from symptomatic leaf samples. Subsequently, de novo assembly and reference mapping was carried out to obtain the complete genomes of PvEV-1 and PvEV-2. Results: We identified the complete genome of Phaseolus vulgaris endornavirus 1 and 2 (PvEV-1 and PvEV-2) from sub-Saharan Africa (SSA). The average genome size of PvEV-1 was ~13,890 nucleotides (nt) while PvEV-2 was ~14,698 nt, encoding a single open reading frame (ORF). Single ORFs ranged from 4,632 to 4,633 aa in PvEV-1 and from 4,899 – to 4,954 aa in PvEV-2. Both ORFs encoded for the RNA-dependent RNA polymerase (RdRP) gene. The percentage sequence similarity between PvEV-1, PvEV-2 from this study GenBanks sequences was 29 % to 99 %. Bayesian phylogenetic analysis resolved in two well-supported monophyletic clades, with isolates from this study clustering with those from Brazil sequences. Discussion: This study provides the first insights into the evolutionary relationships of PvEV from SSA diverse and contributes towards filling the current knowledge gaps on endornaviruses

Download Full-text

Phylogenetic relationships of endornaviruses in common bean from the western highlands of Kenya and global sequences

10.7287/peerj.preprints.26904v1 ◽

2018 ◽

Author(s):

James M Wainaina ◽

Elijah Ateka ◽

Timothy Makori ◽

Monica A Kehoe ◽

Laura M Boykin

Keyword(s):

Phaseolus Vulgaris ◽

Complete Genome ◽

High Throughput Sequencing ◽

De Novo ◽

Current Knowledge ◽

Sequence Similarity ◽

Evolutionary Relationship ◽

Sub Saharan Africa ◽

Complete Genomes ◽

Sub Saharan

Background: Endornaviruses are non-pathogenic viruses infecting multiple agricultural important crops including legumes, with global distribution. However, there is an absence on the complete genome of endornaviruses from legumes in particular with the sub-Saharan region. In this study, we report the first complete genomes of PvEV1 and PvEV2, and the evolutionary relationship of these genomes. Methods: Viral symptomatic common beans (Phaseolus vulgaris) showing Bean common mosaic necrosis virus (BCMNV) symptoms from Vihiga county, in the western highlands of Kenya were collected during field survey’s in the region. High throughput sequencing (RNA-Seq) was carried out on total RNA isolated from symptomatic leaf samples. Subsequently, de novo assembly and reference mapping was carried out to obtain the complete genomes of PvEV-1 and PvEV-2. Results: We identified the complete genome of Phaseolus vulgaris endornavirus 1 and 2 (PvEV-1 and PvEV-2) from sub-Saharan Africa (SSA). The average genome size of PvEV-1 was ~13,890 nucleotides (nt) while PvEV-2 was ~14,698 nt, encoding a single open reading frame (ORF). Single ORFs ranged from 4,632 to 4,633 aa in PvEV-1 and from 4,899 – to 4,954 aa in PvEV-2. Both ORFs encoded for the RNA-dependent RNA polymerase (RdRP) gene. The percentage sequence similarity between PvEV-1, PvEV-2 from this study GenBanks sequences was 29 % to 99 %. Bayesian phylogenetic analysis resolved in two well-supported monophyletic clades, with isolates from this study clustering with those from Brazil sequences. Discussion: This study provides the first insights into the evolutionary relationships of PvEV from SSA diverse and contributes towards filling the current knowledge gaps on endornaviruses

Download Full-text

A benchmarking of human mitochondrial DNA haplogroup classifiers from whole-genome and whole-exome sequence data

10.1101/2021.02.11.430775 ◽

2021 ◽

Author(s):

Víctor García-Olivares ◽

Adrián Muñoz-Barrera ◽

José Miguel Lorenzo-Salazar ◽

Carlos Zaragoza-Trello ◽

Luis A. Rubio-Rodríguez ◽

...

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Sequence Data ◽

Qualitative Assessment ◽

Whole Genome ◽

Third Generation ◽

Sequencing Data ◽

Short Read ◽

Bioinformatic Tools ◽

Whole Exome

AbstractThe mitochondrial genome (mtDNA) is of interest for a range of fields including evolutionary, forensic, and medical genetics. Human mitogenomes can be classified into evolutionary related haplogroups that provide ancestral information and pedigree relationships. Because of this and the advent of high-throughput sequencing (HTS) technology, there is a diversity of bioinformatic tools for haplogroup classification. We present a benchmarking of the 11 most salient tools for human mtDNA classification using empirical whole-genome (WGS) and whole-exome (WES) short-read sequencing data from 36 unrelated donors. Besides, because of its relevance, we also assess the best performing tool in third-generation long noisy read WGS data obtained with nanopore technology for a subset of the donors. We found that, for short-read WGS, most of the tools exhibit high accuracy for haplogroup classification irrespective of the input file used for the analysis. However, for short-read WES, Haplocheck and MixEmt were the most accurate tools. Based on the performance shown for WGS and WES, and the accompanying qualitative assessment, Haplocheck stands out as the most complete tool. For third-generation HTS data, we also showed that Haplocheck was able to accurately retrieve mtDNA haplogroups for all samples assessed, although only after following assembly-based approaches (either based on a referenced-based assembly or a hybrid de novo assembly). Taken together, our results provide guidance for researchers to select the most suitable tool to conduct the mtDNA analyses from HTS data.

Download Full-text