scholarly journals Strategies for the bioinformatic treatment of high-throughput sequencing data for diatom studies and bioassessment

2021 ◽  
Vol 4 ◽  
Author(s):  
Kálmán Tapolczai ◽  
François Keck ◽  
Valentin Vasselon ◽  
Géza Selmeczy ◽  
Maria Kahlert ◽  
...  

Diatom biomonitoring and ecological studies can greatly benefit from DNA metabarcoding compared to conventional microscopical analysis by potentially providing more reliable and accurate data in a cost- and time-efficient way. A conventional strategy for the bioinformatic treatment of sequencing data involves the clustering of quality filtered sequences into Operational Taxonomic Units (OTUs) based on a global sequence similarity, and their assignment to taxonomy using a reference library. Then, the obtained species lists of the successfully assigned taxa are used for subsequent analyses or quality index calculation. However, the high diversity of bioinformatic methods and parameters make inter-studies comparison difficult, especially because OTUs are specific to a given study. Clustering sequences into OTUs aims to reduce the biasing effect of sequencing artefacts and to reach an approximate species level delimitation at the price of potentially grouping together sequences with different ecology. A similar bias occurs when sequences that differ from each other by their ecological preference are assigned to the same taxa. The incompleteness of reference libraries can further introduce a bias by not taking into account unassigned sequences, thus losing the ecological information they possess. In order to overcome these biases, our studies tested new approaches on de novo developed diatom indices based on periphytic samples collected from streams in France and Hungary. Index development was performed with the leave-one-out cross validation (LOOCV) technique by building a model on a training dataset containing n-1 samples and testing it on the remaining test sample. Test values were correlated with a reference environmental gradient. The model was based on the calculation of optimum and tolerance of taxonomic units along the reference gradient and a modified Zelinka-Marvan diatom index equation. Taxonomic units tested in the studies were morphospecies, OTUs (95% similarity threshold), Individual Sequence Units (ISUs, via minimal bioinformatic quality filtering) and Exact Sequence Variants (ESVs, via DADA2 denoising algorithm). The “clustering-free” approach (ISU- and ESV-based indices) performed better than the OTU-based one, providing a fine taxonomic resolution where the ecological difference on genetically close sequence variants could be detected. Thus, these indices are more adapted to a standardized and comparable routine bioassessment. The “taxonomy-free” approach revealed the ecological preferences for those molecular taxonomic units (ISUs/ESVs) that otherwise either (i) would have been assigned to the same taxa due to genetic similarity, or (ii) would not have been recognized because of their absence from the reference libraries. However, we also found that taxonomic information cannot be neglected in ecological studies when the presence of organisms under particular environmental conditions is to be explained or interpreted e.g. via the traits they possess. New types of clustering methods are welcome in the future of biomonitoring where the delimitation of taxonomic units should be refined based on a higher emphasis on their ecology rather than on morphological or genetical criteria.

2015 ◽  
Author(s):  
Dominik Forster ◽  
Micah Dunthorn ◽  
Thorsten Stoeck ◽  
Frédéric Mahé

Discovery of novel diversity in high-throughput sequencing (HTS) studies is a central task in environmental microbial ecology. To evaluate the effects that amplicon clustering methods have on novel diversity discovery, we clustered an environmental marine protist HTS dataset of protist reads together with accessions from the taxonomically curated PR2 reference database using three de novo approaches: sequence similarity networks, USEARCH, and Swarm. The novel diversity uncovered by each clustering approach differed drastically in the number of operational taxonomic units (OTUs) and the number of environmental amplicons in these novel diversity OTUs. Global pairwise alignment comparisons revealed that numerous amplicons classified as novel by USEARCH and Swarm were actually highly similar to reference accessions. Using graph theory we found additional novel diversity within OTUs that would have gone unnoticed without further using their underlying network topologies. Our results suggest that novel diversity inferred from clustering approaches requires further validation, whereas graph theory provides a powerful tool for microbial ecology and the analyses of environmental HTS datasets.


2015 ◽  
Author(s):  
Dominik Forster ◽  
Micah Dunthorn ◽  
Thorsten Stoeck ◽  
Frédéric Mahé

Discovery of novel diversity in high-throughput sequencing (HTS) studies is a central task in environmental microbial ecology. To evaluate the effects that amplicon clustering methods have on novel diversity discovery, we clustered an environmental marine protist HTS dataset of protist reads together with accessions from the taxonomically curated PR2 reference database using three de novo approaches: sequence similarity networks, USEARCH, and Swarm. The novel diversity uncovered by each clustering approach differed drastically in the number of operational taxonomic units (OTUs) and the number of environmental amplicons in these novel diversity OTUs. Global pairwise alignment comparisons revealed that numerous amplicons classified as novel by USEARCH and Swarm were actually highly similar to reference accessions. Using graph theory we found additional novel diversity within OTUs that would have gone unnoticed without further using their underlying network topologies. Our results suggest that novel diversity inferred from clustering approaches requires further validation, whereas graph theory provides a powerful tool for microbial ecology and the analyses of environmental HTS datasets.


2017 ◽  
Author(s):  
Benjamin J Callahan ◽  
Paul J McMurdie ◽  
Susan P Holmes

AbstractRecent advances have made it possible to analyze high-throughput marker-gene sequencing data without resorting to the customary construction of molecular operational taxonomic units (OTUs): clusters of sequencing reads that differ by less than a fixed dissimilarity threshold. New methods control errors sufficiently that sequence variants (SVs) can be resolved exactly, down to the level of single-nucleotide differences over the sequenced gene region. The benefits of finer taxonomic resolution are immediately apparent, and arguments for SV methods have focused on their improved resolution. Less obvious, but we believe more important, are the broad benefits deriving from the status of SVs as consistent labels with intrinsic biological meaning identified independently from a reference database. Here we discuss how those features grant SVs the combined advantages of closed-reference OTUs — including computational costs that scale linearly with study size, simple merging between independently processed datasets, and forward prediction — and of de novo OTUs — including accurate diversity measurement and applicability to communities lacking deep coverage in reference databases. We argue that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that SVs should replace OTUs as the standard unit of marker gene analysis and reporting.


PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e1692 ◽  
Author(s):  
Dominik Forster ◽  
Micah Dunthorn ◽  
Thorsten Stoeck ◽  
Frédéric Mahé

Discovery of novel diversity in high-throughput sequencing studies is an important aspect in environmental microbial ecology. To evaluate the effects that amplicon clustering methods have on the discovery of novel diversity, we clustered an environmental marine high-throughput sequencing dataset of protist amplicons together with reference sequences from the taxonomically curated Protist Ribosomal Reference (PR2) database using threede novoapproaches: sequence similarity networks, USEARCH, and Swarm. The potentially novel diversity uncovered by each clustering approach differed drastically in the number of operational taxonomic units (OTUs) and in the number of environmental amplicons in these novel diversity OTUs. Global pairwise alignment comparisons revealed that numerous amplicons classified as potentially novel by USEARCH and Swarm were more than 97% similar to references of PR2. Using shortest path analyses on sequence similarity network OTUs and Swarm OTUs we found additional novel diversity within OTUs that would have gone unnoticed without further exploiting their underlying network topologies. These results demonstrate that graph theory provides powerful tools for microbial ecology and the analysis of environmental high-throughput sequencing datasets. Furthermore, sequence similarity networks were most accurate in delineating novel diversity from previously discovered diversity.


2017 ◽  
Author(s):  
Lionel Morgado ◽  
Ritsert C. Jansen ◽  
Frank Johannes

ABSTRACTThe loading of small RNA (sRNA) into Argonaute (AGO) complexes is a crucial step in all regulatory pathways identified so far in plants that depend on such non-coding sequences. Important transcriptional and post-transcriptional silencing mechanisms can be activated depending on the specific AGO protein to which sRNA bind. It is known that sRNA-AGO associations are at least partly encoded in the sRNA primary structure, but the sequence features that drive this association have not been fully explored. Here we train support vector machines (SVM) on sRNA sequencing data obtained from AGO-immunoprecipitation experiments to identify features that determine sRNA affinity to specific AGOs. Our SVM reveal that AGO affinity is strongly determined by complex k-mers in the 5’ and 3’ ends of sRNA, in addition to well-known features such as sRNA length and the base composition of the first nucleotide. Moreover, we find that these k-mers tend to overlap known transcription factor (TF) binding motifs, thus highlighting a close interplay between TF and sRNA-mediated transcriptional regulation. We embedded the learned SVM in a computational pipeline that can be used for de novo functional classification of sRNA sequences. This tool, called SAILS, is provided as a web portal accessible at http://sails.eu.nu.


2017 ◽  
Author(s):  
Arnaud Meng ◽  
Erwan Corre ◽  
Ian Probert ◽  
Andres Gutierrez-Rodriguez ◽  
Raffaele Siano ◽  
...  

ABSTRACTDinoflagellates are one of the most abundant and functionally diverse groups of eukaryotes. Despite an overall scarcity of genomic information for dinoflagellates, constantly emerging high-throughput sequencing resources can be used to characterize and compare these organisms. We assembled de novo and processed 46 dinoflagellate transcriptomes and used a sequence similarity network (SSN) to compare the underlying genomic basis of functional features within the group. This approach constitutes the most comprehensive picture to date of the genomic potential of dinoflagellates. A core proteome composed of 252 connected components (CCs) of putative conserved protein domains (pCDs) was identified. Of these, 206 were novel and 16 lacked any functional annotation in public databases. Integration of functional information in our network analyses allowed investigation of pCDs specifically associated to functional traits. With respect to toxicity, sequences homologous to those of proteins involved in toxin biosynthesis pathways (e.g. sxtA1-4 and sxtG) were not specific to known toxin-producing species. Although not fully specific to symbiosis, the most represented functions associated with proteins involved in the symbiotic trait were related to membrane processes and ion transport. Overall, our SSN approach led to identification of 45,207 and 90,794 specific and constitutive pCDs of respectively the toxic and symbiotic species represented in our analyses. Of these, 56% and 57% respectively (i.e. 25,393 and 52,193 pCDs) completely lacked annotation in public databases. This stresses the extent of our lack of knowledge, while emphasizing the potential of SSNs to identify candidate pCDs for further functional genomic characterization.


2016 ◽  
Author(s):  
Ying Wang ◽  
Kun Liu ◽  
De Bi ◽  
Biao Shou Zhou ◽  
Wen Jian Shao

Background. Resurrection plants constitute a unique cadre within angiosperms. Boea clarkeana Hemsl. (Boea, Gesneriaceae) is a desiccation-tolerant dicotyledonous herb that is endemic to China. Although research on angiosperms with DT could be instructive for crops, genomic resources for B. clarkeana remain scarce. In addition, transcriptome sequencing could be an effective way to study desiccation-tolerant plants. Methods. In the present study, we used the platform Illumina HiSeqTM 2000 and de novo assembly technology to obtain leaf transcriptomes of B. clarkeana and conducted a BLASTX alignment of the sequencing data and protein databases for sequence classification and annotation. Then, based on the sequence information obtained, we developed EST-SSR markers by means of EST-SSR mining, primer design and polymorphism identification. Results. A total of 91,449 unigenes were generated from the leaf cDNA library of B. clarkeana in this study. Based on a sequence similarity search with a known protein database, 72,087 unigenes were annotated. Among the annotated unigenes, a total of 71,170 unigenes showed significant similarity to known proteins of 463 popular model species in the Nr database, and 59,962 unigenes and 32,336 unigenes were assigned to GO classifications and COG, respectively. In addition, 44,924 unigenes were mapped in 128 KEGG pathways. Furthermore, a total of 7,610 unigenes with 8,563 microsatellites were found. Seventy-four primer pairs were selected from 436 primer pairs designed for polymorphism validation. SSRs with higher polymorphism rates were concentrated on dinucleotides, pentanucleotides and hexanucleotides. Finally, 17 pairs with highly polymorphic and stable loci were selected for polymorphism screening. There were a total of 65 alleles, with 2–6 alleles at each locus. Mainly due to the unique biological characteristics of plants, the HE, HO and PIC per locus were very low, ranging from 0 to 0.196, 0.082 to 0.14 and 0 to 0.155, respectively. Discussion. A substantial fraction transcriptome sequences of B. clarkeana were generated in this study, which is the first molecular-level analysis of this plant. These sequences are valuable resources for gene annotation and discovery and molecular marker development. These sequences could also provide a valuable basis for the future molecular study of B. clarkeana.


2018 ◽  
Author(s):  
James M Wainaina ◽  
Elijah Ateka ◽  
Timothy Makori ◽  
Monica A Kehoe ◽  
Laura M Boykin

Background: Endornaviruses are non-pathogenic viruses infecting multiple agricultural important crops including legumes, with global distribution. However, there is an absence on the complete genome of endornaviruses from legumes in particular with the sub-Saharan region. In this study, we report the first complete genomes of PvEV1 and PvEV2, and the evolutionary relationship of these genomes. Methods: Viral symptomatic common beans (Phaseolus vulgaris) showing Bean common mosaic necrosis virus (BCMNV) symptoms from Vihiga county, in the western highlands of Kenya were collected during field survey’s in the region. High throughput sequencing (RNA-Seq) was carried out on total RNA isolated from symptomatic leaf samples. Subsequently, de novo assembly and reference mapping was carried out to obtain the complete genomes of PvEV-1 and PvEV-2. Results: We identified the complete genome of Phaseolus vulgaris endornavirus 1 and 2 (PvEV-1 and PvEV-2) from sub-Saharan Africa (SSA). The average genome size of PvEV-1 was ~13,890 nucleotides (nt) while PvEV-2 was ~14,698 nt, encoding a single open reading frame (ORF). Single ORFs ranged from 4,632 to 4,633 aa in PvEV-1 and from 4,899 – to 4,954 aa in PvEV-2. Both ORFs encoded for the RNA-dependent RNA polymerase (RdRP) gene. The percentage sequence similarity between PvEV-1, PvEV-2 from this study GenBanks sequences was 29 % to 99 %. Bayesian phylogenetic analysis resolved in two well-supported monophyletic clades, with isolates from this study clustering with those from Brazil sequences. Discussion: This study provides the first insights into the evolutionary relationships of PvEV from SSA diverse and contributes towards filling the current knowledge gaps on endornaviruses


2018 ◽  
Author(s):  
James M Wainaina ◽  
Elijah Ateka ◽  
Timothy Makori ◽  
Monica A Kehoe ◽  
Laura M Boykin

Background: Endornaviruses are non-pathogenic viruses infecting multiple agricultural important crops including legumes, with global distribution. However, there is an absence on the complete genome of endornaviruses from legumes in particular with the sub-Saharan region. In this study, we report the first complete genomes of PvEV1 and PvEV2, and the evolutionary relationship of these genomes. Methods: Viral symptomatic common beans (Phaseolus vulgaris) showing Bean common mosaic necrosis virus (BCMNV) symptoms from Vihiga county, in the western highlands of Kenya were collected during field survey’s in the region. High throughput sequencing (RNA-Seq) was carried out on total RNA isolated from symptomatic leaf samples. Subsequently, de novo assembly and reference mapping was carried out to obtain the complete genomes of PvEV-1 and PvEV-2. Results: We identified the complete genome of Phaseolus vulgaris endornavirus 1 and 2 (PvEV-1 and PvEV-2) from sub-Saharan Africa (SSA). The average genome size of PvEV-1 was ~13,890 nucleotides (nt) while PvEV-2 was ~14,698 nt, encoding a single open reading frame (ORF). Single ORFs ranged from 4,632 to 4,633 aa in PvEV-1 and from 4,899 – to 4,954 aa in PvEV-2. Both ORFs encoded for the RNA-dependent RNA polymerase (RdRP) gene. The percentage sequence similarity between PvEV-1, PvEV-2 from this study GenBanks sequences was 29 % to 99 %. Bayesian phylogenetic analysis resolved in two well-supported monophyletic clades, with isolates from this study clustering with those from Brazil sequences. Discussion: This study provides the first insights into the evolutionary relationships of PvEV from SSA diverse and contributes towards filling the current knowledge gaps on endornaviruses


2021 ◽  
Author(s):  
Víctor García-Olivares ◽  
Adrián Muñoz-Barrera ◽  
José Miguel Lorenzo-Salazar ◽  
Carlos Zaragoza-Trello ◽  
Luis A. Rubio-Rodríguez ◽  
...  

AbstractThe mitochondrial genome (mtDNA) is of interest for a range of fields including evolutionary, forensic, and medical genetics. Human mitogenomes can be classified into evolutionary related haplogroups that provide ancestral information and pedigree relationships. Because of this and the advent of high-throughput sequencing (HTS) technology, there is a diversity of bioinformatic tools for haplogroup classification. We present a benchmarking of the 11 most salient tools for human mtDNA classification using empirical whole-genome (WGS) and whole-exome (WES) short-read sequencing data from 36 unrelated donors. Besides, because of its relevance, we also assess the best performing tool in third-generation long noisy read WGS data obtained with nanopore technology for a subset of the donors. We found that, for short-read WGS, most of the tools exhibit high accuracy for haplogroup classification irrespective of the input file used for the analysis. However, for short-read WES, Haplocheck and MixEmt were the most accurate tools. Based on the performance shown for WGS and WES, and the accompanying qualitative assessment, Haplocheck stands out as the most complete tool. For third-generation HTS data, we also showed that Haplocheck was able to accurately retrieve mtDNA haplogroups for all samples assessed, although only after following assembly-based approaches (either based on a referenced-based assembly or a hybrid de novo assembly). Taken together, our results provide guidance for researchers to select the most suitable tool to conduct the mtDNA analyses from HTS data.


Sign in / Sign up

Export Citation Format

Share Document