scholarly journals From DNA sequences to operational reference databases: an opinionated approach using R

2021 ◽  
Vol 4 ◽  
Author(s):  
François Keck ◽  
Florian Altermatt

Reference databases of sequences that have been taxonomically assigned are a key element for DNA-based identification of organisms. Accurate and complete reference databases are necessary to associate a correct taxonomic name to the sequences obtained in studies using metabarcoding. Today many research projects using DNA metabarcoding include the development of a custom reference database, often derived from large repositories like GenBank. At the same time, many projects are focussing on the development of ready-to-use databases validated by experts and targeting specific markers and taxonomic groups. While mainstream tools such as spreadsheet softwares may be suitable to manage small databases, they quickly become insufficient when the amount of data increases and validation operations become more complex. There is a clear need for providing user‐friendly and powerful tools to manipulate biological sequences and manage reference databases. The R language which is a free software and has already been adopted by many researchers to perform their analyses is highly suitable to develop such tools. In this talk, we will outline the approach we recommend to handle small- to middle-sized reference databases, currently still making the majority of projects. We will advocate that a simple tabular approach where each sequence constitutes an observation may be the most adequate. While such a single table may be less flexible and less optimized than relational databases or more complex data structures, it is easy to maintain and allows the direct use of modern dataframe centric tools. We will specifically present and discuss two R packages that can be used jointly to make reference database development more accessible and more reproducible. First, we will briefly introduce bioseq (Keck 2020) which is dedicated to biological sequence manipulation and analysis. The package implements classes and functions to make analyses of complex datasets including DNA, RNA or protein sequences as simple as possible. The strength of bioseq is to provide standard and more advanced functions to perform low level operations through a simple and consistent programming interface. Then we will present refdb, which has been developed as an environment for semi-automatic and assisted construction of reference databases. The refdb package is a reference database manager offering a set of powerful functions to import, organize, clean, filter, audit and export the data. We will outline how these two packages together can speed up reference database generation and handling, and contribute to standardization and repeatability in metabarcoding studies.

2021 ◽  
Vol 168 (6) ◽  
Author(s):  
Ann Bucklin ◽  
Katja T. C. A. Peijnenburg ◽  
Ksenia N. Kosobokova ◽  
Todd D. O’Brien ◽  
Leocadio Blanco-Bercial ◽  
...  

AbstractCharacterization of species diversity of zooplankton is key to understanding, assessing, and predicting the function and future of pelagic ecosystems throughout the global ocean. The marine zooplankton assemblage, including only metazoans, is highly diverse and taxonomically complex, with an estimated ~28,000 species of 41 major taxonomic groups. This review provides a comprehensive summary of DNA sequences for the barcode region of mitochondrial cytochrome oxidase I (COI) for identified specimens. The foundation of this summary is the MetaZooGene Barcode Atlas and Database (MZGdb), a new open-access data and metadata portal that is linked to NCBI GenBank and BOLD data repositories. The MZGdb provides enhanced quality control and tools for assembling COI reference sequence databases that are specific to selected taxonomic groups and/or ocean regions, with associated metadata (e.g., collection georeferencing, verification of species identification, molecular protocols), and tools for statistical analysis, mapping, and visualization. To date, over 150,000 COI sequences for ~ 5600 described species of marine metazoan plankton (including holo- and meroplankton) are available via the MZGdb portal. This review uses the MZGdb as a resource for summaries of COI barcode data and metadata for important taxonomic groups of marine zooplankton and selected regions, including the North Atlantic, Arctic, North Pacific, and Southern Oceans. The MZGdb is designed to provide a foundation for analysis of species diversity of marine zooplankton based on DNA barcoding and metabarcoding for assessment of marine ecosystems and rapid detection of the impacts of climate change.


Author(s):  
Nicole Foster ◽  
Kor-jent Dijk ◽  
Ed Biffin ◽  
Jennifer Young ◽  
Vicki Thomson ◽  
...  

A proliferation in environmental DNA (eDNA) research has increased the reliance on reference sequence databases to assign unknown DNA sequences to known taxa. Without comprehensive reference databases, DNA extracted from environmental samples cannot be correctly assigned to taxa, limiting the use of this genetic information to identify organisms in unknown sample mixtures. For animals, standard metabarcoding practices involve amplification of the mitochondrial Cytochrome-c oxidase subunit 1 (CO1) region, which is a universally amplifyable region across majority of animal taxa. This region, however, does not work well as a DNA barcode for plants and fungi, and there is no similar universal single barcode locus that has the same species resolution. Therefore, generating reference sequences has been more difficult and several loci have been suggested to be used in parallel to get to species identification. For this reason, we developed a multi-gene targeted capture approach to generate reference DNA sequences for plant taxa across 20 target chloroplast gene regions in a single assay. We successfully compiled a reference database for 93 temperate coastal plants including seagrasses, mangroves, and saltmarshes/samphire’s. We demonstrate the importance of a comprehensive reference database to prevent species going undetected in eDNA studies. We also investigate how using multiple chloroplast gene regions impacts the ability to discriminate between taxa.


2021 ◽  
Vol 5 ◽  
Author(s):  
Alexis Canino ◽  
Agnès Bouchez ◽  
Christophe Laplace-Treyture ◽  
Isabelle Domaizon ◽  
Frédéric Rimet

Methods for biomonitoring of freshwater phytoplankton are evolving rapidly with eDNA-based methods, offering great complementarity with microscopy. Metabarcoding approaches have been more commonly used over the last years, with a continuous increase in the amount of data generated. Depending on the researchers and the way they assigned barcodes to species (bioinformatic pipelines and molecular reference databases), the taxonomic assignment obtained for HTS DNA reads might vary. This is also true for traditional taxonomic studies by microscopy with regular adjustments of the classification and taxonomy. For those reasons (leading to non-homogeneous taxonomies), gap-analyses and comparisons between studies become even more challenging and the curation processes to find potential consensus names are time-consuming. Here, we present a web-based application (Phytool), developed with ShinyApp (Rstudio), that aims to make the harmonisation of taxonomy easier and in a more efficient way, using a complete and up-to-date taxonomy reference database for freshwater microalgae. Phytool allows users to homogenise and update freshwater phytoplankton taxonomical names from sequence files and data tables directly uploaded in the application. It also gathers barcodes from curated references in a user-friendly way in which it is possible to search for specific organisms. All the data provided are downloadable with the possibility to apply filters in order to select only the required taxa and fields (e.g. specific taxonomic ranks). The main goal is to make accessible to a broad range of users the connection between microscopy and molecular biology and taxonomy through different ready-to-use functions. This study estimates that only 25% of species of freshwater phytoplankton in Phytobs are associated with a barcode. We plead for an increased effort to enrich reference databases by coupling taxonomy and molecular methods. Phytool should make this crucial work more efficient. The application is available at https://caninuzzo.shinyapps.io/phytool_v1/


Author(s):  
Chia-Hua Lue ◽  
Matthew L. Buffington ◽  
Sonja Scheffer ◽  
Matthew Lewis ◽  
Tyler A. Elliott ◽  
...  

AbstractMolecular identification is increasingly used to speed up biodiversity surveys and laboratory experiments. However, many groups of organisms cannot be reliably identified using standard databases such as GenBank or BOLD due to lack of sequenced voucher specimens identified by experts. Sometimes a large number of sequences are available, but with too many errors to allow identification. Here we address this problem for parasitoids of Drosophila by introducing a curated open-access molecular reference database, DROP (Drosophilaparasitoids). Identifying Drosophila parasitoids is challenging and poses a major impediment to realize the full potential of this model system in studies ranging from molecular mechanisms to food webs, and in biological control of Drosophila suzukii. In DROP (http://doi.org/10.5281/zenodo.4519656), genetic data are linked to voucher specimens and, where possible, the voucher specimens are identified by taxonomists and vetted through direct comparison with primary type material. To initiate DROP, we curated 154 laboratory strains, 853 vouchers, 545 DNA sequences, 16 genomes, 11 transcriptomes, and 6 proteomes drawn from a total of 183 operational taxonomic units (OTUs): 113 described Drosophila parasitoid species and 70 provisional species. We found species richness of Drosophila parasitoids to be acutely underestimated and provide an updated taxonomic catalogue for the community. DROP offers accurate molecular identification and improves cross-referencing between individual studies that we hope will catalyze research on this diverse and fascinating model system. Our effort should also serve as an example for researchers facing similar molecular identification problems in other groups of organisms.


2021 ◽  
Vol 4 ◽  
Author(s):  
Liz Davidson

DNA-based identification methods have been shown to have high detection capability and reduced costs compared to traditional methods and can also enable the detection of species that might be missed using traditional methods (e.g. rare species, cryptic species, larval stages). The success of DNA-based identification is dependent on the ‘DNA barcodes’ of target species being present in a barcode reference database. In order to use DNA-based identification methods to assess and monitor UK freshwater arthropods for biodiversity and ecological quality assessments, it is vital that comprehensive reference databases are available. Incomplete reference databases result in many sequences derived from metabarcoding not being assigned to species. Two current projects aim to create collections of high-quality sequences from expertly identified specimens of UK species. The Darwin Tree of Life project aims to sequence the genomes of all the eukaryotic species in Britain and Ireland and FreshBase aims to create a genomic reference collection for UK freshwater invertebrates. The Barcode of Life Data System (BOLD) is one of the main reference databases for animal barcodes. Prioritising the sequencing of UK freshwater arthropod species that are not yet represented in BOLD, would enable more complete identification of UK freshwater biodiversity using metabarcoding and would enable the development of primers to target specific arthropod groups or species. We analysed the coverage of UK freshwater arthropod species in BOLD. Our analyses show that coverage varies between taxonomic groups and large proportions of sequences in some orders are only represented by privately stored sequences in BOLD. Analyses of intra- and inter-specific variation in sequences stored in BOLD show that misidentifications or errors can reduce the barcode gap in some species which could cause difficulties in accurately identifying sequences derived from metabarcoding. Representation in BOLD by specimens from the UK is extremely low and analyses show that high geographic variation in sequences in some species could be important for accurate DNA-based identification of UK species. Our results have implications for prioritising the sequencing of UK freshwater arthropods and for the quality control of stored sequences in order to reduce the occurrence of misidentifications and errors that could impact the accuracy of DNA-based identification.


2016 ◽  
Author(s):  
Panu Somervuo ◽  
Douglas Yu ◽  
Charles Xu ◽  
Yinqiu Ji ◽  
Jenni Hultman ◽  
...  

AbstractA crucial step in the use of DNA markers for biodiversity surveys is the assignment of Linnaean taxonomies (species, genus, etc.) to sequence reads. This allows the use of all the information known based on the taxonomic names. Taxonomic placement of DNA barcoding sequences is inherently probabilistic because DNA sequences contain errors, because there is natural variation among sequences within a species, and because reference databases are incomplete and can have false annotations. However, most existing bioinformatics methods for taxonomic placement either exclude uncertainty, or quantify it using metrics other than probability.In this paper we evaluate the performance of a recently proposed probabilistic taxonomic placement method PROTAX by applying it to both annotated reference sequence data as well as unknown environmental data. Our four case studies include contrasting taxonomic groups (fungi, bacteria, mammals, and insects), variation in the length and quality of the barcoding sequences (from individually Sanger-sequenced sequences to short Illumina reads), variation in the structures and sizes of the taxonomies (from 800 to 130 000 species), and variation in the completeness of the reference databases (representing 15% to 100% of the species).Our results demonstrate that PROTAX yields essentially unbiased assessment of probabilities of taxonomic placement, and thus that its quantification of species identification uncertainty is reliable. As expected, the accuracy of taxonomic placement increases with increasing coverage of taxonomic and reference sequence databases, and with increasing ratio of genetic variation among taxonomic levels over within taxonomic levels.Our results show that reliable species-level identification from environmental samples is still challenging, and thus neglecting identification uncertainty can lead to spurious inference. A key aim for future research is the completion and pruning of taxonomic and reference sequence databases, and making these two types of data compatible.


2017 ◽  
Author(s):  
Adam L. Bazinet ◽  
Brian D. Ondov ◽  
Daniel D. Sommer ◽  
Shashikala Ratnayake

AbstractWhen performing bioforensic casework, it is important to be able to reliably detect the presence of a particular organism in a metagenomic sample, even if the organism is only present in a trace amount. For this task, it is common to use a sequence classification program that determines the taxonomic affiliation of individual sequence reads by comparing them to reference database sequences. As metagenomic data sets often consist of millions or billions of reads that need to be compared to reference databases containing millions of sequences, such sequence classification programs typically use search heuristics and databases with reduced sequence diversity to speed up the analysis, which can lead to incorrect assignments. Thus, in a bioforensic setting where correct assignments are paramount, assignments of interest made by “first-pass” classifiers should be confirmed using the most precise methods and comprehensive databases available. In this study we present ablast-based method for validating the assignments made by less precise sequence classification programs, with optimal parameters for filtering ofblastresults determined via simulation of sequence reads from genomes of interest, and we apply the method to the detection of four pathogenic organisms. The software implementing the method is open source and freely available.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e4892 ◽  
Author(s):  
Adam L. Bazinet ◽  
Brian D. Ondov ◽  
Daniel D. Sommer ◽  
Shashikala Ratnayake

When performing bioforensic casework, it is important to be able to reliably detect the presence of a particular organism in a metagenomic sample, even if the organism is only present in a trace amount. For this task, it is common to use a sequence classification program that determines the taxonomic affiliation of individual sequence reads by comparing them to reference database sequences. As metagenomic data sets often consist of millions or billions of reads that need to be compared to reference databases containing millions of sequences, such sequence classification programs typically use search heuristics and databases with reduced sequence diversity to speed up the analysis, which can lead to incorrect assignments. Thus, in a bioforensic setting where correct assignments are paramount, assignments of interest made by “first-pass” classifiers should be confirmed using the most precise methods and comprehensive databases available. In this study we present a BLAST-based method for validating the assignments made by less precise sequence classification programs, with optimal parameters for filtering of BLAST results determined via simulation of sequence reads from genomes of interest, and we apply the method to the detection of four pathogenic organisms. The software implementing the method is open source and freely available.


2021 ◽  
Vol 4 ◽  
Author(s):  
Cristina Claver ◽  
Oriol Canals ◽  
Naiara Rodriguez-Ezpeleta

Environmental DNA (eDNA) metabarcoding, the process of sequencing DNA collected from the environment for producing biodiversity inventories, is increasingly being applied to assess fish diversity and distribution in marine environments. Yet, the successful application of this technique deeply relies on accurate and complete reference databases used for taxonomic assignment. The most used markers for fish eDNA metabarcoding studies are the cytochrome C oxidase subunit 1 (COI), 16S ribosomal RNA (16S), the 12S ribosomal RNA (12S) and cytochrome b (cyt b) genes, whose sequences are usually retrieved from GenBank, the largest DNA sequence database that represents a worldwide public resource for genetic studies. Thus, the completeness and accuracy of GenBank is critical to derive reliable estimations from fish eDNA metabarcoding data. Here, we have i) compiled the checklist of European marine fishes, ii) performed a gap analysis of the four genes and, within COI and 12S, also of the most used barcodes for fish, and iii) developed a workflow to detect potentially incorrect records in GenBank. We found that from the 1965 species in the checklist (1761 Actinopterygii, 189 Elasmobranchii, 9 Holocephali, 4 Petromyzonti and 2 Myxini), about 70% have sequences for COI, whereas less have sequences for 12S, 16S and cyt b (45-55%). Among the species for which COI ad 12S sequences are available, about 60% and 40% have sequences covering the most used barcodes respectively. The analysis of pairwise distances between sequences revealed pairs belonging to the same species with significantly low similarity and pairs belonging to different high level taxonomic groups (class, order) with significantly large similarity. In light of this further confirmation of presence of a substantial number of incorrect records in GenBank, we propose a method for identifying and removing spurious sequences to create reliable and accurate reference databases for eDNA metabarcoding.


2019 ◽  
Vol 14 (7) ◽  
pp. 621-627 ◽  
Author(s):  
Youhuang Bai ◽  
Xiaozhuan Dai ◽  
Tiantian Ye ◽  
Peijing Zhang ◽  
Xu Yan ◽  
...  

Background: Long noncoding RNAs (lncRNAs) are endogenous noncoding RNAs, arbitrarily longer than 200 nucleotides, that play critical roles in diverse biological processes. LncRNAs exist in different genomes ranging from animals to plants. Objective: PlncRNADB is a searchable database of lncRNA sequences and annotation in plants. Methods: We built a pipeline for lncRNA prediction in plants, providing a convenient utility for users to quickly distinguish potential noncoding RNAs from protein-coding transcripts. Results: More than five thousand lncRNAs are collected from four plant species (Arabidopsis thaliana, Arabidopsis lyrata, Populus trichocarpa and Zea mays) in PlncRNADB. Moreover, our database provides the relationship between lncRNAs and various RNA-binding proteins (RBPs), which can be displayed through a user-friendly web interface. Conclusion: PlncRNADB can serve as a reference database to investigate the lncRNAs and their interaction with RNA-binding proteins in plants. The PlncRNADB is freely available at http://bis.zju.edu.cn/PlncRNADB/.


Sign in / Sign up

Export Citation Format

Share Document