An efficient and robust laboratory workflow and tetrapod database for larger scale eDNA studies

Mapping Intimacies ◽

10.1101/345082 ◽

2018 ◽

Cited By ~ 1

Author(s):

Jan Axtner ◽

Alex Crampton-Platt ◽

Lisa A. Hörig ◽

Azlan Mohamed ◽

Charles C.Y. Xu ◽

...

Keyword(s):

Pcr Amplification ◽

Environmental Dna ◽

Taxonomic Assignment ◽

Bioinformatic Pipeline ◽

Assignment Method ◽

Reference Databases ◽

Reference Sequences ◽

Taxonomic Groups ◽

Annotation Errors ◽

Technical Replication

AbstractBackgroundThe use of environmental DNA, ‘eDNA,’ for species detection via metabarcoding is growing rapidly. We present a co-designed lab workflow and bioinformatic pipeline to mitigate the two most important risks of eDNA: sample contamination and taxonomic mis-assignment. These risks arise from the need for PCR amplification to detect the trace amounts of DNA combined with the necessity of using short target regions due to DNA degradation.FindingsOur high-throughput workflow minimises these risks via a four-step strategy: (1) technical replication with two PCR replicates and two extraction replicates; (2) using multi-markers (12S, 16S, CytB); (3) a ‘twin-tagging,’ two-step PCR protocol;(4) use of the probabilistic taxonomic assignment method PROTAX, which can account for incomplete reference databases.As annotation errors in the reference sequences can result in taxonomic mis-assignment, we supply a protocol for curating sequence datasets. For some taxonomic groups and some markers, curation resulted in over 50% of sequences being deleted from public reference databases, due to (1) limited overlap between our target amplicon and reference sequences; (2) mislabelling of reference sequences; (3) redundancy.Finally, we provide a bioinformatic pipeline to process amplicons and conduct PROTAX assignment and tested it on an ‘invertebrate derived DNA’ (iDNA) dataset from 1532 leeches from Sabah, Malaysia. Twin-tagging allowed us to detect and exclude sequences with non-matching tags. The smallest DNA fragment (16S) amplified most frequently for all samples, but was less powerful for discriminating at species rank. Using a stringent and lax acceptance criteria we found 162 (stringent) and 190 (lax) vertebrate detections of 95 (stringent) and 109 (lax) leech samples.ConclusionsOur metabarcoding workflow should help research groups increase the robustness of their results and therefore facilitate wider usage of e/iDNA, which is turning into a valuable source of ecological and conservation information on tetrapods.

Download Full-text

Assessing accuracy and completeness of GenBank for eDNA metabarcoding: towards a reliable marine fish reference database

ARPHA Conference Abstracts ◽

10.3897/aca.4.e64671 ◽

2021 ◽

Vol 4 ◽

Author(s):

Cristina Claver ◽

Oriol Canals ◽

Naiara Rodriguez-Ezpeleta

Keyword(s):

Ribosomal Rna ◽

Gap Analysis ◽

Environmental Dna ◽

Reference Database ◽

Fish Diversity ◽

Cyt B ◽

Taxonomic Assignment ◽

Reference Databases ◽

Taxonomic Groups ◽

High Level

Environmental DNA (eDNA) metabarcoding, the process of sequencing DNA collected from the environment for producing biodiversity inventories, is increasingly being applied to assess fish diversity and distribution in marine environments. Yet, the successful application of this technique deeply relies on accurate and complete reference databases used for taxonomic assignment. The most used markers for fish eDNA metabarcoding studies are the cytochrome C oxidase subunit 1 (COI), 16S ribosomal RNA (16S), the 12S ribosomal RNA (12S) and cytochrome b (cyt b) genes, whose sequences are usually retrieved from GenBank, the largest DNA sequence database that represents a worldwide public resource for genetic studies. Thus, the completeness and accuracy of GenBank is critical to derive reliable estimations from fish eDNA metabarcoding data. Here, we have i) compiled the checklist of European marine fishes, ii) performed a gap analysis of the four genes and, within COI and 12S, also of the most used barcodes for fish, and iii) developed a workflow to detect potentially incorrect records in GenBank. We found that from the 1965 species in the checklist (1761 Actinopterygii, 189 Elasmobranchii, 9 Holocephali, 4 Petromyzonti and 2 Myxini), about 70% have sequences for COI, whereas less have sequences for 12S, 16S and cyt b (45-55%). Among the species for which COI ad 12S sequences are available, about 60% and 40% have sequences covering the most used barcodes respectively. The analysis of pairwise distances between sequences revealed pairs belonging to the same species with significantly low similarity and pairs belonging to different high level taxonomic groups (class, order) with significantly large similarity. In light of this further confirmation of presence of a substantial number of incorrect records in GenBank, we propose a method for identifying and removing spurious sequences to create reliable and accurate reference databases for eDNA metabarcoding.

Download Full-text

Combining NCBI and BOLD databases for OTU assignment in metabarcoding and metagenomic data: The BOLD_NCBI _Merger

10.7287/peerj.preprints.3133 ◽

2017 ◽

Author(s):

Jan-Niklas Macher ◽

Till-Hendrik Macher ◽

Florian Leese

Keyword(s):

Sequence Data ◽

Metagenomic Data ◽

Biodiversity Assessment ◽

Ecological Studies ◽

Reliability Of Results ◽

Taxonomic Assignment ◽

Taxonomic Information ◽

Reference Databases ◽

Taxonomic Groups ◽

Combine Sequence

Metabarcoding and metagenomic approaches are becoming routine techniques in biodiversity assessment and ecological studies. The assignment of taxonomic information to sequences is challenging, as many reference libraries are lacking information on certain taxonomic groups and can contain erroneous sequences. Combining different reference databases is therefore a promising approach for maximizing taxonomic coverage and reliability of results. This tutorial shows how to use the “BOLD_NCBI_Merger” script to combine sequence data obtained from the National Center for Biotechnology Information (NCBI) GenBank and the Barcode of Life Database (BOLD) and prepare it for taxonomic assignment with the software MEGAN.

Download Full-text

Combining NCBI and BOLD databases for OTU assignment in metabarcoding and metagenomic data: The BOLD_NCBI _Merger

10.7287/peerj.preprints.3133v1 ◽

2017 ◽

Author(s):

Jan-Niklas Macher ◽

Till-Hendrik Macher ◽

Florian Leese

Keyword(s):

Sequence Data ◽

Metagenomic Data ◽

Biodiversity Assessment ◽

Ecological Studies ◽

Reliability Of Results ◽

Taxonomic Assignment ◽

Taxonomic Information ◽

Reference Databases ◽

Taxonomic Groups ◽

Combine Sequence

Download Full-text

Anacapa Toolkit: an environmental DNA toolkit for processing multilocus metabarcode datasets

10.1101/488627 ◽

2018 ◽

Cited By ~ 1

Author(s):

Emily E. Curd ◽

Zack Gold ◽

Gaurav S Kandlikar ◽

Jesse Gomer ◽

Max Ogden ◽

...

Keyword(s):

High Throughput Sequencing ◽

Sequence Data ◽

Environmental Dna ◽

Taxonomic Diversity ◽

R Package ◽

Community Diversity ◽

Taxonomic Assignment ◽

Lowest Common Ancestor ◽

Reference Databases ◽

Comprehensive Reference

Abstract1. Environmental DNA (eDNA) metabarcoding is a promising method to monitor species and community diversity that is rapid, affordable, and non-invasive. Longstanding needs of the eDNA community are modular informatics tools, comprehensive and customizable reference databases, flexibility across high-throughput sequencing platforms, fast multilocus metabarcode processing, and accurate taxonomic assignment. As bioinformatics tools continue to improve, addressing each of these demands within a single bioinformatics toolkit is becoming a reality.2. We present the modular metabarcode sequence toolkit Anacapa (https://github.com/limey-bean/Anacapa/), which addresses the above needs, allowing users to build comprehensive reference databases and assign taxonomy to raw multilocus metabarcode sequence data A novel aspect of Anacapa is our database building module, Creating Reference libraries Using eXisting tools (CRUX), which generates comprehensive reference databases for specific user-defined metabarcode loci. The Quality Control and Dereplication module sorts and processes multiple metabarcode loci and processes merged, unmerged and unpaired reads maximizing recovered diversity. Followed by amplicon sequence variants (ASVs) detection using DADA2. The Anacapa Classifier module aligns these ASVs to CRUX-generated reference databases using Bowtie2. Taxonomy is assigned to ASVs with confidence scores using a Bayesian Lowest Common Ancestor (BLCA) method. The Anacapa Toolkit also includes an R package, ranacapa, for automated results exploration through standard biodiversity statistical analysis.3. We performed a series of benchmarking tests to verify that the Anacapa Toolkit generates comprehensive reference databases that capture wide taxonomic diversity and that it can assign high-quality taxonomy to both MiSeq-length and Hi-Seq length sequence data. We demonstrate the value of the Anacapa Toolkit to assigning taxonomy to eDNA sequences from seawater samples from southern California including capability of this tool kit to process multilocus metabarcoding data.4. The Anacapa Toolkit broadens the exploration of eDNA and assists in biodiversity assessment and management by generating metabarcode specific databases, processing multilocus data, retaining all read types, and expanding non-traditional eDNA targets. Anacapa software and source code are open and available in a virtual container to ease installation.

Download Full-text

Molecular Evidence for Nosocomial Transmission of Human Immunodeficiency Virus from a Surgeon to One of His Patients

Journal of Virology ◽

10.1128/jvi.72.5.4537-4540.1998 ◽

1998 ◽

Vol 72 (5) ◽

pp. 4537-4540 ◽

Cited By ~ 48

Author(s):

Alain Blanchard ◽

Stéphane Ferris ◽

Sophie Chamaret ◽

Denise Guétard ◽

Luc Montagnier

Keyword(s):

Human Immunodeficiency Virus ◽

Viral Genome ◽

Pcr Amplification ◽

Nosocomial Transmission ◽

Molecular Evidence ◽

Immunodeficiency Virus ◽

Reference Sequences ◽

Hiv Type 1 ◽

Viral Sequences

ABSTRACT We have investigated the molecular evidence in favor of the transmission of human immunodeficiency virus (HIV) from an HIV-infected surgeon to one of his patients. After PCR amplification, theenv and gag sequences from the viral genome were cloned and sequenced. Phylogenetic analysis revealed that the viral sequences derived from the surgeon and his patient are closely related, which strongly suggests that nosocomial transmission occurred. In addition, these viral sequences belong to group M of HIV type 1 but are divergent from the reference sequences of the known subtypes.

Download Full-text

Past and recent biodiversity profiling in ancient Lake Chalco Mexico by a metagenomics analysis

10.5194/egusphere-egu21-6277 ◽

2021 ◽

Author(s):

Bárbara Moguel ◽

Liseth Pérez ◽

Luis David Alcaraz ◽

Socorro Lozano-García ◽

Luis Herrera-Estrella ◽

...

Keyword(s):

Volcanic Belt ◽

Environmental Dna ◽

Radiocarbon Dates ◽

The Past ◽

Mexican Volcanic Belt ◽

History Of ◽

High Throughput Dna Sequencing ◽

High Altitude Lake ◽

Taxonomic Groups ◽

The Relationship

<p>For decades, paleoecological studies in lake sediments have focused on reconstructing the environments of the past and explaining phenomena linked to climatic variations. Recent advances in high-throughput DNA sequencing have allowed access to environmental DNA (eDNA) and ancient sedimentary DNA (sedaDNA) as a new and efficient proxy for past and present biodiversity. The basin of Mexico (BM) is located in the central part of the Trans-Mexican Volcanic Belt at 2,200 m a.s.l.; with the southern portion harboring the Chalco sub-basin. Lake Chalco is one of the last remaining natural aquatic ecosystems within the ever-expanding urban area surrounding Mexico City. The paleoenvironmental history of this lake has been previously characterized using sedimentological and geochemical proxies, as well as preserved microfossils (diatoms, pollen) with a temporal framework based on multiple radiocarbon dates. However, information for the remaining taxonomic groups and metabolic pathways remained unexplored. Here, we present the first metagenomics-based study for the Holocene in a high-altitude lake in Central Mexico &#8211;Lake Chalco. We explored the relationship between the lake&#8217;s paleoenvironmental condition and estimations of taxonomic and metabolic profiles across the sedimentary sequence (2.5 meters long). Multiple biological and abiotic variables revealed three main environmental phases: 1) a cool freshwater lake (FW1: 11,500-11,000 cal years BP), 2) a warm hyposaline lake (HS2: 11,000-6,000 cal years BP), and 3) a temperate, subsaline lake (SS3, <6,000 cal years BP). We describe the structure of the microbiota community and taxonomy richness turnover in the three Holocene paleoenvironmental phases. During the past 12 000 years BP the most abundant domains in Lake Chalco sediments were Bacteria, followed by Archaea, and Eukarya (36,722 genera). The analysis of functional proteins showed high biodiversity with a total of 27,636,243 proteins identified, but it was only possible to annotate 3,227,398 of them. Also, we identified several genes associated with some relevant pathways, such as methanogenesis. Altogether, this study allowed us to reconstruct the natural history of lake Chalco and its surroundings.</p>

Download Full-text

Accurate Reconstruction of Microbial Strains from Metagenomic Sequencing Using Representative Reference Genomes

10.1101/215707 ◽

2017 ◽

Cited By ~ 2

Author(s):

Zhemin Zhou ◽

Nina Luhmann ◽

Nabil-Fareed Alikhan ◽

Christopher Quince ◽

Mark Achtman

Keyword(s):

Evaluation Studies ◽

Species Level ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Reference Databases ◽

Microbial Strains ◽

Taxonomic Assignments ◽

Taxonomic Groups ◽

Reference Genomes ◽

Recent Evaluation

AbstractExploring the genetic diversity of microbes within the environment through metagenomic sequencing first requires classifying these reads into taxonomic groups. Current methods compare these sequencing data with existing biased and limited reference databases. Several recent evaluation studies demonstrate that current methods either lack sufficient sensitivity for species-level assignments or suffer from false positives, overestimating the number of species in the metagenome. Both are especially problematic for the identification of low-abundance microbial species, e. g. detecting pathogens in ancient metagenomic samples. We present a new method, SPARSE, which improves taxonomic assignments of metagenomic reads. SPARSE balances existing biased reference databases by grouping reference genomes into similarity-based hierarchical clusters, implemented as an efficient incremental data structure. SPARSE assigns reads to these clusters using a probabilistic model, which specifically penalizes non-specific mappings of reads from unknown sources and hence reduces false-positive assignments. Our evaluation on simulated datasets from two recent evaluation studies demonstrated the improved precision of SPARSE in comparison to other methods for species-level classification. In a third simulation, our method successfully differentiated multiple co-existing Escherichia coli strains from the same sample. In real archaeological datasets, SPARSE identified ancient pathogens with ≤ 0.02% abundance, consistent with published findings that required additional sequencing data. In these datasets, other methods either missed targeted pathogens or reported non-existent ones. SPARSE and all evaluation scripts are available at https://github.com/zheminzhou/SPARSE.

Download Full-text

Overcoming limitations to environmental DNA studies: A coastal temperate reference sequence database for multiple chloroplast gene regions generated in a single assay.

10.22541/au.163252330.05592688/v1 ◽

2021 ◽

Author(s):

Nicole Foster ◽

Kor-jent Dijk ◽

Ed Biffin ◽

Jennifer Young ◽

Vicki Thomson ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Environmental Dna ◽

Reference Sequence ◽

Reference Database ◽

Chloroplast Gene ◽

Coastal Plants ◽

Reference Databases ◽

Targeted Capture ◽

Comprehensive Reference

A proliferation in environmental DNA (eDNA) research has increased the reliance on reference sequence databases to assign unknown DNA sequences to known taxa. Without comprehensive reference databases, DNA extracted from environmental samples cannot be correctly assigned to taxa, limiting the use of this genetic information to identify organisms in unknown sample mixtures. For animals, standard metabarcoding practices involve amplification of the mitochondrial Cytochrome-c oxidase subunit 1 (CO1) region, which is a universally amplifyable region across majority of animal taxa. This region, however, does not work well as a DNA barcode for plants and fungi, and there is no similar universal single barcode locus that has the same species resolution. Therefore, generating reference sequences has been more difficult and several loci have been suggested to be used in parallel to get to species identification. For this reason, we developed a multi-gene targeted capture approach to generate reference DNA sequences for plant taxa across 20 target chloroplast gene regions in a single assay. We successfully compiled a reference database for 93 temperate coastal plants including seagrasses, mangroves, and saltmarshes/samphire’s. We demonstrate the importance of a comprehensive reference database to prevent species going undetected in eDNA studies. We also investigate how using multiple chloroplast gene regions impacts the ability to discriminate between taxa.

Download Full-text

Current limitations and future prospects of detection and biomonitoring of NIS in the Mediterranean Sea through environmental DNA

NeoBiota ◽

10.3897/neobiota.70.71862 ◽

2021 ◽

Vol 70 ◽

pp. 151-165

Author(s):

Francesco Zangaro ◽

Benedetta Saccomanno ◽

Eftychia Tzafesta ◽

Fabio Bozzeda ◽

Valeria Specchia ◽

...

Keyword(s):

Mediterranean Sea ◽

Dna Barcode ◽

Environmental Dna ◽

Special Focus ◽

Indigenous Species ◽

Dna Barcodes ◽

Identification Of Species ◽

The Mediterranean ◽

Taxonomic Groups ◽

Non Indigenous Species

The biodiversity of the Mediterranean Sea is currently threatened by the introduction of Non-Indigenous Species (NIS). Therefore, monitoring the distribution of NIS is of utmost importance to preserve the ecosystems. A promising approach for the identification of species and the assessment of biodiversity is the use of DNA barcoding, as well as DNA and eDNA metabarcoding. Currently, the main limitation in the use of genomic data for species identification is the incompleteness of the DNA barcode databases. In this research, we assessed the availability of DNA barcodes in the main reference libraries for the most updated inventory of 665 confirmed NIS in the Mediterranean Sea, with a special focus on the cytochrome oxidase I (COI) barcode and primers. The results of this study show that there are no barcodes for 33.18% of the species in question, and that 45.30% of the 382 species with COI barcode, have no primers publicly available. This highlights the importance of directing scientific efforts to fill the barcode gap of specific taxonomic groups in order to help in the effective application of the eDNA technique for investigating the occurrence and the distribution of NIS in the Mediterranean Sea.

Download Full-text

PLANiTS: a curated sequence reference dataset for plant ITS DNA metabarcoding

Database ◽

10.1093/database/baz155 ◽

2020 ◽

Vol 2020 ◽

Cited By ~ 6

Author(s):

Elisa Banchi ◽

Claudio G Ametrano ◽

Samuele Greco ◽

David Stanković ◽

Lucia Muggia ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Its Region ◽

Computational Effort ◽

Its Sequences ◽

Reference Dataset ◽

Bioinformatic Pipeline ◽

Taxonomic Level ◽

Dna Metabarcoding ◽

Reference Databases

Abstract DNA metabarcoding combines DNA barcoding with high-throughput sequencing to identify different taxa within environmental communities. The ITS has already been proposed and widely used as universal barcode marker for plants, but a comprehensive, updated and accurate reference dataset of plant ITS sequences has not been available so far. Here, we constructed reference datasets of Viridiplantae ITS1, ITS2 and entire ITS sequences including both Chlorophyta and Streptophyta. The sequences were retrieved from NCBI, and the ITS region was extracted. The sequences underwent identity check to remove misidentified records and were clustered at 99% identity to reduce redundancy and computational effort. For this step, we developed a script called ‘better clustering for QIIME’ (bc4q) to ensure that the representative sequences are chosen according to the composition of the cluster at a different taxonomic level. The three datasets obtained with the bc4q script are PLANiTS1 (100 224 sequences), PLANiTS2 (96 771 sequences) and PLANiTS (97 550 sequences), and all are pre-formatted for QIIME, being this the most used bioinformatic pipeline for metabarcoding analysis. Being curated and updated reference databases, PLANiTS1, PLANiTS2 and PLANiTS are proposed as a reliable, pivotal first step for a general standardization of plant DNA metabarcoding studies. The bc4q script is presented as a new tool useful in each research dealing with sequences clustering. Database URL: https://github.com/apallavicini/bc4q; https://github.com/apallavicini/PLANiTS.

Download Full-text

An efficient and robust laboratory workflow and tetrapod database for larger scale eDNA studies

Assessing accuracy and completeness of GenBank for eDNA metabarcoding: towards a reliable marine fish reference database

Combining NCBI and BOLD databases for OTU assignment in metabarcoding and metagenomic data: The BOLD_NCBI _Merger

Combining NCBI and BOLD databases for OTU assignment in metabarcoding and metagenomic data: The BOLD_NCBI _Merger

Anacapa Toolkit: an environmental DNA toolkit for processing multilocus metabarcode datasets

Molecular Evidence for Nosocomial Transmission of Human Immunodeficiency Virus from a Surgeon to One of His Patients

Past and recent biodiversity profiling in ancient Lake Chalco Mexico by a metagenomics analysis

Accurate Reconstruction of Microbial Strains from Metagenomic Sequencing Using Representative Reference Genomes

Overcoming limitations to environmental DNA studies: A coastal temperate reference sequence database for multiple chloroplast gene regions generated in a single assay.

﻿Current limitations and future prospects of detection and biomonitoring of NIS in the Mediterranean Sea through environmental DNA

PLANiTS: a curated sequence reference dataset for plant ITS DNA metabarcoding

Current limitations and future prospects of detection and biomonitoring of NIS in the Mediterranean Sea through environmental DNA