Bacteria are everywhere, even in your COI data: Τhe art of getting to know the unknown unknowns and shine light on the dark matter!

ARPHA Conference Abstracts ◽

10.3897/aca.4.e64966 ◽

2021 ◽

Vol 4 ◽

Author(s):

Haris Zafeiropoulos ◽

Laura Gargan ◽

Christina Pavloudi ◽

Evangelos Pafilis ◽

Jens Carlsson

Keyword(s):

Dark Matter ◽

Phylogenetic Tree ◽

Sequence Data ◽

Marker Gene ◽

Environmental Dna ◽

Tree Of Life ◽

Taxonomic Assignment ◽

Species Identity ◽

Consensus Sequences ◽

Coi Sequences

Environmental DNA (eDNA) metabarcoding has been commonly used in recent years (Jeunen et al. 2019) for the identification of the species composition of environmental samples. By making use of genetic markers anchored in conserved gene regions, universally present acrooss the species of large taxonomy groups, eDNA metabarcoding exploits both extra- and intra-cellular DNA fragments for biodiversity assessment. However, there is not a truly “universal” marker gene that is capable of amplifying all species across different taxa (Kress et al. 2015). The mitochondrial cytochrome C oxidase subunit I gene (COI) has many of the desirable properties of a “universal" marker and has been widely used for assessing species identity in Eukaryotes, especially metazoans (Andjar et al. 2018). However, a great number of COI Operational Taxonomic Units (OTUs) or/and Amplicon Sequence Variants (ASVs) retrieved from such studies do not match reference sequences and are often referred to as “dark matter” (Deagle et al. 2014). The aim of this study was to discover the origins and identities of these COI dark matter sequences. We built a reference phylogenetic tree that included as many COI-sequence-related information across the tree of life as possible. An overview of the steps followed is presented in Fig. 1a. Briefly, the Midori reference 2 database was used to retrieve eukaryotes sequences (183,330 species). In addition, the API of the BOLD database was used as source for the corresponding Bacteria (559 genera) and Archaea (41 genera) sequences. Consensus sequences at the family level were constructed from each of these three initial COI datasets. The COI-oriented reference phylogenetic tree of life was then built by using 1,240 consensus sequences with more than 80% of those coming from eukaryotic taxa. Phylogeny-based taxonomic assignment was then used to place query sequences. The a) total number of sequences, b) sequences assigned to Eukaryotes and c) unassigned subsets of OTUs, from marine and freshwater samples, retrieved during in-house metabarcoding experiments, were placed in the reference tree (Fig. 1b). It is clear that a large proportion of sequences targeting the COI region of Eukaryotes actually represents bacterial branches in the phylogenetic tree (Fig. 1b). We conclude that COI metabarcoding studies targeting Eukaryotes may come with a great bias derived from amplification and sequencing of bacterial taxa, depending on the primer pair used. However, for the time being, publicly available bacterial COI sequences are far too few to represent the bacterial variability; thus, a reliable taxonomic identification of them is not possible. We suggest that bacterial COI sequences should be included in the reference databases used for the taxonomy assignment of OTUs/ASVs in COI-based eukaryote metabarcoding studies to allow for bacterial sequences that were amplified to be excluded enabling researchers to exclude non-target sequences. Further, the approach presented here allows researchers to better understand the unknown unknowns and shed light on the dark matter of their metabarcoding sequence data.

Download Full-text

Bacteria are everywhere, even in your COI marker gene data!

10.1101/2021.07.10.451903 ◽

2021 ◽

Author(s):

Haris Zafeiropoulos ◽

Laura Gargan ◽

Sanni Hintikka ◽

Christina Pavloudi ◽

Jens Carlsson

Keyword(s):

Dark Matter ◽

Phylogenetic Tree ◽

Marker Gene ◽

Software Tool ◽

Reference Sequence ◽

Taxonomic Assignment ◽

Consensus Sequences ◽

Gene Coi ◽

Domains Of Life ◽

Coi Sequences

The mitochondrial cytochrome C oxidase subunit I gene (COI) is commonly used in eDNA metabarcoding studies, especially for assessing metazoan diversity. Yet, a great number of COI operational taxonomic units or/and amplicon sequence variants are retrieved from such studies and referred to as "dark matter", and do not get a taxonomic assignment with a reference sequence. For a thorough investigation of this dark matter, we have developed the Dark mAtteR iNvestigator (DARN) software tool. A reference COI-oriented phylogenetic tree was built from 1,240 consensus sequences covering all the three domains of life, with more than 80% of those representing eukaryotic taxa. With respect to eukaryotes, consensus sequences at the family level were constructed from 183,330 retrieved from the Midori reference 2 database. Similarly, sequences from 559 bacterial genera and 41 archaeal were retrieved from the BOLD database. DARN makes use of the phylogenetic tree to investigate and quantify pre-processed sequences of amplicon samples to provide both a tabular and a graphical overview of phylogenetic assignments. To evaluate DARN, both environmental and bulk metabarcoding samples from different aquatic environments using various primer sets were analysed. We demonstrate that a large proportion of non-target prokaryotic organisms such as bacteria and archaea are also amplified in eDNA samples and we suggest bacterial COI sequences to be included in the reference databases used for the taxonomy assignment to allow for further analyses of dark matter. DARN source code is available on GitHub at https://github.com/hariszaf/darn and you may find it as a Docker at https://hub.docker.com/r/hariszaf/darn.

Download Full-text

Decona: From demultiplexing to consensus for Nanopore amplicon data

ARPHA Conference Abstracts ◽

10.3897/aca.4.e65029 ◽

2021 ◽

Vol 4 ◽

Author(s):

Saskia Oosterbroek ◽

Karlijn Doorenspleet ◽

Reindert Nijland ◽

Lara Jansen

Keyword(s):

Sequence Data ◽

Variant Calling ◽

Environmental Dna ◽

Laptop Computer ◽

Consensus Sequences ◽

Sequencing Errors ◽

Blast Output ◽

Command Line Tool ◽

Microbial Symbionts ◽

User Friendly

Sequencing of long amplicons is one of the major benefits of Nanopore technologies, as it allows for reads much longer than Illumina. One of the major challenges for the analysis of these long Nanopore reads is the relatively high error rate. Sequencing errors are generally corrected by consensus generation and polishing. This is still a challenge for mixed samples such as metabarcoding environmental DNA, bulk DNA, mixed amplicon PCR’s and contaminated samples because sequence data would have to be clustered before consensus generation. To this end, we developed Decona (https://github.com/Saskia-Oosterbroek/decona), a command line tool that creates consensus sequences from mixed (metabarcoding) samples using a single command. Decona uses the CD-hit algorithm to cluster reads after demultiplexing (qcat) and filtering (NanoFilt). The sequences in each cluster are subsequently aligned (Minimap2), consensus sequences are generated (Racon) and finally polished (Medaka). Variant calling of the clusters (Medaka) is optional. With the integration of the BLAST+ application Decona does not only generate consensus sequences but also produces BLAST output if desired. The program can be used on a laptop computer making it suitable for use under field conditions. Amplicon data ranging from 300-7500 nucleotides was successfully processed by Decona, creating consensus sequences reaching over 99,9% read identity. This included fish datasets (environmental DNA from filtered water) from a curated aquarium, vertebrate datasets that were contaminated with human sequences and separating sponge sequences from their countless microbial symbionts. Decona considerably simplifies and speeds up post sequencing processes, providing consensus sequences and BLAST output through a single command. Classifying consensus sequences instead of raw sequences improves classification accuracy and drastically decreases the amount of sequences that need to be classified. Overall it is a user friendly option for researchers with limited knowledge of script based data processing.

Download Full-text

PEMA v2: addressing metabarcoding bioinformatics analysis challenges

ARPHA Conference Abstracts ◽

10.3897/aca.4.e64902 ◽

2021 ◽

Vol 4 ◽

Author(s):

Haris Zafeiropoulos ◽

Christina Pavloudi ◽

Evangelos Pafilis

Keyword(s):

High Performance ◽

Bioinformatics Analysis ◽

Marker Gene ◽

Environmental Dna ◽

Third Party ◽

Reference Database ◽

Marker Genes ◽

Specific Reference ◽

Taxonomic Assignment ◽

Internal Joint

Environmental DNA (eDNA) and metabarcoding have launched a new era in bio- and eco-assessment over the last years (Ruppert et al. 2019). The simultaneous identification, at the lowest taxonomic level possible, of a mixture of taxa from a great range of samples is now feasible; thus, the number of eDNA metabarcoding studies has increased radically (Deiner and 2017). While the experimental part of eDNA metabarcoding can be rather challenging depending on the special characteristics of the different studies, computational issues are considered to be its major bottlenecks. Among the latter, the bioinformatics analysis of metabarcoding data and especially the taxonomy assignment of the sequences are fundamental challenges. Many steps are required to obtain taxonomically assigned matrices from raw data. For most of these, a plethora of tools are available. However, each tool's execution parameters need to be tailored to reflect each experiment's idiosyncrasy; thus, tuning bioinformatics analysis has proved itself fundamental (Kamenova 2020). The computation capacity of high-performance computing systems (HPC) is frequently required for such analyses. On top of that, the non perfect completeness and correctness of the reference taxonomy databases is another important issue (Loos et al. 2020). Based on third-party tools, we have developed the Pipeline for Environmental Metabarcoding Analysis (PEMA), a HPC-centered, containerized assembly of key metabarcoding analysis tools. PEMA combines state-of-the art technologies and algorithms with an easy to get-set-use framework, allowing researchers to tune thoroughly each study thanks to roll-back checkpoints and on-demand partial pipeline execution features (Zafeiropoulos 2020). Once PEMA was released, there were two main pitfalls soon to be highlighted by users. PEMA supported 4 marker genes and was bounded by specific reference databases. In this new version of PEMA the analysis of any marker gene is now available since a new feature was added, allowing classifiers to train a user-provided reference database and use it for taxonomic assignment. Fig. 1 shows the taxonomy assignment related PEMA modules; all those out of the dashed box have been developed for this new PEMA release. As shown, the RDPClassifier has been trained with Midori reference 2 and has been added as an option, classifying not only metazoans but sequences from all taxonomic groups of Eukaryotes for the case of the COI marker gene. A PEMA documentation site is now also available. PEMA.v2 containers are available via the DockerHub and SingularityHub as well as through the Elixir Greece AAI Service. It has also been selected to be part of the LifeWatch ERIC Internal Joint Initiative for the analysis of ARMS data and soon will be available through the Tesseract VRE.

Download Full-text

Anacapa Toolkit: an environmental DNA toolkit for processing multilocus metabarcode datasets

10.1101/488627 ◽

2018 ◽

Cited By ~ 1

Author(s):

Emily E. Curd ◽

Zack Gold ◽

Gaurav S Kandlikar ◽

Jesse Gomer ◽

Max Ogden ◽

...

Keyword(s):

High Throughput Sequencing ◽

Sequence Data ◽

Environmental Dna ◽

Taxonomic Diversity ◽

R Package ◽

Community Diversity ◽

Taxonomic Assignment ◽

Lowest Common Ancestor ◽

Reference Databases ◽

Comprehensive Reference

Abstract1. Environmental DNA (eDNA) metabarcoding is a promising method to monitor species and community diversity that is rapid, affordable, and non-invasive. Longstanding needs of the eDNA community are modular informatics tools, comprehensive and customizable reference databases, flexibility across high-throughput sequencing platforms, fast multilocus metabarcode processing, and accurate taxonomic assignment. As bioinformatics tools continue to improve, addressing each of these demands within a single bioinformatics toolkit is becoming a reality.2. We present the modular metabarcode sequence toolkit Anacapa (https://github.com/limey-bean/Anacapa/), which addresses the above needs, allowing users to build comprehensive reference databases and assign taxonomy to raw multilocus metabarcode sequence data A novel aspect of Anacapa is our database building module, Creating Reference libraries Using eXisting tools (CRUX), which generates comprehensive reference databases for specific user-defined metabarcode loci. The Quality Control and Dereplication module sorts and processes multiple metabarcode loci and processes merged, unmerged and unpaired reads maximizing recovered diversity. Followed by amplicon sequence variants (ASVs) detection using DADA2. The Anacapa Classifier module aligns these ASVs to CRUX-generated reference databases using Bowtie2. Taxonomy is assigned to ASVs with confidence scores using a Bayesian Lowest Common Ancestor (BLCA) method. The Anacapa Toolkit also includes an R package, ranacapa, for automated results exploration through standard biodiversity statistical analysis.3. We performed a series of benchmarking tests to verify that the Anacapa Toolkit generates comprehensive reference databases that capture wide taxonomic diversity and that it can assign high-quality taxonomy to both MiSeq-length and Hi-Seq length sequence data. We demonstrate the value of the Anacapa Toolkit to assigning taxonomy to eDNA sequences from seawater samples from southern California including capability of this tool kit to process multilocus metabarcoding data.4. The Anacapa Toolkit broadens the exploration of eDNA and assists in biodiversity assessment and management by generating metabarcode specific databases, processing multilocus data, retaining all read types, and expanding non-traditional eDNA targets. Anacapa software and source code are open and available in a virtual container to ease installation.

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text

Phylogenetic Relations of Rhizoplaca Zopf. from Anatolia Inferred from ITS Sequence Data

Zeitschrift für Naturforschung C ◽

10.1515/znc-2006-5-617 ◽

2006 ◽

Vol 61 (5-6) ◽

pp. 405-412 ◽

Cited By ~ 4

Author(s):

Demet Cansaran ◽

Sümer Aras ◽

İrfan Kandemir ◽

Gökhan Halıcı

Keyword(s):

Genetic Variability ◽

Phylogenetic Tree ◽

Sequence Data ◽

Volcanic Rocks ◽

Morphological Characteristics ◽

Its Rdna ◽

Its Sequence ◽

Phylogenetic Relations ◽

Rdna Sequences

Like many lichen-forming fungi, species of the genus Rhizoplaca have wide geographical distributions, but studies of their genetic variability are limited. The information about the ITS rDNA sequences of three species of Rhizoplaca from Anatolia was generated and aligned with other species from other countries and also with the data belonging to Lecanora species. The examined species were collected from the volcanic rocks of Mount Erciyes which is located in the middle of Anatolia (Turkey). The sequence data aligned with eight other samples of Rhizoplaca and six different species of Lecanora were obtained from GenBank. The results support the concept maintained by Arup and Grube (2000) that Rhizoplaca may not be a genus separate from Lecanora. According to the phylogenetic tree, Rhizoplaca melanopthalma from Turkey with two different samples of R. melanopthalma from Arizona (AF159929, AF159934) and a sample from Austria formed a group under the same branch. R. peltata and R. chrysoleuca samples from Anatolia located in two other branches of the tree formed sister groups with the samples of the same species from different countries. Although R. peltata remained on the same branch with other samples of the same species from other countries it was placed in a different branch within the group. When the three species from Anatolia were considered alone, it was noticed that Rhizoplaca melanopthalma and Rhizoplaca peltata are phylogenetically closer to each other than Rhizoplaca chrysoleuca; the morphological characteristics also support this result.

Download Full-text

Characterising Foot-and-Mouth Disease Virus in Clinical Samples Using Nanopore Sequencing

Frontiers in Veterinary Science ◽

10.3389/fvets.2021.656256 ◽

2021 ◽

Vol 8 ◽

Author(s):

Emma Brown ◽

Graham Freimanis ◽

Andrew E. Shaw ◽

Daniel L. Horton ◽

Simon Gubbins ◽

...

Keyword(s):

Cell Culture ◽

Sequence Data ◽

Foot And Mouth Disease ◽

Mouth Disease ◽

Cell Culture Supernatant ◽

Strain Identification ◽

Consensus Sequences ◽

Fmdv Serotypes ◽

Mouth Disease Virus ◽

Foot And Mouth

The sequencing of viral genomes provides important data for the prevention and control of foot-and-mouth disease (FMD) outbreaks. Sequence data can be used for strain identification, outbreak tracing, and aiding the selection of the most appropriate vaccine for the circulating strains. At present, sequencing of FMD virus (FMDV) relies upon the time-consuming transport of samples to well-resourced laboratories. The Oxford Nanopore Technologies' MinION portable sequencer has the potential to allow sequencing in remote, decentralised laboratories closer to the outbreak location. In this study, we investigated the utility of the MinION to generate sequence data of sufficient quantity and quality for the characterisation of FMDV serotypes O, A, Asia 1. Prior to sequencing, a universal two-step RT-PCR was used to amplify parts of the 5′UTR, as well as the leader, capsid and parts of the 2A encoding regions of FMDV RNA extracted from three sample matrices: cell culture supernatant, tongue epithelial suspension and oral swabs. The resulting consensus sequences were compared with reference sequences generated on the Illumina MiSeq platform. Consensus sequences with an accuracy of 100% were achieved within 10 and 30 min from the start of the sequencing run when using RNA extracted from cell culture supernatants and tongue epithelial suspensions, respectively. In contrast, sequencing from swabs required up to 2.5 h. Together these results demonstrated that the MinION sequencer can be used to accurately and rapidly characterise serotypes A, O, and Asia 1 of FMDV using amplicons amplified from a variety of different sample matrices.

Download Full-text

Characterization of an unusually conserved AluI highly reiterated DNA sequence family from the honeybee, Apis mellifera.

Genetics ◽

10.1093/genetics/134.4.1195 ◽

1993 ◽

Vol 134 (4) ◽

pp. 1195-1204

Author(s):

S Tarès ◽

J M Cornuet ◽

P Abad

Keyword(s):

Apis Mellifera ◽

Dna Sequence ◽

Dna Sequences ◽

Sequence Data ◽

Sequence Divergence ◽

Repeated Sequence ◽

Consensus Sequences ◽

Dna Sequence Data ◽

Repeat Class ◽

Honeybee Subspecies

Abstract An AluI family of highly reiterated nontranscribed sequences has been found in the genome of the honeybee Apis mellifera. This repeated sequence is shown to be present at approximately 23,000 copies per haploid genome constituting about 2% of the total genomic DNA. The nucleotide sequence of 10 monomers was determined. The consensus sequences is 176 nucleotides long and has an A + T content of 58%. There are clusters of both direct and inverted repeats. Internal subrepeating units ranging from 11 to 17 nucleotides are observed, suggesting that it could have evolved from a shorter sequence. DNA sequence data reveal that this repeat class is unusually homogeneous compared to the other class of invertebrate highly reiterated DNA sequences. The average pairwise sequence divergence between the repeats is 2.5%. In spite of this unusual homogeneity, divergence has been found in the repeated sequence hybridization ladder between four different honeybee subspecies. Therefore, the AluI highly reiterated sequences provide a new probe for fingerprinting in A. m. mellifera.

Download Full-text

debar, a sequence-by-sequence denoiser for COI-5P DNA barcode data

10.1101/2021.01.04.425285 ◽

2021 ◽

Author(s):

Cameron M. Nugent ◽

Tyler A. Elliott ◽

Sujeevan Ratnasingham ◽

Paul D. N. Hebert ◽

Sarah J. Adamowicz

Keyword(s):

High Throughput Sequencing ◽

Dna Barcode ◽

R Package ◽

Error Rates ◽

Real World Data ◽

Species Discovery ◽

Consensus Sequences ◽

In Silico Studies ◽

Coi Sequences

AbstractDNA barcoding and metabarcoding are now widely used to advance species discovery and biodiversity assessments. High-throughput sequencing (HTS) has expanded the volume and scope of these analyses, but elevated error rates introduce noise into sequence records that can inflate estimates of biodiversity. Denoising —the separation of biological signal from instrument (technical) noise—of barcode and metabarcode data currently employs abundance-based methods which do not capitalize on the highly conserved structure of the cytochrome c oxidase subunit I (COI) region employed as the animal barcode. This manuscript introduces debar, an R package that utilizes a profile hidden Markov model to denoise indel errors in COI sequences introduced by instrument error. In silico studies demonstrated that debar recognized 95% of artificially introduced indels in COI sequences. When applied to real-world data, debar reduced indel errors in circular consensus sequences obtained with the Sequel platform by 75%, and those generated on the Ion Torrent S5 by 94%. The false correction rate was less than 0.1%, indicating that debar is receptive to the majority of true COI variation in the animal kingdom. In conclusion, the debar package improves DNA barcode and metabarcode workflows by aiding the generation of more accurate sequences aiding the characterization of species diversity.

Download Full-text

Phylogenetic and Chemical Diversity of a Hybrid-Isoprenoid-Producing Streptomycete Lineage

Applied and Environmental Microbiology ◽

10.1128/aem.01814-13 ◽

2013 ◽

Vol 79 (22) ◽

pp. 6894-6902 ◽

Cited By ~ 13

Author(s):

Kelley A. Gallagher ◽

Kristin Rauscher ◽

Laura Pavan Ioca ◽

Paul R. Jensen

Keyword(s):

Secondary Metabolites ◽

Secondary Metabolite ◽

Sequence Data ◽

Environmental Dna ◽

Chemical Diversity ◽

Phenotypic Trait ◽

Content Type ◽

Operational Taxonomic Units ◽

Culture Independent ◽

Structural Classes

ABSTRACTStreptomycesspecies dedicate a large portion of their genomes to secondary metabolite biosynthesis. A diverse and largely marine-derived lineage within this genus has been designated MAR4 and identified as a prolific source of hybrid isoprenoid (HI) secondary metabolites. These terpenoid-containing compounds are common in nature but rarely observed as bacterial secondary metabolites. To assess the phylogenetic diversity of the MAR4 lineage, complementary culture-based and culture-independent techniques were applied to marine sediment samples collected off the Channel Islands, CA. The results, including those from an analysis of publically available sequence data and strains isolated as part of prior studies, placed 40 new strains in the MAR4 clade, of which 32 originated from marine sources. When combined with sequences cloned from environmental DNA, 28 MAR4 operational taxonomic units (0.01% genetic distance) were identified. Of these, 82% consisted exclusively of either cloned sequences or cultured strains, supporting the complementarity of these two approaches. Chemical analyses of diverse MAR4 strains revealed the production of five different HI structure classes. All 21 MAR4 strains tested produced at least one HI class, with most strains producing from two to four classes. The two major clades within the MAR4 lineage displayed distinct patterns in the structural classes and the number and amount of HIs produced, suggesting a relationship between taxonomy and secondary metabolite production. The production of HI secondary metabolites appears to be a phenotypic trait of the MAR4 lineage, which represents an emerging model with which to study the ecology and evolution of HI biosynthesis.

Download Full-text