scholarly journals MetaCurator: A hidden Markov model-based toolkit for extracting and curating sequences from taxonomically-informative genetic markers

2019 ◽  
Author(s):  
Rodney T. Richardson ◽  
Douglas B. Sponsler ◽  
Harper McMinn-Sauder ◽  
Reed M. Johnson

SummaryThe community-level analysis of samples containing diverse genetic material, via metabarcoding and metagenomic approaches, is increasingly popular. While the production of sequence data for such studies has become straightforward, questions remain about how best to analyze and taxonomically characterize sequence data. For many sequence classification approaches, an important component of the workflow involves the curation of reference sequences. Ideally, this involves trimming away extraneous sequence at the 3 prime and 5 prime ends of the target marker of interest, as well as the removal of reference sequence duplicates. Here, we present MetaCurator, a software package written in Python, designed for automated reference sequence curation and highly generalizable across markers and study systems. MetaCurator is organized in a modular fashion, so users can implement tools individually in addition to utilizing the automated and flexible MetaCurator parental code. Aside from modules used to organize and format taxonomic lineage data, MetaCurator contains two signature tools. IterRazor utilizes profile hidden Markov models and an iterative search framework to exhaustively identify and extract the precise amplicon marker of interest from available reference sequence data. DerepByTaxonomy then facilitates sequence dereplication using a taxonomically aware approach, removing duplicates only when they belong to the same taxon. This is important for cases of incomplete lineage sorting between species and for highly conserved markers, such as plantrbcLandtrnL, which often display no sequence divergence across taxa, even at the genus level.Availability and implementationMetaCurator is supported on OSX and Linux (RedHat/CentOS) and is freely available under a GPL v3.0 license athttps://github.com/RTRichar/[email protected] informationCode associated with this work is available athttps://github.com/RTRichar/MetabarcodeDBsV2and additional analysis is presented in supplementary files.

Author(s):  
Marco Cosimo Simeone ◽  
Guido W Grimm ◽  
Alessio Papini ◽  
Federico Vessella ◽  
Simone Cardoni ◽  
...  

Nucleotide sequences from the plastome are currently the main source for assessing taxonomic and phylogenetic relationships in flowering plants and their historical biogeography at all hierarchical levels. One exception is the large and economically important genus Quercus (oaks). Whereas differentiation patterns of the nuclear genome are in agreement with morphology and the fossil record, diversity patterns in the plastome are at odds with established taxonomic and phylogenetic relationships. However, the extent and evolutionary implications of this incongruence has yet to be fully uncovered. The DNA sequence divergence of four Euro-Mediterranean Group Ilex oak species (Quercus ilex L., Q. coccifera L., Q. aucheri Jaub. & Spach., Q. alnifolia Poech.) was explored at three chloroplast markers (rbcL, trnK-matK, trnH-psbA). Phylogenetic relationships were reconstructed including worldwide members of additional 55 species representing all Quercus subgeneric groups. Family and order sequence data were harvested from gene banks to better frame the observed divergence in larger taxonomic contexts. We found a strong geographic sorting in the focal group and the genus in general that is entirely decoupled from species boundaries. Main plastid haplotypes shared by distinct oak lineages from the same geographic region and high plastid diversity in members of Group Ilex are indicative for a polyphyletic origin of their plastomes. The results suggest that incomplete lineage sorting and repeated phases of unidirectional introgression among ancestral lineages of Group Ilex and two other main Groups of Eurasian oaks (Cyclobalanopsis and Cerris) caused this complex pattern. Comparison with the current phylogenetic synthesis also suggests an initial high- versus mid-latitude biogeographic split within Quercus. High plastome plasticity of Group Ilex reflects geographic area disruptions, possibly linked with high tectonic activity of past and modern distribution ranges, that did not leave imprints in the nuclear genome of modern species and infrageneric lineages.


PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e1897 ◽  
Author(s):  
Marco Cosimo Simeone ◽  
Guido W. Grimm ◽  
Alessio Papini ◽  
Federico Vessella ◽  
Simone Cardoni ◽  
...  

Nucleotide sequences from the plastome are currently the main source for assessing taxonomic and phylogenetic relationships in flowering plants and their historical biogeography at all hierarchical levels. One major exception is the large and economically important genusQuercus(oaks). Whereas differentiation patterns of the nuclear genome are in agreement with morphology and the fossil record, diversity patterns in the plastome are at odds with established taxonomic and phylogenetic relationships. However, the extent and evolutionary implications of this incongruence has yet to be fully uncovered. The DNA sequence divergence of four Euro-Mediterranean Group Ilex oak species (Quercus ilexL.,Q. cocciferaL.,Q. aucheriJaub. & Spach.,Q. alnifoliaPoech.) was explored at three chloroplast markers (rbcL, trnK/matK, trnH-psbA). Phylogenetic relationships were reconstructed including worldwide members of additional 55 species representing allQuercussubgeneric groups. Family and order sequence data were harvested from gene banks to better frame the observed divergence in larger taxonomic contexts. We found a strong geographic sorting in the focal group and the genus in general that is entirely decoupled from species boundaries. High plastid divergence in members ofQuercusGroup Ilex, including haplotypes shared with related, but long isolated oak lineages, point towards multiple geographic origins of this group of oaks. The results suggest that incomplete lineage sorting and repeated phases of asymmetrical introgression among ancestral lineages of Group Ilex and two other main Groups of Eurasian oaks (Cyclobalanopsis and Cerris) caused this complex pattern. Comparison with the current phylogenetic synthesis also suggests an initial high- versus mid-latitude biogeographic split withinQuercus. High plastome plasticity of Group Ilex reflects geographic area disruptions, possibly linked with high tectonic activity of past and modern distribution ranges, that did not leave imprints in the nuclear genome of modern species and infrageneric lineages.


The Auk ◽  
2003 ◽  
Vol 120 (3) ◽  
pp. 889-907
Author(s):  
Kim T. Scribner ◽  
Sandra L. Talbot ◽  
John M. Pearce ◽  
Barbara J. Pierson ◽  
Karen S. Bollinger ◽  
...  

Abstract Using molecular genetic markers that differ in mode of inheritance and rate of evolution, we examined levels and partitioning of genetic variation for seven nominal subspecies (11 breeding populations) of Canada Geese (Branta canadensis) in western North America. Gene trees constructed from mtDNA control region sequence data show that subspecies of Canada Geese do not have distinct mtDNA. Large and small-bodied forms of Canada Geese were highly diverged (0.077 average sequence divergence) and represent monophyletic groups. A majority (65%) of 20 haplotypes resolved were observed in single breeding locales. However, within both large and small-bodied forms certain haplotypes occurred across multiple subspecies. Population trees for both nuclear (microsatellites) and mitochondrial markers were generally concordant and provide resolution of population and subspecific relationships indicating incomplete lineage sorting. All populations and subspecies were genetically diverged, but to varying degrees. Analyses of molecular variance, nested-clade and coalescencebased analyses of mtDNA suggest that both historical (past fragmentation) and contemporary forces have been important in shaping current spatial genetic distributions. Gene flow appears to be ongoing though at different rates, even among currently recognized subspecies. The efficacy of current subspecific taxonomy is discussed in light of hypothesized historical vicariance and current demographic trends of management and conservation concern.


2015 ◽  
Author(s):  
Marco Cosimo Simeone ◽  
Guido W Grimm ◽  
Alessio Papini ◽  
Federico Vessella ◽  
Simone Cardoni ◽  
...  

Nucleotide sequences from the plastome are currently the main source for assessing taxonomic and phylogenetic relationships in flowering plants and their historical biogeography at all hierarchical levels. One exception is the large and economically important genus Quercus (oaks). Whereas differentiation patterns of the nuclear genome are in agreement with morphology and the fossil record, diversity patterns in the plastome are at odds with established taxonomic and phylogenetic relationships. However, the extent and evolutionary implications of this incongruence has yet to be fully uncovered. The DNA sequence divergence of four Euro-Mediterranean Group Ilex oak species (Quercus ilex L., Q. coccifera L., Q. aucheri Jaub. & Spach., Q. alnifolia Poech.) was explored at three chloroplast markers (rbcL, trnK-matK, trnH-psbA). Phylogenetic relationships were reconstructed including worldwide members of additional 55 species representing all Quercus subgeneric groups. Family and order sequence data were harvested from gene banks to better frame the observed divergence in larger taxonomic contexts. We found a strong geographic sorting in the focal group and the genus in general that is entirely decoupled from species boundaries. Main plastid haplotypes shared by distinct oak lineages from the same geographic region and high plastid diversity in members of Group Ilex are indicative for a polyphyletic origin of their plastomes. The results suggest that incomplete lineage sorting and repeated phases of unidirectional introgression among ancestral lineages of Group Ilex and two other main Groups of Eurasian oaks (Cyclobalanopsis and Cerris) caused this complex pattern. Comparison with the current phylogenetic synthesis also suggests an initial high- versus mid-latitude biogeographic split within Quercus. High plastome plasticity of Group Ilex reflects geographic area disruptions, possibly linked with high tectonic activity of past and modern distribution ranges, that did not leave imprints in the nuclear genome of modern species and infrageneric lineages.


Insects ◽  
2021 ◽  
Vol 12 (5) ◽  
pp. 455
Author(s):  
Na Ra Jeong ◽  
Min Jee Kim ◽  
Sung-Soo Kim ◽  
Sei-Woong Choi ◽  
Iksoo Kim

Conogethes pinicolalis has long been considered as a Pinaceae-feeding type of the yellow peach moth, C. punctiferalis, in Korea. In this study, the divergence of C. pinicolalis from the fruit-feeding moth C. punctiferalis was analyzed in terms of morphology, ecology, and genetics. C. pinicolalis differs from C. punctiferalis in several morphological features. Through field observation, we confirmed that pine trees are the host plants for the first generation of C. pinicolalis larvae, in contrast to fruit-feeding C. punctiferalis larvae. We successfully reared C. pinicolalis larvae to adults by providing them pine needles as a diet. From a genetic perspective, the sequences of mitochondrial COI of these two species substantially diverged by an average of 5.46%; moreover, phylogenetic analysis clearly assigned each species to an independent clade. On the other hand, nuclear EF1α showed a lower sequence divergence (2.10%) than COI. Overall, EF1α-based phylogenetic analysis confirmed each species as an independent clade, but a few haplotypes of EF1α indicated incomplete lineage sorting between these two species. In conclusion, our results demonstrate that C. pinicolalis is an independent species according to general taxonomic criteria; however, analysis of the EF1α sequence revealed a short divergence time.


2019 ◽  
Author(s):  
Joshua I Brian ◽  
Simon K Davy ◽  
Shaun P Wilkinson

Coral reefs rely on their intracellular dinoflagellate symbionts (family Symbiodiniaceae) for nutritional provision in nutrient-poor waters, yet this association is threatened by thermally stressful conditions. Despite this, the evolutionary potential of these symbionts remains poorly characterised. In this study, we tested the potential for divergent Symbiodiniaceae types to sexually reproduce (i.e. hybridise) within Cladocopium, the most ecologically prevalent genus in this family. With sequence data from three organelles (cob gene, mitochondria; psbAncr region, chloroplast; and ITS2 region, nucleus), we utilised the Incongruence Length Difference test, Approximately Unbiased test, tree hybridisation analyses and visual inspection of raw data in stepwise fashion to highlight incongruences between organelles, and thus provide evidence of reticulate evolution. Using this approach, we identified three putative hybrid Cladocopium samples among the 158 analysed, at two of the seven sites sampled. These samples were identified as the common Cladocopium types C40 or C1 with respect to the mitochondria and chloroplasts, but the rarer types C3z, C3u and C1# with respect to their nuclear identity. These five Cladocopium types have previously been confirmed as evolutionarily distinct and were also recovered in non-incongruent samples multiple times, which is strongly suggestive that they sexually reproduced to produce the incongruent samples. A concomitant inspection of Next Generation Sequencing data for these samples suggests that other plausible explanations, such as incomplete lineage sorting, are much less likely. The approach taken in this study allows incongruences between gene regions to be identified with confidence, and brings new light to the evolutionary potential within Symbiodiniaceae.


Zootaxa ◽  
2020 ◽  
Vol 4750 (3) ◽  
pp. 328-348 ◽  
Author(s):  
DAVID A. GRAY ◽  
DAVID B. WEISSMAN ◽  
JEFFREY A. COLE ◽  
EMILY MORIARTY LEMMON

We present the first comprehensive molecular phylogeny of Gryllus field cricket species found in the United States and Canada, select additional named Gryllus species found in Mexico and the Bahamas, plus the European field cricket G. campestris Linnaeus and the Afro-Eurasian cricket G. bimaculatus De Geer. Acheta, Teleogryllus, and Nigrogryllus were used as outgroups. Anchored hybrid enrichment was used to generate 492,531 base pairs of DNA sequence from 563 loci. RAxML analysis of concatenated sequence data and Astral analysis of gene trees gave broadly congruent results, especially for older branches and overall tree structure. The North American Gryllus are monophyletic with respect to the two Old World taxa; certain sub-groups show rapid recent divergence. This is the first Anchored Hybrid Enrichment study of an insect group done for closely related species within a single genus, and the results illustrate the challenges of reconstructing the evolutionary history of young rapidly diverged taxa when both incomplete lineage sorting and probable hybridization are at play. Because Gryllus field crickets have been used extensively as a model system in evolutionary ecology, behavior, neuro-physiology, speciation, and life-history and life-cycle evolution, these results will help inform, interpret, and guide future research in these areas. 


2019 ◽  
Vol 37 (4) ◽  
pp. 1211-1223 ◽  
Author(s):  
Tomáš Flouri ◽  
Xiyun Jiao ◽  
Bruce Rannala ◽  
Ziheng Yang

Abstract Recent analyses suggest that cross-species gene flow or introgression is common in nature, especially during species divergences. Genomic sequence data can be used to infer introgression events and to estimate the timing and intensity of introgression, providing an important means to advance our understanding of the role of gene flow in speciation. Here, we implement the multispecies-coalescent-with-introgression model, an extension of the multispecies-coalescent model to incorporate introgression, in our Bayesian Markov chain Monte Carlo program Bpp. The multispecies-coalescent-with-introgression model accommodates deep coalescence (or incomplete lineage sorting) and introgression and provides a natural framework for inference using genomic sequence data. Computer simulation confirms the good statistical properties of the method, although hundreds or thousands of loci are typically needed to estimate introgression probabilities reliably. Reanalysis of data sets from the purple cone spruce confirms the hypothesis of homoploid hybrid speciation. We estimated the introgression probability using the genomic sequence data from six mosquito species in the Anopheles gambiae species complex, which varies considerably across the genome, likely driven by differential selection against introgressed alleles.


Author(s):  
Todd McLay ◽  
Gareth D. Holmes ◽  
Paul I. Forster ◽  
Susan E. Hoebee ◽  
Denise R. Fernando

The rainforest genus Gossia N.Snow & Guymer (Myrtaceae) occurs in Australia, Melanesia and Malesia, and is capable of hyperaccumulating the heavy metal manganese (Mn). Here, we used nuclear ribosomal and plastid spacer DNA-sequence data to reconstruct the phylogeny of 19 Australian species of Gossia and eight New Caledonian taxa. Our results indicated that the relationship between Gossia and Austromyrtus (Nied.) Burret is not fully resolved, and most Australian species were supported as monophyletic. Non-monophyly might be related to incomplete lineage sorting or inaccurate taxonomic classification. Bark type appears to be a morphological synapomorphy separating two groups of species, with more recently derived lineages having smooth and mottled ‘python’ bark. New Caledonian species were well resolved in a single clade, but were not the first diverging Gossia lineage, calling into doubt the results of a recent study that found Zealandia as the ancestral area of tribe Myrteae. Within Australia, the evolution of multiple clades has probably been driven by well-known biogeographic barriers. Some species with more widespread distributions have been able to cross these barriers by having a wide range of soil-substrate tolerances. Novel Mn-hyperaccumulating species were identified, and, although Mn hyperaccumulation was not strongly correlated with phylogenetic position, there appeared to be some difference in accumulation levels among clades. Our study is the first detailed phylogenetic investigation of Gossia and will serve as a reference for future studies seeking to understand the origin and extent of hyperaccumulation within the Myrteae and Myrtaceae more broadly.


2018 ◽  
Vol 30 (1) ◽  
pp. 216-236
Author(s):  
Rasmus Troelsgaard ◽  
Lars Kai Hansen

Model-based classification of sequence data using a set of hidden Markov models is a well-known technique. The involved score function, which is often based on the class-conditional likelihood, can, however, be computationally demanding, especially for long data sequences. Inspired by recent theoretical advances in spectral learning of hidden Markov models, we propose a score function based on third-order moments. In particular, we propose to use the Kullback-Leibler divergence between theoretical and empirical third-order moments for classification of sequence data with discrete observations. The proposed method provides lower computational complexity at classification time than the usual likelihood-based methods. In order to demonstrate the properties of the proposed method, we perform classification of both simulated data and empirical data from a human activity recognition study.


Sign in / Sign up

Export Citation Format

Share Document