Measuring genetic differentiation from Pool-seq data

Mapping Intimacies ◽

10.1101/282400 ◽

2018 ◽

Cited By ~ 3

Author(s):

Valentin Hivert ◽

Raphël Leblois ◽

Eric J. Petit ◽

Mathieu Gautier ◽

Renaud Vitalis

Keyword(s):

Genetic Differentiation ◽

High Throughput Sequencing ◽

Model Misspecification ◽

Cost Effective ◽

Sequencing Data ◽

Sequencing Errors ◽

Cost Effective Alternative ◽

Method Of Moments Estimator ◽

Dna Pool

AbstractThe recent advent of high throughput sequencing and genotyping technologies enables the comparison of patterns of polymorphisms at a very large number of markers. While the characterization of genetic structure from individual sequencing data remains expensive for many non-model species, it has been shown that sequencing pools of individual DNAs (Pool-seq) represents an attractive and cost-effective alternative. However, analyzing sequence read counts from a DNA pool instead of individual genotypes raises statistical challenges in deriving correct estimates of genetic differentiation. In this article, we provide a method-of-moments estimator of FST for Pool-seq data, based on an analysis-of-variance framework. We show, by means of simulations, that this new estimator is unbiased, and outperforms previously proposed estimators. We evaluate the robustness of our estimator to model misspecification, such as sequencing errors and uneven contributions of individual DNAs to the pools. Last, by reanalyzing published Pool-seq data of different ecotypes of the prickly sculpin Cottus asper, we show how the use of an unbiased FST estimator may question the interpretation of population structure inferred from previous analyses.

Download Full-text

Handling of targeted amplicon sequencing data focusing on index hopping and demultiplexing using a nested metabarcoding approach in ecology

Scientific Reports ◽

10.1038/s41598-021-98018-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Yasemin Guenay-Greunke ◽

David A. Bohan ◽

Michael Traugott ◽

Corinna Wallinger

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Cost Effective ◽

Amplicon Sequencing ◽

Sequencing Depth ◽

Sequencing Error ◽

Sequencing Data ◽

Large Sample ◽

Sequencing Errors ◽

Plant Feeding

AbstractHigh-throughput sequencing platforms are increasingly being used for targeted amplicon sequencing because they enable cost-effective sequencing of large sample sets. For meaningful interpretation of targeted amplicon sequencing data and comparison between studies, it is critical that bioinformatic analyses do not introduce artefacts and rely on detailed protocols to ensure that all methods are properly performed and documented. The analysis of large sample sets and the use of predefined indexes create challenges, such as adjusting the sequencing depth across samples and taking sequencing errors or index hopping into account. However, the potential biases these factors introduce to high-throughput amplicon sequencing data sets and how they may be overcome have rarely been addressed. On the example of a nested metabarcoding analysis of 1920 carabid beetle regurgitates to assess plant feeding, we investigated: (i) the variation in sequencing depth of individually tagged samples and the effect of library preparation on the data output; (ii) the influence of sequencing errors within index regions and its consequences for demultiplexing; and (iii) the effect of index hopping. Our results demonstrate that despite library quantification, large variation in read counts and sequencing depth occurred among samples and that the sequencing error rate in bioinformatic software is essential for accurate adapter/primer trimming and demultiplexing. Moreover, setting an index hopping threshold to avoid incorrect assignment of samples is highly recommended.

Download Full-text

Genome-Wide Identification of 5-Methylcytosine Sites in Bacterial Genomes By High-Throughput Sequencing of MspJI Restriction Fragments

10.1101/2021.02.10.430591 ◽

2021 ◽

Author(s):

Brian P. Anton ◽

Alexey Fomenkov ◽

Victoria Wu ◽

Richard J. Roberts

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

High Throughput Sequencing ◽

Cost Effective ◽

Restriction Enzymes ◽

Specific Sequence ◽

Genome Wide ◽

Cost Effective Alternative ◽

Simple Column ◽

Sequencing Platforms

ABSTRACTSingle-molecule Real-Time (SMRT) sequencing can easily identify sites of N6-methyladenine and N4-methylcytosine within DNA sequences, but similar identification of 5-methylcytosine sites is not as straightforward. In prokaryotic DNA, methylation typically occurs within specific sequence contexts, or motifs, that are a property of the methyltransferases that “write” these epigenetic marks. We present here a straightforward, cost-effective alternative to both SMRT and bisulfite sequencing for the determination of prokaryotic 5-methylcytosine methylation motifs. The method, called MFRE-Seq, relies on excision and isolation of fully methylated fragments of predictable size using MspJI-Family Restriction Enzymes (MFREs), which depend on the presence of 5-methylcytosine for cleavage. We demonstrate that MFRE-Seq is compatible with both Illumina and Ion Torrent sequencing platforms and requires only a digestion step and simple column purification of size-selected digest fragments prior to standard library preparation procedures. We applied MFRE-Seq to numerous bacterial and archaeal genomic DNA preparations and successfully confirmed known motifs and identified novel ones. This method should be a useful complement to existing methodologies for studying prokaryotic methylomes and characterizing the contributing methyltransferases.

Download Full-text

Genome-wide identification of 5-methylcytosine sites in bacterial genomes by high-throughput sequencing of MspJI restriction fragments

PLoS ONE ◽

10.1371/journal.pone.0247541 ◽

2021 ◽

Vol 16 (5) ◽

pp. e0247541

Author(s):

Brian P. Anton ◽

Alexey Fomenkov ◽

Victoria Wu ◽

Richard J. Roberts

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

High Throughput Sequencing ◽

Cost Effective ◽

Restriction Enzymes ◽

Specific Sequence ◽

Genome Wide ◽

Cost Effective Alternative ◽

Simple Column ◽

Sequencing Platforms

Single-molecule Real-Time (SMRT) sequencing can easily identify sites of N6-methyladenine and N4-methylcytosine within DNA sequences, but similar identification of 5-methylcytosine sites is not as straightforward. In prokaryotic DNA, methylation typically occurs within specific sequence contexts, or motifs, that are a property of the methyltransferases that “write” these epigenetic marks. We present here a straightforward, cost-effective alternative to both SMRT and bisulfite sequencing for the determination of prokaryotic 5-methylcytosine methylation motifs. The method, called MFRE-Seq, relies on excision and isolation of fully methylated fragments of predictable size using MspJI-Family Restriction Enzymes (MFREs), which depend on the presence of 5-methylcytosine for cleavage. We demonstrate that MFRE-Seq is compatible with both Illumina and Ion Torrent sequencing platforms and requires only a digestion step and simple column purification of size-selected digest fragments prior to standard library preparation procedures. We applied MFRE-Seq to numerous bacterial and archaeal genomic DNA preparations and successfully confirmed known motifs and identified novel ones. This method should be a useful complement to existing methodologies for studying prokaryotic methylomes and characterizing the contributing methyltransferases.

Download Full-text

Synthetic Sequencing Standards: A Guide to Database Choice for Rumen Microbiota Amplicon Sequencing Analysis

Frontiers in Microbiology ◽

10.3389/fmicb.2020.606825 ◽

2020 ◽

Vol 11 ◽

Author(s):

Paul E. Smith ◽

Sinead M. Waters ◽

Ruth Gómez Expósito ◽

Hauke Smidt ◽

Ciara A. Carberry ◽

...

Keyword(s):

High Throughput Sequencing ◽

Cost Effective ◽

Amplicon Sequencing ◽

Gas Production ◽

Reference Database ◽

Specific Reference ◽

Sequencing Analysis ◽

Sequencing Data ◽

Rumen Microbiota ◽

Reference Databases

Our understanding of complex microbial communities, such as those residing in the rumen, has drastically advanced through the use of high throughput sequencing (HTS) technologies. Indeed, with the use of barcoded amplicon sequencing, it is now cost effective and computationally feasible to identify individual rumen microbial genera associated with ruminant livestock nutrition, genetics, performance and greenhouse gas production. However, across all disciplines of microbial ecology, there is currently little reporting of the use of internal controls for validating HTS results. Furthermore, there is little consensus of the most appropriate reference database for analyzing rumen microbiota amplicon sequencing data. Therefore, in this study, a synthetic rumen-specific sequencing standard was used to assess the effects of database choice on results obtained from rumen microbial amplicon sequencing. Four DADA2 reference training sets (RDP, SILVA, GTDB, and RefSeq + RDP) were compared to assess their ability to correctly classify sequences included in the rumen-specific sequencing standard. In addition, two thresholds of phylogenetic bootstrapping, 50 and 80, were applied to investigate the effect of increasing stringency. Sequence classification differences were apparent amongst the databases. For example the classification of Clostridium differed between all databases, thus highlighting the need for a consistent approach to nomenclature amongst different reference databases. It is hoped the effect of database on taxonomic classification observed in this study, will encourage research groups across various microbial disciplines to develop and routinely use their own microbiome-specific reference standard to validate analysis pipelines and database choice.

Download Full-text

In vivo characterization of the properties of SUMO1-specific monobodies

Biochemical Journal ◽

10.1042/bj20130241 ◽

2013 ◽

Vol 456 (3) ◽

pp. 385-395 ◽

Cited By ~ 4

Author(s):

Anja Berndt ◽

Kevin A. Wilkinson ◽

Michaela J. Heimann ◽

Paul Bishop ◽

Jeremy M. Henley

Keyword(s):

Cost Effective ◽

Effective Alternative ◽

High Affinity ◽

Cost Effective Alternative ◽

In Vivo Characterization ◽

Protein Sumoylation

We demonstrate that SUMO1 monobodies are effective tools for characterizing protein SUMOylation in vivo, and provide a high-affinity, specific and cost-effective alternative to conventional antibodies.

Download Full-text

Characterization of segmental duplications and large inversions using Linked-Reads

10.1101/394528 ◽

2018 ◽

Cited By ~ 4

Author(s):

Fatih Karaoglanoglu ◽

Camir Ricketts ◽

Marzieh Eslami Rasekh ◽

Ezgi Ebren ◽

Iman Hajirasouliha ◽

...

Keyword(s):

High Throughput Sequencing ◽

Segmental Duplications ◽

Sequencing Data ◽

Full Spectrum ◽

Genomic Structural Variation ◽

Split Read ◽

Long Read ◽

Novel Algorithms ◽

Insertion Locus

AbstractMany algorithms aimed at characterizing genomic structural variation (SV) have been developed since the inception of high-throughput sequencing. However, the full spectrum of SVs in the human genome is not yet assessed. Most of the existing methods focus on discovery and genotyping of deletions, insertions, and mobile elements. Detection of balanced SVs with no gain or loss of genomic segments (e.g., inversions) is particularly a challenging task. Long read sequencing has been leveraged to find short inversions but there is still a need to develop methods to detect large genomic inversions. Furthermore, currently there are no algorithms to predict the insertion locus of large interspersed segmental duplications.Here we propose novel algorithms to characterize large (>40Kbp) interspersed segmental duplications and (>80Kbp) inversions using Linked-Read sequencing data. Linked-Read sequencing provides long range information, where Illumina reads are tagged with barcodes that can be used to assign short reads to pools of larger (30-50 Kbp) molecules. Our methods rely on split molecule sequence signature that we have previously described [11]. Similar to the split read, split molecules refer to large segments of DNA that span an SV breakpoint. Therefore, when mapped to the reference genome, the mapping of these segments would be discontinuous. We redesign our earlier algorithm, VALOR, to specifically leverage Linked-Read sequencing data to discover large inversions and characterize interspersed segmental duplications. We implement our new algorithms in a new software package, called VALOR2.AvailabilityVALOR2 is available at https://github.com/BilkentCompGen/valor.

Download Full-text

Family reunion via error correction: An efficient analysis of duplex sequencing data

10.1101/469106 ◽

2018 ◽

Cited By ~ 1

Author(s):

Nicholas Stoler ◽

Barbara Arbeithuber ◽

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

...

Keyword(s):

Error Correction ◽

Dynamic Range ◽

Pcr Amplification ◽

Cost Effective ◽

Sequencing Data ◽

Nucleotide Substitutions ◽

Low Frequencies ◽

Family Reunion ◽

Sequencing Errors ◽

Duplex Sequencing

AbstractDuplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies. Its power comes from pooling together multiple descendants of both strands of original DNA molecules, which allows distinguishing true nucleotide substitutions from PCR amplification and sequencing artifacts. This strategy comes at a cost—sequencing the same molecule multiple times increases dynamic range but significantly diminishes coverage, making whole genome duplex sequencing prohibitively expensive. Furthermore, every duplex experiment produces a substantial proportion of singleton reads that cannot be used in the analysis and are, technically, thrown away. In this paper we demonstrate that a significant fraction of these reads contains PCR or sequencing errors within duplex tags. Correction of such errors allows “reuniting” these reads with their respective families increasing the output of the method and making it more cost effective. Additionally, we combine error correction strategy with a number of algorithmic improvements in a new version of the duplex analysis software, Du Novo 2.0, readily available through Galaxy, Bioconda, and as the source code.

Download Full-text

The theory and practice of measuring broad-range recombination rate from marker selected pools

10.1101/762575 ◽

2019 ◽

Author(s):

Kevin H.-C. Wei ◽

Aditya Mantha ◽

Doris Bachtrog

Keyword(s):

Genetic Distance ◽

Allele Frequency ◽

Recombination Rate ◽

High Throughput Sequencing ◽

Genetic Material ◽

Cost Effective ◽

Theory And Practice ◽

Rate Variation ◽

Sequencing Data ◽

Genome Wide

ABSTRACTRecombination is the exchange of genetic material between homologous chromosomes via physical crossovers. Pioneered by T. H. Morgan and A. Sturtevant over a century ago, methods to estimate recombination rate and genetic distance require scoring large number of recombinant individuals between molecular or visible markers. While high throughput sequencing methods have allowed for genome wide crossover detection producing high resolution maps, such methods rely on large number of recombinants individually sequenced and are therefore difficult to scale. Here, we present a simple and scalable method to infer near chromosome-wide recombination rate from marker selected pools and the corresponding analytical software MarSuPial. Rather than genotyping individuals from recombinant backcrosses, we bulk sequence marker selected pools to infer the allele frequency decay around the selected locus; since the number of recombinant individuals increases proportionally to the genetic distance from the selected locus, the allele frequency across the chromosome can be used to estimate the genetic distance and recombination rate. We mathematically demonstrate the relationship between allele frequency attenuation, recombinant fraction, genetic distance, and recombination rate in marker selected pools. Based on available chromosome-wide recombination rate models of Drosophila, we simulated read counts and determined that nonlinear local regressions (LOESS) produce robust estimates despite the high noise inherent to sequencing data. To empirically validate this approach, we show that (single) marker selected pools closely recapitulate genetic distances inferred from scoring recombinants between double markers. We theoretically determine how secondary loci with viability impacts can modulate the allele frequency decay and how to account for such effects directly from the data. We generated the recombinant map of three wild derived strains which strongly correlates with previous genome-wide measurements. Interestingly, amidst extensive recombination rate variation, multiple regions of the genomes show elevated rates across all strains. Lastly, we apply this method to estimate chromosome-wide crossover interference. Altogether, we find that marker selected pools is a simple and cost effective method for broad recombination rate estimates. Although it does not identify instances of crossovers, it can generate near chromosome-wide recombination maps in as little as one or two libraries.

Download Full-text

Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies

PeerJ ◽

10.7717/peerj.1839 ◽

2016 ◽

Vol 4 ◽

pp. e1839 ◽

Cited By ~ 57

Author(s):

Tom O. Delmont ◽

A. Murat Eren

Keyword(s):

High Throughput Sequencing ◽

Draft Genome ◽

Cost Effective ◽

Single Copy ◽

Eukaryotic Genome ◽

Sequencing Data ◽

Bacterial Genomes ◽

Long Read ◽

Domains Of Life ◽

Genome Assemblies

High-throughput sequencing provides a fast and cost-effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigradeHypsibius dujardini,and created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes from the raw assembly, and curate a 182 Mbp draft genome forH. dujardinisupported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.

Download Full-text

Strengths and Biases of High-Throughput Sequencing Data in the Characterization of Freshwater Ciliate Microbiomes

Microbial Ecology ◽

10.1007/s00248-016-0912-8 ◽

2016 ◽

Vol 73 (4) ◽

pp. 865-875 ◽

Cited By ~ 6

Author(s):

Vittorio Boscaro ◽

Alessia Rossi ◽

Claudia Vannini ◽

Franco Verni ◽

Sergei I. Fokin ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data

Download Full-text