Geometry of the sample frequency spectrum and the perils of demographic inference

Mapping Intimacies ◽

10.1101/233908 ◽

2017 ◽

Cited By ~ 1

Author(s):

Zvi Rosen ◽

Anand Bhaskar ◽

Sebastien Roch ◽

Yun S. Song

Keyword(s):

Frequency Spectrum ◽

Dna Sequences ◽

Sequence Data ◽

Strong Dependence ◽

Population History ◽

Piecewise Constant ◽

Sample Frequency ◽

Demographic Inference ◽

Pathological Behavior ◽

Inference Methods

AbstractThe sample frequency spectrum (SFS), which describes the distribution of mutant alleles in a sample of DNA sequences, is a widely used summary statistic in population genetics. The expected SFS has a strong dependence on the historical population demography and this property is exploited by popular statistical methods to infer complex demographic histories from DNA sequence data. Most, if not all, of these inference methods exhibit pathological behavior, however. Specifically, they often display runaway behavior in optimization, where the inferred population sizes and epoch durations can degenerate to 0 or diverge to infinity, and show undesirable sensitivity of the inferred demography to perturbations in the data. The goal of this paper is to provide theoretical insights into why such problems arise. To this end, we characterize the geometry of the expected SFS for piecewise-constant demographic histories and use our results to show that the aforementioned pathological behavior of popular inference methods is intrinsic to the geometry of the expected SFS. We provide explicit descriptions and visualizations for a toy model with sample size 4, and generalize our intuition to arbitrary sample sizes n using tools from convex and algebraic geometry. We also develop a universal characterization result which shows that the expected SFS of a sample of size n under an arbitrary population history can be recapitulated by a piecewise-constant demography with only κn epochs, where κn is between n/2 and 2n – 1. The set of expected SFS for piecewise-constant demographies with fewer than κn epochs is open and non-convex, which causes the above phenomena for inference from data.

Download Full-text

Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1503717112 ◽

2015 ◽

Vol 112 (25) ◽

pp. 7677-7682 ◽

Cited By ~ 53

Author(s):

Jonathan Terhorst ◽

Yun S. Song

Keyword(s):

Frequency Spectrum ◽

Dna Sequences ◽

Convergence Rates ◽

Fixed Number ◽

Estimation Accuracy ◽

Information Theoretic ◽

Estimation Problems ◽

Sample Frequency ◽

Demographic Inference ◽

Segregating Sites

The sample frequency spectrum (SFS) of DNA sequences from a collection of individuals is a summary statistic that is commonly used for parametric inference in population genetics. Despite the popularity of SFS-based inference methods, little is currently known about the information theoretic limit on the estimation accuracy as a function of sample size. Here, we show that using the SFS to estimate the size history of a population has a minimax error of at least O(1/log s), where s is the number of independent segregating sites used in the analysis. This rate is exponentially worse than known convergence rates for many classical estimation problems in statistics. Another surprising aspect of our theoretical bound is that it does not depend on the dimension of the SFS, which is related to the number of sampled individuals. This means that, for a fixed number s of segregating sites considered, using more individuals does not help to reduce the minimax error bound. Our result pertains to populations that have experienced a bottleneck, and we argue that it can be expected to apply to many populations in nature.

Download Full-text

Geometry of the Sample Frequency Spectrum and the Perils of Demographic Inference

Genetics ◽

10.1534/genetics.118.300733 ◽

2018 ◽

Vol 210 (2) ◽

pp. 665-682 ◽

Cited By ~ 9

Author(s):

Zvi Rosen ◽

Anand Bhaskar ◽

Sebastien Roch ◽

Yun S. Song

Keyword(s):

Frequency Spectrum ◽

Sample Frequency ◽

Demographic Inference

Download Full-text

Hierarchical Models for Mitochondrial DNA Sequence Data

Austrian Journal of Statistics ◽

10.17713/ajs.v36i1.320 ◽

2016 ◽

Vol 36 (1) ◽

Author(s):

Paola Berchialla

Keyword(s):

Mitochondrial Dna ◽

Dna Sequence ◽

Dna Sequences ◽

Sequence Data ◽

Population History ◽

Parametric Models ◽

Italian Population ◽

Bayesian Hierarchical ◽

Dna Sequence Data ◽

Mitochondrial Dna Sequence

We introduce a Bayesian hierarchical model for mitochondrial DNA sequence data, which is fitted via acceptance-rejection algorithms. The model incorporates parametric models of population history explicitly as well as a mutational process allowing for a simultaneous parameter estimation whose importance has become increasingly clear in many recent studies. The model is applied to a sample of DNA sequences from the Italian population.

Download Full-text

Inferring the ancestry of everyone

10.1101/458067 ◽

2018 ◽

Cited By ~ 7

Author(s):

Jerome Kelleher ◽

Yan Wong ◽

Patrick K. Albers ◽

Anthony W. Wohns ◽

Gil McVean

Keyword(s):

Dna Sequences ◽

Evolutionary Biology ◽

Sequence Data ◽

Data Sets ◽

Original Sequence ◽

Efficient Access ◽

History Of ◽

Comparable Accuracy ◽

Rich Information ◽

Inference Methods

AbstractA central problem in evolutionary biology is to infer the full genealogical history of a set of DNA sequences. This history contains rich information about the forces that have influenced a sexually reproducing species. However, existing methods are limited: the most accurate is unable to cope with more than a few dozen samples. With modern genetic data sets rapidly approaching millions of genomes, there is an urgent need for efficient inference methods to exploit such rich resources. We introduce an algorithm to infer whole-genome history which has comparable accuracy to the state-of-the-art but can process around four orders of magnitude more sequences. Additionally, our method results in an “evolutionary encoding” of the original sequence data, enabling efficient access to genealogies and calculation of genetic statistics over the data. We apply this technique to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the genealogies we estimate are both rich in biological signal and efficient to process.

Download Full-text

Characterization of squalene synthase gene from Gymnema sylvestre R. Br.

Beni-Suef University Journal of Basic and Applied Sciences ◽

10.1186/s43088-020-00094-4 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Kuldeepsingh A. Kalariya ◽

Ram Prasnna Meena ◽

Lipi Poojara ◽

Deepa Shahi ◽

Sandip Patel

Keyword(s):

Dna Sequences ◽

Genomic Dna ◽

Competitive Inhibition ◽

Sequence Data ◽

Homology Model ◽

Squalene Synthase ◽

Gymnema Sylvestre ◽

Gardenia Jasminoides ◽

Ramachandran Plots ◽

Flanking Regions

Abstract Background Squalene synthase (SQS) is a rate-limiting enzyme necessary to produce pentacyclic triterpenes in plants. It is an important enzyme producing squalene molecules required to run steroidal and triterpenoid biosynthesis pathways working in competitive inhibition mode. Reports are available on information pertaining to SQS gene in several plants, but detailed information on SQS gene in Gymnema sylvestre R. Br. is not available. G. sylvestre is a priceless rare vine of central eco-region known for its medicinally important triterpenoids. Our work aims to characterize the GS-SQS gene in this high-value medicinal plant. Results Coding DNA sequences (CDS) with 1245 bp length representing GS-SQS gene predicted from transcriptome data in G. sylvestre was used for further characterization. The SWISS protein structure modeled for the GS-SQS amino acid sequence data had MolProbity Score of 1.44 and the Clash Score 3.86. The quality estimates and statistical score of Ramachandran plots analysis indicated that the homology model was reliable. For full-length amplification of the gene, primers designed from flanking regions of CDS encoding GS-SQS were used to get amplification against genomic DNA as template which resulted in approximately 6.2-kb sized single-band product. The sequencing of this product through NGS was carried out generating 2.32 Gb data and 3347 number of scaffolds with N50 value of 457 bp. These scaffolds were compared to identify similarity with other SQS genes as well as the GS-SQSs of the transcriptome. Scaffold_3347 representing the GS-SQS gene harbored two introns of 101 and 164 bp size. Both these intronic regions were validated by primers designed from adjoining outside regions of the introns on the scaffold representing GS-SQS gene. The amplification took place when the template was genomic DNA and failed when the template was cDNA confirmed the presence of two introns in GS-SQS gene in Gymnema sylvestre R. Br. Conclusion This study shows GS-SQS gene was very closely related to Coffea arabica and Gardenia jasminoides and this gene harbored two introns of 101 and 164 bp size.

Download Full-text

Discovery of New Genera Challenges the Subtribal Classification of Tok-Tok Beetles (Coleoptera: Tenebrionidae: Sepidiini)

Insect Systematics and Diversity ◽

10.1093/isd/ixab006 ◽

2021 ◽

Vol 5 (2) ◽

Author(s):

Olivia M Gearner ◽

Marcin J Kamiński ◽

Kojun Kanda ◽

Kali Swichtenberg ◽

Aaron D Smith

Keyword(s):

Dna Sequences ◽

Species Group ◽

Phylogenetic Placement ◽

Darkling Beetles ◽

Species Groups ◽

Taxonomic Groups ◽

Inference Methods ◽

A New Species ◽

Coleoptera Tenebrionidae

Abstract Sepidiini is a speciose tribe of desert-inhabiting darkling beetles, which contains a number of poorly defined taxonomic groups and is in need of revision at all taxonomic levels. In this study, two previously unrecognized lineages were discovered, based on morphological traits, among the extremely speciose genera Psammodes Kirby, 1819 (164 species and subspecies) and Ocnodes Fåhraeus, 1870 (144 species and subspecies), namely the Psammodes spinosus species-group and Ocnodes humeralis species-group. In order to test their phylogenetic placement, a phylogeny of the tribe was reconstructed based on analyses of DNA sequences from six nonoverlapping genetic loci (CAD, wg, COI JP, COI BC, COII, and 28S) using Bayesian and maximum likelihood inference methods. The aforementioned, morphologically defined, species-groups were recovered as distinct and well-supported lineages within Molurina + Phanerotomeina and are interpreted as independent genera, respectively, Tibiocnodes Gearner & Kamiński gen. nov. and Tuberocnodes Gearner & Kamiński gen. nov. A new species, Tuberocnodes synhimboides Gearner & Kamiński sp. nov., is also described. Furthermore, as the recovered phylogenetic placement of Tibiocnodes and Tuberocnodes undermines the monophyly of Molurina and Phanerotomeina, an analysis of the available diagnostic characters for those subtribes is also performed. As a consequence, Phanerotomeina is considered as a synonym of the newly redefined Molurina sens. nov. Finally, spectrograms of vibrations produced by substrate tapping of two Molurina species, Toktokkus vialis (Burchell, 1822) and T. synhimboides, are presented.

Download Full-text

An Integrated Framework for the Inference of Viral Population History From Reconstructed Genealogies

Genetics ◽

10.1093/genetics/155.3.1429 ◽

2000 ◽

Vol 155 (3) ◽

pp. 1429-1437

Author(s):

Oliver G Pybus ◽

Andrew Rambaut ◽

Paul H Harvey

Keyword(s):

Maximum Likelihood ◽

Sequence Data ◽

Demographic History ◽

Population History ◽

Maximum Likelihood Estimates ◽

Viral Population ◽

True Parameter ◽

Subtype B ◽

Exponential Growth Model ◽

Parameter Values

Abstract We describe a unified set of methods for the inference of demographic history using genealogies reconstructed from gene sequence data. We introduce the skyline plot, a graphical, nonparametric estimate of demographic history. We discuss both maximum-likelihood parameter estimation and demographic hypothesis testing. Simulations are carried out to investigate the statistical properties of maximum-likelihood estimates of demographic parameters. The simulations reveal that (i) the performance of exponential growth model estimates is determined by a simple function of the true parameter values and (ii) under some conditions, estimates from reconstructed trees perform as well as estimates from perfect trees. We apply our methods to HIV-1 sequence data and find strong evidence that subtypes A and B have different demographic histories. We also provide the first (albeit tentative) genetic evidence for a recent decrease in the growth rate of subtype B.

Download Full-text

Molecular evolution of homologous gene sequences in germline-limited and somatic chromosomes of Acricotopus

Genome ◽

10.1139/g04-026 ◽

2004 ◽

Vol 47 (4) ◽

pp. 732-741 ◽

Cited By ~ 2

Author(s):

Wolfgang Staiber

Keyword(s):

Molecular Evolution ◽

Dna Sequences ◽

Sequence Data ◽

Structural Evolution ◽

5S Rdna ◽

Oligonucleotide Primer ◽

Homologous Gene ◽

Gene Sequences ◽

Nucleotide Substitutions ◽

Degenerate Oligonucleotide

The origin of germline-limited chromosomes (Ks) as descendants of somatic chromosomes (Ss) and their structural evolution was recently elucidated in the chironomid Acricotopus. The Ks consist of large S-homologous sections and of heterochromatic segments containing germline-specific, highly repetitive DNA sequences. Less is known about the molecular evolution and features of the sequences in the S-homologous K sections. More information about this was received by comparing homologous gene sequences of Ks and Ss. Genes for 5.8S, 18S, 28S, and 5S ribosomal RNA were choosen for the comparison and therefore isolated first by PCR from somatic DNA of Acricotopus and sequenced. Specific K DNA was collected by microdissection of monopolar moving K complements from differential gonial mitoses and was then amplified by degenerate oligonucleotide primer (DOP)-PCR. With the sequence data of the somatic rDNAs, the homologous 5.8S and 5S rDNA sequences were isolated by PCR from the DOP-PCR sequence pool of the Ks. In addition, a number of K DOP-PCR sequences were directly cloned and analysed. One K clone contained a section of a putative N-acetyltransferase gene. Compared with its homolog from the Ss, the sequence exhibited few nucleotide substitutions (99.2% sequence identity). The same was true for the 5.8S and 5S sequences from Ss and Ks (97.5%100% identity). This supports the idea that the S-homologous K sequences may be conserved and do not evolve independently from their somatic homologs. Possible mechanisms effecting such conservation of S-derived sequences in the Ks are discussed.Key words: microdissection, DOP-PCR, germline-limited chromosomes, molecular evolution.

Download Full-text

SHI7 Is a Self-Learning Pipeline for Multipurpose Short-Read DNA Quality Control

mSystems ◽

10.1128/msystems.00202-17 ◽

2018 ◽

Vol 3 (3) ◽

Cited By ~ 15

Author(s):

Gabriel A. Al-Ghalith ◽

Benjamin Hillmann ◽

Kaiwei Ang ◽

Robin Shields-Cutler ◽

Dan Knights

Keyword(s):

Quality Control ◽

Dna Sequences ◽

Sequence Data ◽

Background Knowledge ◽

Sequencing Technology ◽

Data Set ◽

Short Read ◽

Dna Quality ◽

Public Data ◽

User Friendly

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.

Download Full-text

Diversity of echinostomes (Digenea: Echinostomatidae) in their snail hosts at high latitudes

Parasite ◽

10.1051/parasite/2021054 ◽

2021 ◽

Vol 28 ◽

pp. 59

Author(s):

Camila Pantoja ◽

Anna Faltýnková ◽

Katie O’Dwyer ◽

Damien Jouet ◽

Karl Skírnisson ◽

...

Keyword(s):

North America ◽

Dna Sequences ◽

Sequence Data ◽

Molecular Data ◽

Migratory Bird ◽

Parasite Fauna ◽

Life Cycles ◽

Freshwater Ecosystems ◽

Trematode Species ◽

Host Life

The biodiversity of freshwater ecosystems globally still leaves much to be discovered, not least in the trematode parasite fauna they support. Echinostome trematode parasites have complex, multiple-host life-cycles, often involving migratory bird definitive hosts, thus leading to widespread distributions. Here, we examined the echinostome diversity in freshwater ecosystems at high latitude locations in Iceland, Finland, Ireland and Alaska (USA). We report 14 echinostome species identified morphologically and molecularly from analyses of nad1 and 28S rDNA sequence data. We found echinostomes parasitising snails of 11 species from the families Lymnaeidae, Planorbidae, Physidae and Valvatidae. The number of echinostome species in different hosts did not vary greatly and ranged from one to three species. Of these 14 trematode species, we discovered four species (Echinoparyphium sp. 1, Echinoparyphium sp. 2, Neopetasiger sp. 5, and Echinostomatidae gen. sp.) as novel in Europe; we provide descriptions for the newly recorded species and those not previously associated with DNA sequences. Two species from Iceland (Neopetasiger islandicus and Echinoparyphium sp. 2) were recorded in both Iceland and North America. All species found in Ireland are new records for this country. Via an integrative taxonomic approach taken, both morphological and molecular data are provided for comparison with future studies to elucidate many of the unknown parasite life cycles and transmission routes. Our reports of species distributions spanning Europe and North America highlight the need for parasite biodiversity assessments across large geographical areas.

Download Full-text