A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree

Mapping Intimacies ◽

10.1101/055541 ◽

2016 ◽

Cited By ~ 18

Author(s):

Michael A. Eberle ◽

Epameinondas Fritzilas ◽

Peter Krusche ◽

Morten Källberg ◽

Benjamin L. Moore ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Objective Assessment ◽

Variant Calling ◽

Whole Genome Sequence ◽

Reference Dataset ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Genome Wide ◽

Transmission Information

AbstractImprovement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalogue of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of seventeen individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased “platinum” variant catalogue of 4.7 million single nucleotide variants (SNVs) plus 0.7 million small (1-50bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and eleven children of this pedigree. Platinum genotypes are highly concordant with the current catalogue of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%), and add a validated truth catalogue that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission (“non-platinum”) revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.

Download Full-text

Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks

10.1101/2021.03.04.433952 ◽

2021 ◽

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Pi-Chuan Chang ◽

Maria Nattestad ◽

Alexey Kolesnikov ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Variant Calling ◽

High Accuracy ◽

Superior Performance ◽

Read Length ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Short Read ◽

Long Read

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read based phasing. Third-generation nanopore sequence data has demonstrated a long read length, but current interpretation methods for its novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline PEPPER-Margin-DeepVariant that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single nucleotide variant identification method at the whole genome-scale and produces high-quality single nucleotide variants in segmental duplications and low-mappability regions where short-read based genotyping fails. We show that our pipeline can provide highly-contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% to 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance than the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio-HiFi-polished).

Download Full-text

HashSeq: a Simple, Scalable, and Conservative De Novo Variant Caller for 16S rRNA Gene Data Sets

mSystems ◽

10.1128/msystems.00697-21 ◽

2021 ◽

Author(s):

Farnaz Fouladi ◽

Jacqueline B. Young ◽

Anthony A. Fodor

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

De Novo ◽

Sequence Data ◽

Variant Calling ◽

Rrna Gene ◽

Sequence Variants ◽

Data Sets ◽

Single Nucleotide ◽

Gene Data

Recent bioinformatics development has enabled the detection of sequence variants with a high resolution of only one single-nucleotide difference in 16S rRNA gene sequence data. Despite this progress, there are several limitations that can be associated with variant calling pipelines, such as producing a large number of low-abundance sequence variants which need to be filtered out with arbitrary thresholds in downstream analyses or having a slow runtime.

Download Full-text

Pedigree-based estimation of human mobile element retrotransposition rates

10.1101/506691 ◽

2018 ◽

Cited By ~ 1

Author(s):

Julie Feusier ◽

W. Scott Watkins ◽

Jainy Thomas ◽

Andrew Farrell ◽

David J. Witherspoon ◽

...

Keyword(s):

De Novo ◽

Phylogenetic Analyses ◽

Mobile Element ◽

Whole Genome Sequence ◽

Structural Variants ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Alu Elements ◽

Depth Analysis ◽

Parent Of Origin

AbstractGermline mutation rates in humans have been estimated for a variety of mutation types, including single nucleotide and large structural variants. Here we directly measure the germline retrotransposition rate for the three active retrotransposon elements: L1, Alu, and SVA. We utilized three tools for calling Mobile Element Insertions (MEIs) (MELT, RUFUS, and TranSurVeyor) on blood-derived whole genome sequence (WGS) data from 603 CEPH individuals, comprising 33 three-generation pedigrees. We identified 27 de novo MEIs in 440 births. The retrotransposition rate estimates for Alu elements, one in 40, is roughly half the rate estimated using phylogenetic analyses, a difference in magnitude similar to that observed for single nucleotide variants. The L1 retrotransposition rate is one in 62 births and is within range of previous estimates (1:20-1:200 births). The SVA retrotransposition rate, one in 55 births, is much higher than the previous estimate of one in 900 births. Our large, three-generation pedigrees allowed us to assess parent-of-origin effects and the timing of insertion events in either gametogenesis or early embryonic development. We find a statistically significant paternal bias in Alu retrotransposition. Our study represents the first in-depth analysis of the rate and dynamics of human retrotransposition from WGS data in three-generation human pedigrees.

Download Full-text

Combination of Genome-Wide Polymorphisms and Copy Number Variations of Pharmacogenes in Koreans

Journal of Personalized Medicine ◽

10.3390/jpm11010033 ◽

2021 ◽

Vol 11 (1) ◽

pp. 33

Author(s):

Nayoung Han ◽

Jung Mi Oh ◽

In-Wha Kim

Keyword(s):

Copy Number ◽

Genome Wide Association Study ◽

Copy Number Gain ◽

Copy Number Variations ◽

Gene Gain ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Haplotype Blocks ◽

Genome Wide ◽

Control And Prevention

For predicting phenotypes and executing precision medicine, combination analysis of single nucleotide variants (SNVs) genotyping with copy number variations (CNVs) is required. The aim of this study was to discover SNVs or common copy CNVs and examine the combined frequencies of SNVs and CNVs in pharmacogenes using the Korean genome and epidemiology study (KoGES), a consortium project. The genotypes (N = 72,299) and CNV data (N = 1000) were provided by the Korean National Institute of Health, Korea Centers for Disease Control and Prevention. The allele frequencies of SNVs, CNVs, and combined SNVs with CNVs were calculated and haplotype analysis was performed. CYP2D6 rs1065852 (c.100C>T, p.P34S) was the most common variant allele (48.23%). A total of 8454 haplotype blocks in 18 pharmacogenes were estimated. DMD ranked the highest in frequency for gene gain (64.52%), while TPMT ranked the highest in frequency for gene loss (51.80%). Copy number gain of CYP4F2 was observed in 22 subjects; 13 of those subjects were carriers with CYP4F2*3 gain. In the case of TPMT, approximately one-half of the participants (N = 308) had loss of the TPMT*1*1 diplotype. The frequencies of SNVs and CNVs in pharmacogenes were determined using the Korean cohort-based genome-wide association study.

Download Full-text

Unsuspected somatic mosaicism for FBN1 gene contributes to Marfan syndrome

Genetics in Medicine ◽

10.1038/s41436-020-01078-6 ◽

2021 ◽

Author(s):

Pauline Arnaud ◽

Hélène Morel ◽

Olivier Milleron ◽

Laurent Gouya ◽

Christine Francannet ◽

...

Keyword(s):

Marfan Syndrome ◽

Somatic Mosaicism ◽

Variant Calling ◽

Copy Number Variations ◽

Pathogenic Variant ◽

Single Nucleotide Variants ◽

Bioinformatics Analyses ◽

Single Nucleotide ◽

Fbn1 Gene ◽

Pathogenic Variants

Abstract Purpose Individuals with mosaic pathogenic variants in the FBN1 gene are mainly described in the course of familial screening. In the literature, almost all these mosaic individuals are asymptomatic. In this study, we report the experience of our team on more than 5,000 Marfan syndrome (MFS) probands. Methods Next-generation sequencing (NGS) capture technology allowed us to identify five cases of MFS probands who harbored a mosaic pathogenic variant in the FBN1 gene. Results These five sporadic mosaic probands displayed classical features usually seen in Marfan syndrome. Combined with the results of the literature, these rare findings concerned both single-nucleotide variants and copy-number variations. Conclusion This underestimated finding should not be overlooked in the molecular diagnosis of MFS patients and warrants an adaptation of the parameters used in bioinformatics analyses. The five present cases of symptomatic MFS probands harboring a mosaic FBN1 pathogenic variant reinforce the fact that apparently asymptomatic mosaic parents should have a complete clinical examination and a regular cardiovascular follow-up. We advise that individuals with a typical MFS for whom no single-nucleotide pathogenic variant or exon deletion/duplication was identified should be tested by NGS capture panel with an adapted variant calling analysis.

Download Full-text

scSNV: accurate dscRNA-seq SNV co-expression analysis using duplicate tag collapsing

Genome Biology ◽

10.1186/s13059-021-02364-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gavin W. Wilson ◽

Mathieu Derouet ◽

Gail E. Darling ◽

Jonathan C. Yeung

Keyword(s):

Genetic Variants ◽

False Positive ◽

Variant Calling ◽

Call Rate ◽

Rna Seq ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Variant Call ◽

Two Samples ◽

Co Detection

AbstractIdentifying single nucleotide variants has become common practice for droplet-based single-cell RNA-seq experiments; however, presently, a pipeline does not exist to maximize variant calling accuracy. Furthermore, molecular duplicates generated in these experiments have not been utilized to optimally detect variant co-expression. Herein, we introduce scSNV designed from the ground up to “collapse” molecular duplicates and accurately identify variants and their co-expression. We demonstrate that scSNV is fast, with a reduced false-positive variant call rate, and enables the co-detection of genetic variants and A>G RNA edits across twenty-two samples.

Download Full-text

Whole-genome sequence data suggests environmental adaptation of Ethiopian sheep populations

Genome Biology and Evolution ◽

10.1093/gbe/evab014 ◽

2021 ◽

Author(s):

Pamela Wiener ◽

Christelle Robert ◽

Abulgasim Ahbara ◽

Mazdak Salavati ◽

Ayele Abebe ◽

...

Keyword(s):

High Altitude ◽

Environmental Variables ◽

Large Scale ◽

Sequence Data ◽

Strong Association ◽

Environmental Adaptation ◽

Whole Genome Sequence ◽

Single Nucleotide Variants ◽

High Altitude Adaptation ◽

Altitude Adaptation

Abstract Great progress has been made over recent years in the identification of selection signatures in the genomes of livestock species. This work has primarily been carried out in commercial breeds for which the dominant selection pressures, are associated with artificial selection. As agriculture and food security are likely to be strongly affected by climate change, a better understanding of environment-imposed selection on agricultural species is warranted. Ethiopia is an ideal setting to investigate environmental adaptation in livestock due to its wide variation in geo-climatic characteristics and the extensive genetic and phenotypic variation of its livestock. Here, we identified over three million single nucleotide variants across 12 Ethiopian sheep populations and applied landscape genomics approaches to investigate the association between these variants and environmental variables. Our results suggest that environmental adaptation for precipitation-related variables is stronger than that related to altitude or temperature, consistent with large-scale meta-analyses of selection pressure across species. The set of genes showing association with environmental variables was enriched for genes highly expressed in human blood and nerve tissues. There was also evidence of enrichment for genes associated with high-altitude adaptation although no strong association was identified with hypoxia-inducible-factor (HIF) genes. One of the strongest altitude-related signals was for a collagen gene, consistent with previous studies of high-altitude adaptation. Several altitude-associated genes also showed evidence of adaptation with temperature, suggesting a relationship between responses to these environmental factors. These results provide a foundation to investigate further the effects of climatic variables on small ruminant populations.

Download Full-text

Prediction of genome-wide effects of single nucleotide variants on transcription factor binding

Scientific Reports ◽

10.1038/s41598-020-74793-4 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Sebastian Carrasco Pro ◽

Katia Bulekova ◽

Brian Gregor ◽

Adam Labadorf ◽

Juan Ignacio Fuxman Bass

Keyword(s):

Binding Sites ◽

Cancer Type ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Regulatory Regions ◽

Genome Wide ◽

Transcriptional Regulatory ◽

Gene Regulatory ◽

The Impact ◽

The Relationship

Abstract Single nucleotide variants (SNVs) located in transcriptional regulatory regions can result in gene expression changes that lead to adaptive or detrimental phenotypic outcomes. Here, we predict gain or loss of binding sites for 741 transcription factors (TFs) across the human genome. We calculated ‘gainability’ and ‘disruptability’ scores for each TF that represent the likelihood of binding sites being created or disrupted, respectively. We found that functional cis-eQTL SNVs are more likely to alter TF binding sites than rare SNVs in the human population. In addition, we show that cancer somatic mutations have different effects on TF binding sites from different TF families on a cancer-type basis. Finally, we discuss the relationship between these results and cancer mutational signatures. Altogether, we provide a blueprint to study the impact of SNVs derived from genetic variation or disease association on TF binding to gene regulatory regions.

Download Full-text

An integrative approach to investigate the respective roles of single-nucleotide variants and copy-number variants in Attention-Deficit/Hyperactivity Disorder

Scientific Reports ◽

10.1038/srep22851 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 9

Author(s):

Leandro de Araújo Lima ◽

Ana Cecília Feio-dos-Santos ◽

Sintia Iole Belangero ◽

Ary Gadelha ◽

Rodrigo Affonseca Bressan ◽

...

Keyword(s):

Attention Deficit Hyperactivity Disorder ◽

Attention Deficit ◽

Copy Number ◽

De Novo ◽

Copy Number Variants ◽

Integrative Approach ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Hyperactivity Disorder ◽

New Genes

Abstract Many studies have attempted to investigate the genetic susceptibility of Attention-Deficit/Hyperactivity Disorder (ADHD), but without much success. The present study aimed to analyze both single-nucleotide and copy-number variants contributing to the genetic architecture of ADHD. We generated exome data from 30 Brazilian trios with sporadic ADHD. We also analyzed a Brazilian sample of 503 children/adolescent controls from a High Risk Cohort Study for the Development of Childhood Psychiatric Disorders, and also previously published results of five CNV studies and one GWAS meta-analysis of ADHD involving children/adolescents. The results from the Brazilian trios showed that cases with de novo SNVs tend not to have de novo CNVs and vice-versa. Although the sample size is small, we could also see that various comorbidities are more frequent in cases with only inherited variants. Moreover, using only genes expressed in brain, we constructed two “in silico” protein-protein interaction networks, one with genes from any analysis, and other with genes with hits in two analyses. Topological and functional analyses of genes in this network uncovered genes related to synapse, cell adhesion, glutamatergic and serotoninergic pathways, both confirming findings of previous studies and capturing new genes and genetic variants in these pathways.

Download Full-text

A curated dataset of modern and ancient high-coverage shotgun human genomes

Scientific Data ◽

10.1038/s41597-021-00980-1 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Pierpaolo Maisano Delser ◽

Eppie R. Jones ◽

Anahit Hovhannisyan ◽

Lara Cassidy ◽

Ron Pinhasi ◽

...

Keyword(s):

Sequence Data ◽

Whole Genome ◽

Reference Dataset ◽

High Coverage ◽

Sample Distribution ◽

Human Samples ◽

Human Genomes ◽

Genome Wide ◽

Genome Wide Data ◽

Computationally Intensive

AbstractOver the last few years, genome-wide data for a large number of ancient human samples have been collected. Whilst datasets of captured SNPs have been collated, high coverage shotgun genomes (which are relatively few but allow certain types of analyses not possible with ascertained captured SNPs) have to be reprocessed by individual groups from raw reads. This task is computationally intensive. Here, we release a dataset including 35 whole-genome sequenced samples, previously published and distributed worldwide, together with the genetic pipeline used to process them. The dataset contains 72,041,355 sites called across 19 ancient and 16 modern individuals and includes sequence data from four previously published ancient samples which we sequenced to higher coverage (10–18x). Such a resource will allow researchers to analyse their new samples with the same genetic pipeline and directly compare them to the reference dataset without re-processing published samples. Moreover, this dataset can be easily expanded to increase the sample distribution both across time and space.

Download Full-text