Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects

James Zou; Gregory Valiant; Paul Valiant; Konrad Karczewski; Siu On Chan; Kaitlin Samocha; Monkol Lek; Shamil Sunyaev; Mark Daly; Daniel G. MacArthur

doi:10.1038/ncomms13293

Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects

10.1101/030841 ◽

2015 ◽

Cited By ~ 1

Author(s):

James Zou ◽

Gregory Valiant ◽

Paul Valiant ◽

Konrad Karczewski ◽

Siu On Chan ◽

...

Keyword(s):

Frequency Distribution ◽

Statistical Power ◽

Large Scale ◽

Rare Variants ◽

Human Populations ◽

Loss Of Function ◽

Protein Coding ◽

Missense Variants ◽

Healthy Humans ◽

Coding Variants

As new proposals aim to sequence ever larger collection of humans, it is critical to have a quantitative framework to evaluate the statistical power of these projects. We developed a new algorithm, UnseenEst, and applied it to the exomes of 60,706 individuals to estimate the frequency distribution of all protein-coding variants, including rare variants that have not been observed yet in the current cohorts. Our results quantified the number of new variants that we expect to identify as sequencing cohorts reach hundreds of thousands of individuals. With 500K individuals, we find that we expect to capture 7.5% of all possible loss-of-function variants and 12% of all possible missense variants. We also estimate that 2,900 genes have loss-of-function frequency of less than 0.00001 in healthy humans, consistent with very strong intolerance to gene inactivation.

Download Full-text

The contribution of common regulatory and protein-coding TYR variants in the genetic architecture of albinism

10.1101/2021.11.01.21265733 ◽

2021 ◽

Author(s):

Vincent Michaud ◽

Eulalie Lasseaux ◽

David J Green ◽

Dave T Gerrard ◽

Claudio Plaisant ◽

...

Keyword(s):

Genetic Architecture ◽

Large Scale ◽

Diagnostic Yield ◽

Genetic Diseases ◽

Gene Encoding ◽

Protein Coding ◽

Autosomal Recessive Disorders ◽

Functional Variants ◽

Prevalent Disease ◽

Coding Variants

Genetic diseases have been historically segregated into rare Mendelian and common complex conditions. Large-scale studies using genome sequencing are eroding this distinction and are gradually unmasking the underlying complexity of human traits. We studied a cohort of 1,313 individuals with albinism aiming to gain insights into the genetic architecture of rare, autosomal recessive disorders. We investigated the contribution of regulatory and protein-coding variants at the common and rare ends of the allele-frequency spectrum. We focused on TYR, the gene encoding tyrosinase, and found that a promoter variant, TYR: c.-301C>T [rs4547091], modulates the penetrance of a prevalent, disease-associated missense change, TYR: c.1205G>A [rs1126809]. We also found that homozygosity for a haplotype formed by three common, functional variants, TYR: c.[-301C;575C>A;1205G>A], confers a high risk of albinism (OR>77) and is associated with reduced vision in UK Biobank participants. Finally, we report how the combined analysis of rare and common variants increases diagnostic yield and informs genetic counselling in families with albinism.

Download Full-text

Rare schizophrenia risk variant burden is conserved in diverse human populations

10.1101/2022.01.03.22268662 ◽

2022 ◽

Author(s):

Dongjing Liu ◽

Dara Meyer ◽

Brian Fennessy ◽

Claudia Feng ◽

Esther Cheng ◽

...

Keyword(s):

Genetic Architecture ◽

Large Scale ◽

Current Knowledge ◽

European Ancestry ◽

P Value ◽

Human Populations ◽

Protein Coding ◽

Coding Regions ◽

Meta Analyses ◽

Shared Risk

Schizophrenia is a chronic mental illness that is amongst the most debilitating conditions encountered in medical practice. A recent landmark schizophrenia study of the protein-coding regions of the genome identified a causal role for ten genes and a concentration of rare variant signals in evolutionarily constrained genes1. This study -- and most other large-scale human genetic studies -- was mainly composed of individuals of European ancestry, and the generalizability of the findings in non-European populations is unclear. To address this gap in knowledge, we designed a custom sequencing panel based on current knowledge of the genetic architecture of schizophrenia and applied it to a new cohort of 22,135 individuals of diverse ancestries. Replicating earlier work, cases carried a significantly higher burden of rare protein-truncating variants among constrained genes (OR=1.48, p-value = 5.4 x 10-6). In meta-analyses with existing schizophrenia datasets totaling up to 35,828 cases and 107,877 controls, this excess burden was largely consistent across five continental populations. Two genes (SRRM2 and AKAP11) were newly implicated as schizophrenia risk genes, and one gene (PCLO) was identified as a shared risk gene for schizophrenia and autism. Overall, our results lend robust support to the rare allelic spectrum of the genetic architecture of schizophrenia being conserved across diverse human populations.

Download Full-text

Sequencing of 640,000 exomes identifies GPR75 variants associated with protection from obesity

Science ◽

10.1126/science.abf8683 ◽

2021 ◽

Vol 373 (6550) ◽

pp. eabf8683

Author(s):

Parsa Akbari ◽

Ankit Gilani ◽

Olukayode Sosina ◽

Jack A. Kosmicki ◽

Lori Khrimian ◽

...

Keyword(s):

Complex Traits ◽

Large Scale ◽

The United States ◽

Protein Coding ◽

Knock Out ◽

The United Kingdom ◽

Body Adiposity ◽

Coding Variants ◽

G Protein Coupled ◽

Diet Model

Large-scale human exome sequencing can identify rare protein-coding variants with a large impact on complex traits such as body adiposity. We sequenced the exomes of 645,626 individuals from the United Kingdom, the United States, and Mexico and estimated associations of rare coding variants with body mass index (BMI). We identified 16 genes with an exome-wide significant association with BMI, including those encoding five brain-expressed G protein–coupled receptors (CALCR, MC4R, GIPR, GPR151, and GPR75). Protein-truncating variants in GPR75 were observed in ~4/10,000 sequenced individuals and were associated with 1.8 kilograms per square meter lower BMI and 54% lower odds of obesity in the heterozygous state. Knock out of Gpr75 in mice resulted in resistance to weight gain and improved glycemic control in a high-fat diet model. Inhibition of GPR75 may provide a therapeutic strategy for obesity.

Download Full-text

SMAP: A pipeline for sample matching in proteogenomics

10.1101/2021.09.17.460682 ◽

2021 ◽

Author(s):

Ling Li ◽

Mingming Niu ◽

Alyssa Erickson ◽

Jie Luo ◽

Kincaid Rowbotham ◽

...

Keyword(s):

Large Scale ◽

Ribosome Profiling ◽

Sequencing Data ◽

Protein Coding ◽

Web Based ◽

Link Type ◽

Genomics And Proteomics ◽

Sample Data ◽

Dependent Protein ◽

Coding Variants

AbstractIntegration of genomics and proteomics (proteogenomics) offers unprecedented promise for in-depth understanding of human diseases. However, sample mix-up is a pervasive, recurring problem, due to complex sample processing in proteogenomics. Here we present a pipeline for Sample Matching in Proteogenomics (SMAP) for verifying sample identity to ensure data integrity. SMAP infers sample-dependent protein-coding variants from quantitative mass spectrometry (MS), and aligns the MS-based proteomic samples with genomic samples by two discriminant scores. Theoretical analysis with simulation data indicates that SMAP is capable of uniquely match proteomic and genomic samples, when ≥20% genotypes of individual samples are available. When SMAP was applied to a large-scale proteomics dataset from 288 biological samples generated by the PsychENCODE BrainGVEX project, we identified and corrected 18.8% (54/288) mismatched samples. The correction was further confirmed by ribosome profiling and assay for transposase-accessible chromatin sequencing data from the same set of samples. Thus our results demonstrate that SMAP is an effective tool for sample verification in a large-scale MS-based proteogenomics study. The source code, manual, and sample data of the SMAP are publicly available at https://github.com/UND-Wanglab/SMAP, and a web-based SMAP can be accessed at https://smap.shinyapps.io/smap/.

Download Full-text

The 3D spatial constraint on 6.1 million amino acid sites in the human proteome

10.1101/2021.09.15.460390 ◽

2021 ◽

Author(s):

Bian Li ◽

Dan M. Roden ◽

John A. Capra

Keyword(s):

Genetic Variation ◽

Amino Acid ◽

Structure Prediction ◽

Large Scale ◽

Acid Sites ◽

Human Proteome ◽

Human Populations ◽

Spatial Constraint ◽

Protein Coding ◽

Individual Site

AbstractQuantification of the tolerance of protein-coding sites to genetic variation within human populations has become a cornerstone of the prediction of the function of genomic variants. We hypothesize that the constraint on missense variation at individual amino acid sites is largely shaped by direct 3D interactions with neighboring sites. To quantify the constraint on protein-coding genetic variation in 3D spatial neighborhoods, we introduce a new framework called COntact Set MISsense tolerance (or COSMIS) for estimating constraint. Leveraging recent advances in computational structure prediction, large-scale sequencing data from gnomAD, and a mutation-spectrum-aware statistical model, we comprehensively map the landscape of 3D spatial constraint on 6.1 amino acid sites covering >80% (16,533) of human proteins. We show that the human proteome is broadly under 3D spatial constraint and that the level of spatial constraint is strongly associated with disease relevance both at the individual site level and the protein level. We demonstrate that COSMIS performs significantly better at a range of variant interpretation tasks than other population-based constraint metrics while also providing biophysical insight into the potential functional roles of constrained sites. We make our constraint maps freely available and anticipate that the structural landscape of constrained sites identified by COSMIS will facilitate interpretation of protein-coding variation in human evolution and prioritization of sites for mechanistic or functional investigation.

Download Full-text

Mutational patterns and clonal evolution from diagnosis to relapse in pediatric acute lymphoblastic leukemia

Scientific Reports ◽

10.1038/s41598-021-95109-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Shumaila Sayyab ◽

Anders Lundmark ◽

Malin Larsson ◽

Markus Ringnér ◽

Sara Nystedt ◽

...

Keyword(s):

Acute Lymphoblastic Leukemia ◽

Large Scale ◽

Somatic Mutations ◽

Lymphoblastic Leukemia ◽

Clonal Evolution ◽

Point Mutations ◽

Driver Genes ◽

Protein Coding ◽

Pediatric Acute Lymphoblastic Leukemia ◽

Evolutionary Trajectories

AbstractThe mechanisms driving clonal heterogeneity and evolution in relapsed pediatric acute lymphoblastic leukemia (ALL) are not fully understood. We performed whole genome sequencing of samples collected at diagnosis, relapse(s) and remission from 29 Nordic patients. Somatic point mutations and large-scale structural variants were called using individually matched remission samples as controls, and allelic expression of the mutations was assessed in ALL cells using RNA-sequencing. We observed an increased burden of somatic mutations at relapse, compared to diagnosis, and at second relapse compared to first relapse. In addition to 29 known ALL driver genes, of which nine genes carried recurrent protein-coding mutations in our sample set, we identified putative non-protein coding mutations in regulatory regions of seven additional genes that have not previously been described in ALL. Cluster analysis of hundreds of somatic mutations per sample revealed three distinct evolutionary trajectories during ALL progression from diagnosis to relapse. The evolutionary trajectories provide insight into the mutational mechanisms leading relapse in ALL and could offer biomarkers for improved risk prediction in individual patients.

Download Full-text

Assessing the contribution of rare-to-common protein-coding variants to circulating metabolic biomarker levels via 412,394 UK Biobank exome sequences

10.1101/2021.12.24.21268381 ◽

2021 ◽

Author(s):

Abhishek Nag ◽

Lawrence Middleton ◽

Ryan S Dhindsa ◽

Dimitrios Vitsios ◽

Eleanor M Wigmore ◽

...

Keyword(s):

Gene Networks ◽

Rare Variants ◽

Association Studies ◽

Low Frequency ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Protein Coding ◽

The Uk ◽

Metabolic Biomarkers ◽

Coding Variants

Genome-wide association studies have established the contribution of common and low frequency variants to metabolic biomarkers in the UK Biobank (UKB); however, the role of rare variants remains to be assessed systematically. We evaluated rare coding variants for 198 metabolic biomarkers, including metabolites assayed by Nightingale Health, using exome sequencing in participants from four genetically diverse ancestries in the UKB (N=412,394). Gene-level collapsing analysis, that evaluated a range of genetic architectures, identified a total of 1,303 significant relationships between genes and metabolic biomarkers (p<1x10-8), encompassing 207 distinct genes. These include associations between rare non-synonymous variants in GIGYF1 and glucose and lipid biomarkers, SYT7 and creatinine, and others, which may provide insights into novel disease biology. Comparing to a previous microarray-based genotyping study in the same cohort, we observed that 40% of gene-biomarker relationships identified in the collapsing analysis were novel. Finally, we applied Gene-SCOUT, a novel tool that utilises the gene-biomarker association statistics from the collapsing analysis to identify genes having similar biomarker fingerprints and thus expand our understanding of gene networks.

Download Full-text

PaperBLAST: Text-mining papers for information about homologs

10.1101/133041 ◽

2017 ◽

Author(s):

Morgan N. Price ◽

Adam P. Arkin

Keyword(s):

Text Mining ◽

Genome Sequencing ◽

Full Text ◽

Large Scale ◽

Scientific Literature ◽

Protein Sequences ◽

Protein Coding ◽

Link Protein ◽

Protein Coding Genes ◽

Link Type

AbstractLarge-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources that link protein sequences to scientific articles (Swiss-Prot, GeneRIF, and EcoCyc). PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/.

Download Full-text

Loss of critical developmental and human disease-causing genes in 58 mammals

10.1101/819169 ◽

2019 ◽

Author(s):

Yatish Turakhia ◽

Heidi I. Chen ◽

Amir Marcovitz ◽

Gill Bejerano

Keyword(s):

Evolutionary Biology ◽

Large Scale ◽

Gene Annotation ◽

Synonymous Substitution ◽

Specific Gene ◽

High Confidence ◽

Protein Coding ◽

Congenital Diseases ◽

Manual Curation ◽

Human Genes

Gene losses provide an insightful route for studying the morphological and physiological adaptations of species, but their discovery is challenging. Existing genome annotation tools and protein databases focus on annotating intact genes and do not attempt to distinguish nonfunctional genes from genes missing annotation due to sequencing and assembly artifacts. Previous attempts to annotate gene losses have required significant manual curation, which hampers their scalability for the ever-increasing deluge of newly sequenced genomes. Using extreme sequence erosion (deletion and non-synonymous substitution) as an unambiguous signature of loss, we developed an automated approach for detecting high-confidence protein-coding gene loss events across a species tree. Our approach relies solely on gene annotation in a single reference genome, raw assemblies for the remaining species to analyze, and the associated phylogenetic tree for all organisms involved. Using the hg38 human assembly as a reference, we discovered over 500 unique human genes affected by such high-confidence erosion events in different clades across 58 mammals. While most of these events likely have benign consequences, we also found dozens of clade-specific gene losses that result in early lethality in outgroup mammals or are associated with severe congenital diseases in humans. Our discoveries yield intriguing potential for translational medical genetics and for evolutionary biology, and our approach is readily applicable to large-scale genome sequencing efforts across the tree of life.

Download Full-text