GUIDES: sgRNA design for loss-of-function screens

Predicting functional effect of missense variants using graph attention neural networks

10.1101/2021.04.22.441037 ◽

2021 ◽

Author(s):

Haicang Zhang ◽

Michelle S. Xu ◽

Wendy K. Chung ◽

Yufeng Shen

Keyword(s):

Neural Networks ◽

Large Scale ◽

De Novo ◽

Neurodevelopmental Disorder ◽

Loss Of Function ◽

Sequencing Data ◽

Missense Variants ◽

Scale Population ◽

Scan Data ◽

Main Component

AbstractAccurate prediction of damaging missense variants is critically important for interpretating genome sequence. While many methods have been developed, their performance has been limited. Recent progress in machine learning and availability of large-scale population genomic sequencing data provide new opportunities to significantly improve computational predictions. Here we describe gMVP, a new method based on graph attention neural networks. Its main component is a graph with nodes capturing predictive features of amino acids and edges weighted by coevolution strength, which enables effective pooling of information from local protein sequence context and functionally correlated distal positions. Evaluated by deep mutational scan data, gMVP outperforms published methods in identifying damaging variants in TP53, PTEN, BRCA1, and MSH2. Additionally, it achieves the best separation of de novo missense variants in neurodevelopmental disorder cases from the ones in controls. Finally, the model supports transfer learning to optimize gain- and loss-of-function predictions in sodium and calcium channels. In summary, we demonstrate that gMVP can improve interpretation of missense variants in clinical testing and genetic studies.

Download Full-text

Minimal genome-wide human CRISPR-Cas9 library

10.1101/848895 ◽

2019 ◽

Cited By ~ 1

Author(s):

Emanuel Gonçalves ◽

Mark Thomas ◽

Fiona M Behan ◽

Gabriele Picco ◽

Clare Pacini ◽

...

Keyword(s):

Large Scale ◽

Gene Loss ◽

Dynamic Range ◽

Loss Of Function ◽

Assay Sensitivity ◽

Guide Rna ◽

Minimal Genome ◽

Large Size ◽

Genome Wide ◽

Complex Models

AbstractCRISPR guide-RNA libraries have been iteratively optimised to provide increasingly efficient reagents, although their large size is a barrier for many applications. We designed an optimised minimal genome-wide human CRISPR-Cas9 library (MinLibCas9), by mining existing large-scale gene loss-of-function datasets, resulting in a greater than 42% reduction in size compared to other libraries while preserving assay sensitivity and specificity. MinLibCas9 increases the dynamic range of CRISPR-Cas9 loss-of-function screens and extends their application to complex models and assays.

Download Full-text

Exome-by-phenome-wide rare variant gene burden association with electronic health record phenotypes

10.1101/798330 ◽

2019 ◽

Author(s):

Joseph Park ◽

Nathan Katz ◽

Xinyuan Zhang ◽

Anastasia M Lucas ◽

Anurag Verma ◽

...

Keyword(s):

Large Scale ◽

Stop Codon ◽

Association Studies ◽

Whole Genome Sequencing Data ◽

Loss Of Function ◽

Sequencing Data ◽

Missense Variants ◽

Whole Exome ◽

Wide Scale ◽

Electronic Health

AbstractBackgroundBy coupling large-scale DNA sequencing with electronic health records (EHR), “genome-first” approaches can enhance our understanding of the contribution of rare genetic variants to disease. Aggregating rare, loss-of-function variants in a candidate gene into a “gene burden” to test for association with EHR phenotypes can identify both known and novel clinical implications for the gene in human disease. However, this methodology has not yet been applied on both an exome-wide and phenome-wide scale, and the clinical ontologies of rare loss-of-function variants in many genes have yet to be described.MethodsWe leveraged whole exome sequencing (WES) data in participants (N=11,451) in the Penn Medicine Biobank (PMBB) to address on an exome-wide scale the association of a burden of rare loss-of-function variants in each gene with diverse EHR phenotypes using a phenome-wide association study (PheWAS) approach. For discovery, we collapsed rare (minor allele frequency (MAF) ≤ 0.1%) predicted loss-of-function (pLOF) variants (i.e. frameshift insertions/deletions, gain/loss of stop codon, or splice site disruption) per gene to perform a gene burden PheWAS. Subsequent evaluation of the significant gene burden associations was done by collapsing rare (MAF ≤ 0.1%) missense variants with Rare Exonic Variant Ensemble Learner (REVEL) scores ≥ 0.5 into corresponding yet distinct gene burdens, as well as interrogation of individual low-frequency to common (MAF > 0.1%) pLOF variants and missense variants with REVEL≥ 0.5. We replicated our findings using the UK Biobank’s (UKBB) whole exome sequence dataset (N=49,960).ResultsFrom the pLOF-based discovery phase, we identified 106 gene burdens with phenotype associations at p<10-6 from our exome-by-phenome-wide association studies. Positive-control associations included TTN (cardiomyopathy, p=7.83E-13), MYBPC3 (hypertrophic cardiomyopathy, p=3.48E-15), CFTR (cystic fibrosis, p=1.05E-15), CYP2D6 (adverse effects due to opiates/narcotics, p=1.50E-09), and BRCA2 (breast cancer, p=1.36E-07). Of the 106 genes, 12 gene-phenotype relationships were also detected by REVEL-informed missense-based gene burdens and 19 by single-variant analyses, demonstrating the robustness of these gene-phenotype relationships. Three genes showed evidence of association using both additional methods (BRCA1, CFTR, TGM6), leading to a total of 28 robust gene-phenotype associations within PMBB. Furthermore, replication studies in UKBB validated 30 of 106 gene burden associations, of which 12 demonstrated robustness in PMBB.ConclusionOur study presents 12 exome-by-phenome-wide robust gene-phenotype associations, which include three proof-of-concept associations and nine novel findings. We show the value of aggregating rare pLOF variants into gene burdens on an exome-wide scale for unbiased association with EHR phenotypes to identify novel clinical ontologies of human genes. Furthermore, we show the significance of evaluating gene burden associations through complementary, yet non-overlapping genetic association studies from the same dataset. Our results suggest that this approach applied to even larger cohorts of individuals with WES or whole-genome sequencing data linked to EHR phenotype data will yield many new insights into the relationship of genetic variation and disease phenotypes.

Download Full-text

Minimal genome-wide human CRISPR-Cas9 library

Genome Biology ◽

10.1186/s13059-021-02268-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Emanuel Gonçalves ◽

Mark Thomas ◽

Fiona M. Behan ◽

Gabriele Picco ◽

Clare Pacini ◽

...

Keyword(s):

Large Scale ◽

Dynamic Range ◽

Loss Of Function ◽

Assay Sensitivity ◽

Guide Rna ◽

Minimal Genome ◽

Backward Compatibility ◽

Large Size ◽

Genome Wide ◽

Complex Models

AbstractCRISPR guide RNA libraries have been iteratively improved to provide increasingly efficient reagents, although their large size is a barrier for many applications. We design an optimised minimal genome-wide human CRISPR-Cas9 library (MinLibCas9) by mining existing large-scale gene loss-of-function datasets, resulting in a greater than 42% reduction in size compared to other CRISPR-Cas9 libraries while preserving assay sensitivity and specificity. MinLibCas9 provides backward compatibility with existing datasets, increases the dynamic range of CRISPR-Cas9 screens and extends their application to complex models and assays.

Download Full-text

Sequencing data discovery with MetaSeek

10.7287/peerj.preprints.27804 ◽

2019 ◽

Author(s):

Adrienne Hoarfrost ◽

Nick Brown ◽

C. Titus Brown ◽

Carol Arnosti

Keyword(s):

Large Scale ◽

Source Code ◽

Research Priorities ◽

Sequencing Data ◽

Data Discovery ◽

Web Based ◽

Search Filter ◽

Sequence Read Archive ◽

Meta Analyses ◽

Generation Sequencing

Sequencing data resources have increased exponentially in recent years, as has interest in large-scale meta-analyses of integrated next-generation sequencing datasets. However, curation of integrated datasets that match a user’s particular research priorities is currently a time-intensive and imprecise task. MetaSeek is a sequencing data discovery tool that enables users to flexibly search and filter on any metadata field to quickly find the sequencing datasets that meet their needs. MetaSeek automatically scrapes metadata from all publicly available datasets in the Sequence Read Archive, cleans and parses messy, user-provided metadata into a structured, standard-compliant database, and predicts missing fields where possible. MetaSeek provides a web-based graphical user interface and interactive visualization dashboard, as well as a programmatic API to rapidly search, filter, visualize, save, share, and download matching sequencing metadata. The MetaSeek online interface is available at https://www.metaseek.cloud/. The MetaSeek database can also be accessed via API to programmatically search, filter, and download all metadata. MetaSeek source code, metadata scrapers, and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek/. Additional guides, tutorials, and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek, and on the MetaSeek website, https://www.metaseek.cloud/. MetaSeek is distributed under an MIT license.

Download Full-text

SMAP: A pipeline for sample matching in proteogenomics

10.1101/2021.09.17.460682 ◽

2021 ◽

Author(s):

Ling Li ◽

Mingming Niu ◽

Alyssa Erickson ◽

Jie Luo ◽

Kincaid Rowbotham ◽

...

Keyword(s):

Large Scale ◽

Ribosome Profiling ◽

Sequencing Data ◽

Protein Coding ◽

Web Based ◽

Link Type ◽

Genomics And Proteomics ◽

Sample Data ◽

Dependent Protein ◽

Coding Variants

AbstractIntegration of genomics and proteomics (proteogenomics) offers unprecedented promise for in-depth understanding of human diseases. However, sample mix-up is a pervasive, recurring problem, due to complex sample processing in proteogenomics. Here we present a pipeline for Sample Matching in Proteogenomics (SMAP) for verifying sample identity to ensure data integrity. SMAP infers sample-dependent protein-coding variants from quantitative mass spectrometry (MS), and aligns the MS-based proteomic samples with genomic samples by two discriminant scores. Theoretical analysis with simulation data indicates that SMAP is capable of uniquely match proteomic and genomic samples, when ≥20% genotypes of individual samples are available. When SMAP was applied to a large-scale proteomics dataset from 288 biological samples generated by the PsychENCODE BrainGVEX project, we identified and corrected 18.8% (54/288) mismatched samples. The correction was further confirmed by ribosome profiling and assay for transposase-accessible chromatin sequencing data from the same set of samples. Thus our results demonstrate that SMAP is an effective tool for sample verification in a large-scale MS-based proteogenomics study. The source code, manual, and sample data of the SMAP are publicly available at https://github.com/UND-Wanglab/SMAP, and a web-based SMAP can be accessed at https://smap.shinyapps.io/smap/.

Download Full-text

A Survey on Classical Teletraffic Models and Network Planning Issues for Cellular Telephony

Recent Advances in Broadband Integrated Network Operations and Services Management ◽

10.4018/978-1-60960-589-6.ch015 ◽

2011 ◽

pp. 250-262

Author(s):

Francisco Barcelo-Arroyo ◽

Israel Martin-Escalona

Keyword(s):

Graphical User Interface ◽

Power Plants ◽

Sensor Node ◽

Large Scale ◽

Real Life ◽

Wireless Sensor ◽

Web Based ◽

Simulation Tools ◽

Cellular Telephony ◽

Consumption Rates

Air pollution is an important environmental issue that has a direct effect on human health and ecological balance. Factories, power plants, vehicles, windblown dust and wildfires are some of the contributors to pollution. Reasonable simulation tools exist for evaluating large scale sensor networks; however, they fail to capture significant details of node operation or practical aspects of wireless communication. Real life testbeds capture the realism and bring out important aspects for further research. In this paper, we present an implementation of a wireless sensor network testbed for automatic and real-time monitoring of environmental pollution for the protection of public spaces. The paper describes the physical setup, the sensor node hardware and software architecture for “anytime, anywhere” monitoring and management of pollution data through a single, Web-based graphical user interface. The paper presents practical issues in the integration of sensors, actual power consumption rates and develops a practical hierarchical routing methodology.

Download Full-text

Sequencing data discovery with MetaSeek

Bioinformatics ◽

10.1093/bioinformatics/btz499 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4857-4859 ◽

Cited By ~ 1

Author(s):

Adrienne Hoarfrost ◽

Nick Brown ◽

C Titus Brown ◽

Carol Arnosti

Keyword(s):

Large Scale ◽

Source Code ◽

Research Priorities ◽

Sequencing Data ◽

Data Discovery ◽

Web Based ◽

Search Filter ◽

Sequence Read Archive ◽

Meta Analyses ◽

Generation Sequencing

Abstract Summary Sequencing data resources have increased exponentially in recent years, as has interest in large-scale meta-analyses of integrated next-generation sequencing datasets. However, curation of integrated datasets that match a user’s particular research priorities is currently a time-intensive and imprecise task. MetaSeek is a sequencing data discovery tool that enables users to flexibly search and filter on any metadata field to quickly find the sequencing datasets that meet their needs. MetaSeek automatically scrapes metadata from all publicly available datasets in the Sequence Read Archive, cleans and parses messy, user-provided metadata into a structured, standard-compliant database and predicts missing fields where possible. MetaSeek provides a web-based graphical user interface and interactive visualization dashboard, as well as a programmatic API to rapidly search, filter, visualize, save, share and download matching sequencing metadata. Availability and implementation The MetaSeek online interface is available at https://www.metaseek.cloud/. The MetaSeek database can also be accessed via API to programmatically search, filter and download all metadata. MetaSeek source code, metadata scrapers and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek/.

Download Full-text

Planet Microbe: a platform for marine microbiology to discover and analyze interconnected ‘omics and environmental data

Nucleic Acids Research ◽

10.1093/nar/gkaa637 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D792-D802

Author(s):

Alise J Ponsero ◽

Matthew Bomhoff ◽

Kai Blumberg ◽

Ken Youens-Clark ◽

Nina M Herz ◽

...

Keyword(s):

Large Scale ◽

Environmental Data ◽

Sequencing Data ◽

Sample Collection ◽

Data Discovery ◽

Web Based ◽

Marine Microbiology ◽

Science Community ◽

Marine Microbial Communities ◽

Collaborative Efforts

Abstract In recent years, large-scale oceanic sequencing efforts have provided a deeper understanding of marine microbial communities and their dynamics. These research endeavors require the acquisition of complex and varied datasets through large, interdisciplinary and collaborative efforts. However, no unifying framework currently exists for the marine science community to integrate sequencing data with physical, geological, and geochemical datasets. Planet Microbe is a web-based platform that enables data discovery from curated historical and on-going oceanographic sequencing efforts. In Planet Microbe, each ‘omics sample is linked with other biological and physiochemical measurements collected for the same water samples or during the same sample collection event, to provide a broader environmental context. This work highlights the need for curated aggregation efforts that can enable new insights into high-quality metagenomic datasets. Planet Microbe is freely accessible from https://www.planetmicrobe.org/.

Download Full-text

Sequencing data discovery with MetaSeek

10.7287/peerj.preprints.27804v1 ◽

2019 ◽

Author(s):

Adrienne Hoarfrost ◽

Nick Brown ◽

C. Titus Brown ◽

Carol Arnosti

Keyword(s):

Large Scale ◽

Source Code ◽

Research Priorities ◽

Sequencing Data ◽

Data Discovery ◽

Web Based ◽

Search Filter ◽

Sequence Read Archive ◽

Meta Analyses ◽

Generation Sequencing

Sequencing data resources have increased exponentially in recent years, as has interest in large-scale meta-analyses of integrated next-generation sequencing datasets. However, curation of integrated datasets that match a user’s particular research priorities is currently a time-intensive and imprecise task. MetaSeek is a sequencing data discovery tool that enables users to flexibly search and filter on any metadata field to quickly find the sequencing datasets that meet their needs. MetaSeek automatically scrapes metadata from all publicly available datasets in the Sequence Read Archive, cleans and parses messy, user-provided metadata into a structured, standard-compliant database, and predicts missing fields where possible. MetaSeek provides a web-based graphical user interface and interactive visualization dashboard, as well as a programmatic API to rapidly search, filter, visualize, save, share, and download matching sequencing metadata. The MetaSeek online interface is available at https://www.metaseek.cloud/. The MetaSeek database can also be accessed via API to programmatically search, filter, and download all metadata. MetaSeek source code, metadata scrapers, and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek/. Additional guides, tutorials, and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek, and on the MetaSeek website, https://www.metaseek.cloud/. MetaSeek is distributed under an MIT license.

Download Full-text