Data imbalance in CRISPR off-target prediction

Yuli Gao; Guohui Chuai; Weichuan Yu; Shen Qu; Qi Liu

doi:10.1093/bib/bbz069

Data imbalance in CRISPR off-target prediction

Briefings in Bioinformatics ◽

10.1093/bib/bbz069 ◽

2019 ◽

Vol 21 (4) ◽

pp. 1448-1454 ◽

Cited By ~ 4

Author(s):

Yuli Gao ◽

Guohui Chuai ◽

Weichuan Yu ◽

Shen Qu ◽

Qi Liu

Keyword(s):

Computational Models ◽

Gene Editing ◽

Target Prediction ◽

Computational Techniques ◽

Cleavage Sites ◽

Detection Techniques ◽

Data Imbalance ◽

Genome Wide ◽

Machine Learning Model ◽

Nucleotide Mismatch

Abstract For genome-wide CRISPR off-target cleavage sites (OTS) prediction, an important issue is data imbalance—the number of true OTS recognized by whole-genome off-target detection techniques is much smaller than that of all possible nucleotide mismatch loci, making the training of machine learning model very challenging. Therefore, computational models proposed for OTS prediction and scoring should be carefully designed and properly evaluated in order to avoid bias. In our study, two tools are taken as examples to further emphasize the data imbalance issue in CRISPR off-target prediction to achieve better sensitivity and specificity for optimized CRISPR gene editing. We would like to indicate that (1) the benchmark of CRISPR off-target prediction should be properly evaluated and not overestimated by considering data imbalance issue; (2) incorporation of efficient computational techniques (including ensemble learning and data synthesis techniques) can help to address the data imbalance issue and improve the performance of CRISPR off-target prediction. Taking together, we call for more efforts to address the data imbalance issue in CRISPR off-target prediction to facilitate clinical utility of CRISPR-based gene editing techniques.

Download Full-text

Recognition of CRISPR off-target cleavage sites with SeqGAN

Current Bioinformatics ◽

10.2174/1574893616666210727162650 ◽

2021 ◽

Vol 16 ◽

Author(s):

Wen Li ◽

Xiao-Bo Wang ◽

Yan Xu

Keyword(s):

Original Data ◽

Cleavage Sites ◽

Guide Rna ◽

Data Imbalance ◽

Sequence Generation ◽

Crispr System ◽

Adversarial Network ◽

Genome Wide ◽

Nucleotide Mismatch ◽

Auc Value

Background: The CRISPR system can quickly achieve the editing of different gene loci by changing a small sequence on a single guide RNA. But the off-target event limits the further development of the CRISPR system. How to improve the efficiency and specificity of this technology and minimize the risk of off-target has always been a challenge. For genome-wide CRISPR off-target cleavage sites (OTS) prediction, an important issue is data imbalance, that is, the number of true OTS identified is much less than that of all possible nucleotide mismatch loci. Method: In this work, based on the sequence-generating adversarial network (SeqGAN), positive off-target sequences were generated to amplify the off-target gene locus OTS dataset of Cpf1. Then we trained the data by deep convolutional neural network (CNN) to obtain a predictor with stronger generalization ability and better performance. Results: n 10-fold cross-validation, the AUC value of the CNN classifier after SeqGAN balance was 0.941, which was higher than that of original 0.863 and over-sampling 0.929. In independence testing, AUC value of the CNN classifier after SeqGAN balance was 0.841 which was higher than that of original 0.833 and over-sampling 0.836. the PR value was 0.722 after SeqGAN, which was also about higher 0.16 than original data and higher about 0.03 than over-sampling. Conclusion: The sequence generation antagonistic network SeqGAN was firstly used to deal with data imbalance processing on CRISPR data. All the results showed that the SeqGAN can effectively generate positive data for CRISPR off-target sites.

Download Full-text

Benchmarking and integrating genome-wide CRISPR off-target detection and prediction

Nucleic Acids Research ◽

10.1093/nar/gkaa930 ◽

2020 ◽

Vol 48 (20) ◽

pp. 11370-11379

Author(s):

Jifang Yan ◽

Dongyu Xue ◽

Guohui Chuai ◽

Yuli Gao ◽

Gongchen Zhang ◽

...

Keyword(s):

In Silico ◽

Gene Knockout ◽

Target Prediction ◽

Systematic Evaluation ◽

Detection Techniques ◽

Prediction Tools ◽

Benchmark Study ◽

Genome Wide ◽

Benchmark Datasets ◽

One Stop

Abstract Systematic evaluation of genome-wide Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) off-target profiles is a fundamental step for the successful application of the CRISPR system to clinical therapies. Many experimental techniques and in silico tools have been proposed for detecting and predicting genome-wide CRISPR off-target profiles. These techniques and tools, however, have not been systematically benchmarked. A comprehensive benchmark study and an integrated strategy that takes advantage of the currently available tools to improve predictions of genome-wide CRISPR off-target profiles are needed. We focused on the specificity of the traditional CRISPR SpCas9 system for gene knockout. First, we benchmarked 10 available genome-wide off-target cleavage site (OTS) detection techniques with the published OTS detection datasets. Second, taking the datasets generated from OTS detection techniques as the benchmark datasets, we benchmarked 17 available in silico genome-wide OTS prediction tools to evaluate their genome-wide CRISPR off-target prediction performances. Finally, we present the first one-stop integrated Genome-Wide Off-target cleavage Search platform (iGWOS) that was specifically designed for the optimal genome-wide OTS prediction by integrating the available OTS prediction algorithms with an AdaBoost ensemble framework.

Download Full-text

Analyzing and interpreting DNA double-strand break sequencing data

10.1101/2020.03.05.977801 ◽

2020 ◽

Cited By ~ 1

Author(s):

Abhishek Mitra ◽

Norbert Dojer ◽

Bernard Fongang ◽

Jules Nde ◽

Yingjie Zhu ◽

...

Keyword(s):

Strand Break ◽

Cleavage Sites ◽

Dna Double Strand Breaks ◽

Sequencing Data ◽

Strand Breaks ◽

Detection Techniques ◽

Genome Wide ◽

Dna Double Strand Break ◽

Dna Replication Stress ◽

Cell Genome

AbstractDNA double-strand breaks (DSBs), are a major threat to genomic stability and may lead to cancer. Several technologies to accurately detect DSBs genome-wide have been developed recently, but still lacking publicly available tools for analysis of the resulting data. Here, we present a step-by-step iSeq package (http://breakome.utmb.edu/software.html), custom designed for analysis and interpretation of DSB-sequencing data. iSeq performs barcode trimming and read counting, and identifies DSB-enriched regions by statistical test and annotate them to the desired genomic features. Applying this package, users can identify and annotate DSB-enriched regions from base pair (eg. Cas9 cleavage sites) up to megabase (eg. DNA replication stress-induced) resolution, and if possible quantify DSB frequencies per cell genome-wide by combining with qDSB-Seq. iSeq can be used for any sequencing-based DSB detection techniques. The analysis for Steps 1-19 can be performed within ~4 hours.

Download Full-text

Genetic Interactions Effects of Cardiovascular Disorder Using Computational Models: A Review

Current Biotechnology ◽

10.2174/2211550109999201008125800 ◽

2020 ◽

Vol 9 (3) ◽

pp. 177-191

Author(s):

Sridharan Priya ◽

Radha K. Manavalan

Keyword(s):

Coronary Artery Disease ◽

Coronary Artery ◽

Cardiovascular Diseases ◽

Computational Models ◽

Genetic Interaction ◽

Association Studies ◽

Genetic Interactions ◽

Computational Techniques ◽

Genome Wide Association Studies ◽

Artery Disease

Background: The diseases in the heart and blood vessels such as heart attack, Coronary Artery Disease, Myocardial Infarction (MI), High Blood Pressure, and Obesity, are generally referred to as Cardiovascular Diseases (CVD). The risk factors of CVD include gender, age, cholesterol/ LDL, family history, hypertension, smoking, and genetic and environmental factors. Genome- Wide Association Studies (GWAS) focus on identifying the genetic interactions and genetic architectures of CVD. Objective: Genetic interactions or Epistasis infer the interactions between two or more genes where one gene masks the traits of another gene and increases the susceptibility of CVD. To identify the Epistasis relationship through biological or laboratory methods needs an enormous workforce and more cost. Hence, this paper presents the review of various statistical and Machine learning approaches so far proposed to detect genetic interaction effects for the identification of various Cardiovascular diseases such as Coronary Artery Disease (CAD), MI, Hypertension, HDL and Lipid phenotypes data, and Body Mass Index dataset. Conclusion: This study reveals that various computational models identified the candidate genes such as AGT, PAI-1, ACE, PTPN22, MTHR, FAM107B, ZNF107, PON1, PON2, GTF2E1, ADGRB3, and FTO, which play a major role in genetic interactions for the causes of CVDs. The benefits, limitations, and issues of the various computational techniques for the evolution of epistasis responsible for cardiovascular diseases are exhibited.

Download Full-text

Analysis of computational codon usage models and their association with translationally slow codons

10.1101/2020.03.26.010488 ◽

2020 ◽

Author(s):

Gabriel Wright ◽

Anabel Rodriguez ◽

Jun Li ◽

Patricia L. Clark ◽

Tijana Milenković ◽

...

Keyword(s):

Codon Usage ◽

Computational Models ◽

Selective Pressure ◽

Synonymous Codon ◽

Ground Truth ◽

Protein Translation ◽

Weak Correlation ◽

Experimental Conditions ◽

Synonymous Codons ◽

Genome Wide

AbstractImproved computational modeling of protein translation rates, including better prediction of where translational slowdowns along an mRNA sequence may occur, is critical for understanding co-translational folding. Because codons within a synonymous codon group are translated at different rates, many computational translation models rely on analyzing synonymous codons. Some models rely on genome-wide codon usage bias (CUB), believing that globally rare and common codons are the most informative of slow and fast translation, respectively. Others use the CUB observed only in highly expressed genes, which should be under selective pressure to be translated efficiently (and whose CUB may therefore be more indicative of translation rates). No prior work has analyzed these models for their ability to predict translational slowdowns. Here, we evaluate five models for their association with slowly translated positions as denoted by two independent ribosome footprint (RFP) count experiments from S. cerevisiae, because RFP data is often considered as a “ground truth” for translation rates across mRNA sequences. We show that all five considered models strongly associate with the RFP data and therefore have potential for estimating translational slowdowns. However, we also show that there is a weak correlation between RFP counts for the same genes originating from independent experiments, even when their experimental conditions are similar. This raises concerns about the efficacy of using current RFP experimental data for estimating translation rates and highlights a potential advantage of using computational models to understand translation rates instead.

Download Full-text

Role of Genetic Interactions in Lung Diseases Detection using Computational Approaches: A Review

Current Chinese Computer Science ◽

10.2174/2665997201666210125091915 ◽

2021 ◽

Vol 01 ◽

Author(s):

S Priya ◽

R Manavalan

Keyword(s):

Lung Cancer ◽

Lung Diseases ◽

Computational Models ◽

Association Studies ◽

Genetic Interactions ◽

Genome Wide Association Studies ◽

Computational Approaches ◽

Genome Wide ◽

Human Disorders ◽

Brca1 Brca2

: Genome-wide Association Studies (GWAS) give special insight into genetic differences and environmental influences that are part of different human disorders and provide prognostic help to increase the survival of patients. Lung diseases such as lung cancer, asthma, and tuberculosis are detected by analyzing Single Nucleotide Polymorphism (SNP) genetic variations. The key causes of lung-related diseases are genetic factors, environmental and social behaviors. The epistasis effects act as a blueprint for the researchers to observe the genetic variation associated with lung diseases. The manual examination of the enormous genetic interactions is complicated to detect the lungs syndromes for diagnosis of acute respiratory. Due to its importance, several computational approaches have been modeled to infer epistasis effects. This article includes a comprehensive and multifaceted review of all relevant genetic studies published between 2006 and 2020. In this critical review, various computational approaches are extensively discussed in detecting respondent Epistasis effects for various lung diseases such as Asthma, Tuberculosis, lung cancer, and Nicotine drug dependence. The analysis shows that different computational models identified candidate genes such as CHRNA4, CHRNB2, BDNF, TAS2R16, TAS2R38, BRCA1, BRCA2, RAD21, IL4Ra, IL-13 and IL-1β, have important causes for genetic variants linked to pulmonary disease. These computational approaches' strengths and limitations are described. The issues behind the computational methods while identifying the lung diseases through epistasis effects and the parameters used by various researchers for their evaluation are presented.

Download Full-text

piCRISPR: Physically Informed Features Improve Deep Learning Models for CRISPR/Cas9 Off-Target Cleavage Prediction

10.1101/2021.11.16.468799 ◽

2021 ◽

Author(s):

Florian Störtz ◽

Jeffrey Mak ◽

Peter Minary

Keyword(s):

Deep Learning ◽

Gene Editing ◽

Target Prediction ◽

Model Space ◽

Chromatin Accessibility ◽

Sequence Context ◽

Prediction Algorithms ◽

Cleavage Assay

CRISPR/Cas programmable nuclease systems have become ubiquitous in the field of gene editing. With progressing development, applications in in vivo therapeutic gene editing are increasingly within reach, yet limited by possible adverse side effects from unwanted edits. Recent years have thus seen continuous development of off-target prediction algorithms trained on in vitro cleavage assay data gained from immortalised cell lines. Here, we implement novel deep learning algorithms and feature encodings for off-target prediction and systematically sample the resulting model space in order to find optimal models and inform future modelling efforts. We lay emphasis on physically informed features, hence terming our approach piCRISPR, which we gain on the large, diverse crisprSQL off-target cleavage dataset. We find that our best-performing model highlights the importance of sequence context and chromatin accessibility for cleavage prediction and outperforms state-of-the-art prediction algorithms in terms of area under precision-recall curve.

Download Full-text

Modeling of Steelmaking Processes

Advances in Chemical and Materials Engineering - Computational Approaches to Materials Design ◽

10.4018/978-1-5225-0290-6.ch013 ◽

2016 ◽

pp. 369-421

Author(s):

Seppo Louhenkilpi ◽

Subhas Ganguly

Keyword(s):

Statistical Learning ◽

Quantum Chemical ◽

High Performance ◽

Process Model ◽

Computational Models ◽

Computational Techniques ◽

Computational Process ◽

Steel Making ◽

Steel Converter ◽

Steelmaking Technology

In the field of experiment, theory, modeling and simulation, the most noteworthy progressions applicable to steelmaking technology have been closely linked with the emergence of more powerful computing tools, advances in needful software's and algorithms design, and to a lesser degree, with the development of emerging computing theory. These have enabled the integration of several different types of computational techniques (for example, quantum chemical, and molecular dynamics, DFT, FEM, Soft computing, statistical learning etc., to name a few) to provide high-performance simulations of steelmaking processes based on emerging computational models and theories. This chapter overviews the general steps and concepts for developing a computational process model including few exercises in the area of steel making. The various sections of the chapter aim to describe how to developed models for various issues related to steelmaking processes and to simulate a physical process starts with the process fundaments. The examples include steel converter, tank vacuum degassing, and continuous casting, etc.

Download Full-text

Amplification-free long-read sequencing reveals unforeseen CRISPR-Cas9 off-target activity

Genome Biology ◽

10.1186/s13059-020-02206-w ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Ida Höijer ◽

Josefin Johansson ◽

Sanna Gudmundsson ◽

Chen-Shan Chin ◽

Ignas Bunikis ◽

...

Keyword(s):

Genomic Dna ◽

Human Fibroblasts ◽

Target Prediction ◽

Human Cell Line ◽

Cleavage Sites ◽

Guide Rna ◽

Target Sites ◽

Long Read ◽

Target Activity

Abstract Background One ongoing concern about CRISPR-Cas9 genome editing is that unspecific guide RNA (gRNA) binding may induce off-target mutations. However, accurate prediction of CRISPR-Cas9 off-target activity is challenging. Here, we present SMRT-OTS and Nano-OTS, two novel, amplification-free, long-read sequencing protocols for detection of gRNA-driven digestion of genomic DNA by Cas9 in vitro. Results The methods are assessed using the human cell line HEK293, re-sequenced at 18x coverage using highly accurate HiFi SMRT reads. SMRT-OTS and Nano-OTS are first applied to three different gRNAs targeting HEK293 genomic DNA, resulting in a set of 55 high-confidence gRNA cleavage sites identified by both methods. Twenty-five of these sites are not reported by off-target prediction software, either because they contain four or more single nucleotide mismatches or insertion/deletion mismatches, as compared with the human reference. Additional experiments reveal that 85% of Cas9 cleavage sites are also found by other in vitro-based methods and that on- and off-target sites are detectable in gene bodies where short-reads fail to uniquely align. Even though SMRT-OTS and Nano-OTS identify several sites with previously validated off-target editing activity in cells, our own CRISPR-Cas9 editing experiments in human fibroblasts do not give rise to detectable off-target mutations at the in vitro-predicted sites. However, indel and structural variation events are enriched at the on-target sites. Conclusions Amplification-free long-read sequencing reveals Cas9 cleavage sites in vitro that would have been difficult to predict using computational tools, including in dark genomic regions inaccessible by short-read sequencing.

Download Full-text

Prediction-based highly sensitive CRISPR off-target validation using target-specific DNA enrichment

Nature Communications ◽

10.1038/s41467-020-17418-8 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 2

Author(s):

Seung-Hun Kang ◽

Wi-jae Lee ◽

Ju-Hyun An ◽

Jong-Hee Lee ◽

Young-Hyun Kim ◽

...

Keyword(s):

Target Detection ◽

High Sensitivity ◽

Fold Increase ◽

Amplicon Sequencing ◽

Detection Methods ◽

Detection Techniques ◽

Amplification Method ◽

Genome Wide ◽

Sequencing Method ◽

Specific Amplification

AbstractCRISPR effectors, which comprise a CRISPR-Cas protein and a guide (g)RNA derived from the bacterial immune system, are widely used for target-specific genome editing. When the gRNA recognizes genomic loci with sequences that are similar to the target, deleterious mutations can occur. Off-target mutations with a frequency below 0.5% remain mostly undetected by current genome-wide off-target detection techniques. Here we report a method to effectively detect extremely small amounts of mutated DNA based on predicted off-target-specific amplification. In this study, we used various genome editors to induce intracellular genome mutations, and the CRISPR amplification method detected off-target mutations at a significantly higher rate (1.6~984 fold increase) than an existing targeted amplicon sequencing method. In the near future, CRISPR amplification in combination with genome-wide off-target detection methods will allow detection of genome editor-induced off-target mutations with high sensitivity and in a non-biased manner.

Download Full-text