Genome-Wide Prediction of cis-Regulatory Regions Using Supervised Deep Learning Methods

Mapping Intimacies ◽

10.1101/041616 ◽

2016 ◽

Cited By ~ 4

Author(s):

Yifeng Li ◽

Wenqiang Shi ◽

Wyeth W Wasserman

Keyword(s):

Deep Learning ◽

Human Genome ◽

Predictive Performance ◽

Significant Advance ◽

Complex Data ◽

Promoter Regions ◽

Learning Methods ◽

Regulatory Regions ◽

Genome Wide ◽

The Impact

Identifying active cis-regulatory regions in the human genome is critical for understanding gene regulation and assessing the impact of genetic variation on phenotype. Based on rich data resources such as the Encyclopedia of DNA Elements (ENCODE) and the Functional Annotation of the Mammalian Genome (FANTOM) projects, we introduce DECRES, the first supervised deep learning approach for the identification of enhancer and promoter regions in the human genome. Due to their ability to discover patterns in large and complex data, the introduction of deep learning methods enables a significant advance in our knowledge of the genomic locations of cis-regulatory regions. Using models for well-characterized cell lines, we identify key experimental features that contribute to the predictive performance. Applying DECRES, we delineate locations of 300,000 candidate enhancers genome wide (6.8% of the genome, of which 40,000 are supported by bidirectional transcription data) and 26,000 candidate promoters (0.6% of the genome).

Download Full-text

Genome-wide prediction of cis-regulatory regions using supervised deep learning methods

BMC Bioinformatics ◽

10.1186/s12859-018-2187-1 ◽

2018 ◽

Vol 19 (1) ◽

Cited By ~ 35

Author(s):

Yifeng Li ◽

Wenqiang Shi ◽

Wyeth W. Wasserman

Keyword(s):

Deep Learning ◽

Learning Methods ◽

Regulatory Regions ◽

Genome Wide

Download Full-text

A Review of Computer-Aided Expert Systems for Breast Cancer Diagnosis

Cancers ◽

10.3390/cancers13112764 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2764

Author(s):

Xin Yu Liew ◽

Nazia Hameed ◽

Jeremie Clos

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Deep Learning ◽

Main Process ◽

Learning Approaches ◽

Learning Methods ◽

Advantages And Disadvantages ◽

Computer Aided ◽

Conventional Methods ◽

The Impact

A computer-aided diagnosis (CAD) expert system is a powerful tool to efficiently assist a pathologist in achieving an early diagnosis of breast cancer. This process identifies the presence of cancer in breast tissue samples and the distinct type of cancer stages. In a standard CAD system, the main process involves image pre-processing, segmentation, feature extraction, feature selection, classification, and performance evaluation. In this review paper, we reviewed the existing state-of-the-art machine learning approaches applied at each stage involving conventional methods and deep learning methods, the comparisons within methods, and we provide technical details with advantages and disadvantages. The aims are to investigate the impact of CAD systems using histopathology images, investigate deep learning methods that outperform conventional methods, and provide a summary for future researchers to analyse and improve the existing techniques used. Lastly, we will discuss the research gaps of existing machine learning approaches for implementation and propose future direction guidelines for upcoming researchers.

Download Full-text

Prediction of genome-wide effects of single nucleotide variants on transcription factor binding

Scientific Reports ◽

10.1038/s41598-020-74793-4 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Sebastian Carrasco Pro ◽

Katia Bulekova ◽

Brian Gregor ◽

Adam Labadorf ◽

Juan Ignacio Fuxman Bass

Keyword(s):

Binding Sites ◽

Cancer Type ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Regulatory Regions ◽

Genome Wide ◽

Transcriptional Regulatory ◽

Gene Regulatory ◽

The Impact ◽

The Relationship

Abstract Single nucleotide variants (SNVs) located in transcriptional regulatory regions can result in gene expression changes that lead to adaptive or detrimental phenotypic outcomes. Here, we predict gain or loss of binding sites for 741 transcription factors (TFs) across the human genome. We calculated ‘gainability’ and ‘disruptability’ scores for each TF that represent the likelihood of binding sites being created or disrupted, respectively. We found that functional cis-eQTL SNVs are more likely to alter TF binding sites than rare SNVs in the human population. In addition, we show that cancer somatic mutations have different effects on TF binding sites from different TF families on a cancer-type basis. Finally, we discuss the relationship between these results and cancer mutational signatures. Altogether, we provide a blueprint to study the impact of SNVs derived from genetic variation or disease association on TF binding to gene regulatory regions.

Download Full-text

Deep Learning for Caries Detection and Classification

Diagnostics ◽

10.3390/diagnostics11091672 ◽

2021 ◽

Vol 11 (9) ◽

pp. 1672

Author(s):

Luya Lian ◽

Tianer Zhu ◽

Fudong Zhu ◽

Haihua Zhu

Keyword(s):

Neural Network ◽

Deep Learning ◽

Disease Diagnosis ◽

Validation Dataset ◽

Reference Dataset ◽

Dice Coefficient ◽

Learning Methods ◽

Test Dataset ◽

The Impact ◽

Caries Lesions

Objectives: Deep learning methods have achieved impressive diagnostic performance in the field of radiology. The current study aimed to use deep learning methods to detect caries lesions, classify different radiographic extensions on panoramic films, and compare the classification results with those of expert dentists. Methods: A total of 1160 dental panoramic films were evaluated by three expert dentists. All caries lesions in the films were marked with circles, whose combination was defined as the reference dataset. A training and validation dataset (1071) and a test dataset (89) were then established from the reference dataset. A convolutional neural network, called nnU-Net, was applied to detect caries lesions, and DenseNet121 was applied to classify the lesions according to their depths (dentin lesions in the outer, middle, or inner third D1/2/3 of dentin). The performance of the test dataset in the trained nnU-Net and DenseNet121 models was compared with the results of six expert dentists in terms of the intersection over union (IoU), Dice coefficient, accuracy, precision, recall, negative predictive value (NPV), and F1-score metrics. Results: nnU-Net yielded caries lesion segmentation IoU and Dice coefficient values of 0.785 and 0.663, respectively, and the accuracy and recall rate of nnU-Net were 0.986 and 0.821, respectively. The results of the expert dentists and the neural network were shown to be no different in terms of accuracy, precision, recall, NPV, and F1-score. For caries depth classification, DenseNet121 showed an overall accuracy of 0.957 for D1 lesions, 0.832 for D2 lesions, and 0.863 for D3 lesions. The recall results of the D1/D2/D3 lesions were 0.765, 0.652, and 0.918, respectively. All metric values, including accuracy, precision, recall, NPV, and F1-score values, were proven to be no different from those of the experienced dentists. Conclusion: In detecting and classifying caries lesions on dental panoramic radiographs, the performance of deep learning methods was similar to that of expert dentists. The impact of applying these well-trained neural networks for disease diagnosis and treatment decision making should be explored.

Download Full-text

A deep learning-based framework for estimating fine-scale germline mutation rates

10.1101/2021.10.25.465689 ◽

2021 ◽

Author(s):

Yiyuan Fang ◽

Shuyi Deng ◽

Cai Li

Keyword(s):

Deep Learning ◽

Mutation Rate ◽

Germline Mutation ◽

Homo Sapiens ◽

Predictive Performance ◽

Training Data ◽

Mutation Rates ◽

Fine Scale ◽

Current State ◽

Genome Wide

Germline mutation rates are essential for genetic and evolutionary analyses. Yet, estimating accurate fine-scale mutation rates across the genome is a great challenge, due to relatively few observed mutations and intricate relationships between predictors and mutation rates. Here we present MuRaL (Mutation Rate Learner), a deep learning-based framework to predict fine-scale mutation rates using only genomic sequences as input. Harnessing human germline variants for comprehensive assessment, we show that MuRaL achieves better predictive performance than current state-of-the-art methods. Moreover, MuRaL can build models with relatively few training mutations and a moderate number of sequenced individuals. It can leverage transfer learning to build models with further less training data and time. We apply MuRaL to produce genome-wide mutation rate profiles for four species - Homo sapiens, Macaca mulatta, Arabidopsis thaliana and Drosophila melanogaster, demonstrating its high applicability. The generated mutation rate profiles and open source software can greatly facilitate related research.

Download Full-text

Adapting for Informal Language in Arabic Twitter Improves ‎Monitoring of COVID-19 Pandemic and Influenza Epidemic‎ (Preprint)

10.2196/preprints.27670 ◽

2021 ◽

Author(s):

Lama Alsudias ◽

Paul Rayson

Keyword(s):

Social Media ◽

Deep Learning ◽

Building Blocks ◽

Detection Algorithm ◽

Entity Recognition ◽

Disease Spread ◽

Learning Methods ◽

Infected People ◽

Standard Terminology ◽

The Impact

BACKGROUND Twitter is a real time messaging platform widely used by people and organisations to share ‎information on many topics. It could potentially be useful to analyse tweets for infectious ‎disease monitoring purposes ‎ in order to reduce reporting lag time, and to provide an ‎independent complementary source of data, compared to traditional approaches. ‎However, such analysis is currently not possible in the Arabic speaking world due to lack of ‎basic building blocks for research.‎ OBJECTIVE We collect around 4,000 Arabic tweets related to COVID-19 and Influenza. We clean and ‎label the tweets relative to the Arabic Infectious Diseases Ontology which includes non-‎standard terminology and 11 core concepts and 21 relations. The aim of this study is to ‎analyse Arabic tweets to estimate their usefulness for health surveillance, understand the ‎impact of the informal terms in the analysis, show the effect of the deep learning methods ‎in the classification process, and identify the locations where the infection is spreading.‎ METHODS We apply multi-label classification techniques: Binary Relevance, Classifier Chains, Label ‎Powerset, Adapted Algorithm (MLKNN), NBSVM, BERT, and AraBERT to identify infected ‎people. We also use Named Entity Recognition to predict the locations affected. ‎ RESULTS We achieve an F1-score up to 88% in the Influenza case study and 94% in the COVID-19 one. ‎ ‎ Adapting for non-standard terminology and informal language helps to improve ‎accuracy by as ‎much as 15% with an average improvement of 8%.‎ Deep learning methods ‎achieve around 5% on hamming loss during the classifying process. Our geo-location ‎detection algorithm can predict on average 54% accuracy for the location of the users using ‎tweet content.‎ ‎ ‎ ‎ CONCLUSIONS This study identifies two Arabic social media datasets for monitoring tweets related to ‎Influenza and COVID-19‎. It demonstrates the importance of including informal terms, which ‎is regularly used by social media users, in the analysis. It also proves that BERT achieves good ‎results when used with new terms in COVID-19 tweets. Finally, the tweet content may ‎contain useful information to determine the location of the disease spread.

Download Full-text

Automatic Feature Selection for Improved Interpretability on Whole Slide Imaging

Machine Learning and Knowledge Extraction ◽

10.3390/make3010012 ◽

2021 ◽

Vol 3 (1) ◽

pp. 243-262

Author(s):

Antoine Pirovano ◽

Hippolyte Heuberger ◽

Sylvain Berlemont ◽

SaÏd Ladjal ◽

Isabelle Bloch

Keyword(s):

Deep Learning ◽

Multiple Instance Learning ◽

Daily Routine ◽

Learning Context ◽

Learning Methods ◽

Fully Integrated ◽

Gradient Based ◽

Important Challenge ◽

The Stability ◽

The Impact

Deep learning methods are widely used for medical applications to assist medical doctors in their daily routine. While performances reach expert’s level, interpretability (highlighting how and what a trained model learned and why it makes a specific decision) is the next important challenge that deep learning methods need to answer to be fully integrated in the medical field. In this paper, we address the question of interpretability in the context of whole slide images (WSI) classification with the formalization of the design of WSI classification architectures and propose a piece-wise interpretability approach, relying on gradient-based methods, feature visualization and multiple instance learning context. After training two WSI classification architectures on Camelyon-16 WSI dataset, highlighting discriminative features learned, and validating our approach with pathologists, we propose a novel manner of computing interpretability slide-level heat-maps, based on the extracted features, that improves tile-level classification performances. We measure the improvement using the tile-level AUC that we called Localization AUC, and show an improvement of more than 0.2. We also validate our results with a RemOve And Retrain (ROAR) measure. Then, after studying the impact of the number of features used for heat-map computation, we propose a corrective approach, relying on activation colocalization of selected features, that improves the performances and the stability of our proposed method.

Download Full-text

Three-dimensional deep learning with spatial erasing for unsupervised anomaly segmentation in brain MRI

International Journal of Computer Assisted Radiology and Surgery ◽

10.1007/s11548-021-02451-9 ◽

2021 ◽

Author(s):

Marcel Bengs ◽

Finn Behrendt ◽

Julia Krüger ◽

Roland Opfer ◽

Alexander Schlaefer

Keyword(s):

Deep Learning ◽

Brain Mri ◽

Magnetic Resonance Images ◽

Spatial Context ◽

Data Sets ◽

Learning Methods ◽

Data Set ◽

Performance Improvements ◽

Wide Range ◽

The Impact

Abstract Purpose Brain Magnetic Resonance Images (MRIs) are essential for the diagnosis of neurological diseases. Recently, deep learning methods for unsupervised anomaly detection (UAD) have been proposed for the analysis of brain MRI. These methods rely on healthy brain MRIs and eliminate the requirement of pixel-wise annotated data compared to supervised deep learning. While a wide range of methods for UAD have been proposed, these methods are mostly 2D and only learn from MRI slices, disregarding that brain lesions are inherently 3D and the spatial context of MRI volumes remains unexploited. Methods We investigate whether using increased spatial context by using MRI volumes combined with spatial erasing leads to improved unsupervised anomaly segmentation performance compared to learning from slices. We evaluate and compare 2D variational autoencoder (VAE) to their 3D counterpart, propose 3D input erasing, and systemically study the impact of the data set size on the performance. Results Using two publicly available segmentation data sets for evaluation, 3D VAEs outperform their 2D counterpart, highlighting the advantage of volumetric context. Also, our 3D erasing methods allow for further performance improvements. Our best performing 3D VAE with input erasing leads to an average DICE score of 31.40% compared to 25.76% for the 2D VAE. Conclusions We propose 3D deep learning methods for UAD in brain MRI combined with 3D erasing and demonstrate that 3D methods clearly outperform their 2D counterpart for anomaly segmentation. Also, our spatial erasing method allows for further performance improvements and reduces the requirement for large data sets.

Download Full-text

Improvement of Prediction Performance With Conjoint Molecular Fingerprint in Deep Learning

Frontiers in Pharmacology ◽

10.3389/fphar.2020.606668 ◽

2020 ◽

Vol 11 ◽

Author(s):

Liangxu Xie ◽

Lei Xu ◽

Ren Kong ◽

Shan Chang ◽

Xiaojun Xu

Keyword(s):

Deep Learning ◽

Short Term Memory ◽

Molecular Descriptor ◽

Predictive Performance ◽

Gradient Boosting ◽

Support Vector ◽

Quantitative Structure ◽

Structure Activity ◽

Extreme Gradient Boosting ◽

The Impact

The accurate predicting of physical properties and bioactivity of drug molecules in deep learning depends on how molecules are represented. Many types of molecular descriptors have been developed for quantitative structure-activity/property relationships quantitative structure-activity relationships (QSPR). However, each molecular descriptor is optimized for a specific application with encoding preference. Considering that standalone featurization methods may only cover parts of information of the chemical molecules, we proposed to build the conjoint fingerprint by combining two supplementary fingerprints. The impact of conjoint fingerprint and each standalone fingerprint on predicting performance was systematically evaluated in predicting the logarithm of the partition coefficient (logP) and binding affinity of protein-ligand by using machine learning/deep learning (ML/DL) methods, including random forest (RF), support vector regression (SVR), extreme gradient boosting (XGBoost), long short-term memory network (LSTM), and deep neural network (DNN). The results demonstrated that the conjoint fingerprint yielded improved predictive performance, even outperforming the consensus model using two standalone fingerprints among four out of five examined methods. Given that the conjoint fingerprint scheme shows easy extensibility and high applicability, we expect that the proposed conjoint scheme would create new opportunities for continuously improving predictive performance of deep learning by harnessing the complementarity of various types of fingerprints.

Download Full-text

Evaluating and improving heritability models using summary statistics

10.1101/736496 ◽

2019 ◽

Cited By ~ 1

Author(s):

Doug Speed ◽

John Holmes ◽

David J Balding

Keyword(s):

Association Studies ◽

College Education ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Promoter Regions ◽

Statistical Framework ◽

Genome Wide ◽

Using Data ◽

Complex Human Traits ◽

The Impact

AbstractThere is currently much debate regarding the best way to model how heritability varies across the genome. The authors of GCTA recommend the GCTA-LDMS-I Model, the authors of LD Score Regression recommend the Baseline LD Model, while we have instead recommended the LDAK Model. Here we provide a statistical framework for assessing heritability models using summary statistics from genome-wide association studies. Using data from studies of 31 complex human traits (average sample size 136,000), we show that the Baseline LD Model is the most realistic of the existing heritability models, but that it can be improved by incorporating features from the LDAK Model. Our framework also provides a method for estimating the selection-related parameter α from summary statistics. We find strong evidence (P<1e-6) of negative genome-wide selection for traits including height, systolic blood pressure and college education, and that the impact of selection is stronger inside functional categories such as coding SNPs and promoter regions.

Download Full-text