Connecting Histopathology Imaging and Proteomics in Kidney Cancer through Machine Learning

Mapping Intimacies ◽

10.1101/756288 ◽

2019 ◽

Author(s):

Francisco Azuaje ◽

Sang-Yoon Kim ◽

Daniel Perez Hernandez ◽

Gunnar Dittmar

Keyword(s):

Machine Learning ◽

Large Scale ◽

Diagnostic Value ◽

Classification Model ◽

Clinical Approach ◽

Proteomics Data ◽

Cell Renal Cell Carcinoma ◽

Molecular Features ◽

Genes Encoding ◽

New Research

AbstractProteomics data encode molecular features of diagnostic value and accurately reflect key underlying biological mechanisms in cancers. Histopathology imaging is a well-established clinical approach to cancer diagnosis. The predictive relationship between large-scale proteomics and H&E-stained histopathology images remains largely uncharacterized. Here we investigate such associations through the application of machine learning, including deep neural networks, to proteomics and histology imaging datasets generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) from clear cell renal cell carcinoma patients. We report robust correlations between a set of diagnostic proteins and predictions generated by an imaging-based classification model. Proteins significantly correlated with the histology-based predictions are significantly implicated in immune responses, extracellular matrix reorganization and metabolism. Moreover, we showed that the genes encoding these proteins also reliably recapitulate the biological associations with imaging-derived predictions based on strong gene-protein expression correlations. Our findings offer novel insights into the integrative modeling of histology and omics data through machine learning, as well as the methodological basis for new research opportunities in this and other cancer types.

Download Full-text

Connecting Histopathology Imaging and Proteomics in Kidney Cancer through Machine Learning

Journal of Clinical Medicine ◽

10.3390/jcm8101535 ◽

2019 ◽

Vol 8 (10) ◽

pp. 1535 ◽

Cited By ~ 7

Author(s):

Francisco Azuaje ◽

Sang-Yoon Kim ◽

Daniel Perez Hernandez ◽

Gunnar Dittmar

Keyword(s):

Machine Learning ◽

Large Scale ◽

Diagnostic Value ◽

Classification Model ◽

Clinical Approach ◽

Proteomics Data ◽

Cell Renal Cell Carcinoma ◽

Molecular Features ◽

Genes Encoding ◽

New Research

Proteomics data encode molecular features of diagnostic value and accurately reflect key underlying biological mechanisms in cancers. Histopathology imaging is a well-established clinical approach to cancer diagnosis. The predictive relationship between large-scale proteomics and H&E-stained histopathology images remains largely uncharacterized. Here we investigate such associations through the application of machine learning, including deep neural networks, to proteomics and histology imaging datasets generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) from clear cell renal cell carcinoma patients. We report robust correlations between a set of diagnostic proteins and predictions generated by an imaging-based classification model. Proteins significantly correlated with the histology-based predictions are significantly implicated in immune responses, extracellular matrix reorganization, and metabolism. Moreover, we showed that the genes encoding these proteins also reliably recapitulate the biological associations with imaging-derived predictions based on strong gene–protein expression correlations. Our findings offer novel insights into the integrative modeling of histology and omics data through machine learning, as well as the methodological basis for new research opportunities in this and other cancer types.

Download Full-text

Integrate multi-omics data with biological interaction networks using Multi-view Factorization AutoEncoder (MAE)

BMC Genomics ◽

10.1186/s12864-019-6285-x ◽

2019 ◽

Vol 20 (S11) ◽

Cited By ~ 3

Author(s):

Tianle Ma ◽

Aidong Zhang

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Domain Knowledge ◽

Large Scale ◽

Learning Model ◽

Interaction Networks ◽

Omics Data ◽

Biological Interaction ◽

Molecular Features ◽

Training Objective

Abstract Background Comprehensive molecular profiling of various cancers and other diseases has generated vast amounts of multi-omics data. Each type of -omics data corresponds to one feature space, such as gene expression, miRNA expression, DNA methylation, etc. Integrating multi-omics data can link different layers of molecular feature spaces and is crucial to elucidate molecular pathways underlying various diseases. Machine learning approaches to mining multi-omics data hold great promises in uncovering intricate relationships among molecular features. However, due to the “big p, small n” problem (i.e., small sample sizes with high-dimensional features), training a large-scale generalizable deep learning model with multi-omics data alone is very challenging. Results We developed a method called Multi-view Factorization AutoEncoder (MAE) with network constraints that can seamlessly integrate multi-omics data and domain knowledge such as molecular interaction networks. Our method learns feature and patient embeddings simultaneously with deep representation learning. Both feature representations and patient representations are subject to certain constraints specified as regularization terms in the training objective. By incorporating domain knowledge into the training objective, we implicitly introduced a good inductive bias into the machine learning model, which helps improve model generalizability. We performed extensive experiments on the TCGA datasets and demonstrated the power of integrating multi-omics data and biological interaction networks using our proposed method for predicting target clinical variables. Conclusions To alleviate the overfitting problem in deep learning on multi-omics data with the “big p, small n” problem, it is helpful to incorporate biological domain knowledge into the model as inductive biases. It is very promising to design machine learning models that facilitate the seamless integration of large-scale multi-omics data and biomedical domain knowledge for uncovering intricate relationships among molecular features and clinical features.

Download Full-text

Music Feature Extraction and Classification Algorithm Based on Deep Learning

Scientific Programming ◽

10.1155/2021/1651560 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Jingwen Zhang

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Large Scale ◽

Rapid Development ◽

Audio Signal ◽

Classification Model ◽

Machine Learning Classification ◽

Gating Mechanism ◽

Music Information ◽

Sound Spectrum

With the rapid development of information technology and communication, digital music has grown and exploded. Regarding how to quickly and accurately retrieve the music that users want from huge bulk of music repository, music feature extraction and classification are considered as an important part of music information retrieval and have become a research hotspot in recent years. Traditional music classification approaches use a large number of artificially designed acoustic features. The design of features requires knowledge and in-depth understanding in the domain of music. The features of different classification tasks are often not universal and comprehensive. The existing approach has two shortcomings as follows: ensuring the validity and accuracy of features by manually extracting features and the traditional machine learning classification approaches not performing well on multiclassification problems and not having the ability to be trained on large-scale data. Therefore, this paper converts the audio signal of music into a sound spectrum as a unified representation, avoiding the problem of manual feature selection. According to the characteristics of the sound spectrum, the research has combined 1D convolution, gating mechanism, residual connection, and attention mechanism and proposed a music feature extraction and classification model based on convolutional neural network, which can extract more relevant sound spectrum characteristics of the music category. Finally, this paper designs comparison and ablation experiments. The experimental results show that this approach is better than traditional manual models and machine learning-based approaches.

Download Full-text

Machine Learning Assisted Prediction of Prognostic Biomarkers Associated With COVID-19, Using Clinical and Proteomics Data

Frontiers in Genetics ◽

10.3389/fgene.2021.636441 ◽

2021 ◽

Vol 12 ◽

Author(s):

Rahila Sardar ◽

Arun Sharma ◽

Dinesh Gupta

Keyword(s):

Machine Learning ◽

Sensitivity And Specificity ◽

Protein Profile ◽

Clinical Information ◽

Classification Model ◽

Clinical Parameters ◽

Proteomics Data ◽

Model Based ◽

Diagnosis And Prognosis ◽

Maximum Accuracy

With the availability of COVID-19-related clinical data, healthcare researchers can now explore the potential of computational technologies such as artificial intelligence (AI) and machine learning (ML) to discover biomarkers for accurate detection, early diagnosis, and prognosis for the management of COVID-19. However, the identification of biomarkers associated with survival and deaths remains a major challenge for early prognosis. In the present study, we have evaluated and developed AI-based prediction algorithms for predicting a COVID-19 patient’s survival or death based on a publicly available dataset consisting of clinical parameters and protein profile data of hospital-admitted COVID-19 patients. The best classification model based on clinical parameters achieved a maximum accuracy of 89.47% for predicting survival or death of COVID-19 patients, with a sensitivity and specificity of 85.71 and 92.45%, respectively. The classification model based on normalized protein expression values of 45 proteins achieved a maximum accuracy of 89.01% for predicting the survival or death, with a sensitivity and specificity of 92.68 and 86%, respectively. Interestingly, we identified 9 clinical and 45 protein-based putative biomarkers associated with the survival/death of COVID-19 patients. Based on our findings, few clinical features and proteins correlate significantly with the literature and reaffirm their role in the COVID-19 disease progression at the molecular level. The machine learning–based models developed in the present study have the potential to predict the survival chances of COVID-19 positive patients in the early stages of the disease or at the time of hospitalization. However, this has to be verified on a larger cohort of patients before it can be put to actual clinical practice. We have also developed a webserver CovidPrognosis, where clinical information can be uploaded to predict the survival chances of a COVID-19 patient. The webserver is available at http://14.139.62.220/covidprognosis/.

Download Full-text

Understanding Software-2.0

ACM Transactions on Software Engineering and Methodology ◽

10.1145/3453478 ◽

2021 ◽

Vol 30 (4) ◽

pp. 1-42

Author(s):

Malinda Dilhara ◽

Ameya Ketkar ◽

Danny Dig

Keyword(s):

Machine Learning ◽

Longitudinal Study ◽

Empirical Study ◽

Large Scale ◽

Research Directions ◽

Common Practices ◽

Increasing Trend ◽

Usage Patterns ◽

New Research ◽

Shed Light

Enabled by a rich ecosystem of Machine Learning (ML) libraries, programming using learned models , i.e., Software-2.0 , has gained substantial adoption. However, we do not know what challenges developers encounter when they use ML libraries. With this knowledge gap, researchers miss opportunities to contribute to new research directions, tool builders do not invest resources where automation is most needed, library designers cannot make informed decisions when releasing ML library versions, and developers fail to use common practices when using ML libraries. We present the first large-scale quantitative and qualitative empirical study to shed light on how developers in Software-2.0 use ML libraries, and how this evolution affects their code. Particularly, using static analysis we perform a longitudinal study of 3,340 top-rated open-source projects with 46,110 contributors. To further understand the challenges of ML library evolution, we survey 109 developers who introduce and evolve ML libraries. Using this rich dataset we reveal several novel findings. Among others, we found an increasing trend of using ML libraries: The ratio of new Python projects that use ML libraries increased from 2% in 2013 to 50% in 2018. We identify several usage patterns including the following: (i) 36% of the projects use multiple ML libraries to implement various stages of the ML workflows, (ii) developers update ML libraries more often than the traditional libraries , (iii) strict upgrades are the most popular for ML libraries among other update kinds, (iv) ML library updates often result in cascading library updates, and (v) ML libraries are often downgraded (22.04% of cases). We also observed unique challenges when evolving and maintaining Software-2.0 such as (i) binary incompatibility of trained ML models and (ii) benchmarking ML models. Finally, we present actionable implications of our findings for researchers, tool builders, developers, educators, library vendors, and hardware vendors.

Download Full-text

Accuracy of Machine Learning Algorithms for the Classification of Molecular Features of Gliomas on MRI: A Systematic Literature Review and Meta-Analysis

Cancers ◽

10.3390/cancers13112606 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2606

Author(s):

Evi J. van Kempen ◽

Max Post ◽

Manoj Mannil ◽

Benno Kusters ◽

Mark ter Laan ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Meta Analysis ◽

Learning Algorithms ◽

External Validation ◽

Machine Learning Algorithms ◽

Molecular Characteristics ◽

Aggregated Data ◽

Molecular Features

Treatment planning and prognosis in glioma treatment are based on the classification into low- and high-grade oligodendroglioma or astrocytoma, which is mainly based on molecular characteristics (IDH1/2- and 1p/19q codeletion status). It would be of great value if this classification could be made reliably before surgery, without biopsy. Machine learning algorithms (MLAs) could play a role in achieving this by enabling glioma characterization on magnetic resonance imaging (MRI) data without invasive tissue sampling. The aim of this study is to provide a performance evaluation and meta-analysis of various MLAs for glioma characterization. Systematic literature search and meta-analysis were performed on the aggregated data, after which subgroup analyses for several target conditions were conducted. This study is registered with PROSPERO, CRD42020191033. We identified 724 studies; 60 and 17 studies were eligible to be included in the systematic review and meta-analysis, respectively. Meta-analysis showed excellent accuracy for all subgroups, with the classification of 1p/19q codeletion status scoring significantly poorer than other subgroups (AUC: 0.748, p = 0.132). There was considerable heterogeneity among some of the included studies. Although promising results were found with regard to the ability of MLA-tools to be used for the non-invasive classification of gliomas, large-scale, prospective trials with external validation are warranted in the future.

Download Full-text

A Scalable Machine Learning Pipeline for Paddy Rice Classification Using Multi-Temporal Sentinel Data

Remote Sensing ◽

10.3390/rs13091769 ◽

2021 ◽

Vol 13 (9) ◽

pp. 1769

Author(s):

Vasileios Sitokonstantinou ◽

Alkiviadis Koukos ◽

Thanassis Drivas ◽

Charalampos Kontoes ◽

Ioannis Papoutsis ◽

...

Keyword(s):

Machine Learning ◽

Satellite Data ◽

High Performance ◽

Large Scale ◽

Paddy Rice ◽

Machine Learning Algorithms ◽

Classification Model ◽

Supervised Machine Learning ◽

Rice Area ◽

Multi Temporal

The demand for rice production in Asia is expected to increase by 70% in the next 30 years, which makes evident the need for a balanced productivity and effective food security management at a national and continental level. Consequently, the timely and accurate mapping of paddy rice extent and its productivity assessment is of utmost significance. In turn, this requires continuous area monitoring and large scale mapping, at the parcel level, through the processing of big satellite data of high spatial resolution. This work designs and implements a paddy rice mapping pipeline in South Korea that is based on a time-series of Sentinel-1 and Sentinel-2 data for the year of 2018. There are two challenges that we address; the first one is the ability of our model to manage big satellite data and scale for a nationwide application. The second one is the algorithm’s capacity to cope with scarce labeled data to train supervised machine learning algorithms. Specifically, we implement an approach that combines unsupervised and supervised learning. First, we generate pseudo-labels for rice classification from a single site (Seosan-Dangjin) by using a dynamic k-means clustering approach. The pseudo-labels are then used to train a Random Forest (RF) classifier that is fine-tuned to generalize in two other sites (Haenam and Cheorwon). The optimized model was then tested against 40 labeled plots, evenly distributed across the country. The paddy rice mapping pipeline is scalable as it has been deployed in a High Performance Data Analytics (HPDA) environment using distributed implementations for both k-means and RF classifiers. When tested across the country, our model provided an overall accuracy of 96.69% and a kappa coefficient 0.87. Even more, the accurate paddy rice area mapping was returned early in the year (late July), which is key for timely decision-making. Finally, the performance of the generalized paddy rice classification model, when applied in the sites of Haenam and Cheorwon, was compared to the performance of two equivalent models that were trained with locally sampled labels. The results were comparable and highlighted the success of the model’s generalization and its applicability to other regions.

Download Full-text

Large-Scale Data Learning Method for Anomaly Detection using Machine Learning for Monitoring Vibration in Vehicle Equipment

IEEJ Transactions on Industry Applications ◽

10.1541/ieejias.140.480 ◽

2020 ◽

Vol 140 (6) ◽

pp. 480-487

Author(s):

Minoru Kondo

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Large Scale ◽

Learning Method ◽

Large Scale Data ◽

Scale Data

Download Full-text

Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

10.1561/9781680837056 ◽

2020 ◽

Author(s):

Songze Li ◽

Salman Avestimehr

Keyword(s):

Machine Learning ◽

Distributed Computing ◽

Large Scale

Download Full-text

Evolution of Metastable Structures in Bimetallic Catalysts from Microscopy and Machine-Learning Molecular Dynamics

10.26434/chemrxiv.11811660.v1 ◽

2020 ◽

Author(s):

Jin Soo Lim ◽

Jonathan Vandermause ◽

Matthijs A. van Spronsen ◽

Albert Musaelian ◽

Christopher R. O’Connor ◽

...

Keyword(s):

Machine Learning ◽

Molecular Dynamics ◽

Large Scale ◽

Materials Science ◽

Complete Characterization ◽

Layer By Layer ◽

Surface Restructuring ◽

Metastable Structures ◽

Mechanistic Investigation ◽

Underlying Mechanisms

Restructuring of interface plays a crucial role in materials science and heterogeneous catalysis. Bimetallic systems, in particular, often adopt very different composition and morphology at surfaces compared to the bulk. For the first time, we reveal a detailed atomistic picture of the long-timescale restructuring of Pd deposited on Ag, using microscopy, spectroscopy, and novel simulation methods. Encapsulation of Pd by Ag always precedes layer-by-layer dissolution of Pd, resulting in significant Ag migration out of the surface and extensive vacancy pits. These metastable structures are of vital catalytic importance, as Ag-encapsulated Pd remains much more accessible to reactants than bulk-dissolved Pd. The underlying mechanisms are uncovered by performing fast and large-scale machine-learning molecular dynamics, followed by our newly developed method for complete characterization of atomic surface restructuring events. Our approach is broadly applicable to other multimetallic systems of interest and enables the previously impractical mechanistic investigation of restructuring dynamics.

Download Full-text