scholarly journals Confederated learning in healthcare: training machine learning models using disconnected data separated by individual, data type and identity for Large-Scale Health System Intelligence (Preprint)

10.2196/24951 ◽  
2020 ◽  
Author(s):  
Dianbo Liu ◽  
Kathe Fox ◽  
Griffin Weber ◽  
Tim Miller
2021 ◽  
Author(s):  
Norberto Sánchez-Cruz ◽  
Jose L. Medina-Franco

<p>Epigenetic targets are a significant focus for drug discovery research, as demonstrated by the eight approved epigenetic drugs for treatment of cancer and the increasing availability of chemogenomic data related to epigenetics. This data represents a large amount of structure-activity relationships that has not been exploited thus far for the development of predictive models to support medicinal chemistry efforts. Herein, we report the first large-scale study of 26318 compounds with a quantitative measure of biological activity for 55 protein targets with epigenetic activity. Through a systematic comparison of machine learning models trained on molecular fingerprints of different design, we built predictive models with high accuracy for the epigenetic target profiling of small molecules. The models were thoroughly validated showing mean precisions up to 0.952 for the epigenetic target prediction task. Our results indicate that the herein reported models have considerable potential to identify small molecules with epigenetic activity. Therefore, our results were implemented as freely accessible and easy-to-use web application.</p>


Author(s):  
Mark Endrei ◽  
Chao Jin ◽  
Minh Ngoc Dinh ◽  
David Abramson ◽  
Heidi Poxon ◽  
...  

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Prasanna Date ◽  
Davis Arthur ◽  
Lauren Pusey-Nazzaro

AbstractTraining machine learning models on classical computers is usually a time and compute intensive process. With Moore’s law nearing its inevitable end and an ever-increasing demand for large-scale data analysis using machine learning, we must leverage non-conventional computing paradigms like quantum computing to train machine learning models efficiently. Adiabatic quantum computers can approximately solve NP-hard problems, such as the quadratic unconstrained binary optimization (QUBO), faster than classical computers. Since many machine learning problems are also NP-hard, we believe adiabatic quantum computers might be instrumental in training machine learning models efficiently in the post Moore’s law era. In order to solve problems on adiabatic quantum computers, they must be formulated as QUBO problems, which is very challenging. In this paper, we formulate the training problems of three machine learning models—linear regression, support vector machine (SVM) and balanced k-means clustering—as QUBO problems, making them conducive to be trained on adiabatic quantum computers. We also analyze the computational complexities of our formulations and compare them to corresponding state-of-the-art classical approaches. We show that the time and space complexities of our formulations are better (in case of SVM and balanced k-means clustering) or equivalent (in case of linear regression) to their classical counterparts.


2019 ◽  
Author(s):  
Mojtaba Haghighatlari ◽  
Gaurav Vishwakarma ◽  
Mohammad Atif Faiz Afzal ◽  
Johannes Hachmann

<div><div><div><p>We present a multitask, physics-infused deep learning model to accurately and efficiently predict refractive indices (RIs) of organic molecules, and we apply it to a library of 1.5 million compounds. We show that it outperforms earlier machine learning models by a significant margin, and that incorporating known physics into data-derived models provides valuable guardrails. Using a transfer learning approach, we augment the model to reproduce results consistent with higher-level computational chemistry training data, but with a considerably reduced number of corresponding calculations. Prediction errors of machine learning models are typically smallest for commonly observed target property values, consistent with the distribution of the training data. However, since our goal is to identify candidates with unusually large RI values, we propose a strategy to boost the performance of our model in the remoter areas of the RI distribution: We bias the model with respect to the under-represented classes of molecules that have values in the high-RI regime. By adopting a metric popular in web search engines, we evaluate our effectiveness in ranking top candidates. We confirm that the models developed in this study can reliably predict the RIs of the top 1,000 compounds, and are thus able to capture their ranking. We believe that this is the first study to develop a data-derived model that ensures the reliability of RI predictions by model augmentation in the extrapolation region on such a large scale. These results underscore the tremendous potential of machine learning in facilitating molecular (hyper)screening approaches on a massive scale and in accelerating the discovery of new compounds and materials, such as organic molecules with high-RI for applications in opto-electronics.</p></div></div></div>


2020 ◽  
Vol 375 (1810) ◽  
pp. 20190510 ◽  
Author(s):  
Damien Beillouin ◽  
Bernhard Schauberger ◽  
Ana Bastos ◽  
Phillipe Ciais ◽  
David Makowski

Extreme weather increases the risk of large-scale crop failure. The mechanisms involved are complex and intertwined, hence undermining the identification of simple adaptation levers to help improve the resilience of agricultural production. Based on more than 82 000 yield data reported at the regional level in 17 European countries, we assess how climate affected the yields of nine crop species. Using machine learning models, we analyzed historical yield data since 1901 and then focus on 2018, which has experienced a multiplicity and a diversity of atypical extreme climatic conditions. Machine learning models explain up to 65% of historical yield anomalies. We find that both extremes in temperature and precipitation are associated with negative yield anomalies, but with varying impacts in different parts of Europe. In 2018, Northern and Eastern Europe experienced multiple and simultaneous crop failures—among the highest observed in recent decades. These yield losses were associated with extremely low rainfalls in combination with high temperatures between March and August 2018. However, the higher than usual yields recorded in Southern Europe—caused by favourable spring rainfall conditions—nearly offset the large decrease in Northern European crop production. Our results outline the importance of considering single and compound climate extremes to analyse the causes of yield losses in Europe. We found no clear upward or downward trend in the frequency of extreme yield losses for any of the considered crops between 1990 and 2018. This article is part of the theme issue ‘Impacts of the 2018 severe drought and heatwave in Europe: from site to continental scale'.


2020 ◽  
Vol 34 (7) ◽  
pp. 717-730 ◽  
Author(s):  
Matthew C. Robinson ◽  
Robert C. Glen ◽  
Alpha A. Lee

Abstract Machine learning methods may have the potential to significantly accelerate drug discovery. However, the increasing rate of new methodological approaches being published in the literature raises the fundamental question of how models should be benchmarked and validated. We reanalyze the data generated by a recently published large-scale comparison of machine learning models for bioactivity prediction and arrive at a somewhat different conclusion. We show that the performance of support vector machines is competitive with that of deep learning methods. Additionally, using a series of numerical experiments, we question the relevance of area under the receiver operating characteristic curve as a metric in virtual screening. We further suggest that area under the precision–recall curve should be used in conjunction with the receiver operating characteristic curve. Our numerical experiments also highlight challenges in estimating the uncertainty in model performance via scaffold-split nested cross validation.


2020 ◽  
Author(s):  
Dianbo Liu ◽  
Kathe Fox ◽  
Griffin Weber ◽  
Tim Miller

BACKGROUND A patient’s health information is generally fragmented across silos because it follows how care is delivered: multiple providers in multiple settings. Though it is technically feasible to reunite data for analysis in a manner that underpins a rapid learning healthcare system, privacy concerns and regulatory barriers limit data centralization for this purpose. OBJECTIVE Machine learning can be conducted in a federated manner on patient datasets with the same set of variables, but separated across storage. But federated learning cannot handle the situation where different data types for a given patient are separated vertically across different organizations and when patient ID matching across different institutions is difficult. We call methods that enable machine learning model training on data separated by two or more dimensions “confederated machine learning.” We propose and evaluate confederated learning for training machine learning models to stratify the risk of several diseases among silos when data are horizontally separated by individual, vertically separated by data type, and separated by identity without patient ID matching. METHODS The confederated learning method can be intuitively understood as a distributed learning method with representation learning, generative model, imputation method and data augmentation elements.The confederated learning method we developed consists of three steps: Step 1) Conditional generative adversarial networks with matching loss (cGAN) were trained using data from the central analyzer to infer one data type from another, for example, inferring medications using diagnoses. Generative (cGAN) models were used in this study because a considerable percentage of individuals has not paired data types. For instance, a patient may only have his or her diagnoses in the database but not medication information due to insurance enrolment. cGAN can utilize data with paired information by minimizing matching loss and data without paired information by minimizing adversarial loss. Step 2) Missing data types from each silo were inferred using the model trained in step 1. Step 3) Task-specific models, such as a model to predict diagnoses of diabetes, were trained in a federated manner across all silos simultaneously. RESULTS We conducted experiments to train disease prediction models using confederated learning on a large nationwide health insurance dataset from the U.S that is split into 99 silos. The models stratify individuals by their risk of diabetes, psychological disorders or ischemic heart disease in the next two years, using diagnoses, medication claims and clinical lab test records of patients (See Methods section for details). The goal of these experiments is to test whether a confederated learning approach can simultaneously address the two types of separation mentioned above. CONCLUSIONS we demonstrated that health data distributed across silos separated by individual and data type can be used to train machine learning models without moving or aggregating data. Our method obtains predictive accuracy competitive to a centralized upper bound in predicting risks of diabetes, psychological disorders or ischemic heart disease using previous diagnoses, medications and lab tests as inputs. We compared the performance of a confederated learning approach with models trained on centralized data, only data with the central analyzer or a single data type across silos. The experimental results suggested that confederated learning trained predictive models efficiently across disconnected silos. CLINICALTRIAL NA


Sign in / Sign up

Export Citation Format

Share Document