scholarly journals A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features

2020 ◽  
Vol 6 ◽  
pp. e307
Author(s):  
Michal B. Rozenwald ◽  
Aleksandra A. Galitsyna ◽  
Grigory V. Sapunov ◽  
Ekaterina E. Khrameeva ◽  
Mikhail S. Gelfand

Technological advances have lead to the creation of large epigenetic datasets, including information about DNA binding proteins and DNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TADs). TADs are involved in the regulation of gene expression activity, but the mechanisms of their formation are not yet fully understood. Here, we focus on machine learning methods to characterize DNA folding patterns in Drosophila based on chromatin marks across three cell lines. We present linear regression models with four types of regularization, gradient boosting, and recurrent neural networks (RNN) as tools to study chromatin folding characteristics associated with TADs given epigenetic chromatin immunoprecipitation data. The bidirectional long short-term memory RNN architecture produced the best prediction scores and identified biologically relevant features. Distribution of protein Chriz (Chromator) and histone modification H3K4me3 were selected as the most informative features for the prediction of TADs characteristics. This approach may be adapted to any similar biological dataset of chromatin features across various cell lines and species. The code for the implemented pipeline, Hi-ChiP-ML, is publicly available: https://github.com/MichalRozenwald/Hi-ChIP-ML

Water ◽  
2020 ◽  
Vol 12 (10) ◽  
pp. 2927
Author(s):  
Jiyeong Hong ◽  
Seoro Lee ◽  
Joo Hyun Bae ◽  
Jimin Lee ◽  
Woon Ji Park ◽  
...  

Predicting dam inflow is necessary for effective water management. This study created machine learning algorithms to predict the amount of inflow into the Soyang River Dam in South Korea, using weather and dam inflow data for 40 years. A total of six algorithms were used, as follows: decision tree (DT), multilayer perceptron (MLP), random forest (RF), gradient boosting (GB), recurrent neural network–long short-term memory (RNN–LSTM), and convolutional neural network–LSTM (CNN–LSTM). Among these models, the multilayer perceptron model showed the best results in predicting dam inflow, with the Nash–Sutcliffe efficiency (NSE) value of 0.812, root mean squared errors (RMSE) of 77.218 m3/s, mean absolute error (MAE) of 29.034 m3/s, correlation coefficient (R) of 0.924, and determination coefficient (R2) of 0.817. However, when the amount of dam inflow is below 100 m3/s, the ensemble models (random forest and gradient boosting models) performed better than MLP for the prediction of dam inflow. Therefore, two combined machine learning (CombML) models (RF_MLP and GB_MLP) were developed for the prediction of the dam inflow using the ensemble methods (RF and GB) at precipitation below 16 mm, and the MLP at precipitation above 16 mm. The precipitation of 16 mm is the average daily precipitation at the inflow of 100 m3/s or more. The results show the accuracy verification results of NSE 0.857, RMSE 68.417 m3/s, MAE 18.063 m3/s, R 0.927, and R2 0.859 in RF_MLP, and NSE 0.829, RMSE 73.918 m3/s, MAE 18.093 m3/s, R 0.912, and R2 0.831 in GB_MLP, which infers that the combination of the models predicts the dam inflow the most accurately. CombML algorithms showed that it is possible to predict inflow through inflow learning, considering flow characteristics such as flow regimes, by combining several machine learning algorithms.


2020 ◽  
Vol 12 (6) ◽  
pp. 914 ◽  
Author(s):  
Mahdieh Danesh Yazdi ◽  
Zheng Kuang ◽  
Konstantina Dimakopoulou ◽  
Benjamin Barratt ◽  
Esra Suel ◽  
...  

Estimating air pollution exposure has long been a challenge for environmental health researchers. Technological advances and novel machine learning methods have allowed us to increase the geographic range and accuracy of exposure models, making them a valuable tool in conducting health studies and identifying hotspots of pollution. Here, we have created a prediction model for daily PM2.5 levels in the Greater London area from 1st January 2005 to 31st December 2013 using an ensemble machine learning approach incorporating satellite aerosol optical depth (AOD), land use, and meteorological data. The predictions were made on a 1 km × 1 km scale over 3960 grid cells. The ensemble included predictions from three different machine learners: a random forest (RF), a gradient boosting machine (GBM), and a k-nearest neighbor (KNN) approach. Our ensemble model performed very well, with a ten-fold cross-validated R2 of 0.828. Of the three machine learners, the random forest outperformed the GBM and KNN. Our model was particularly adept at predicting day-to-day changes in PM2.5 levels with an out-of-sample temporal R2 of 0.882. However, its ability to predict spatial variability was weaker, with a R2 of 0.396. We believe this to be due to the smaller spatial variation in pollutant levels in this area.


2021 ◽  
Vol 11 (21) ◽  
pp. 10139
Author(s):  
Fernando J. Aguilar ◽  
Abderrahim Nemmaoui ◽  
Manuel A. Aguilar ◽  
Alberto Peñalver

Most of the allometric models used to estimate tree aboveground biomass rely on tree diameter at breast height (DBH). However, it is difficult to measure DBH from airborne remote sensors, and is common to draw upon traditional least squares linear regression models to relate DBH with dendrometric variables measured from airborne sensors, such as tree height (H) and crown diameter (CD). This study explores the usefulness of ensemble-type supervised machine learning regression algorithms, such as random forest regression (RFR), categorical boosting (CatBoost), gradient boosting (GBoost), or AdaBoost regression (AdaBoost), as an alternative to linear regression (LR) for modelling the allometric relationships DBH = Φ(H) and DBH = Ψ(H, CD). The original dataset was made up of 2272 teak trees (Tectona grandis Linn. F.) belonging to three different plantations located in Ecuador. All teak trees were digitally reconstructed from terrestrial laser scanning point clouds. The results showed that allometric models involving both H and CD to estimate DBH performed better than those based solely on H. Furthermore, boosting machine learning regression algorithms (CatBoost and GBoost) outperformed RFR (bagging) and LR (traditional linear regression) models, both in terms of goodness-of-fit (R2) and stability (variations in training and testing samples).


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Mustafa Abed ◽  
Monzur Alam Imteaz ◽  
Ali Najah Ahmed ◽  
Yuk Feng Huang

AbstractEvaporation is a key element for water resource management, hydrological modelling, and irrigation system designing. Monthly evaporation (Ep) was projected by deploying three machine learning (ML) models included Extreme Gradient Boosting, ElasticNet Linear Regression, and Long Short-Term Memory; and two empirical techniques namely Stephens-Stewart and Thornthwaite. The aim of this study is to develop a reliable generalised model to predict evaporation throughout Malaysia. In this context, monthly meteorological statistics from two weather stations in Malaysia were utilised for training and testing the models on the basis of climatic aspects such as maximum temperature, mean temperature, minimum temperature, wind speed, relative humidity, and solar radiation for the period of 2000–2019. For every approach, multiple models were formulated by utilising various combinations of input parameters and other model factors. The performance of models was assessed by utilising standard statistical measures. The outcomes indicated that the three machine learning models formulated outclassed empirical models and could considerably enhance the precision of monthly Ep estimate even with the same combinations of inputs. In addition, the performance assessment showed that Long Short-Term Memory Neural Network (LSTM) offered the most precise monthly Ep estimations from all the studied models for both stations. The LSTM-10 model performance measures were (R2 = 0.970, MAE = 0.135, MSE = 0.027, RMSE = 0.166, RAE = 0.173, RSE = 0.029) for Alor Setar and (R2 = 0.986, MAE = 0.058, MSE = 0.005, RMSE = 0.074, RAE = 0.120, RSE = 0.013) for Kota Bharu.


Author(s):  
Rupali Amit Bagate ◽  
R. Suguna

Identifying sarcasm present in the text could be a challenging work. In sarcasm, a negative word can flip the polarity of a positive sentence. Sentences can be classified as sarcastic or non-sarcastic. It is easier to identify sarcasm using facial expression or tonal weight rather detecting from plain text. Thus, sarcasm detection using natural language processing is major challenge without giving away any specific context or clue such as #sarcasm present in a tweet. Therefore, research tries to solve this classification problem using various optimized models. Proposed model, analyzes whether a given tweet, is sarcastic or not without the presnece of hashtag sarcasm or any kind of specific context present in text. To achieve better results, we used different machine learning classification methodology along with deep learning embedding techniques. Our optimized model uses a stacking technique which combines the result of logistic regression and long short-term memory (LSTM) recurrent neural net feed to light gradient boosting technique which generates better result as compare to existing machine learning and neural network algorithm. The key difference of our research work is sarcasm detection done without #sarcasm which has not been much explored earlier by any researcher. The metrics used for evolutionis F1-score and confusion matrix.


2015 ◽  
Vol 12 (14) ◽  
pp. 11833-11861 ◽  
Author(s):  
V. F. Rodriguez-Galiano ◽  
M. Sanchez-Castillo ◽  
J. Dash ◽  
P. M. Atkinson

Abstract. This research reveals new insights into the climatic drivers of anomalies in land surface phenology (LSP) across the entire European forest, while at the same time establishes a new conceptual framework for predictive modelling of LSP. Specifically, the Random Forest method, a multivariate, spatially non-stationary and non-linear machine learning approach, was introduced for phenological modelling across very large areas and across multiple years simultaneously: the typical case for satellite-observed LSP. The RF model was fitted to the relation between LSP anomalies and numerous climate predictor variables computed at biologically-relevant rather than human-imposed temporal scales. In addition, the legacy effect of an advanced or delayed spring on autumn phenology was explored. The RF models explained 81 and 62 % of the variance in the spring and autumn LSP anomalies, with relative errors of 10 and 20 %, respectively: a level of precision that has until now been unobtainable at the continental scale. Multivariate linear regression models explained only 36 and 25 %, respectively. It also allowed identification of the main drivers of the anomalies in LSP through its estimation of variable importance. This research, thus, shows clearly the inadequacy of the hitherto applied linear regression approaches for modelling LSP and paves the way for a new set of scientific investigations based on machine learning methods.


2021 ◽  
Author(s):  
Christian Friedemann Luz ◽  
Dimitrios Soudis ◽  
Maurits H Renes ◽  
Leslie R Zwerwer ◽  
Nicoletta Giudice ◽  
...  

Objectives: Infection-related consultations on intensive care units (ICU) build an important cornerstone in the care for critically ill patients with (suspected) infections. The positive impact of consultations on quality of care and clinical outcome has previously been demonstrated. However, timing is essential and to date consultations are typically event-triggered and reactive. Here, we investigate a proactive approach by predicting infection-related consultations using machine learning models and routine electronic health records (EHR). Methods: We used data from a mixed ICU at a large academic tertiary care hospital including 9684 admissions. EHR data comprised demographics, laboratory results, point-of-care tests, vital signs, line placements, and prescriptions. Consultations were performed by clinical microbiologists. The predicted target outcome (occurrence of a consultation) was modelled using random forest (RF), gradient boosting machines (RF), and long short-term memory neural networks (LSTM). Results: Overall, 7.8 % of all admission received a consultation. Time-sensitive modelling approaches and increasing numbers of patient features (parameters) performed better than static approaches in predicting infection-related consultations at the ICU. Splitting a patient admission into eight-hour intervals and using LSTM resulted in the accurate prediction of consultations up to eight hours in advance with an area under the receiver operator curve of 0.921 and an area under precision recall curve of 0.673. Conclusion: We could successfully predict of infection-related consultations on an ICU up to eight hours in advance, even without using classical triggers, such as (interim) microbiology reports. Predicting this key event can potentially streamline ICU and consultant workflows and improve care and outcome for critically ill patients with (suspected) infections.


2020 ◽  
Vol 12 (11) ◽  
pp. 4471 ◽  
Author(s):  
Jack Ngarambe ◽  
Amina Irakoze ◽  
Geun Young Yun ◽  
Gon Kim

The performance of machine learning (ML) algorithms depends on the nature of the problem at hand. ML-based modeling, therefore, should employ suitable algorithms where optimum results are desired. The purpose of the current study was to explore the potential applications of ML algorithms in modeling daylight in indoor spaces and ultimately identify the optimum algorithm. We thus developed and compared the performance of four common ML algorithms: generalized linear models, deep neural networks, random forest, and gradient boosting models in predicting the distribution of indoor daylight illuminances. We found that deep neural networks, which showed a determination of coefficient (R2) of 0.99, outperformed the other algorithms. Additionally, we explored the use of long short-term memory to forecast the distribution of daylight at a particular future time. Our results show that long short-term memory is accurate and reliable (R2 = 0.92). Our findings provide a basis for discussions on ML algorithms’ use in modeling daylight in indoor spaces, which may ultimately result in efficient tools for estimating daylight performance in the primary stages of building design and daylight control schemes for energy efficiency.


2019 ◽  
Vol 28 (3) ◽  
pp. 1039-1052
Author(s):  
Reva M. Zimmerman ◽  
JoAnn P. Silkes ◽  
Diane L. Kendall ◽  
Irene Minkina

Purpose A significant relationship between verbal short-term memory (STM) and language performance in people with aphasia has been found across studies. However, very few studies have examined the predictive value of verbal STM in treatment outcomes. This study aims to determine if verbal STM can be used as a predictor of treatment success. Method Retrospective data from 25 people with aphasia in a larger randomized controlled trial of phonomotor treatment were analyzed. Digit and word spans from immediately pretreatment were run in multiple linear regression models to determine whether they predict magnitude of change from pre- to posttreatment and follow-up naming accuracy. Pretreatment, immediately posttreatment, and 3 months posttreatment digit and word span scores were compared to determine if they changed following a novel treatment approach. Results Verbal STM, as measured by digit and word spans, did not predict magnitude of change in naming accuracy from pre- to posttreatment nor from pretreatment to 3 months posttreatment. Furthermore, digit and word spans did not change from pre- to posttreatment or from pretreatment to 3 months posttreatment in the overall analysis. A post hoc analysis revealed that only the less impaired group showed significant changes in word span scores from pretreatment to 3 months posttreatment. Discussion The results suggest that digit and word spans do not predict treatment gains. In a less severe subsample of participants, digit and word span scores can change following phonomotor treatment; however, the overall results suggest that span scores may not change significantly. The implications of these findings are discussed within the broader purview of theoretical and empirical associations between aphasic language and verbal STM processing.


2020 ◽  
Vol 39 (5) ◽  
pp. 6579-6590
Author(s):  
Sandy Çağlıyor ◽  
Başar Öztayşi ◽  
Selime Sezgin

The motion picture industry is one of the largest industries worldwide and has significant importance in the global economy. Considering the high stakes and high risks in the industry, forecast models and decision support systems are gaining importance. Several attempts have been made to estimate the theatrical performance of a movie before or at the early stages of its release. Nevertheless, these models are mostly used for predicting domestic performances and the industry still struggles to predict box office performances in overseas markets. In this study, the aim is to design a forecast model using different machine learning algorithms to estimate the theatrical success of US movies in Turkey. From various sources, a dataset of 1559 movies is constructed. Firstly, independent variables are grouped as pre-release, distributor type, and international distribution based on their characteristic. The number of attendances is discretized into three classes. Four popular machine learning algorithms, artificial neural networks, decision tree regression and gradient boosting tree and random forest are employed, and the impact of each group is observed by compared by the performance models. Then the number of target classes is increased into five and eight and results are compared with the previously developed models in the literature.


Sign in / Sign up

Export Citation Format

Share Document