Application of Random Forest and ICON Models Combined with Weather Forecasts to Predict Soil Temperature and Water Content in a Greenhouse

Yi-Zhih Tsai; Kan-Sheng Hsu; Hung-Yu Wu; Shu-I Lin; Hwa-Lung Yu; Kuo-Tsang Huang; Ming-Che Hu; Shao-Yiu Hsu

doi:10.3390/w12041176

Application of Random Forest and ICON Models Combined with Weather Forecasts to Predict Soil Temperature and Water Content in a Greenhouse

Water ◽

10.3390/w12041176 ◽

2020 ◽

Vol 12 (4) ◽

pp. 1176

Author(s):

Yi-Zhih Tsai ◽

Kan-Sheng Hsu ◽

Hung-Yu Wu ◽

Shu-I Lin ◽

Hwa-Lung Yu ◽

...

Keyword(s):

Water Management ◽

Machine Learning ◽

Random Forest ◽

Soil Temperature ◽

Water Content ◽

Numerical Models ◽

Volumetric Water Content ◽

Forest Model ◽

Greenhouse Experiments ◽

Machine Learning Methods

Climate change might potentially cause extreme weather events to become more frequent and intense. It could also enhance water scarcity and reduce food security. More efficient water management techniques are thus required to ensure a stable food supply and quality. Maintaining proper soil water content and soil temperature is necessary for efficient water management in agricultural practices. The usage of water and fertilizers can be significantly improved with a precise water content prediction tool. In this study, we proposed a new framework that combines weather forecast data, numerical models, and machine learning methods to simulate and predict the soil temperature and volumetric water content in a greenhouse. To test the framework, we performed greenhouse experiments with cherry tomatoes. The numerical models and machine learning methods we selected were Newton’s law of cooling, HYDRUS-1D, the random forest model, and the ICON (inferring connections of networks) model. The measured air temperature, soil temperature, and volumetric water content during the cultivation period were used for model calibration and validation. We compared the performances of the models for soil temperature and volumetric water content predictions. The results showed that the random forest model performed a more accurate prediction than other methods under the limited information provided from greenhouse experiments. This approach provides a framework that can potentially learn best water management practices from experienced farmers and provide intelligent information for smart greenhouse management.

Download Full-text

Virtual Screening of Antitumor Inhibitors Targeting BRD4 Based on Machine Learning Methods

10.21203/rs.3.rs-768419/v1 ◽

2021 ◽

Author(s):

Keliang Wu ◽

Chenghua Zhang ◽

Bing He ◽

Huanxin Li ◽

Shan Tang ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Virtual Screening ◽

Molecular Design ◽

Random Forest Model ◽

Learning Methods ◽

Forest Model ◽

Design And Synthesis ◽

Machine Learning Methods ◽

Dynamics Simulations

Abstract BRD4 is a hot antitumor target. In this study, three kinds of machine learning methods were used to establish classification models of BRD4 inhibitors, achieving satisfactory prediction performance. Through comparison, random forest model worked best, the parameters of which were also optimized. Then, the best random forest model was applied to perform virtual screening against ZINC database and a total of 89 potential compounds with BRD4 inhibitory activity were eventually identified. Further, seven molecules were chosen from the hits, and a docking calculation was carried out for each molecule, showing a strong interaction between ligand and BRD4. Subsequently, these molecules were evaluated by molecular dynamics simulations, all having certain binding stability. The results have proved the effectiveness of the developed models based on machine learning methods and the molecules filtered by virtual screening not only have a significant guiding in practice for the molecular design and synthesis, but also can provide great possibility for the discoveries and final approvals of anti-cancer drugs targeting BRD4.

Download Full-text

Integrating Machine/Deep Learning Methods and Filtering Techniques for Reliable Mineral Phase Segmentation of 3D X-ray Computed Tomography Images

Energies ◽

10.3390/en14154595 ◽

2021 ◽

Vol 14 (15) ◽

pp. 4595

Author(s):

Parisa Asadi ◽

Lauren E. Beckingham

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Random Forest ◽

Ct Images ◽

Ct Imaging ◽

Learning Method ◽

Learning Methods ◽

X Ray ◽

Machine Learning Methods ◽

Filtering Techniques

X-ray CT imaging provides a 3D view of a sample and is a powerful tool for investigating the internal features of porous rock. Reliable phase segmentation in these images is highly necessary but, like any other digital rock imaging technique, is time-consuming, labor-intensive, and subjective. Combining 3D X-ray CT imaging with machine learning methods that can simultaneously consider several extracted features in addition to color attenuation, is a promising and powerful method for reliable phase segmentation. Machine learning-based phase segmentation of X-ray CT images enables faster data collection and interpretation than traditional methods. This study investigates the performance of several filtering techniques with three machine learning methods and a deep learning method to assess the potential for reliable feature extraction and pixel-level phase segmentation of X-ray CT images. Features were first extracted from images using well-known filters and from the second convolutional layer of the pre-trained VGG16 architecture. Then, K-means clustering, Random Forest, and Feed Forward Artificial Neural Network methods, as well as the modified U-Net model, were applied to the extracted input features. The models’ performances were then compared and contrasted to determine the influence of the machine learning method and input features on reliable phase segmentation. The results showed considering more dimensionality has promising results and all classification algorithms result in high accuracy ranging from 0.87 to 0.94. Feature-based Random Forest demonstrated the best performance among the machine learning models, with an accuracy of 0.88 for Mancos and 0.94 for Marcellus. The U-Net model with the linear combination of focal and dice loss also performed well with an accuracy of 0.91 and 0.93 for Mancos and Marcellus, respectively. In general, considering more features provided promising and reliable segmentation results that are valuable for analyzing the composition of dense samples, such as shales, which are significant unconventional reservoirs in oil recovery.

Download Full-text

Machine Learning in Aging: An Example of Developing Prediction Models for Serious Fall Injury in Older Adults

Innovation in Aging ◽

10.1093/geroni/igaa057.859 ◽

2020 ◽

Vol 4 (Supplement_1) ◽

pp. 268-269

Author(s):

Jaime Speiser ◽

Kathryn Callahan ◽

Jason Fanning ◽

Thomas Gill ◽

Anne Newman ◽

...

Keyword(s):

Machine Learning ◽

Older Adults ◽

Random Forest ◽

Decision Tree ◽

Prediction Models ◽

Receiver Operating Curve ◽

Learning Methods ◽

Life Study ◽

Fall Injury ◽

Machine Learning Methods

Abstract Advances in computational algorithms and the availability of large datasets with clinically relevant characteristics provide an opportunity to develop machine learning prediction models to aid in diagnosis, prognosis, and treatment of older adults. Some studies have employed machine learning methods for prediction modeling, but skepticism of these methods remains due to lack of reproducibility and difficulty understanding the complex algorithms behind models. We aim to provide an overview of two common machine learning methods: decision tree and random forest. We focus on these methods because they provide a high degree of interpretability. We discuss the underlying algorithms of decision tree and random forest methods and present a tutorial for developing prediction models for serious fall injury using data from the Lifestyle Interventions and Independence for Elders (LIFE) study. Decision tree is a machine learning method that produces a model resembling a flow chart. Random forest consists of a collection of many decision trees whose results are aggregated. In the tutorial example, we discuss evaluation metrics and interpretation for these models. Illustrated in data from the LIFE study, prediction models for serious fall injury were moderate at best (area under the receiver operating curve of 0.54 for decision tree and 0.66 for random forest). Machine learning methods may offer improved performance compared to traditional models for modeling outcomes in aging, but their use should be justified and output should be carefully described. Models should be assessed by clinical experts to ensure compatibility with clinical practice.

Download Full-text

Possibility of Autonomous Estimation of Shiba Goat’s Estrus and Non-Estrus Behavior by Machine Learning Methods

Animals ◽

10.3390/ani10050771 ◽

2020 ◽

Vol 10 (5) ◽

pp. 771

Author(s):

Toshiya Arakawa

Keyword(s):

Neural Network ◽

Machine Learning ◽

Random Forest ◽

Markov Models ◽

Tracking System ◽

Video Tracking ◽

Training Data ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods

Mammalian behavior is typically monitored by observation. However, direct observation requires a substantial amount of effort and time, if the number of mammals to be observed is sufficiently large or if the observation is conducted for a prolonged period. In this study, machine learning methods as hidden Markov models (HMMs), random forests, support vector machines (SVMs), and neural networks, were applied to detect and estimate whether a goat is in estrus based on the goat’s behavior; thus, the adequacy of the method was verified. Goat’s tracking data was obtained using a video tracking system and used to estimate whether they, which are in “estrus” or “non-estrus”, were in either states: “approaching the male”, or “standing near the male”. Totally, the PC of random forest seems to be the highest. However, The percentage concordance (PC) value besides the goats whose data were used for training data sets is relatively low. It is suggested that random forest tend to over-fit to training data. Besides random forest, the PC of HMMs and SVMs is high. However, considering the calculation time and HMM’s advantage in that it is a time series model, HMM is better method. The PC of neural network is totally low, however, if the more goat’s data were acquired, neural network would be an adequate method for estimation.

Download Full-text

Machine learning approaches to the determinants of women’s vasomotor symptoms using general hospital data

10.21203/rs.3.rs-74144/v1 ◽

2020 ◽

Author(s):

Ki-Jin Ryu ◽

Kyong Wook Yi ◽

Yong Jin Kim ◽

Jung Ho Shin ◽

Jun Young Hur ◽

...

Keyword(s):

Machine Learning ◽

Insulin Resistance ◽

Random Forest ◽

Variable Importance ◽

Vasomotor Symptoms ◽

Model Assessment ◽

Cancer Antigen ◽

Learning Methods ◽

Machine Learning Methods ◽

Glutamyl Transferase

Abstract Background To analyze the determinants of women’s vasomotor symptoms (VMS) using machine learning. Methods Data came from Korea University Anam Hospital in Seoul, Korea, with 3298 women, aged 40–80 years, who attended their general health check from January 2010 to December 2012. Five machine learning methods were applied and compared for the prediction of VMS, measured by a Menopause Rating Scale. Variable importance, the effect of a variable on model performance, was used for identifying major determinants of VMS. Results In terms of the mean squared error, the random forest (0.9326) was much better than linear regression (12.4856) and artificial neural networks with one, two and three hidden layers (1.5576, 1.5184 and 1.5833, respectively). Based on variable importance from the random forest, the most important determinants of VMS were age, menopause age, thyroid stimulating hormone, monocyte and triglyceride, as well as gamma glutamyl transferase, blood urea nitrogen, cancer antigen 19 − 9, C-reactive protein and low-density-lipoprotein cholesterol. Indeed, the following determinants ranked within the top 20 in terms of variable importance: cancer antigen 125, total cholesterol, insulin, free thyroxine, forced vital capacity, alanine aminotransferase, forced expired volume in one second, height, homeostatic model assessment for insulin resistance and carcinoembryonic antigen. Conclusions Machine learning provides an invaluable decision support system for the prediction of VMS. For preventing VMS, preventive measures would be needed regarding the thyroid function, the lipid profile, the liver function, inflammation markers, insulin resistance, the monocyte, cancer antigens and the lung function.

Download Full-text

Detecting Face Touching Using Smartwatches to Mitigate the Spread of COVID-19: Pilot Study (Preprint)

10.2196/preprints.28799 ◽

2021 ◽

Author(s):

Chen Bai ◽

Yu-Peng Chen ◽

Adam Wolach ◽

Lisa Anthony ◽

Mamoun Mardini

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Respiratory Diseases ◽

Window Size ◽

Support Vector ◽

Accelerometer Data ◽

Respiratory Illnesses ◽

Motion Data ◽

Machine Learning Methods

BACKGROUND Frequent spontaneous facial self-touches, predominantly during outbreaks, have the theoretical potential to be a mechanism of contracting and transmitting diseases. Despite the recent advent of vaccines, behavioral approaches remain an integral part of reducing the spread of COVID-19 and other respiratory illnesses. Real-time biofeedback of face touching can potentially mitigate the spread of respiratory diseases. The gap addressed in this study is the lack of an on-demand platform that utilizes motion data from smartwatches to accurately detect face touching. OBJECTIVE The aim of this study was to utilize the functionality and the spread of smartwatches to develop a smartwatch application to identifying motion signatures that are mapped accurately to face touching. METHODS Participants (n=10, 50% women, aged 20-83) performed 10 physical activities classified into: face touching (FT) and non-face touching (NFT) categories, in a standardized laboratory setting. We developed a smartwatch application on Samsung Galaxy Watch to collect raw accelerometer data from participants. Then, data features were extracted from consecutive non-overlapping windows varying from 2-16 seconds. We examined the performance of state-of-the-art machine learning methods on face touching movements recognition (FT vs NFT) and individual activity recognition (IAR): logistic regression, support vector machine, decision trees and random forest. RESULTS Machine learning models were accurate in recognizing face touching categories; logistic regression achieved the best performance across all metrics (Accuracy: 0.93 +/- 0.08, Recall: 0.89 +/- 0.16, Precision: 0.93 +/- 0.08, F1-score: 0.90 +/- 0.11, AUC: 0.95 +/- 0.07) at the window size of 5 seconds. IAR models resulted in lower performance; the random forest classifier achieved the best performance across all metrics (Accuracy: 0.70 +/- 0.14, Recall: 0.70 +/- 0.14, Precision: 0.70 +/- 0.16, F1-score: 0.67 +/- 0.15) at the window size of 9 seconds. CONCLUSIONS Wearable devices, powered with machine learning, are effective in detecting facial touches. This is highly significant during respiratory infection outbreaks, as it has a great potential to refrain people from touching their faces and potentially mitigate the possibility of transmitting COVID-19 and future respiratory diseases.

Download Full-text

Monitoring the Variation of Vegetation Water Content with Machine Learning Methods: Point–Surface Fusion of MODIS Products and GNSS-IR Observations

Remote Sensing ◽

10.3390/rs11121440 ◽

2019 ◽

Vol 11 (12) ◽

pp. 1440 ◽

Cited By ~ 1

Author(s):

Qiangqiang Yuan ◽

Shuwen Li ◽

Linwei Yue ◽

Tongwen Li ◽

Huanfeng Shen ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Water Content ◽

Linear Regression Method ◽

Learning Methods ◽

Machine Learning Methods ◽

Vegetation Water Content ◽

Drought Prediction ◽

Vegetation Water ◽

Surface Fusion

Vegetation water content (VWC) is recognized as an important parameter in vegetation growth studies, natural disasters such as forest fires, and drought prediction. Recently, the Global Navigation Satellite System Interferometric Reflectometry (GNSS-IR) has emerged as an important technique for monitoring vegetation information. The normalized microwave reflection index (NMRI) was developed to reflect the change of VWC based on this fact. However, NMRI uses local site-based data, and the sparse distribution hinders the application of NMRI. In this study, we obtained a 500 m spatially continuous NMRI product by integrating GNSS-IR site data with other VWC-related products using the point–surface fusion technique. The auxiliary data in the fusion process include the normalized difference vegetation index (NDVI), gross primary productivity (GPP), and precipitation. Meanwhile, the fusion performance of three machine learning methods, i.e., the back-propagation neural network (BPNN), generalized regression neural network (GRNN), and random forest (RF) are compared and analyzed. The machine learning methods achieve satisfactory results, with cross-validation R values of 0.71–0.83 and RMSEs of 0.025–0.037. The results show a clear improvement over the traditional multiple linear regression method, which achieves R (RMSE) values of only about 0.4 (0.045). It indicates that the machine learning methods can better learn the complex nonlinear relationship between NMRI and the input VWC-related index. Among the machine learning methods, the RF model obtained the best results. Long time-series NMRI images with a 500 m spatial resolution in the western part of the continental U.S. were then obtained. The results show that the spatial distribution of the NMRI product is consistent with a drought situation from 2012 to 2014 in the U.S., which verifies the feasibility of analyzing and predicting drought times and distribution ranges by using the 500 m fusion product.

Download Full-text

TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies

Scientific Reports ◽

10.1038/s41598-019-54519-x ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Jiali Sun ◽

Qingtai Wu ◽

Dafeng Shen ◽

Yangjun Wen ◽

Fengrong Liu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

The Other ◽

New Method ◽

Genome Wide Association ◽

High Dimensional ◽

Simulation Experiments ◽

Machine Learning Methods ◽

Least Angle Regression ◽

Genome Wide

AbstractOne of the most important tasks in genome-wide association analysis (GWAS) is the detection of single-nucleotide polymorphisms (SNPs) which are related to target traits. With the development of sequencing technology, traditional statistical methods are difficult to analyze the corresponding high-dimensional massive data or SNPs. Recently, machine learning methods have become more popular in high-dimensional genetic data analysis for their fast computation speed. However, most of machine learning methods have several drawbacks, such as poor generalization ability, over-fitting, unsatisfactory classification and low detection accuracy. This study proposed a two-stage algorithm based on least angle regression and random forest (TSLRF), which firstly considered the control of population structure and polygenic effects, then selected the SNPs that were potentially related to target traits by using least angle regression (LARS), furtherly analyzed this variable subset using random forest (RF) to detect quantitative trait nucleotides (QTNs) associated with target traits. The new method has more powerful detection in simulation experiments and real data analyses. The results of simulation experiments showed that, compared with the existing approaches, the new method effectively improved the detection ability of QTNs and model fitting degree, and required less calculation time. In addition, the new method significantly distinguished QTNs and other SNPs. Subsequently, the new method was applied to analyze five flowering-related traits in Arabidopsis. The results showed that, the distinction between QTNs and unrelated SNPs was more significant than the other methods. The new method detected 60 genes confirmed to be related to the target trait, which was significantly higher than the other methods, and simultaneously detected multiple gene clusters associated with the target trait.

Download Full-text

Classification models using circulating neutrophil transcripts can detect unruptured intracranial aneurysm

Journal of Translational Medicine ◽

10.1186/s12967-020-02550-2 ◽

2020 ◽

Vol 18 (1) ◽

Author(s):

Kerry E. Poppenberg ◽

Vincent M. Tutino ◽

Lu Li ◽

Muhammad Waqas ◽

Armond June ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Model Performance ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Training Cohort ◽

Network Analyses ◽

Machine Learning Methods

Abstract Background Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

Download Full-text

3145 An Evaluation of Machine Learning and Traditional Statistical Methods for Discovery in Large-Scale Translational Data

Journal of Clinical and Translational Science ◽

10.1017/cts.2019.8 ◽

2019 ◽

Vol 3 (s1) ◽

pp. 2-2

Author(s):

Megan C Hollister ◽

Jeffrey D. Blume

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Random Forest ◽

Gene Expression Data ◽

Large Scale ◽

Second Generation ◽

A Priori ◽

Expression Data ◽

P Values ◽

Machine Learning Methods

OBJECTIVES/SPECIFIC AIMS: To examine and compare the claims in Bzdok, Altman, and Brzywinski under a broader set of conditions by using unbiased methods of comparison. To explore how to accurately use various machine learning and traditional statistical methods in large-scale translational research by estimating their accuracy statistics. Then we will identify the methods with the best performance characteristics. METHODS/STUDY POPULATION: We conducted a simulation study with a microarray of gene expression data. We maintained the original structure proposed by Bzdok, Altman, and Brzywinski. The structure for gene expression data includes a total of 40 genes from 20 people, in which 10 people are phenotype positive and 10 are phenotype negative. In order to find a statistical difference 25% of the genes were set to be dysregulated across phenotype. This dysregulation forced the positive and negative phenotypes to have different mean population expressions. Additional variance was included to simulate genetic variation across the population. We also allowed for within person correlation across genes, which was not done in the original simulations. The following methods were used to determine the number of dysregulated genes in simulated data set: unadjusted p-values, Benjamini-Hochberg adjusted p-values, Bonferroni adjusted p-values, random forest importance levels, neural net prediction weights, and second-generation p-values. RESULTS/ANTICIPATED RESULTS: Results vary depending on whether a pre-specified significance level is used or the top 10 ranked values are taken. When all methods are given the same prior information of 10 dysregulated genes, the Benjamini-Hochberg adjusted p-values and the second-generation p-values generally outperform all other methods. We were not able to reproduce or validate the finding that random forest importance levels via a machine learning algorithm outperform classical methods. Almost uniformly, the machine learning methods did not yield improved accuracy statistics and they depend heavily on the a priori chosen number of dysregulated genes. DISCUSSION/SIGNIFICANCE OF IMPACT: In this context, machine learning methods do not outperform standard methods. Because of this and their additional complexity, machine learning approaches would not be preferable. Of all the approaches the second-generation p-value appears to offer significant benefit for the cost of a priori defining a region of trivially null effect sizes. The choice of an analysis method for large-scale translational data is critical to the success of any statistical investigation, and our simulations clearly highlight the various tradeoffs among the available methods.

Download Full-text