Prediction of Hanwoo Cattle Phenotypes from Genotypes Using Machine Learning Methods

Swati Srivastava; Bryan Irvine Lopez; Himansu Kumar; Myoungjin Jang; Han-Ha Chai; Woncheoul Park; Jong-Eun Park; Dajeong Lim

doi:10.3390/ani11072066

Prediction of Hanwoo Cattle Phenotypes from Genotypes Using Machine Learning Methods

Animals ◽

10.3390/ani11072066 ◽

2021 ◽

Vol 11 (7) ◽

pp. 2066

Author(s):

Swati Srivastava ◽

Bryan Irvine Lopez ◽

Himansu Kumar ◽

Myoungjin Jang ◽

Han-Ha Chai ◽

...

Keyword(s):

Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Eye Muscle ◽

Important Species ◽

Machine Learning Methods ◽

Extreme Gradient Boosting ◽

Boosting Method ◽

Predictive Correlation ◽

Hanwoo Cattle

Hanwoo was originally raised for draft purposes, but the increase in local demand for red meat turned that purpose into full-scale meat-type cattle rearing; it is now considered one of the most economically important species and a vital food source for Koreans. The application of genomic selection in Hanwoo breeding programs in recent years was expected to lead to higher genetic progress. However, better statistical methods that can improve the genomic prediction accuracy are required. Hence, this study aimed to compare the predictive performance of three machine learning methods, namely, random forest (RF), extreme gradient boosting method (XGB), and support vector machine (SVM), when predicting the carcass weight (CWT), marbling score (MS), backfat thickness (BFT) and eye muscle area (EMA). Phenotypic and genotypic data (53,866 SNPs) from 7324 commercial Hanwoo cattle that were slaughtered at the age of around 30 months were used. The results showed that the boosting method XGB showed the highest predictive correlation for CWT and MS, followed by GBLUP, SVM, and RF. Meanwhile, the best predictive correlation for BFT and EMA was delivered by GBLUP, followed by SVM, RF, and XGB. Although XGB presented the highest predictive correlations for some traits, we did not find an advantage of XGB or any machine learning methods over GBLUP according to the mean squared error of prediction. Thus, we still recommend the use of GBLUP in the prediction of genomic breeding values for carcass traits in Hanwoo cattle.

Download Full-text

Using Machine Learning Methods To Identify Coal Pay Zones from Drilling and Logging-While-Drilling (LWD) Data

SPE Journal ◽

10.2118/198288-pa ◽

2020 ◽

Vol 25 (03) ◽

pp. 1241-1258 ◽

Cited By ~ 2

Author(s):

Ruizhi Zhong ◽

Raymond L. Johnson ◽

Zhongwei Chen

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

Learning Methods ◽

Well Completion ◽

Machine Learning Methods ◽

Extreme Gradient Boosting ◽

Logging While Drilling

Summary Accurate coal identification is critical in coal seam gas (CSG) (also known as coalbed methane or CBM) developments because it determines well completion design and directly affects gas production. Density logging using radioactive source tools is the primary tool for coal identification, adding well trips to condition the hole and additional well costs for logging runs. In this paper, machine learning methods are applied to identify coals from drilling and logging-while-drilling (LWD) data to reduce overall well costs. Machine learning algorithms include logistic regression (LR), support vector machine (SVM), artificial neural network (ANN), random forest (RF), and extreme gradient boosting (XGBoost). The precision, recall, and F1 score are used as evaluation metrics. Because coal identification is an imbalanced data problem, the performance on the minority class (i.e., coals) is limited. To enhance the performance on coal prediction, two data manipulation techniques [naive random oversampling (NROS) technique and synthetic minority oversampling technique (SMOTE)] are separately coupled with machine learning algorithms. Case studies are performed with data from six wells in the Surat Basin, Australia. For the first set of experiments (single-well experiments), both the training data and test data are in the same well. The machine learning methods can identify coal pay zones for sections with poor or missing logs. It is found that rate of penetration (ROP) is the most important feature. The second set of experiments (multiple-well experiments) uses the training data from multiple nearby wells, which can predict coal pay zones in a new well. The most important feature is gamma ray. After placing slotted casings, all wells have coal identification rates greater than 90%, and three wells have coal identification rates greater than 99%. This indicates that machine learning methods (either XGBoost or ANN/RF with NROS/SMOTE) can be an effective way to identify coal pay zones and reduce coring or logging costs in CSG developments.

Download Full-text

Prediction of Liver Weight Recovery by an Integrated Metabolomics and Machine Learning Approach After 2/3 Partial Hepatectomy

Frontiers in Pharmacology ◽

10.3389/fphar.2021.760474 ◽

2021 ◽

Vol 12 ◽

Author(s):

Runbin Sun ◽

Haokai Zhao ◽

Shuzhen Huang ◽

Ran Zhang ◽

Zhenyao Lu ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Random Forest ◽

Liver Regeneration ◽

Partial Hepatectomy ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods ◽

Liver Index ◽

Extreme Gradient Boosting

Liver has an ability to regenerate itself in mammals, whereas the mechanism has not been fully explained. Here we used a GC/MS-based metabolomic method to profile the dynamic endogenous metabolic change in the serum of C57BL/6J mice at different times after 2/3 partial hepatectomy (PHx), and nine machine learning methods including Least Absolute Shrinkage and Selection Operator Regression (LASSO), Partial Least Squares Regression (PLS), Principal Components Regression (PCR), k-Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forest (RF), eXtreme Gradient Boosting (xgbDART), Neural Network (NNET) and Bayesian Regularized Neural Network (BRNN) were used for regression between the liver index and metabolomic data at different stages of liver regeneration. We found a tree-based random forest method that had the minimum average Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and the maximum R square (R2) and is time-saving. Furthermore, variable of importance in the project (VIP) analysis of RF method was performed and metabolites with VIP ranked top 20 were selected as the most critical metabolites contributing to the model. Ornithine, phenylalanine, 2-hydroxybutyric acid, lysine, etc. were chosen as the most important metabolites which had strong correlations with the liver index. Further pathway analysis found Arginine biosynthesis, Pantothenate and CoA biosynthesis, Galactose metabolism, Valine, leucine and isoleucine degradation were the most influenced pathways. In summary, several amino acid metabolic pathways and glucose metabolism pathway were dynamically changed during liver regeneration. The RF method showed advantages for predicting the liver index after PHx over other machine learning methods used and a metabolic clock containing four metabolites is established to predict the liver index during liver regeneration.

Download Full-text

Comparative Assessment of Machine Learning Methods for Urban Vegetation Mapping Using Multitemporal Sentinel-1 Imagery

Remote Sensing ◽

10.3390/rs12121952 ◽

2020 ◽

Vol 12 (12) ◽

pp. 1952 ◽

Cited By ~ 5

Author(s):

Mateo Gašparović ◽

Dino Dobrinić

Keyword(s):

Machine Learning ◽

Urban Areas ◽

Vegetation Mapping ◽

Gradient Boosting ◽

Support Vector ◽

Urban Vegetation ◽

Learning Methods ◽

Machine Learning Methods ◽

Extreme Gradient Boosting ◽

Sar Data

Mapping of green vegetation in urban areas using remote sensing techniques can be used as a tool for integrated spatial planning to deal with urban challenges. In this context, multitemporal (MT) synthetic aperture radar (SAR) data have not been equally investigated, as compared to optical satellite data. This research compared various machine learning methods using single-date and MT Sentinel-1 (S1) imagery. The research was focused on vegetation mapping in urban areas across Europe. Urban vegetation was classified using six classifiers—random forests (RF), support vector machine (SVM), extreme gradient boosting (XGB), multi-layer perceptron (MLP), AdaBoost.M1 (AB), and extreme learning machine (ELM). Whereas, SVM showed the best performance in the single-date image analysis, the MLP classifier yielded the highest overall accuracy in the MT classification scenario. Mean overall accuracy (OA) values for all machine learning methods increased from 57% to 77% with speckle filtering. Using MT SAR data, i.e., three and five S1 imagery, an additional increase in the OA of 8.59% and 13.66% occurred, respectively. Additionally, using three and five S1 imagery for classification, the F1 measure for forest and low vegetation land-cover class exceeded 90%. This research allowed us to confirm the possibility of MT C-band SAR imagery for urban vegetation mapping.

Download Full-text

Benchmark Study of Supervised Machine Learning Methods for a Ship Speed-Power Prediction at Sea

10.1115/omae2021-62395 ◽

2021 ◽

Author(s):

Xiao Lang ◽

Da Wu ◽

Wengang Mao

Keyword(s):

Machine Learning ◽

Supervised Machine Learning ◽

Support Vector ◽

Statistical Regression ◽

Learning Methods ◽

Benchmark Study ◽

Machine Learning Methods ◽

Extreme Gradient Boosting ◽

Ship Performance ◽

Ship Speed

Abstract The development and evaluation of energy efficiency measures to reduce air emissions from shipping strongly depends on reliable description of a ship’s performance when sailing at sea. Normally, model tests and semi-empirical formulas are used to model a ship’s performance but they are either expensive or lack accuracy. Nowadays, a lot of ship performance-related parameters have been recorded during a ship’s sailing, and different data driven machine learning methods have been applied for the ship speed-power modelling. This paper compares different supervised machine learning algorithms, i.e., eXtreme Gradient Boosting (XGBoost), neural network, support vector machine, and some statistical regression methods, for the ship speed-power modelling. A worldwide sailing chemical tanker with full-scale measurements is employed as the case study vessel. A general data pre-processing method for the machine learning is presented. The machine learning models are trained using measurement data including ship operation profiles and encountered metocean conditions. Through the benchmark study, the pros and cons of different machine learning methods for the ship’s speed-power performance modelling are identified. The accuracy of various algorithms based models for ship performance during individual voyages is also investigated.

Download Full-text

Prediction of the Development of Gestational Diabetes Mellitus in Pregnant Women Using Machine Learning Methods

Microsystems Electronics and Acoustics ◽

10.20535/2523-4455.mea.228845 ◽

2021 ◽

Vol 26 (2) ◽

Author(s):

Marko Romanovych Basarab ◽

Ekateryna Olehivna Ivanko ◽

Vishwesh Kulkarni

Keyword(s):

Diabetes Mellitus ◽

Machine Learning ◽

Gestational Diabetes ◽

Gestational Diabetes Mellitus ◽

Supervised Machine Learning ◽

Support Vector ◽

Pima Indians ◽

Learning Methods ◽

Machine Learning Methods ◽

Extreme Gradient Boosting

The paper is devoted to the application of machine learning methods to the prediction of the development of gestational diabetes mellitus in early pregnancy. Based on two publicly available databases, study assesses influence of such features as body mass index, thickness of triceps skin folds, ultrasound measurements of maternal visceral fat, first measured fasting glucose, and others a predictors of gestational diabetes mellitus. The supervised machine learning methods based on decision trees, support vector machines, logistic regression, k-nearest neighbors classifier, ensemble learning, Naive Bayes classifier, and neural networks were implemented to determine the best classification models for computerized gestational diabetes mellitus disease prediction. The accuracy of the different classifiers was determined and compared. Support vector machine classifier demonstrated the highest accuracy (83.0% of total correctly prognosed cases, 87.9% for healthy class, and 78.1% for gestational diabetes mellitus) in predicting the development of gestational diabetes based on features from Pima Indians Diabetes Database. Extreme gradient boosting classifier performed the best, comparing to other supervised machine learning methods, for Visceral Adipose Tissue Measurements during Pregnancy Database. It showed 87.9% of total correctly prognosed cases, 82.2% for healthy class, and 93.6% for gestational diabetes mellitus).

Download Full-text

Identifying Cancer Targets Based on Machine Learning Methods via Chou’s 5-steps Rule and General Pseudo Components

Current Topics in Medicinal Chemistry ◽

10.2174/1568026619666191016155543 ◽

2019 ◽

Vol 19 (25) ◽

pp. 2301-2317 ◽

Cited By ~ 2

Author(s):

Ruirui Liang ◽

Jiayang Xie ◽

Chi Zhang ◽

Mengying Zhang ◽

Hai Huang ◽

...

Keyword(s):

Machine Learning ◽

Growth Rate ◽

Big Data ◽

Human Genome Project ◽

Genome Project ◽

Support Vector ◽

Successful Implementation ◽

Learning Methods ◽

Machine Learning Methods ◽

Vector Machines

In recent years, the successful implementation of human genome project has made people realize that genetic, environmental and lifestyle factors should be combined together to study cancer due to the complexity and various forms of the disease. The increasing availability and growth rate of ‘big data’ derived from various omics, opens a new window for study and therapy of cancer. In this paper, we will introduce the application of machine learning methods in handling cancer big data including the use of artificial neural networks, support vector machines, ensemble learning and naïve Bayes classifiers.

Download Full-text

Integration of transcriptomic data identifies key hallmark genes in hypertrophic cardiomyopathy

BMC Cardiovascular Disorders ◽

10.1186/s12872-021-02147-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jing Xu ◽

Xiangdong Liu ◽

Qiming Dai

Keyword(s):

Machine Learning ◽

Hypertrophic Cardiomyopathy ◽

Heart Diseases ◽

Expression Patterns ◽

Support Vector ◽

Rna Seq ◽

Ppi Network ◽

Learning Methods ◽

Transcriptomic Data ◽

Machine Learning Methods

Abstract Background Hypertrophic cardiomyopathy (HCM) represents one of the most common inherited heart diseases. To identify key molecules involved in the development of HCM, gene expression patterns of the heart tissue samples in HCM patients from multiple microarray and RNA-seq platforms were investigated. Methods The significant genes were obtained through the intersection of two gene sets, corresponding to the identified differentially expressed genes (DEGs) within the microarray data and within the RNA-Seq data. Those genes were further ranked using minimum-Redundancy Maximum-Relevance feature selection algorithm. Moreover, the genes were assessed by three different machine learning methods for classification, including support vector machines, random forest and k-Nearest Neighbor. Results Outstanding results were achieved by taking exclusively the top eight genes of the ranking into consideration. Since the eight genes were identified as candidate HCM hallmark genes, the interactions between them and known HCM disease genes were explored through the protein–protein interaction (PPI) network. Most candidate HCM hallmark genes were found to have direct or indirect interactions with known HCM diseases genes in the PPI network, particularly the hub genes JAK2 and GADD45A. Conclusions This study highlights the transcriptomic data integration, in combination with machine learning methods, in providing insight into the key hallmark genes in the genetic etiology of HCM.

Download Full-text

Machine Learning Methods Applied to the Prediction of Pseudo-nitzschia spp. Blooms in the Galician Rias Baixas (NW Spain)

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10040199 ◽

2021 ◽

Vol 10 (4) ◽

pp. 199

Author(s):

Francisco M. Bellas Aláez ◽

Jesus M. Torres Palenzuela ◽

Evangelos Spyrakos ◽

Luis González Vilas

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Prediction Models ◽

Support Vector ◽

False Alarms ◽

Learning Approaches ◽

Learning Methods ◽

Machine Learning Methods ◽

Rías Baixas ◽

New Algorithms

This work presents new prediction models based on recent developments in machine learning methods, such as Random Forest (RF) and AdaBoost, and compares them with more classical approaches, i.e., support vector machines (SVMs) and neural networks (NNs). The models predict Pseudo-nitzschia spp. blooms in the Galician Rias Baixas. This work builds on a previous study by the authors (doi.org/10.1016/j.pocean.2014.03.003) but uses an extended database (from 2002 to 2012) and new algorithms. Our results show that RF and AdaBoost provide better prediction results compared to SVMs and NNs, as they show improved performance metrics and a better balance between sensitivity and specificity. Classical machine learning approaches show higher sensitivities, but at a cost of lower specificity and higher percentages of false alarms (lower precision). These results seem to indicate a greater adaptation of new algorithms (RF and AdaBoost) to unbalanced datasets. Our models could be operationally implemented to establish a short-term prediction system.

Download Full-text

Predicting plaque vulnerability change using intravascular ultrasound + optical coherence tomography image-based fluid–structure interaction models and machine learning methods with patient follow-up data: a feasibility study

BioMedical Engineering OnLine ◽

10.1186/s12938-021-00868-6 ◽

2021 ◽

Vol 20 (1) ◽

Author(s):

Xiaoya Guo ◽

Akiko Maehara ◽

Mitsuaki Matsumura ◽

Liang Wang ◽

Jie Zheng ◽

...

Keyword(s):

Machine Learning ◽

Coronary Plaque ◽

Support Vector ◽

Single Factor ◽

Plaque Vulnerability ◽

Learning Methods ◽

Machine Learning Methods ◽

Vulnerability Prediction ◽

Biomechanical Factors

Abstract Background Coronary plaque vulnerability prediction is difficult because plaque vulnerability is non-trivial to quantify, clinically available medical image modality is not enough to quantify thin cap thickness, prediction methods with high accuracies still need to be developed, and gold-standard data to validate vulnerability prediction are often not available. Patient follow-up intravascular ultrasound (IVUS), optical coherence tomography (OCT) and angiography data were acquired to construct 3D fluid–structure interaction (FSI) coronary models and four machine-learning methods were compared to identify optimal method to predict future plaque vulnerability. Methods Baseline and 10-month follow-up in vivo IVUS and OCT coronary plaque data were acquired from two arteries of one patient using IRB approved protocols with informed consent obtained. IVUS and OCT-based FSI models were constructed to obtain plaque wall stress/strain and wall shear stress. Forty-five slices were selected as machine learning sample database for vulnerability prediction study. Thirteen key morphological factors from IVUS and OCT images and biomechanical factors from FSI model were extracted from 45 slices at baseline for analysis. Lipid percentage index (LPI), cap thickness index (CTI) and morphological plaque vulnerability index (MPVI) were quantified to measure plaque vulnerability. Four machine learning methods (least square support vector machine, discriminant analysis, random forest and ensemble learning) were employed to predict the changes of three indices using all combinations of 13 factors. A standard fivefold cross-validation procedure was used to evaluate prediction results. Results For LPI change prediction using support vector machine, wall thickness was the optimal single-factor predictor with area under curve (AUC) 0.883 and the AUC of optimal combinational-factor predictor achieved 0.963. For CTI change prediction using discriminant analysis, minimum cap thickness was the optimal single-factor predictor with AUC 0.818 while optimal combinational-factor predictor achieved an AUC 0.836. Using random forest for predicting MPVI change, minimum cap thickness was the optimal single-factor predictor with AUC 0.785 and the AUC of optimal combinational-factor predictor achieved 0.847. Conclusion This feasibility study demonstrated that machine learning methods could be used to accurately predict plaque vulnerability change based on morphological and biomechanical factors from multi-modality image-based FSI models. Large-scale studies are needed to verify our findings.

Download Full-text

Early warning of citric acid overdose and timely adjustment of regional citrate anticoagulation based on machine learning methods

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01489-8 ◽

2021 ◽

Vol 21 (S2) ◽

Author(s):

Huan Chen ◽

Yingying Ma ◽

Na Hong ◽

Hao Wang ◽

Longxiang Su ◽

...

Keyword(s):

Machine Learning ◽

Citric Acid ◽

Early Warning ◽

Regional Citrate Anticoagulation ◽

Support Vector ◽

Learning Methods ◽

Citrate Anticoagulation ◽

Replacement Fluid ◽

Machine Learning Methods ◽

Neutral Networks

Abstract Background Regional citrate anticoagulation (RCA) is an important local anticoagulation method during bedside continuous renal replacement therapy. To improve patient safety and achieve computer assisted dose monitoring and control, we took intensive care units patients into cohort and aiming at developing a data-driven machine learning model to give early warning of citric acid overdose and provide adjustment suggestions on citrate pumping rate and 10% calcium gluconate input rate for RCA treatment. Methods Patient age, gender, pumped citric acid dose value, 5% NaHCO3 solvent, replacement fluid solvent, body temperature value, and replacement fluid PH value as clinical features, models attempted to classify patients who received regional citrate anticoagulation into correct outcome category. Four models, Adaboost, XGBoost, support vector machine (SVM) and shallow neural network, were compared on the performance of predicting outcomes. Prediction results were evaluated using accuracy, precision, recall and F1-score. Results For classifying patients at the early stages of citric acid treatment, the accuracy of neutral networks model is higher than Adaboost, XGBoost and SVM, the F1-score of shallow neutral networks (90.77%) is overall outperformed than other models (88.40%, 82.17% and 88.96% for Adaboost, XGBoost and SVM). Extended experiment and validation were further conducted using the MIMIC-III database, the F1-scores for shallow neutral networks, Adaboost, XGBoost and SVM are 80.00%, 80.46%, 80.37% and 78.90%, the AUCs are 0.8638, 0.8086, 0.8466 and 0.7919 respectively. Conclusion The results of this study demonstrated the feasibility and performance of machine learning methods for monitoring and adjusting local regional citrate anticoagulation, and further provide decision-making recommendations to clinicians point-of-care.

Download Full-text