Non-H3 CDR template selection in antibody modeling through machine learning

PeerJ ◽

10.7717/peerj.6179 ◽

2019 ◽

Vol 7 ◽

pp. e6179 ◽

Cited By ~ 2

Author(s):

Xiyao Long ◽

Jeliazko R. Jeliazkov ◽

Jeffrey J. Gray

Keyword(s):

Machine Learning ◽

Homology Modeling ◽

De Novo ◽

Specific Binding ◽

Adaptive Immune System ◽

Gradient Boosting ◽

Extensive Literature ◽

Alignment Algorithm ◽

Cluster Membership ◽

Antibody Modeling

Antibodies are proteins generated by the adaptive immune system to recognize and counteract a plethora of pathogens through specific binding. This adaptive binding is mediated by structural diversity in the six complementary determining region (CDR) loops (H1, H2, H3, L1, L2 and L3), which also makes accurate structural modeling of CDRs challenging. Both homology and de novo modeling approaches have been used; to date, the former has achieved greater accuracy for the non-H3 loops. The homology modeling of non-H3 CDRs is more accurate because non-H3 CDR loops of the same length and type can be grouped into a few structural clusters. Most antibody-modeling suites utilize homology modeling for the non-H3 CDRs, differing only in the alignment algorithm and how/if they utilize structural clusters. While RosettaAntibody and SAbPred do not explicitly assign query CDR sequences to clusters, two other approaches, PIGS and Kotai Antibody Builder, utilize sequence-based rules to assign CDR sequences to clusters. While the manually curated sequence rules can identify better structural templates, because their curation requires extensive literature search and human effort, they lag behind the deposition of new antibody structures and are infrequently updated. In this study, we propose a machine learning approach (Gradient Boosting Machine [GBM]) to learn the structural clusters of non-H3 CDRs from sequence alone. The GBM method simplifies feature selection and can easily integrate new data, compared to manual sequence rule curation. We compare the classification results using the GBM method to that of RosettaAntibody in a 3-repeat 10-fold cross-validation (CV) scheme on the cluster-annotated antibody database PyIgClassify and we observe an improvement in the classification accuracy of the concerned loops from 84.5% ± 0.24% to 88.16% ± 0.056%. The GBM models reduce the errors in specific cluster membership misclassifications when the involved clusters have relatively abundant data. Based on the factors identified, we suggest methods that can enrich structural classes with sparse data to further improve prediction accuracy in future studies.

Download Full-text

Non-H3 CDR template selection in antibody modeling through machine learning

10.7287/peerj.preprints.26996v1 ◽

2018 ◽

Author(s):

Xiyao Long ◽

Jeliazko R Jeliazkov ◽

Jeffrey J Gray

Keyword(s):

Machine Learning ◽

Homology Modeling ◽

De Novo ◽

Specific Binding ◽

Gradient Boosting ◽

Extensive Literature ◽

Alignment Algorithm ◽

Cluster Membership ◽

Sequence Rules ◽

Antibody Modeling

Antibodies are proteins generated by the adaptive immune system to recognize and counteract a plethora of pathogens through specific binding. This adaptive binding is mediated by structural diversity in the six complementary determining region (CDR) loops (H1, H2, H3, L1, L2 and L3), which also makes accurate structural modeling of CDRs challenging. Both homology and de novo modeling approaches have been used; to date, the former has achieved greater accuracy for the non-H3 loops. The better performance of homology modeling in non-H3 CDRs is due to the fact that most of the non-H3 CDR loops of the same length and type can be grouped into a few structural clusters. Most antibody-modeling suites utilize homology modeling for the non-H3 CDRs, differing only in the alignment algorithm and how/if they utilize structural clusters. While RosettaAntibody and SAbPred do not explicitly assign query CDR sequences to clusters, two other approaches, PIGS and Kotai Antibody Builder, utilize sequence-based rules to assign CDR sequences to clusters. While the manually curated sequence rules can identify better structural templates, because their curation requires extensive literature search and human effort, they lag behind the deposition of new antibody structures and are infrequently updated. In this study, we propose a machine learning approach (Gradient Boosting Machine [GBM]) to learn the structural clusters of non-H3 CDRs from sequence alone. We argue the GBM method gives simplicity in feature selection and immediate integration of new data compared to manual sequence rules curation. We compare the classification results using the GBM method to that of RosettaAntibody in a 3-repeat 10-fold cross-validation scheme on the cluster-annotated antibody database PyIgClassify and we observe an improvement in the classification accuracy from 78.8±0.2% to 85.1±0.2%. We find the GBM models can reduce the errors in specific cluster membership misclassifications if the involved clusters have relatively abundant data. Based on the factors identified, we suggest methods that can enrich structural classes with sparse data can possibly further improve prediction accuracy in future studies.

Download Full-text

Non-H3 CDR template selection in antibody modeling through machine learning

10.7287/peerj.preprints.26996 ◽

2018 ◽

Cited By ~ 1

Author(s):

Xiyao Long ◽

Jeliazko R Jeliazkov ◽

Jeffrey J Gray

Keyword(s):

Machine Learning ◽

Homology Modeling ◽

De Novo ◽

Specific Binding ◽

Gradient Boosting ◽

Extensive Literature ◽

Alignment Algorithm ◽

Cluster Membership ◽

Sequence Rules ◽

Antibody Modeling

Antibodies are proteins generated by the adaptive immune system to recognize and counteract a plethora of pathogens through specific binding. This adaptive binding is mediated by structural diversity in the six complementary determining region (CDR) loops (H1, H2, H3, L1, L2 and L3), which also makes accurate structural modeling of CDRs challenging. Both homology and de novo modeling approaches have been used; to date, the former has achieved greater accuracy for the non-H3 loops. The better performance of homology modeling in non-H3 CDRs is due to the fact that most of the non-H3 CDR loops of the same length and type can be grouped into a few structural clusters. Most antibody-modeling suites utilize homology modeling for the non-H3 CDRs, differing only in the alignment algorithm and how/if they utilize structural clusters. While RosettaAntibody and SAbPred do not explicitly assign query CDR sequences to clusters, two other approaches, PIGS and Kotai Antibody Builder, utilize sequence-based rules to assign CDR sequences to clusters. While the manually curated sequence rules can identify better structural templates, because their curation requires extensive literature search and human effort, they lag behind the deposition of new antibody structures and are infrequently updated. In this study, we propose a machine learning approach (Gradient Boosting Machine [GBM]) to learn the structural clusters of non-H3 CDRs from sequence alone. We argue the GBM method gives simplicity in feature selection and immediate integration of new data compared to manual sequence rules curation. We compare the classification results using the GBM method to that of RosettaAntibody in a 3-repeat 10-fold cross-validation scheme on the cluster-annotated antibody database PyIgClassify and we observe an improvement in the classification accuracy from 78.8±0.2% to 85.1±0.2%. We find the GBM models can reduce the errors in specific cluster membership misclassifications if the involved clusters have relatively abundant data. Based on the factors identified, we suggest methods that can enrich structural classes with sparse data can possibly further improve prediction accuracy in future studies.

Download Full-text

Reliable photometric membership (RPM) of galaxies in clusters – I. A machine learning method and its performance in the local universe

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa486 ◽

2020 ◽

Vol 493 (3) ◽

pp. 3429-3441

Author(s):

Paulo A A Lopes ◽

André L B Ribeiro

Keyword(s):

Machine Learning ◽

Galaxy Evolution ◽

Large Scale ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Validation Data ◽

Membership Probability ◽

Cluster Membership ◽

Stochastic Gradient Boosting

ABSTRACT We introduce a new method to determine galaxy cluster membership based solely on photometric properties. We adopt a machine learning approach to recover a cluster membership probability from galaxy photometric parameters and finally derive a membership classification. After testing several machine learning techniques (such as stochastic gradient boosting, model averaged neural network and k-nearest neighbours), we found the support vector machine algorithm to perform better when applied to our data. Our training and validation data are from the Sloan Digital Sky Survey main sample. Hence, to be complete to $M_r^* + 3$, we limit our work to 30 clusters with $z$phot-cl ≤ 0.045. Masses (M200) are larger than $\sim 0.6\times 10^{14} \, \mathrm{M}_{\odot }$ (most above $3\times 10^{14} \, \mathrm{M}_{\odot }$). Our results are derived taking in account all galaxies in the line of sight of each cluster, with no photometric redshift cuts or background corrections. Our method is non-parametric, making no assumptions on the number density or luminosity profiles of galaxies in clusters. Our approach delivers extremely accurate results (completeness, C $\sim 92{\rm{ per\ cent}}$ and purity, P $\sim 87{\rm{ per\ cent}}$) within R200, so that we named our code reliable photometric membership. We discuss possible dependencies on magnitude, colour, and cluster mass. Finally, we present some applications of our method, stressing its impact to galaxy evolution and cosmological studies based on future large-scale surveys, such as eROSITA, EUCLID, and LSST.

Download Full-text

Forecasting US movies box office performances in Turkey using machine learning algorithms

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189120 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6579-6590

Author(s):

Sandy Çağlıyor ◽

Başar Öztayşi ◽

Selime Sezgin

Keyword(s):

Machine Learning ◽

Global Economy ◽

Learning Algorithms ◽

Forecast Model ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

High Stakes ◽

Box Office ◽

Industry Forecast ◽

The Impact

The motion picture industry is one of the largest industries worldwide and has significant importance in the global economy. Considering the high stakes and high risks in the industry, forecast models and decision support systems are gaining importance. Several attempts have been made to estimate the theatrical performance of a movie before or at the early stages of its release. Nevertheless, these models are mostly used for predicting domestic performances and the industry still struggles to predict box office performances in overseas markets. In this study, the aim is to design a forecast model using different machine learning algorithms to estimate the theatrical success of US movies in Turkey. From various sources, a dataset of 1559 movies is constructed. Firstly, independent variables are grouped as pre-release, distributor type, and international distribution based on their characteristic. The number of attendances is discretized into three classes. Four popular machine learning algorithms, artificial neural networks, decision tree regression and gradient boosting tree and random forest are employed, and the impact of each group is observed by compared by the performance models. Then the number of target classes is increased into five and eight and results are compared with the previously developed models in the literature.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

Predicting Undesired Treatment Outcome in Mental Healthcare: Machine Learning Study (Preprint)

10.2196/preprints.17235 ◽

2019 ◽

Author(s):

Kasper Van Mens ◽

Joran Lokkerbol ◽

Richard Janssen ◽

Robert de Lange ◽

Bea Tiemens

Keyword(s):

Machine Learning ◽

Treatment Outcome ◽

Mental Health Treatment ◽

Mental Healthcare ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Trade Off ◽

Trade Offs ◽

Outcome Monitoring ◽

Extreme Gradient Boosting

BACKGROUND It remains a challenge to predict which treatment will work for which patient in mental healthcare. OBJECTIVE In this study we compare machine algorithms to predict during treatment which patients will not benefit from brief mental health treatment and present trade-offs that must be considered before an algorithm can be used in clinical practice. METHODS Using an anonymized dataset containing routine outcome monitoring data from a mental healthcare organization in the Netherlands (n = 2,655), we applied three machine learning algorithms to predict treatment outcome. The algorithms were internally validated with cross-validation on a training sample (n = 1,860) and externally validated on an unseen test sample (n = 795). RESULTS The performance of the three algorithms did not significantly differ on the test set. With a default classification cut-off at 0.5 predicted probability, the extreme gradient boosting algorithm showed the highest positive predictive value (ppv) of 0.71(0.61 – 0.77) with a sensitivity of 0.35 (0.29 – 0.41) and area under the curve of 0.78. A trade-off can be made between ppv and sensitivity by choosing different cut-off probabilities. With a cut-off at 0.63, the ppv increased to 0.87 and the sensitivity dropped to 0.17. With a cut-off of at 0.38, the ppv decreased to 0.61 and the sensitivity increased to 0.57. CONCLUSIONS Machine learning can be used to predict treatment outcomes based on routine monitoring data.This allows practitioners to choose their own trade-off between being selective and more certain versus inclusive and less certain.

Download Full-text

Evaluation of Three Different Machine Learning Methods for Object-Based Artificial Terrace Mapping—A Case Study of the Loess Plateau, China

Remote Sensing ◽

10.3390/rs13051021 ◽

2021 ◽

Vol 13 (5) ◽

pp. 1021

Author(s):

Hu Ding ◽

Jiaming Na ◽

Shangjing Jiang ◽

Jie Zhu ◽

Kai Liu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Loess Plateau ◽

Water Conservation ◽

Nearest Neighbor ◽

Gradient Boosting ◽

K Nearest Neighbor ◽

The Loess Plateau ◽

Object Based ◽

Extreme Gradient Boosting

Artificial terraces are of great importance for agricultural production and soil and water conservation. Automatic high-accuracy mapping of artificial terraces is the basis of monitoring and related studies. Previous research achieved artificial terrace mapping based on high-resolution digital elevation models (DEMs) or imagery. As a result of the importance of the contextual information for terrace mapping, object-based image analysis (OBIA) combined with machine learning (ML) technologies are widely used. However, the selection of an appropriate classifier is of great importance for the terrace mapping task. In this study, the performance of an integrated framework using OBIA and ML for terrace mapping was tested. A catchment, Zhifanggou, in the Loess Plateau, China, was used as the study area. First, optimized image segmentation was conducted. Then, features from the DEMs and imagery were extracted, and the correlations between the features were analyzed and ranked for classification. Finally, three different commonly-used ML classifiers, namely, extreme gradient boosting (XGBoost), random forest (RF), and k-nearest neighbor (KNN), were used for terrace mapping. The comparison with the ground truth, as delineated by field survey, indicated that random forest performed best, with a 95.60% overall accuracy (followed by 94.16% and 92.33% for XGBoost and KNN, respectively). The influence of class imbalance and feature selection is discussed. This work provides a credible framework for mapping artificial terraces.

Download Full-text

Asymmetric Dopaminergic Dysfunction in Brain-First versus Body-First Parkinson’s Disease Subtypes

Journal of Parkinson s Disease ◽

10.3233/jpd-212761 ◽

2021 ◽

pp. 1-11

Author(s):

Karoline Knudsen ◽

Tatyana D. Fedorova ◽

Jacob Horsager ◽

Katrine B. Andersen ◽

Casper Skjærbæk ◽

...

Keyword(s):

Parkinson’S Disease ◽

Parkinson's Disease ◽

De Novo ◽

Specific Binding ◽

Asymmetry Index ◽

The Body ◽

Specific Binding Ratio ◽

Dat Spect ◽

Vagal Innervation ◽

Nigrostriatal Degeneration

Background: We have hypothesized that Parkinson’s disease (PD) comprises two subtypes. Brain-first, where pathogenic α-synuclein initially forms unilaterally in one hemisphere leading to asymmetric nigrostriatal degeneration, and body-first with initial enteric pathology, which spreads through overlapping vagal innervation leading to more symmetric brainstem involvement and hence more symmetric nigrostriatal degeneration. Isolated REM sleep behaviour disorder has been identified as a strong marker of the body-first type. Objective: To analyse striatal asymmetry in [18F]FDOPA PET and [123I]FP-CIT DaT SPECT data from iRBD patients, de novo PD patients with RBD (PD +RBD) and de novo PD patients without RBD (PD - RBD). These groups were defined as prodromal body-first, de novo body-first, and de novo brain-first, respectively. Methods: We included [18F]FDOPA PET scans from 21 iRBD patients, 11 de novo PD +RBD, 22 de novo PD - RBD, and 18 controls subjects. Also, [123I]FP-CIT DaT SPECT data from iRBD and de novo PD patients with unknown RBD status from the PPPMI dataset was analysed. Lowest putamen specific binding ratio and putamen asymmetry index (AI) was defined. Results: Nigrostriatal degeneration was significantly more symmetric in patients with RBD versus patients without RBD or with unknown RBD status in both FDOPA (p = 0.001) and DaT SPECT (p = 0.001) datasets. Conclusion: iRBD subjects and de novo PD +RBD patients present with significantly more symmetric nigrostriatal dopaminergic degeneration compared to de novo PD - RBD patients. The results support the hypothesis that body-first PD is characterized by more symmetric distribution most likely due to more symmetric propagation of pathogenic α-synuclein compared to brain-first PD.

Download Full-text

Potassium Ferro cyanide electrochemically detected by Differential Pulse and Square Wave Voltammetry in a Competition using Gradient Boosting as Machine Learning Algorithm

2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT) ◽

10.1109/ismsit50672.2020.9254514 ◽

2020 ◽

Author(s):

Nemah Abu Shama ◽

Suleyman Asir ◽

Kamil Dimililer ◽

Mehmet Ozsoz

Keyword(s):

Machine Learning ◽

Square Wave Voltammetry ◽

Learning Algorithm ◽

Differential Pulse ◽

Gradient Boosting ◽

Machine Learning Algorithm ◽

Square Wave ◽

Wave Voltammetry

Download Full-text

Improving the Resolution and Accuracy of Groundwater Level Anomalies Using the Machine Learning-Based Fusion Model in the North China Plain

Sensors ◽

10.3390/s21010046 ◽

2020 ◽

Vol 21 (1) ◽

pp. 46

Author(s):

Gangqiang Zhang ◽

Wei Zheng ◽

Wenjie Yin ◽

Weiwei Lei

Keyword(s):

Machine Learning ◽

Data Fusion ◽

Groundwater Level ◽

North China Plain ◽

North China ◽

Gradient Boosting ◽

Fusion Model ◽

The North ◽

The North China Plain ◽

Prediction Module

The launch of GRACE satellites has provided a new avenue for studying the terrestrial water storage anomalies (TWSA) with unprecedented accuracy. However, the coarse spatial resolution greatly limits its application in hydrology researches on local scales. To overcome this limitation, this study develops a machine learning-based fusion model to obtain high-resolution (0.25°) groundwater level anomalies (GWLA) by integrating GRACE observations in the North China Plain. Specifically, the fusion model consists of three modules, namely the downscaling module, the data fusion module, and the prediction module, respectively. In terms of the downscaling module, the GRACE-Noah model outperforms traditional data-driven models (multiple linear regression and gradient boosting decision tree (GBDT)) with the correlation coefficient (CC) values from 0.24 to 0.78. With respect to the data fusion module, the groundwater level from 12 monitoring wells is incorporated with climate variables (precipitation, runoff, and evapotranspiration) using the GBDT algorithm, achieving satisfactory performance (mean values: CC: 0.97, RMSE: 1.10 m, and MAE: 0.87 m). By merging the downscaled TWSA and fused groundwater level based on the GBDT algorithm, the prediction module can predict the water level in specified pixels. The predicted groundwater level is validated against 6 in-situ groundwater level data sets in the study area. Compare to the downscaling module, there is a significant improvement in terms of CC metrics, on average, from 0.43 to 0.71. This study provides a feasible and accurate fusion model for downscaling GRACE observations and predicting groundwater level with improved accuracy.

Download Full-text