scholarly journals SAMbinder: A web server for predicting SAM binding residues of a protein from its amino acid sequence

2019 ◽  
Author(s):  
Piyush Agrawal ◽  
Gaurav Mishra ◽  
Gajendra P. S. Raghava

AbstractMotivationS-adenosyl-L-methionine (SAM) is one of the important cofactor present in the biological system and play a key role in many diseases. There is a need to develop a method for predicting SAM binding sites in a protein for designing drugs against SAM associated disease. Best of our knowledge, there is no method that can predict the binding site of SAM in a given protein sequence.ResultThis manuscript describes a method SAMbinder, developed for predicting SAM binding sites in a protein from its primary sequence. All models were trained, tested and evaluated on 145 SAM binding protein chains where no two chains have more than 40% sequence similarity. Firstly, models were developed using different machine learning techniques on a balanced dataset contain 2188 SAM interacting and an equal number of non-interacting residues. Our Random Forest based model developed using binary profile feature got maximum MCC 0.42 with AUROC 0.79 on the validation dataset. The performance of our models improved significantly from MCC 0.42 to 0.61, when evolutionary information in the form of PSSM profile is used as a feature. We also developed models on realistic dataset contains 2188 SAM interacting and 40029 non-interacting residues and got maximum MCC 0.61 with AUROC of 0.89. In order to evaluate the performance of our models, we used internal as well as external cross-validation technique.Availability and implementationhttps://webs.iiitd.edu.in/raghava/sambinder/.

Author(s):  
Anjali Dhall ◽  
Sumeet Patiyal ◽  
Neelam Sharma ◽  
Salman Sadullah Usmani ◽  
Gajendra P S Raghava

Abstract Interleukin 6 (IL-6) is a pro-inflammatory cytokine that stimulates acute phase responses, hematopoiesis and specific immune reactions. Recently, it was found that the IL-6 plays a vital role in the progression of COVID-19, which is responsible for the high mortality rate. In order to facilitate the scientific community to fight against COVID-19, we have developed a method for predicting IL-6 inducing peptides/epitopes. The models were trained and tested on experimentally validated 365 IL-6 inducing and 2991 non-inducing peptides extracted from the immune epitope database. Initially, 9149 features of each peptide were computed using Pfeature, which were reduced to 186 features using the SVC-L1 technique. These features were ranked based on their classification ability, and the top 10 features were used for developing prediction models. A wide range of machine learning techniques has been deployed to develop models. Random Forest-based model achieves a maximum AUROC of 0.84 and 0.83 on training and independent validation dataset, respectively. We have also identified IL-6 inducing peptides in different proteins of SARS-CoV-2, using our best models to design vaccine against COVID-19. A web server named as IL-6Pred and a standalone package has been developed for predicting, designing and screening of IL-6 inducing peptides (https://webs.iiitd.edu.in/raghava/il6pred/).


2021 ◽  
Author(s):  
Shipra Jain ◽  
Kawal Preet Kaur Malhotra ◽  
Sumeet Patiyal ◽  
Gajendra P.S. Raghava

Prostate-specific antigen (PSA) is a key biomarker, which is commonly used to screen patients of prostate cancer. There is a significant number of unnecessary biopsies that are performed every year, due to poor accuracy of PSA based biomarker. In this study, we identified alternate biomarkers based on gene expression that can be used to screen prostate cancer with high accuracy. All models were trained and test on gene expression profile of 500 prostate cancer and 51 normal samples. Numerous feature selection techniques have been used to identify potential biomarkers. These biomarkers have been used to develop various models using different machine learning techniques for predicting samples of prostate cancer. Our logistic regression-based model achieved highest AUROC 0.91 with accuracy 82.42% on validation dataset. We introduced a new approach called propensity index, where expression of gene is converted into propensity. Our propensity based approach improved the performance of classification models significantly and achieved AUROC 0.99 with accuracy 96.36% on validation dataset. We also identified and ranked selected genes which can be used to discriminate prostate cancer patients from health individuals with high accuracy. It was observed that single gene based biomarkers can only achieve accuracy around 90%. In this study, we got best performance using a panel of 10 genes; random forest model using propensity index.


2018 ◽  
Vol 43 (4) ◽  
pp. 335-357
Author(s):  
Łukasz Radliński

Abstract User satisfaction is an important feature of software quality. However, it was rarely studied in software engineering literature. By enhancing earlier research this paper focuses on predicting user satisfaction with machine learning techniques using software development data from an extended ISBSG dataset. This study involved building, evaluating and comparing a total of 15,600 prediction schemes. Each scheme consists of a different combination of its components: manual feature preselection, handling missing values, outlier elimination, value normalization, automated feature selection, and a classifier. The research procedure involved a 10-fold cross-validation and separate testing, both repeated 10 times, to train and to evaluate each prediction scheme. Achieved level of accuracy for best performing schemes expressed by Matthews correlation coefficient was about 0.5 in the cross-validation and about 0.5–0.6 in the testing stage. The study identified the most accurate settings for components of prediction schemes.


2018 ◽  
Author(s):  
Michiel Stock ◽  
Tapio Pahikkala ◽  
Antti Airola ◽  
Willem Waegeman ◽  
Bernard De Baets

AbstractMotivationSupervised machine learning techniques have traditionally been very successful at reconstructing biological networks, such as protein-ligand interaction, protein-protein interaction and gene regulatory networks. Recently, much emphasis has been placed on the correct evaluation of such supervised models. It is vital to distinguish between using the model to either predict new interactions in a given network or to predict interactions for a new vertex not present in the original network. Specific cross-validation schemes need to be used to assess the performance in such different prediction settings.ResultsWe present a series of leave-one-out cross-validation shortcuts to rapidly estimate the performance of state-of-the-art kernel-based network inference techniques.AvailabilityThe machine learning techniques with the algebraic shortcuts are implemented in the RLScore software package.


2016 ◽  
Author(s):  
Michael Powell ◽  
Mahan Hosseini ◽  
John Collins ◽  
Chloe Callahan-Flintoft ◽  
William Jones ◽  
...  

ABSTRACTMachine learning is a powerful set of techniques that has enhanced the abilities of neuroscientists to interpret information collected through EEG, fMRI, and MEG data. With these powerful techniques comes the danger of overfitting of hyper-parameters which can render results invalid, and cause a failure to generalize beyond the data set. We refer to this problem as ‘over-hyping’ and show that it is pernicious despite commonly used precautions. In particular, over-hyping occurs when an analysis is run repeatedly with slightly different analysis parameters and one set of results is selected based on the analysis. When this is done, the resulting method is unlikely to generalize to a new dataset, rendering it a partially, or perhaps even completely spurious result that will not be valid outside of the data used in the original analysis. While it is commonly assumed that cross-validation is an effective protection against such spurious results generated through overfitting or overhyping, this is not actually true. In this article, we show that both one-shot and iterative optimization of an analysis are prone to over-hyping, despite the use of cross-validation. We demonstrate that non-generalizable results can be obtained even on non-informative (i.e. random) data by modifying hyper-parameters in seemingly innocuous ways. We recommend a number of techniques for limiting over-hyping, such as lock-boxes, blind analyses, pre-registrations, and nested cross-validation. These techniques, are common in other fields that use machine learning, including computer science and physics. Adopting similar safeguards is critical for ensuring the robustness of machine-learning techniques in the neurosciences.


2021 ◽  
Author(s):  
Andrew M V Dadario ◽  
Christian Espinoza ◽  
Wellington Araujo Nogueira

Objective Anticipating fetal risk is a major factor in reducing child and maternal mortality and suffering. In this context cardiotocography (CTG) is a low cost, well established procedure that has been around for decades, despite lacking consensus regarding its impact on outcomes. Machine learning emerged as an option for automatic classification of CTG records, as previous studies showed expert level results, but often came at the price of reduced generalization potential. With that in mind, the present study sought to improve statistical rigor of evaluation towards real world application. Materials and Methods In this study, a dataset of 2126 CTG recordings labeled as normal, suspect or pathological by the consensus of three expert obstetricians was used to create a baseline random forest model. This was followed by creating a lightgbm model tuned using gaussian process regression and post processed using cross validation ensembling. Performance was assessed using the area under the precision-recall curve (AUPRC) metric over 100 experiment executions, each using a testing set comprised of 30% of data stratified by the class label. Results The best model was a cross validation ensemble of lightgbm models that yielded 95.82% AUPRC. Conclusions The model is shown to produce consistent expert level performance at a less than negligible cost. At an estimated 0.78 USD per million predictions the model can generate value in settings with CTG qualified personnel and all the more in their absence.


2020 ◽  
Vol 16 ◽  

Different currencies are being processed in money exchange shops and banks around the globe on a daily basis, where money exchange and transfer takes place. Identifying different currency is a difficult task and can lead to financial loss. There are approximately 180 currencies being used around the world, and each of them differ in color, size and texture. Thus, to correctly identify different currencies, a currency recognition systems needs to be designed. In this paper, we propose the design of an AlexNet based currency recognition system to recognize different international currency notes. We use 10-fold Cross Validation to obtain the cross-validation results of the AlexNet model. The features for the Alex model is extracted from the images back and front of each currency note. We also explore and implement deep learning models to compare the performance of the AlexNet model.


2021 ◽  
Author(s):  
Philip Fradkin ◽  
Adamo Young ◽  
Lazar Atanackovic ◽  
Brendan J Frey ◽  
Leo J Lee ◽  
...  

Molecular carcinogenicity is a preventable cause of cancer, however, most experimental testing of molecular compounds is an expensive and time consuming process, making high throughput experimental approaches infeasible. In recent years, there has been substantial progress in machine learning techniques for molecular property prediction. In this work, we propose a model for carcinogenicity prediction, CONCERTO, which uses a graph transformer in conjunction with a molecular fingerprint representation, trained on multi-round mutagenicity and carcinogenicity objectives. To train and validate CONCERTO, we augment the training dataset with more informative labels and utilize a larger external validation dataset. Extensive experiments demonstrate that our model yields results superior to alternate approaches for molecular carcinogenicity prediction.


Sign in / Sign up

Export Citation Format

Share Document