SAMbinder: A web server for predicting SAM binding residues of a protein from its amino acid sequence

Mapping Intimacies ◽

10.1101/625806 ◽

2019 ◽

Cited By ~ 2

Author(s):

Piyush Agrawal ◽

Gaurav Mishra ◽

Gajendra P. S. Raghava

Keyword(s):

Binding Sites ◽

Protein Sequence ◽

Cross Validation ◽

Sequence Similarity ◽

Machine Learning Techniques ◽

Validation Dataset ◽

Evolutionary Information ◽

Learning Techniques ◽

Binding Residues ◽

Validation Technique

AbstractMotivationS-adenosyl-L-methionine (SAM) is one of the important cofactor present in the biological system and play a key role in many diseases. There is a need to develop a method for predicting SAM binding sites in a protein for designing drugs against SAM associated disease. Best of our knowledge, there is no method that can predict the binding site of SAM in a given protein sequence.ResultThis manuscript describes a method SAMbinder, developed for predicting SAM binding sites in a protein from its primary sequence. All models were trained, tested and evaluated on 145 SAM binding protein chains where no two chains have more than 40% sequence similarity. Firstly, models were developed using different machine learning techniques on a balanced dataset contain 2188 SAM interacting and an equal number of non-interacting residues. Our Random Forest based model developed using binary profile feature got maximum MCC 0.42 with AUROC 0.79 on the validation dataset. The performance of our models improved significantly from MCC 0.42 to 0.61, when evolutionary information in the form of PSSM profile is used as a feature. We also developed models on realistic dataset contains 2188 SAM interacting and 40029 non-interacting residues and got maximum MCC 0.61 with AUROC of 0.89. In order to evaluate the performance of our models, we used internal as well as external cross-validation technique.Availability and implementationhttps://webs.iiitd.edu.in/raghava/sambinder/.

Download Full-text

Computer-aided prediction and design of IL-6 inducing peptides: IL-6 plays a crucial role in COVID-19

Briefings in Bioinformatics ◽

10.1093/bib/bbaa259 ◽

2020 ◽

Cited By ~ 2

Author(s):

Anjali Dhall ◽

Sumeet Patiyal ◽

Neelam Sharma ◽

Salman Sadullah Usmani ◽

Gajendra P S Raghava

Keyword(s):

Scientific Community ◽

Prediction Models ◽

Vital Role ◽

Machine Learning Techniques ◽

Validation Dataset ◽

Independent Validation ◽

Immune Epitope ◽

Learning Techniques ◽

Wide Range ◽

Immune Epitope Database

Abstract Interleukin 6 (IL-6) is a pro-inflammatory cytokine that stimulates acute phase responses, hematopoiesis and specific immune reactions. Recently, it was found that the IL-6 plays a vital role in the progression of COVID-19, which is responsible for the high mortality rate. In order to facilitate the scientific community to fight against COVID-19, we have developed a method for predicting IL-6 inducing peptides/epitopes. The models were trained and tested on experimentally validated 365 IL-6 inducing and 2991 non-inducing peptides extracted from the immune epitope database. Initially, 9149 features of each peptide were computed using Pfeature, which were reduced to 186 features using the SVC-L1 technique. These features were ranked based on their classification ability, and the top 10 features were used for developing prediction models. A wide range of machine learning techniques has been deployed to develop models. Random Forest-based model achieves a maximum AUROC of 0.84 and 0.83 on training and independent validation dataset, respectively. We have also identified IL-6 inducing peptides in different proteins of SARS-CoV-2, using our best models to design vaccine against COVID-19. A web server named as IL-6Pred and a standalone package has been developed for predicting, designing and screening of IL-6 inducing peptides (https://webs.iiitd.edu.in/raghava/il6pred/).

Download Full-text

Application of different machine learning techniques in identifying features of protein sequence data

2016 1st India International Conference on Information Processing (IICIP) ◽

10.1109/iicip.2016.7975376 ◽

2016 ◽

Author(s):

Swati Mishra ◽

Mukesh Kumar ◽

Santanu Kumar Rath

Keyword(s):

Machine Learning ◽

Protein Sequence ◽

Sequence Data ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Protein Sequence Data

Download Full-text

A highly accurate model for screening prostate cancer using propensity index panel of ten genes

10.1101/2021.03.22.436371 ◽

2021 ◽

Author(s):

Shipra Jain ◽

Kawal Preet Kaur Malhotra ◽

Sumeet Patiyal ◽

Gajendra P.S. Raghava

Keyword(s):

Gene Expression ◽

Prostate Cancer ◽

Single Gene ◽

Specific Antigen ◽

High Accuracy ◽

Machine Learning Techniques ◽

Validation Dataset ◽

New Approach ◽

Learning Techniques ◽

Feature Selection Techniques

Prostate-specific antigen (PSA) is a key biomarker, which is commonly used to screen patients of prostate cancer. There is a significant number of unnecessary biopsies that are performed every year, due to poor accuracy of PSA based biomarker. In this study, we identified alternate biomarkers based on gene expression that can be used to screen prostate cancer with high accuracy. All models were trained and test on gene expression profile of 500 prostate cancer and 51 normal samples. Numerous feature selection techniques have been used to identify potential biomarkers. These biomarkers have been used to develop various models using different machine learning techniques for predicting samples of prostate cancer. Our logistic regression-based model achieved highest AUROC 0.91 with accuracy 82.42% on validation dataset. We introduced a new approach called propensity index, where expression of gene is converted into propensity. Our propensity based approach improved the performance of classification models significantly and achieved AUROC 0.99 with accuracy 96.36% on validation dataset. We also identified and ranked selected genes which can be used to discriminate prostate cancer patients from health individuals with high accuracy. It was observed that single gene based biomarkers can only achieve accuracy around 90%. In this study, we got best performance using a panel of 10 genes; random forest model using propensity index.

Download Full-text

Understanding the protein sequence and structural adaptation in extremophilic organisms through machine learning techniques

Physiological and Biotechnological Aspects of Extremophiles ◽

10.1016/b978-0-12-818322-9.00023-x ◽

2020 ◽

pp. 307-314

Author(s):

Abhigyan Nath ◽

S. Karthikeyan

Keyword(s):

Machine Learning ◽

Protein Sequence ◽

Machine Learning Techniques ◽

Structural Adaptation ◽

Learning Techniques

Download Full-text

Predicting Aggregated User Satisfaction in Software Projects

Foundations of Computing and Decision Sciences ◽

10.1515/fcds-2018-0017 ◽

2018 ◽

Vol 43 (4) ◽

pp. 335-357

Author(s):

Łukasz Radliński

Keyword(s):

User Satisfaction ◽

Missing Values ◽

Cross Validation ◽

Machine Learning Techniques ◽

Software Projects ◽

Learning Techniques ◽

Prediction Scheme ◽

Testing Stage ◽

Development Data ◽

Research Procedure

Abstract User satisfaction is an important feature of software quality. However, it was rarely studied in software engineering literature. By enhancing earlier research this paper focuses on predicting user satisfaction with machine learning techniques using software development data from an extended ISBSG dataset. This study involved building, evaluating and comparing a total of 15,600 prediction schemes. Each scheme consists of a different combination of its components: manual feature preselection, handling missing values, outlier elimination, value normalization, automated feature selection, and a classifier. The research procedure involved a 10-fold cross-validation and separate testing, both repeated 10 times, to train and to evaluate each prediction scheme. Achieved level of accuracy for best performing schemes expressed by Matthews correlation coefficient was about 0.5 in the cross-validation and about 0.5–0.6 in the testing stage. The study identified the most accurate settings for components of prediction schemes.

Download Full-text

Algebraic Shortcuts for Leave-One-Out Cross-Validation in Supervised Network Inference

10.1101/242321 ◽

2018 ◽

Author(s):

Michiel Stock ◽

Tapio Pahikkala ◽

Antti Airola ◽

Willem Waegeman ◽

Bernard De Baets

Keyword(s):

Machine Learning ◽

Biological Networks ◽

Regulatory Networks ◽

Network Inference ◽

Cross Validation ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Ligand Interaction ◽

Learning Techniques ◽

Leave One Out

AbstractMotivationSupervised machine learning techniques have traditionally been very successful at reconstructing biological networks, such as protein-ligand interaction, protein-protein interaction and gene regulatory networks. Recently, much emphasis has been placed on the correct evaluation of such supervised models. It is vital to distinguish between using the model to either predict new interactions in a given network or to predict interactions for a new vertex not present in the original network. Specific cross-validation schemes need to be used to assess the performance in such different prediction settings.ResultsWe present a series of leave-one-out cross-validation shortcuts to rapidly estimate the performance of state-of-the-art kernel-based network inference techniques.AvailabilityThe machine learning techniques with the algebraic shortcuts are implemented in the RLScore software package.

Download Full-text

I TRIED A BUNCH OF THINGS: THE DANGERS OF UNEXPECTED OVERFITTING IN CLASSIFICATION

10.1101/078816 ◽

2016 ◽

Cited By ~ 13

Author(s):

Michael Powell ◽

Mahan Hosseini ◽

John Collins ◽

Chloe Callahan-Flintoft ◽

William Jones ◽

...

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Machine Learning Techniques ◽

Random Data ◽

Effective Protection ◽

Data Set ◽

Learning Techniques ◽

Original Analysis ◽

Spurious Result ◽

Spurious Results

ABSTRACTMachine learning is a powerful set of techniques that has enhanced the abilities of neuroscientists to interpret information collected through EEG, fMRI, and MEG data. With these powerful techniques comes the danger of overfitting of hyper-parameters which can render results invalid, and cause a failure to generalize beyond the data set. We refer to this problem as ‘over-hyping’ and show that it is pernicious despite commonly used precautions. In particular, over-hyping occurs when an analysis is run repeatedly with slightly different analysis parameters and one set of results is selected based on the analysis. When this is done, the resulting method is unlikely to generalize to a new dataset, rendering it a partially, or perhaps even completely spurious result that will not be valid outside of the data used in the original analysis. While it is commonly assumed that cross-validation is an effective protection against such spurious results generated through overfitting or overhyping, this is not actually true. In this article, we show that both one-shot and iterative optimization of an analysis are prone to over-hyping, despite the use of cross-validation. We demonstrate that non-generalizable results can be obtained even on non-informative (i.e. random) data by modifying hyper-parameters in seemingly innocuous ways. We recommend a number of techniques for limiting over-hyping, such as lock-boxes, blind analyses, pre-registrations, and nested cross-validation. These techniques, are common in other fields that use machine learning, including computer science and physics. Adopting similar safeguards is critical for ensuring the robustness of machine-learning techniques in the neurosciences.

Download Full-text

Classification of Fetal State through the application of Machine Learning techniques on Cardiotocography records: Towards Real World Application.

10.1101/2021.06.03.21255808 ◽

2021 ◽

Author(s):

Andrew M V Dadario ◽

Christian Espinoza ◽

Wellington Araujo Nogueira

Keyword(s):

Machine Learning ◽

Real World ◽

Cross Validation ◽

Low Cost ◽

Gaussian Process Regression ◽

Machine Learning Techniques ◽

Real World Application ◽

Learning Techniques ◽

Qualified Personnel

Objective Anticipating fetal risk is a major factor in reducing child and maternal mortality and suffering. In this context cardiotocography (CTG) is a low cost, well established procedure that has been around for decades, despite lacking consensus regarding its impact on outcomes. Machine learning emerged as an option for automatic classification of CTG records, as previous studies showed expert level results, but often came at the price of reduced generalization potential. With that in mind, the present study sought to improve statistical rigor of evaluation towards real world application. Materials and Methods In this study, a dataset of 2126 CTG recordings labeled as normal, suspect or pathological by the consensus of three expert obstetricians was used to create a baseline random forest model. This was followed by creating a lightgbm model tuned using gaussian process regression and post processed using cross validation ensembling. Performance was assessed using the area under the precision-recall curve (AUPRC) metric over 100 experiment executions, each using a testing set comprised of 30% of data stratified by the class label. Results The best model was a cross validation ensemble of lightgbm models that yielded 95.82% AUPRC. Conclusions The model is shown to produce consistent expert level performance at a less than negligible cost. At an estimated 0.78 USD per million predictions the model can generate value in settings with CTG qualified personnel and all the more in their absence.

Download Full-text

Currency Recognition and Calculation System using Machine Learning Techniques

WSEAS TRANSACTIONS ON SIGNAL PROCESSING ◽

10.37394/232014.2020.16.5 ◽

2020 ◽

Vol 16 ◽

Keyword(s):

Cross Validation ◽

Recognition System ◽

Daily Basis ◽

Machine Learning Techniques ◽

Financial Loss ◽

Learning Techniques ◽

Currency Note ◽

Recognition Systems ◽

Calculation System ◽

Fold Cross Validation

Different currencies are being processed in money exchange shops and banks around the globe on a daily basis, where money exchange and transfer takes place. Identifying different currency is a difficult task and can lead to financial loss. There are approximately 180 currencies being used around the world, and each of them differ in color, size and texture. Thus, to correctly identify different currencies, a currency recognition systems needs to be designed. In this paper, we propose the design of an AlexNet based currency recognition system to recognize different international currency notes. We use 10-fold Cross Validation to obtain the cross-validation results of the AlexNet model. The features for the Alex model is extracted from the images back and front of each currency note. We also explore and implement deep learning models to compare the performance of the AlexNet model.

Download Full-text

A graph neural network approach to molecule carcinogenicity prediction

10.1101/2021.11.10.468094 ◽

2021 ◽

Author(s):

Philip Fradkin ◽

Adamo Young ◽

Lazar Atanackovic ◽

Brendan J Frey ◽

Leo J Lee ◽

...

Keyword(s):

Experimental Testing ◽

External Validation ◽

Machine Learning Techniques ◽

Training Dataset ◽

Validation Dataset ◽

Molecular Fingerprint ◽

Neural Network Approach ◽

Learning Techniques ◽

Substantial Progress ◽

Experimental Approaches

Molecular carcinogenicity is a preventable cause of cancer, however, most experimental testing of molecular compounds is an expensive and time consuming process, making high throughput experimental approaches infeasible. In recent years, there has been substantial progress in machine learning techniques for molecular property prediction. In this work, we propose a model for carcinogenicity prediction, CONCERTO, which uses a graph transformer in conjunction with a molecular fingerprint representation, trained on multi-round mutagenicity and carcinogenicity objectives. To train and validate CONCERTO, we augment the training dataset with more informative labels and utilize a larger external validation dataset. Extensive experiments demonstrate that our model yields results superior to alternate approaches for molecular carcinogenicity prediction.

Download Full-text