DTI-MLCD: predicting drug-target interactions using multi-label learning with community detection method

Briefings in Bioinformatics ◽

10.1093/bib/bbaa205 ◽

2020 ◽

Cited By ~ 1

Author(s):

Yanyi Chu ◽

Xiaoqi Shan ◽

Tianhang Chen ◽

Mingming Jiang ◽

Yanjing Wang ◽

...

Keyword(s):

Machine Learning ◽

Community Detection ◽

Drug Target ◽

Drug Repositioning ◽

Binary Classification ◽

Predictive Performance ◽

Detection Methods ◽

Target Pair ◽

Data Sets ◽

Data Set

Abstract Identifying drug-target interactions (DTIs) is an important step for drug discovery and drug repositioning. To reduce the experimental cost, a large number of computational approaches have been proposed for this task. The machine learning-based models, especially binary classification models, have been developed to predict whether a drug-target pair interacts or not. However, there is still much room for improvement in the performance of current methods. Multi-label learning can overcome some difficulties caused by single-label learning in order to improve the predictive performance. The key challenge faced by multi-label learning is the exponential-sized output space, and considering label correlations can help to overcome this challenge. In this paper, we facilitate multi-label classification by introducing community detection methods for DTI prediction, named DTI-MLCD. Moreover, we updated the gold standard data set by adding 15,000 more positive DTI samples in comparison to the data set, which has widely been used by most of previously published DTI prediction methods since 2008. The proposed DTI-MLCD is applied to both data sets, demonstrating its superiority over other machine learning methods and several existing methods. The data sets and source code of this study are freely available at https://github.com/a96123155/DTI-MLCD.

Download Full-text

Predicting drug-target interactions using multi-label learning with community detection method (DTI-MLCD)

10.1101/2020.05.11.087734 ◽

2020 ◽

Cited By ~ 1

Author(s):

Yanyi Chu ◽

Xiaoqi Shan ◽

Dennis R. Salahub ◽

Yi Xiong ◽

Dong-Qing Wei

Keyword(s):

Machine Learning ◽

Community Detection ◽

Gold Standard ◽

Drug Target ◽

Drug Repositioning ◽

Binary Classification ◽

Predictive Performance ◽

Detection Methods ◽

Data Set ◽

Standard Data

AbstractIdentifying drug-target interactions (DTIs) is an important step for drug discovery and drug repositioning. To reduce heavily experiment cost, booming machine learning has been applied to this field and developed many computational methods, especially binary classification methods. However, there is still much room for improvement in the performance of current methods. Multi-label learning can reduce difficulties faced by binary classification learning with high predictive performance, and has not been explored extensively. The key challenge it faces is the exponential-sized output space, and considering label correlations can help it. Thus, we facilitate the multi-label classification by introducing community detection methods for DTIs prediction, named DTI-MLCD. On the other hand, we updated the gold standard data set proposed in 2008 and still in use today. The proposed DTI-MLCD is performed on the gold standard data set before and after the update, and shows the superiority than other classical machine learning methods and other benchmark proposed methods, which confirms the efficiency of it. The data and code for this study can be found at https://github.com/a96123155/DTI-MLCD.

Download Full-text

Application of Machine Learning Techniques in Drug-Target Interactions Prediction

Current Pharmaceutical Design ◽

10.2174/1381612826666201125105730 ◽

2020 ◽

Vol 26 ◽

Author(s):

Shengli Zhang ◽

Jiesheng Wang ◽

Zhenhui Lin ◽

Yunyun Liang

Keyword(s):

Machine Learning ◽

Drug Target ◽

Drug Repositioning ◽

Machine Learning Techniques ◽

Data Sets ◽

Machine Learning Methods ◽

Applied Machine Learning ◽

Lab Experiments ◽

Learning Techniques ◽

Supervised Methods

Background: Drug-Target interactions are vital for drug design and drug repositioning. However, traditional lab experiments are both expensive and time-consuming. Various computational methods which applied machine learning techniques performed efficiently and effectively in the field. Results: The machine learning methods can be divided into three categories basically: Supervised methods, SemiSupervised methods and Unsupervised methods. We reviewed recent representative methods applying machine learning techniques of each category in DTIs and summarized a brief list of databases frequently used in drug discovery. In addition, we compared the advantages and limitations of these methods in each category. Conclusion: Every prediction model has its both strengths and weaknesses and should be adopted in proper ways. Three major problems in DTIs prediction including the lack of nonreactive drug-target pairs data sets, overoptimistic results due to the biases and the exploiting of regression models on DTIs prediction should be seriously considered.

Download Full-text

A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification

Metabolomics ◽

10.1007/s11306-019-1612-4 ◽

2019 ◽

Vol 15 (12) ◽

Cited By ~ 18

Author(s):

Kevin M. Mendez ◽

Stacey N. Reinke ◽

David I. Broadhurst

Keyword(s):

Machine Learning ◽

Gold Standard ◽

Binary Classification ◽

Learning Algorithms ◽

Predictive Ability ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Data Sets ◽

Metabolomics Data ◽

Machine Learning Methods

Abstract Introduction Metabolomics is increasingly being used in the clinical setting for disease diagnosis, prognosis and risk prediction. Machine learning algorithms are particularly important in the construction of multivariate metabolite prediction. Historically, partial least squares (PLS) regression has been the gold standard for binary classification. Nonlinear machine learning methods such as random forests (RF), kernel support vector machines (SVM) and artificial neural networks (ANN) may be more suited to modelling possible nonlinear metabolite covariance, and thus provide better predictive models. Objectives We hypothesise that for binary classification using metabolomics data, non-linear machine learning methods will provide superior generalised predictive ability when compared to linear alternatives, in particular when compared with the current gold standard PLS discriminant analysis. Methods We compared the general predictive performance of eight archetypal machine learning algorithms across ten publicly available clinical metabolomics data sets. The algorithms were implemented in the Python programming language. All code and results have been made publicly available as Jupyter notebooks. Results There was only marginal improvement in predictive ability for SVM and ANN over PLS across all data sets. RF performance was comparatively poor. The use of out-of-bag bootstrap confidence intervals provided a measure of uncertainty of model prediction such that the quality of metabolomics data was observed to be a bigger influence on generalised performance than model choice. Conclusion The size of the data set, and choice of performance metric, had a greater influence on generalised predictive performance than the choice of machine learning algorithm.

Download Full-text

Comprehensive assessment of machine learning-based methods for predicting antimicrobial peptides

Briefings in Bioinformatics ◽

10.1093/bib/bbab083 ◽

2021 ◽

Author(s):

Jing Xu ◽

Fuyi Li ◽

André Leier ◽

Dongxu Xiang ◽

Hsin-Hui Shen ◽

...

Keyword(s):

Machine Learning ◽

Antimicrobial Peptides ◽

Computational Methods ◽

Cross Validation ◽

Predictive Performance ◽

Support Vector ◽

Data Sets ◽

Learning Methods ◽

Data Set ◽

Machine Learning Methods

Abstract Antimicrobial peptides (AMPs) are a unique and diverse group of molecules that play a crucial role in a myriad of biological processes and cellular functions. AMP-related studies have become increasingly popular in recent years due to antimicrobial resistance, which is becoming an emerging global concern. Systematic experimental identification of AMPs faces many difficulties due to the limitations of current methods. Given its significance, more than 30 computational methods have been developed for accurate prediction of AMPs. These approaches show high diversity in their data set size, data quality, core algorithms, feature extraction, feature selection techniques and evaluation strategies. Here, we provide a comprehensive survey on a variety of current approaches for AMP identification and point at the differences between these methods. In addition, we evaluate the predictive performance of the surveyed tools based on an independent test data set containing 1536 AMPs and 1536 non-AMPs. Furthermore, we construct six validation data sets based on six different common AMP databases and compare different computational methods based on these data sets. The results indicate that amPEPpy achieves the best predictive performance and outperforms the other compared methods. As the predictive performances are affected by the different data sets used by different methods, we additionally perform the 5-fold cross-validation test to benchmark different traditional machine learning methods on the same data set. These cross-validation results indicate that random forest, support vector machine and eXtreme Gradient Boosting achieve comparatively better performances than other machine learning methods and are often the algorithms of choice of multiple AMP prediction tools.

Download Full-text

Building Damage Detection from Post-Event Aerial Imagery Using Single Shot Multibox Detector

Applied Sciences ◽

10.3390/app9061128 ◽

2019 ◽

Vol 9 (6) ◽

pp. 1128 ◽

Cited By ~ 12

Author(s):

Yundong Li ◽

Wei Hu ◽

Han Dong ◽

Xueyan Zhang

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Hurricane Sandy ◽

Training Data ◽

Aerial Images ◽

Detection Methods ◽

Single Shot ◽

Data Set ◽

Augmentation Strategies ◽

Post Disaster

Using aerial cameras, satellite remote sensing or unmanned aerial vehicles (UAV) equipped with cameras can facilitate search and rescue tasks after disasters. The traditional manual interpretation of huge aerial images is inefficient and could be replaced by machine learning-based methods combined with image processing techniques. Given the development of machine learning, researchers find that convolutional neural networks can effectively extract features from images. Some target detection methods based on deep learning, such as the single-shot multibox detector (SSD) algorithm, can achieve better results than traditional methods. However, the impressive performance of machine learning-based methods results from the numerous labeled samples. Given the complexity of post-disaster scenarios, obtaining many samples in the aftermath of disasters is difficult. To address this issue, a damaged building assessment method using SSD with pretraining and data augmentation is proposed in the current study and highlights the following aspects. (1) Objects can be detected and classified into undamaged buildings, damaged buildings, and ruins. (2) A convolution auto-encoder (CAE) that consists of VGG16 is constructed and trained using unlabeled post-disaster images. As a transfer learning strategy, the weights of the SSD model are initialized using the weights of the CAE counterpart. (3) Data augmentation strategies, such as image mirroring, rotation, Gaussian blur, and Gaussian noise processing, are utilized to augment the training data set. As a case study, aerial images of Hurricane Sandy in 2012 were maximized to validate the proposed method’s effectiveness. Experiments show that the pretraining strategy can improve of 10% in terms of overall accuracy compared with the SSD trained from scratch. These experiments also demonstrate that using data augmentation strategies can improve mAP and mF1 by 72% and 20%, respectively. Finally, the experiment is further verified by another dataset of Hurricane Irma, and it is concluded that the paper method is feasible.

Download Full-text

An Optimized Stacking Ensemble Model for Phishing Websites Detection

Electronics ◽

10.3390/electronics10111285 ◽

2021 ◽

Vol 10 (11) ◽

pp. 1285

Author(s):

Mohammed Al-Sarem ◽

Faisal Saeed ◽

Zeyad Ghaleb Al-Mekhlafi ◽

Badiea Abdulkarem Mohammed ◽

Tawfik Al-Hadhrami ◽

...

Keyword(s):

Machine Learning ◽

Random Forests ◽

Ensemble Method ◽

Detection Methods ◽

Detection Accuracy ◽

Ensemble Model ◽

Security Attacks ◽

Data Set ◽

Machine Learning Methods ◽

Ensemble Machine Learning

Security attacks on legitimate websites to steal users’ information, known as phishing attacks, have been increasing. This kind of attack does not just affect individuals’ or organisations’ websites. Although several detection methods for phishing websites have been proposed using machine learning, deep learning, and other approaches, their detection accuracy still needs to be enhanced. This paper proposes an optimized stacking ensemble method for phishing website detection. The optimisation was carried out using a genetic algorithm (GA) to tune the parameters of several ensemble machine learning methods, including random forests, AdaBoost, XGBoost, Bagging, GradientBoost, and LightGBM. The optimized classifiers were then ranked, and the best three models were chosen as base classifiers of a stacking ensemble method. The experiments were conducted on three phishing website datasets that consisted of both phishing websites and legitimate websites—the Phishing Websites Data Set from UCI (Dataset 1); Phishing Dataset for Machine Learning from Mendeley (Dataset 2, and Datasets for Phishing Websites Detection from Mendeley (Dataset 3). The experimental results showed an improvement using the optimized stacking ensemble method, where the detection accuracy reached 97.16%, 98.58%, and 97.39% for Dataset 1, Dataset 2, and Dataset 3, respectively.

Download Full-text

Generation of geometric interpolations of building types with deep variational autoencoders

Design Science ◽

10.1017/dsj.2020.31 ◽

2020 ◽

Vol 6 ◽

Author(s):

Jaime de Miguel Rodríguez ◽

Maria Eugenia Villafañe ◽

Luka Piškorec ◽

Fernando Sancho Caparrini

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Large Data ◽

Learning Model ◽

Large Data Sets ◽

Data Sets ◽

Connectivity Map ◽

Data Set ◽

3D Objects ◽

Machine Learning Model

Abstract This work presents a methodology for the generation of novel 3D objects resembling wireframes of building types. These result from the reconstruction of interpolated locations within the learnt distribution of variational autoencoders (VAEs), a deep generative machine learning model based on neural networks. The data set used features a scheme for geometry representation based on a ‘connectivity map’ that is especially suited to express the wireframe objects that compose it. Additionally, the input samples are generated through ‘parametric augmentation’, a strategy proposed in this study that creates coherent variations among data by enabling a set of parameters to alter representative features on a given building type. In the experiments that are described in this paper, more than 150 k input samples belonging to two building types have been processed during the training of a VAE model. The main contribution of this paper has been to explore parametric augmentation for the generation of large data sets of 3D geometries, showcasing its problems and limitations in the context of neural networks and VAEs. Results show that the generation of interpolated hybrid geometries is a challenging task. Despite the difficulty of the endeavour, promising advances are presented.

Download Full-text

Nowcasting heavy precipitation over the Netherlands using a 13-year radar archive: a machine learning approach

10.5194/egusphere-egu21-12814 ◽

2021 ◽

Author(s):

Eva van der Kooij ◽

Marc Schleiss ◽

Riccardo Taormina ◽

Francesco Fioranelli ◽

Dorien Lugt ◽

...

Keyword(s):

Machine Learning ◽

The Netherlands ◽

Heavy Rainfall ◽

Predictive Performance ◽

Heavy Precipitation ◽

Early Warning Systems ◽

Training Data ◽

Short Term ◽

Data Set ◽

Radar Images

Accurate short-term forecasts, also known as nowcasts, of heavy precipitation are desirable for creating early warning systems for extreme weather and its consequences, e.g. urban flooding. In this research, we explore the use of machine learning for short-term prediction of heavy rainfall showers in the Netherlands.We assess the performance of a recurrent, convolutional neural network (TrajGRU) with lead times of 0 to 2 hours. The network is trained on a 13-year archive of radar images with 5-min temporal and 1-km spatial resolution from the precipitation radars of the Royal Netherlands Meteorological Institute (KNMI). We aim to train the model to predict the formation and dissipation of dynamic, heavy, localized rain events, a task for which traditional Lagrangian nowcasting methods still come up short.We report on different ways to optimize predictive performance for heavy rainfall intensities through several experiments. The large dataset available provides many possible configurations for training. To focus on heavy rainfall intensities, we use different subsets of this dataset through using different conditions for event selection and varying the ratio of light and heavy precipitation events present in the training data set and change the loss function used to train the model.To assess the performance of the model, we compare our method to current state-of-the-art Lagrangian nowcasting system from the pySTEPS library, like S-PROG, a deterministic approximation of an ensemble mean forecast. The results of the experiments are used to discuss the pros and cons of machine-learning based methods for precipitation nowcasting and possible ways to further increase performance.

Download Full-text

A systematic review of machine learning-based missing value imputation techniques

Data Technologies and Applications ◽

10.1108/dta-12-2020-0298 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Tressy Thomas ◽

Enayat Rajabi

Keyword(s):

Machine Learning ◽

Selection Process ◽

Evaluation Metrics ◽

Correct Prediction ◽

Data Sets ◽

Data Set ◽

Missing Value ◽

Content Type ◽

Missing Value Imputation ◽

Literature Reviews

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

Download Full-text

Birds Sound Classification Based on Machine Learning Algorithms

Asian Journal of Research in Computer Science ◽

10.9734/ajrcos/2021/v9i430227 ◽

2021 ◽

pp. 1-11

Author(s):

Aska E. Mehyadin ◽

Adnan Mohsin Abdulazeez ◽

Dathar Abas Hasan ◽

Jwan N. Saeed

Keyword(s):

Machine Learning ◽

Noise Suppression ◽

Bird Species ◽

Machine Learning Algorithms ◽

Data Sets ◽

Learning Technology ◽

Species Classification ◽

Data Set ◽

Sound Classification ◽

Mel Frequency Cepstral Coefficient

The bird classifier is a system that is equipped with an area machine learning technology and uses a machine learning method to store and classify bird calls. Bird species can be known by recording only the sound of the bird, which will make it easier for the system to manage. The system also provides species classification resources to allow automated species detection from observations that can teach a machine how to recognize whether or classify the species. Non-undesirable noises are filtered out of and sorted into data sets, where each sound is run via a noise suppression filter and a separate classification procedure so that the most useful data set can be easily processed. Mel-frequency cepstral coefficient (MFCC) is used and tested through different algorithms, namely Naïve Bayes, J4.8 and Multilayer perceptron (MLP), to classify bird species. J4.8 has the highest accuracy (78.40%) and is the best. Accuracy and elapsed time are (39.4 seconds).

Download Full-text