Analysis of Expression Pattern of snoRNAs in Different Cancer Types with Machine Learning Algorithms

Xiaoyong Pan; Lei Chen; Kai-Yan Feng; Xiao-Hua Hu; Yu-Hang Zhang; Xiang-Yin Kong; Tao Huang; Yu-Dong Cai

doi:10.3390/ijms20092185

Analysis of Expression Pattern of snoRNAs in Different Cancer Types with Machine Learning Algorithms

International Journal of Molecular Sciences ◽

10.3390/ijms20092185 ◽

2019 ◽

Vol 20 (9) ◽

pp. 2185 ◽

Cited By ~ 18

Author(s):

Xiaoyong Pan ◽

Lei Chen ◽

Kai-Yan Feng ◽

Xiao-Hua Hu ◽

Yu-Hang Zhang ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Expression Pattern ◽

Learning Algorithms ◽

Expression Patterns ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Support Vector ◽

Feature List ◽

Cancer Types

Small nucleolar RNAs (snoRNAs) are a new type of functional small RNAs involved in the chemical modifications of rRNAs, tRNAs, and small nuclear RNAs. It is reported that they play important roles in tumorigenesis via various regulatory modes. snoRNAs can both participate in the regulation of methylation and pseudouridylation and regulate the expression pattern of their host genes. This research investigated the expression pattern of snoRNAs in eight major cancer types in TCGA via several machine learning algorithms. The expression levels of snoRNAs were first analyzed by a powerful feature selection method, Monte Carlo feature selection (MCFS). A feature list and some informative features were accessed. Then, the incremental feature selection (IFS) was applied to the feature list to extract optimal features/snoRNAs, which can make the support vector machine (SVM) yield best performance. The discriminative snoRNAs included HBII-52-14, HBII-336, SNORD123, HBII-85-29, HBII-420, U3, HBI-43, SNORD116, SNORA73B, SCARNA4, HBII-85-20, etc., on which the SVM can provide a Matthew’s correlation coefficient (MCC) of 0.881 for predicting these eight cancer types. On the other hand, the informative features were fed into the Johnson reducer and repeated incremental pruning to produce error reduction (RIPPER) algorithms to generate classification rules, which can clearly show different snoRNAs expression patterns in different cancer types. The analysis results indicated that extracted discriminative snoRNAs can be important for identifying cancer samples in different types and the expression pattern of snoRNAs in different cancer types can be partly uncovered by quantitative recognition rules.

Download Full-text

Alternative Polyadenylation Modification Patterns Reveal Essential Posttranscription Regulatory Mechanisms of Tumorigenesis in Multiple Tumor Types

BioMed Research International ◽

10.1155/2020/6384120 ◽

2020 ◽

Vol 2020 ◽

pp. 1-9

Author(s):

Min Li ◽

XiaoYong Pan ◽

Tao Zeng ◽

Yu-Hang Zhang ◽

Kaiyan Feng ◽

...

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Alternative Polyadenylation ◽

Machine Learning Algorithms ◽

Regulatory Mechanisms ◽

Support Vector ◽

Transcriptional Level ◽

Cancer Data ◽

Feature List ◽

Cancer Types

Among various risk factors for the initiation and progression of cancer, alternative polyadenylation (APA) is a remarkable endogenous contributor that directly triggers the malignant phenotype of cancer cells. APA affects biological processes at a transcriptional level in various ways. As such, APA can be involved in tumorigenesis through gene expression, protein subcellular localization, or transcription splicing pattern. The APA sites and status of different cancer types may have diverse modification patterns and regulatory mechanisms on transcripts. Potential APA sites were screened by applying several machine learning algorithms on a TCGA-APA dataset. First, a powerful feature selection method, minimum redundancy maximum relevancy, was applied on the dataset, resulting in a feature list. Then, the feature list was fed into the incremental feature selection, which incorporated the support vector machine as the classification algorithm, to extract key APA features and build a classifier. The classifier can classify cancer patients into cancer types with perfect performance. The key APA-modified genes had a potential prognosis ability because of their significant power in the survival analysis of TCGA pan-cancer data.

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

Evaluating Variable Selection and Machine Learning Algorithms for Estimating Forest Heights by Combining Lidar and Hyperspectral Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090507 ◽

2020 ◽

Vol 9 (9) ◽

pp. 507

Author(s):

Sanjiwana Arjasakusuma ◽

Sandiaga Swahyu Kusuma ◽

Stuart Phinn

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Principal Component ◽

Hyperspectral Data ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Forest Height ◽

Extreme Gradient Boosting

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.

Download Full-text

Diagnostic Performance of 2D and 3D T2WI-Based Radiomics Features With Machine Learning Algorithms to Distinguish Solid Solitary Pulmonary Lesion

Frontiers in Oncology ◽

10.3389/fonc.2021.683587 ◽

2021 ◽

Vol 11 ◽

Author(s):

Qi Wan ◽

Jiaxuan Zhou ◽

Xiaoying Xia ◽

Jianfeng Hu ◽

Peng Wang ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Diagnostic Performance ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Approaches ◽

Selection Methods ◽

Linear Discriminant ◽

2D And 3D

ObjectiveTo evaluate the performance of 2D and 3D radiomics features with different machine learning approaches to classify SPLs based on magnetic resonance(MR) T2 weighted imaging (T2WI).Material and MethodsA total of 132 patients with pathologically confirmed SPLs were examined and randomly divided into training (n = 92) and test datasets (n = 40). A total of 1692 3D and 1231 2D radiomics features per patient were extracted. Both radiomics features and clinical data were evaluated. A total of 1260 classification models, comprising 3 normalization methods, 2 dimension reduction algorithms, 3 feature selection methods, and 10 classifiers with 7 different feature numbers (confined to 3–9), were compared. The ten-fold cross-validation on the training dataset was applied to choose the candidate final model. The area under the receiver operating characteristic curve (AUC), precision-recall plot, and Matthews Correlation Coefficient were used to evaluate the performance of machine learning approaches.ResultsThe 3D features were significantly superior to 2D features, showing much more machine learning combinations with AUC greater than 0.7 in both validation and test groups (129 vs. 11). The feature selection method Analysis of Variance(ANOVA), Recursive Feature Elimination(RFE) and the classifier Logistic Regression(LR), Linear Discriminant Analysis(LDA), Support Vector Machine(SVM), Gaussian Process(GP) had relatively better performance. The best performance of 3D radiomics features in the test dataset (AUC = 0.824, AUC-PR = 0.927, MCC = 0.514) was higher than that of 2D features (AUC = 0.740, AUC-PR = 0.846, MCC = 0.404). The joint 3D and 2D features (AUC=0.813, AUC-PR = 0.926, MCC = 0.563) showed similar results as 3D features. Incorporating clinical features with 3D and 2D radiomics features slightly improved the AUC to 0.836 (AUC-PR = 0.918, MCC = 0.620) and 0.780 (AUC-PR = 0.900, MCC = 0.574), respectively.ConclusionsAfter algorithm optimization, 2D feature-based radiomics models yield favorable results in differentiating malignant and benign SPLs, but 3D features are still preferred because of the availability of more machine learning algorithmic combinations with better performance. Feature selection methods ANOVA and RFE, and classifier LR, LDA, SVM and GP are more likely to demonstrate better diagnostic performance for 3D features in the current study.

Download Full-text

AUTOMATED SELECTION OF INPUTS FOR LOG PREDICTION MODELS USING A NEW FEATURE SELECTION METHOD

10.30632/spwla-2021-0091 ◽

2021 ◽

Author(s):

Ravi Arkalgud ◽

◽

Andrew McDonald ◽

Ross Brackenridge ◽

◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Prediction Models ◽

Learning Algorithms ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Target Feature ◽

Geomechanical Properties ◽

New Feature ◽

Selection Of

Automation is becoming an integral part of our daily lives as technology and techniques rapidly develop. Many automation workflows are now routinely being applied within the geoscience domain. The basic structure of automation and its success of modelling fundamentally hinges on the appropriate choice of parameters and speed of processing. The entire process demands that the data being fed into any machine learning model is essentially of good quality. The technological advances in well logging technology over decades have enabled the collection of vast amounts of data across wells and fields. This poses a major issue in automating petrophysical workflows. It necessitates to ensure that, the data being fed is appropriate and fit for purpose. The selection of features (logging curves) and parameters for machine learning algorithms has therefore become a topic at the forefront of related research. Inappropriate feature selections can lead erroneous results, reduced precision and have proved to be computationally expensive. Experienced Eye (EE) is a novel methodology, derived from Domain Transfer Analysis (DTA), which seeks to identify and elicit the optimum input curves for modelling. During the EE solution process, relationships between the input variables and target variables are developed, based on characteristics and attributes of the inputs instead of statistical averages. The relationships so developed between variables can then be ranked appropriately and selected for modelling process. This paper focuses on three distinct petrophysical data scenarios where inputs are ranked prior to modelling: prediction of continuous permeability from discrete core measurements, porosity from multiple logging measurements and finally the prediction of key geomechanical properties. Each input curve is ranked against a target feature. For each case study, the best ranked features were carried forward to the modelling stage, and the results are validated alongside conventional interpretation methods. Ranked features were also compared between different machine learning algorithms: DTA, Neural Networks and Multiple Linear Regression. Results are compared with the available data for various case studies. The use of the new feature selection has been proven to improve accuracy and precision of prediction results from multiple modelling algorithms.

Download Full-text

BCD-WERT: a novel approach for breast cancer detection using whale optimization based efficient features and extremely randomized tree algorithm

PeerJ Computer Science ◽

10.7717/peerj-cs.390 ◽

2021 ◽

Vol 7 ◽

pp. e390

Author(s):

Shafaq Abbas ◽

Zunera Jalil ◽

Abdul Rehman Javed ◽

Iqra Batool ◽

Mohammad Zubair Khan ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Support Vector Machine ◽

Feature Selection ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Experimental Results ◽

Support Vector ◽

Novel Approach ◽

Whale Optimization

Breast cancer is one of the leading causes of death in the current age. It often results in subpar living conditions for a patient as they have to go through expensive and painful treatments to fight this cancer. One in eight women all over the world is affected by this disease. Almost half a million women annually do not survive this fight and die from this disease. Machine learning algorithms have proven to outperform all existing solutions for the prediction of breast cancer using models built on the previously available data. In this paper, a novel approach named BCD-WERT is proposed that utilizes the Extremely Randomized Tree and Whale Optimization Algorithm (WOA) for efficient feature selection and classification. WOA reduces the dimensionality of the dataset and extracts the relevant features for accurate classification. Experimental results on state-of-the-art comprehensive dataset demonstrated improved performance in comparison with eight other machine learning algorithms: Support Vector Machine (SVM), Random Forest, Kernel Support Vector Machine, Decision Tree, Logistic Regression, Stochastic Gradient Descent, Gaussian Naive Bayes and k-Nearest Neighbor. BCD-WERT outperformed all with the highest accuracy rate of 99.30% followed by SVM achieving 98.60% accuracy. Experimental results also reveal the effectiveness of feature selection techniques in improving prediction accuracy.

Download Full-text

An Extensive Text Mining Study for the Turkish Language

Advances in Business Information Systems and Analytics - Natural Language Processing for Global and Local Business ◽

10.4018/978-1-7998-4240-8.ch012 ◽

2021 ◽

pp. 272-306

Author(s):

Durmuş Özkan Şahin ◽

Erdal Kılıç

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Text Mining ◽

Language Processing ◽

Information Gain ◽

Learning Algorithms ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Classification Algorithms ◽

Chi Square

In this study, the authors give both theoretical and experimental information about text mining, which is one of the natural language processing topics. Three different text mining problems such as news classification, sentiment analysis, and author recognition are discussed for Turkish. They aim to reduce the running time and increase the performance of machine learning algorithms. Four different machine learning algorithms and two different feature selection metrics are used to solve these text classification problems. Classification algorithms are random forest (RF), logistic regression (LR), naive bayes (NB), and sequential minimal optimization (SMO). Chi-square and information gain metrics are used as the feature selection method. The highest classification performance achieved in this study is 0.895 according to the F-measure metric. This result is obtained by using the SMO classifier and information gain metric for news classification. This study is important in terms of comparing the performances of classification algorithms and feature selection methods.

Download Full-text

An analysis of PCOS disease prediction model using machine learning classification algorithms

Recent Patents on Engineering ◽

10.2174/1872212115999201224130204 ◽

2020 ◽

Vol 15 ◽

Author(s):

Shivani Aggarwal ◽

Kavita Pandey

Keyword(s):

Machine Learning ◽

Insulin Resistance ◽

Feature Selection ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Classification Algorithms ◽

Metabolic Abnormalities ◽

Related Disorder ◽

Machine Learning Classification

Background: Polycystic ovary syndrome is commonly known as PCOS and it is surprising that it affects up to 18% of women in reproductive age. PCOS is the most usually occurring hormone-related disorder. Some of the symptoms of PCOS are irregular periods, increased facial and body hair growth, attain more weight, darkening of skin, diabetes and trouble conceiving (infertility). It also came into light that patients suffering from PCOS also possess a range of metabolic abnormalities. Due to metabolic abnormalities, some disorder may occur which increase the risk of insulin resistance, type 2 diabetes and impaired glucose tolerance (a sign of prediabetes). Family members of women suffering from PCOS are also at higher hazardous level for developing the same metabolic abnormalities. Obesity and overweight status contribute to insulin resistance in PCOS. Objective: In the modern era, there are several new technologies available to diagnose PCOS and one of them is Machine learning algorithms because they are exposed to new data. These algorithms learn from past experiences to produce reliable and repeatable decisions. In this article, Machine learning algorithms are used to identify the important features to diagnose PCOS. Methods: Several classification algorithms like Support vector machine (SVM), Logistic Regression, Gradient Boosting, Random Forest, Decision Tree and K-Nearest Neighbor (KNN) are uses well organized test datasets for classify huge records. Initially a dataset of 541 instances and 41 attributes has been taken to apply the prediction models and a manual feature selection is done over it. Results: After the feature selection, a set of 12 attributes has been identified which plays a crucial role in diagnosing PCOS. Conclusion: There are several researches progressing in the direction of diagnosing PCOS but till now the relevant features are not identify for the same.

Download Full-text

Failure Prediction of Aircraft Equipment Using Machine Learning with a Hybrid Data Preparation Method

Scientific Programming ◽

10.1155/2020/8616039 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Kadir Celikmih ◽

Onur Inan ◽

Harun Uguz

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Feature Selection Method ◽

Absolute Error ◽

Machine Learning Algorithms ◽

Performance Criteria ◽

Aviation Industry ◽

Support Vector ◽

Data Preparation ◽

Hybrid Data

There is a large amount of information and maintenance data in the aviation industry that could be used to obtain meaningful results in forecasting future actions. This study aims to introduce machine learning models based on feature selection and data elimination to predict failures of aircraft systems. Maintenance and failure data for aircraft equipment across a period of two years were collected, and nine input and one output variables were meticulously identified. A hybrid data preparation model is proposed to improve the success of failure count prediction in two stages. In the first stage, ReliefF, a feature selection method for attribute evaluation, is used to find the most effective and ineffective parameters. In the second stage, a K-means algorithm is modified to eliminate noisy or inconsistent data. Performance of the hybrid data preparation model on the maintenance dataset of the equipment is evaluated by Multilayer Perceptron (MLP) as Artificial Neural network (ANN), Support Vector Regression (SVR), and Linear Regression (LR) as machine learning algorithms. Moreover, performance criteria such as the Correlation Coefficient (CC), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) are used to evaluate the models. The results indicate that the hybrid data preparation model is successful in predicting the failure count of the equipment.

Download Full-text

Efficient Approach of Automatic Speech Emotion Recognition (ASR) Using Mutual Information

INFORMATION TECHNOLOGY IN INDUSTRY ◽

10.17762/itii.v9i1.177 ◽

2021 ◽

Vol 9 (1) ◽

pp. 595-603

Author(s):

Shivangi Srivastav, Rajiv Ranjan Tewari

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Mutual Information ◽

Speaker Recognition ◽

Speaker Identification ◽

Learning Algorithms ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Human Interaction ◽

Speech Emotion Recognition

Speech is a significant quality for distinguishing a person in daily human to human interaction/ communication. Like other biometric measures, such as face, iris and fingerprints, voice can therefore be used as a biometric measure for perceiving or identifying the person. Speaker recognition is almost the same as a kind of voice recognition in which the speaker is identified from the expression instead of the message. Automatic Speaker Recognition (ASR) is the way to identify people who rely on highlights that are omitted from speech expressions. Speech signals are awesome correspondence media that constantly pass on rich and useful knowledge, such as a speaker's feeling, sexual orientation, complement, and other interesting attributes. In any speaker identification, the essential task is to delete helpful highlights and allow for significant examples of speaker models. Hypothetical description, organization of the full state of feeling and the modalities of articulation of feeling are added. A SER framework is developed to conduct this investigation, in view of different classifiers and different techniques for extracting highlights. In this work various machine learning algorithms are investigated to identify decision boundary in feature space of audio signals. Moreover novelty of this art lies in improving the performance of classical machine learning algorithms using information theory based feature selection methods. The higher accuracy retrieved is 96 percent using Random forest algorithm incorporated with Joint Mutual information feature selection method.

Download Full-text