Regional feature extraction of various fishes based on chemical and microbial variable selection using machine learning

Taiga Asakura; Kenji Sakata; Yasuhiro Date; Jun Kikuchi

doi:10.1039/c8ay00377g

Identifying Potential miRNA Biomarkers for Gastric Cancer Diagnosis Using Machine Learning Variable Selection Approach

Frontiers in Genetics ◽

10.3389/fgene.2021.779455 ◽

2022 ◽

Vol 12 ◽

Author(s):

Neda Gilani ◽

Reza Arabi Belaghi ◽

Younes Aftabi ◽

Elnaz Faramarzi ◽

Tuba Edgünlü ◽

...

Keyword(s):

Machine Learning ◽

Gastric Cancer ◽

Variable Selection ◽

Prediction Models ◽

Strong Relationship ◽

Training Sample ◽

Machine Learning Algorithms ◽

Molecular Events ◽

Selection Approach ◽

Ontological Analysis

Aim: This study aimed to accurately identification of potential miRNAs for gastric cancer (GC) diagnosis at the early stages of the disease.Methods: We used GSE106817 data with 2,566 miRNAs to train the machine learning models. We used the Boruta machine learning variable selection approach to identify the strong miRNAs associated with GC in the training sample. We then validated the prediction models in the independent sample GSE113486 data. Finally, an ontological analysis was done on identified miRNAs to eliciting the relevant relationships.Results: Of those 2,874 patients in the training the model, there were 115 (4%) patients with GC. Boruta identified 30 miRNAs as potential biomarkers for GC diagnosis and hsa-miR-1343-3p was at the highest ranking. All of the machine learning algorithms showed that using hsa-miR-1343-3p as a biomarker, GC can be predicted with very high precision (AUC; 100%, sensitivity; 100%, specificity; 100% ROC; 100%, Kappa; 100) using with the cut-off point of 8.2 for hsa-miR-1343-3p. Also, ontological analysis of 30 identified miRNAs approved their strong relationship with cancer associated genes and molecular events.Conclusion: The hsa-miR-1343-3p could be introduced as a valuable target for studies on the GC diagnosis using reliable biomarkers.

Download Full-text

Variable selection with missing data in both covariates and outcomes: Imputation and machine learning

Statistical Methods in Medical Research ◽

10.1177/09622802211046385 ◽

2021 ◽

pp. 096228022110463

Author(s):

Liangyuan Hu ◽

Jung-Yi Joyce Lin ◽

Jiayi Ji

Keyword(s):

Machine Learning ◽

Missing Data ◽

Variable Selection ◽

Random Forests ◽

Gradient Boosting ◽

Stepwise Selection ◽

Machine Learning Methods ◽

Extreme Gradient Boosting ◽

Selection Approach ◽

Additive Regression

Variable selection in the presence of both missing covariates and outcomes is an important statistical research topic. Parametric regression are susceptible to misspecification, and as a result are sub-optimal for variable selection. Flexible machine learning methods mitigate the reliance on the parametric assumptions, but do not provide as naturally defined variable importance measure as the covariate effect native to parametric models. We investigate a general variable selection approach when both the covariates and outcomes can be missing at random and have general missing data patterns. This approach exploits the flexibility of machine learning models and bootstrap imputation, which is amenable to nonparametric methods in which the covariate effects are not directly available. We conduct expansive simulations investigating the practical operating characteristics of the proposed variable selection approach, when combined with four tree-based machine learning methods, extreme gradient boosting, random forests, Bayesian additive regression trees, and conditional random forests, and two commonly used parametric methods, lasso and backward stepwise selection. Numeric results suggest that, extreme gradient boosting and Bayesian additive regression trees have the overall best variable selection performance with respect to the [Formula: see text] score and Type I error, while the lasso and backward stepwise selection have subpar performance across various settings. There is no significant difference in the variable selection performance due to imputation methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome with data from the Study of Women’s Health Across the Nation.

Download Full-text

Feature extraction and prediction of Dengue Outbreaks

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit206544 ◽

2020 ◽

pp. 216-222

Author(s):

Kunal Parikh ◽

Tanvi Makadia ◽

Harshil Patel

Keyword(s):

Public Health ◽

Machine Learning ◽

Developing Countries ◽

Feature Extraction ◽

Predictive Analytics ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Health Concerns ◽

The World ◽

Dengue Outbreaks

Dengue is unquestionably one of the biggest health concerns in India and for many other developing countries. Unfortunately, many people have lost their lives because of it. Every year, approximately 390 million dengue infections occur around the world among which 500,000 people are seriously infected and 25,000 people have died annually. Many factors could cause dengue such as temperature, humidity, precipitation, inadequate public health, and many others. In this paper, we are proposing a method to perform predictive analytics on dengue’s dataset using KNN: a machine-learning algorithm. This analysis would help in the prediction of future cases and we could save the lives of many.

Download Full-text

Improving Practices for Selecting a Subset of Important Predictors in Psychology: An Application to Predicting Pain

10.31234/osf.io/j8t7s ◽

2019 ◽

Author(s):

Sierra Bainter ◽

Thomas Granville McCauley ◽

Tor D Wager ◽

Elizabeth Reynolds Losin

Keyword(s):

Variable Selection ◽

Bayesian Variable Selection ◽

Limited Information ◽

Online Application ◽

Stochastic Search Variable Selection ◽

Selection Approach ◽

Pain Ratings ◽

Research Questions ◽

Standard Techniques ◽

Search Variable

In this paper we address the problem of selecting important predictors from some larger set of candidate predictors. Standard techniques are limited by lack of power and high false positive rates. A Bayesian variable selection approach used widely in biostatistics, stochastic search variable selection, can be used instead to combat these issues by accounting for uncertainty in the other predictors of the model. In this paper we present Bayesian variable selection to aid researchers facing this common scenario, along with an online application (https://ssvsforpsych.shinyapps.io/ssvsforpsych/) to perform the analysis and visualize the results. Using an application to predict pain ratings, we demonstrate how this approach quickly identifies reliable predictors, even when the set of possible predictors is larger than the sample size. This technique is widely applicable to research questions that may be relatively data-rich, but with limited information or theory to guide variable selection.

Download Full-text

Document Preprocessing with TF-IDF to Improve the Polarity Classification Performance of Unstructured Sentiment Analysis

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v5i3.1066 ◽

2020 ◽

pp. 235-242

Author(s):

Farrikh Alzami ◽

Erika Devi Udayanti ◽

Dwi Puji Prabowo ◽

Rama Aria Megantara

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Random Forest ◽

Sentiment Analysis ◽

Classification Performance ◽

Document Preparation ◽

Learning Models ◽

Polarity Classification ◽

Negative Sentiment ◽

Machine Learning Models

Sentiment analysis in terms of polarity classification is very important in everyday life, with the existence of polarity, many people can find out whether the respected document has positive or negative sentiment so that it can help in choosing and making decisions. Sentiment analysis usually done manually. Therefore, an automatic sentiment analysis classification process is needed. However, it is rare to find studies that discuss extraction features and which learning models are suitable for unstructured sentiment analysis types with the Amazon food review case. This research explores some extraction features such as Word Bags, TF-IDF, Word2Vector, as well as a combination of TF-IDF and Word2Vector with several machine learning models such as Random Forest, SVM, KNN and Naïve Bayes to find out a combination of feature extraction and learning models that can help add variety to the analysis of polarity sentiments. By assisting with document preparation such as html tags and punctuation and special characters, using snowball stemming, TF-IDF results obtained with SVM are suitable for obtaining a polarity classification in unstructured sentiment analysis for the case of Amazon food review with a performance result of 87,3 percent.

Download Full-text

A Comparative Survey of Feature Extraction and Machine Learning Methods in Diverse Acoustic Environments

Sensors ◽

10.3390/s21041274 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1274

Author(s):

Daniel Bonet-Solà ◽

Rosa Ma Alsina-Pagès

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Best Practice ◽

Nearest Neighbor ◽

Gaussian Mixture ◽

Machine Learning Algorithms ◽

Multimedia Retrieval ◽

Natural Environments ◽

K Nearest Neighbor ◽

Acoustic Environments

Acoustic event detection and analysis has been widely developed in the last few years for its valuable application in monitoring elderly or dependant people, for surveillance issues, for multimedia retrieval, or even for biodiversity metrics in natural environments. For this purpose, sound source identification is a key issue to give a smart technological answer to all the aforementioned applications. Diverse types of sounds and variate environments, together with a number of challenges in terms of application, widen the choice of artificial intelligence algorithm proposal. This paper presents a comparative study on combining several feature extraction algorithms (Mel Frequency Cepstrum Coefficients (MFCC), Gammatone Cepstrum Coefficients (GTCC), and Narrow Band (NB)) with a group of machine learning algorithms (k-Nearest Neighbor (kNN), Neural Networks (NN), and Gaussian Mixture Model (GMM)), tested over five different acoustic environments. This work has the goal of detailing a best practice method and evaluate the reliability of this general-purpose algorithm for all the classes. Preliminary results show that most of the combinations of feature extraction and machine learning present acceptable results in most of the described corpora. Nevertheless, there is a combination that outperforms the others: the use of GTCC together with kNN, and its results are further analyzed for all the corpora.

Download Full-text

Suicidality Detection on Social Media Using Metadata and Text Feature Extraction and Machine Learning

Archives of Suicide Research ◽

10.1080/13811118.2021.1955783 ◽

2021 ◽

pp. 1-16

Author(s):

Woojin Jung ◽

Donghun Kim ◽

Seojin Nam ◽

Yongjun Zhu

Keyword(s):

Machine Learning ◽

Social Media ◽

Feature Extraction ◽

Text Feature

Download Full-text

Multi-Sensor Fusion Module for Perceptual Target Recognition for Intelligent Machine Learning Visual Feature Extraction

IEEE Sensors Journal ◽

10.1109/jsen.2021.3061207 ◽

2021 ◽

pp. 1-1

Author(s):

Hechuang Wang

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Sensor Fusion ◽

Target Recognition ◽

Visual Feature ◽

Intelligent Machine ◽

Visual Feature Extraction

Download Full-text

Utilising Flow Aggregation to Classify Benign Imitating Attacks

Sensors ◽

10.3390/s21051761 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1761

Author(s):

Hanan Hindy ◽

Robert Atkinson ◽

Christos Tachtatzis ◽

Ethan Bayne ◽

Miroslav Bures ◽

...

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Network Traffic ◽

Cyber Attacks ◽

Detection Accuracy ◽

Computational Power ◽

Human Understanding ◽

Flow Aggregation ◽

Additional Level ◽

Attack Surfaces

Cyber-attacks continue to grow, both in terms of volume and sophistication. This is aided by an increase in available computational power, expanding attack surfaces, and advancements in the human understanding of how to make attacks undetectable. Unsurprisingly, machine learning is utilised to defend against these attacks. In many applications, the choice of features is more important than the choice of model. A range of studies have, with varying degrees of success, attempted to discriminate between benign traffic and well-known cyber-attacks. The features used in these studies are broadly similar and have demonstrated their effectiveness in situations where cyber-attacks do not imitate benign behaviour. To overcome this barrier, in this manuscript, we introduce new features based on a higher level of abstraction of network traffic. Specifically, we perform flow aggregation by grouping flows with similarities. This additional level of feature abstraction benefits from cumulative information, thus qualifying the models to classify cyber-attacks that mimic benign traffic. The performance of the new features is evaluated using the benchmark CICIDS2017 dataset, and the results demonstrate their validity and effectiveness. This novel proposal will improve the detection accuracy of cyber-attacks and also build towards a new direction of feature extraction for complex ones.

Download Full-text