vis–NIR and XRF Data Fusion and Feature Selection to Estimate Potentially Toxic Elements in Soil

Asa Gholizadeh; João A. Coblinski; Mohammadmehdi Saberioon; Eyal Ben-Dor; Ondřej Drábek; José A. M. Demattê; Luboš Borůvka; Karel Němeček; Sabine Chabrillat; Julie Dajčl

doi:10.3390/s21072386

vis–NIR and XRF Data Fusion and Feature Selection to Estimate Potentially Toxic Elements in Soil

Sensors ◽

10.3390/s21072386 ◽

2021 ◽

Vol 21 (7) ◽

pp. 2386

Author(s):

Asa Gholizadeh ◽

João A. Coblinski ◽

Mohammadmehdi Saberioon ◽

Eyal Ben-Dor ◽

Ondřej Drábek ◽

...

Keyword(s):

Feature Selection ◽

Data Fusion ◽

Toxic Elements ◽

Prediction Models ◽

Quantitative Estimation ◽

Full Range ◽

Potentially Toxic Elements ◽

Support Vector ◽

Spectroscopic Techniques ◽

Data Set

Soil contamination by potentially toxic elements (PTEs) is intensifying under increasing industrialization. Thus, the ability to efficiently delineate contaminated sites is crucial. Visible–near infrared (vis–NIR: 350–2500 nm) and X-ray fluorescence (XRF: 0.02–41.08 keV) spectroscopic techniques have attracted tremendous attention for the assessment of PTEs. Recently, the application of fused vis–NIR and XRF spectroscopy, which is based on the complementary effect of data fusion, is also increasing. Moreover, different data manipulation methods, including feature selection approaches, affect the prediction performance. This study investigated the feasibility of using single and fused vis–NIR and XRF spectra while exploring feature selection algorithms for the assessment of key soil PTEs. The soil samples were collected from one of the most heavily polluted areas of the Czech Republic and scanned using laboratory vis–NIR and XRF spectrometers. Univariate filter (UF) and genetic algorithm (GA) were used to select the bands of greater importance for the PTE prediction. Support vector machine (SVM) was then used to train the models using the full-range and feature-selected spectra of single sensors and their fusion. It was found that XRF spectra alone (primarily GA-selected) performed better than single vis–NIR and fused spectral data for predictions of PTEs. Moreover, the prediction models that were derived from the fused data set (particularly the GA-selected) enhanced the models’ accuracies as compared with the single vis–NIR spectra. In general, the results suggest that the GA-selected spectra obtained from the single XRF spectrometer (for As and Pb) and from the fusion of vis–NIR and XRF (for Pb) are promising for accurate quantitative estimation detection of the mentioned PTEs.

Download Full-text

Data Fusion of Two Hyperspectral Imaging Systems with Complementary Spectral Sensing Ranges for Blueberry Bruising Detection

Sensors ◽

10.3390/s18124463 ◽

2018 ◽

Vol 18 (12) ◽

pp. 4463 ◽

Cited By ~ 2

Author(s):

Shuxiang Fan ◽

Changying Li ◽

Wenqian Huang ◽

Liping Chen

Keyword(s):

Feature Selection ◽

Data Fusion ◽

Hyperspectral Imaging ◽

Multispectral Imaging ◽

Feature Selection Method ◽

Tunable Filter ◽

Support Vector ◽

Spectroscopic Techniques ◽

Decision Level ◽

Spectral Ranges

Currently, the detection of blueberry internal bruising focuses mostly on single hyperspectral imaging (HSI) systems. Attempts to fuse different HSI systems with complementary spectral ranges are still lacking. A push broom based HSI system and a liquid crystal tunable filter (LCTF) based HSI system with different sensing ranges and detectors were investigated to jointly detect blueberry internal bruising in the lab. The mean reflectance spectrum of each berry sample was extracted from the data obtained by two HSI systems respectively. The spectral data from the two spectroscopic techniques were analyzed separately using feature selection method, partial least squares-discriminant analysis (PLS-DA), and support vector machine (SVM), and then fused with three data fusion strategies at the data level, feature level, and decision level. The three data fusion strategies achieved better classification results than using each HSI system alone. The decision level fusion integrating classification results from the two instruments with selected relevant features achieved more promising results, suggesting that the two HSI systems with complementary spectral ranges, combined with feature selection and data fusion strategies, could be used synergistically to improve blueberry internal bruising detection. This study was the first step in demonstrating the feasibility of the fusion of two HSI systems with complementary spectral ranges for detecting blueberry bruising, which could lead to a multispectral imaging system with a few selected wavelengths and an appropriate detector for bruising detection on the packing line.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

Comparison of Spectroscopic Techniques Combined with Chemometrics for Cocaine Powder Analysis

Journal of Analytical Toxicology ◽

10.1093/jat/bkaa101 ◽

2020 ◽

Vol 44 (8) ◽

pp. 851-860

Author(s):

Joy Eliaerts ◽

Natalie Meert ◽

Pierre Dardenne ◽

Vincent Baeten ◽

Juan-Antonio Fernandez Pierna ◽

...

Keyword(s):

Gas Chromatography ◽

Near Infrared ◽

Evaluation Criteria ◽

Classification Model ◽

Support Vector ◽

Spectroscopic Techniques ◽

Data Set ◽

Promising Tool ◽

Powder Analysis ◽

Mir Spectra

Abstract Spectroscopic techniques combined with chemometrics are a promising tool for analysis of seized drug powders. In this study, the performance of three spectroscopic techniques [Mid-InfraRed (MIR), Raman and Near-InfraRed (NIR)] was compared. In total, 364 seized powders were analyzed and consisted of 276 cocaine powders (with concentrations ranging from 4 to 99 w%) and 88 powders without cocaine. A classification model (using Support Vector Machines [SVM] discriminant analysis) and a quantification model (using SVM regression) were constructed with each spectral dataset in order to discriminate cocaine powders from other powders and quantify cocaine in powders classified as cocaine positive. The performances of the models were compared with gas chromatography coupled with mass spectrometry (GC–MS) and gas chromatography with flame-ionization detection (GC–FID). Different evaluation criteria were used: number of false negatives (FNs), number of false positives (FPs), accuracy, root mean square error of cross-validation (RMSECV) and determination coefficients (R2). Ten colored powders were excluded from the classification data set due to fluorescence background observed in Raman spectra. For the classification, the best accuracy (99.7%) was obtained with MIR spectra. With Raman and NIR spectra, the accuracy was 99.5% and 98.9%, respectively. For the quantification, the best results were obtained with NIR spectra. The cocaine content was determined with a RMSECV of 3.79% and a R2 of 0.97. The performance of MIR and Raman to predict cocaine concentrations was lower than NIR, with RMSECV of 6.76% and 6.79%, respectively and both with a R2 of 0.90. The three spectroscopic techniques can be applied for both classification and quantification of cocaine, but some differences in performance were detected. The best classification was obtained with MIR spectra. For quantification, however, the RMSECV of MIR and Raman was twice as high in comparison with NIR. Spectroscopic techniques combined with chemometrics can reduce the workload for confirmation analysis (e.g., chromatography based) and therefore save time and resources.

Download Full-text

Artificial bee colony algorithm for feature selection and improved support vector machine for text classification

Information Discovery and Delivery ◽

10.1108/idd-09-2018-0045 ◽

2019 ◽

Vol 47 (3) ◽

pp. 154-170

Author(s):

Janani Balakumar ◽

S. Vijayarani Mohan

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Text Classification ◽

Support Vector ◽

Data Sets ◽

Selection Algorithm ◽

Data Set ◽

Content Type ◽

Benchmark Data ◽

Bee Colony

Purpose Owing to the huge volume of documents available on the internet, text classification becomes a necessary task to handle these documents. To achieve optimal text classification results, feature selection, an important stage, is used to curtail the dimensionality of text documents by choosing suitable features. The main purpose of this research work is to classify the personal computer documents based on their content. Design/methodology/approach This paper proposes a new algorithm for feature selection based on artificial bee colony (ABCFS) to enhance the text classification accuracy. The proposed algorithm (ABCFS) is scrutinized with the real and benchmark data sets, which is contrary to the other existing feature selection approaches such as information gain and χ2 statistic. To justify the efficiency of the proposed algorithm, the support vector machine (SVM) and improved SVM classifier are used in this paper. Findings The experiment was conducted on real and benchmark data sets. The real data set was collected in the form of documents that were stored in the personal computer, and the benchmark data set was collected from Reuters and 20 Newsgroups corpus. The results prove the performance of the proposed feature selection algorithm by enhancing the text document classification accuracy. Originality/value This paper proposes a new ABCFS algorithm for feature selection, evaluates the efficiency of the ABCFS algorithm and improves the support vector machine. In this paper, the ABCFS algorithm is used to select the features from text (unstructured) documents. Although, there is no text feature selection algorithm in the existing work, the ABCFS algorithm is used to select the data (structured) features. The proposed algorithm will classify the documents automatically based on their content.

Download Full-text

Feature selection using binary particle swarm optimization and support vector machines for medical diagnosis

Biomedical Engineering / Biomedizinische Technik ◽

10.1515/bmt-2012-0009 ◽

2012 ◽

Vol 57 (5) ◽

Cited By ~ 5

Author(s):

Mohammad Reza Daliri

Keyword(s):

Feature Selection ◽

Particle Swarm Optimization ◽

Support Vector Machines ◽

Particle Swarm ◽

Support Vector ◽

Emission Computed Tomography ◽

Binary Particle Swarm Optimization ◽

Data Set ◽

Swarm Optimization ◽

Vector Machines

AbstractIn this article, we propose a feature selection strategy using a binary particle swarm optimization algorithm for the diagnosis of different medical diseases. The support vector machines were used for the fitness function of the binary particle swarm optimization. We evaluated our proposed method on four databases from the machine learning repository, including the single proton emission computed tomography heart database, the Wisconsin breast cancer data set, the Pima Indians diabetes database, and the Dermatology data set. The results indicate that, with selected less number of features, we obtained a higher accuracy in diagnosing heart, cancer, diabetes, and erythematosquamous diseases. The results were compared with the traditional feature selection methods, namely, the F-score and the information gain, and a superior accuracy was obtained with our method. Compared to the genetic algorithm for feature selection, the results of the proposed method show a higher accuracy in all of the data, except in one. In addition, in comparison with other methods that used the same data, our approach has a higher performance using less number of features.

Download Full-text

Feature Selection Algorithm for Hyperlipidemia Classification

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.701-702.110 ◽

2014 ◽

Vol 701-702 ◽

pp. 110-113

Author(s):

Qi Rui Zhang ◽

He Xian Wang ◽

Jiang Wei Qin

Keyword(s):

Feature Selection ◽

Nearest Neighbor ◽

Information Gain ◽

Classification Systems ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Document Frequency ◽

Selection Algorithms ◽

Term Weights

This paper reports a comparative study of feature selection algorithms on a hyperlipimedia data set. Three methods of feature selection were evaluated, including document frequency (DF), information gain (IG) and aχ2 statistic (CHI). The classification systems use a vector to represent a document and use tfidfie (term frequency, inverted document frequency, and inverted entropy) to compute term weights. In order to compare the effectives of feature selection, we used three classification methods: Naïve Bayes (NB), k Nearest Neighbor (kNN) and Support Vector Machines (SVM). The experimental results show that IG and CHI outperform significantly DF, and SVM and NB is more effective than KNN when macro-averagingF1 measure is used. DF is suitable for the task of large text classification.

Download Full-text

Development of a Gene-Based Prediction Model for Recurrence of Colorectal Cancer Using an Ensemble Learning Algorithm

Frontiers in Oncology ◽

10.3389/fonc.2021.631056 ◽

2021 ◽

Vol 11 ◽

Author(s):

Han-Ching Chan ◽

Amrita Chattopadhyay ◽

Eric Y. Chuang ◽

Tzu-Pin Lu

Keyword(s):

Colorectal Cancer ◽

High Risk ◽

Adjuvant Chemotherapy ◽

Prediction Models ◽

Stage I ◽

Training Data ◽

Differentially Expressed ◽

Support Vector ◽

Data Set ◽

Risk Of Recurrence

It is difficult to determine which patients with stage I and II colorectal cancer are at high risk of recurrence, qualifying them to undergo adjuvant chemotherapy. In this study, we aimed to determine a gene signature using gene expression data that could successfully identify high risk of recurrence among stage I and II colorectal cancer patients. First, a synthetic minority oversampling technique was used to address the problem of imbalanced data due to rare recurrence events. We then applied a sequential workflow of three methods (significance analysis of microarrays, logistic regression, and recursive feature elimination) to identify genes differentially expressed between patients with and without recurrence. To stabilize the prediction algorithm, we repeated the above processes on 10 subsets by bagging the training data set and then used support vector machine methods to construct the prediction models. The final predictions were determined by majority voting. The 10 models, using 51 differentially expressed genes, successfully predicted a high risk of recurrence within 3 years in the training data set, with a sensitivity of 91.18%. For the validation data sets, the sensitivity of the prediction with samples from two other countries was 80.00% and 91.67%. These prediction models can potentially function as a tool to decide if adjuvant chemotherapy should be administered after surgery for patients with stage I and II colorectal cancer.

Download Full-text

National spectral data and learning algorithms for potentially toxic elements modelling in forest soil horizons

10.31219/osf.io/2jusw ◽

2020 ◽

Author(s):

Asa Gholizadeh ◽

Mohammadmehdi Saberioon ◽

Eyal Ben Dor ◽

Raphael A. Viscarra Rossel ◽

Lubos Boruvka

Keyword(s):

Czech Republic ◽

Spectral Data ◽

Forest Soils ◽

Toxic Elements ◽

Potentially Toxic Elements ◽

Partial Least Square ◽

Least Square ◽

Support Vector ◽

The Czech Republic ◽

Organic Horizons

Forest ecosystems are among the main parts of the biosphere; however, they have been endangered from the significant elevation and harmful effects of air and soil pollutants, including potentially toxic elements (PTEs). The concentration of PTEs in forest soils varies not only laterally but also vertically with depth. Forest surface organic horizons are of particular interest in forest ecosystem monitoring due to their role as stable adsorbents of the deposited atmospheric substances. Therefore, the main purpose of this study was to conduct rapid examinations of forest soils PTEs (Cr, Cu, Pb, Zn, and Al), testing the capability of VIS--NIR spectroscopy coupled with machine learning (ML) techniques (partial least square regression (PLSR), support vector machine regression (SVMR), and random forest (RF)) and fully connected neural network (FNN), a deep learning (DL) approach, in forest organic horizons. One-thousand-and-eighty forested sites across the Czech Republic at two soil layers, defining the fragmented (F) and humus (H) organic horizons, were investigated (total 2160 samples). PTEs as well as total Fe and SOC, as auxiliary data, were conventionally and spectrally determined and modelled in the combined organic horizons (F + H) and in each individual horizon using the ML and DL algorithms. Results indicated that the concentration of all PTEs was higher in the horizon H compared to the F horizon. Although the spectral reflectance of samples tended to decrease with increased PTEs concentration. Strongly significant positive correlations between all PTEs and total Fe in all horizons were obtained, which were higher in the H and F + H horizons than the F horizon. The highest correlations of PTEs with the spectra were at 460--590~nm, which is mostly linked to the presence of Fe-oxide. These results show the importance of Fe for spectral prediction of PTEs. Cr and Al were the most accurately predicted elements, regardless of the applied learning technique. SVMR provided the best results in assessing the H horizon (e.g., R\(^2\) = 0.88 and root mean square error (RMSE) = 3.01~mg/kg, and R\(^2\) = 0.82 and RMSE = 1682.25~mg/kg for Cr and Al, respectively); however, FNN predicted the combined F + H horizons the best (R\(^2\) = 0.89 and RMSE = 2.95~mg/kg, and R\(^2\) = 0.86 and RMSE = 1593.64~mg/kg for Cr and Al, respectively) due to the larger number of samples. In the F horizon, almost no parameters were predicted adequately. This study shows that given the availability of larger sample sizes, FNN can be a more promising technique compared to ML methods for assessment of Cr and Al concentration based on national spectral data in the forests of the Czech Republic.

Download Full-text

Automatic missing value imputation for cleaning phase of diabetic’s readmission prediction model

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v12i2.pp2001-2013 ◽

2022 ◽

Vol 12 (2) ◽

pp. 2001

Author(s):

Jesmeen Mohd Zebaral Hoque ◽

Jakir Hossen ◽

Shohel Sayeed ◽

Chy. Mohammed Tawsif K. ◽

Jaya Ganesan ◽

...

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Prediction Models ◽

Low Cost ◽

Support Vector ◽

Data Sampling ◽

Data Set ◽

Missing Value ◽

Missing Value Imputation ◽

Proper Analysis

Recently, the industry of healthcare started generating a large volume of datasets. If hospitals can employ the data, they could easily predict the outcomes and provide better treatments at early stages with low cost. Here, data analytics (DA) was used to make correct decisions through proper analysis and prediction. However, inappropriate data may lead to flawed analysis and thus yield unacceptable conclusions. Hence, transforming the improper data from the entire data set into useful data is essential. Machine learning (ML) technique was used to overcome the issues due to incomplete data. A new architecture, automatic missing value imputation (AMVI) was developed to predict missing values in the dataset, including data sampling and feature selection. Four prediction models (i.e., logistic regression, support vector machine (SVM), AdaBoost, and random forest algorithms) were selected from the well-known classification. The complete AMVI architecture performance was evaluated using a structured data set obtained from the UCI repository. Accuracy of around 90% was achieved. It was also confirmed from cross-validation that the trained ML model is suitable and not over-fitted. This trained model is developed based on the dataset, which is not dependent on a specific environment. It will train and obtain the outperformed model depending on the data available.

Download Full-text

Optainet-based technique for SVR feature selection and parameters optimization for software cost prediction

MATEC Web of Conferences ◽

10.1051/matecconf/202134801002 ◽

2021 ◽

Vol 348 ◽

pp. 01002

Author(s):

Assia Najm ◽

Abdelali Zakrani ◽

Abdelaziz Marzak

Keyword(s):

Feature Selection ◽

Random Forest ◽

Prediction Models ◽

Project Managers ◽

Parameters Optimization ◽

Support Vector ◽

Features Selection ◽

Selection Methods ◽

Cost Prediction ◽

Software Cost

The software cost prediction is a crucial element for a project’s success because it helps the project managers to efficiently estimate the needed effort for any project. There exist in literature many machine learning methods like decision trees, artificial neural networks (ANN), and support vector regressors (SVR), etc. However, many studies confirm that accurate estimations greatly depend on hyperparameters optimization, and on the proper input feature selection that impacts highly the accuracy of software cost prediction models (SCPM). In this paper, we propose an enhanced model using SVR and the Optainet algorithm. The Optainet is used at the same time for 1-selecting the best set of features and 2-for tuning the parameters of the SVR model. The experimental evaluation was conducted using a 30% holdout over seven datasets. The performance of the suggested model is then compared to the tuned SVR model using Optainet without feature selection. The results were also compared to the Boruta and random forest features selection methods. The experiments show that for overall datasets, the Optainet-based method improves significantly the accuracy of the SVR model and it outperforms the random forest and Boruta feature selection methods.

Download Full-text