Optimum Feature Subset for Optimizing Crop Yield Prediction Using Filter and Wrapper Approaches

2019 ◽  
Vol 35 (1) ◽  
pp. 9-14 ◽  
Author(s):  
P. S. Maya Gopal ◽  
R Bhargavi

Abstract. In agriculture, crop yield prediction is critical. Crop yield depends on various features which can be categorized as geographical, climatic, and biological. Geographical features consist of cultivable land in hectares, canal length to cover the cultivable land, number of tanks and tube wells available for irrigation. Climatic features consist of rainfall, temperature, and radiation. Biological features consist of seeds, minerals, and nutrients. In total, 15 features were considered for this study to understand features impact on paddy crop yield for all seasons of each year. For selecting vital features, five filter and wrapper approaches were applied. For predicting accuracy of features selection algorithm, Multiple Linear Regression (MLR) model was used. The RMSE, MAE, R, and RRMSE metrics were used to evaluate the performance of feature selection algorithms. Data used for the analysis was drawn from secondary sources of state Agriculture Department, Government of Tamil Nadu, India, for over 30 years. Seventy-five percent of data was used for training and 25% was used for testing. Low computational time was also considered for the selection of best feature subset. Outcome of all feature selection algorithms have given similar results in the RMSE, RRMSE, R, and MAE values. The adjusted R2 value was used to find the optimum feature subset despite all the deviations. The evaluation of the dataset used in this work shows that total area of cultivation, number of tanks and open wells used for irrigation, length of canals used for irrigation, and average maximum temperature during the season of the crop are the best features for better crop yield prediction on the study area. The MLR gives 85% of model accuracy for the selected features with low computational time. Keywords: Feature selection algorithm, Model validation, Multiple linear regression, Performance metrics.

Author(s):  
Maya Gopal P S ◽  
Bhargavi R

In agriculture, crop yield prediction is critical. Crop yield depends on various features including geographic, climate and biological. This research article discusses five Feature Selection (FS) algorithms namely Sequential Forward FS, Sequential Backward Elimination FS, Correlation based FS, Random Forest Variable Importance and the Variance Inflation Factor algorithm for feature selection. Data used for the analysis was drawn from secondary sources of the Tamil Nadu state Agriculture Department for a period of 30 years. 75% of data was used for training and 25% data was used for testing. The performance of the feature selection algorithms are evaluated by Multiple Linear Regression. RMSE, MAE, R and RRMSE metrics are calculated for the feature selection algorithms. The adjusted R2 was used to find the optimum feature subset. Also, the time complexity of the algorithms was considered for the computation. The selected features are applied to Multilinear regression, Artificial Neural Network and M5Prime. MLR gives 85% of accuracy by using the features which are selected by SFFS algorithm.


2021 ◽  
pp. 1-19
Author(s):  
Lulu Li

Set-valued data is a significant kind of data, such as data obtained from different search engines, market data, patients’ symptoms and behaviours. An information system (IS) based on incomplete set-valued data is called an incomplete set-valued information system (ISVIS), which generalized model of a single-valued incomplete information system. This paper gives feature selection for an ISVIS by means of uncertainty measurement. Firstly, the similarity degree between two information values on a given feature of an ISVIS is proposed. Then, the tolerance relation on the object set with respect to a given feature subset in an ISVIS is obtained. Next, λ-reduction in an ISVIS is presented. What’s more, connections between the proposed feature selection and uncertainty measurement are exhibited. Lastly, feature selection algorithms based on λ-discernibility matrix, λ-information granulation, λ-information entropy and λ-significance in an ISVIS are provided. In order to better prove the practical significance of the provided algorithms, a numerical experiment is carried out, and experiment results show the number of features and average size of features by each feature selection algorithm.


Author(s):  
Manpreet Kaur ◽  
Chamkaur Singh

Educational Data Mining (EDM) is an emerging research area help the educational institutions to improve the performance of their students. Feature Selection (FS) algorithms remove irrelevant data from the educational dataset and hence increases the performance of classifiers used in EDM techniques. This paper present an analysis of the performance of feature selection algorithms on student data set. .In this papers the different problems that are defined in problem formulation. All these problems are resolved in future. Furthermore the paper is an attempt of playing a positive role in the improvement of education quality, as well as guides new researchers in making academic intervention.


PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0255307
Author(s):  
Fujun Wang ◽  
Xing Wang

Feature selection is an important task in big data analysis and information retrieval processing. It reduces the number of features by removing noise, extraneous data. In this paper, one feature subset selection algorithm based on damping oscillation theory and support vector machine classifier is proposed. This algorithm is called the Maximum Kendall coefficient Maximum Euclidean Distance Improved Gray Wolf Optimization algorithm (MKMDIGWO). In MKMDIGWO, first, a filter model based on Kendall coefficient and Euclidean distance is proposed, which is used to measure the correlation and redundancy of the candidate feature subset. Second, the wrapper model is an improved grey wolf optimization algorithm, in which its position update formula has been improved in order to achieve optimal results. Third, the filter model and the wrapper model are dynamically adjusted by the damping oscillation theory to achieve the effect of finding an optimal feature subset. Therefore, MKMDIGWO achieves both the efficiency of the filter model and the high precision of the wrapper model. Experimental results on five UCI public data sets and two microarray data sets have demonstrated the higher classification accuracy of the MKMDIGWO algorithm than that of other four state-of-the-art algorithms. The maximum ACC value of the MKMDIGWO algorithm is at least 0.5% higher than other algorithms on 10 data sets.


2021 ◽  
Author(s):  
Mariette Vreugdenhil ◽  
Isabella Pfeil ◽  
Luca Brocca ◽  
Stefania Camici ◽  
Markus Enenkel ◽  
...  

<div> <p>Accurate and reliable early warning systems can support anticipatory disaster risk financing which can be more cost effective than post-disaster emergency response. One of the challenges in anticipatory disaster risk financing is basis risk, as a result of data and model uncertainty. The increasing availability of Earth Observation (EO) data provides the opportunity to develop shadow models or include different variables in early warning systems and weather index insurance. Especially of interest is the early indication of climate impacts on agricultural production. Traditionally, crop and yield prediction models use meteorological data such as precipitation and temperature, or optical based indicators such as Normalized Difference Vegetation Index (NDVI), for yield prediction.  In recent years, soil moisture has gained popularity for yield prediction as it controls the water availability for plants.  </p> </div><div> <p>Here, we will present the use of different satellite-based rainfall and soil moisture products, in combination with NDVI, to develop a yield deficiency indicator over two water limited regions. An analysis for Senegal and Morocco is performed at the national level using yield data of four major crops from the Food and Agriculture Organization of the United Nations. Freely available EO datasets for rainfall, soil moisture, root zone soil moisture and NDVI were used. All datasets were spatially resampled to a 0.1° grid, temporally aggregated to monthly anomalies and finally detrended and standardized. First, regression analysis with yearly yield was performed per EO dataset for single months. For this, EO datasets where aggregated over areas where the specific crop was grown. Secondly, based on these results multiple linear regression was performed using the months and variables with the highest explanatory power. The multiple linear regression was used to provide spatially varying yield predictions by trading time for space. The spatial predictions were validated using sub-national yield data from Senegal.  </p> </div><div> <p>The analysis demonstrates the added-value of satellite soil moisture for early yield prediction. Both in Senegal and Morocco rainfall and soil moisture showed a high predictive skill early in the growing season: negative early season soil moisture anomalies often lead to low yield. NDVI showed more predictive power later in the growing season. For example, in Morocco soil moisture at the start of the season can already explain 56% of the variability in yield. NDVI can explain 80% of the yield, however this is at the end of the growing season. Combining anomalies of the optimal months based on the different variables in multiple linear regression improved yield prediction. Again, including NDVI led to higher predictive power, at the cost of early warning.  This analysis shows very clearly that soil moisture can be a valuable tool for anticipatory drought risk financing and early warning systems. </p> </div>


2016 ◽  
pp. 1099-1114
Author(s):  
Zongyuan Zhao ◽  
Shuxiang Xu ◽  
Byeong Ho Kang ◽  
Mir Md Jahangir Kabir ◽  
Yunling Liu ◽  
...  

Artificial Neural Network has shown its impressive ability on many real world problems such as pattern recognition, classification and function approximation. An extension of ANN, higher order neural network (HONN), improves ANN's computational and learning capabilities. However, the large number of higher order attributes leads to long learning time and complex network structure. Some irrelevant higher order attributes can also hinder the performance of HONN. In this chapter, feature selection algorithms will be used to simplify HONN architecture. Comparisons of fully connected HONN with feature selected HONN demonstrate that proper feature selection can be effective on decreasing number of inputs, reducing computational time, and improving prediction accuracy of HONN.


Author(s):  
Hui Wang ◽  
Li Li Guo ◽  
Yun Lin

Automatic modulation recognition is very important for the receiver design in the broadband multimedia communication system, and the reasonable signal feature extraction and selection algorithm is the key technology of Digital multimedia signal recognition. In this paper, the information entropy is used to extract the single feature, which are power spectrum entropy, wavelet energy spectrum entropy, singular spectrum entropy and Renyi entropy. And then, the feature selection algorithm of distance measurement and Sequential Feature Selection(SFS) are presented to select the optimal feature subset. Finally, the BP neural network is used to classify the signal modulation. The simulation result shows that the four-different information entropy can be used to classify different signal modulation, and the feature selection algorithm is successfully used to choose the optimal feature subset and get the best performance.


2018 ◽  
Vol 2018 ◽  
pp. 1-14 ◽  
Author(s):  
Wen-Pei Chen ◽  
Shih-Hao Chang ◽  
Chuan-Yi Tang ◽  
Ming-Li Liou ◽  
Suh-Jen Jane Tsai ◽  
...  

Periodontitis is an inflammatory disease involving complex interactions between oral microorganisms and the host immune response. Understanding the structure of the microbiota community associated with periodontitis is essential for improving classifications and diagnoses of various types of periodontal diseases and will facilitate clinical decision-making. In this study, we used a 16S rRNA metagenomics approach to investigate and compare the compositions of the microbiota communities from 76 subgingival plagues samples, including 26 from healthy individuals and 50 from patients with periodontitis. Furthermore, we propose a novel feature selection algorithm for selecting features with more information from many variables with a combination of these features and machine learning methods were used to construct prediction models for predicting the health status of patients with periodontal disease. We identified a total of 12 phyla, 124 genera, and 355 species and observed differences between health- and periodontitis-associated bacterial communities at all phylogenetic levels. We discovered that the generaPorphyromonas,Treponema,Tannerella,Filifactor, andAggregatibacterwere more abundant in patients with periodontal disease, whereasStreptococcus,Haemophilus,Capnocytophaga,Gemella,Campylobacter, andGranulicatellawere found at higher levels in healthy controls. Using our feature selection algorithm, random forests performed better in terms of predictive power than other methods and consumed the least amount of computational time.


2013 ◽  
Vol 22 (04) ◽  
pp. 1350027
Author(s):  
JAGANATHAN PALANICHAMY ◽  
KUPPUCHAMY RAMASAMY

Feature selection is essential in data mining and pattern recognition, especially for database classification. During past years, several feature selection algorithms have been proposed to measure the relevance of various features to each class. A suitable feature selection algorithm normally maximizes the relevancy and minimizes the redundancy of the selected features. The mutual information measure can successfully estimate the dependency of features on the entire sampling space, but it cannot exactly represent the redundancies among features. In this paper, a novel feature selection algorithm is proposed based on maximum relevance and minimum redundancy criterion. The mutual information is used to measure the relevancy of each feature with class variable and calculate the redundancy by utilizing the relationship between candidate features, selected features and class variables. The effectiveness is tested with ten benchmarked datasets available in UCI Machine Learning Repository. The experimental results show better performance when compared with some existing algorithms.


2018 ◽  
Author(s):  
Matheus B. De Moraes ◽  
André L. S. Gradvohl

Data streams are transmitted at high speeds with huge volume and may contain critical information need processing in real-time. Hence, to reduce computational cost and time, the system may apply a feature selection algorithm. However, this is not a trivial task due to the concept drift. In this work, we show that two feature selection algorithms, Information Gain and Online Feature Selection, present lower performance when compared to classification tasks without feature selection. Both algorithms presented more relevant results in one distinct scenario each, showing final accuracies up to 14% higher. The experiments using both real and artificial datasets present a potential for using these methods due to their better adaptability in some concept drift situations.


Sign in / Sign up

Export Citation Format

Share Document