Ensemble of Data-Driven Prognostic Algorithms With Weight Optimization and K-Fold Cross Validation

Author(s):  
Chao Hu ◽  
Byeng D. Youn ◽  
Pingfeng Wang

The traditional data-driven prognostic approach is to construct multiple candidate algorithms using a training data set, evaluate their respective performance using a testing data set, and select the one with the best performance while discarding all the others. This approach has three shortcomings: (i) the selected standalone algorithm may not be robust, i.e., it may be less accurate when the real data acquired after the deployment differs from the testing data; (ii) it wastes the resources for constructing the algorithms that are discarded in the deployment; (iii) it requires the testing data in addition to the training data, which increases the overall expenses for the algorithm selection. To overcome these drawbacks, this paper proposes an ensemble data-driven prognostic approach which combines multiple member algorithms with a weighted-sum formulation. Three weighting schemes, namely, the accuracy-based weighting, diversity-based weighting and optimization-based weighting, are proposed to determine the weights of member algorithms for data-driven prognostics. The k-fold cross validation (CV) is employed to estimate the prediction error required by the weighting schemes. Two case studies were employed to demonstrate the effectiveness of the proposed prognostic approach. The results suggest that the ensemble approach with any weighting scheme gives more accurate RUL predictions compared to any sole algorithm and that the optimization-based weighting scheme gives the best overall performance among the three weighting schemes.

Author(s):  
Chao Hu ◽  
Byeng D. Youn ◽  
Pingfeng Wang ◽  
Joung Taek Yoon

Prognostics aims at determining whether a failure of an engineered system (e.g., a nuclear power plant) is impending and estimating the remaining useful life (RUL) before the failure occurs. The traditional data-driven prognostic approach involves the following three steps: (Step 1) construct multiple candidate algorithms using a training data set; (Step 2) evaluate their respective performance using a testing data set; and (Step 3) select the one with the best performance while discarding all the others. There are three main challenges in the traditional data-driven prognostic approach: (i) lack of robustness in the selected standalone algorithm; (ii) waste of the resources for constructing the algorithms that are discarded; and (iii) demand for the testing data in addition to the training data. To address these challenges, this paper proposes an ensemble approach for data-driven prognostics. This approach combines multiple member algorithms with a weighted-sum formulation where the weights are estimated by using one of the three weighting schemes, namely the accuracy-based weighting, diversity-based weighting and optimization-based weighting. In order to estimate the prediction error required by the accuracy- and optimization-based weighting schemes, we propose the use of the k-fold cross validation (CV) as a robust error estimator. The performance of the proposed ensemble approach is verified with three engineering case studies. It can be seen from all the case studies that the ensemble approach achieves better accuracy in RUL predictions compared to any sole algorithm when the member algorithms with good diversity show comparable prediction accuracy.


2016 ◽  
Vol 7 (2) ◽  
pp. 48-58 ◽  
Author(s):  
Ivana Herliana W. Jayawardanu ◽  
Seng Hansun

In 2010, 51% of 39 million blindness are caused by cataract. In 2013, there are 1.8% of 1.027.763 Indonesian people who suffered from cataract. Half of them are not treated yet due to their ignorance on the cataract disease. Therefore, in this research, we tried to build a system that can detect early cataract disease as the ophthalmologist would do. The system will use C4.5 algorithm that receives 150 training data set as an input, resulting in a set of rules which can be used as decision factors. To test the system, k-fold cross validation technique is been used with k equals to 10. From the analysis result, the accuracy of the system is 93.2% to detect cataract disease and 80.5% to detect the type of cataract disease one might suffered. Index terms-C4.5 algorithm, cataract, k-fold cross validation, machine learning


2021 ◽  
Vol 1201 (1) ◽  
pp. 012005
Author(s):  
B Morais da Costa ◽  
J Þ Snæbjörnsson ◽  
O A Øiseth ◽  
J Wang ◽  
J B Jakobsen

Abstract This study presents a data-driven model to predict mean turbulence intensities at desired generic locations, for all wind directions. The model, a multilayer perceptron, requires only information about the local topography and a historical dataset of wind measurements and topography at other locations. Five years of data from six different wind measurement mast locations were used. A k-fold cross-validation evaluated the model at each location, where four locations were used for the training data, another location was used for validation, and the remaining one to test the model. The model outperformed the approach given in the European standard, for both performance metrics used. The results of different hyperparameter optimizations are presented, allowing for uncertainty estimates of the model performances.


Author(s):  
Mohamed Guendouz ◽  
Abdelmalek Amine ◽  
Reda Mohamed Hamou

The GitHub website represents nowadays an essential tool for developers from around the world; it is considered as a social network for them in which they can share their open source projects with others in a form of repositories. This paper presents and discusses the design and the implementation of a new recommender system for GitHub repositories based on a collaborative-filtering approach, which can be useful in many ways in the process of searching for the right solutions to build projects. The GitHub website is becoming very popular these days, a lot of projects are shared by millions of developers, building this recommender system can reduce searching time and make search results more and more relevant. The authors evaluate their system by conducting a set of experiments on a real data set using different well-known metrics and the k-fold cross validation method. Results obtained from these experiments are very promising, the authors found that their recommender system can reaches better precision and recall accuracy.


2014 ◽  
Vol 556-562 ◽  
pp. 4244-4247
Author(s):  
Zhen Duo Wang ◽  
Jing Wang ◽  
Huan Wang

Quantization methods are very significant mining task for presenting some operations, i.e., learning and classification in machine learning and data mining since many mining and learning methods in such fields require that the testing data set must include the partitioned features. In this paper, we propose a spheriform quantization method based on sub-region inherent dimension, which induces the quantified interval number and size in data-driven way. The method assumes that a quantified cluster of points can be contained in a lower intrinsicm-dimensional spheriform space of expected radius. These sample points in the spheriform can be obtained by adaptively selecting the neighborhood at initial observation based on sub-region inherent dimension. Experimental results and analysis on UCI real data sets demonstrate that our method significantly enhances the accuracy of classification than traditional quantization methods by implementing C4.5 decision tree.


Water ◽  
2021 ◽  
Vol 13 (1) ◽  
pp. 107
Author(s):  
Elahe Jamalinia ◽  
Faraz S. Tehrani ◽  
Susan C. Steele-Dunne ◽  
Philip J. Vardon

Climatic conditions and vegetation cover influence water flux in a dike, and potentially the dike stability. A comprehensive numerical simulation is computationally too expensive to be used for the near real-time analysis of a dike network. Therefore, this study investigates a random forest (RF) regressor to build a data-driven surrogate for a numerical model to forecast the temporal macro-stability of dikes. To that end, daily inputs and outputs of a ten-year coupled numerical simulation of an idealised dike (2009–2019) are used to create a synthetic data set, comprising features that can be observed from a dike surface, with the calculated factor of safety (FoS) as the target variable. The data set before 2018 is split into training and testing sets to build and train the RF. The predicted FoS is strongly correlated with the numerical FoS for data that belong to the test set (before 2018). However, the trained model shows lower performance for data in the evaluation set (after 2018) if further surface cracking occurs. This proof-of-concept shows that a data-driven surrogate can be used to determine dike stability for conditions similar to the training data, which could be used to identify vulnerable locations in a dike network for further examination.


SPE Journal ◽  
2021 ◽  
pp. 1-25
Author(s):  
Chang Gao ◽  
Juliana Y. Leung

Summary The steam-assisted gravity drainage (SAGD) recovery process is strongly impacted by the spatial distributions of heterogeneous shale barriers. Though detailed compositional flow simulators are available for SAGD recovery performance evaluation, the simulation process is usually quite computationally demanding, rendering their use over a large number of reservoir models for assessing the impacts of heterogeneity (uncertainties) to be impractical. In recent years, data-driven proxies have been widely proposed to reduce the computational effort; nevertheless, the proxy must be trained using a large data set consisting of many flow simulation cases that are ideally spanning the model parameter spaces. The question remains: is there a more efficient way to screen a large number of heterogeneous SAGD models? Such techniques could help to construct a training data set with less redundancy; they can also be used to quickly identify a subset of heterogeneous models for detailed flow simulation. In this work, we formulated two particular distance measures, flow-based and static-based, to quantify the similarity among a set of 3D heterogeneous SAGD models. First, to formulate the flow-based distance measure, a physics-basedparticle-tracking model is used: Darcy’s law and energy balance are integrated to mimic the steam chamber expansion process; steam particles that are located at the edge of the chamber would release their energy to the surrounding cold bitumen, while detailed fluid displacements are not explicitly simulated. The steam chamber evolution is modeled, and a flow-based distance between two given reservoir models is defined as the difference in their chamber sizes over time. Second, to formulate the static-based distance, the Hausdorff distance (Hausdorff 1914) is used: it is often used in image processing to compare two images according to their corresponding spatial arrangement and shapes of various objects. A suite of 3D models is constructed using representative petrophysical properties and operating constraints extracted from several pads in Suncor Energy’s Firebag project. The computed distance measures are used to partition the models into different groups. To establish a baseline for comparison, flow simulations are performed on these models to predict the actual chamber evolution and production profiles. The grouping results according to the proposed flow- and static-based distance measures match reasonably well to those obtained from detailed flow simulations. Significant improvement in computational efficiency is achieved with the proposed techniques. They can be used to efficiently screen a large number of reservoir models and facilitate the clustering of these models into groups with distinct shale heterogeneity characteristics. It presents a significant potential to be integrated with other data-driven approaches for reducing the computational load typically associated with detailed flow simulations involving multiple heterogeneous reservoir realizations.


2016 ◽  
Vol 2016 (4) ◽  
pp. 21-36 ◽  
Author(s):  
Tao Wang ◽  
Ian Goldberg

Abstract Website fingerprinting allows a local, passive observer monitoring a web-browsing client’s encrypted channel to determine her web activity. Previous attacks have shown that website fingerprinting could be a threat to anonymity networks such as Tor under laboratory conditions. However, there are significant differences between laboratory conditions and realistic conditions. First, in laboratory tests we collect the training data set together with the testing data set, so the training data set is fresh, but an attacker may not be able to maintain a fresh data set. Second, laboratory packet sequences correspond to a single page each, but for realistic packet sequences the split between pages is not obvious. Third, packet sequences may include background noise from other types of web traffic. These differences adversely affect website fingerprinting under realistic conditions. In this paper, we tackle these three problems to bridge the gap between laboratory and realistic conditions for website fingerprinting. We show that we can maintain a fresh training set with minimal resources. We demonstrate several classification-based techniques that allow us to split full packet sequences effectively into sequences corresponding to a single page each. We describe several new algorithms for tackling background noise. With our techniques, we are able to build the first website fingerprinting system that can operate directly on packet sequences collected in the wild.


2013 ◽  
Vol 284-287 ◽  
pp. 3111-3114
Author(s):  
Hsiang Chuan Liu ◽  
Wei Sung Chen ◽  
Ben Chang Shia ◽  
Chia Chen Lee ◽  
Shang Ling Ou ◽  
...  

In this paper, a novel fuzzy measure, high order lambda measure, was proposed, based on the Choquet integral with respect to this new measure, a novel composition forecasting model which composed the GM(1,1) forecasting model, the time series model and the exponential smoothing model was also proposed. For evaluating the efficiency of this improved composition forecasting model, an experiment with a real data by using the 5 fold cross validation mean square error was conducted. The performances of Choquet integral composition forecasting model with the P-measure, Lambda-measure, L-measure and high order lambda measure, respectively, a ridge regression composition forecasting model and a multiple linear regression composition forecasting model and the traditional linear weighted composition forecasting model were compared. The experimental results showed that the Choquet integral composition forecasting model with respect to the high order lambda measure has the best performance.


2019 ◽  
Vol 23 (1) ◽  
pp. 67-77 ◽  
Author(s):  
Yao Yevenyo Ziggah ◽  
Hu Youjian ◽  
Alfonso Rodrigo Tierra ◽  
Prosper Basommi Laari

The popularity of Artificial Neural Network (ANN) methodology has been growing in a wide variety of areas in geodesy and geospatial sciences. Its ability to perform coordinate transformation between different datums has been well documented in literature. In the application of the ANN methods for the coordinate transformation, only the train-test (hold-out cross-validation) approach has usually been used to evaluate their performance. Here, the data set is divided into two disjoint subsets thus, training (model building) and testing (model validation) respectively. However, one major drawback in the hold-out cross-validation procedure is inappropriate data partitioning. Improper split of the data could lead to a high variance and bias in the results generated. Besides, in a sparse dataset situation, the hold-out cross-validation is not suitable. For these reasons, the K-fold cross-validation approach has been recommended. Consequently, this study, for the first time, explored the potential of using K-fold cross-validation method in the performance assessment of radial basis function neural network and Bursa-Wolf model under data-insufficient situation in Ghana geodetic reference network. The statistical analysis of the results revealed that incorrect data partition could lead to a false reportage on the predictive performance of the transformation model. The findings revealed that the RBFNN and Bursa-Wolf model produced a transformation accuracy of 0.229 m and 0.469 m, respectively. It was also realised that a maximum horizontal error of 0.881 m and 2.131 m was given by the RBFNN and Bursa-Wolf. The obtained results per the cadastral surveying and plan production requirement set by the Ghana Survey and Mapping Division are applicable. This study will contribute to the usage of K-fold cross-validation approach in developing countries having the same sparse dataset situation like Ghana as well as in the geodetic sciences where ANN users seldom apply the statistical resampling technique.


Sign in / Sign up

Export Citation Format

Share Document