An Ensemble Approach for Robust Data-Driven Prognostics

Volume 3: 38th Design Automation Conference, Parts A and B ◽

10.1115/detc2012-70529 ◽

2012 ◽

Author(s):

Chao Hu ◽

Byeng D. Youn ◽

Pingfeng Wang ◽

Joung Taek Yoon

Keyword(s):

Case Studies ◽

Nuclear Power ◽

Remaining Useful Life ◽

Training Data ◽

Data Driven ◽

Error Estimator ◽

Data Set ◽

Weighting Schemes ◽

Ensemble Approach ◽

Testing Data

Prognostics aims at determining whether a failure of an engineered system (e.g., a nuclear power plant) is impending and estimating the remaining useful life (RUL) before the failure occurs. The traditional data-driven prognostic approach involves the following three steps: (Step 1) construct multiple candidate algorithms using a training data set; (Step 2) evaluate their respective performance using a testing data set; and (Step 3) select the one with the best performance while discarding all the others. There are three main challenges in the traditional data-driven prognostic approach: (i) lack of robustness in the selected standalone algorithm; (ii) waste of the resources for constructing the algorithms that are discarded; and (iii) demand for the testing data in addition to the training data. To address these challenges, this paper proposes an ensemble approach for data-driven prognostics. This approach combines multiple member algorithms with a weighted-sum formulation where the weights are estimated by using one of the three weighting schemes, namely the accuracy-based weighting, diversity-based weighting and optimization-based weighting. In order to estimate the prediction error required by the accuracy- and optimization-based weighting schemes, we propose the use of the k-fold cross validation (CV) as a robust error estimator. The performance of the proposed ensemble approach is verified with three engineering case studies. It can be seen from all the case studies that the ensemble approach achieves better accuracy in RUL predictions compared to any sole algorithm when the member algorithms with good diversity show comparable prediction accuracy.

Download Full-text

Ensemble of Data-Driven Prognostic Algorithms With Weight Optimization and K-Fold Cross Validation

Volume 3: 30th Computers and Information in Engineering Conference, Parts A and B ◽

10.1115/detc2010-29182 ◽

2010 ◽

Cited By ~ 8

Author(s):

Chao Hu ◽

Byeng D. Youn ◽

Pingfeng Wang

Keyword(s):

Cross Validation ◽

Real Data ◽

Weighting Scheme ◽

Training Data ◽

Data Driven ◽

Algorithm Selection ◽

Data Set ◽

Weighting Schemes ◽

Testing Data ◽

Fold Cross Validation

The traditional data-driven prognostic approach is to construct multiple candidate algorithms using a training data set, evaluate their respective performance using a testing data set, and select the one with the best performance while discarding all the others. This approach has three shortcomings: (i) the selected standalone algorithm may not be robust, i.e., it may be less accurate when the real data acquired after the deployment differs from the testing data; (ii) it wastes the resources for constructing the algorithms that are discarded in the deployment; (iii) it requires the testing data in addition to the training data, which increases the overall expenses for the algorithm selection. To overcome these drawbacks, this paper proposes an ensemble data-driven prognostic approach which combines multiple member algorithms with a weighted-sum formulation. Three weighting schemes, namely, the accuracy-based weighting, diversity-based weighting and optimization-based weighting, are proposed to determine the weights of member algorithms for data-driven prognostics. The k-fold cross validation (CV) is employed to estimate the prediction error required by the weighting schemes. Two case studies were employed to demonstrate the effectiveness of the proposed prognostic approach. The results suggest that the ensemble approach with any weighting scheme gives more accurate RUL predictions compared to any sole algorithm and that the optimization-based weighting scheme gives the best overall performance among the three weighting schemes.

Download Full-text

Semi-Supervised Learning With Co-Training for Data-Driven Prognostics

Volume 2: 31st Computers and Information in Engineering Conference, Parts A and B ◽

10.1115/detc2011-48302 ◽

2011 ◽

Author(s):

Chao Hu ◽

Byeng D. Youn ◽

Taejin Kim

Keyword(s):

Remaining Useful Life ◽

Training Data ◽

Data Driven ◽

Individual Data ◽

Data Set ◽

Failure Data ◽

Rich Information ◽

Useful Life ◽

Engineered Systems ◽

Systems Failure

Traditional data-driven prognostics often requires a large amount of failure data for the offline training in order to achieve good accuracy for the online prediction. However, in many engineered systems, failure data are fairly expensive and time-consuming to obtain while suspension data are readily available. In such cases, it becomes essentially critical to utilize suspension data, which may carry rich information regarding the degradation trend and help achieve more accurate remaining useful life (RUL) prediction. To this end, this paper proposes a co-training-based data-driven prognostic algorithm, denoted by Coprog, which uses two individual data-driven algorithms with each predicting RULs of suspension units for the other. The confidence of an individual data-driven algorithm in predicting the RUL of a suspension unit is quantified by the extent to which the inclusion of that unit in the training data set reduces the sum square error (SSE) in RUL prediction on the failure units. After a suspension unit is chosen and its RUL is predicted by an individual algorithm, it becomes a virtual failure unit that is added to the training data set. Results obtained from two case studies suggest that Coprog gives more accurate RUL predictions compared to any individual algorithm without the consideration of suspension data and that Coprog can effectively exploit suspension data to improve the accuracy in data-driven prognostics.

Download Full-text

A Data-Driven Surrogate Approach for the Temporal Stability Forecasting of Vegetation Covered Dikes

Water ◽

10.3390/w13010107 ◽

2021 ◽

Vol 13 (1) ◽

pp. 107

Author(s):

Elahe Jamalinia ◽

Faraz S. Tehrani ◽

Susan C. Steele-Dunne ◽

Philip J. Vardon

Keyword(s):

Numerical Simulation ◽

Water Flux ◽

Temporal Stability ◽

Synthetic Data ◽

Climatic Conditions ◽

Training Data ◽

Data Driven ◽

Data Set ◽

Surface Cracking ◽

Real Time Analysis

Climatic conditions and vegetation cover influence water flux in a dike, and potentially the dike stability. A comprehensive numerical simulation is computationally too expensive to be used for the near real-time analysis of a dike network. Therefore, this study investigates a random forest (RF) regressor to build a data-driven surrogate for a numerical model to forecast the temporal macro-stability of dikes. To that end, daily inputs and outputs of a ten-year coupled numerical simulation of an idealised dike (2009–2019) are used to create a synthetic data set, comprising features that can be observed from a dike surface, with the calculated factor of safety (FoS) as the target variable. The data set before 2018 is split into training and testing sets to build and train the RF. The predicted FoS is strongly correlated with the numerical FoS for data that belong to the test set (before 2018). However, the trained model shows lower performance for data in the evaluation set (after 2018) if further surface cracking occurs. This proof-of-concept shows that a data-driven surrogate can be used to determine dike stability for conditions similar to the training data, which could be used to identify vulnerable locations in a dike network for further examination.

Download Full-text

A Copula-Based Sampling Method for Data-Driven Prognostics and Health Management

Volume 3A: 39th Design Automation Conference ◽

10.1115/detc2013-13592 ◽

2013 ◽

Author(s):

Zhimin Xi ◽

Rong Jing ◽

Pingfeng Wang ◽

Chao Hu

Keyword(s):

Sampling Method ◽

Failure Time ◽

Health Management ◽

Remaining Useful Life ◽

Training Data ◽

Data Driven ◽

Prognostics And Health Management ◽

Statistical Relationship ◽

Cooling Fans ◽

On Line

This paper develops a Copula-based sampling method for data-driven prognostics and health management (PHM). The principal idea is to first build statistical relationship between failure time and the time realizations at specified degradation levels on the basis of off-line training data sets, then identify possible failure times for on-line testing units based on the constructed statistical model and available on-line testing data. Specifically, three technical components are proposed to implement the methodology. First of all, a generic health index system is proposed to represent the health degradation of engineering systems. Next, a Copula-based modeling is proposed to build statistical relationship between failure time and the time realizations at specified degradation levels. Finally, a sampling approach is proposed to estimate the failure time and remaining useful life (RUL) of on-line testing units. Two case studies, including a bearing system in electric cooling fans and a 2008 IEEE PHM challenge problem, are employed to demonstrate the effectiveness of the proposed methodology.

Download Full-text

Techniques for Fast Screening of 3D Heterogeneous Shale Barrier Configurations and Their Impacts on SAGD Chamber Development

SPE Journal ◽

10.2118/199906-pa ◽

2021 ◽

pp. 1-25

Author(s):

Chang Gao ◽

Juliana Y. Leung

Keyword(s):

Distance Measure ◽

Flow Simulation ◽

Training Data ◽

Distance Measures ◽

Data Driven ◽

Data Set ◽

Flow Simulations ◽

Steam Chamber ◽

Reservoir Models ◽

Tracking Model

Summary The steam-assisted gravity drainage (SAGD) recovery process is strongly impacted by the spatial distributions of heterogeneous shale barriers. Though detailed compositional flow simulators are available for SAGD recovery performance evaluation, the simulation process is usually quite computationally demanding, rendering their use over a large number of reservoir models for assessing the impacts of heterogeneity (uncertainties) to be impractical. In recent years, data-driven proxies have been widely proposed to reduce the computational effort; nevertheless, the proxy must be trained using a large data set consisting of many flow simulation cases that are ideally spanning the model parameter spaces. The question remains: is there a more efficient way to screen a large number of heterogeneous SAGD models? Such techniques could help to construct a training data set with less redundancy; they can also be used to quickly identify a subset of heterogeneous models for detailed flow simulation. In this work, we formulated two particular distance measures, flow-based and static-based, to quantify the similarity among a set of 3D heterogeneous SAGD models. First, to formulate the flow-based distance measure, a physics-basedparticle-tracking model is used: Darcy’s law and energy balance are integrated to mimic the steam chamber expansion process; steam particles that are located at the edge of the chamber would release their energy to the surrounding cold bitumen, while detailed fluid displacements are not explicitly simulated. The steam chamber evolution is modeled, and a flow-based distance between two given reservoir models is defined as the difference in their chamber sizes over time. Second, to formulate the static-based distance, the Hausdorff distance (Hausdorff 1914) is used: it is often used in image processing to compare two images according to their corresponding spatial arrangement and shapes of various objects. A suite of 3D models is constructed using representative petrophysical properties and operating constraints extracted from several pads in Suncor Energy’s Firebag project. The computed distance measures are used to partition the models into different groups. To establish a baseline for comparison, flow simulations are performed on these models to predict the actual chamber evolution and production profiles. The grouping results according to the proposed flow- and static-based distance measures match reasonably well to those obtained from detailed flow simulations. Significant improvement in computational efficiency is achieved with the proposed techniques. They can be used to efficiently screen a large number of reservoir models and facilitate the clustering of these models into groups with distinct shale heterogeneity characteristics. It presents a significant potential to be integrated with other data-driven approaches for reducing the computational load typically associated with detailed flow simulations involving multiple heterogeneous reservoir realizations.

Download Full-text

On Realistically Attacking Tor with Website Fingerprinting

Proceedings on Privacy Enhancing Technologies ◽

10.1515/popets-2016-0027 ◽

2016 ◽

Vol 2016 (4) ◽

pp. 21-36 ◽

Cited By ~ 25

Author(s):

Tao Wang ◽

Ian Goldberg

Keyword(s):

Background Noise ◽

Laboratory Tests ◽

Training Data ◽

Web Traffic ◽

Training Set ◽

Data Set ◽

Laboratory Conditions ◽

Testing Data ◽

In The Wild ◽

New Algorithms

Abstract Website fingerprinting allows a local, passive observer monitoring a web-browsing client’s encrypted channel to determine her web activity. Previous attacks have shown that website fingerprinting could be a threat to anonymity networks such as Tor under laboratory conditions. However, there are significant differences between laboratory conditions and realistic conditions. First, in laboratory tests we collect the training data set together with the testing data set, so the training data set is fresh, but an attacker may not be able to maintain a fresh data set. Second, laboratory packet sequences correspond to a single page each, but for realistic packet sequences the split between pages is not obvious. Third, packet sequences may include background noise from other types of web traffic. These differences adversely affect website fingerprinting under realistic conditions. In this paper, we tackle these three problems to bridge the gap between laboratory and realistic conditions for website fingerprinting. We show that we can maintain a fresh training set with minimal resources. We demonstrate several classification-based techniques that allow us to split full packet sequences effectively into sequences corresponding to a single page each. We describe several new algorithms for tackling background noise. With our techniques, we are able to build the first website fingerprinting system that can operate directly on packet sequences collected in the wild.

Download Full-text

Latent Semantic Analysis using a Dennis Coefficient for English Sentiment Classification in a Parallel System

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2018.3.3044 ◽

2018 ◽

Vol 13 (3) ◽

pp. 408-428 ◽

Cited By ~ 4

Author(s):

Phu Vo Ngoc

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Sentiment Classification ◽

Training Data ◽

The Novel ◽

Data Set ◽

Proposed Model ◽

Testing Data ◽

Novel Model ◽

Better Than

We have already survey many significant approaches for many years because there are many crucial contributions of the sentiment classification which can be applied in everyday life, such as in political activities, commodity production, and commercial activities. We have proposed a novel model using a Latent Semantic Analysis (LSA) and a Dennis Coefficient (DNC) for big data sentiment classification in English. Many LSA vectors (LSAV) have successfully been reformed by using the DNC. We use the DNC and the LSAVs to classify 11,000,000 documents of our testing data set to 5,000,000 documents of our training data set in English. This novel model uses many sentiment lexicons of our basis English sentiment dictionary (bESD). We have tested the proposed model in both a sequential environment and a distributed network system. The results of the sequential system are not as good as that of the parallel environment. We have achieved 88.76% accuracy of the testing data set, and this is better than the accuracies of many previous models of the semantic analysis. Besides, we have also compared the novel model with the previous models, and the experiments and the results of our proposed model are better than that of the previous model. Many different fields can widely use the results of the novel model in many commercial applications and surveys of the sentiment classification.

Download Full-text

Remaining Useful Life (RUL) Prediction of Rolling Element Bearing Using Random Forest and Gradient Boosting Technique

Volume 13: Design, Reliability, Safety, and Risk ◽

10.1115/imece2018-87623 ◽

2018 ◽

Cited By ~ 2

Author(s):

Sangram Patil ◽

Aum Patil ◽

Vishwadeep Handikherkar ◽

Sumit Desai ◽

Vikas M. Phalle ◽

...

Keyword(s):

Random Forest ◽

Learning Curves ◽

Remaining Useful Life ◽

Rolling Element Bearing ◽

Data Driven ◽

Gradient Boosting ◽

Data Set ◽

Target Values ◽

Rolling Element ◽

Useful Life

Rolling element bearings are very important and highly utilized in many industries. Their catastrophic failure due to fluctuating working conditions leads to unscheduled breakdown and increases accidental economical losses. Thus these issues have triggered a need for reliable and automatic prognostics methodology which will prevent a potentially expensive maintenance program. Accordingly, Remaining Useful Life (RUL) prediction based on artificial intelligence is an attractive methodology for several researchers. In this study, data-driven condition monitoring approach is implemented for predicting RUL of bearing under a certain load and speed. The approach demonstrates the use of ensemble regression techniques like Random Forest and Gradient Boosting for prediction of RUL with time-domain features which are extracted from given vibration signals. The extracted features are ranked using Decision Tree (DT) based ranking technique and training and testing feature vectors are produced and fed as an input to ensemble technique. Hyper-parameters are tuned for these models by using exhaustive parameter search and performance of these models is further verified by plotting respective learning curves. For the present work FEMTO bearing data-set provided by IEEE PHM Data Challenge 2012 is used. Weibull Hazard Rate Function for each bearing from learning data set is used to find target values i.e. projected RUL of the bearings. Results of proposed models are compared with well-established data-driven approaches from literature and are found to be better than all the models applied on this data-set, thereby demonstrating the reliability of the proposed model.

Download Full-text

Selection of optimal external filter for colorimetric camera

Color and Imaging Conference ◽

10.2352/issn.2169-2629.2021.29.141 ◽

2021 ◽

Vol 2021 (29) ◽

pp. 141-147

Author(s):

Michael J. Vrhel ◽

H. Joel Trussell

Keyword(s):

Figure Of Merit ◽

Band System ◽

Image Data ◽

Optimal Filter ◽

Training Data ◽

Noise Model ◽

Data Set ◽

Testing Data ◽

And Training ◽

Selection Of

A database of realizable filters is created and searched to obtain the best filter that, when placed in front of an existing camera, results in improved colorimetric capabilities for the system. The image data with the external filter is combined with image data without the filter to provide a six-band system. The colorimetric accuracy of the system is quantified using simulations that include a realistic signal-dependent noise model. Using a training data set, we selected the optimal filter based on four criteria: Vora Value, Figure of Merit, training average ΔE, and training maximum ΔE. Each selected filter was used on testing data. The filters chosen using the training ΔE criteria consistently outperformed the theoretical criteria.

Download Full-text

STING Algorithm Used English Sentiment Classification in a Parallel Environment

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001417500215 ◽

2017 ◽

Vol 31 (07) ◽

pp. 1750021 ◽

Cited By ~ 15

Author(s):

Nguyen Duy Dat ◽

Vo Ngoc Phu ◽

Vo Thi Ngoc Tran ◽

Vo Thi Ngoc Chau ◽

Tuan A. Nguyen

Keyword(s):

Sentiment Classification ◽

Training Data ◽

Network Environment ◽

Data Set ◽

Political Activities ◽

New Model ◽

Commercial Activities ◽

Testing Data ◽

Parallel Network ◽

English Training

Sentiment classification is significant in everyday life of everyone, in political activities, activities of commodity production, commercial activities. In this research, we propose a new model for Big Data sentiment classification in the parallel network environment. Our new model uses STING Algorithm (SA) (in the data mining field) for English document-level sentiment classification with Hadoop Map (M)/Reduce (R) based on the 90,000 English sentences of the training data set in a Cloudera parallel network environment — a distributed system. In the world there is not any scientific study which is similar to this survey. Our new model can classify sentiment of millions of English documents with the shortest execution time in the parallel network environment. We test our new model on the 25,000 English documents of the testing data set and achieved on 61.2% accuracy. Our English training data set includes 45,000 positive English sentences and 45,000 negative English sentences.

Download Full-text