Incorporating Misclassification Error in Skill Assessment

William Briggs; Matt Pocernich; David Ruppert

doi:10.1175/mwr3032.1

Incorporating Misclassification Error in Skill Assessment

Monthly Weather Review ◽

10.1175/mwr3032.1 ◽

2005 ◽

Vol 133 (11) ◽

pp. 3382-3392 ◽

Cited By ~ 10

Author(s):

William Briggs ◽

Matt Pocernich ◽

David Ruppert

Keyword(s):

Gold Standard ◽

Score Test ◽

Skill Score ◽

Statistical Test ◽

Skill Assessment ◽

Misclassification Error ◽

Model Parameters ◽

Standard Data ◽

Test Of Significance ◽

Meteorological Observations

Abstract It is desirable to account for misclassification error of meteorological observations so that the true skill of the forecast can be assessed. Errors in observations can occur, among other places, in pilot reports of icing and in tornado spotting. Not accounting for misclassification error gives a misleading picture of the forecast’s true performance. An extension to the climate skill score test developed in Briggs and Ruppert is presented to account for possible misclassification error of the meteorological observation. This extension supposes a statistical misclassification-error model where “gold standard” data, or expert opinion, is available to characterize the misclassification-error characteristics of the observation. These model parameters are then inserted into the Briggs and Ruppert skill score for which a statistical test of significance can be performed.

Download Full-text

Seasonal streamflow forecasts for Europe – I. Hindcast verification with pseudo- and real observations

10.5194/hess-2016-603 ◽

2016 ◽

Cited By ~ 1

Author(s):

Wouter Greuell ◽

Wietse H. P. Franssen ◽

Hester Biemans ◽

Ronald W. A. Hutjes

Keyword(s):

Correlation Coefficient ◽

Lead Time ◽

Irrigation Management ◽

Skill Score ◽

Skill Assessment ◽

General Tendency ◽

Infiltration Capacity ◽

Streamflow Forecasts ◽

Meteorological Observations ◽

North Germany

Abstract. Seasonal predictions can be exploited among others to optimize hydropower energy generation, navigability of rivers and irrigation management to decrease crop yield losses. This paper is the first of two papers dealing with a model-based system built to produce seasonal hydrological forecasts (WUSHP: Wageningen University Seamless Hydrological Prediction system), applied here to Europe. The present paper presents the development and the skill evaluation of the system. In WUSHP hydrology is simulated by running the Variable Infiltration Capacity (VIC) hydrological model with forcing from bias-corrected output of ECMWF's Seasonal Forecasting System 4. The system is probabilistic. For the assessment of skill, we performed hindcast simulations (1981–2010) and a reference simulation, in which VIC was forced by gridded meteorological observations, to generate initial hydrological conditions for the hindcasts and discharge output for skill assessment (pseudo-observations). Skill is analysed with monthly temporal resolution for the entire annual cycle. Using the pseudo-observations and taking the correlation coefficient as metric, hot spots of significant skill in runoff were identified in Fennoscandia (from January to October), the southern part of the Mediterranean (from June to August), Poland, North Germany, Romania and Bulgaria (mainly from November to January) and West France (from December to May). The spatial pattern of skill is fading with increasing lead time but some skill is left at the end of the hindcasts (7 months). On average across the domain, skill in discharge is slightly higher than skill in runoff. This can be explained by the delay between runoff and discharge and the general tendency of decreasing skill with lead time. Theoretical skill as determined with the pseudo-observations was compared to actual skill as determined with real discharge observations from 747 stations. Actual skill is mostly and often substantially less than theoretical skill, which is consistent with a conceptual analysis of the two types of verification. Qualitatively, results are hardly sensitive to the different skill metrics considered in this study (correlation coefficient, ROC area and Ranked Probability Skill Score) but ROC areas tend to be slightly larger for the Below Normal than for the Above Normal tercile.

Download Full-text

The STORK dataset: Linked midwifery and delivery records of the mothers and index children in the Avon Longitudinal Study of Parents and Children (ALSPAC)

Wellcome Open Research ◽

10.12688/wellcomeopenres.16247.1 ◽

2020 ◽

Vol 5 ◽

pp. 229

Author(s):

Mark Mummé ◽

Andy Boyd ◽

Jean Golding ◽

John Macleod

Keyword(s):

Longitudinal Study ◽

Gold Standard ◽

Small Sample ◽

Birth Cohort Study ◽

Future Research ◽

Record System ◽

Standard Data ◽

Parents And Children ◽

Sample Spot ◽

Avon Longitudinal Study

This data note describes the linked antenatal and delivery records of the mothers and index children of the Avon Longitudinal Study of Parents and Children (ALSPAC) birth cohort study. These records were extracted from the computerised maternity record system ‘STORK’ used by the two largest NHS trusts in the study catchment area. The STORK database was designed to be populated by midwives and other health professionals during a woman’s pregnancy and shortly after the baby’s birth. These early computer records were initiated in the early 1990s, shortly before the start of enrolment to ALSPAC. At this time the use of electronic medical record systems such as ‘STORK’ was very new, the accuracy of the records has been questioned and little contemporary detailed documentation is available. Small sample spot checks on the accuracy of the information in ‘STORK’ suggests extensive missingness and differences against gold-standard fieldworker abstracted information in some variables; yet high levels of completeness and agreement with gold-standard data in others. Software code was created using STATA (StataCorp LLC) to transform the original CSV (comma-separated values) files into a cohesive and consistent format which was reviewed for data-completeness for its potential use in future research. The cleaned ‘STORK’ records provide health, social and maternity data from the very earliest period of the ALSPAC study in an easily accessible format, which is particularly useful when other sources of data are missing.

Download Full-text

The skill assessment of ENSO prediction issued by JMA ensemble prediction system and CFSv2

IOP Conference Series Earth and Environmental Science ◽

10.1088/1755-1315/893/1/012047 ◽

2021 ◽

Vol 893 (1) ◽

pp. 012047

Author(s):

R Rahmat ◽

A M Setiawan ◽

Supari

Keyword(s):

Southern Oscillation ◽

Skill Score ◽

Skill Assessment ◽

Japan Meteorological Agency ◽

Ensemble Prediction ◽

In Situ Observation ◽

Prediction System ◽

Ensemble Prediction System ◽

Enso Prediction ◽

The Government

Abstract Indonesian climate is strongly affected by El Niño-Southern Oscillation (ENSO) as one of climate-driven factor. ENSO prediction during the upcoming months or year is crucial for the government in order to design the further strategic policy. Besides producing its own ENSO prediction, BMKG also regularly releases the status and ENSO prediction collected from other climate centers, such as Japan Meteorological Agency (JMA) and National Oceanic and Atmospheric Administration (NOAA). However, the skill of these products is not well known yet. The aim of this study is to conduct a simple assessment on the skill of JMA Ensemble Prediction System (EPS) and NOAA Climate Forecast System version 2 (CFSv2) ENSO prediction using World Meteorological Organization (WMO) Standard Verification System for Long Range Forecast (SVS-LRF) method. Both ENSO prediction results also compared each other using Student's t-test. The ENSO predictions data were obtained from the ENSO JMA and ENSO NCEP forecast archive files, while observed Nino 3.4 were calculated from Centennial in situ Observation-Based Estimates (COBE) Sea Surface Temperature Anomaly (SSTA). Both ENSO prediction issued by JMA and NCEP has a good skill on 1 to 3 months lead time, indicated by high correlation coefficient and positive value of Mean Square Skill Score (MSSS). However, the skill of both skills significantly reduced for May-August target month. Further careful interpretation is needed for ENSO prediction issued on this mentioned period.

Download Full-text

Adjusting for publication biases across similar interventions performed well when compared with gold standard data

Journal of Clinical Epidemiology ◽

10.1016/j.jclinepi.2011.01.009 ◽

2011 ◽

Vol 64 (11) ◽

pp. 1230-1241 ◽

Cited By ~ 23

Author(s):

Santiago G. Moreno ◽

Alex J. Sutton ◽

A.E. Ades ◽

Nicola J. Cooper ◽

Keith R. Abrams

Keyword(s):

Gold Standard ◽

Standard Data

Download Full-text

Data Preprocessing Method and Fault Diagnosis Based on Evaluation Function of Information Contribution Degree

Journal of Control Science and Engineering ◽

10.1155/2018/6565737 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10

Author(s):

Siyu Ji ◽

Chenglin Wen

Keyword(s):

Fault Diagnosis ◽

Combined Cycle ◽

Training Data ◽

New Method ◽

Model Parameters ◽

Evaluation Function ◽

Data Set ◽

Standard Data ◽

True Value ◽

The Mean

Neural network is a data-driven algorithm; the process established by the network model requires a large amount of training data, resulting in a significant amount of time spent in parameter training of the model. However, the system modal update occurs from time to time. Prediction using the original model parameters will cause the output of the model to deviate greatly from the true value. Traditional methods such as gradient descent and least squares methods are all centralized, making it difficult to adaptively update model parameters according to system changes. Firstly, in order to adaptively update the network parameters, this paper introduces the evaluation function and gives a new method to evaluate the parameters of the function. The new method without changing other parameters of the model updates some parameters in the model in real time to ensure the accuracy of the model. Then, based on the evaluation function, the Mean Impact Value (MIV) algorithm is used to calculate the weight of the feature, and the weighted data is brought into the established fault diagnosis model for fault diagnosis. Finally, the validity of this algorithm is verified by the example of UCI-Combined Cycle Power Plant (UCI-ccpp) simulation of standard data set.

Download Full-text

The Potential Impact of Using Persistence as a Reference Forecast on Perceived Forecast Skill

Weather and Forecasting ◽

10.1175/2008waf2007037.1 ◽

2008 ◽

Vol 23 (5) ◽

pp. 1022-1031 ◽

Cited By ~ 15

Author(s):

Marion P. Mittermaier

Keyword(s):

Skill Score ◽

Skill Assessment ◽

Added Value ◽

Forecast Performance ◽

Additional Degree ◽

Perceived Performance ◽

Precipitation Total ◽

Forecasting System ◽

Potential Impact ◽

The Impact

Abstract Skill is defined as actual forecast performance relative to the performance of a reference forecast. It is shown that the choice of reference (e.g., random or persistence) can affect the perceived performance of the forecast system. Two scores, the equitable threat score (ETS) and the odds ratio benefit skill score (ORBSS), were chosen to show the impact of using a persistence forecast, first using some simple hypothetical scenarios and second for actual forecasts from the Met Office Unified Model (UM) of precipitation, total cloud cover, and visibility during 2006. Overall persistence offers a sterner test of true forecast added value and accuracy, but using a more realistic reference may come at a cost. Using persistence introduces an additional degree of freedom to the skill assessment, which may be rather variable for “weather parameters.” Ultimately, the aim of any forecasting system should be to achieve a substantive separation between the inherent skill of the reference (which represents basic predictability) and the actual forecast.

Download Full-text

“Gold standard” data for evaluation and comparison of 3D/2D registration methods

Computer Aided Surgery ◽

10.3109/10929080500097687 ◽

2004 ◽

Vol 9 (4) ◽

pp. 137-144 ◽

Cited By ~ 15

Author(s):

Dejan Tomaževič ◽

Boštjan Likar ◽

Franjo Pernuš

Keyword(s):

Gold Standard ◽

Standard Data

Download Full-text

Validation for 2D/3D registration II: The comparison of intensity- and gradient-based merit functions using a new gold standard data set

Medical Physics ◽

10.1118/1.3553403 ◽

2011 ◽

Vol 38 (3) ◽

pp. 1491-1502 ◽

Cited By ~ 27

Author(s):

Christelle Gendrin ◽

Primož Markelj ◽

Supriyanto Ardjo Pawiro ◽

Jakob Spoerk ◽

Christoph Bloch ◽

...

Keyword(s):

Gold Standard ◽

Merit Functions ◽

3D Registration ◽

Data Set ◽

Standard Data ◽

Gradient Based

Download Full-text

Extraction of Traditional Chinese Medicine Entity: Design of a Novel Span-Level Named Entity Recognition Method With Distant Supervision (Preprint)

10.2196/preprints.28219 ◽

2021 ◽

Author(s):

Qi Jia ◽

Dezheng Zhang ◽

Haifeng Xu ◽

Yonghong Xie

Keyword(s):

Chinese Medicine ◽

Gold Standard ◽

False Negative ◽

Named Entity Recognition ◽

Entity Recognition ◽

Data Set ◽

Standard Data ◽

Named Entity ◽

Medical Entity ◽

Clinical Records

BACKGROUND Traditional Chinese medicine (TCM) clinical records contain the symptoms of patients, diagnoses, and subsequent treatment of doctors. These records are important resources for research and analysis of TCM diagnosis knowledge. However, most of TCM clinical records are unstructured text. Therefore, a method to automatically extract medical entities from TCM clinical records is indispensable. OBJECTIVE Training a medical entity extracting model needs a large number of annotated corpus. The cost of annotated corpus is very high and there is a lack of gold-standard data sets for supervised learning methods. Therefore, we utilized distantly supervised named entity recognition (NER) to respond to the challenge. METHODS We propose a span-level distantly supervised NER approach to extract TCM medical entity. It utilizes the pretrained language model and a simple multilayer neural network as classifier to detect and classify entity. We also designed a negative sampling strategy for the span-level model. The strategy randomly selects negative samples in every epoch and filters the possible false-negative samples periodically. It reduces the bad influence from the false-negative samples. RESULTS We compare our methods with other baseline methods to illustrate the effectiveness of our method on a gold-standard data set. The F1 score of our method is 77.34 and it remarkably outperforms the other baselines. CONCLUSIONS We developed a distantly supervised NER approach to extract medical entity from TCM clinical records. We estimated our approach on a TCM clinical record data set. Our experimental results indicate that the proposed approach achieves a better performance than other baselines.

Download Full-text

Investigating Heterogeneity in Brand Preferences in Logit Models for Panel Data

Journal of Marketing Research ◽

10.1177/002224379102800404 ◽

1991 ◽

Vol 28 (4) ◽

pp. 417-428 ◽

Cited By ~ 69

Author(s):

Pradeep K. Chintagunta ◽

Dipak C. Jain ◽

Naufel J. Vilcassim

Keyword(s):

Panel Data ◽

Random Effects ◽

Brand Choice ◽

Marketing Mix ◽

Statistical Test ◽

Model Parameters ◽

Logit Models ◽

Holdout Sample ◽

Brand Preferences

In analyzing panel data, the issue of heterogeneity across households is an important consideration. If heterogeneity is present but is ignored in the analysis, it will result in biased and inconsistent estimates of the effects of marketing mix variables on brand choice. The authors propose the use of a random effects specification to account for heterogeneity in brand preferences across households in a logit framework. The model parameters are estimated by both parametric and semiparametric approaches. The authors also compare their results with those obtained from logit models in which observed past choice behavioir is used to capture such heterogeneity. The different models are estimated with the IRI saltine crackers dataset. A formal statistical test of the model specifications shows that the semiparametric specification is the most preferred in terms of the overall fit of the model to the data. In addition, that specification predicts best when the models are validated in a holdout sample of households.

Download Full-text