The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models

Mapping Intimacies ◽

10.1101/374355 ◽

2018 ◽

Author(s):

Yuqing Zhang ◽

Christoph Bernau ◽

Giovanni Parmigiani ◽

Levi Waldron

Keyword(s):

Prediction Models ◽

Bootstrap Method ◽

Covariance Structure ◽

Parametric Bootstrap ◽

Generative Models ◽

Lower Accuracy ◽

Study Heterogeneity ◽

Genomic Studies ◽

Multiple Disease Outcomes ◽

The Impact

SUMMARYCross-study validation (CSV) of prediction models is an alternative to traditional cross-validation (CV) in domains where multiple comparable datasets are available. Although many studies have noted potential sources of heterogeneity in genomic studies, to our knowledge none have system atically investigated their intertwined impacts on prediction accuracy across studies. We employ a hybrid parametric/non-parametric bootstrap method to realistically simulate publicly available compendia of microarray, RNA-seq, and whole metagenome shotgun (WMS) microbiome studies of health outcomes. Three types of heterogeneity between studies are manipulated and studied: imbalances in the prevalence of clinical and pathological covariates, 2) differences in gene covariance that could be caused by batch, platform, or tumor purity effects, and 3) differences in the “true” model that associates gene expression and clinical factors to outcome. We assess model accuracy while altering these factors. Lower accuracy is seen in CSV than in CV. Surprisingly, heterogeneity in known clinical covariates and differences in gene covariance structure have very limited contributions in the loss of accuracy when validating in new studies. However, forcing identical generative models greatly reduces the within/across study difference. These results, observed consistently for multiple disease outcomes and omics platforms, suggest that the most easily identifiable sources of study heterogeneity are not necessarily the primary ones that undermine the ability to accurately replicate the accuracy of omics prediction models in new studies. Unidentified heterogeneity, such as could arise from unmeasured confounding, may be more important.

Download Full-text

Organizational Commitment, Work Engagement, Person–Supervisor Fit, and Turnover Intention: A Total Effect Moderation Model

Social Behavior and Personality An International Journal ◽

10.2224/sbp.2015.43.10.1657 ◽

2015 ◽

Vol 43 (10) ◽

pp. 1657-1666 ◽

Cited By ~ 10

Author(s):

Jun-cheng Zhang ◽

Wen-quan Ling ◽

Zhao-yi Zhang ◽

Jun Xie

Keyword(s):

Organizational Commitment ◽

Work Engagement ◽

Turnover Intention ◽

Bootstrap Method ◽

Negative Relationship ◽

Parametric Bootstrap ◽

Theory And Practice ◽

Total Effect ◽

Moderation Model ◽

The Impact

We conceptualized work engagement as a mediator and person–supervisor fit as a moderator for understanding the impact mechanism of organizational commitment on turnover intention. With survey data collected from a sample of 512 building engineers in Taiwan, we tested the total effect moderation model that we proposed in this study via a path analysis procedure using a parametric bootstrap method. Results indicated that work engagement partially mediated the negative effect of organizational commitment on turnover intention, and that the negative relationship between organizational commitment and turnover intention became weaker when person–supervisor fit was closer. Implications for management theory and practice are discussed.

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

Assessing the Impact of Secondary Structure and Solvent Accessibility on Protein Evolution

Genetics ◽

10.1093/genetics/149.1.445 ◽

1998 ◽

Vol 149 (1) ◽

pp. 445-458 ◽

Cited By ~ 21

Author(s):

Nick Goldman ◽

Jeffrey L Thorne ◽

David T Jones

Keyword(s):

Amino Acid ◽

Secondary Structure ◽

Protein Evolution ◽

Solvent Accessibility ◽

Strong Association ◽

Length Distribution ◽

Parametric Bootstrap ◽

Amino Acid Replacement ◽

Physical Constraints ◽

The Impact

Abstract Empirically derived models of amino acid replacement are employed to study the association between various physical features of proteins and evolution. The strengths of these associations are statistically evaluated by applying the models of protein evolution to 11 diverse sets of protein sequences. Parametric bootstrap tests indicate that the solvent accessibility status of a site has a particularly strong association with the process of amino acid replacement that it experiences. Significant association between secondary structure environment and the amino acid replacement process is also observed. Careful description of the length distribution of secondary structure elements and of the organization of secondary structure and solvent accessibility along a protein did not always significantly improve the fit of the evolutionary models to the data sets that were analyzed. As indicated by the strength of the association of both solvent accessibility and secondary structure with amino acid replacement, the process of protein evolution—both above and below the species level—will not be well understood until the physical constraints that affect protein evolution are identified and characterized.

Download Full-text

Predicting hospital admissions from individual patient data (IPD): an applied example to explore key elements driving external validity

BMJ Open ◽

10.1136/bmjopen-2020-045572 ◽

2021 ◽

Vol 11 (8) ◽

pp. e045572

Author(s):

Andreas Daniel Meid ◽

Ana Isabel Gonzalez-Gonzalez ◽

Truc Sophia Dinh ◽

Jeanet Blom ◽

Marjan van den Akker ◽

...

Keyword(s):

Prognostic Model ◽

Prediction Models ◽

Hospital Admissions ◽

Absolute Risk ◽

External Validation ◽

Baseline Risk ◽

Study Heterogeneity ◽

Related Quality ◽

Individual Participant ◽

Risk Predictor

ObjectiveTo explore factors that potentially impact external validation performance while developing and validating a prognostic model for hospital admissions (HAs) in complex older general practice patients.Study design and settingUsing individual participant data from four cluster-randomised trials conducted in the Netherlands and Germany, we used logistic regression to develop a prognostic model to predict all-cause HAs within a 6-month follow-up period. A stratified intercept was used to account for heterogeneity in baseline risk between the studies. The model was validated both internally and by using internal-external cross-validation (IECV).ResultsPrior HAs, physical components of the health-related quality of life comorbidity index, and medication-related variables were used in the final model. While achieving moderate discriminatory performance, internal bootstrap validation revealed a pronounced risk of overfitting. The results of the IECV, in which calibration was highly variable even after accounting for between-study heterogeneity, agreed with this finding. Heterogeneity was equally reflected in differing baseline risk, predictor effects and absolute risk predictions.ConclusionsPredictor effect heterogeneity and differing baseline risk can explain the limited external performance of HA prediction models. With such drivers known, model adjustments in external validation settings (eg, intercept recalibration, complete updating) can be applied more purposefully.Trial registration numberPROSPERO id: CRD42018088129.

Download Full-text

GPS Coordinates for Modelling Correlated Herd Effects in Genomic Prediction Models Applied to Hanwoo Beef Cattle

Animals ◽

10.3390/ani11072050 ◽

2021 ◽

Vol 11 (7) ◽

pp. 2050

Author(s):

Beatriz Castro Dias Cuyabano ◽

Gabriel Rovere ◽

Dajeong Lim ◽

Tae Hun Kim ◽

Hak Kyo Lee ◽

...

Keyword(s):

Environmental Factors ◽

Genomic Prediction ◽

Prediction Models ◽

Phenotypic Expression ◽

Genetic Evaluation ◽

Genomic Breeding ◽

Breeding Values ◽

Korean Cattle ◽

Evaluation Programs ◽

The Impact

It is widely known that the environment influences phenotypic expression and that its effects must be accounted for in genetic evaluation programs. The most used method to account for environmental effects is to add herd and contemporary group to the model. Although generally informative, the herd effect treats different farms as independent units. However, if two farms are located physically close to each other, they potentially share correlated environmental factors. We introduce a method to model herd effects that uses the physical distances between farms based on the Global Positioning System (GPS) coordinates as a proxy for the correlation matrix of these effects that aims to account for similarities and differences between farms due to environmental factors. A population of Hanwoo Korean cattle was used to evaluate the impact of modelling herd effects as correlated, in comparison to assuming the farms as completely independent units, on the variance components and genomic prediction. The main result was an increase in the reliabilities of the predicted genomic breeding values compared to reliabilities obtained with traditional models (across four traits evaluated, reliabilities of prediction presented increases that ranged from 0.05 ± 0.01 to 0.33 ± 0.03), suggesting that these models may overestimate heritabilities. Although little to no significant gain was obtained in phenotypic prediction, the increased reliability of the predicted genomic breeding values is of practical relevance for genetic evaluation programs.

Download Full-text

A comprehensive investigation of the impact of feature selection techniques on crashing fault residence prediction models

Information and Software Technology ◽

10.1016/j.infsof.2021.106652 ◽

2021 ◽

pp. 106652

Author(s):

Kunsong Zhao ◽

Zhou Xu ◽

Meng Yan ◽

Tao Zhang ◽

Dan Yang ◽

...

Keyword(s):

Feature Selection ◽

Prediction Models ◽

Comprehensive Investigation ◽

The Impact ◽

Feature Selection Techniques

Download Full-text

Estimating the burden of cardiovascular risk in community dwellers over 40 years old in South Africa, Kenya, Burkina Faso and Ghana

BMJ Global Health ◽

10.1136/bmjgh-2020-003499 ◽

2021 ◽

Vol 6 (1) ◽

pp. e003499

Author(s):

Ryan G Wagner ◽

Nigel J Crowther ◽

Lisa K Micklesfield ◽

Palwende Romauld Boua ◽

Engelbert A Nonterah ◽

...

Keyword(s):

Risk Factors ◽

Risk Factor ◽

South African ◽

Sub Saharan Africa ◽

African Countries ◽

Cvd Risk ◽

Genomic Studies ◽

Sub Saharan ◽

Context Specific ◽

The Impact

IntroductionCardiovascular disease (CVD) risk factors are increasing in sub-Saharan Africa. The impact of these risk factors on future CVD outcomes and burden is poorly understood. We examined the magnitude of modifiable risk factors, estimated future CVD risk and compared results between three commonly used 10-year CVD risk factor algorithms and their variants in four African countries.MethodsIn the Africa-Wits-INDEPTH partnership for Genomic studies (the AWI-Gen Study), 10 349 randomly sampled individuals aged 40–60 years from six sites participated in a survey, with blood pressure, blood glucose and lipid levels measured. Using these data, 10-year CVD risk estimates using Framingham, Globorisk and WHO-CVD and their office-based variants were generated. Differences in future CVD risk and results by algorithm are described using kappa and coefficients to examine agreement and correlations, respectively.ResultsThe 10-year CVD risk across all participants in all sites varied from 2.6% (95% CI: 1.6% to 4.1%) using the WHO-CVD lab algorithm to 6.5% (95% CI: 3.7% to 11.4%) using the Framingham office algorithm, with substantial differences in risk between sites. The highest risk was in South African settings (in urban Soweto: 8.9% (IQR: 5.3–15.3)). Agreement between algorithms was low to moderate (kappa from 0.03 to 0.55) and correlations ranged between 0.28 and 0.70. Depending on the algorithm used, those at high risk (defined as risk of 10-year CVD event >20%) who were under treatment for a modifiable risk factor ranged from 19.2% to 33.9%, with substantial variation by both sex and site.ConclusionThe African sites in this study are at different stages of an ongoing epidemiological transition as evidenced by both risk factor levels and estimated 10-year CVD risk. There is low correlation and disparate levels of population risk, predicted by different risk algorithms, within sites. Validating existing risk algorithms or designing context-specific 10-year CVD risk algorithms is essential for accurately defining population risk and targeting national policies and individual CVD treatment on the African continent.

Download Full-text

Developing Relative Humidity and Temperature Corrections for Low-Cost Sensors Using Machine Learning

Sensors ◽

10.3390/s21103338 ◽

2021 ◽

Vol 21 (10) ◽

pp. 3338

Author(s):

Ivan Vajs ◽

Dejan Drajic ◽

Nenad Gligoric ◽

Ilija Radovanovic ◽

Ivan Popovic

Keyword(s):

Machine Learning ◽

Air Quality ◽

Relative Humidity ◽

Low Cost ◽

Quality Monitoring ◽

Air Quality Monitoring ◽

Lower Accuracy ◽

Wide Range ◽

The Impact ◽

Monitoring Stations

Existing government air quality monitoring networks consist of static measurement stations, which are highly reliable and accurately measure a wide range of air pollutants, but they are very large, expensive and require significant amounts of maintenance. As a promising solution, low-cost sensors are being introduced as complementary, air quality monitoring stations. These sensors are, however, not reliable due to the lower accuracy, short life cycle and corresponding calibration issues. Recent studies have shown that low-cost sensors are affected by relative humidity and temperature. In this paper, we explore methods to additionally improve the calibration algorithms with the aim to increase the measurement accuracy considering the impact of temperature and humidity on the readings, by using machine learning. A detailed comparative analysis of linear regression, artificial neural network and random forest algorithms are presented, analyzing their performance on the measurements of CO, NO2 and PM10 particles, with promising results and an achieved R2 of 0.93–0.97, 0.82–0.94 and 0.73–0.89 dependent on the observed period of the year, respectively, for each pollutant. A comprehensive analysis and recommendations on how low-cost sensors could be used as complementary monitoring stations to the reference ones, to increase spatial and temporal measurement resolution, is provided.

Download Full-text

Modeling transcriptional regulation using gene regulatory networks based on multi-omics data sources

BMC Bioinformatics ◽

10.1186/s12859-021-04126-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Neel Patel ◽

William S. Bush

Keyword(s):

Gene Expression ◽

Transcription Factor ◽

Transcriptional Regulation ◽

Gene Regulatory Networks ◽

Regulatory Networks ◽

Prediction Models ◽

Long Distance ◽

Chromatin Looping ◽

Gene Regulatory ◽

The Impact

Abstract Background Transcriptional regulation is complex, requiring multiple cis (local) and trans acting mechanisms working in concert to drive gene expression, with disruption of these processes linked to multiple diseases. Previous computational attempts to understand the influence of regulatory mechanisms on gene expression have used prediction models containing input features derived from cis regulatory factors. However, local chromatin looping and trans-acting mechanisms are known to also influence transcriptional regulation, and their inclusion may improve model accuracy and interpretation. In this study, we create a general model of transcription factor influence on gene expression by incorporating both cis and trans gene regulatory features. Results We describe a computational framework to model gene expression for GM12878 and K562 cell lines. This framework weights the impact of transcription factor-based regulatory data using multi-omics gene regulatory networks to account for both cis and trans acting mechanisms, and measures of the local chromatin context. These prediction models perform significantly better compared to models containing cis-regulatory features alone. Models that additionally integrate long distance chromatin interactions (or chromatin looping) between distal transcription factor binding regions and gene promoters also show improved accuracy. As a demonstration of their utility, effect estimates from these models were used to weight cis-regulatory rare variants for sequence kernel association test analyses of gene expression. Conclusions Our models generate refined effect estimates for the influence of individual transcription factors on gene expression, allowing characterization of their roles across the genome. This work also provides a framework for integrating multiple data types into a single model of transcriptional regulation.

Download Full-text

Real-time and near real-time ZTD from a local network of low-cost dual-frequency GNSS receivers.

10.5194/egusphere-egu21-5465 ◽

2021 ◽

Author(s):

Tomasz Hadas ◽

Grzegorz Marut ◽

Jan Kapłon ◽

Witold Rohm

Keyword(s):

Water Vapor ◽

Real Time ◽

Low Cost ◽

Weather Prediction ◽

Local Network ◽

System Component ◽

Dual Frequency ◽

Lower Accuracy ◽

Gnss Receivers ◽

The Impact

The dynamics of water vapor distribution in the troposphere, measured with Global Navigation Satellite Systems (GNSS), is a subject of weather research and climate studies. With GNSS, remote sensing of the troposphere in Europe is performed continuously and operationally under the E-GVAP (http://egvap.dmi.dk/) program with more than 2000 permanent stations. These data are one of the assimilation system component of mesoscale weather prediction models (10 km scale) for many nations across Europe. However, advancing precise local forecasts for severe weather requires high resolution models and observing system.&#160; &#160;Further densification of the tracking network, e.g. in urban or mountain areas, will be costly when considering geodetic-grade equipment. However, the rapid development of GNSS-based applications results in a dynamic release of mass-market GNSS receivers. It has been demonstrated that post-processing of GPS-data from a dual-frequency low-cost receiver allows retrieving ZTD with high accuracy. Although low-cost receivers are a promising solution to the problem of densifying GNSS networks for water vapor monitoring, there are still some technological limitations and they require further development and calibration.We have developed a low-cost GNSS station, dedicated to real-time GNSS meteorology, which provides GPS, GLONASS and Galileo dual-frequency observations either in RINEX v3.04 format or via RTCM v3.3 stream, with either Ethernet or GSM data transmission. The first two units are deployed in a close vicinity of permanent station WROC, which belongs to the International GNSS Service (IGS) network. Therefore, we compare results from real-time and near real-time processing of GNSS observations from a low-cost unit with IGS Final products. We also investigate the impact of replacing a standard patch antenna with an inexpensive survey-grade antenna. Finally, we deploy a local network of low-cost receivers in and around the city of Wroclaw, Poland, in order to analyze the dynamics of troposphere delay at a very high spatial resolution.As a measure of accuracy, we use the standard deviation of ZTD differences between estimated ZTD and IGS Final product. For the near real-time mode, that accuracy is 5&#160;mm and 6 mm, for single- (L1) and dual-frequency (L1/L5,E5b) solution, respectively. Lower accuracy of the dual-frequency relative solution we justify by the missing antenna phase center correction model for L5 and E5b frequencies. With the real-time Precise Point Positioning technique, we estimate ZTD with the accuracy of 7.5 &#8211; 8.6 mm. After antenna replacement, the accuracy is improved almost by a factor of 2 (to 4.1 mm), which is close to the 3.1&#160;mm accuracy which we obtain in real-time using data from the WROC station.

Download Full-text