A machine-learning heuristic to improve gene score prediction of polygenic traits

Mapping Intimacies ◽

10.1101/107409 ◽

2017 ◽

Author(s):

Guillaume Paré ◽

Shihong Mao ◽

Wei Q. Deng

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Risk Scores ◽

Uk Biobank ◽

Polygenic Risk ◽

Learning Techniques ◽

Diabetes Status ◽

Polygenic Traits ◽

The Uk ◽

Prediction Problems

AbstractMachine-learning techniques have helped solve a broad range of prediction problems, yet are not widely used to build polygenic risk scores for the prediction of complex traits. We propose a novel heuristic based on machine-learning techniques (GraBLD) to boost the predictive performance of polygenic risk scores. Gradient boosted regression trees were first used to optimize the weights of SNPs included in the score, followed by a novel regional adjustment for linkage disequilibrium. A calibration set with sample size of ~200 individuals was sufficient for optimal performance. GraBLD yielded prediction R2 of 0.239 and 0.082 using GIANT summary association statistics for height and BMI in the UK Biobank study (N=130K; 1.98M SNPs), explaining 46.9% and 32.7% of the overall polygenic variance, respectively. For diabetes status, the area under the receiver operating characteristic curve was 0.602 in the UK Biobank study using summary-level association statistics from the DIAGRAM consortium. GraBLD outperformed other polygenic score heuristics for the prediction of height (p<2.2x10−16) and BMI (p<1.57x10−4), and was equivalent to LDpred for diabetes. Results were independently validated in the Health and Retirement Study (N=8,292; 688,398 SNPs). Our report demonstrates the use of machine-learning techniques, coupled with summary-level data from large genome-wide meta-analyses to improve the prediction of polygenic traits.

Download Full-text

Association of accelerometer-derived sleep measures with lifetime psychiatric diagnoses: A cross-sectional study of 89,205 participants from the UK Biobank

PLoS Medicine ◽

10.1371/journal.pmed.1003782 ◽

2021 ◽

Vol 18 (10) ◽

pp. e1003782

Author(s):

Michael Wainberg ◽

Samuel E. Jones ◽

Lindsay Melhuish Beaupre ◽

Sean L. Hill ◽

Daniel Felsky ◽

...

Keyword(s):

Bipolar Disorder ◽

Sleep Duration ◽

Sleep Efficiency ◽

Risk Scores ◽

Psychiatric Diagnoses ◽

Uk Biobank ◽

Major Depressive ◽

Cross Sectional ◽

Polygenic Risk ◽

The Uk

Background Sleep problems are both symptoms of and modifiable risk factors for many psychiatric disorders. Wrist-worn accelerometers enable objective measurement of sleep at scale. Here, we aimed to examine the association of accelerometer-derived sleep measures with psychiatric diagnoses and polygenic risk scores in a large community-based cohort. Methods and findings In this post hoc cross-sectional analysis of the UK Biobank cohort, 10 interpretable sleep measures—bedtime, wake-up time, sleep duration, wake after sleep onset, sleep efficiency, number of awakenings, duration of longest sleep bout, number of naps, and variability in bedtime and sleep duration—were derived from 7-day accelerometry recordings across 89,205 participants (aged 43 to 79, 56% female, 97% self-reported white) taken between 2013 and 2015. These measures were examined for association with lifetime inpatient diagnoses of major depressive disorder, anxiety disorders, bipolar disorder/mania, and schizophrenia spectrum disorders from any time before the date of accelerometry, as well as polygenic risk scores for major depression, bipolar disorder, and schizophrenia. Covariates consisted of age and season at the time of the accelerometry recording, sex, Townsend deprivation index (an indicator of socioeconomic status), and the top 10 genotype principal components. We found that sleep pattern differences were ubiquitous across diagnoses: each diagnosis was associated with a median of 8.5 of the 10 accelerometer-derived sleep measures, with measures of sleep quality (for instance, sleep efficiency) generally more affected than mere sleep duration. Effect sizes were generally small: for instance, the largest magnitude effect size across the 4 diagnoses was β = −0.11 (95% confidence interval −0.13 to −0.10, p = 3 × 10−56, FDR = 6 × 10−55) for the association between lifetime inpatient major depressive disorder diagnosis and sleep efficiency. Associations largely replicated across ancestries and sexes, and accelerometry-derived measures were concordant with self-reported sleep properties. Limitations include the use of accelerometer-based sleep measurement and the time lag between psychiatric diagnoses and accelerometry. Conclusions In this study, we observed that sleep pattern differences are a transdiagnostic feature of individuals with lifetime mental illness, suggesting that they should be considered regardless of diagnosis. Accelerometry provides a scalable way to objectively measure sleep properties in psychiatric clinical research and practice, even across tens of thousands of individuals.

Download Full-text

Exploring various polygenic risk scores for skin cancer in the phenomes of the Michigan genomics initiative and the UK Biobank with a visual catalog: PRSWeb

PLoS Genetics ◽

10.1371/journal.pgen.1008202 ◽

2019 ◽

Vol 15 (6) ◽

pp. e1008202 ◽

Cited By ~ 11

Author(s):

Lars G. Fritsche ◽

Lauren J. Beesley ◽

Peter VandeHaar ◽

Robert B. Peng ◽

Maxwell Salvatore ◽

...

Keyword(s):

Skin Cancer ◽

Risk Scores ◽

Uk Biobank ◽

Polygenic Risk ◽

The Uk

Download Full-text

Popular music lyrics and musicians’ gender over time: A computational approach

Psychology of Music ◽

10.1177/0305735619871602 ◽

2019 ◽

pp. 030573561987160 ◽

Cited By ~ 1

Author(s):

Manuel Anglada-Tort ◽

Amanda E Krause ◽

Adrian C North

Keyword(s):

Machine Learning ◽

Popular Music ◽

Machine Learning Techniques ◽

Mixed Effect ◽

Gender Distribution ◽

Learning Techniques ◽

Inflection Points ◽

The Uk ◽

Music Lyrics ◽

Over Time

The present study investigated how the gender distribution of the United Kingdom’s most popular artists has changed over time and the extent to which these changes might relate to popular music lyrics. Using data mining and machine learning techniques, we analyzed all songs that reached the UK weekly top 5 sales charts from 1960 to 2015 (4,222 songs). DICTION software facilitated a computerized analysis of the lyrics, measuring a total of 36 lyrical variables per song. Results showed a significant inequality in gender representation on the charts. However, the presence of female musicians increased significantly over the time span. The most critical inflection points leading to changes in the prevalence of female musicians were in 1968, 1976, and 1984. Linear mixed-effect models showed that the total number of words and the use of self-reference in popular music lyrics changed significantly as a function of musicians’ gender distribution over time, and particularly around the three critical inflection points identified. Irrespective of gender, there was a significant trend toward increasing repetition in the lyrics over time. Results are discussed in terms of the potential advantages of using machine learning techniques to study naturalistic singles sales charts data.

Download Full-text

Symptom clusters among cancer survivors: what can machine learning techniques tell us?

BMC Medical Research Methodology ◽

10.1186/s12874-021-01352-4 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Koen I. Neijenhuijs ◽

Carel F. W. Peeters ◽

Henk van Weert ◽

Pim Cuijpers ◽

Irma Verdonck-de Leeuw

Keyword(s):

Machine Learning ◽

High Risk ◽

Cancer Survivors ◽

Well Being ◽

Physical Symptoms ◽

Symptom Clusters ◽

Machine Learning Techniques ◽

Risk Scores ◽

Learning Techniques ◽

Patient Reported

Abstract Purpose Knowledge regarding symptom clusters may inform targeted interventions. The current study investigated symptom clusters among cancer survivors, using machine learning techniques on a large data set. Methods Data consisted of self-reports of cancer survivors who used a fully automated online application ‘Oncokompas’ that supports them in their self-management. This is done by 1) monitoring their symptoms through patient reported outcome measures (PROMs); and 2) providing a personalized overview of supportive care options tailored to their scores, aiming to reduce symptom burden and improve health-related quality of life. In the present study, data on 26 generic symptoms (physical and psychosocial) were used. Results of the PROM of each symptom are presented to the user as a no well-being risk, moderate well-being risk, or high well-being risk score. Data of 1032 cancer survivors were analysed using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) on high risk scores and moderate-to-high risk scores separately. Results When analyzing the high risk scores, seven clusters were extracted: one main cluster which contained most frequently occurring physical and psychosocial symptoms, and six subclusters with different combinations of these symptoms. When analyzing moderate-to-high risk scores, three clusters were extracted: two main clusters were identified, which separated physical symptoms (and their consequences) and psycho-social symptoms, and one subcluster with only body weight issues. Conclusion There appears to be an inherent difference on the co-occurrence of symptoms dependent on symptom severity. Among survivors with high risk scores, the data showed a clustering of more connections between physical and psycho-social symptoms in separate subclusters. Among survivors with moderate-to-high risk scores, we observed less connections in the clustering between physical and psycho-social symptoms.

Download Full-text

Significant Sparse Polygenic Risk Scores across 428 traits in UK Biobank

10.1101/2021.09.02.21262942 ◽

2021 ◽

Author(s):

Yosuke Tanigawa ◽

Junyang Qian ◽

Guhan Ram Venkataraman ◽

Johanne M. Justesen ◽

Ruilin Li ◽

...

Keyword(s):

Genetic Variants ◽

Quantitative Traits ◽

Predictive Performance ◽

Risk Scores ◽

Polygenic Risk Score ◽

Uk Biobank ◽

Polygenic Risk ◽

Systematic Assessment ◽

Phenotype Data ◽

The Uk

We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,600 traits using genetic and phenotype data in the UK Biobank. We report 428 sparse PRS models with significant (p < 2.5e-5) incremental predictive performance when compared against the covariate-only model that considers age, sex, and the genotype principal components. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance in quantitative traits (Spearman's ρ = 0.54, p = 1.4e-15), but not in binary traits (ρ = 0.059, p = 0.35). The sparse PRS model trained on European individuals showed limited transferability when evaluated on individuals from non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).

Download Full-text

Use of Machine Learning in Stock Market Prediction

European Journal of Technology ◽

10.47672/ejt.634 ◽

2020 ◽

Vol 4 (1) ◽

pp. 60-73

Author(s):

Memoona Shaheen ◽

Mehreen Arshad

Keyword(s):

Machine Learning ◽

Policy Implications ◽

Machine Learning Techniques ◽

Support Vector ◽

Hybrid Techniques ◽

Current Trends ◽

Time Period ◽

Learning Techniques ◽

Vector Machines ◽

Prediction Problems

Objective: The objective of this study was to examine and determine future directions in regard to future machine learning techniques based on the review of the current literature. Methodology: A systematic review has been used to review the current trends from the peer-reviewed journal articles in the past twenty years. For this study, four categories have been categorized, the use of neural networks, support vector machines, the use of a genetic algorithm, and the combination of hybrid techniques. Studies in each of these categorize have been evaluated. Finding: Firstly, there is a strong link between machine learning methods and the prediction problems they are associated with. The second conclusion that we can conclude from this review is that past studies need to improve its generalizability results. Most of the studies that have been reviewed in this analysis has only used the machine learning systems through the use of one market or during only a one time period without taking into consideration whether the system would be adaptable in other situations and conditions. Limitations, future trends, as well as policy implications have been defined.

Download Full-text

Abstract P879: Differences in Statistical Performance of Polygenic Risk Scores for Cardiovascular Disease Across Different Race/Ethnicities

Stroke ◽

10.1161/str.52.suppl_1.p879 ◽

2021 ◽

Vol 52 (Suppl_1) ◽

Author(s):

Julian N Acosta ◽

Cameron Both ◽

Natalia Szejko ◽

Stacy Brown ◽

Kevin N Sheth ◽

...

Keyword(s):

Cardiovascular Disease ◽

Logistic Regression ◽

Genetic Risk ◽

Regression Models ◽

Risk Scores ◽

Uk Biobank ◽

Polygenic Risk ◽

Logistic Regression Models ◽

The Uk ◽

Significant Health

Introduction: Genome-wide association studies have identified numerous genetic risk variants for stroke and myocardial infarction (MI) in Europeans. However, the limited applicability of these results to non-Europeans due to racial/ethnic differences in the genetic architecture of cardiovascular disease (CVD), coupled with the limited availability of genomic data in non-Europeans, may create significant health disparities now that genomic-based precision medicine is a reality. We tested the hypothesis that the performance of polygenic risk scores (PRS) for CVD differ in Europeans versus non-Europeans. Methods: We conducted a nested study within the UK Biobank, a prospective, population-based study that enrolled ~500,000 participants across the UK. For this study, we identified self-reported black participants and randomly matched them 1:1 by age and sex with white participants. We created a PRS using previously discovered loci for stroke and MI. We then tested whether this PRS representing the aggregate polygenic susceptibility to CVD yielded similar precision in black versus white participants in logistic regression models. Results: Of the 502,536 participants enrolled in the UK Biobank, 8,061 were self-reported blacks, with 7,644 having available data for our analyses. We randomly matched these participants with white individuals, leading to a total sample size of 15,288 (mean age 51.9 [SD 8.1], female 8,722 [57%]). The total number of events was 741 overall, with 363 happening in blacks and 378 happening in whites. In logistic regression models including age, sex, and 5 principal components, the statistical precision (e.g. narrower confidence intervals) for the PRS was substantially higher for whites (OR 1.22, 95%CI 1.08 - 1.37; p<0.0001) compared to blacks (OR 1.24, 95%CI 1.05-1.47; p=0.01). Secondary analyses using genetically-determined ancestry yielded similar results. Conclusion: Because CVD-related PRSs are derived mainly using genetic risk factors identified in populations of European ancestry, their statistical performance is lower in non-European populations. This asymmetry can lead to significant health disparities now that these tools are being evaluated in multiple precision medicine approaches.

Download Full-text

Comparison of Machine Learning Techniques for Prediction Problems

Advances in Intelligent Systems and Computing - Web, Artificial Intelligence and Network Applications ◽

10.1007/978-3-030-15035-8_69 ◽

2019 ◽

pp. 713-723 ◽

Cited By ~ 3

Author(s):

Yoney Kirsal Ever ◽

Kamil Dimililer ◽

Boran Sekeroglu

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Prediction Problems

Download Full-text

Evaluation of polygenic risk scores for 17 cancer types in relation to cognitive decline in the UK Biobank

Alzheimer s & Dementia ◽

10.1002/alz.041625 ◽

2020 ◽

Vol 16 (S10) ◽

Author(s):

Rebecca E Graff ◽

Sarah F Ackley ◽

Monica Ospina Romero ◽

Scott C Zimmerman ◽

Fanny M Elahi ◽

...

Keyword(s):

Cognitive Decline ◽

Risk Scores ◽

Uk Biobank ◽

Polygenic Risk ◽

Cancer Types ◽

The Uk

Download Full-text

Cost Benefits of Using Machine Learning Features in NIDS for Cyber Security in UK Small Medium Enterprises (SME)

Future Internet ◽

10.3390/fi13080186 ◽

2021 ◽

Vol 13 (8) ◽

pp. 186

Author(s):

Nisha Rawindaran ◽

Ambikesh Jayal ◽

Edmond Prakash ◽

Chaminda Hewage

Keyword(s):

Machine Learning ◽

Home Environment ◽

Cyber Security ◽

Working Environment ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Set Up ◽

The Uk ◽

Medium Enterprises ◽

At Home

Cyber security has made an impact and has challenged Small and Medium Enterprises (SMEs) in their approaches towards how they protect and secure data. With an increase in more wired and wireless connections and devices on SME networks, unpredictable malicious activities and interruptions have risen. Finding the harmony between the advancement of technology and costs has always been a balancing act particularly in convincing the finance directors of these SMEs to invest in capital towards their IT infrastructure. This paper looks at various devices that currently are in the market to detect intrusions and look at how these devices handle prevention strategies for SMEs in their working environment both at home and in the office, in terms of their credibility in handling zero-day attacks against the costs of achieving so. The experiment was set up during the 2020 pandemic referred to as COVID-19 when the world experienced an unprecedented event of large scale. The operational working environment of SMEs reflected the context when the UK went into lockdown. Pre-pandemic would have seen this experiment take full control within an operational office environment; however, COVID-19 times has pushed us into a corner to evaluate every aspect of cybersecurity from the office and keeping the data safe within the home environment. The devices chosen for this experiment were OpenSource such as SNORT and pfSense to detect activities within the home environment, and Cisco, a commercial device, set up within an SME network. All three devices operated in a live environment within the SME network structure with employees being both at home and in the office. All three devices were observed from the rules they displayed, their costs and machine learning techniques integrated within them. The results revealed these aspects to be important in how they identified zero-day attacks. The findings showed that OpenSource devices whilst free to download, required a high level of expertise in personnel to implement and embed machine learning rules into the business solution even for staff working from home. However, when using Cisco, the price reflected the buy-in into this expertise and Cisco’s mainframe network, to give up-to-date information on cyber-attacks. The requirements of the UK General Data Protection Regulations Act (GDPR) were also acknowledged as part of the broader framework of the study. Machine learning techniques such as anomaly-based intrusions did show better detection through a commercially subscription-based model for support from Cisco compared to that of the OpenSource model which required internal expertise in machine learning. A cost model was used to compare the outcome of SMEs’ decision making, in getting the right framework in place in securing their data. In conclusion, finding a balance between IT expertise and costs of products that are able to help SMEs protect and secure their data will benefit the SMEs from using a more intelligent controlled environment with applied machine learning techniques, and not compromising on costs.

Download Full-text