Population-specific recombination maps from segments of identity by descent

Mapping Intimacies ◽

10.1101/868091 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ying Zhou ◽

Brian L. Browning ◽

Sharon R. Browning

Keyword(s):

Sequence Data ◽

Pearson Correlation ◽

Computational Cost ◽

Genotype Imputation ◽

Large Set ◽

Recombination Rates ◽

Identity By Descent ◽

European Americans ◽

Heart Study ◽

Similar Accuracy

ABSTRACTRecombination rates vary significantly across the genome, and estimates of recombination rates are needed for downstream analyses such as haplotype phasing and genotype imputation. Existing methods for recombination rate estimation are limited by insufficient amounts of informative genetic data or by high computational cost. We present a method for using segments of identity by descent to infer recombination rates. Our method can be applied to sequenced population cohorts to obtain high-resolution, population-specific recombination maps. We use our method to generate new recombination maps for European Americans and for African Americans from TOPMed sequence data from the Framingham Heart Study (1626 unrelated individuals) and the Jackson Heart Study (2046 unrelated individuals). We compare our maps to existing maps using the Pearson correlation between estimated recombination rates. In Europeans we use the deCODE map, which is based on a very large set of Icelandic family data (126,407 meioses), as a gold standard against which to compare other maps. Our European American map has higher accuracy at fine-scale resolution (1-10kb) than linkage disequilibrium maps from the HapMap and 1000 Genomes projects. Our African American map has much higher accuracy than an admixture-based map that is derived from a similar number individuals, and similar accuracy at fine scales (1-10kb) to an admixture-based map that is derived from 15 times as many individuals.

Download Full-text

Kinpute: Using identity by descent to improve genotype imputation

10.1101/399147 ◽

2018 ◽

Author(s):

Mark Abney ◽

Aisha El Sherbiny

Keyword(s):

Sequence Data ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Specific Reference ◽

Identity By Descent ◽

Imputation Methods ◽

Identical By Descent ◽

Novel Method ◽

Optimal Set ◽

Genotype Probabilities

1AbstractMotivationGenotype imputation, though generally accurate, often results in many genotypes being poorly imputed, particularly in studies where the individuals are not well represented by standard reference panels. When individuals in the study share regions of the genome identical by descent (IBD), it is possible to use this information in combination with a study specific reference panel (SSRP) to improve the imputation results. Kinpute uses IBD information—due to either recent, familial relatedness or distant, unknown ancestors— in conjunction with the output from linkage disequilibrium (LD) based imputation methods to compute more accurate genotype probabilities. Kinpute uses a novel method for IBD imputation, which works even in the absence of a pedigree, and results in substantially improved imputation quality.ResultsGiven initial estimates of average IBD between subjects in the study sample, Kinpute uses a novel algorithm to select an optimal set of individuals to sequence and use as an SSRP. Kinpute is designed to use as input both this SSRP and the genotype probabilities output from other LD based imputation software, and uses a new method to combine the LD imputed genotype probabilities with IBD configurations to substantially improve imputation. We tested Kinpute on a human population isolate where 98 individuals have been sequenced. In half of this sample, whose sequence data was masked, we used Impute2 to perform LD based imputation and Kinpute was used to obtain higher accuracy genotype probabilities. Measures of imputation accuracy improved significantly, particularly for those genotypes that Impute2 imputed with low certainty.AvailabilityKinpute is an open-source and freely available C++ software package that can be downloaded from https://github.com/markabney/Kinpute/releases.

Download Full-text

Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks

BMC Bioinformatics ◽

10.1186/s12859-021-04101-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yingxi Yang ◽

Hui Wang ◽

Wen Li ◽

Xiaobo Wang ◽

Shizhao Wei ◽

...

Keyword(s):

Correlation Coefficient ◽

Sequence Data ◽

Rapid Development ◽

Pearson Correlation ◽

Structural Features ◽

Generative Adversarial Networks ◽

Post Translational Modification ◽

Generative Adversarial Network ◽

Data Imbalance ◽

Adversarial Network

Abstract Background Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. Method We proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories. Results In the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN. Conclusions The CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM.

Download Full-text

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

Nature ◽

10.1038/s41586-021-03205-y ◽

2021 ◽

Vol 590 (7845) ◽

pp. 290-299 ◽

Cited By ~ 22

Author(s):

Daniel Taliun ◽

◽

Daniel N. Harris ◽

Michael D. Kessler ◽

Jedidiah Carlson ◽

...

Keyword(s):

Rare Variants ◽

Sequence Data ◽

Association Studies ◽

Genotype Imputation ◽

Genome Wide Association Studies ◽

Phenotypic Data ◽

Treatment And Prevention ◽

Genome Wide ◽

Diverse Backgrounds ◽

Unmapped Reads

AbstractThe Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.

Download Full-text

Deep learning predicts short non-coding RNA functions from only raw sequence data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008415 ◽

2020 ◽

Vol 16 (11) ◽

pp. e1008415

Author(s):

Teresa Maria Rosaria Noviello ◽

Francesco Ceccarelli ◽

Michele Ceccarelli ◽

Luigi Cerulo

Keyword(s):

Secondary Structure ◽

Sequence Data ◽

Computational Cost ◽

Large Data ◽

Sequence Information ◽

Structure Information ◽

Non Coding Rna ◽

A Genome ◽

Data Volume ◽

Biological Functionality

Small non-coding RNAs (ncRNAs) are short non-coding sequences involved in gene regulation in many biological processes and diseases. The lack of a complete comprehension of their biological functionality, especially in a genome-wide scenario, has demanded new computational approaches to annotate their roles. It is widely known that secondary structure is determinant to know RNA function and machine learning based approaches have been successfully proven to predict RNA function from secondary structure information. Here we show that RNA function can be predicted with good accuracy from a lightweight representation of sequence information without the necessity of computing secondary structure features which is computationally expensive. This finding appears to go against the dogma of secondary structure being a key determinant of function in RNA. Compared to recent secondary structure based methods, the proposed solution is more robust to sequence boundary noise and reduces drastically the computational cost allowing for large data volume annotations. Scripts and datasets to reproduce the results of experiments proposed in this study are available at: https://github.com/bioinformatics-sannio/ncrna-deep.

Download Full-text

Watching a Small Portion could be as Good as Watching All: Towards Efficient Video Classification

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/98 ◽

2018 ◽

Cited By ~ 10

Author(s):

Hehe Fan ◽

Zhongwen Xu ◽

Linchao Zhu ◽

Chenggang Yan ◽

Jianjun Ge ◽

...

Keyword(s):

Large Scale ◽

Sampling Rate ◽

Computational Cost ◽

Confidence Score ◽

Video Content ◽

Video Classification ◽

Video Frames ◽

Similar Accuracy ◽

Efficient Video

We aim to significantly reduce the computational cost for classification of temporally untrimmed videos while retaining similar accuracy. Existing video classification methods sample frames with a predefined frequency over entire video. Differently, we propose an end-to-end deep reinforcement approach which enables an agent to classify videos by watching a very small portion of frames like what we do. We make two main contributions. First, information is not equally distributed in video frames along time. An agent needs to watch more carefully when a clip is informative and skip the frames if they are redundant or irrelevant. The proposed approach enables the agent to adapt sampling rate to video content and skip most of the frames without the loss of information. Second, in order to have a confident decision, the number of frames that should be watched by an agent varies greatly from one video to another. We incorporate an adaptive stop network to measure confidence score and generate timely trigger to stop the agent watching videos, which improves efficiency without loss of accuracy. Our approach reduces the computational cost significantly for the large-scale YouTube-8M dataset, while the accuracy remains the same.

Download Full-text

Abstract P002: Evidence For Smoking Dependent Genetic Effects on C-reactive Protein Levels in a Multi-ethnic Cohort Setting: The Care Consortium.

Circulation ◽

10.1161/circ.125.suppl_10.ap002 ◽

2012 ◽

Vol 125 (suppl_10) ◽

Author(s):

Jaclyn Ellis ◽

Jeremy Walston ◽

Josee Dupuis ◽

Emma Larkin ◽

Maja Barbalic ◽

...

Keyword(s):

African Americans ◽

Candidate Gene ◽

Meta Analysis ◽

Cardiovascular Health Study ◽

C Reactive Protein ◽

Reactive Protein ◽

European Americans ◽

Heart Study ◽

Study Results ◽

Never Smokers

INTRODUCTION: C-reactive protein (CRP) is a heritable biomarker of systemic inflammation and a predictor of cardiovascular disease (CVD). Cigarette smoking is a major risk factor in the development of CVD and has been shown to affect circulating levels of CRP. Therefore, we sought to determine how this important environmental exposure may influence genetic associations with CRP in a multi-ethnic setting. METHODS: Using the ITMAT Broad-CARe (IBC) SNP array, a custom 50,000 SNP gene-centric array having dense coverage of over 2,000 candidate genes for CVD pathways, we performed a meta-analysis of up to 26,065 participants of European descent and 7,584 participants of African descent for association with log-CRP level within smoking status stratum. The 2 smoking strata were: never smokers and ever smokers (comprising of current and former smokers). We conducted IBC-wide association scans for CRP within cohort-, race- and smoking-stratum and meta-analyzed by race. Samples were from the Candidate gene Association Resource (CARe) cohorts (Atherosclerosis Risk in Communities Study, Framingham Heart Study, Cardiovascular Health Study, Cleveland Family Study , Coronary Artery Risk Development in Young Adults Study, Jackson Heart Study, and Multi-Ethnic Study of Atherosclerosis Study). Results were considered to be panel wide statistically significant if p<2.2×10−6. RESULTS: The overall sample size for ever smokers (never smokers) was 11,698 (10,344) in European Americans and 3,448 (4,330) in African Americans. The per-allele beta coefficients for genes previously established to be associated with CRP and present on the IBC chip ( CRP, APOE, GCKR, IL6R, LEPR, HNF1A, NLRP3 ) were very similar in magnitude between smoking strata in European Americans. However, in the African Americans, the estimated per-allele CRP and IL6R betas were 2-times larger for the ever smokers as compared to the never smokers. In the European American analysis, one gene not previously reported for association with CRP reached IBC-wide significance for a CRP-lowering effect in the never smokers ( GSTT1 , p=4.8E-07 for SNP rs405597 ), but not in the ever smokers (p=0.078). CONCLUSION: This large scale candidate gene based meta-analysis identified one novel locus for CRP ( GSTT1 ) associated with serum CRP levels in those reporting having never regularly smoked. Polymorphisms in GSTT1 , which plays a role in detoxification, have previously been reported to interact with smoking for other phenotypes including birth weight and colorectal cancer. We also observed evidence that smoking modifies the effects for previously established loci CRP and IL6R in African Americans. These results may identify important context genetic specific effects that influence chronic inflammation.

Download Full-text

Improvements of Rackwitz–Fiessler Method for Correlated Structural Reliability Analysis

International Journal of Computational Methods ◽

10.1142/s0219876219500774 ◽

2019 ◽

Vol 17 (06) ◽

pp. 1950077 ◽

Cited By ~ 1

Author(s):

Sheng-Tong Zhou ◽

Qian Xiao ◽

Jian-Min Zhou ◽

Hong-Guang Li

Keyword(s):

Structural Reliability ◽

Pearson Correlation ◽

Computational Cost ◽

Transformation Process ◽

Point Of View ◽

Gaussian Copula ◽

Normal Space ◽

Copula Theory ◽

Geometric Point ◽

Nataf Transformation

Rackwitz–Fiessler (RF) method is well accepted as an efficient way to solve the uncorrelated non-Normal reliability problems by transforming original non-Normal variables into equivalent Normal variables based on the equivalent Normal conditions. However, this traditional RF method is often abandoned when correlated reliability problems are involved, because the point-by-point implementation property of equivalent Normal conditions makes the RF method hard to clearly describe the correlations of transformed variables. To this end, some improvements on the traditional RF method are presented from the isoprobabilistic transformation and copula theory viewpoints. First of all, the forward transformation process of RF method from the original space to the standard Normal space is interpreted as the isoprobabilistic transformation from the geometric point of view. This viewpoint makes us reasonably describe the stochastic dependence of transformed variables same as that in Nataf transformation (NATAF). Thus, a corresponding enhanced RF (EnRF) method is proposed to deal with the correlated reliability problems described by Pearson linear correlation. Further, we uncover the implicit Gaussian copula hypothesis of RF method according to the invariant theorem of copula and the strictly increasing isoprobabilistic transformation. Meanwhile, based on the copula-only rank correlations such as the Spearman and Kendall correlations, two improved RF (IRF) methods are introduced to overcome the potential pitfalls of Pearson correlation in EnRF. Later, taking NATAF as a reference, the computational cost and efficiency of above three proposed RF methods are also discussed in Hasofer–Lind reliability algorithm. Finally, four illustrative structure reliability examples are demonstrated to validate the availability and advantages of the new proposed RF methods.

Download Full-text

Ethnic Differences in Nighttime Melatonin and Nighttime Blood Pressure: A Study in European Americans and African Americans

American Journal of Hypertension ◽

10.1093/ajh/hpz083 ◽

2019 ◽

Vol 32 (10) ◽

pp. 968-974 ◽

Cited By ~ 2

Author(s):

Jinhee Jeong ◽

Haidong Zhu ◽

Ryan A Harris ◽

Yanbin Dong ◽

Shaoyong Su ◽

...

Keyword(s):

Blood Pressure ◽

African Americans ◽

Ethnic Differences ◽

Ethnic Difference ◽

Melatonin Secretion ◽

Creatinine Concentration ◽

Major Index ◽

European Americans ◽

Heart Study ◽

Nighttime Blood Pressure

Abstract BACKGROUND Ethnic differences in nighttime blood pressure (BP) have long been documented with African Americans (AAs) having higher BP than European Americans (EAs). At present, lower nighttime melatonin, a key regulator of circadian rhythms, has been associated with higher nighttime BP levels in EAs. This study sought to test the hypothesis that AAs have lower nighttime melatonin secretion compared with EAs. We also determined if this ethnic difference in melatonin could partially explain the ethnic difference in nighttime BP. METHODS A total of 150 young adults (71 AA; 46% females; mean age: 27.7 years) enrolled in the Georgia Stress and Heart study provided an overnight urine sample for the measurement of 6-sulfatoxymelatonin, a major metabolite of melatonin. Urine melatonin excretion (UME) was calculated as the ratio between 6-sulfatoxymelatonin concentration and creatinine concentration. Twenty-four-hour ambulatory BP was assessed and nighttime systolic BP (SBP) was used as a major index of BP regulation. RESULTS After adjustment of age, sex, body mass index, and smoking, AAs had significantly lower UME (P = 0.002) and higher nighttime SBP than EAs (P = 0.036). Lower UME was significantly associated with higher nighttime SBP and this relationship did not depend on ethnicity. The ethnicity difference in nighttime SBP was significantly attenuated after adding UME into the model (P = 0.163). CONCLUSION This study is the first to document the ethnic difference in nighttime melatonin excretion, demonstrating that AAs have lower melatonin secretion compared with EAs. Furthermore, the ethnic difference in nighttime melatonin can partially account for the established ethnic difference in nighttime SBP.

Download Full-text

Reliability-Based Design Optimization of Complex Problems With Multiple Design Points via Narrowed Search Region

Journal of Mechanical Design ◽

10.1115/1.4045420 ◽

2019 ◽

Vol 142 (6) ◽

Cited By ~ 6

Author(s):

Yutian Wang ◽

Peng Hao ◽

Zhendong Guo ◽

Dachuan Liu ◽

Qiang Gao

Keyword(s):

Design Optimization ◽

Risk Function ◽

Computational Cost ◽

Reliability Based Design Optimization ◽

Search Region ◽

Complex Problems ◽

Reliability Based Design ◽

Training Samples ◽

Gradient Based ◽

Similar Accuracy

Abstract The expensive computational cost is always a major concern for reliability-based design optimization (RBDO) of complex problems. The performance of RBDO can be lowered by the inaccuracy of reliability analysis (RA) which is caused by multiple local optimums and multiple design points in highly non-linear space. In order to reduce the computational burden and guarantee the accuracy of RA (and thus to improve the RBDO performance), a global RBDO algorithm by adopting an improved constraint boundary sampling (GRBDO-ICBS) method is proposed. Specifically, the GRBDO-ICBS method first narrows the concerned search region by using a Kriging-based global search. The accuracies of the design points are verified by the expected risk function (ERF), and the corresponding inaccurate design points are added into training samples to update Kriging. Then a multi-start gradient-based sequential RBDO is carried out, which tries to find out all multiple design points in the concerned search region. The performance of GRBDO-ICBS is demonstrated by four examples. All results have shown that the proposed method can achieve similar accuracy as Monte Carlo simulation (MCS)-based RBDO but with a much lower computational cost.

Download Full-text

Aerosol and physical atmosphere model parameters are both important sources of uncertainty in aerosol ERF

Atmospheric Chemistry and Physics ◽

10.5194/acp-18-9975-2018 ◽

2018 ◽

Vol 18 (13) ◽

pp. 9975-10006 ◽

Cited By ~ 33

Author(s):

Leighton A. Regayre ◽

Jill S. Johnson ◽

Masaru Yoshioka ◽

Kirsty J. Pringle ◽

David M. H. Sexton ◽

...

Keyword(s):

Model Uncertainty ◽

Radiative Forcing ◽

Computational Cost ◽

Short Wave ◽

Sensitivity Analyses ◽

Model Parameters ◽

Large Set ◽

Individual Parameter ◽

Cloud Water Content ◽

Atmosphere Model

Abstract. Changes in aerosols cause a change in net top-of-the-atmosphere (ToA) short-wave and long-wave radiative fluxes; rapid adjustments in clouds, water vapour and temperature; and an effective radiative forcing (ERF) of the planetary energy budget. The diverse sources of model uncertainty and the computational cost of running climate models make it difficult to isolate the main causes of aerosol ERF uncertainty and to understand how observations can be used to constrain it. We explore the aerosol ERF uncertainty by using fast model emulators to generate a very large set of aerosol–climate model variants that span the model uncertainty due to 27 parameters related to atmospheric and aerosol processes. Sensitivity analyses shows that the uncertainty in the ToA flux is dominated (around 80 %) by uncertainties in the physical atmosphere model, particularly parameters that affect cloud reflectivity. However, uncertainty in the change in ToA flux caused by aerosol emissions over the industrial period (the aerosol ERF) is controlled by a combination of uncertainties in aerosol (around 60 %) and physical atmosphere (around 40 %) parameters. Four atmospheric and aerosol parameters account for around 80 % of the uncertainty in short-wave ToA flux (mostly parameters that directly scale cloud reflectivity, cloud water content or cloud droplet concentrations), and these parameters also account for around 60 % of the aerosol ERF uncertainty. The common causes of uncertainty mean that constraining the modelled planetary brightness to tightly match satellite observations changes the lower 95 % credible aerosol ERF value from −2.65 to −2.37 W m−2. This suggests the strongest forcings (below around −2.4 W m−2) are inconsistent with observations. These results show that, regardless of the fact that the ToA flux is 2 orders of magnitude larger than the aerosol ERF, the observed flux can constrain the uncertainty in ERF because their values are connected by constrainable process parameters. The key to reducing the aerosol ERF uncertainty further will be to identify observations that can additionally constrain individual parameter ranges and/or combined parameter effects, which can be achieved through sensitivity analysis of perturbed parameter ensembles.

Download Full-text