scholarly journals Extending long-range phasing and haplotype library imputation algorithms to very large and heterogeneous datasets

2018 ◽  
Author(s):  
Daniel Money ◽  
David Wilson ◽  
Janez Jenko ◽  
Gregor Gorjanc ◽  
John M. Hickey

AbstractBackgroundThis paper describes the latest improvements to the long-range phasing and haplotype library imputation algorithms that enable them to successfully phase both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs). Previous publicly available implementations of long-range phasing could not phase large datasets due to the computational cost of defining surrogate parents by exhaustive all-against-all searches. Further, both long-range phasing and haplotype library imputation were not designed to deal with large amounts of missing data, which is inherent when using multiple SNP arrays.MethodsHere, we developed methods which avoid the need for all-against-all searches by performing long-range phasing on subsets of individuals and then combing results. We also extended long-range phasing and haplotype library imputation algorithms to enable them to use different sets of markers, including missing values, when determining surrogate parents and identifying haplotypes. We implemented and tested these extensions in an updated version of our phasing software AlphaPhase.ResultsA simulated dataset with one million individuals genotyped with the same set of 6,711 SNP for a single chromosome took two days to phase. A larger dataset with one million individuals genotyped with 49,579 SNP for a single chromosome took 14 days to phase. The percentage of correctly phased alleles at heterozygous loci was respectively 90.5% and 90.0% for the two datasets, which is comparable to the accuracy achieved with previous versions of AlphaPhase on smaller datasets.The phasing accuracy for datasets with different sets of markers was generally lower than that for datasets with one set of markers. For a simulated dataset with three sets of markers 2.8% of alleles at heterozygous positions were phased incorrectly whereas the equivalent figure with one set of markers was 0.6%.ConclusionsThe improved long-range phasing and haplotype library imputation algorithms enable AlphaPhase to quickly and accurately phase very large and heterogeneous datasets. This will enable more powerful breeding and genetics research and application.

2020 ◽  
Vol 21 (11) ◽  
pp. 1068-1077
Author(s):  
Xiaochao Sun ◽  
Bin Yang ◽  
Qunye Zhang

: Many studies have shown that the spatial distribution of genes within a single chromosome exhibits distinct patterns. However, little is known about the characteristics of inter-chromosomal distribution of genes (including protein-coding genes, processed transcripts and pseudogenes) in different genomes. In this study, we explored these issues using the available genomic data of both human and model organisms. Moreover, we also analyzed the distribution pattern of protein-coding genes that have been associated with 14 common diseases and the insert/deletion mutations and single nucleotide polymorphisms detected by whole genome sequencing in an acute promyelocyte leukemia patient. We obtained the following novel findings. Firstly, inter-chromosomal distribution of genes displays a nonstochastic pattern and the gene densities in different chromosomes are heterogeneous. This kind of heterogeneity is observed in genomes of both lower and higher species. Secondly, protein-coding genes involved in certain biological processes tend to be enriched in one or a few chromosomes. Our findings have added new insights into our understanding of the spatial distribution of genome and disease- related genes across chromosomes. These results could be useful in improving the efficiency of disease-associated gene screening studies by targeting specific chromosomes.


2006 ◽  
Vol 04 (03) ◽  
pp. 639-647 ◽  
Author(s):  
ELEAZAR ESKIN ◽  
RODED SHARAN ◽  
ERAN HALPERIN

The common approaches for haplotype inference from genotype data are targeted toward phasing short genomic regions. Longer regions are often tackled in a heuristic manner, due to the high computational cost. Here, we describe a novel approach for phasing genotypes over long regions, which is based on combining information from local predictions on short, overlapping regions. The phasing is done in a way, which maximizes a natural maximum likelihood criterion. Among other things, this criterion takes into account the physical length between neighboring single nucleotide polymorphisms. The approach is very efficient and is applied to several large scale datasets and is shown to be successful in two recent benchmarking studies (Zaitlen et al., in press; Marchini et al., in preparation). Our method is publicly available via a webserver at .


2020 ◽  
Vol 52 (1) ◽  
Author(s):  
Daniel Money ◽  
David Wilson ◽  
Janez Jenko ◽  
Andrew Whalen ◽  
Steve Thorn ◽  
...  

Plants ◽  
2020 ◽  
Vol 9 (9) ◽  
pp. 1153
Author(s):  
Yudai Kawamoto ◽  
Hirotaka Toda ◽  
Hiroshi Inoue ◽  
Kappei Kobayashi ◽  
Naoto Yamaoka ◽  
...  

To further develop barley breeding and genetics, more information on gene functions based on the analysis of the mutants of each gene is needed. However, the mutant resources are not as well developed as the model plants, such as Arabidopsis and rice. Although genome editing techniques have been able to generate mutants, it is not yet an effective method as it can only be used to transform a limited number of cultivars. Here, we developed a mutant population using ‘Mannenboshi’, which produces good quality grains with high yields but is susceptible to disease, to establish a Targeting Induced Local Lesions IN Genomes (TILLING) system that can isolate mutants in a high-throughput manner. To evaluate the availability of the prepared 8043 M3 lines, we investigated the frequency of mutant occurrence using a rapid, visually detectable waxy phenotype as an indicator. Four mutants were isolated and single nucleotide polymorphisms (SNPs) were identified in the Waxy gene as novel alleles. It was confirmed that the mutations could be easily detected using the mismatch endonuclease CELI, revealing that a sufficient number of mutants could be rapidly isolated from our TILLING population.


Author(s):  
Mohammad Poursina ◽  
Jeremy Laflin ◽  
Kurt S. Anderson

In molecular simulations, the dominant portion of the computational cost is associated with force field calculations. Herein, we extend the approach used to approximate long range gravitational force and the associated moment in spacecraft dynamics to the coulomb forces present in coarse grained biopolymer simulations. We approximate the resultant force and moment for long-range particle-body and body-body interactions due to the electrostatic force field. The resultant moment approximated here is due to the fact that the net force does not necessarily act through the center of mass of the body (pseudoatom). This moment is considered in multibody-based coarse grain simulations while neglected in bead models which use particle dynamics to address the dynamics of the system. A novel binary divide and conquer algorithm (BDCA) is presented to implement the force field approximation. The proposed algorithm is implemented by considering each rigid/flexible domain as a node of the leaf level of the binary tree. This substructuring strategy is well suited to coarse grain simulations of chain biopolymers using an articulated multibody approach.


2020 ◽  
Vol 82 (12) ◽  
pp. 2711-2724 ◽  
Author(s):  
Pezhman Kazemi ◽  
Jaume Giralt ◽  
Christophe Bengoa ◽  
Armin Masoumian ◽  
Jean-Philippe Steyer

Abstract Because of the static nature of conventional principal component analysis (PCA), natural process variations may be interpreted as faults when it is applied to processes with time-varying behavior. In this paper, therefore, we propose a complete adaptive process monitoring framework based on incremental principal component analysis (IPCA). This framework updates the eigenspace by incrementing new data to the PCA at a low computational cost. Moreover, the contribution of variables is recursively provided using complete decomposition contribution (CDC). To impute missing values, the empirical best linear unbiased prediction (EBLUP) method is incorporated into this framework. The effectiveness of this framework is evaluated using benchmark simulation model No. 2 (BSM2). Our simulation results show the ability of the proposed approach to distinguish between time-varying behavior and faulty events while correctly isolating the sensor faults even when these faults are relatively small.


2019 ◽  
Vol 36 (8) ◽  
pp. 2328-2336
Author(s):  
Chuanyi Zhang ◽  
Idoia Ochoa

Abstract Motivation Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known ‘true’ variants, i.e. gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given Variants Call Format (VCF) file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics). Results For the analysis, we used whole genome sequencing (WGS) Human datasets for which the gold standards are available. We show on these data that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared with VQSR (4 versus 50 min approximately for filtering the single nucleotide polymorphisms of a WGS Human sample). Availability and Implementation Code and scripts available at: github.com/ChuanyiZ/vef. Supplementary information Supplementary data are available at Bioinformatics online.


2016 ◽  
Vol 59 (3) ◽  
pp. 351-361 ◽  
Author(s):  
Meng Zhang ◽  
Chuanying Pan ◽  
Qin Lin ◽  
Shenrong Hu ◽  
Ruihua Dang ◽  
...  

Abstract. Nanog is an important pluripotent transcription regulator transforming somatic cells to induced pluripotent stem cells (iPSCs), and its overexpression leads to a high expression of the growth and differentiation factor 3 (GDF3), which affects animal growth traits. Therefore, the aim of this study was to explore the genetic variations within the Nanog gene and their effects on phenotypic traits in cattle. Six novel exonic single nucleotide polymorphisms (SNPs) were found in six cattle breeds. Seven haplotypes were analyzed: TCAACC (0.260), TCAATA (0.039), TCATCC (0.019), TCGACC (0.506), TCGATA (0.137), TCGTCC (0.036), and CTGATA (0.003). There were strong linkage disequilibriums of SNP1 and SNP2 in Jiaxian cattle as well as of SNP5 and SNP6 in both Jiaxian cattle and Nanyang cattle. Moreover, SNP3, SNP4, and SNP5 were associated with phenotypes. The individuals with GG genotype at the SNP3 locus or AA genotype at the SNP4 locus showed better body slanting length and chest circumference or body height and hucklebone width in Nanyang cattle. The superiority of the SNP5-C allele regarding body height and cannon circumference was observed in Jiaxian cattle. The combination of SNP3 and SNP4 (GG–AA) had positive effects on body height, body slanting length, and chest circumference. These findings may indicate that Nanog, as a regulator of bovine growth traits, could be a candidate gene for marker-assisted selection (MAS) in breeding and genetics in cattle.


Sign in / Sign up

Export Citation Format

Share Document