Extending long-range phasing and haplotype library imputation algorithms to very large and heterogeneous datasets

Mapping Intimacies ◽

10.1101/477398 ◽

2018 ◽

Author(s):

Daniel Money ◽

David Wilson ◽

Janez Jenko ◽

Gregor Gorjanc ◽

John M. Hickey

Keyword(s):

Long Range ◽

Missing Values ◽

Computational Cost ◽

Simulated Dataset ◽

Nucleotide Polymorphisms ◽

Single Chromosome ◽

Genetics Research ◽

Heterogeneous Datasets ◽

Breeding And Genetics ◽

Surrogate Parents

AbstractBackgroundThis paper describes the latest improvements to the long-range phasing and haplotype library imputation algorithms that enable them to successfully phase both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs). Previous publicly available implementations of long-range phasing could not phase large datasets due to the computational cost of defining surrogate parents by exhaustive all-against-all searches. Further, both long-range phasing and haplotype library imputation were not designed to deal with large amounts of missing data, which is inherent when using multiple SNP arrays.MethodsHere, we developed methods which avoid the need for all-against-all searches by performing long-range phasing on subsets of individuals and then combing results. We also extended long-range phasing and haplotype library imputation algorithms to enable them to use different sets of markers, including missing values, when determining surrogate parents and identifying haplotypes. We implemented and tested these extensions in an updated version of our phasing software AlphaPhase.ResultsA simulated dataset with one million individuals genotyped with the same set of 6,711 SNP for a single chromosome took two days to phase. A larger dataset with one million individuals genotyped with 49,579 SNP for a single chromosome took 14 days to phase. The percentage of correctly phased alleles at heterozygous loci was respectively 90.5% and 90.0% for the two datasets, which is comparable to the accuracy achieved with previous versions of AlphaPhase on smaller datasets.The phasing accuracy for datasets with different sets of markers was generally lower than that for datasets with one set of markers. For a simulated dataset with three sets of markers 2.8% of alleles at heterozygous positions were phased incorrectly whereas the equivalent figure with one set of markers was 0.6%.ConclusionsThe improved long-range phasing and haplotype library imputation algorithms enable AlphaPhase to quickly and accurately phase very large and heterogeneous datasets. This will enable more powerful breeding and genetics research and application.

Download Full-text

Analysis of Inter-Chromosomal Distribution of Disease-Related Genes in Human Genome

Current Protein and Peptide Science ◽

10.2174/1389203721666200426233158 ◽

2020 ◽

Vol 21 (11) ◽

pp. 1068-1077

Author(s):

Xiaochao Sun ◽

Bin Yang ◽

Qunye Zhang

Keyword(s):

Spatial Distribution ◽

Model Organisms ◽

Nucleotide Polymorphisms ◽

Chromosomal Distribution ◽

Single Nucleotide ◽

Protein Coding ◽

Single Chromosome ◽

Deletion Mutations ◽

Protein Coding Genes ◽

Disease Related Genes

: Many studies have shown that the spatial distribution of genes within a single chromosome exhibits distinct patterns. However, little is known about the characteristics of inter-chromosomal distribution of genes (including protein-coding genes, processed transcripts and pseudogenes) in different genomes. In this study, we explored these issues using the available genomic data of both human and model organisms. Moreover, we also analyzed the distribution pattern of protein-coding genes that have been associated with 14 common diseases and the insert/deletion mutations and single nucleotide polymorphisms detected by whole genome sequencing in an acute promyelocyte leukemia patient. We obtained the following novel findings. Firstly, inter-chromosomal distribution of genes displays a nonstochastic pattern and the gene densities in different chromosomes are heterogeneous. This kind of heterogeneity is observed in genomes of both lower and higher species. Secondly, protein-coding genes involved in certain biological processes tend to be enriched in one or a few chromosomes. Our findings have added new insights into our understanding of the spatial distribution of genome and disease- related genes across chromosomes. These results could be useful in improving the efficiency of disease-associated gene screening studies by targeting specific chromosomes.

Download Full-text

Allele-specific long-range PCR/sequencing method for allelic assignment of multiple single nucleotide polymorphisms

Journal of Biochemical and Biophysical Methods ◽

10.1016/s0165-022x(02)00114-8 ◽

2003 ◽

Vol 55 (1) ◽

pp. 1-9 ◽

Cited By ~ 6

Author(s):

Michiyo Nagano ◽

Takahiro Nakamura ◽

Shogo Ozawa ◽

Keiko Maekawa ◽

Yoshiro Saito ◽

...

Keyword(s):

Single Nucleotide Polymorphisms ◽

Long Range ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Sequencing Method ◽

Allele Specific ◽

Long Range Pcr

Download Full-text

A NOTE ON PHASING LONG GENOMIC REGIONS USING LOCAL HAPLOTYPE PREDICTIONS

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720006002272 ◽

2006 ◽

Vol 04 (03) ◽

pp. 639-647 ◽

Cited By ~ 6

Author(s):

ELEAZAR ESKIN ◽

RODED SHARAN ◽

ERAN HALPERIN

Keyword(s):

Large Scale ◽

Computational Cost ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Novel Approach ◽

Maximum Likelihood Criterion ◽

The Common ◽

Genomic Regions ◽

High Computational Cost ◽

Combining Information

The common approaches for haplotype inference from genotype data are targeted toward phasing short genomic regions. Longer regions are often tackled in a heuristic manner, due to the high computational cost. Here, we describe a novel approach for phasing genotypes over long regions, which is based on combining information from local predictions on short, overlapping regions. The phasing is done in a way, which maximizes a natural maximum likelihood criterion. Among other things, this criterion takes into account the physical length between neighboring single nucleotide polymorphisms. The approach is very efficient and is applied to several large scale datasets and is shown to be successful in two recent benchmarking studies (Zaitlen et al., in press; Marchini et al., in preparation). Our method is publicly available via a webserver at .

Download Full-text

Extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets

Genetics Selection Evolution ◽

10.1186/s12711-020-00558-2 ◽

2020 ◽

Vol 52 (1) ◽

Author(s):

Daniel Money ◽

David Wilson ◽

Janez Jenko ◽

Andrew Whalen ◽

Steve Thorn ◽

...

Keyword(s):

Long Range ◽

Heterogeneous Datasets

Download Full-text

Fast and Inexpensive Phenotyping and Genotyping Methods for Evaluation of Barley Mutant Population

Plants ◽

10.3390/plants9091153 ◽

2020 ◽

Vol 9 (9) ◽

pp. 1153

Author(s):

Yudai Kawamoto ◽

Hirotaka Toda ◽

Hiroshi Inoue ◽

Kappei Kobayashi ◽

Naoto Yamaoka ◽

...

Keyword(s):

Waxy Gene ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Mutant Population ◽

Tilling Population ◽

Local Lesions ◽

Novel Alleles ◽

High Yields ◽

Breeding And Genetics ◽

High Throughput Manner

To further develop barley breeding and genetics, more information on gene functions based on the analysis of the mutants of each gene is needed. However, the mutant resources are not as well developed as the model plants, such as Arabidopsis and rice. Although genome editing techniques have been able to generate mutants, it is not yet an effective method as it can only be used to transform a limited number of cultivars. Here, we developed a mutant population using ‘Mannenboshi’, which produces good quality grains with high yields but is susceptible to disease, to establish a Targeting Induced Local Lesions IN Genomes (TILLING) system that can isolate mutants in a high-throughput manner. To evaluate the availability of the prepared 8043 M3 lines, we investigated the frequency of mutant occurrence using a rapid, visually detectable waxy phenotype as an indicator. Four mutants were isolated and single nucleotide polymorphisms (SNPs) were identified in the Waxy gene as novel alleles. It was confirmed that the mutations could be easily detected using the mismatch endonuclease CELI, revealing that a sufficient number of mutants could be rapidly isolated from our TILLING population.

Download Full-text

Fast Electrostatic Force and Moment Calculations in Multibody-Based Simulations of Coarse-Grained Biopolymers

Volume 4: 8th International Conference on Multibody Systems, Nonlinear Dynamics, and Control, Parts A and B ◽

10.1115/detc2011-48376 ◽

2011 ◽

Cited By ~ 1

Author(s):

Mohammad Poursina ◽

Jeremy Laflin ◽

Kurt S. Anderson

Keyword(s):

Force Field ◽

Long Range ◽

Electrostatic Force ◽

Center Of Mass ◽

Computational Cost ◽

Field Approximation ◽

The Body ◽

Divide And Conquer ◽

Coarse Grained ◽

Coarse Grain

In molecular simulations, the dominant portion of the computational cost is associated with force field calculations. Herein, we extend the approach used to approximate long range gravitational force and the associated moment in spacecraft dynamics to the coulomb forces present in coarse grained biopolymer simulations. We approximate the resultant force and moment for long-range particle-body and body-body interactions due to the electrostatic force field. The resultant moment approximated here is due to the fact that the net force does not necessarily act through the center of mass of the body (pseudoatom). This moment is considered in multibody-based coarse grain simulations while neglected in bead models which use particle dynamics to address the dynamics of the system. A novel binary divide and conquer algorithm (BDCA) is presented to implement the force field approximation. The proposed algorithm is implemented by considering each rigid/flexible domain as a node of the leaf level of the binary tree. This substructuring strategy is well suited to coarse grain simulations of chain biopolymers using an articulated multibody approach.

Download Full-text

Fault detection and diagnosis in water resource recovery facilities using incremental PCA

Water Science & Technology ◽

10.2166/wst.2020.368 ◽

2020 ◽

Vol 82 (12) ◽

pp. 2711-2724 ◽

Cited By ~ 2

Author(s):

Pezhman Kazemi ◽

Jaume Giralt ◽

Christophe Bengoa ◽

Armin Masoumian ◽

Jean-Philippe Steyer

Keyword(s):

Principal Component Analysis ◽

Missing Values ◽

Computational Cost ◽

Principal Component ◽

Process Variations ◽

Component Analysis ◽

Fault Detection And Diagnosis ◽

Time Varying ◽

Best Linear Unbiased ◽

Detection And Diagnosis

Abstract Because of the static nature of conventional principal component analysis (PCA), natural process variations may be interpreted as faults when it is applied to processes with time-varying behavior. In this paper, therefore, we propose a complete adaptive process monitoring framework based on incremental principal component analysis (IPCA). This framework updates the eigenspace by incrementing new data to the PCA at a low computational cost. Moreover, the contribution of variables is recursively provided using complete decomposition contribution (CDC). To impute missing values, the empirical best linear unbiased prediction (EBLUP) method is incorporated into this framework. The effectiveness of this framework is evaluated using benchmark simulation model No. 2 (BSM2). Our simulation results show the ability of the proposed approach to distinguish between time-varying behavior and faulty events while correctly isolating the sensor faults even when these faults are relatively small.

Download Full-text

Molecular haplotyping of genomic DNA for multiple single-nucleotide polymorphisms located kilobases apart using long-range polymerase chain reaction and intramolecular ligation

Pharmacogenetics ◽

10.1097/00008571-200203000-00003 ◽

2002 ◽

Vol 12 (2) ◽

pp. 93-99 ◽

Cited By ~ 55

Author(s):

Oliver G. McDonald ◽

Eugene Y. Krynetski ◽

William E. Evans

Keyword(s):

Polymerase Chain Reaction ◽

Single Nucleotide Polymorphisms ◽

Long Range ◽

Genomic Dna ◽

Nucleotide Polymorphisms ◽

Chain Reaction ◽

Single Nucleotide ◽

Polymerase Chain ◽

Intramolecular Ligation

Download Full-text

VEF: a variant filtering tool based on ensemble methods

Bioinformatics ◽

10.1093/bioinformatics/btz952 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2328-2336

Author(s):

Chuanyi Zhang ◽

Idoia Ochoa

Keyword(s):

Missing Values ◽

Genomic Analysis ◽

Ensemble Methods ◽

Supplementary Information ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Significant Saving ◽

Variant Call ◽

Variant Filtering ◽

Human Sample

Abstract Motivation Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known ‘true’ variants, i.e. gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given Variants Call Format (VCF) file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics). Results For the analysis, we used whole genome sequencing (WGS) Human datasets for which the gold standards are available. We show on these data that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared with VQSR (4 versus 50 min approximately for filtering the single nucleotide polymorphisms of a WGS Human sample). Availability and Implementation Code and scripts available at: github.com/ChuanyiZ/vef. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Exploration of the exonic variations of the iPSC-related <i>Nanog</i> gene and their effects on phenotypic traits in cattle

Archives Animal Breeding ◽

10.5194/aab-59-351-2016 ◽

2016 ◽

Vol 59 (3) ◽

pp. 351-361 ◽

Cited By ~ 2

Author(s):

Meng Zhang ◽

Chuanying Pan ◽

Qin Lin ◽

Shenrong Hu ◽

Ruihua Dang ◽

...

Keyword(s):

Growth Traits ◽

Body Height ◽

Chest Circumference ◽

Phenotypic Traits ◽

Nucleotide Polymorphisms ◽

Positive Effects ◽

Nanog Gene ◽

Animal Growth ◽

Induced Pluripotent ◽

Breeding And Genetics

Abstract. Nanog is an important pluripotent transcription regulator transforming somatic cells to induced pluripotent stem cells (iPSCs), and its overexpression leads to a high expression of the growth and differentiation factor 3 (GDF3), which affects animal growth traits. Therefore, the aim of this study was to explore the genetic variations within the Nanog gene and their effects on phenotypic traits in cattle. Six novel exonic single nucleotide polymorphisms (SNPs) were found in six cattle breeds. Seven haplotypes were analyzed: TCAACC (0.260), TCAATA (0.039), TCATCC (0.019), TCGACC (0.506), TCGATA (0.137), TCGTCC (0.036), and CTGATA (0.003). There were strong linkage disequilibriums of SNP1 and SNP2 in Jiaxian cattle as well as of SNP5 and SNP6 in both Jiaxian cattle and Nanyang cattle. Moreover, SNP3, SNP4, and SNP5 were associated with phenotypes. The individuals with GG genotype at the SNP3 locus or AA genotype at the SNP4 locus showed better body slanting length and chest circumference or body height and hucklebone width in Nanyang cattle. The superiority of the SNP5-C allele regarding body height and cannon circumference was observed in Jiaxian cattle. The combination of SNP3 and SNP4 (GG–AA) had positive effects on body height, body slanting length, and chest circumference. These findings may indicate that Nanog, as a regulator of bovine growth traits, could be a candidate gene for marker-assisted selection (MAS) in breeding and genetics in cattle.

Download Full-text