scholarly journals VcfR: a package to manipulate and visualize VCF format data in R

2016 ◽  
Author(s):  
Brian J. Knaus ◽  
Niklaus J. Grünwald

AbstractSoftware to call single nucleotide polymorphisms or related genetic variants has converged on the variant call format (VCF) as the output format of choice. This has created a need for tools to work with VCF files. While an increasing number of software exists to read VCF data, many only extract the genotypes without including the data associated with each genotype that describes its quality. We created the R package vcfR to address this issue. We developed a VCF file exploration tool implemented in the R language because R provides an interactive experience and an environment that is commonly used for genetic data analysis. Functions to read and write VCF files into R as well as functions to extract portions of the data and to plot summary statistics of the data are implemented. VcfR further provides the ability to visualize how various parameterizations of the data affect the results. Additional tools are included to integrate sequence (FASTA) and annotation data (GFF) for visualization of genomic regions such as chromosomes. Conversion functions translate data from the vcfR data structure to formats used by other R genetics packages. Computationally intensive functions are implemented in C++ to improve performance. Use of these tools is intended to facilitate VCF data exploration, including intuitive methods for data quality control and easy export to other R packages for further analysis. VcfR thus provides essential, novel tools currently not available in R.

2014 ◽  
Vol 17 (4) ◽  
Author(s):  
Raymond K. Walters ◽  
Charles Laurin ◽  
Gitta H. Lubke

Epistasis is a growing area of research in genome-wide studies, but the differences between alternative definitions of epistasis remain a source of confusion for many researchers. One problem is that models for epistasis are presented in a number of formats, some of which have difficult-to-interpret parameters. In addition, the relation between the different models is rarely explained. Existing software for testing epistatic interactions between single-nucleotide polymorphisms (SNPs) does not provide the flexibility to compare the available model parameterizations. For that reason we have developed an R package for investigating epistatic and penetrance models, EpiPen, to aid users who wish to easily compare, interpret, and utilize models for two-locus epistatic interactions. EpiPen facilitates research on SNP-SNP interactions by allowing the R user to easily convert between common parametric forms for two-locus interactions, generate data for simulation studies, and perform power analyses for the selected model with a continuous or dichotomous phenotype. The usefulness of the package for model interpretation and power analysis is illustrated using data on rheumatoid arthritis.


Author(s):  
Gloria Pérez-Rubio ◽  
Luis Alberto López-Flores ◽  
Ana Paula Cupertino ◽  
Francisco Cartujano-Barrera ◽  
Luz Myriam Reynales-Shigematsu ◽  
...  

Previous studies have identified variants in genes encoding proteins associated with the degree of addiction, smoking onset, and cessation. We aimed to describe thirty-one single nucleotide polymorphisms (SNPs) in seven candidate genomic regions spanning six genes associated with tobacco-smoking in a cross-sectional study from two different interventions for quitting smoking: (1) thirty-eight smokers were recruited via multimedia to participate in e-Decídete! program (e-Dec) and (2) ninety-four attended an institutional smoking cessation program on-site. SNPs genotyping was done by real-time PCR using TaqMan probes. The analysis of alleles and genotypes was carried out using the EpiInfo v7. on-site subjects had more years smoking and tobacco index than e-Dec smokers (p < 0.05, both); in CYP2A6 we found differences in the rs28399433 (p < 0.01), the e-Dec group had a higher frequency of TT genotype (0.78 vs. 0.35), and TG genotype frequency was higher in the on-site group (0.63 vs. 0.18), same as GG genotype (0.03 vs. 0.02). Moreover, three SNPs in NRXN1, two in CHRNA3, and two in CHRNA5 had differences in genotype frequencies (p < 0.01). Cigarettes per day were different (p < 0.05) in the metabolizer classification by CYP2A6 alleles. In conclusion, subjects attending a mobile smoking cessation intervention smoked fewer cigarettes per day, by fewer years, and by fewer cumulative pack-years. There were differences in the genotype frequencies of SNPs in genes related to nicotine metabolism and nicotine dependence. Slow metabolizers smoked more cigarettes per day than intermediate and normal metabolizers.


2006 ◽  
Vol 04 (03) ◽  
pp. 639-647 ◽  
Author(s):  
ELEAZAR ESKIN ◽  
RODED SHARAN ◽  
ERAN HALPERIN

The common approaches for haplotype inference from genotype data are targeted toward phasing short genomic regions. Longer regions are often tackled in a heuristic manner, due to the high computational cost. Here, we describe a novel approach for phasing genotypes over long regions, which is based on combining information from local predictions on short, overlapping regions. The phasing is done in a way, which maximizes a natural maximum likelihood criterion. Among other things, this criterion takes into account the physical length between neighboring single nucleotide polymorphisms. The approach is very efficient and is applied to several large scale datasets and is shown to be successful in two recent benchmarking studies (Zaitlen et al., in press; Marchini et al., in preparation). Our method is publicly available via a webserver at .


Animals ◽  
2020 ◽  
Vol 10 (1) ◽  
pp. 170 ◽  
Author(s):  
Zengkui Lu ◽  
Yaojing Yue ◽  
Chao Yuan ◽  
Jianbin Liu ◽  
Zhiqiang Chen ◽  
...  

Body weight is an important economic trait for sheep and it is vital for their successful production and breeding. Therefore, identifying the genomic regions and biological pathways that contribute to understanding variability in body weight traits is significant for selection purposes. In this study, the genome-wide associations of birth, weaning, yearling, and adult weights of 460 fine-wool sheep were determined using resequencing technology. The results showed that 113 single nucleotide polymorphisms (SNPs) reached the genome-wide significance levels for the four body weight traits and 30 genes were annotated effectively, including AADACL3, VGF, NPC1, and SERPINA12. The genes annotated by these SNPs significantly enriched 78 gene ontology terms and 25 signaling pathways, and were found to mainly participate in skeletal muscle development and lipid metabolism. These genes can be used as candidate genes for body weight in sheep, and provide useful information for the production and genomic selection of Chinese fine-wool sheep.


2018 ◽  
Vol 63 (No. 4) ◽  
pp. 136-143
Author(s):  
N. Moravčíková ◽  
M. Simčič ◽  
G. Mészáros ◽  
J. Sölkner ◽  
V. Kukučková ◽  
...  

The aim of this study was to analyse the genomic regions that have been target of natural selection with respect to identifying the loci responsible mainly for fitness traits across six alpine cattle breeds. The genome-wide scan for selection signatures was performed using genotyping data from totally 465 animals. After applying data quality control, overall 35 873 single nucleotide polymorphisms were useable for the subsequent analysis. The detection of genomic regions affected by natural selection was carried out using the approach of principal component analysis. The analysis was based on the assumption that markers extremely related to the population structure are also candidates for local adaptation potential of the population. Based on the expected false discovery rate equal to 10% up to 1138 loci were identified as outliers. The strongest signals of selection were found in genomic regions on BTA 1, 2, 3, 6, 9, 11, 13, and 22. Most genes located in the identified regions have been previously associated with immunity system as well as body growth and muscle formation that mainly reflect the pressure of both natural and artificial selection in respect to adaptation of analysed breeds to the local environmental conditions. The results also signalized that those regions represent a correlated selection response in way to maintain the fitness of analysed breeds.


2021 ◽  
Author(s):  
Yu-Ming Hsu ◽  
Matthieu Falque ◽  
Olivier Martin

In essentially all species where meiotic crossovers have been studied, they occur preferentially in open chromatin, typically near gene promoters and to a lesser extent at the end of genes. Here, in the case of Arabidopsis thaliana, we unveil further trends arising when one considers contextual information, namely summarized epigenetic status, size of underlying genomic regions and degree of divergence between homologs. For instance we find that intergenic recombination rate is reduced if those regions are less than 1.5 kb in size. Furthermore, we propose that the presence of single nucleotide polymorphisms is a factor driving enhanced crossover rate compared to when homologous sequences are identical, in agreement with previous works comparing rates in homozygous and heterozygous blocks. Lastly, by integrating these different factors, we produce a quantitative and predictive model of the recombination landscape that reproduces much of the experimental variation.


2017 ◽  
Author(s):  
Débora Y. C. Brandt ◽  
Jônatas César ◽  
Jérôme Goudet ◽  
Diogo Meyer

ABSTRACTBalancing selection is defined as a class of selective regimes that maintain polymorphism above what is expected under neutrality. Theory predicts that balancing selection reduces population differentiation, as measured by FST. However, balancing selection regimes in which different sets of alleles are maintained in different populations could increase population differentiation. To tackle this issue, we investigated population differentiation at the HLA genes, which constitute the most striking example of balancing selection in humans. We found that population differentiation of single nucleotide polymorphisms (SNPs) at the HLA genes is on average lower than that of SNPs in other genomic regions. However, this result depends on accounting for the differences in allele frequency between selected and putatively neutral sites. Our finding of reduced differentiation at SNPs within HLA genes suggests a predominant role of shared selective pressures among populations at a global scale. However, in pairs of closely related populations, where genome-wide differentiation is low, differentiation at HLA is higher than in other genomic regions. This pattern was reproduced in simulations of overdominant selection. We conclude that population differentiation at the HLA genes is generally lower than genome-wide, but it may be higher for recently diverged population pairs, and that this pattern can be explained by a simple overdominance regime.


eLife ◽  
2020 ◽  
Vol 9 ◽  
Author(s):  
Kira Delmore ◽  
Juan Carlos Illera ◽  
Javier Pérez-Tris ◽  
Gernot Segelbacher ◽  
Juan S Lugo Ramos ◽  
...  

Seasonal migration is a taxonomically widespread behaviour that integrates across many traits. The European blackcap exhibits enormous variation in migration and is renowned for research on its evolution and genetic basis. We assembled a reference genome for blackcaps and obtained whole genome resequencing data from individuals across its breeding range. Analyses of population structure and demography suggested divergence began ~30,000 ya, with evidence for one admixture event between migrant and resident continent birds ~5000 ya. The propensity to migrate, orientation and distance of migration all map to a small number of genomic regions that do not overlap with results from other species, suggesting that there are multiple ways to generate variation in migration. Strongly associated single nucleotide polymorphisms (SNPs) were located in regulatory regions of candidate genes that may serve as major regulators of the migratory syndrome. Evidence for selection on shared variation was documented, providing a mechanism by which rapid changes may evolve.


2021 ◽  
Vol 140 (12) ◽  
pp. 1753-1773
Author(s):  
Andrew J. Pakstis ◽  
Neeru Gandotra ◽  
William C. Speed ◽  
Michael Murtha ◽  
Curt Scharfe ◽  
...  

AbstractSingle-nucleotide polymorphisms (SNPs) and small genomic regions with multiple SNPs (microhaplotypes, MHs) are rapidly emerging as novel forensic investigative tools to assist in individual identification, kinship analyses, ancestry inference, and deconvolution of DNA mixtures. Here, we analyzed information for 90 microhaplotype loci in 4009 individuals from 79 world populations in 6 major biogeographic regions. The study included multiplex microhaplotype sequencing (mMHseq) data analyzed for 524 individuals from 16 populations and genotype data for 3485 individuals from 63 populations curated from public repositories. Analyses of the 79 populations revealed excellent characteristics for this 90-plex MH panel for various forensic applications achieving an overall average effective number of allele values (Ae) of 4.55 (range 1.04–19.27) for individualization and mixture deconvolution. Population-specific random match probabilities ranged from a low of 10–115 to a maximum of 10–66. Mean informativeness (In) for ancestry inference was 0.355 (range 0.117–0.883). 65 novel SNPs were detected in 39 of the MHs using mMHseq. Of the 3018 different microhaplotype alleles identified, 1337 occurred at frequencies > 5% in at least one of the populations studied. The 90-plex MH panel enables effective differentiation of population groupings for major biogeographic regions as well as delineation of distinct subgroupings within regions. Open-source, web-based software is available to support validation of this technology for forensic case work analysis and to tailor MH analysis for specific geographical regions.


Sign in / Sign up

Export Citation Format

Share Document