scholarly journals BasePlayer: Versatile Analysis Software for Large-scale Genomic Variant Discovery

2017 ◽  
Author(s):  
Riku Katainen ◽  
Iikki Donner ◽  
Tatiana Cajuso ◽  
Eevi Kaasinen ◽  
Kimmo Palin ◽  
...  

AbstractNext-generation sequencing (NGS) is being routinely applied in life sciences and clinical practice, where the interpretation of the resulting massive data has become a critical challenge. Computational workflows, such as the Broad GATK, have been established to take raw sequencing data and produce processed data for downstream analyses. Consequently, results of these computationally demanding workflows, consisting of e.g. sequence alignment and variant calling, are increasingly being provided for customers by sequencing and bioinformatics facilities. However, downstream variant analysis, whole-genome level in particular, has been lacking a multi-purpose tool, which could take advantage of rapidly growing genomic information and integrate genetic variant, sequence, genomic annotation and regulatory (e.g. ENCODE) data interactively and in a visual fashion. Here we introduce a highly efficient and user-friendly software, BasePlayer (http://baseplayer.fi), for biological discovery in large-scale NGS data. BasePlayer enables tightly integrated comparative variant analysis and visualization of thousands of NGS data samples and millions of variants, with numerous applications in disease, regulatory and population genomics. Although BasePlayer has been designed primarily for whole-genome and exome sequencing data, it is well-suited to various study settings, diseases and organisms by supporting standard and upcoming file formats. BasePlayer transforms an ordinary desktop computer into a large-scale genomic research platform, enabling also a non-technical user to perform complex comparative variant analyses, population frequency filtering and genome level annotations under intuitive, scalable and highly-responsive user interface to facilitate everyday genetic research as well as the search of novel discoveries.

Author(s):  
Anne Krogh Nøhr ◽  
Kristian Hanghøj ◽  
Genis Garcia Erill ◽  
Zilong Li ◽  
Ida Moltke ◽  
...  

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.


Genes ◽  
2021 ◽  
Vol 12 (2) ◽  
pp. 258
Author(s):  
Karim Karimi ◽  
Duy Ngoc Do ◽  
Mehdi Sargolzaei ◽  
Younes Miar

Characterizing the genetic structure and population history can facilitate the development of genomic breeding strategies for the American mink. In this study, we used the whole genome sequences of 100 mink from the Canadian Centre for Fur Animal Research (CCFAR) at the Dalhousie Faculty of Agriculture (Truro, NS, Canada) and Millbank Fur Farm (Rockwood, ON, Canada) to investigate their population structure, genetic diversity and linkage disequilibrium (LD) patterns. Analysis of molecular variance (AMOVA) indicated that the variation among color-types was significant (p < 0.001) and accounted for 18% of the total variation. The admixture analysis revealed that assuming three ancestral populations (K = 3) provided the lowest cross-validation error (0.49). The effective population size (Ne) at five generations ago was estimated to be 99 and 50 for CCFAR and Millbank Fur Farm, respectively. The LD patterns revealed that the average r2 reduced to <0.2 at genomic distances of >20 kb and >100 kb in CCFAR and Millbank Fur Farm suggesting that the density of 120,000 and 24,000 single nucleotide polymorphisms (SNP) would provide the adequate accuracy of genomic evaluation in these populations, respectively. These results indicated that accounting for admixture is critical for designing the SNP panels for genotype-phenotype association studies of American mink.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5895 ◽  
Author(s):  
Thomas Andreas Kohl ◽  
Christian Utpatel ◽  
Viola Schleusener ◽  
Maria Rosaria De Filippo ◽  
Patrick Beckert ◽  
...  

Analyzing whole-genome sequencing data of Mycobacterium tuberculosis complex (MTBC) isolates in a standardized workflow enables both comprehensive antibiotic resistance profiling and outbreak surveillance with highest resolution up to the identification of recent transmission chains. Here, we present MTBseq, a bioinformatics pipeline for next-generation genome sequence data analysis of MTBC isolates. Employing a reference mapping based workflow, MTBseq reports detected variant positions annotated with known association to antibiotic resistance and performs a lineage classification based on phylogenetic single nucleotide polymorphisms (SNPs). When comparing multiple datasets, MTBseq provides a joint list of variants and a FASTA alignment of SNP positions for use in phylogenomic analysis, and identifies groups of related isolates. The pipeline is customizable, expandable and can be used on a desktop computer or laptop without any internet connection, ensuring mobile usage and data security. MTBseq and accompanying documentation is available from https://github.com/ngs-fzb/MTBseq_source.


2021 ◽  
Author(s):  
Jiru Han ◽  
Jacob E Munro ◽  
Anthony Kocoski ◽  
Alyssa E Barry ◽  
Melanie Bahlo

Short tandem repeats (STRs) are highly informative genetic markers that have been used extensively in population genetics analysis. They are an important source of genetic diversity and can also have functional impact. Despite the availability of bioinformatic methods that permit large-scale genome-wide genotyping of STRs from whole genome sequencing data, they have not previously been applied to sequencing data from large collections of malaria parasite field samples. Here, we have genotyped STRs using HipSTR in more than 3,000 Plasmodium falciparum and 174 Plasmodium vivax published whole-genome sequence data from samples collected across the globe. High levels of noise and variability in the resultant callset necessitated the development of a novel method for quality control of STR genotype calls. A set of high-quality STR loci (6,768 from P. falciparum and 3,496 from P. vivax) were used to study Plasmodium genetic diversity, population structures and genomic signatures of selection and these were compared to genome-wide single nucleotide polymorphism (SNP) genotyping data. In addition, the genome-wide information about genetic variation and other characteristics of STRs in P. falciparum and P. vivax have been made available in an interactive web-based R Shiny application PlasmoSTR (https://github.com/bahlolab/PlasmoSTR).


2018 ◽  
Author(s):  
Paul Lacaze ◽  
Mark Pinese ◽  
Warren Kaplan ◽  
Andrew Stone ◽  
Marie-Jo Brion ◽  
...  

AbstractAllele frequency data from human reference populations is of increasing value for filtering and assignment of pathogenicity to genetic variants. Aged and healthy populations are more likely to be selectively depleted of pathogenic alleles, and therefore particularly suitable as a reference populations for the major diseases of clinical and public health importance. However, reference studies of the healthy elderly have remained under-represented in human genetics. We have developed the Medical Genome Reference Bank (MGRB), a large-scale comprehensive whole-genome dataset of confirmed healthy elderly individuals, to provide a publicly accessible resource for health-related research, and for clinical genetics. It also represents a useful resource for studying the genetics of healthy aging. The MGRB comprises 4,000 healthy, older individuals with no reported history of cancer, cardiovascular disease or dementia, recruited from two Australian community-based cohorts. DNA derived from blood samples will be subject to whole genome sequencing. The MGRB will measure genome-wide genetic variation in 4,000 individuals, mostly of European decent, aged 60-95 years (mean age ≥ 75 years). The MGRB has committed to a policy of data sharing, employing a hierarchical data management system to maintain participant privacy and confidentiality, whilst maximizing research and clinical usage of the database. The MGRB will represent a dataset of international significance, broadly accessible to the clinical and genetic research community.


2020 ◽  
Author(s):  
Caoimhe E. O’Brien ◽  
João Oliveira-Pacheco ◽  
Eoin Ó Cinnéide ◽  
Max A. B. Hasse ◽  
Chris Todd Hittinger ◽  
...  

AbstractCandida tropicalis is a human pathogen that primarily infects the immunocompromised. Whereas the genome of one isolate, C. tropicalis MYA-3404, was originally sequenced in 2009, there have been no large-scale, multi-isolate studies of the genetic and phenotypic diversity of this species. Here, we used whole genome sequencing and phenotyping to characterize 77 isolates C. tropicalis isolates from clinical and environmental sources from a variety of locations. We show that most C. tropicalis isolates are diploids with approximately 2 - 6 heterozygous variants per kilobase. The genomes are relatively stable, with few aneuploidies. However, we identified one highly homozygous isolate and six isolates of C. tropicalis with much higher heterozygosity levels ranging from 36 - 49 heterozygous variants per kilobase. Our analyses show that the heterozygous isolates represent two different hybrid lineages, where the hybrids share one parent (A) with most other C. tropicalis isolates, but the second parent (B or C) differs by at least 4% at the genome level. Four of the sequenced isolates descend from an AB hybridization, and two from an AC hybridization. The hybrids are MTLa/α heterozygotes. Hybridization, or mating, between different parents is therefore common in the evolutionary history of C. tropicalis. The new hybrids were predominantly found in environmental niches, including from soil. Hybridization is therefore unlikely to be associated with virulence. In addition, we used genotype-phenotype correlation and CRISPR-Cas9 editing to identify a genome variant that results in the inability of one isolate to utilize certain branched-chain amino acids as a sole nitrogen source.Author summaryCandida tropicalis is an important fungal pathogen, which is particularly common in the Asia-Pacific and Latin America. There is currently very little known about the diversity of genotype and phenotype of C. tropicalis isolates. By carrying out a phylogenomic analysis of 77 isolates, we find that C. tropicalis genomes range from very homozygous to highly heterozygous. We show that the heterozygous isolates are hybrids, most likely formed by mating between different parents. Unlike other Candida species, the hybrids are more common in environmental than in clinical niches, suggesting that for this species, hybridization is not associated with virulence. We also explore the range of phenotypes, and we identify a genomic variant that is required for growth on valine and isoleucine as sole nitrogen sources.


Genes ◽  
2020 ◽  
Vol 11 (12) ◽  
pp. 1444
Author(s):  
Nazeefa Fatima ◽  
Anna Petri ◽  
Ulf Gyllensten ◽  
Lars Feuk ◽  
Adam Ameur

Long-read single molecule sequencing is increasingly used in human genomics research, as it allows to accurately detect large-scale DNA rearrangements such as structural variations (SVs) at high resolution. However, few studies have evaluated the performance of different single molecule sequencing platforms for SV detection in human samples. Here we performed Oxford Nanopore Technologies (ONT) whole-genome sequencing of two Swedish human samples (average 32× coverage) and compared the results to previously generated Pacific Biosciences (PacBio) data for the same individuals (average 66× coverage). Our analysis inferred an average of 17k and 23k SVs from the ONT and PacBio data, respectively, with a majority of them overlapping with an available multi-platform SV dataset. When comparing the SV calls in the two Swedish individuals, we find a higher concordance between ONT and PacBio SVs detected in the same individual as compared to SVs detected by the same technology in different individuals. Downsampling of PacBio reads, performed to obtain similar coverage levels for all datasets, resulted in 17k SVs per individual and improved overlap with the ONT SVs. Our results suggest that ONT and PacBio have a similar performance for SV detection in human whole genome sequencing data, and that both technologies are feasible for population-scale studies.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Giulio Caravagna ◽  
Guido Sanguinetti ◽  
Trevor A. Graham ◽  
Andrea Sottoriva

Abstract Background The large-scale availability of whole-genome sequencing profiles from bulk DNA sequencing of cancer tissues is fueling the application of evolutionary theory to cancer. From a bulk biopsy, subclonal deconvolution methods are used to determine the composition of cancer subpopulations in the biopsy sample, a fundamental step to determine clonal expansions and their evolutionary trajectories. Results In a recent work we have developed a new model-based approach to carry out subclonal deconvolution from the site frequency spectrum of somatic mutations. This new method integrates, for the first time, an explicit model for neutral evolutionary forces that participate in clonal expansions; in that work we have also shown that our method improves largely over competing data-driven methods. In this Software paper we present mobster, an open source R package built around our new deconvolution approach, which provides several functions to plot data and fit models, assess their confidence and compute further evolutionary analyses that relate to subclonal deconvolution. Conclusions We present the mobster package for tumour subclonal deconvolution from bulk sequencing, the first approach to integrate Machine Learning and Population Genetics which can explicitly model co-existing neutral and positive selection in cancer. We showcase the analysis of two datasets, one simulated and one from a breast cancer patient, and overview all package functionalities.


Sign in / Sign up

Export Citation Format

Share Document