Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects

Mapping Intimacies ◽

10.1101/269316 ◽

2018 ◽

Cited By ~ 1

Author(s):

Allison A. Regier ◽

Yossi Farjoun ◽

David Larson ◽

Olga Krasheninina ◽

Hyun Min Kang ◽

...

Keyword(s):

Data Processing ◽

Genome Sequencing ◽

Statistical Power ◽

Human Genetics ◽

Variant Calling ◽

Joint Analysis ◽

Sequencing Analysis ◽

Batch Effects ◽

Many Sources ◽

Genomic Regions

AbstractHundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years to interrogate a broad range of traits, across diverse populations. These data are more valuable in aggregate: joint analysis of genomes from many sources increases sample size and statistical power for trait mapping, and will enable studies of genome biology, population genetics and genome function at unprecedented scale. A central challenge for joint analysis is that different WGS data processing and analysis pipelines cause substantial batch effects in combined datasets, necessitating computationally expensive reprocessing and harmonization prior to variant calling. This approach is no longer tenable given the scale of current studies and data volumes. Here, in a collaboration across multiple genome centers and NIH programs, we define WGS data processing standards that allow different groups to produce “functionally equivalent” (FE) results suitable for joint variant calling with minimal batch effects. Our approach promotes broad harmonization of upstream data processing steps, while allowing for diverse variant callers. Importantly, it allows each group to continue innovating on data processing pipelines, as long as results remain compatible. We present initial FE pipelines developed at five genome centers and show that they yield similar variant calling results – including single nucleotide (SNV), insertion/deletion (indel) and structural variation (SV) – and produce significantly less variability than sequencing replicates. Residual inter-pipeline variability is concentrated at low quality sites and repetitive genomic regions prone to stochastic effects. This work alleviates a key technical bottleneck for genome aggregation and helps lay the foundation for broad data sharing and community-wide “big-data” human genetics studies.

Download Full-text

Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects

Nature Communications ◽

10.1038/s41467-018-06159-4 ◽

2018 ◽

Vol 9 (1) ◽

Cited By ~ 58

Author(s):

Allison A. Regier ◽

Yossi Farjoun ◽

David E. Larson ◽

Olga Krasheninina ◽

Hyun Min Kang ◽

...

Keyword(s):

Genome Sequencing ◽

Human Genetics ◽

Variant Calling ◽

Functional Equivalence ◽

Sequencing Analysis

Download Full-text

1722-P: Colocalization of TOPMed Whole Genome Sequencing Analysis and Tissue-Specific eQTL Signals Detects Target Genes for Type 2 Diabetes Risk

Diabetes ◽

10.2337/db19-1722-p ◽

2019 ◽

Vol 68 (Supplement 1) ◽

pp. 1722-P

Author(s):

MINDY D. SZETO ◽

HEATHER M. HIGHLAND ◽

ALISA MANNING ◽

Keyword(s):

Type 2 Diabetes ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Target Genes ◽

Diabetes Risk ◽

Whole Genome ◽

Sequencing Analysis ◽

Tissue Specific

Download Full-text

Whole Genome Sequencing Refines Knowledge on the Population Structure of Mycobacterium bovis from a Multi-Host Tuberculosis System

Microorganisms ◽

10.3390/microorganisms9081585 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1585

Author(s):

Ana C. Reis ◽

Liliana C. M. Salvador ◽

Suelee Robbe-Austerman ◽

Rogério Tenreiro ◽

Ana Botelho ◽

...

Keyword(s):

Population Structure ◽

Whole Genome Sequencing ◽

Wild Boar ◽

Genome Sequencing ◽

Mycobacterium Bovis ◽

Red Deer ◽

Variable Number Tandem Repeat ◽

Variant Calling ◽

Whole Genome ◽

Network Analyses

Classical molecular analyses of Mycobacterium bovis based on spoligotyping and Variable Number Tandem Repeat (MIRU-VNTR) brought the first insights into the epidemiology of animal tuberculosis (TB) in Portugal, showing high genotypic diversity of circulating strains that mostly cluster within the European 2 clonal complex. Previous surveillance provided valuable information on the prevalence and spatial occurrence of TB and highlighted prevalent genotypes in areas where livestock and wild ungulates are sympatric. However, links at the wildlife–livestock interfaces were established mainly via classical genotype associations. Here, we apply whole genome sequencing (WGS) to cattle, red deer and wild boar isolates to reconstruct the M. bovis population structure in a multi-host, multi-region disease system and to explore links at a fine genomic scale between M. bovis from wildlife hosts and cattle. Whole genome sequences of 44 representative M. bovis isolates, obtained between 2003 and 2015 from three TB hotspots, were compared through single nucleotide polymorphism (SNP) variant calling analyses. Consistent with previous results combining classical genotyping with Bayesian population admixture modelling, SNP-based phylogenies support the branching of this M. bovis population into five genetic clades, three with apparent geographic specificities, as well as the establishment of an SNP catalogue specific to each clade, which may be explored in the future as phylogenetic markers. The core genome alignment of SNPs was integrated within a spatiotemporal metadata framework to further structure this M. bovis population by host species and TB hotspots, providing a baseline for network analyses in different epidemiological and disease control contexts. WGS of M. bovis isolates from Portugal is reported for the first time in this pilot study, refining the spatiotemporal context of TB at the wildlife–livestock interface and providing further support to the key role of red deer and wild boar on disease maintenance. The SNP diversity observed within this dataset supports the natural circulation of M. bovis for a long time period, as well as multiple introduction events of the pathogen in this Iberian multi-host system.

Download Full-text

Assessing genomic diversity and signatures of selection in Jiaxian Red cattle using whole-genome sequencing data

BMC Genomics ◽

10.1186/s12864-020-07340-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xiaoting Xia ◽

Shunjin Zhang ◽

Huaju Zhang ◽

Zijing Zhang ◽

Ningbo Chen ◽

...

Keyword(s):

Population Structure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genomic Variation ◽

Genomic Diversity ◽

System Response ◽

Whole Genome ◽

Population Structure Analysis ◽

Native Cattle ◽

Genomic Regions

Abstract Background Native cattle breeds are an important source of genetic variation because they might carry alleles that enable them to adapt to local environment and tough feeding conditions. Jiaxian Red, a Chinese native cattle breed, is reported to have originated from crossbreeding between taurine and indicine cattle; their history as a draft and meat animal dates back at least 30 years. Using whole-genome sequencing (WGS) data of 30 animals from the core breeding farm, we investigated the genetic diversity, population structure and genomic regions under selection of Jiaxian Red cattle. Furthermore, we used 131 published genomes of world-wide cattle to characterize the genomic variation of Jiaxian Red cattle. Results The population structure analysis revealed that Jiaxian Red cattle harboured the ancestry with East Asian taurine (0.493), Chinese indicine (0.379), European taurine (0.095) and Indian indicine (0.033). Three methods (nucleotide diversity, linkage disequilibrium decay and runs of homozygosity) implied the relatively high genomic diversity in Jiaxian Red cattle. We used θπ, CLR, FST and XP-EHH methods to look for the candidate signatures of positive selection in Jiaxian Red cattle. A total number of 171 (θπ and CLR) and 17 (FST and XP-EHH) shared genes were identified using different detection strategies. Functional annotation analysis revealed that these genes are potentially responsible for growth and feed efficiency (CCSER1), meat quality traits (ROCK2, PPP1R12A, CYB5R4, EYA3, PHACTR1), fertility (RFX4, SRD5A2) and immune system response (SLAMF1, CD84 and SLAMF6). Conclusion We provide a comprehensive overview of sequence variations in Jiaxian Red cattle genomes. Selection signatures were detected in genomic regions that are possibly related to economically important traits in Jiaxian Red cattle. We observed a high level of genomic diversity and low inbreeding in Jiaxian Red cattle. These results provide a basis for further resource protection and breeding improvement of this breed.

Download Full-text

Clinical-grade whole-genome sequencing and 3′ transcriptome analysis of colorectal cancer patients

Genome Medicine ◽

10.1186/s13073-021-00852-8 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Agata Stodolna ◽

Miao He ◽

Mahesh Vasipalli ◽

Zoya Kingsbury ◽

Jennifer Becq ◽

...

Keyword(s):

Colorectal Cancer ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Transcriptome Analysis ◽

Variant Calling ◽

Standard Of Care ◽

Genomic Variation ◽

Whole Genome ◽

Clinical Grade ◽

Pathway Gene

Abstract Background Clinical-grade whole-genome sequencing (cWGS) has the potential to become the standard of care within the clinic because of its breadth of coverage and lack of bias towards certain regions of the genome. Colorectal cancer presents a difficult treatment paradigm, with over 40% of patients presenting at diagnosis with metastatic disease. We hypothesised that cWGS coupled with 3′ transcriptome analysis would give new insights into colorectal cancer. Methods Patients underwent PCR-free whole-genome sequencing and alignment and variant calling using a standardised pipeline to output SNVs, indels, SVs and CNAs. Additional insights into the mutational signatures and tumour biology were gained by the use of 3′ RNA-seq. Results Fifty-four patients were studied in total. Driver analysis identified the Wnt pathway gene APC as the only consistently mutated driver in colorectal cancer. Alterations in the PI3K/mTOR pathways were seen as previously observed in CRC. Multiple private CNAs, SVs and gene fusions were unique to individual tumours. Approximately 30% of patients had a tumour mutational burden of > 10 mutations/Mb of DNA, suggesting suitability for immunotherapy. Conclusions Clinical whole-genome sequencing offers a potential avenue for the identification of private genomic variation that may confer sensitivity to targeted agents and offer patients new options for targeted therapies.

Download Full-text

Unbiased machine learning methods to predict the limitations of variant calling in homologous genomic regions using next-generation sequencing

Molecular Genetics and Metabolism ◽

10.1016/s1096-7192(21)00467-4 ◽

2021 ◽

Vol 132 ◽

pp. S250-S252

Author(s):

Feng Li ◽

Rohan Gnanaolivu ◽

Noemi Vidal-Folch ◽

Neiladri Saha ◽

Nipun Mistry ◽

...

Keyword(s):

Machine Learning ◽

Next Generation Sequencing ◽

Variant Calling ◽

Next Generation ◽

Learning Methods ◽

Machine Learning Methods ◽

Genomic Regions ◽

Generation Sequencing

Download Full-text

Estimating sequencing error rates using families

BioData Mining ◽

10.1186/s13040-021-00259-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelley Paskov ◽

Jae-Yoon Jung ◽

Brianna Chrisman ◽

Nate T. Stockham ◽

Peter Washington ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Platform ◽

Whole Exome

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Download Full-text

Whole genome sequencing analysis of Salmonella enterica serovar Weltevreden isolated from human stool and contaminated food samples collected from the Southern coastal area of China

International Journal of Food Microbiology ◽

10.1016/j.ijfoodmicro.2017.10.032 ◽

2018 ◽

Vol 266 ◽

pp. 317-323 ◽

Cited By ~ 4

Author(s):

Baisheng Li ◽

Xingfen Yang ◽

Hailing Tan ◽

Bixia Ke ◽

Dongmei He ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Coastal Area ◽

Salmonella Enterica ◽

Food Samples ◽

Whole Genome ◽

Sequencing Analysis ◽

Contaminated Food ◽

Human Stool

Download Full-text

A large-scale whole-genome sequencing analysis reveals false positives of bacterial essential genes

Applied Microbiology and Biotechnology ◽

10.1007/s00253-021-11702-3 ◽

2021 ◽

Author(s):

Yuanhao Li ◽

Bo Jiang ◽

Weijun Dai

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

False Positives ◽

Essential Genes ◽

Whole Genome ◽

Sequencing Analysis

Download Full-text

Fitting whole-genome sequencing analysis for metastasis

Nature Cancer ◽

10.1038/s43018-021-00312-7 ◽

2021 ◽

Vol 2 (12) ◽

pp. 1290-1290

Author(s):

Julia Simundza

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome ◽

Sequencing Analysis

Download Full-text