High Coverage Whole Genome Sequencing of the Expanded 1000 Genomes Project Cohort Including 602 Trios

High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios

10.1101/2021.02.06.430068 ◽

2021 ◽

Cited By ~ 4

Author(s):

Marta Byrska-Bishop ◽

Uday S. Evani ◽

Xuefang Zhao ◽

Anna O. Basile ◽

Haley J. Abel ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Whole Genome ◽

1000 Genomes Project ◽

Phase 3 ◽

High Coverage ◽

Entire Cohort ◽

1000 Genomes ◽

Low Coverage

ABSTRACTThe 1000 Genomes Project (1kGP), launched in 2008, is the largest fully open resource of whole genome sequencing (WGS) data consented for public distribution of raw sequence data without access or use restrictions. The final (phase 3) 2015 release of 1kGP included 2,504 unrelated samples from 26 populations, representing five continental regions of the world and was based on a combination of technologies including low coverage WGS (mean depth 7.4X), high coverage whole exome sequencing (mean depth 65.7X), and microarray genotyping. Here, we present a new, high coverage WGS resource encompassing the original 2,504 1kGP samples, as well as an additional 698 related samples that result in 602 complete trios in the 1kGP cohort. We sequenced this expanded 1kGP cohort of 3,202 samples to a targeted depth of 30X using Illumina NovaSeq 6000 instruments. We performed SNV/INDEL calling against the GRCh38 reference using GATK’s HaplotypeCaller, and generated a comprehensive set of SVs by integrating multiple analytic methods through a sophisticated machine learning model, upgrading the 1kGP dataset to current state-of-the-art standards. Using this strategy, we defined over 111 million SNVs, 14 million INDELs, and ∼170 thousand SVs across the entire cohort of 3,202 samples with estimated false discovery rate (FDR) of 0.3%, 1.0%, and 1.8%, respectively. By comparison to the low-coverage phase 3 callset, we observed substantial improvements in variant discovery and estimated FDR that were facilitated by high coverage re-sequencing and expansion of the cohort. Specifically, we called 7% more SNVs, 59% more INDELs, and 170% more SVs per genome than the phase 3 callset. Moreover, we leveraged the presence of families in the cohort to achieve superior haplotype phasing accuracy and we demonstrate improvements that the high coverage panel brings especially for INDEL imputation. We make all the data generated as part of this project publicly available and we envision this updated version of the 1kGP callset to become the new de facto public resource for the worldwide scientific community working on genomics and genetics.

Download Full-text

Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis

Genome Biology ◽

10.1186/s13059-021-02303-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Huiguang Yi ◽

Yanling Lin ◽

Chengqi Lin ◽

Wenfei Jin

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Real Data ◽

Whole Genome ◽

1000 Genomes Project ◽

1000 Genomes ◽

Sequence Read Archive ◽

Large Scale Dataset ◽

Ncbi Sequence Read Archive

AbstractHere, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real data. Using Kssd, we prioritize references for all 1,019,179 bacteria whole genome sequencing (WGS) runs from NCBI Sequence Read Archive and find misidentification or contamination in 6164 of these. Additionally, we analyze WGS and exome runs of samples from the 1000 Genomes Project.

Download Full-text

Prioritising positively selected variants in whole-genome sequencing data using FineMAV

BMC Bioinformatics ◽

10.1186/s12859-021-04506-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Fadilla Wahyudi ◽

Farhang Aghakhanian ◽

Sadequr Rahman ◽

Yik-Ying Teo ◽

Michał Szpak ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Population Genomics ◽

Software Tool ◽

Human Populations ◽

Whole Genome ◽

Sequencing Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Genome Browsers

Abstract Background In population genomics, polymorphisms that are highly differentiated between geographically separated populations are often suggestive of Darwinian positive selection. Genomic scans have highlighted several such regions in African and non-African populations, but only a handful of these have functional data that clearly associates candidate variations driving the selection process. Fine-Mapping of Adaptive Variation (FineMAV) was developed to address this in a high-throughput manner using population based whole-genome sequences generated by the 1000 Genomes Project. It pinpoints positively selected genetic variants in sequencing data by prioritizing high frequency, population-specific and functional derived alleles. Results We developed a stand-alone software that implements the FineMAV statistic. To graphically visualise the FineMAV scores, it outputs the statistics as bigWig files, which is a common file format supported by many genome browsers. It is available as a command-line and graphical user interface. The software was tested by replicating the FineMAV scores obtained using 1000 Genomes Project African, European, East and South Asian populations and subsequently applied to whole-genome sequencing datasets from Singapore and China to highlight population specific variants that can be subsequently modelled. The software tool is publicly available at https://github.com/fadilla-wahyudi/finemav. Conclusions The software tool described here determines genome-wide FineMAV scores, using low or high-coverage whole-genome sequencing datasets, that can be used to prioritize a list of population specific, highly differentiated candidate variants for in vitro or in vivo functional screens. The tool displays these scores on the human genome browsers for easy visualisation, annotation and comparison between different genomic regions in worldwide human populations.

Download Full-text

Large-scale whole-genome sequencing of three diverse Asian populations in Singapore

10.1101/390070 ◽

2018 ◽

Cited By ~ 3

Author(s):

Degang Wu ◽

Jinzhuang Dou ◽

Xiaoran Chai ◽

Claire Bellis ◽

Andreas Wilm ◽

...

Keyword(s):

Genetic Structure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Human Genetics ◽

Genotype Imputation ◽

Whole Genome ◽

Fine Scale ◽

1000 Genomes Project ◽

1000 Genomes ◽

Asian Populations

AbstractAsian populations are currently underrepresented in human genetics research. Here we present whole-genome sequencing data of 4,810 Singaporeans from three diverse ethnic groups: 2,780 Chinese, 903 Malays, and 1,127 Indians. Despite a medium depth of 13.7×, we achieved essentially perfect (>99.8%) sensitivity and accuracy for detecting common variants and good sensitivity (>89%) for detecting extremely rare variants with <0.1% allele frequency. We found 89.2 million single-nucleotide polymorphisms (SNPs) and 9.1 million small insertions and deletions (INDELs), more than half of which have not been cataloged in dbSNP. In particular, we found 126 common deleterious mutations (MAF>0.01) that were absent in the existing public databases, highlighting the importance of local population reference for genetic diagnosis. We describe fine-scale genetic structure of Singapore populations and their relationship to worldwide populations from the 1000 Genomes Project. In addition to revealing noticeable amounts of admixture among three Singapore populations and a Malay-related novel ancestry component that has not been captured by the 1000 Genomes Project, our analysis also identified some fine-scale features of genetic structure consistent with two waves of prehistoric migration from south China to Southeast Asia. Finally, we demonstrate that our data can substantially improve genotype imputation not only for Singapore populations, but also for populations across Asia and Oceania. These results highlight the genetic diversity in Singapore and the potential impacts of our data as a resource to empower human genetics discovery in a broad geographic region.

Download Full-text

Comprehensive Characterization of Human Genome Variation by High Coverage Whole-Genome Sequencing of Forty Four Caucasians

PLoS ONE ◽

10.1371/journal.pone.0059494 ◽

2013 ◽

Vol 8 (4) ◽

pp. e59494 ◽

Cited By ~ 39

Author(s):

Hui Shen ◽

Jian Li ◽

Jigang Zhang ◽

Chao Xu ◽

Yan Jiang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Human Genome ◽

Genome Sequencing ◽

Whole Genome ◽

High Coverage ◽

Genome Variation ◽

Comprehensive Characterization

Download Full-text

Whole-genome sequencing of nine esophageal adenocarcinoma cell lines

F1000Research ◽

10.12688/f1000research.7033.1 ◽

2016 ◽

Vol 5 ◽

pp. 1336 ◽

Cited By ~ 8

Author(s):

Gianmarco Contino ◽

Matthew D. Eldridge ◽

Maria Secrier ◽

Lawrence Bower ◽

Rachael Fels Elliott ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Esophageal Adenocarcinoma ◽

Cell Lines ◽

Genome Sequencing ◽

Sequence Data ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Single Nucleotide Variants ◽

High Coverage ◽

Single Nucleotide

Esophageal adenocarcinoma (EAC) is highly mutated and molecularly heterogeneous. The number of cell lines available for study is limited and their genome has been only partially characterized. The availability of an accurate annotation of their mutational landscape is crucial for accurate experimental design and correct interpretation of genotype-phenotype findings. We performed high coverage, paired end whole genome sequencing on eight EAC cell lines—ESO26, ESO51, FLO-1, JH-EsoAd1, OACM5.1 C, OACP4 C, OE33, SK-GT-4—all verified against original patient material, and one esophageal high grade dysplasia cell line, CP-D. We have made available the aligned sequence data and report single nucleotide variants (SNVs), small insertions and deletions (indels), and copy number alterations, identified by comparison with the human reference genome and known single nucleotide polymorphisms (SNPs). We compare these putative mutations to mutations found in primary tissue EAC samples, to inform the use of these cell lines as a model of EAC.

Download Full-text

Whole-genome sequencing of 1,171 elderly admixed individuals from the largest Latin American metropolis (São Paulo, Brazil)

10.21203/rs.3.rs-85969/v1 ◽

2020 ◽

Author(s):

Michel Naslavsky ◽

Marilia Scliar ◽

Guilherme Yamamoto ◽

Jaqueline Wang ◽

Stepanka Zverinova ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Latin American ◽

Population Genomics ◽

Mobile Element ◽

Whole Genome ◽

High Coverage ◽

Genomic Studies ◽

Novel Alleles ◽

Recessive Disorders

Abstract As whole-genome sequencing (WGS) becomes the gold standard tool for studying population genomics and medical applications, data on diverse non-European and admixed individuals are still scarce. Here, we present a high-coverage WGS dataset of 1,171 highly admixed elderly Brazilians from a census-based cohort, providing over 76 million variants, of which ~ 2 million are absent from large public databases. WGS enabled identifying ~ 2,000 novel mobile element insertions, nearly 5 Mb of genomic segments absent from human genome reference, and over 140 novel alleles from HLA genes. We reclassified and curated nearly four hundred variant's pathogenicity assertions in genes associated with dominantly inherited Mendelian disorders and calculated the incidence for selected recessive disorders, demonstrating the clinical usefulness of the present study. Finally, we observed that whole-genome and HLA imputation could be significantly improved compared to available datasets since rare variation represents the largest proportion of input from WGS. These results demonstrate that even smaller sample sizes of underrepresented populations bring relevant data for genomic studies, especially when exploring analyses allowed only by WGS.

Download Full-text

Whole-genome sequencing of 1,171 elderly admixed individuals from the largest Latin American metropolis (São Paulo, Brazil)

10.1101/2020.09.15.298026 ◽

2020 ◽

Author(s):

Michel S. Naslavsky ◽

Marilia O. Scliar ◽

Guilherme L. Yamamoto ◽

Jaqueline Yu Ting Wang ◽

Stepanka Zverinova ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Latin American ◽

Population Genomics ◽

Mobile Element ◽

Whole Genome ◽

High Coverage ◽

Genomic Studies ◽

Novel Alleles ◽

Recessive Disorders

AbstractAs whole-genome sequencing (WGS) becomes the gold standard tool for studying population genomics and medical applications, data on diverse non-European and admixed individuals are still scarce. Here, we present a high-coverage WGS dataset of 1,171 highly admixed elderly Brazilians from a census-based cohort, providing over 76 million variants, of which ~2 million are absent from large public databases. WGS enabled identifying ~2,000 novel mobile element insertions, nearly 5Mb of genomic segments absent from human genome reference, and over 140 novel alleles from HLA genes. We reclassified and curated nearly four hundred variant's pathogenicity assertions in genes associated with dominantly inherited Mendelian disorders and calculated the incidence for selected recessive disorders, demonstrating the clinical usefulness of the present study. Finally, we observed that whole-genome and HLA imputation could be significantly improved compared to available datasets since rare variation represents the largest proportion of input from WGS. These results demonstrate that even smaller sample sizes of underrepresented populations bring relevant data for genomic studies, especially when exploring analyses allowed only by WGS.

Download Full-text

Utility of Whole Genome Sequencing in diagnosing complex disorders: lesson from renal tubular disorders

Endocrine Abstracts ◽

10.1530/endoabs.59.p052 ◽

2018 ◽

Author(s):

Mark Stevenson ◽

Alistair T Pagnamenta ◽

Heather G Mack ◽

Judith A Savige ◽

Kate E Lines ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome ◽

Complex Disorders ◽

Tubular Disorders ◽

Renal Tubular Disorders ◽

Renal Tubular

Download Full-text

1722-P: Colocalization of TOPMed Whole Genome Sequencing Analysis and Tissue-Specific eQTL Signals Detects Target Genes for Type 2 Diabetes Risk

Diabetes ◽

10.2337/db19-1722-p ◽

2019 ◽

Vol 68 (Supplement 1) ◽

pp. 1722-P

Author(s):

MINDY D. SZETO ◽

HEATHER M. HIGHLAND ◽

ALISA MANNING ◽

Keyword(s):

Type 2 Diabetes ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Target Genes ◽

Diabetes Risk ◽

Whole Genome ◽

Sequencing Analysis ◽

Tissue Specific

Download Full-text