Large-scale whole-genome sequencing of three diverse Asian populations in Singapore

Mapping Intimacies ◽

10.1101/390070 ◽

2018 ◽

Cited By ~ 3

Author(s):

Degang Wu ◽

Jinzhuang Dou ◽

Xiaoran Chai ◽

Claire Bellis ◽

Andreas Wilm ◽

...

Keyword(s):

Genetic Structure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Human Genetics ◽

Genotype Imputation ◽

Whole Genome ◽

Fine Scale ◽

1000 Genomes Project ◽

1000 Genomes ◽

Asian Populations

AbstractAsian populations are currently underrepresented in human genetics research. Here we present whole-genome sequencing data of 4,810 Singaporeans from three diverse ethnic groups: 2,780 Chinese, 903 Malays, and 1,127 Indians. Despite a medium depth of 13.7×, we achieved essentially perfect (>99.8%) sensitivity and accuracy for detecting common variants and good sensitivity (>89%) for detecting extremely rare variants with <0.1% allele frequency. We found 89.2 million single-nucleotide polymorphisms (SNPs) and 9.1 million small insertions and deletions (INDELs), more than half of which have not been cataloged in dbSNP. In particular, we found 126 common deleterious mutations (MAF>0.01) that were absent in the existing public databases, highlighting the importance of local population reference for genetic diagnosis. We describe fine-scale genetic structure of Singapore populations and their relationship to worldwide populations from the 1000 Genomes Project. In addition to revealing noticeable amounts of admixture among three Singapore populations and a Malay-related novel ancestry component that has not been captured by the 1000 Genomes Project, our analysis also identified some fine-scale features of genetic structure consistent with two waves of prehistoric migration from south China to Southeast Asia. Finally, we demonstrate that our data can substantially improve genotype imputation not only for Singapore populations, but also for populations across Asia and Oceania. These results highlight the genetic diversity in Singapore and the potential impacts of our data as a resource to empower human genetics discovery in a broad geographic region.

Download Full-text

High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios

10.1101/2021.02.06.430068 ◽

2021 ◽

Cited By ~ 4

Author(s):

Marta Byrska-Bishop ◽

Uday S. Evani ◽

Xuefang Zhao ◽

Anna O. Basile ◽

Haley J. Abel ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Whole Genome ◽

1000 Genomes Project ◽

Phase 3 ◽

High Coverage ◽

Entire Cohort ◽

1000 Genomes ◽

Low Coverage

ABSTRACTThe 1000 Genomes Project (1kGP), launched in 2008, is the largest fully open resource of whole genome sequencing (WGS) data consented for public distribution of raw sequence data without access or use restrictions. The final (phase 3) 2015 release of 1kGP included 2,504 unrelated samples from 26 populations, representing five continental regions of the world and was based on a combination of technologies including low coverage WGS (mean depth 7.4X), high coverage whole exome sequencing (mean depth 65.7X), and microarray genotyping. Here, we present a new, high coverage WGS resource encompassing the original 2,504 1kGP samples, as well as an additional 698 related samples that result in 602 complete trios in the 1kGP cohort. We sequenced this expanded 1kGP cohort of 3,202 samples to a targeted depth of 30X using Illumina NovaSeq 6000 instruments. We performed SNV/INDEL calling against the GRCh38 reference using GATK’s HaplotypeCaller, and generated a comprehensive set of SVs by integrating multiple analytic methods through a sophisticated machine learning model, upgrading the 1kGP dataset to current state-of-the-art standards. Using this strategy, we defined over 111 million SNVs, 14 million INDELs, and ∼170 thousand SVs across the entire cohort of 3,202 samples with estimated false discovery rate (FDR) of 0.3%, 1.0%, and 1.8%, respectively. By comparison to the low-coverage phase 3 callset, we observed substantial improvements in variant discovery and estimated FDR that were facilitated by high coverage re-sequencing and expansion of the cohort. Specifically, we called 7% more SNVs, 59% more INDELs, and 170% more SVs per genome than the phase 3 callset. Moreover, we leveraged the presence of families in the cohort to achieve superior haplotype phasing accuracy and we demonstrate improvements that the high coverage panel brings especially for INDEL imputation. We make all the data generated as part of this project publicly available and we envision this updated version of the 1kGP callset to become the new de facto public resource for the worldwide scientific community working on genomics and genetics.

Download Full-text

Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis

Genome Biology ◽

10.1186/s13059-021-02303-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Huiguang Yi ◽

Yanling Lin ◽

Chengqi Lin ◽

Wenfei Jin

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Real Data ◽

Whole Genome ◽

1000 Genomes Project ◽

1000 Genomes ◽

Sequence Read Archive ◽

Large Scale Dataset ◽

Ncbi Sequence Read Archive

AbstractHere, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real data. Using Kssd, we prioritize references for all 1,019,179 bacteria whole genome sequencing (WGS) runs from NCBI Sequence Read Archive and find misidentification or contamination in 6164 of these. Additionally, we analyze WGS and exome runs of samples from the 1000 Genomes Project.

Download Full-text

Prioritising positively selected variants in whole-genome sequencing data using FineMAV

BMC Bioinformatics ◽

10.1186/s12859-021-04506-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Fadilla Wahyudi ◽

Farhang Aghakhanian ◽

Sadequr Rahman ◽

Yik-Ying Teo ◽

Michał Szpak ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Population Genomics ◽

Software Tool ◽

Human Populations ◽

Whole Genome ◽

Sequencing Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Genome Browsers

Abstract Background In population genomics, polymorphisms that are highly differentiated between geographically separated populations are often suggestive of Darwinian positive selection. Genomic scans have highlighted several such regions in African and non-African populations, but only a handful of these have functional data that clearly associates candidate variations driving the selection process. Fine-Mapping of Adaptive Variation (FineMAV) was developed to address this in a high-throughput manner using population based whole-genome sequences generated by the 1000 Genomes Project. It pinpoints positively selected genetic variants in sequencing data by prioritizing high frequency, population-specific and functional derived alleles. Results We developed a stand-alone software that implements the FineMAV statistic. To graphically visualise the FineMAV scores, it outputs the statistics as bigWig files, which is a common file format supported by many genome browsers. It is available as a command-line and graphical user interface. The software was tested by replicating the FineMAV scores obtained using 1000 Genomes Project African, European, East and South Asian populations and subsequently applied to whole-genome sequencing datasets from Singapore and China to highlight population specific variants that can be subsequently modelled. The software tool is publicly available at https://github.com/fadilla-wahyudi/finemav. Conclusions The software tool described here determines genome-wide FineMAV scores, using low or high-coverage whole-genome sequencing datasets, that can be used to prioritize a list of population specific, highly differentiated candidate variants for in vitro or in vivo functional screens. The tool displays these scores on the human genome browsers for easy visualisation, annotation and comparison between different genomic regions in worldwide human populations.

Download Full-text

High Coverage Whole Genome Sequencing of the Expanded 1000 Genomes Project Cohort Including 602 Trios

SSRN Electronic Journal ◽

10.2139/ssrn.3967671 ◽

2021 ◽

Author(s):

Marta Byrska-Bishop ◽

Uday S. Evani ◽

Xuefang Zhao ◽

Anna O. Basile ◽

Haley J. Abel ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome ◽

1000 Genomes Project ◽

High Coverage ◽

1000 Genomes

Download Full-text

Whole genome sequencing data of multiple individuals of Pakistani descent

Scientific Data ◽

10.1038/s41597-020-00664-2 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Shahid Y. Khan ◽

Muhammad Ali ◽

Mei-Chong W. Lee ◽

Zhiwei Ma ◽

Pooja Biswas ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Asian Populations ◽

Ethnic Populations ◽

Novel Variants ◽

Intergenic Regions

Abstract Here we report whole genome sequencing of four individuals (H3, H4, H5, and H6) from a family of Pakistani descent. Whole genome sequencing yielded 1084.92, 894.73, 1068.62, and 1005.77 million mapped reads corresponding to 162.73, 134.21, 160.29, and 150.86 Gb sequence data and 52.49x, 43.29x, 51.70x, and 48.66x average coverage for H3, H4, H5, and H6, respectively. We identified 3,529,659, 3,478,495, 3,407,895, and 3,426,862 variants in the genomes of H3, H4, H5, and H6, respectively, including 1,668,024 variants common in the four genomes. Further, we identified 42,422, 39,824, 28,599, and 35,206 novel variants in the genomes of H3, H4, H5, and H6, respectively. A major fraction of the variants identified in the four genomes reside within the intergenic regions of the genome. Single nucleotide polymorphism (SNP) genotype based comparative analysis with ethnic populations of 1000 Genomes database linked the ancestry of all four genomes with the South Asian populations, which was further supported by mitochondria based haplogroup analysis. In conclusion, we report whole genome sequencing of four individuals of Pakistani descent.

Download Full-text

Optimizing Genomic Selection in Dezhou Donkey Using Low Coverage Whole Genome Sequencing

10.21203/rs.3.rs-607740/v1 ◽

2021 ◽

Author(s):

Changheng Zhao ◽

Jun Teng ◽

Xinhao Zhang ◽

Dan Wang ◽

Xinyi Zhang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genomic Selection ◽

Genome Sequencing ◽

Sequence Data ◽

Low Cost ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Whole Genome Sequence ◽

Whole Genome ◽

Low Coverage

Abstract Background Low coverage whole genome sequencing is a low-cost genotyping technology. Combining with genotype imputation approaches, it is likely to become a critical component of cost-efficient genomic selection programs in agricultural livestock. Here, we used the low-coverage sequence data of 617 Dezhou donkeys to investigate the performance of genotype imputation for low coverage whole genome sequence data and genomic selection based on the imputed genotype data. The specific aims were: (i) to measure the accuracy of genotype imputation under different sequencing depths, sample sizes, MAFs, and imputation pipelines; and (ii) to assess the accuracy of genomic selection under different marker densities derived from the imputed sequence data, different strategies for constructing the genomic relationship matrixes, and single- vs multi-trait models. Results We found that a high imputation accuracy (> 0.95) can be achieved for sequence data with sequencing depth as low as 1x and the number of sequenced individuals equal to 400. For genomic selection, the best performance was obtained by using a marker density of 410K and a G matrix constructed using marker dosage information. Multi-trait GBLUP performed better than single-trait GBLUP. Conclusions Our study demonstrates that low coverage whole genome sequencing would be a cost-effective method for genomic selection in Dezhou Donkey.

Download Full-text

Large-Scale Whole-Genome Sequencing of Three Diverse Asian Populations in Singapore

Cell ◽

10.1016/j.cell.2019.09.019 ◽

2019 ◽

Vol 179 (3) ◽

pp. 736-749.e15 ◽

Cited By ~ 17

Author(s):

Degang Wu ◽

Jinzhuang Dou ◽

Xiaoran Chai ◽

Claire Bellis ◽

Andreas Wilm ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome ◽

Asian Populations

Download Full-text

Whole-genome sequencing of Burkholderia pseudomallei from an urban melioidosis hot spot reveals a fine-scale population structure and localised spatial clustering in the environment

Scientific Reports ◽

10.1038/s41598-020-62300-8 ◽

2020 ◽

Vol 10 (1) ◽

Cited By ~ 1

Author(s):

Audrey Rachlin ◽

Mark Mayo ◽

Jessica R. Webb ◽

Mariana Kleinecke ◽

Vanessa Rigas ◽

...

Keyword(s):

Population Structure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Burkholderia Pseudomallei ◽

Spatial Clustering ◽

Hot Spot ◽

Whole Genome ◽

Fine Scale ◽

Scale Population

Download Full-text

A population-specific reference panel for improved genotype imputation in African Americans

Communications Biology ◽

10.1038/s42003-021-02777-9 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Jared O’Connell ◽

Taedong Yun ◽

Meghan Moreno ◽

Helen Li ◽

Nadia Litterman ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

African Ancestry ◽

Genotype Imputation ◽

Whole Genome Sequencing Data ◽

Specific Reference ◽

Whole Genome ◽

Sequencing Data ◽

High Quality ◽

Sub Saharan

AbstractThere is currently a dearth of accessible whole genome sequencing (WGS) data for individuals residing in the Americas with Sub-Saharan African ancestry. We generated whole genome sequencing data at intermediate (15×) coverage for 2,294 individuals with large amounts of Sub-Saharan African ancestry, predominantly Atlantic African admixed with varying amounts of European and American ancestry. We performed extensive comparisons of variant callers, phasing algorithms, and variant filtration on these data to construct a high quality imputation panel containing data from 2,269 unrelated individuals. With the exception of the TOPMed imputation server (which notably cannot be downloaded), our panel substantially outperformed other available panels when imputing African American individuals. The raw sequencing data, variant calls and imputation panel for this cohort are all freely available via dbGaP and should prove an invaluable resource for further study of admixed African genetics.

Download Full-text

Genotyping by low-coverage whole-genome sequencing in intercross pedigrees from outbred founders: a cost efficient approach

10.1101/421768 ◽

2018 ◽

Author(s):

Yanjun Zan ◽

Thibaut Payen ◽

Mette Lillie ◽

Christa F. Honaker ◽

Paul B. Siegel ◽

...

Keyword(s):

High Resolution ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Genotype Imputation ◽

Whole Genome ◽

Efficient Manner ◽

Founder Line ◽

Cost Efficient ◽

Low Coverage

ABSTRACTBackgroundExperimental intercrosses between outbred founder populations are powerful resources for mapping loci contributing to complex traits (Quantitative Trait Loci or QTL). Here, we present an approach and accompanying software for high-resolution genotype imputation in such populations using whole-genome high coverage sequence data on founder individuals (∼30×) and low coverage sequence data on intercross individuals (∼0.4×). The method is illustrated in a large F2 pedigree between lines of chickens that have been divergently selected for 40 generations for the same trait (body weight at 8 weeks of age).ResultsDescribed is how hundreds of individuals were whole-genome sequenced in a cost- and time-efficient manner using a Tn5-based library preparation protocol optimized for this application. In total, 7.6M markers segregated in this pedigree and 10.0 to 13.7% were informative for imputing the founder line genotypes within the F0-F2 families. The genotypes imputed from low coverage sequence data were consistent with the founder line genotypes estimated using SNP and microsatellite markers both at individual imputed sites (92%) and across the genome of individual chickens (93%). The resolution of the recombination breakpoints was high with 50% being resolved within <10kb.ConclusionsA method for genotype imputation from low-coverage whole-genome sequencing in outbred intercrosses is described and evaluated. By applying it to an outbred chicken F2 cross it is illustrated that it provides high quality, high-resolution genotypes in a time and cost efficient manner.

Download Full-text