Human Reference Genome and a High Contiguity Ethnic Genome AK1

Mapping Intimacies ◽

10.1101/795807 ◽

2019 ◽

Author(s):

Jina Kim ◽

Joohon Sung ◽

Kyudong Han ◽

Wooseok Lee ◽

Seyoung Mun ◽

...

Keyword(s):

Reference Genome ◽

Human Genetics ◽

Missing Information ◽

East Asians ◽

Human Reference Genome ◽

Multiple Populations ◽

The Common ◽

Genome Assemblies ◽

Reference Genomes ◽

Unmapped Reads

AbstractStudies have shown that the current human reference genome (GRCh38) might miss information for some populations, but “exactly what we miss” is still elusive due to the lower contiguity of non-reference genomes. We juxtaposed the GRCh38 with high contiguity genome assemblies, AK1, to show that ∼1.8% (∼53.4 Mbp) of AK1 sequences missed in GRCh38 with ∼0.76% (∼22.2 Mbp) of ectopic chromosomes. The unique AK1 sequences harbored ∼1,390 putative coding elements. We found that ∼5.3Mb (∼0.2%) of the AK1 sequences aligned and recovered the “unmapped” reads of fourteen individuals (5 East-Asians, 4 Europeans, and 5 Africans) as a reference. The regions that “unmapped” reads aligned included 110 common (shared between ≥2 individuals) and 38 globally (≥7 individuals) missing regions with 25 candidate coding elements. We verified that many of the common missing regions exist in multiple populations and chimpanzee’s DNA. Our study illuminates not only the discovery of missing information but the use of highly precise ethnic genomes in understanding human genetics.

Download Full-text

A High Quality Asian Genome Assembly Identifies Features of Common Missing Regions

Genes ◽

10.3390/genes11111350 ◽

2020 ◽

Vol 11 (11) ◽

pp. 1350

Author(s):

Jina Kim ◽

Joohon Sung ◽

Kyudong Han ◽

Wooseok Lee ◽

Seyoung Mun ◽

...

Keyword(s):

Genome Assembly ◽

Genome Analysis ◽

Reference Genome ◽

East Asians ◽

High Quality ◽

Human Reference Genome ◽

The Common ◽

Occurrence Mechanism ◽

Genomic Regions ◽

Unmapped Reads

The current human reference genome (GRCh38), with its superior quality, has contributed significantly to genome analysis. However, GRCh38 may still underrepresent the ethnic genome, specifically for Asians, though exactly what we are missing is still elusive. Here, we juxtaposed GRCh38 with a high-contiguity genome assembly of one Korean (AK1) to show that a part of AK1 genome is missing in GRCh38 and that the missing regions harbored ~1390 putative coding elements. Furthermore, we found that multiple populations shared some certain parts in the missing genome when we analyzed the “unmapped” (to GRCh38) reads of fourteen individuals (five East-Asians, four Europeans, and five Africans), amounting to ~5.3 Mb (~0.2% of AK1) of the total genomic regions. The recovered AK1 regions from the “unmapped reads”, which were the estimated missing regions that did not exist in GRCh38, harbored candidate coding elements. We verified that most of the common (shared by ≥7 individuals) missing regions exist in human and chimpanzee DNA. Moreover, we further identified the occurrence mechanism and ethnic heterogeneity as well as the presence of the common missing regions. This study illuminates a potential advantage of using a pangenome reference and brings up the need for further investigations on the various features of regions globally missed in GRCh38.

Download Full-text

dnAQET: a framework to compute a consolidated metric for benchmarking quality of de novo assemblies

BMC Genomics ◽

10.1186/s12864-019-6070-x ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Gokhan Yavas ◽

Huixiao Hong ◽

Wenming Xiao

Keyword(s):

Quality Assessment ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Quality Score ◽

De Novo Genome Assembly ◽

Genome Assemblies ◽

Reference Genomes ◽

Better Than

Abstract Background Accurate de novo genome assembly has become reality with the advancements in sequencing technology. With the ever-increasing number of de novo genome assembly tools, assessing the quality of assemblies has become of great importance in genome research. Although many quality metrics have been proposed and software tools for calculating those metrics have been developed, the existing tools do not produce a unified measure to reflect the overall quality of an assembly. Results To address this issue, we developed the de novo Assembly Quality Evaluation Tool (dnAQET) that generates a unified metric for benchmarking the quality assessment of assemblies. Our framework first calculates individual quality scores for the scaffolds/contigs of an assembly by aligning them to a reference genome. Next, it computes a quality score for the assembly using its overall reference genome coverage, the quality score distribution of its scaffolds and the redundancy identified in it. Using synthetic assemblies randomly generated from the latest human genome build, various builds of the reference genomes for five organisms and six de novo assemblies for sample NA24385, we tested dnAQET to assess its capability for benchmarking quality evaluation of genome assemblies. For synthetic data, our quality score increased with decreasing number of misassemblies and redundancy and increasing average contig length and coverage, as expected. For genome builds, dnAQET quality score calculated for a more recent reference genome was better than the score for an older version. To compare with some of the most frequently used measures, 13 other quality measures were calculated. The quality score from dnAQET was found to be better than all other measures in terms of consistency with the known quality of the reference genomes, indicating that dnAQET is reliable for benchmarking quality assessment of de novo genome assemblies. Conclusions The dnAQET is a scalable framework designed to evaluate a de novo genome assembly based on the aggregated quality of its scaffolds (or contigs). Our results demonstrated that dnAQET quality score is reliable for benchmarking quality assessment of genome assemblies. The dnQAET can help researchers to identify the most suitable assembly tools and to select high quality assemblies generated.

Download Full-text

Towards a reference genome that captures global genetic diversity

Nature Communications ◽

10.1038/s41467-020-19311-w ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Karen H. Y. Wong ◽

Walfred Ma ◽

Chun-Yu Wei ◽

Erh-Chan Yeh ◽

Wan-Jia Lin ◽

...

Keyword(s):

Genetic Diversity ◽

Reference Genome ◽

Regulatory Elements ◽

Human Populations ◽

Single Individual ◽

Rna Seq ◽

Human Reference Genome ◽

Reference Sequences ◽

Genome Annotations ◽

Unmapped Reads

Abstract The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.

Download Full-text

Liftoff: accurate mapping of gene annotations

Bioinformatics ◽

10.1093/bioinformatics/btaa1016 ◽

2020 ◽

Author(s):

Alaina Shumate ◽

Steven L Salzberg

Keyword(s):

Reference Genome ◽

Supplementary Information ◽

Closely Related Species ◽

Protein Coding ◽

Human Reference Genome ◽

Sequence Identity ◽

Gene Annotations ◽

Genome Assemblies ◽

Average Sequence Identity ◽

High Quality Genome

Abstract Motivation Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. Results One strategy to annotate new or improved genome assemblies is to map or ‘lift over’ the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.3% of human protein-coding genes to a chimpanzee genome assembly with 98.2% sequence identity. Availability and Implementation Liftoff can be installed via bioconda and PyPI. Additionally, the source code for Liftoff is available at https://github.com/agshumate/Liftoff Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Assembly and Annotation of an Ashkenazi Human Reference Genome

10.1101/2020.03.18.997395 ◽

2020 ◽

Cited By ~ 2

Author(s):

Alaina Shumate ◽

Aleksey V. Zimin ◽

Rachel M. Sherman ◽

Daniela Puiu ◽

Justin M. Wagner ◽

...

Keyword(s):

Dna Sequences ◽

Reference Genome ◽

Gene Families ◽

Gene Content ◽

Specific Reference ◽

Protein Coding ◽

Human Reference Genome ◽

Protein Coding Genes ◽

Reference Genomes ◽

Similar Gene

AbstractHere we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are >99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences. 40 of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. 11 genes appear on different chromosomes from their homologs in GRCh38. Alignment of DNA sequences from an unrelated Ashkenazi individual to Ash1 identified ~1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes.

Download Full-text

The Dark Matter of Large Cereal Genomes: Long Tandem Repeats

International Journal of Molecular Sciences ◽

10.3390/ijms20102483 ◽

2019 ◽

Vol 20 (10) ◽

pp. 2483 ◽

Cited By ~ 5

Author(s):

Veronika Kapustová ◽

Zuzana Tulpová ◽

Helena Toegelová ◽

Petr Novák ◽

Jiří Macas ◽

...

Keyword(s):

Bread Wheat ◽

Tandem Repeats ◽

Reference Genome ◽

Sequence Data ◽

Repetitive Sequences ◽

Short Read Sequence ◽

Cereal Genomes ◽

Genome Assemblies ◽

Reference Genomes ◽

Size Estimates

Reference genomes of important cereals, including barley, emmer wheat and bread wheat, were released recently. Their comparison with genome size estimates obtained by flow cytometry indicated that the assemblies represent not more than 88–98% of the complete genome. This work is aimed at identifying the missing parts in two cereal genomes and proposing techniques to make the assemblies more complete. We focused on tandemly organised repetitive sequences, known to be underrepresented in genome assemblies generated from short-read sequence data. Our study found arrays of three tandem repeats with unit sizes of 1242 to 2726 bp present in the bread wheat reference genome generated from short reads. However, this and another wheat genome assembly employing long PacBio reads failed in integrating correctly the 2726-bp repeat in the pseudomolecule context. This suggests that tandem repeats of this size, frequently incorporated in unassigned scaffolds, may contribute to shrinking of pseudomolecules without reducing size of the entire assembly. We demonstrate how this missing information may be added to the pseudomolecules with the aid of nanopore sequencing of individual BAC clones and optical mapping. Using the latter technique, we identified and localised a 470-kb long array of 45S ribosomal DNA absent from the reference genome of barley.

Download Full-text

Refgenie: a reference genome resource manager

GigaScience ◽

10.1093/gigascience/giz149 ◽

2020 ◽

Vol 9 (2) ◽

Cited By ~ 3

Author(s):

Michał Stolarczyk ◽

Vincent P Reuter ◽

Jason P Smith ◽

Neal E Magee ◽

Nathan C Sheffield

Keyword(s):

Genome Analysis ◽

High Throughput Sequencing ◽

Reference Genome ◽

Sequencing Analysis ◽

Resource Manager ◽

Computing Environments ◽

Downstream Analysis ◽

Genome Assemblies ◽

Server Application ◽

Reference Genomes

Abstract Background Reference genome assemblies are essential for high-throughput sequencing analysis projects. Typically, genome assemblies are stored on disk alongside related resources; e.g., many sequence aligners require the assembly to be indexed. The resulting indexes are broadly applicable for downstream analysis, so it makes sense to share them. However, there is no simple tool to do this. Results Here, we introduce refgenie, a reference genome assembly asset manager. Refgenie makes it easier to organize, retrieve, and share genome analysis resources. In addition to genome indexes, refgenie can manage any files related to reference genomes, including sequences and annotation files. Refgenie includes a command line interface and a server application that provides a RESTful API, so it is useful for both tool development and analysis. Conclusions Refgenie streamlines sharing genome analysis resources among groups and across computing environments. Refgenie is available at https://refgenie.databio.org.

Download Full-text

Genome Graphs

10.1101/101378 ◽

2017 ◽

Cited By ~ 34

Author(s):

Adam M. Novak ◽

Glenn Hickey ◽

Erik Garrison ◽

Sean Blum ◽

Abram Connelly ◽

...

Keyword(s):

Reference Genome ◽

Human Genetics ◽

De Novo ◽

Variant Calling ◽

Reference Structure ◽

Read Mapping ◽

Human Reference Genome ◽

A Genome ◽

Universal Reference ◽

Genome Graph

AbstractThere is increasing recognition that a single, monoploid reference genome is a poor universal reference structure for human genetics, because it represents only a tiny fraction of human variation. Adding this missing variation results in a structure that can be described as a mathematical graph: a genome graph. We demonstrate that, in comparison to the existing reference genome (GRCh38), genome graphs can substantially improve the fractions of reads that map uniquely and perfectly. Furthermore, we show that this fundamental simplification of read mapping transforms the variant calling problem from one in which many non-reference variants must be discovered de-novo to one in which the vast majority of variants are simply re-identified within the graph. Using standard benchmarks as well as a novel reference-free evaluation, we show that a simplistic variant calling procedure on a genome graph can already call variants at least as well as, and in many cases better than, a state-of-the-art method on the linear human reference genome. We anticipate that graph-based references will supplant linear references in humans and in other applications where cohorts of sequenced individuals are available.

Download Full-text

Refgenie: a reference genome resource manager

10.1101/698704 ◽

2019 ◽

Cited By ~ 2

Author(s):

Michal Stolarczyk ◽

Vincent P. Reuter ◽

Neal E. Magee ◽

Nathan C. Sheffield

Keyword(s):

High Throughput Sequencing ◽

Reference Genome ◽

Sequencing Analysis ◽

Tool Development ◽

Resource Manager ◽

Reference Genome Assembly ◽

Downstream Analysis ◽

Genome Assemblies ◽

Server Application ◽

Reference Genomes

Reference genome assemblies are essential for high-throughput sequencing analysis projects. Typically, genome assemblies are stored on disk alongside related resources; for example, many sequence aligners require the assembly to be indexed. The resulting indexes are broadly applicable for downstream analysis, so it makes sense to share them. However, there is no simple tool to do this. To this end, we introduce refgenie, a reference genome assembly asset manager. Refgenie makes it easier to organize, retrieve, and share genome analysis resources. In addition to genome indexes, refgenie can manage any files related to reference genomes, including sequences and annotation files. Refgenie includes a command-line interface and a server application that provides a RESTful API, so it is useful for both tool development and analysis.Availabilityhttps://refgenie.databio.org

Download Full-text

Association Mapping from Sequencing Reads Using K-mers

10.1101/141267 ◽

2017 ◽

Cited By ~ 5

Author(s):

Atif Rahman ◽

Ingileif Hallgrímsdóttir ◽

Michael B. Eisen ◽

Lior Pachter

Keyword(s):

Cardiovascular Diseases ◽

Reference Genome ◽

Heart Diseases ◽

Association Studies ◽

South Asians ◽

Simulated Data ◽

Genome Wide Association Studies ◽

Human Reference Genome ◽

1000 Genomes ◽

Reference Genomes

AbstractGenome wide association studies (GWAS) rely on microarrays, or more recently mapping of whole-genome sequencing reads, to genotype individuals. The reliance on prior sequencing of a reference genome for the organism on which the association study is to be performed limits the scope of association studies, and also precludes the identification of differences between cases and controls outside of the reference. We present an alignment free method for association studies that is based on counting k-mers in sequencing reads, testing for associations directly between k-mers and the trait of interest, and local assembly of the statistically significant k-mers to identify sequence differences. Results with simulated data and an analysis of the 1000 genomes data provide a proof of principle for the approach. In a pairwise comparison of the Toscani in Italia (TSI) and the Yoruba in Ibadan, Nigeria (YRI) populations we find that sequences identified by our method largely agree with results obtained using standard GWAS based on variant calling from mapped reads. However unlike standard GWAS, we find that our method identifies associations with structural variations and sites not present in the reference genome revealing sequences absent from the human reference genome. We also analyze data from the Bengali from Bangladesh (BEB) population to explore possible genetic basis of high rate of mortality due to cardiovascular diseases (CVD) among South Asians and find significant differences in frequencies of a number of non-synonymous variants in genes linked to CVDs between BEB and TSI samples, including the site rs1042034, which has been associated with higher risk of CVDs previously, and the nearby rs676210 in the Apolipoprotein B (ApoB) gene.Author SummaryWe present a method for associating regions in genomes to traits or diseases. The method is based on finding differences in frequencies of short strings of letters in sequencing reads and do not require reads to be aligned to a reference genome. This makes it applicable to study of organisms with no or incomplete reference genomes. We test our method with simulated data and sequencing data from the 1000 genomes project and find agreement with the conventional approach based on alignment to a reference genome. In addition, our method finds associations with sequences not in reference genomes and reveals sequences missing from the human reference genome. We also explore high rates of mortality due to cardiovascular diseases among South Asians and find prevalence of variations in genes associated with heart diseases in samples from the Bengali from Bangladesh population including one that has been reported to be associated with early onset of cardiovascular diseases.

Download Full-text