A High Quality Asian Genome Assembly Identifies Features of Common Missing Regions

Jina Kim; Joohon Sung; Kyudong Han; Wooseok Lee; Seyoung Mun; Jooyeon Lee; Kunhyung Bahk; Inchul Yang; Young-Kyung Bae; Changhoon Kim; Jong-Il Kim; Jeong-Sun Seo

doi:10.3390/genes11111350

A High Quality Asian Genome Assembly Identifies Features of Common Missing Regions

Genes ◽

10.3390/genes11111350 ◽

2020 ◽

Vol 11 (11) ◽

pp. 1350

Author(s):

Jina Kim ◽

Joohon Sung ◽

Kyudong Han ◽

Wooseok Lee ◽

Seyoung Mun ◽

...

Keyword(s):

Genome Assembly ◽

Genome Analysis ◽

Reference Genome ◽

East Asians ◽

High Quality ◽

Human Reference Genome ◽

The Common ◽

Occurrence Mechanism ◽

Genomic Regions ◽

Unmapped Reads

The current human reference genome (GRCh38), with its superior quality, has contributed significantly to genome analysis. However, GRCh38 may still underrepresent the ethnic genome, specifically for Asians, though exactly what we are missing is still elusive. Here, we juxtaposed GRCh38 with a high-contiguity genome assembly of one Korean (AK1) to show that a part of AK1 genome is missing in GRCh38 and that the missing regions harbored ~1390 putative coding elements. Furthermore, we found that multiple populations shared some certain parts in the missing genome when we analyzed the “unmapped” (to GRCh38) reads of fourteen individuals (five East-Asians, four Europeans, and five Africans), amounting to ~5.3 Mb (~0.2% of AK1) of the total genomic regions. The recovered AK1 regions from the “unmapped reads”, which were the estimated missing regions that did not exist in GRCh38, harbored candidate coding elements. We verified that most of the common (shared by ≥7 individuals) missing regions exist in human and chimpanzee DNA. Moreover, we further identified the occurrence mechanism and ethnic heterogeneity as well as the presence of the common missing regions. This study illuminates a potential advantage of using a pangenome reference and brings up the need for further investigations on the various features of regions globally missed in GRCh38.

Download Full-text

Human Reference Genome and a High Contiguity Ethnic Genome AK1

10.1101/795807 ◽

2019 ◽

Author(s):

Jina Kim ◽

Joohon Sung ◽

Kyudong Han ◽

Wooseok Lee ◽

Seyoung Mun ◽

...

Keyword(s):

Reference Genome ◽

Human Genetics ◽

Missing Information ◽

East Asians ◽

Human Reference Genome ◽

Multiple Populations ◽

The Common ◽

Genome Assemblies ◽

Reference Genomes ◽

Unmapped Reads

AbstractStudies have shown that the current human reference genome (GRCh38) might miss information for some populations, but “exactly what we miss” is still elusive due to the lower contiguity of non-reference genomes. We juxtaposed the GRCh38 with high contiguity genome assemblies, AK1, to show that ∼1.8% (∼53.4 Mbp) of AK1 sequences missed in GRCh38 with ∼0.76% (∼22.2 Mbp) of ectopic chromosomes. The unique AK1 sequences harbored ∼1,390 putative coding elements. We found that ∼5.3Mb (∼0.2%) of the AK1 sequences aligned and recovered the “unmapped” reads of fourteen individuals (5 East-Asians, 4 Europeans, and 5 Africans) as a reference. The regions that “unmapped” reads aligned included 110 common (shared between ≥2 individuals) and 38 globally (≥7 individuals) missing regions with 25 candidate coding elements. We verified that many of the common missing regions exist in multiple populations and chimpanzee’s DNA. Our study illuminates not only the discovery of missing information but the use of highly precise ethnic genomes in understanding human genetics.

Download Full-text

Whole-genome assembly of Ganoderma leucocontextum (Ganodermataceae, Fungi) discovered from the Tibetan Plateau of China

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab337 ◽

2021 ◽

Author(s):

Yuanchao Liu ◽

Longhua Huang ◽

Huiping Hu ◽

Manjun Cai ◽

Xiaowei Liang ◽

...

Keyword(s):

Genome Assembly ◽

Southwest China ◽

Reference Genome ◽

Biological Activities ◽

Single Copy ◽

The Tibetan Plateau ◽

Whole Genome ◽

High Quality ◽

Pharmacological Activities ◽

Genetic Studies

Abstract Ganoderma leucocontextum, a newly discovered species of Ganodermataceae in China, has diverse pharmacological activities. G. leucocontextum was widely cultivated in southwest China, but the systematic genetic study has been impeded by the lack of a reference genome. Herein, we present the first whole-genome assembly of G. leucocontextum based on the Illumina and Nanopore platform from high-quality DNA extracted from a monokaryon strain (DH-8). The generated genome was 50.05 Mb in size with a N50 scaﬀold size of 3.06 Mb, 78,206 coding sequences and 13,390 putative genes. Genome completeness was assessed using the Benchmarking Universal Single-Copy Orthologs (BUSCO) tool, which identified 96.55% of the 280 Fungi BUSCO genes. Furthermore, differences in functional genes of secondary metabolites (terpenoids) were analyzed between G. leucocontextum and G. lucidum. G. leucocontextum has more genes related to terpenoids synthesis compared to G. lucidum, which may be one of the reasons why they exhibit different biological activities. This is the first genome assembly and annotation for G. leucocontextum, which would enrich the toolbox for biological and genetic studies in G. leucocontextum.

Download Full-text

Genome Assembly and Population Resequencing Reveal the Geographical Divergence of 'Shanmei'(Rubus corchorifolius)

10.1101/2021.11.22.469527 ◽

2021 ◽

Author(s):

Yinqing Yang ◽

Kang Zhang ◽

Ya Xiao ◽

Lingkui Zhang ◽

Yile Huang ◽

...

Keyword(s):

Genome Assembly ◽

Ancestral Population ◽

Effective Population ◽

High Quality ◽

Local Environments ◽

Rubus Species ◽

Rubus Chingii ◽

Rubus Chingii Hu ◽

Genomic Regions ◽

High Quality Genome

Rubus corchorifolius (Shanmei or mountain berry, 2n =14) is widely distributed in China, and its fruit has high nutritional and medicinal values. Here, we report a high-quality chromosome-scale genome assembly of Shanmei, with a size of 215.69 Mb and encompassing 26696 genes. Genome comparisons among Rosaceae species show that Shanmei and Fupenzi(Rubus chingii Hu) are most closely related, and then is blackberry (Rubus occidentalis). Further resequencing of 101 samples of Shanmei collected from four regions in provinces of Yunnan, Hunan, Jiangxi and Sichuan in South China reveals that the Hunan population of Shanmei possesses the highest diversity and may represent the relatively more ancestral population. Moreover, the Yunnan population undergoes strong selection based on nucleotide diversity, linkage disequilibrium and the historical effective population size analyses. Furthermore, genes from candidate genomic regions that show strong divergence are significantly enriched in flavonoid biosynthesis and plant hormone signal transduction, indicating the genetic basis of adaptation of Shanmei to the local environments. The high-quality genome sequences and the variome dataset of Shanmei provide valuable resources for breeding applications and for elucidating the genome evolution and ecological adaptation of Rubus species.

Download Full-text

Genome assembly of the JD17 soybean provides a new reference genome for Comparative genomics

10.1101/2021.11.23.469778 ◽

2021 ◽

Author(s):

Xinxin Yi ◽

Jing Liu ◽

Shengcai Chen ◽

Hao Wu ◽

Min Liu ◽

...

Keyword(s):

Nitrogen Fixation ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Genomic Analysis ◽

Comparative Genomic ◽

High Quality ◽

Genome Wide ◽

A Genome ◽

Cultivated Soybean

Cultivated soybean (Glycine max) is an important source for protein and oil. Many elite cultivars with different traits have been developed for different conditions. Each soybean strain has its own genetic diversity, and the availability of more high-quality soybean genomes can enhance comparative genomic analysis for identifying genetic underpinnings for its unique traits. In this study, we constructed a high-quality de novo assembly of an elite soybean cultivar Jidou 17 (JD17) with chromsome contiguity and high accuracy. We annotated 52,840 gene models and reconstructed 74,054 high-quality full-length transcripts. We performed a genome-wide comparative analysis based on the reference genome of JD17 with three published soybeans (WM82, ZH13 and W05) , which identified five large inversions and two large translocations specific to JD17, 20,984 - 46,912 PAVs spanning 13.1 - 46.9 Mb in size, and 5 - 53 large PAV clusters larger than 500kb. 1,695,741 - 3,664,629 SNPs and 446,689 - 800,489 Indels were identified and annotated between JD17 and them. Symbiotic nitrogen fixation (SNF) genes were identified and the effects from these variants were further evaluated. It was found that the coding sequences of 9 nitrogen fixation-related genes were greatly affected. The high-quality genome assembly of JD17 can serve as a valuable reference for soybean functional genomics research.

Download Full-text

A Chromosome-Scale Genome Assembly Resource for Myriosclerotinia sulcatula Infecting Sedge Grass (Carex sp.)

Molecular Plant-Microbe Interactions ◽

10.1094/mpmi-03-20-0060-a ◽

2020 ◽

Vol 33 (7) ◽

pp. 880-883

Author(s):

Stefan Kusch ◽

Heba M. M. Ibrahim ◽

Catherine Zanchetta ◽

Celine Lopez-Roques ◽

Cecile Donnadieu ◽

...

Keyword(s):

Host Range ◽

Sclerotinia Sclerotiorum ◽

Genome Assembly ◽

Plant Pathogens ◽

Reference Genome ◽

Close Relative ◽

High Quality ◽

Protein Coding ◽

Protein Coding Genes ◽

Reference Genome Assembly

The fungus Myriosclerotinia sulcatula is a close relative of the notorious polyphagous plant pathogens Botrytis cinerea and Sclerotinia sclerotiorum but exhibits a host range restricted to plants from the Carex genus (Cyperaceae family). To date, there are no genomic resources available for fungi in the Myriosclerotinia genus. Here, we present a chromosome-scale reference genome assembly for M. sulcatula. The assembly contains 24 contigs with a total length of 43.53 Mbp, with scaffold N50 of 2,649.7 kbp and N90 of 1,133.1 kbp. BRAKER-predicted gene models were manually curated using WebApollo, resulting in 11,275 protein-coding genes that we functionally annotated. We provide a high-quality reference genome assembly and annotation for M. sulcatula as a resource for studying evolution and pathogenicity in fungi from the Sclerotiniaceae family.

Download Full-text

Genome Assembly and Annotation of Botryosphaeria dothidea sdau11-99, a Latent Pathogen of Apple Fruit Ring Rot in China

Plant Disease ◽

10.1094/pdis-06-20-1182-a ◽

2020 ◽

Author(s):

Chengming Yu ◽

Yufei Diao ◽

Quan Lu ◽

Jiaping Zhao ◽

Shengnan Cui ◽

...

Keyword(s):

Genome Sequence ◽

Genome Assembly ◽

Fungal Pathogen ◽

Woody Plants ◽

Reference Genome ◽

Apple Fruit ◽

Botryosphaeria Dothidea ◽

High Quality ◽

Ring Rot ◽

Wide Range

Botryosphaeria dothidea is a latent and important fungal pathogen on a wide range of woody plants. Fruit ring rot caused by B. dothidea is a major disease in China on apple. This study establishes a high quality, nearly complete and well annotated genome sequence of B. dothidea strain sdau11-99. The findings of this research provide a reference genome resource for further research on the apple fruit ring rot pathogen on apple and other hosts.

Download Full-text

Towards a reference genome that captures global genetic diversity

Nature Communications ◽

10.1038/s41467-020-19311-w ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Karen H. Y. Wong ◽

Walfred Ma ◽

Chun-Yu Wei ◽

Erh-Chan Yeh ◽

Wan-Jia Lin ◽

...

Keyword(s):

Genetic Diversity ◽

Reference Genome ◽

Regulatory Elements ◽

Human Populations ◽

Single Individual ◽

Rna Seq ◽

Human Reference Genome ◽

Reference Sequences ◽

Genome Annotations ◽

Unmapped Reads

Abstract The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.

Download Full-text

LRez: C ++ API and toolkit for analyzing and managing Linked-Reads data

Bioinformatics Advances ◽

10.1093/bioadv/vbab022 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Claire Lemaitre ◽

Fabrice Legeai

Keyword(s):

Genome Assembly ◽

Low Cost ◽

Variant Calling ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Dna Molecule ◽

Sequencing Technologies ◽

Wide Range ◽

Genomic Regions

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances

Download Full-text

Chromosome-level genome assembly of a benthic associated Syngnathiformes species: the common dragonet, Callionymus lyra

Gigabyte ◽

10.46471/gigabyte.6 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Sven Winter ◽

Stefan Prost ◽

Jordi de Raad ◽

Raphael T. F. Coimbra ◽

Magnus Wolf ◽

...

Keyword(s):

Genome Assembly ◽

Morphological Differentiation ◽

Chromosome Length ◽

Morphological Transformation ◽

High Quality ◽

The North ◽

Repeat Content ◽

The Common ◽

Species Specific ◽

Chromosome Level

Background The common dragonet, Callionymus lyra, is one of three Callionymus species inhabiting the North Sea. All three species show strong sexual dimorphism. The males show strong morphological differentiation, e.g., species-specific colouration and size relations, while the females of different species have few distinguishing characters. Callionymus belongs to the ‘benthic associated clade’ of the order Syngnathiformes. The ‘benthic associated clade’ so far is not represented by genome data and serves as an important outgroup to understand the morphological transformation in ‘long-snouted’ syngnatiformes such as seahorses and pipefishes. Findings Here, we present the chromosome-level genome assembly of C. lyra. We applied Oxford Nanopore Technologies’ long-read sequencing, short-read DNBseq, and proximity-ligation-based scaffolding to generate a high-quality genome assembly. The resulting assembly has a contig N50 of 2.2 Mbp and a scaffold N50 of 26.7 Mbp. The total assembly length is 568.7 Mbp, of which over 538 Mbp were scaffolded into 19 chromosome-length scaffolds. The identification of 94.5% complete BUSCO genes indicates high assembly completeness. Additionally, we sequenced and assembled a multi-tissue transcriptome with a total length of 255.5 Mbp that was used to aid the annotation of the genome assembly. The annotation resulted in 19,849 annotated transcripts and identified a repeat content of 27.7%. Conclusions The chromosome-level assembly of C. lyra provides a high-quality reference genome for future population genomic, phylogenomic, and phylogeographic analyses.

Download Full-text

Complete Genome Assembly of Myxococcus xanthus Strain DZ2 Using Long High-Fidelity (HiFi) Reads Generated with PacBio Technology

Microbiology Resource Announcements ◽

10.1128/mra.00530-21 ◽

2021 ◽

Vol 10 (28) ◽

Author(s):

Rikesh Jain ◽

Bianca H. Habermann ◽

Tâm Mignot

Keyword(s):

Genome Assembly ◽

Complete Genome ◽

Reference Genome ◽

Myxococcus Xanthus ◽

High Fidelity ◽

High Quality ◽

Microbial Ecosystem ◽

Gram Negative ◽

Content Type

Myxococcus xanthus is a Gram-negative social bacterium belonging to the order Myxococcales of the class Deltaproteobacteria . It is a facultative social predator found in soils across the globe and is thought to be crucial for the microbial ecosystem. Here, we report a complete high-quality reference genome of the M. xanthus strain DZ2.

Download Full-text