The genome of C57BL/6J “Eve”, the mother of the laboratory mouse genome reference strain

Mapping Intimacies ◽

10.1101/517466 ◽

2019 ◽

Author(s):

Vishal Kumar Sarsani ◽

Narayanan Raghupathy ◽

Ian T. Fiddes ◽

Joel Armstrong ◽

Francoise Thibaud-Nissen ◽

...

Keyword(s):

Reference Genome ◽

De Novo ◽

Laboratory Mouse ◽

Mouse Strains ◽

Structural Variations ◽

High Coverage ◽

Coding Sequences ◽

Adam And Eve ◽

Long Read ◽

Mouse Reference Genome

ABSTRACTIsogenic laboratory mouse strains are used to enhance reproducibility as individuals within a strain are essentially genetically identical. For the most widely used isogenic strain, C57BL/6, there is also a wealth of genetic, phenotypic, and genomic data, including one of the highest quality reference genomes (GRCm38.p6). However, laboratory mouse strains are living reagents and hence genetic drift occurs and is an unavoidable source of accumulating genetic variability that can have an impact on reproducibility over time. Nearly 20 years after the first release of the mouse reference genome, individuals from the strain it represents (C57BL/6J) are at least 26 inbreeding generations removed from the individuals used to generate the mouse reference genome. Moreover, C57BL/6J is now maintained through the periodic reintroduction of mice from cryopreserved embryo stocks that are derived from a single breeder pair, aptly named C57BL/6J Adam and Eve. To more accurately represent the genome of today’s C57BL/6J mice, we have generated a de novo assembly of the C57BL/6J Eve genome (B6Eve) using high coverage, long-read sequencing, optical mapping, and short-read data. Using these data, we addressed recurring variants observed in previous mouse studies. We have also identified structural variations that impact coding sequences, closed gaps in the mouse reference assembly, some of which are in genes, and we have identified previously unannotated coding sequences through long read sequencing of cDNAs. This B6Eve assembly explains discrepant observations that have been associated with GRCm38-based analyses, and has provided data towards a reference genome that is more representative of the C57BL/6J mice that are in use today.

Download Full-text

Multiple laboratory mouse reference genomes define strain specific haplotypes and novel functional loci

10.1101/235838 ◽

2018 ◽

Cited By ~ 5

Author(s):

Jingtao Lilue ◽

Anthony G. Doran ◽

Ian T. Fiddes ◽

Monica Abrudan ◽

Joel Armstrong ◽

...

Keyword(s):

Reference Genome ◽

De Novo ◽

Inbred Mouse ◽

Laboratory Mouse ◽

Model Organism ◽

Inbred Strains ◽

Mouse Strains ◽

Disease Response ◽

Unannotated Gene ◽

Mouse Reference Genome

AbstractThe most commonly employed mammalian model organism is the laboratory mouse. A wide variety of genetically diverse inbred mouse strains, representing distinct physiological states, disease susceptibilities, and biological mechanisms have been developed over the last century. We report full length draft de novo genome assemblies for 16 of the most widely used inbred strains and reveal for the first time extensive strain-specific haplotype variation. We identify and characterise 2,567 regions on the current Genome Reference Consortium mouse reference genome exhibiting the greatest sequence diversity between strains. These regions are enriched for genes involved in defence and immunity, and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. Several immune related loci, some in previously identified QTLs for disease response have novel haplotypes not present in the reference that may explain the phenotype. We used these genomes to improve the mouse reference genome resulting in the completion of 10 new gene structures, and 62 new coding loci were added to the reference genome annotation. Notably this high quality collection of genomes revealed a previously unannotated gene (Efcab3-like) encoding 5,874 amino acids, one of the largest known in the rodent lineage. Interestingly, Efcab3-like−/− mice exhibit severe size anomalies in four regions of the brain suggesting a mechanism of Efcab3-like regulating brain development.

Download Full-text

A long reads-based de-novo assembly of the genome of the Arlee homozygous line reveals chromosomal rearrangements in rainbow trout

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab052 ◽

2021 ◽

Author(s):

Guangtu Gao ◽

Susana Magadan ◽

Geoffrey C Waldbieser ◽

Ramey C Youngblood ◽

Paul A Wheeler ◽

...

Keyword(s):

Rainbow Trout ◽

Chromosome Number ◽

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Structural Variations ◽

High Coverage ◽

Haploid Chromosome Number ◽

Long Reads

Abstract Currently, there is still a need to improve the contiguity of the rainbow trout reference genome and to use multiple genetic backgrounds that will represent the genetic diversity of this species. The Arlee doubled haploid line was originated from a domesticated hatchery strain that was originally collected from the northern California coast. The Canu pipeline was used to generate the Arlee line genome de-novo assembly from high coverage PacBio long-reads sequence data. The assembly was further improved with Bionano optical maps and Hi-C proximity ligation sequence data to generate 32 major scaffolds corresponding to the karyotype of the Arlee line (2 N = 64). It is composed of 938 scaffolds with N50 of 39.16 Mb and a total length of 2.33 Gb, of which ∼95% was in 32 chromosome sequences with only 438 gaps between contigs and scaffolds. In rainbow trout the haploid chromosome number can vary from 29 to 32. In the Arlee karyotype the haploid chromosome number is 32 because chromosomes Omy04, 14 and 25 are divided into six acrocentric chromosomes. Additional structural variations that were identified in the Arlee genome included the major inversions on chromosomes Omy05 and Omy20 and additional 15 smaller inversions that will require further validation. This is also the first rainbow trout genome assembly that includes a scaffold with the sex-determination gene (sdY) in the chromosome Y sequence. The utility of this genome assembly is demonstrated through the improved annotation of the duplicated genome loci that harbor the IGH genes on chromosomes Omy12 and Omy13.

Download Full-text

Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401280 ◽

2020 ◽

Vol 10 (8) ◽

pp. 2801-2809 ◽

Cited By ~ 1

Author(s):

Tingting Zhao ◽

Zhongqu Duan ◽

Georgi Z. Genchev ◽

Hui Lu

Keyword(s):

Reference Genome ◽

De Novo ◽

Sequence Length ◽

Sequencing Data ◽

Human Reference Genome ◽

Satellite Sequences ◽

Long Read ◽

Data Gap ◽

Simple Repeats ◽

Gap Closing

Despite continuous updates of the human reference genome, there are still hundreds of unresolved gaps which account for about 5% of the total sequence length. Given the availability of whole genome de novo assemblies, especially those derived from long-read sequencing data, gap-closing sequences can be determined. By comparing 17 de novo long-read sequencing assemblies with the human reference genome, we identified a total of 1,125 gap-closing sequences for 132 (16.9% of 783) gaps and added up to 2.2 Mb novel sequences to the human reference genome. More than 90% of the non-redundant sequences could be verified by unmapped reads from the Simons Genome Diversity Project dataset. In addition, 15.6% of the non-reference sequences were found in at least one of four non-human primate genomes. We further demonstrated that the non-redundant sequences had high content of simple repeats and satellite sequences. Moreover, 43 (32.6%) of the 132 closed gaps were shown to be polymorphic; such sequences may play an important biological role and can be useful in the investigation of human genetic diversity.

Download Full-text

Rapid low-cost assembly of the Drosophila melanogaster reference genome using low-coverage, long-read sequencing

10.1101/267401 ◽

2018 ◽

Cited By ~ 6

Author(s):

Edwin A. Solares ◽

Mahul Chakraborty ◽

Danny E. Miller ◽

Shannon Kalsow ◽

Kate Hall ◽

...

Keyword(s):

Drosophila Melanogaster ◽

Genetic Variation ◽

Large Scale ◽

Reference Genome ◽

De Novo ◽

Low Cost ◽

Nucleotide Polymorphisms ◽

Structural Variants ◽

High Coverage ◽

Reference Assembly

ABSTRACTAccurate and comprehensive characterization of genetic variation is essential for deciphering the genetic basis of diseases and other phenotypes. A vast amount of genetic variation stems from large-scale sequence changes arising from the duplication, deletion, inversion, and translocation of sequences. In the past 10 years, high-throughput short reads have greatly expanded our ability to assay sequence variation due to single nucleotide polymorphisms. However, a recent de novo assembly of a second Drosophila melanogaster reference genome has revealed that short read genotyping methods miss hundreds of structural variants, including those affecting phenotypes. While genomes assembled using high-coverage long reads can achieve high levels of contiguity and completeness, concerns about cost, errors, and low yield have limited widespread adoption of such sequencing approaches. Here we resequenced the reference strain of D. melanogaster (ISO1) on a single Oxford Nanopore MinION flow cell run for 24 hours. Using only reads longer than 1 kb or with at least 30x coverage, we assembled a highly contiguous de novo genome. The addition of inexpensive paired reads and subsequent scaffolding using an optical map technology achieved an assembly with completeness and contiguity comparable to the D. melanogaster reference assembly. Comparison of our assembly to the reference assembly of ISO1 uncovered a number of structural variants (SVs), including novel LTR transposable element insertions and duplications affecting genes with developmental, behavioral, and metabolic functions. Collectively, these SVs provide a snapshot of the dynamics of genome evolution. Furthermore, our assembly and comparison to the D. melanogaster reference genome demonstrates that high-quality de novo assembly of reference genomes and comprehensive variant discovery using such assemblies are now possible by a single lab for under $1,000 (USD).

Download Full-text

A comparison of methodological approaches to the study of young sex chromosomes: A case study in Poecilia

10.1101/2021.11.29.470452 ◽

2021 ◽

Author(s):

Iulia Darolti ◽

Pedro Almeida ◽

Alison E Wright ◽

Judith E Mank

Keyword(s):

Sex Chromosomes ◽

Reference Genome ◽

Sex Chromosome ◽

De Novo ◽

Sequence Similarity ◽

Sequence Evolution ◽

Structural Variations ◽

Recombination Suppression ◽

Custom Made ◽

Methodological Approaches

Studies of sex chromosome systems at early stages of divergence are key to understanding the initial process and underlying causes of recombination suppression. However, identifying signatures of divergence in homomorphic sex chromosomes can be challenging due to high levels of sequence similarity between the X and the Y. Variations in methodological precision and underlying data can make all the difference between detecting subtle divergence patterns or missing them entirely. Recent efforts to test for X-Y sequence differentiation in the guppy have led to contradictory results. Here we apply different analytical methodologies to the same dataset to test for the accuracy of different approaches in identifying patterns of sex chromosome divergence in the guppy. Our comparative analysis reveals that the most substantial source of variation in the results of the different analyses lies in the reference genome used. Analyses using custom-made de novo genome assemblies for the focal species successfully recover a signal of divergence across different methodological approaches. By contrast, using the distantly related Xiphophorus reference genome results in variable patterns, due to both sequence evolution and structural variations on the sex chromosomes between the guppy and Xiphophorus. Changes in mapping and filtering parameters can additionally introduce noise and obscure the signal. Our results illustrate how analytical differences can alter perceived results and we highlight best practices for the study of nascent sex chromosomes.

Download Full-text

Telomere-to-telomere assembly of a complete human X chromosome

10.1101/735928 ◽

2019 ◽

Cited By ~ 43

Author(s):

Karen H. Miga ◽

Sergey Koren ◽

Arang Rhie ◽

Mitchell R. Vollger ◽

Ariel Gershman ◽

...

Keyword(s):

Human Genome ◽

X Chromosome ◽

Satellite Dna ◽

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Hydatidiform Mole ◽

Segmental Duplications ◽

High Coverage ◽

Current Reference

After nearly two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no one chromosome has been finished end to end, and hundreds of unresolved gaps persist 1,2. The remaining gaps include ribosomal rDNA arrays, large near-identical segmental duplications, and satellite DNA arrays. These regions harbor largely unexplored variation of unknown consequence, and their absence from the current reference genome can lead to experimental artifacts and hide true variants when re-sequencing additional human genomes. Here we present a de novo human genome assembly that surpasses the continuity of GRCh38 2, along with the first gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome 3, we reconstructed the ∼2.8 megabase centromeric satellite DNA array and closed all 29 remaining gaps in the current reference, including new sequence from the human pseudoautosomal regions and cancer-testis ampliconic gene families (CT-X and GAGE). This complete chromosome X, combined with the ultra-long nanopore data, also allowed us to map methylation patterns across complex tandem repeats and satellite arrays for the first time. These results demonstrate that finishing the human genome is now within reach and will enable ongoing efforts to complete the remaining human chromosomes.

Download Full-text

A whole genome atlas of 81 Psilocybe genomes as a resource for psilocybin production.

F1000Research ◽

10.12688/f1000research.55301.2 ◽

2021 ◽

Vol 10 ◽

pp. 961

Author(s):

Kevin McKernan ◽

Liam Kane ◽

Yvonne Helbert ◽

Lei Zhang ◽

Nathan Houde ◽

...

Keyword(s):

Gene Cluster ◽

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Genomic Diversity ◽

Sequence Coverage ◽

Single Molecule Sequencing ◽

Contiguous Gene ◽

Long Read ◽

Interesting Variation

The Psilocybe genus is well known for the synthesis of valuable psychoactive compounds such as Psilocybin, Psilocin, Baeocystin and Aeruginascin. The ubiquity of Psilocybin synthesis in Psilocybe has been attributed to a horizontal gene transfer mechanism of a ~20Kb gene cluster. A recently published highly contiguous reference genome derived from long read single molecule sequencing has underscored interesting variation in this Psilocybin synthesis gene cluster. This reference genome has also enabled the shotgun sequencing of spores from many Psilocybe strains to better catalog the genomic diversity in the Psilocybin synthesis pathway. Here we present the de novo assembly of 81 Psilocybe genomes compared to the P.envy reference genome. Surprisingly, the genomes of Psilocybe galindoi, Psilocybe tampanensis and Psilocybe azurescens lack sequence coverage over the previously described Psilocybin synthesis pathway but do demonstrate amino acid sequence homology to a less contiguous gene cluster and may illuminate the previously proposed evolution of psilocybin synthesis.

Download Full-text

Improved Apis mellifera reference genome based on the alternative long-read-based assemblies

10.1101/2021.04.30.442202 ◽

2021 ◽

Author(s):

Milyausha Kaskinova ◽

Bayazit Yunusbayev ◽

Radick Altinbaev ◽

Rika Raffiudin ◽

Madeline H. Carpenter ◽

...

Keyword(s):

Apis Mellifera ◽

Honey Bee ◽

Reference Genome ◽

De Novo ◽

Gene Annotation ◽

Model Organism ◽

Functional Genomic ◽

Long Read ◽

Chromosome Level

ABSTRACTApis mellifera L., the western honey bee is a major crop pollinator that plays a key role in beekeeping and serves as an important model organism in social behavior studies. Recent efforts have improved on the quality of the honey bee reference genome and developed a chromosome-level assembly of sixteen chromosomes, two of which are gapless. However, the rest suffer from 51 gaps, 160 unplaced/unlocalized scaffolds, and the lack of 2 distal telomeres. The gaps are located at the hard-to-assemble extended highly repetitive chromosomal regions that may contain functional genomic elements. Here, we use de-novo re-assemblies from the most recent reference genome Amel_HAv_3.1 raw reads and other long-read-based assemblies (INRA_AMelMel_1.0, ASM1384120v1, and ASM1384124v1) of the honey bee genome to resolve 13 gaps, five unplaced/unlocalized scaffolds and, the lacking telomeres of the Amel_HAv_3.1. The total length of the resolved gaps is 848,747 bp. The accuracy of the corrected assembly was validated by mapping PacBio reads and performing gene annotation assessment. Comparative analysis suggests that the PacBio-reads-based assemblies of the honey bee genomes failed in the same highly repetitive extended regions of the chromosomes, especially on chromosome 10. To fully resolve these extended repetitive regions, further work using ultra-long Nanopore sequencing would be needed. Our updated assembly facilitates more accurate reference-guided scaffolding and marker/sequence mapping in honey bee genomics studies.

Download Full-text

Chromosome-length genome assembly and structural variations of the primal Basenji dog (Canis lupus familiaris) genome

BMC Genomics ◽

10.1186/s12864-021-07493-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Richard J. Edwards ◽

Matt A. Field ◽

James M. Ferguson ◽

Olga Dudchenko ◽

Jens Keilwagen ◽

...

Keyword(s):

Reference Genome ◽

De Novo ◽

Genome Structure ◽

Canis Lupus Familiaris ◽

Structural Variations ◽

German Shepherd ◽

High Quality ◽

Entire Family ◽

The Impact ◽

Reference Genomes

Abstract Background Basenjis are considered an ancient dog breed of central African origins that still live and hunt with tribesmen in the African Congo. Nicknamed the barkless dog, Basenjis possess unique phylogeny, geographical origins and traits, making their genome structure of great interest. The increasing number of available canid reference genomes allows us to examine the impact the choice of reference genome makes with regard to reference genome quality and breed relatedness. Results Here, we report two high quality de novo Basenji genome assemblies: a female, China (CanFam_Bas), and a male, Wags. We conduct pairwise comparisons and report structural variations between assembled genomes of three dog breeds: Basenji (CanFam_Bas), Boxer (CanFam3.1) and German Shepherd Dog (GSD) (CanFam_GSD). CanFam_Bas is superior to CanFam3.1 in terms of genome contiguity and comparable overall to the high quality CanFam_GSD assembly. By aligning short read data from 58 representative dog breeds to three reference genomes, we demonstrate how the choice of reference genome significantly impacts both read mapping and variant detection. Conclusions The growing number of high-quality canid reference genomes means the choice of reference genome is an increasingly critical decision in subsequent canid variant analyses. The basal position of the Basenji makes it suitable for variant analysis for targeted applications of specific dog breeds. However, we believe more comprehensive analyses across the entire family of canids is more suited to a pangenome approach. Collectively this work highlights the importance the choice of reference genome makes in all variation studies.

Download Full-text

Analysis of a small outbreak of Shiga toxin-producing Escherichia coli O157:H7 using long-read sequencing

Microbial Genomics ◽

10.1099/mgen.0.000545 ◽

2021 ◽

Vol 7 (3) ◽

Author(s):

David R. Greig ◽

Claire Jenkins ◽

Saheer E. Gharbia ◽

Timothy J. Dallman

Keyword(s):

Reference Genome ◽

Genetic Relatedness ◽

De Novo ◽

Methodological Approach ◽

Foodborne Pathogen ◽

Variant Calling ◽

Sequencing Data ◽

Deletion Event ◽

Base Calling ◽

Long Read

Compared to short-read sequencing data, long-read sequencing facilitates single contiguous de novo assemblies and characterization of the prophage region of the genome. Here, we describe our methodological approach to using Oxford Nanopore Technology (ONT) sequencing data to quantify genetic relatedness and to look for microevolutionary events in the core and accessory genomes to assess the within-outbreak variation of four genetically and epidemiologically linked isolates. Analysis of both Illumina and ONT sequencing data detected one SNP between the four sequences of the outbreak isolates. The variant calling procedure highlighted the importance of masking homologous sequences in the reference genome regardless of the sequencing technology used. Variant calling also highlighted the systemic errors in ONT base-calling and ambiguous mapping of Illumina reads that results in variations in the genetic distance when comparing one technology to the other. The prophage component of the outbreak strain was analysed, and nine of the 16 prophages showed some similarity to the prophage in the Sakai reference genome, including the stx2a-encoding phage. Prophage comparison between the outbreak isolates identified minor genome rearrangements in one of the isolates, including an inversion and a deletion event. The ability to characterize the accessory genome in this way is the first step to understanding the significance of these microevolutionary events and their impact on the evolutionary history, virulence and potentially the likely source and transmission of this zoonotic, foodborne pathogen.

Download Full-text