New whole genome de novo assemblies of three divergent strains of rice (O. sativa) documents novel gene space of aus and indica

Mapping Intimacies ◽

10.1101/003764 ◽

2014 ◽

Cited By ~ 4

Author(s):

Michael C Schatz ◽

Lyza G Maron ◽

Joshua C Stein ◽

Alejandro Hernandez Wences ◽

James Gurtowski ◽

...

Keyword(s):

Structural Variation ◽

De Novo ◽

Sequence Data ◽

Biological Properties ◽

Genomic Diversity ◽

Reference Sequence ◽

Human Populations ◽

Whole Genome ◽

Rice Varieties ◽

Assembly Technology

The use of high throughput genome-sequencing technologies has uncovered a large extent of structural variation in eukaryotic genomes that makes important contributions to genomic diversity and phenotypic variation. Currently, when the genomes of different strains of a given organism are compared, whole genome resequencing data are aligned to an established reference sequence. However when the reference differs in significant structural ways from the individuals under study, the analysis is often incomplete or inaccurate. Here, we use rice as a model to explore the extent of structural variation among strains adapted to different ecologies and geographies, and show that this variation can be significant, often matching or exceeding the variation present in closely related human populations or other mammals. We demonstrate how improvements in sequencing and assembly technology allow rapid and inexpensive de novo assembly of next generation sequence data into high-quality assemblies that can be directly compared to provide an unbiased assessment. Using this approach, we are able to accurately assess the ?pan-genome? of three divergent rice varieties and document several megabases of each genome absent in the other two. Many of the genome-specific loci are annotated to contain genes, reflecting the potential for new biological properties that would be missed by standard resequencing approaches. We further provide a detailed analysis of several loci associated with agriculturally important traits, illustrating the utility of our approach for biological discovery. All of the data and software are openly available to support further breeding and functional studies of rice and other species.

Download Full-text

A computational framework to assess genome-wide distribution of polymorphic human endogenous retrovirus-K in human populations

10.1101/444034 ◽

2018 ◽

Cited By ~ 1

Author(s):

Weiling Li ◽

Lin Lin ◽

Raunaq Malhotra ◽

Lei Yang ◽

Raj Acharya ◽

...

Keyword(s):

Disease Risk ◽

Sequence Data ◽

Endogenous Retrovirus ◽

Genomic Diversity ◽

Human Endogenous Retrovirus ◽

Human Populations ◽

Whole Genome ◽

Short Read ◽

Reference Set ◽

Type K

AbstractHuman Endogenous Retrovirus type K (HERV-K) is the only HERV known to be insertionally polymorphic. It is possible that HERV-Ks contribute to human disease because people differ in both number and genomic location of these retroviruses. Indeed viral transcripts, proteins, and antibody against HERV-K are detected in cancers, auto-immune, and neurodegenerative diseases. However, attempts to link a polymorphic HERV-K with any disease have been frustrated in part because population frequency of HERV-K provirus at each site is lacking and it is challenging to identify closely related elements such as HERV-K from short read sequence data. We present an integrated and computationally robust approach that uses whole genome short read data to determine the occupation status at all sites reported to contain a HERV-K provirus. Our method estimates the proportion of fixed length genomic sequence (k-mers) from whole genome sequence data matching a reference set ofk-mersunique to each HERV-K loci and applies mixture model-based clustering to account for low depth sequence data. Our analysis of 1000 Genomes Project Data (KGP) reveals numerous differences among the five KGP super-populations in the frequency of individual and co-occurring HERV-K proviruses; we provide a visualization tool to easily depict the prevalence of any combination of HERV-K among KGP populations. Further, the genome burden of polymorphic HERV-K is variable in humans, with East Asian (EAS) individuals having the fewest integration sites. Our study identifies population-specific sequence variation for several HERV-K proviruses. We expect these resources will advance research on HERV-K contributions to human diseases.Author summaryHuman Endogenous Retrovirus type K (HERV-K) is the youngest of retrovirus families in the human genome and is the only group that is polymorphic; a HERV-K can be present in one individual but absent from others. HERV-Ks could contribute to disease risk but establishing a link of a polymorphic HERV-K to a specific disease has been difficult. We develop an easy to use method that reveals the considerable variation existing among global populations in the frequency of individual and co-occurring polymorphic HERV-K, and in the total number of HERV-K that any individual has in their genome. Our study provides a global reference set of HERV-K genomic diversity and tools needed to determine the genomic landscape of HERV-K in any patient population.

Download Full-text

A study of transposable element-associated structural variations (TASVs) using a de novo-assembled Korean genome

Experimental & Molecular Medicine ◽

10.1038/s12276-021-00586-y ◽

2021 ◽

Author(s):

Seyoung Mun ◽

Songmi Kim ◽

Wooseok Lee ◽

Keunsoo Kang ◽

Thomas J. Meyer ◽

...

Keyword(s):

Genome Sequencing ◽

Genome Assembly ◽

De Novo ◽

Personal Genome ◽

Human Populations ◽

Whole Genome ◽

Structural Variations ◽

Insert Size ◽

Human Genomes ◽

Next Generation Sequencing Ngs

AbstractAdvances in next-generation sequencing (NGS) technology have made personal genome sequencing possible, and indeed, many individual human genomes have now been sequenced. Comparisons of these individual genomes have revealed substantial genomic differences between human populations as well as between individuals from closely related ethnic groups. Transposable elements (TEs) are known to be one of the major sources of these variations and act through various mechanisms, including de novo insertion, insertion-mediated deletion, and TE–TE recombination-mediated deletion. In this study, we carried out de novo whole-genome sequencing of one Korean individual (KPGP9) via multiple insert-size libraries. The de novo whole-genome assembly resulted in 31,305 scaffolds with a scaffold N50 size of 13.23 Mb. Furthermore, through computational data analysis and experimental verification, we revealed that 182 TE-associated structural variation (TASV) insertions and 89 TASV deletions contributed 64,232 bp in sequence gain and 82,772 bp in sequence loss, respectively, in the KPGP9 genome relative to the hg19 reference genome. We also verified structural differences associated with TASVs by comparative analysis with TASVs in recent genomes (AK1 and TCGA genomes) and reported their details. Here, we constructed a new Korean de novo whole-genome assembly and provide the first study, to our knowledge, focused on the identification of TASVs in an individual Korean genome. Our findings again highlight the role of TEs as a major driver of structural variations in human individual genomes.

Download Full-text

Identification of meiotic recombination through gamete genome reconstruction using whole genome linked-reads

10.1101/363341 ◽

2018 ◽

Author(s):

Peng Xu ◽

Zechen Chong ◽

Keyword(s):

Meiotic Recombination ◽

Haplotype Diversity ◽

Genomic Analysis ◽

Genomic Diversity ◽

Pedigree Information ◽

Human Populations ◽

Whole Genome ◽

Homologous Chromosomes ◽

Template Strand ◽

Recombination Hotspots

AbstractMeiotic recombination (MR), which transmits exchanged genetic materials between homologous chromosomes to offspring, plays a crucial role in shaping genomic diversity in eukaryotic organisms. In humans, thousands of meiotic recombination hotspots have been mapped by population genetics approaches. However, direct identification of MR events for individuals is still challenging due to the difficulty in resolving the haplotypes of homologous chromosomes and reconstructing the gamete genome. Whole genome linked-read sequencing (lrWGS) can generate haplotype sequences of mega-base pairs (N50 ~2.5Mb) after computational phasing. However, the haplotype information is still isolated in a large number of fragmented genomic regions and limited by switch errors, impeding its further application in the chromosome-scale analysis. In this study, we developed a tool MRLR (Meiotic Recombination identification by Linked-Read sequencing) for the analysis of individual MR events. By leveraging trio pedigree information with lrWGS haplotypes, our pipeline is sufficient to reconstruct the whole human gamete genome with 99.8% haplotyping accuracy. By analyzing the haplotype exchange between homologous chromosomes, MRLR identified 462 high-resolution MR events in 6 human trio samples from the Genome In A Bottle (GIAB) and the Human Genome Structural Variation Consortium (HGSVC). In three datasets of the HGSVC, our results recapitulated 149 (92%) previously identified high-confident MR events and discovered 85 novel events. About half (40) of the new events are supported by single-cell template strand sequencing (Strand-seq) results. We found that 332 (71.9%) MR events co-localize with recombination hotspots (>10 cM/Mb) in human populations, and MR breakpoint regions are enriched in PRDM9 and DMC1 binding sites. In addition, 48% (221) breakpoint regions were detected inside a gene, indicating these MRs can directly affect the haplotype diversity of genic regions. Taken together, our approach provides new opportunities in the haplotype-based genomic analysis of individual meiotic recombination. The MRLR software is implemented in Perl and is freely available at https://github.com/ChongLab/MRLR.

Download Full-text

Reference-Guided De Novo Genome Assembly to Dissect a QTL Region for Submergence Tolerance Derived from Ciherang-Sub1

Plants ◽

10.3390/plants10122740 ◽

2021 ◽

Vol 10 (12) ◽

pp. 2740

Author(s):

Yuya Liang ◽

Shichen Wang ◽

Chersty L. Harper ◽

Nithya K. Subramanian ◽

Rodante E. Tabien ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Global Climate ◽

Major Effect ◽

Sequence Information ◽

Whole Genome ◽

Submergence Tolerance ◽

De Novo Genome Assembly ◽

Rice Varieties ◽

Genome Profile

Global climate change has increased the number of severe flooding events that affect agriculture, including rice production in the U.S. and internationally. Heavy rainfall can cause rice plants to be completely submerged, which can significantly affect grain yield or completely destroy the plants. Recently, a major effect submergence tolerance QTL during the vegetative stage, qSub8.1, which originated from Ciherang-Sub1, was identified in a mapping population derived from a cross between Ciherang-Sub1 and IR10F365. Ciherang-Sub1 was, in turn, derived from a cross between Ciherang and IR64-Sub1. Here, we characterize the qSub8.1 region by analyzing the sequence information of Ciherang-Sub1 and its two parents (Ciherang and IR64-Sub1) and compare the whole genome profile of these varieties with the Nipponbare and Minghui 63 (MH63) reference genomes. The three rice varieties were sequenced with 150 bp pair-end whole-genome shotgun sequencing (Illumina HiSeq4000), followed by performing the Trimmomatic-SOAPdenovo2-MUMmer3 pipeline for genome assembly, resulting in approximate genome sizes of 354.4, 343.7, and 344.7 Mb, with N50 values of 25.1, 25.4, and 26.1 kb, respectively. The results showed that the Ciherang-Sub1 genome is composed of 59–63% Ciherang, 22–24% of IR64-Sub1, and 15–17% of unknown sources. The genome profile revealed a more detailed genomic composition than previous marker-assisted breeding and showed that the qSub8.1 region is mostly from Ciherang, with some introgressed segments from IR64-Sub1 and currently unknown source(s).

Download Full-text

MUMdex: MUM-based structural variation detection

10.1101/078261 ◽

2016 ◽

Cited By ~ 2

Author(s):

Peter A. Andrews ◽

Ivan Iossifov ◽

Jude Kendall ◽

Steven Marks ◽

Lakshmi Muthuswamy ◽

...

Keyword(s):

Reference Genome ◽

De Novo ◽

Sequence Data ◽

Genomic Analysis ◽

Supplementary Information ◽

Whole Genome ◽

Analysis Software ◽

Population Database ◽

Genome Data ◽

Split Read

AbstractMotivationStandard genome sequence alignment tools primarily designed to find one alignment per read have difficulty detecting inversion, translocation and large insertion and deletion (indel) events. Moreover, dedicated split read alignment methods that depend only upon the reference genome may misidentify or find too many potential split read alignments because of reference genome anomalies.MethodsWe introduce MUMdex, a Maximal Unique Match (MUM)-based genomic analysis software package consisting of a sequence aligner to the reference genome, a storage-indexing format and analysis software. Discordant reference alignments of MUMs are especially suitable for identifying inversion, translocation and large indel differences in unique regions. Extracted population databases are used as filters for flaws in the reference genome. We describe the concepts underlying MUM-based analysis, the software implementation and its usage.ResultsWe demonstrate via simulation that the MUMdex aligner and alignment format are able to correctly detect and record genomic events. We characterize alignment performance and output file sizes for human whole genome data and compare to Bowtie 2 and the BAM format. Preliminary results demonstrate the practicality of the analysis approach by detecting de novo mutation candidates in human whole genome DNA sequence data from 510 families. We provide a population database of events from these families for use by others.Availabilityhttp://mumdex.com/[email protected] (or [email protected])Supplementary informationSupplementary data are available online.

Download Full-text

Pembentukan Pustaka Genom, Resekuensing, dan Identifikasi SNP Berdasarkan Sekuen Genom Total Genotipe Kedelai Indonesia

Jurnal AgroBiogen ◽

10.21082/jbio.v11n1.2015.p7-16 ◽

2016 ◽

Vol 11 (1) ◽

pp. 7 ◽

Cited By ~ 2

Author(s):

I Made Tasma ◽

Dani Satyawan ◽

Habib Rijzaani

Keyword(s):

Genomic Library ◽

Genomic Sequence ◽

Sequence Data ◽

Soybean Genome ◽

Quality Analysis ◽

Reference Sequence ◽

Sequencing Error ◽

Whole Genome ◽

Genome Wide ◽

Soybean Genotypes

Resequencing of the soybean genome facilitates SNP marker discoveries useful for supporting the national soybean breeding programs. The objectives of the present study were to construct soybean genomic libraries, to resequence the whole genome of five Indonesian soybean genotypes, and to identify SNPs based on the resequence data. The studies consisted of genomic library construction and quality analysis, resequencing the whole-genome of five soybean genotypes, and genome-wide SNP identification based on alignment of the resequence data with reference sequence, Williams 82. The five Indonesian soybean genotypes were Tambora, Grobogan, B3293, Malabar, and Davros. The results showed that soybean genomic library was successfully constructed having the size of 400 bp with library concentrations range from 21.2–64.5 ng/μl. Resequencing of the libraries resulted in 50.1 x 109 bp total genomic sequence. The quality of genomic library and sequence data resulted from this study was high as indicated by Q score of 88.6% with low sequencing error of only 0.97%. Bioinformatic analysis resulted in a total of 2,597,286 SNPs, 257,598 insertions, and 202,157 deletions. Of the total SNPs identified, only 95,207 SNPs (2.15%) were located within exons. Among those, 49,926 SNPs caused missense mutation and 1,535 SNPs caused nonsense mutation. SNPs resulted from this study upon verification will be very useful for genome-wide SNP chip development of the soybean genome to accelerate breeding program of the soybean.

Download Full-text

A benchmarking of human mitochondrial DNA haplogroup classifiers from whole-genome and whole-exome sequence data

10.1101/2021.02.11.430775 ◽

2021 ◽

Author(s):

Víctor García-Olivares ◽

Adrián Muñoz-Barrera ◽

José Miguel Lorenzo-Salazar ◽

Carlos Zaragoza-Trello ◽

Luis A. Rubio-Rodríguez ◽

...

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Sequence Data ◽

Qualitative Assessment ◽

Whole Genome ◽

Third Generation ◽

Sequencing Data ◽

Short Read ◽

Bioinformatic Tools ◽

Whole Exome

AbstractThe mitochondrial genome (mtDNA) is of interest for a range of fields including evolutionary, forensic, and medical genetics. Human mitogenomes can be classified into evolutionary related haplogroups that provide ancestral information and pedigree relationships. Because of this and the advent of high-throughput sequencing (HTS) technology, there is a diversity of bioinformatic tools for haplogroup classification. We present a benchmarking of the 11 most salient tools for human mtDNA classification using empirical whole-genome (WGS) and whole-exome (WES) short-read sequencing data from 36 unrelated donors. Besides, because of its relevance, we also assess the best performing tool in third-generation long noisy read WGS data obtained with nanopore technology for a subset of the donors. We found that, for short-read WGS, most of the tools exhibit high accuracy for haplogroup classification irrespective of the input file used for the analysis. However, for short-read WES, Haplocheck and MixEmt were the most accurate tools. Based on the performance shown for WGS and WES, and the accompanying qualitative assessment, Haplocheck stands out as the most complete tool. For third-generation HTS data, we also showed that Haplocheck was able to accurately retrieve mtDNA haplogroups for all samples assessed, although only after following assembly-based approaches (either based on a referenced-based assembly or a hybrid de novo assembly). Taken together, our results provide guidance for researchers to select the most suitable tool to conduct the mtDNA analyses from HTS data.

Download Full-text

First phylogenetic analysis of Malian SARS-CoV-2 sequences provide molecular insights into the genomic diversity of the Sahel region

10.1101/2020.09.23.20165639 ◽

2020 ◽

Author(s):

Bourema Kouriba ◽

Angela Duerr ◽

Alexandra Rehn ◽

Abdoul Karim Sangare ◽

Brehima Youssouf Traoure ◽

...

Keyword(s):

Phylogenetic Analysis ◽

Genome Sequencing ◽

Sequence Data ◽

Genomic Diversity ◽

Whole Genome ◽

Sequencing Data ◽

Genome Sequences ◽

Spreading Dynamics ◽

Sahel Region ◽

Limited Sequence

We are currently facing a pandemic of COVID-19, caused by a spillover from an animal-originating coronavirus to humans occuring in the Wuhan region, China, in December 2019. From China the virus has spread to 188 countries and regions worldwide, reaching the Sahel region on the 2nd of March 2020. Since whole genome sequencing (WGS) data is very crucial to understand the spreading dynamics of the ongoing pandemic, but only limited sequence data is available from the Sahel region to date, we have focused our efforts on generating the first Malian sequencing data available. Screening of 217 Malian patient samples for the presence of SARS-CoV-2 resulted in 38 positive isolates from which 21 whole genome sequences were generated. Our analysis shows that both, the early A (19B) and the fast evolving B (20A/C) clade, are present in Mali indicating multiple and independent introductions of the SARS-CoV-2 to the Sahel region.

Download Full-text

Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly

Nature Biotechnology ◽

10.1038/nbt.1904 ◽

2011 ◽

Vol 29 (8) ◽

pp. 723-730 ◽

Cited By ~ 88

Author(s):

Yingrui Li ◽

Hancheng Zheng ◽

Ruibang Luo ◽

Honglong Wu ◽

Hongmei Zhu ◽

...

Keyword(s):

De Novo Assembly ◽

Structural Variation ◽

De Novo ◽

Whole Genome ◽

Single Nucleotide ◽

Human Genomes ◽

Nucleotide Resolution ◽

Single Nucleotide Resolution

Download Full-text

Variant calling for cpn60 barcode sequence-based microbiome profiling

10.1101/749267 ◽

2019 ◽

Author(s):

Sarah J. Vancuren ◽

Scott J. Dos Santos ◽

Janet E. Hill ◽

Keyword(s):

De Novo ◽

Sequence Data ◽

Variant Calling ◽

Taxonomic Composition ◽

Species Level ◽

Reference Sequence ◽

Sequence Length ◽

Sequence Variant ◽

Operational Taxonomic Units ◽

Microbiome Profiling

AbstractAmplification and sequencing of conserved genetic barcodes such as the cpn60 gene is a common approach to determining the taxonomic composition of microbiomes. Exact sequence variant calling has been proposed as an alternative to previously established methods for aggregation of sequence reads into operational taxonomic units (OTU). We investigated the utility of variant calling for cpn60 barcode sequences and determined the minimum sequence length required to provide species-level resolution. Sequence data from the 5’ region of the cpn60 barcode amplified from the human vaginal microbiome (n=45), and a mock community were used to compare variant calling to de novo assembly of reads, and mapping to a reference sequence database in terms of number of OTU formed, and overall community composition. Variant calling resulted in microbiome profiles that were consistent in apparent composition to those generated with the other methods but with significant logistical advantages. Variant calling is rapid, achieves high resolution of taxa, and does not require reference sequence data. Our results further demonstrate that 150 bp from the 5’ end of the cpn60 barcode sequence is sufficient to provide species-level resolution of microbiota.

Download Full-text